AI Singapore and Google partner to enhance Southeast Asian Large Language Model training datasets

AI Singapore (AISG) and Google Research have embarked on Project SEALD (Southeast Asian Languages in One Network Data), a research collaboration to enhance datasets that can be used to train, fine-tune, and evaluate large language models (LLMs) in languages spoken across Southeast Asia (SEA). This collaboration seeks to improve cultural context awareness and capabilities in SEA LLMs, and advance their applicability across the region to bring broad benefits to society.

Improving inclusivity in SEA LLMs 

Starting with Indonesian, Thai, Tamil, Filipino, and Burmese, the research under Project SEALD will help build a diverse and high-quality data corpus of languages spoken in SEA to support the training of models under SEA-LION (Southeast Asian Languages in One Network)—an initiative by AISG to develop a family of LLMs specifically pre-trained and instruction-tuned to be more representative of SEA’s cultural contexts and linguistic nuances—and other models that can add value to SEA-centric use cases.

Under Project SEALD, AISG and Google Research Asia Pacific (APAC) will work together on:

  • Developing translocalization and translation models,
  • Establishing best practices for instruction tuning datasets,
  • Creating tools to enable translocalization at scale, and
  • Publishing pre-training recipes for SEA languages.

AISG and Google will release the datasets and output from Project SEALD in open-source to advance the progress of the SEA LLM ecosystem and foster strong regional expertise.

As a specific use case, Project SEALD is working to improve communications with under-represented populations of migrant workers in Singapore, who may speak and understand a variety of regional languages with greater fluency than English. Data collection efforts to better capture linguistic nuances within this community will provide the foundation for enhanced engagement by both the Singapore Government and employers.

When integrated into one of the generative AI solutions first developed under the AI Trailblazers initiative by the Singapore Government and Google Cloud, the datasets and output from Project SEALD can aid outreach across a variety of important domains, such as redressal of worker grievances and extension of assistance schemes.

Lastly, Project SEALD will engage with ecosystem partners—academia, industry, and government—in various ways. These include working with industry players for data collection, curation, and quality checks, collaborating with academia in different SEA countries to implement state-of-the-art techniques in evaluation and benchmarking, and partnering with government stakeholders in Singapore and across the region to advance use cases for public good.

Advancing SEA LLMs for the region 

Building on this, AISG is collaborating with Google Cloud to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and open models that meet Google Cloud’s strict enterprise safety and quality standards. Through Vertex AI, organizations can use enterprise-grade tools to easily customize these models to address relevant use cases and integrate them into their applications. In addition, AISG will continue to make its SEA-LION LLMs available on Hugging Face, which has been partnering with Google Cloud to help developers train, tune, and serve open models quickly and cost-effectively.

AISG has also initiated collaborations across Singapore and other SEA countries. For example, AISG has signed Memorandums of Understanding (MOUs) or Letters of Intent (LOIs) with Indonesian, Malaysian, and Vietnamese entities for the development of datasets and applications for regional LLMs. In addition, AISG has been engaging partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics. Finally, in the Singapore context, AISG works closely with public sector and R&D stakeholders on safety alignment and multimodality.

In APAC, Google Research has a similar large-scale language inclusivity project ongoing in India with the Indian Institute of Science via Project Vaani—an initiative that is gathering, transcribing, and open-sourcing speech data from across all of India’s 773 districts.

Key partner quotes

“Google is proud to be partnering with AISG to put Singapore and SEA on the map of AI model development. By focusing on languages spoken and used in SEA and cultural understanding, Project SEALD will significantly improve the existing corpus and evaluation benchmarks for these languages. This will open new opportunities and make AI more inclusive, accessible, and helpful for individuals and businesses throughout the region.” – Yolyn Ang, Vice President, Knowledge and Information Partnerships, Google APAC 

“The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve SEA-LION’s capabilities. We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with Google for the benefit of the entire community.” – Leslie Teo, Senior Director of AI Products, AISG

“VISTEC is excited to be part of this pan-ASEAN natural language processing (NLP) development offered by Project SEALD, a vital collaborative mechanism that sets our diverse NLP communities in one collective and strategic direction. In particular, Project SEALD will alleviate the resource constraints associated with incorporating SEA languages into AI innovations by delivering new pre-trained language models, datasets, and benchmarks. VISTEC is proud to be an official partner, contributing our expertise in Thai NLP to this project.” – Sarana Nutanong, Vidyasirimedhi Institute of Science and Technology, Thailand

“As we continue to work with AISG through XFORM, Inc. in developing localized, comprehensive, and inclusive datasets, we are looking forward to contributing to Project SEALD, which will make a significant contribution in building localized, culture-driven, context-sensitive, and open-source LLMs for SEA through the Ateneo Social Computing Science Laboratory.” – Maria Regina Estuar, Head, Ateneo Social Computing Science Laboratory; CEO, XFORM, Inc., the Philippines 

Call for partnerships

Help shape the future of AI in SEA! Partner with Google and AISG to enhance regional LLMs and create language solutions tailored to our region. Researchers, developers, and businesses, your expertise is needed to drive innovation in this exciting field. Contact us at [email protected] to get involved.