Chat2Find Corpus: 255M Token Trilingual AI Dataset Released

At Chat2Find, our mission has always been to bridge the gap between advanced technology and local linguistic reality. Today, we are proud to take a massive leap forward in that mission by releasing the Chat2Find Corpus to the global research community.

This open-source dataset, now live on Hugging Face, contains approximately 255 Million tokens across 279,248 real-world conversations. By capturing the nuances of Sinhala (සිංහල), Tamil (தமிழ்), and English, we are providing the “missing link” for AI development in the South Asian region.

Beyond Formal Language

Traditional datasets often miss how we actually talk. Our corpus uniquely includes significant instances of Singlish and Tanglish, reflecting the transliterated and code-mixed patterns that are native to our digital communication today.

A Foundation for Future Reasoning

This release is just the beginning. The Chat2Find Corpus serves as the primary training data for our upcoming suite of Open-Weights Reasoning Models. We are currently in the final stages of developing:

Chat2Find Base: A foundational trilingual model.
Chat2Find Instruct: Optimised for complex task-following.
Chat2Find Reasoning: A high-logic model bringing chain-of-thought processing to South Asian languages.

Open for Innovation

Under the MIT License, we invite researchers, startups, and hobbyists to explore, train, and build. Together, we can ensure that the future of AI is as diverse as the languages we speak.

Access the dataset here:

Hugging Face – Chat2Find Corpus

LankaData

Revolutionizing Multilingual AI! Chat2Find Releases 255M Token Trilingual Corpus

Beyond Formal Language

A Foundation for Future Reasoning

Open for Innovation

Chat2Find