Bhashini Invites Partners to Annotate Indian Language Data for AI Training


Bhashini Invites Partners to Annotate Indian Language Data for AI Training
  • Digital India Bhashini Division invites agencies to annotate & label datasets in 22 Indian languages for AI model training.
  • Vendors to work on ASR, MT, TTS, OCR & transliteration tasks using Bhashini’s in-house DCCF platform with strict quality benchmarks.
  • Empanelment open for one year (extendable to two), with bid submissions due by August 28.

The Ministry of Electronics and Information Technology's Digital India Bhashini Division (DIBD) has invited agencies to annotate and label datasets in 22 Indian languages required for training artificial intelligence (AI) models. "Data annotation and labelling is important for machine learning as it gives the context which algorithms need to learn", DIBD CEO Amitabh Nag told Moneycontrol.

"By labeling raw data (such as images, text, or audio) with meaningful tags or labels, these processes allow models to learn patterns, make correct predictions and eventually, execute desired actions", Nag added.

Nag explained that without well-labelled data, "machine learning models fail to learn, resulting in suboptimal performance and untrustworthy results". In this context, DIBD has issued a request for empanelment (RFE), requesting companies to label and annotate Indian datasets. "The RFE is giving a massive opportunity to the data industry players to be a part of the AI revolution," added Nag.

As per the RFE, the vendors will be required to support five fundamental AI/ML language tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), Text-to-Speech (TTS), Optical Character Recognition (OCR), and Transliteration.

Also Read: India's Edge Data Centre Capacity Set to Triple to 210 MW by 2027: ICRA

The RFE explained that vendors would be required to mark up raw data with task- and domain-specific metadata. For ASR, for example, chosen vendors would need to deliver both verbatim and cleaned transcripts, along with timestamping and speaker information including age and gender.

For Machine Translation, translations must be checked for context, fluency, and correspondence. All annotation work must be performed on Bhashini’s in-house Data Capture and Curation Framework (DCCF) platform, the RFE said.

The RFE sets strict quality benchmarks to validate consistency. Industry experts emphasise how crucial high-quality labeled data is for effective AI systems, particularly in low-resource languages.

Bhashini currently supports over 22 official Indian languages, the same languages listed under the Eighth Schedule of the Indian Constitution. The empanelment will be for one year and extendable to two years, and the government agency has invited bids by August 28.