In her one-room home on a quiet street in Agara, a tiny village three hours southwest of Bangalore that’s fringed by rice paddies and groundnut fields, Preethi P. sits on a stool near a sewing machine. Normally, she would spend hours mending or stitching clothes, averaging less than $1 a day for her work. On this day, however, she is reading a sentence in her native Kannada language into an app on a phone. She pauses briefly, then reads another.
Preethi, who goes by a single name, as is common in the region, is among the 70 workers hired in Agara and neighbouring villages by a startup called Karya to gather text, voice and image data in India’s vernacular languages. She is part of a vast, unseen global workforce — operating in countries like India, Kenya and the Philippines — who collect and label the data that AI chatbots and virtual assistants rely on to generate relevant responses. Unlike many other data contractors, however, Preethi gets paid well for her efforts, at least by local standards.
After three days of working with Karya, Preethi earned 4,500 rupees ($54), more than four times the amount the 22-year-old high school graduate usually makes as a tailor in an entire month. The money is enough, she said, to pay off that month’s installment on a loan taken to partly repair the crumbling mud walls of her home that have been carefully patched up with colourful saris. “All I need is a phone and the internet.”
Karya was founded in 2021, before the rise of ChatGPT, but this year’s frenzy around generative AI has only added to tech companies’ insatiable demand for data. India alone is expected to have nearly one million data annotation workers by 2030, according to Nasscom, the country’s tech industry trade body. Karya differentiates itself from other data vendors by offering its contractors – mostly women, and mostly in rural communities – as much as 20 times the prevailing minimum wage, with the promise of producing better quality Indian-language data that tech companies will pay more to obtain.
“Every year, big tech companies spend billions of dollars collecting training data for their AI” and machine learning models, said Manu Chopra, the 27-year-old Stanford-educated computer engineer behind the startup, told Bloomberg in an interview. “Poor pay for such work is an industry failure.”
If meagre wages are an industry failure, it’s one that Silicon Valley bears some responsibility for creating. For years, tech companies have outsourced tasks like data labelling and content moderation to cheaper contractors overseas. But now, some of Silicon Valley’s most prominent names are turning to Karya to address one of the biggest challenges for their AI products: finding high-quality data to build tools that can better serve billions of potential non-English speaking users. These partnerships could represent a powerful shift in the economics of the data industry and Silicon Valley’s relationship with data providers.
Microsoft Corp. has used Karya to source local speech data for its AI products. The Bill & Melinda Gates Foundation is working with Karya to reduce gender biases in data that feeds into large language models, the technology underpinning AI chatbots. And Alphabet Inc.’s Google is leaning on Karya and other local partners to gather speech data in 85 Indian districts. Google plans to expand to every district to include the majority language or dialect spoken and build a generative AI model for 125 Indian languages.
Many AI services have been disproportionately developed with English-language internet data, such as articles, books and social media posts. As a result, these AI models poorly represent the diversity of languages for internet users in other countries who are accessing AI-powered smartphones and apps faster than they’re learning English. Nearly one billion such potential users live in India alone, as the government pushes for a rollout of AI tools in every sphere from healthcare to education to financial services.
“India is the first non-Western country we are doing this in, and we are testing Bard in nine Indian languages,” said Manish Gupta, head of Google Research in India, referring to the company’s AI chatbot. “Over 70 Indian languages spoken by over a million people each had zero digital corpus. The problem is so stark.”
Gupta ticked off a list of issues that AI firms need to address in order to serve India’s internet users: Non-English datasets are dismally low quality; hardly any conversational data exists in Hindi and other Indian languages; and digitized content from books and newspapers in Indian languages is very limited.
When used for South Asian languages, some large language models have been found to make up words and struggle with basic grammar. There are also concerns these AI services may reflect a more skewed view of other cultures. It’s critical to have broad representation of training data, including non-English data, so AI systems “don’t perpetuate harmful stereotypes, produce hate speech, nor yield misinformation,” said Mehran Sahami, a professor in the computer science department at Stanford University.
Karya, a social impact startup headquartered in Bangalore and supported by grants, is able to broaden the pool of languages represented in part by specifically targeting workers in rural areas who might not otherwise be contracted for such tasks. Karya’s app can work without internet access and it provides voice support for those with limited literacy. In India, over 32,000 crowdsourced workers have logged into the app, completing 40 million paid digital tasks such as image recognition, contour alignments, video annotation and speech annotation.
For Chopra, the goal isn’t just to improve the supply of data but to fight poverty. Karya’s founder grew up in an impoverished neighbourhood called Shakur Basti in West Delhi. He won a scholarship to study in an elite school where he was bullied because his classmates said he “smelled poor.” Chopra landed at Stanford to study computer science but realized he hated the “how you make a billion dollars” mindset he encountered there.
After graduating in 2017, he began working on his long-held interest: using technology to tackle poverty. “It takes a mere $1,500 in savings to make an Indian eligible to enter the middle class,” Chopra said. “But the impoverished can take 200 years to reach that level of savings.”
Microsoft, he learned, had been paying a hefty amount for collecting speech data, albeit of poor quality, to feed its AI systems and research. In 2017, for instance, although 1 million hours of digitized spoken data was available in Marathi, a language spoken in Mumbai and its Western India region, only 165 hours was available for purchase. His startup has since put together 10,000 hours of Marathi speech data for Microsoft’s AI services, read by men and women from five different regions.
“Tech companies want the data, accent and all,” Chopra said. “You cough, they want that in the speech – it represents natural language.”Saikat Guha, a researcher at Microsoft Research India who focuses on the ethics of data collection, said he has also used Karya’s content for a project to aid those with visual disabilities in finding jobs. “The quality of data is far better than any other source I’ve used,” said Guha. “If you pay workers fairly, they’re more invested in their work, and the end result is better data.”
Meanwhile, over 30,000 young, school-educated women are working with Karya to help collect “gender intentional” datasets – such as that the doctor or boss isn’t always a he – in six Indian languages for the Bill & Melinda Gates Foundation. It’s the biggest such effort in Indian languages and will serve as a corpus to build datasets to reduce gender-related biases in LLMs.Karya isn’t stopping with India. The company said it’s in talks to sell its platform as a service to organizations in Africa and South America who will do similar work.
For now, women in Yelandur, another village southwest of Bangalore, eagerly await Karya’s next project: transcribing from a Kannada audio recording. Among them is Shambhavi S., 25, who earned a few thousand rupees from a previous assignment while working in the quiet of her home after feeding her in-laws dinner and putting her children to bed.
“I don’t know what artificial intelligence is, I haven’t heard of it,” said Shambhavi. “I want to earn and educate my children, so they can learn how to use it.”