Webp 3r2l9nmmbri3huekmox6348shtyh
Alexandra Reeve Givens President & CEO at Center for Democracy & Technology | Official website

Lessons from non-English NLP groups highlight need for diverse AI development

ORGANIZATIONS IN THIS STORY

Many language models are predominantly trained on English text, which is significantly more than texts in other languages. This imbalance has tangible effects on racialized and marginalized communities. "For example, they have resulted in inaccurate medical advice in Hindi, led to wrongful arrest because of mistranslations in Arabic, and have been accused of fueling ethnic cleansing in Ethiopia due to poor moderation of speech that incites violence."

The focus on English-centric natural language processing (NLP) tools often excludes non-English-speaking communities. In response to this issue, region- and language-specific research groups like Masakhane and AmericasNLP have emerged. These groups aim to counter the dominance of English-centric NLP by enabling their communities to contribute to and benefit from NLP tools developed in their own languages.

Based on research and discussions with these collectives, several promising practices are outlined for companies and research groups. These practices can help broaden community participation in multilingual AI development.

"These harms reflect the English-centric nature of natural language processing (NLP) tools," says the CDT brief titled “Beyond English-Centric AI: Lessons on Community Participation from Non-English NLP Groups.” The document highlights the need for involving non-English-speaking communities in the development process.

The full brief offers further insights into how these practices can be implemented effectively.

ORGANIZATIONS IN THIS STORY