Many language models are predominantly trained on English text, which is significantly more than texts in other languages. This imbalance has tangible effects on racialized and marginalized communities. "For example, they have resulted in inaccurate medical advice in Hindi, led to wrongful arrest because of mistranslations in Arabic, and have been accused of fueling ethnic cleansing in Ethiopia due to poor moderation of speech that incites violence."
The focus on English-centric natural language processing (NLP) tools often excludes non-English-speaking communities. In response to this issue, region- and language-specific research groups like Masakhane and AmericasNLP have emerged. These groups aim to counter the dominance of English-centric NLP by enabling their communities to contribute to and benefit from NLP tools developed in their own languages.
Based on research and discussions with these collectives, several promising practices are outlined for companies and research groups. These practices can help broaden community participation in multilingual AI development.
"These harms reflect the English-centric nature of natural language processing (NLP) tools," says the CDT brief titled “Beyond English-Centric AI: Lessons on Community Participation from Non-English NLP Groups.” The document highlights the need for involving non-English-speaking communities in the development process.
The full brief offers further insights into how these practices can be implemented effectively.