Scholars face hurdles moderating harmful online content due to resource disparities

Webp 3r2l9nmmbri3huekmox6348shtyh
Alexandra Reeve Givens President & CEO at Center for Democracy & Technology | Official website

Scholars face hurdles moderating harmful online content due to resource disparities

ORGANIZATIONS IN THIS STORY

Around 75% of internet users are from non-English speaking countries in the Majority World, yet social media companies allocate most of their content moderation resources to English-speaking populations in the West. This disparity has led to human rights violations and unjust moderation outcomes in the Majority World. Researchers from these regions have focused on boosting automated detection of harmful content in local languages that are often underrepresented and lack robust technological support.

To understand the challenges faced by researchers addressing online harms, 12 researchers specializing in three low-resource languages—Tamil from South Asia, Kiswahili in East and Central Africa, and Quechua in South America—were interviewed. These researchers use Natural Language Processing (NLP) to improve computer understanding of low-resource languages, focusing on detecting misinformation, hate speech, and spam that need moderation.

The investigation reveals a troubling trend: tech companies are withholding crucial data from researchers, hindering the development of automated content moderation technologies for low-resource languages. This issue is compounded by colonial biases in NLP research that impede effective moderation of harmful content in non-English contexts. However, the interviewed NLP researchers believe that partnerships between local researchers and social media giants could improve the situation.

Researchers working with Tamil, Kiswahili, and Quechua identified a significant roadblock: the lack of high-quality digital data due to a colonial legacy favoring English and European languages while neglecting linguistic diversity in the Majority World. They rely on user-generated content on social media as their primary source of digital data but find it insufficient for training AI models for low-resource languages.

African NLP researchers working on Kiswahili complained about being denied access to data if they lacked prior publications—a challenge since they often lack funding to publish work on low-resource languages. The situation worsened when tech companies began charging exorbitant fees for data access or blocked open-source tools used by independent researchers.

This resource gap is rooted in a colonial legacy prioritizing Western institutions as knowledge producers rather than building local research capacity. Tech companies exacerbate this gap by gatekeeping and monetizing user-generated content in low-resource languages.

In response to these challenges, some NLP researchers initiated community-led processes to gather data voluntarily from WhatsApp users or native speakers who donated speech data and helped with manual transcription. However, without sufficient funding, they struggled to fairly compensate community members contributing to these efforts.

Despite widespread interest in developing automated moderation technologies for non-English content, progress has been slow due to inadequate high-end computing devices. While many rely on Google Colab’s free computing resources, they argue that allotted time and memory are insufficient for effectively training language models with billions of parameters.

Historically biased support for non-Latin scripts has forced people in the Majority World to use Latin alphabets for writing their languages resulting in code-mixing (combining two or more languages). Existing AI models perform poorly on code-mixed texts because they struggle with romanized words from different languages primarily trained on English datasets.

The systematic assumptions underlying current AI pipelines do not account for cultural nuances such as algospeak used locally to evade moderation while spreading hate speech prompting Tamil NLP researchers we interviewed found integrating emoji context improved hate speech detection accuracy significantly better than earlier methods which removed emojis during preprocessing stages standardization process established years ago based predominantly English-language studies sentiments analysis approach overlooks ethnic religious-based positive-sentiment-promoting supremacist ideologies suggested involving community members targeted specific annotations ensure guidelines accurately reflect experiences affected communities

Discussions highlight systemic biases inequities impeding content moderation research Majority World despite abundant resources controlled western-based companies failing mitigate regional online harms perpetuating “data colonialism” profiting generated labor denying access entrenching global inequalities thus urging fairer resource distribution shifting power grassroots efforts combating harmful contents

Ongoing power asymmetry outsourcing low-wage tasks prestigious developments remaining Silicon Valley engineers challenging colonial labor division requires collaboration leveraging necessary linguistic cultural expertise caution against appropriating exploiting volunteer contributions ensuring transparent procedures responsible safeguarding provided datasets offering grants supporting remunerative compensation fostering inclusive equitable environment promoting holistic solutions sensitive diverse global needs enhancing collective well-being safeguarding cyberspace inclusivity fairness justice human dignity respect thriving interconnected societies amidst growing digital landscapes

###

ORGANIZATIONS IN THIS STORY