Foundation model developers are being urged to enhance their practices in benchmarking and transparency when it comes to non-English languages. The push for improvement stems from concerns that the claims made about the effectiveness of these models in languages beyond English often do not align with reality. This discrepancy is particularly noticeable in "low resource" languages, which lack sufficient text data for model training.
According to a recent article, foundation model developers tend to test their models extensively in English but fall short when it comes to testing in other languages. The benchmarks used for non-English languages are described as narrower and less robust than what is suggested by the developers. This imbalance raises the risk of deploying foundation models in contexts where they may not be adequately equipped to perform, leading to potential pitfalls such as offering inaccurate benefits information or inadequately moderating online speech in non-English settings.
The article also highlights the importance of not relying on assumptions about "cross-lingual transfer," which refers to the idea that training a model in one language can facilitate learning in other languages. This theory is still being debated, with limitations yet to be fully understood. It is suggested that both foundation model developers and downstream application developers should not presume that safety measures in one language will seamlessly transfer to others.
Furthermore, developers are encouraged to incorporate benchmarks that are specific to individual languages, rather than solely relying on parallel benchmarks that may not capture the unique cultural nuances of different language speakers. By utilizing natively developed, monolingual benchmarks, developers can gain a better understanding of their model's capabilities and limitations in each language.
Additionally, the article emphasizes the importance of using a variety of approaches to assess non-English text, rather than solely relying on machine-translated benchmarks. Machine translations may not accurately represent how real language speakers communicate, especially in low-resource languages where translation quality may be subpar.
To enhance transparency, foundation model developers are advised to share details about the volume and sources of training data in different languages. This information can be valuable for downstream application developers looking to fine-tune models for specific languages. The disclosure of such data is seen as a crucial step towards ensuring accountability and trust in the deployment of foundation models.
Lastly, the article underscores the need for testing vulnerabilities using non-English attacks, even if a model is primarily designed for English use. Engaging in multilingual red-teaming exercises can help identify potential risks and weaknesses across different languages, ultimately leading to more robust and secure models.
In conclusion, the call for better benchmarking and transparency practices in foundation models for non-English languages aims to bridge the gap between high-resource and low-resource languages, ultimately enabling more responsible and effective deployment of AI technologies globally.