Internet board explores new standards for controlling AI training crawlers

AI-powered chatbots draw on the collective work of billions of humans. To respond to user queries, ChatGPT, Google Gemini, Microsoft Copilot, and other large language models rely on their analysis of trillions of words posted online. Similarly, image generators produce graphics by analyzing billions of photos and illustrations available on the web. Unless these systems are given a specific, limited store of content to learn from, they "crawl" the web to gather the necessary data.

Discussions about the rights of humans who created this content have primarily focused on copyright law. Many AI firms face lawsuits from writers and artists upset that tech companies profit from using or reproducing their work. However, copyright is not the only relevant interest, and courts may not always be the best venue for resolving complex issues involving many stakeholders.

For instance, individuals without a copyright claim may still care about how their work is used or how information about them is shared. Researchers also seek the ability to analyze content for scientific and public-interest purposes without getting entangled in disputes over copyright or AI training methods. Additionally, relying on the legal system may favor larger entities with more resources to fight for their interests in court or establish exclusive licensing agreements.

The internet standards-setting process offers an alternative model for addressing these issues. Tech standards bodies have a history of finding solutions involving various stakeholders, including tech companies large and small, civil society organizations, national governments, researchers, and individual users. A precedent exists for providing website owners with technical means to control automated attempts to index their content. For example, three decades ago, an informal collaborative process resulted in websites using a "robots.txt" file to set parameters around search-engine crawlers.

With this background in mind, the Internet Architecture Board will convene a workshop in Washington, DC this week focused on controls for AI training crawling and potential standards work at the Internet Engineering Task Force (IETF). Short position papers from potential participants will explore issues involved and propose standards allowing content creators to track or opt out of having their work included in AI training sets. Eric Null from CDT will present considerations related to privacy and other non-copyright interests while advocating for broad stakeholder inclusion in standard-setting processes.

Although technical standards are merely recommendations without enforcement mechanisms, an updated standard could help build consensus that respects the wishes of those who create and maintain web content used by chatbots and other AI tools.

More information:

- CDT’s position paper for the workshop on AI-Control

- More details on the workshop and solicitation for position papers

Internet board explores new standards for controlling AI training crawlers

Internet board explores new standards for controlling AI training crawlers

Top Stories