Reddit has announced updates to its Robots Exclusion Protocol (robots.txt file), which governs the crawling of its website by automated web bots. The change aims to prevent AI-powered entities from scraping and training models on Reddit content without permission.
Traditionally, the robots.txt file allowed search engines to scrape content and direct users to relevant pages. However, with the rise of AI, websites are being exploited for training models without acknowledging their content sources. To combat this, Reddit will continue to rate-limit and block unknown bots and crawlers that do not abide by its Public Content Policy and lack an agreement with the platform.
The update is designed to deter AI companies from training their large language models on Reddit content, while good-faith actors like researchers and organizations, such as the Internet Archive, should not be affected. However, AI crawlers may choose to ignore Reddit’s robots.txt file.
This announcement comes on the heels of a Wired investigation that exposed AI-powered search startup Perplexity for scraping content despite ignoring requests not to do so. Perplexity CEO Aravind Srinivas argued that the robots.txt file is not a legal framework.
Reddit’s updated policy will not impact companies with which it has an agreement, such as Google, which has a $60 million deal to train AI models on Reddit content. The change signals that other companies seeking to use Reddit’s data for AI training must obtain permission and payment.
“Anyone accessing Reddit content must abide by our policies, including those in place to protect redditors,” Reddit stated in its blog post. “We are selective about who we work with and trust with large-scale access to Reddit content.”
This development follows a new policy released by Reddit last month, aimed at guiding the use of its data by commercial entities and partners.