Denas Grybauskas, Chief Governance and Strategy Officer at Oxylabs, outlines the key considerations of the EU AI Act that must be considered from both a legal and ethical perspective to ensure the best web data collection practices are followed.
Web scraping today faces an interesting dichotomy. While it is an essential part of the internet experience, powering key sites, the enormous amounts of data being scraped for AI training purposes are putting it under scrutiny.
As the AI boom is changing the whole nature of the web, it is rekindling some old discussions, even on how public data should be accessed. Add the AI copyright infringement headlines muddying the water of how data is being used, and it becomes a difficult space for businesses to navigate.
As discussed in the OxyCon session I chaired this year, the EU AI Act has introduced an additional layer of questions for the industry to address. It definitely has not given businesses that aggregate data a ‘highway code’ for web scraping, and many elements of the law are still unclear, creating easy traps for businesses to fall into.
The uncertain legal landscape
There are some recurring legal issues that businesses need to watch out for when collecting web data:
- Breach of contract: The most common legal claims associated with web data collection come from breach of contract, which occurs when one party fails to fulfil what they’ve agreed to when accepting Terms of Service. Suppose a company has an account on a specific website, such as a social media site, and decides to scrape that site simultaneously. In that case, it naturally puts itself under greater risk exposure. Scraping content from social media sites after agreeing to ToC has been one of the main drivers of lawsuits in this field. It can still be argued (and it was in some cases) that the act of scraping is unrelated to the purpose of social media sites and the creation of accounts; therefore, terms of service shouldn’t regulate public data scraping. However, proving this point will require effort.
- Copyright infringement: The legal claims generating the most headlines today are related to copyright infringement, especially those that result in high-profile class actions. These lawsuits spark the most controversy, and one even resulted in London protests earlier this year over claims that Meta stole books. Currently, outlets are reporting on the music publishers embroiled in a legal battle with Anthropic over AI copyright claims. These types of lawsuits reflect an ongoing debate about what data can be used for AI training purposes and how creators should be involved.
- Personal data: Occasionally, publicly available data also entails personal information. Even if it is technically ‘publicly available’, personal data is still protected by privacy laws, typically subject to exceptions and conditions, such as those outlined in the CCPA. Companies should therefore thoroughly evaluate whether collecting such information is necessary and ethical. It’s highly likely that questions of privacy and data ownership will remain the main focus areas in the courts and public discussions around web data for the foreseeable future.
The underlying perception that web scraping practices exist in a ‘grey area’ often stems from a lack of clarity. The legal landscape today lacks a clear, easy-to-follow ‘one-stop shop’ guide for full compliance, which would ‘unmuddies’ the water on this issue.
Despite good intent, the EU AI Act has not provided this.
The impact of AI on web scraping
The AI boom has once again brought the spotlight to the need for legal clarification. It has driven an increased demand for data, bringing the term ‘data scraping’ into the mainstream conversation. The amount of web scraping conducted by businesses has skyrocketed, and, unsurprisingly, this has thrown the issue of copyright into the limelight.
However, there are some valid arguments in the US legal system for instances where aggregation of public data (copyrighted) could fall under the fair use doctrine. For example, if a business is transparent about the public data it uses and transforms it into something new, this might be considered fair use. One of the key conditions, as per recent US court cases (Anthropic’s case), is for the work (for which the public data was aggregated and used) to be transformative.
Currently, fair use in the US cannot be legally blocked in its entirety within a contract. However, within the scope of fair use, copyrighted material can be repurposed in entirely new ways. In this instance, it has been transformed from its copyrighted state.
When doing this, businesses need to be aware of a few factors to behave ethically within current legislation. For example, a court would look at the following to define fair use and rule on a copyright infringement:
- The nature of the copyrighted work – is it private or personal in any way?
- How much of the copyrighted work has been used?
- Has transformation taken place?
- What is the economic impact of the copyrighted work? Has it affected the original?
When publicly scraping data to train AI models, it’s crucial to remain vigilant and aware, regardless of your location. The EU has both a database rights regime and the DSM directive that include text and data mining exemptions. While legal regimes differ, it is always important to evaluate the source of data used and the jurisdiction of your company to understand what rules apply to you, and what is the best course of action to stay within these rules.
How can businesses prepare for training on public data?
To ensure alertness, every AI system deployer and provider must conduct a thorough risk assessment before deploying their web data collection on the market. Part of this research should include getting to know the regulations of your specific region, ensuring that the key people are fully aware of copyright, privacy and other laws.
Current laws and regulations around AI are incredibly fragmented, making it a challenging environment to navigate. A comprehensive understanding of these laws, including the AI Act and wider EU regulations, will position businesses for seamless web data collection practices.
At the end of the day, the businesses whose AI models will withstand the test of time are the ones that don’t just build with compliance in mind, but truly build systems that can flex to regulations easily.
The EU AI Act in practice
Unfortunately, businesses still lack a comprehensive guide for web scraping in the European Union. Instead, it arms them with knowledge about specific obligations for general-purpose module providers. As a result, it’s fragmented and unstable, with no clear through-path to success.
A thorough understanding of best practices, alongside a risk assessment, is the key to thriving in this legal environment.
For the technologies of today’s world to remain as unbiased, ethical and representative as possible, we must strive for public data to remain open for AI training purposes. The whole internet is a diverse dataset that, with the right legal guidance, can be utilised to fuel innovation.






