The last 12 months have demonstrated the huge capabilities enabled by public web data collection; however, it’s clear that the industry still has room to develop in 2026.
With expected changes to legislation in the dependent AI industry and legal battles ahead, it will be interesting to watch how this plays out as the year unfolds. One thing we can rely on: the fundamentals of data collection will remain more important than ever.
Below, top tech experts have come together to share their insights into how they expect the data collection landscape to develop, based on their industry expertise, and to reveal what 2026 could bring to businesses and AI worldwide.
Fair use of copyrighted material
Denas Grybauskas, Chief Governance and Strategy Officer at Oxylabs, explained: “In US law discussions and potentially practice, we will see a growing emphasis put on the transformation of copyrighted work. The fair use doctrine allows transformative use of copyrighted material, which adds something new and makes it different in purpose or character.
“Therefore, much legal discussion will likely focus on whether using content, including web content, for AI training constitutes transformative use sufficient to qualify as fair use.
“At the same time, in cases where the fair use doctrine doesn’t apply – in jurisdictions such as the EU – the industry will need technological mechanisms for credit attribution and workable ways to remunerate creators, without undermining the openness of the web or the seamlessness of access to public information.”
Agentic systems for data collection
Julius Černiauskas, CEO at Oxylabs, said: “Next year will likely see interesting developments in comprehensive agentic systems for public data collection. Take the process of web scraping, which consists of many small tasks. AI agents can automate these tasks.
“Together, they comprise a multi-agent system that can handle much of the process, driving down costs and democratising public data access by making it more accessible without requiring particular skills or engineering teams.
“Once again, new tools and features to automate particular tasks constantly enter the market – something that will multiply next year.”
LLM use for parsing
Juras Juršėnas, COO at Oxylabs, stated: “Over the next 12 months, the use of LLMs for parsing will grow. For the past few years, data parsing has been one of the most impactful AI use cases in public data collection.
“However, it was still limited by price (for LLM tokens) and by prompt-size constraints. Developers and data teams used to always need to clean the HTML to reduce its size before passing it to the LLM for parsing, which required additional resources. You might now only need to do this in specific cases.
“The number of options in the market for tools that can do it for you is booming. Thus, it is reasonable to expect an increase in LLM usage for parsing.”
Quality vs quantity
Rytis Ulys, Head of Data & AI at Oxylabs, commented: “In 2026, the search for data will focus less on quantity and more on quality. Recent Anthropic research showed that even small amounts of low-quality data can ruin the entire dataset.
“Additionally, it showed that beyond a certain point, adding more low-quality data yields minimal gain – or even degrades performance – compared to using a targeted, relevant subset.
“As such, the fundamentals of data collection will remain more important than ever. Robust tables and catalogues, quality and lineage, and low-latency query engines have become prerequisites for agents, retrieval, not afterthoughts. Graph and vector-augmented retrieval is moving from blog posts to patterns, observability now spans prompts, tools, and cost, and compliance sits alongside performance on the same plane. Data isn’t fading; it’s been promoted to AI’s control surface.”
A better understanding of online data collection
Based on these insights, we can expect interesting developments in comprehensive agentic systems for public data gathering, the growth of LLMs for parsing, and a shift toward quality over quantity in data search.
In tandem, over the next 12 months, legal decisions on copyright law must be made in both the US and Europe, as the current situation has left many in uncertain territory.
Hopefully, 2026 will bring businesses clarity and understanding, with new tools and capabilities to automate processes, as well as a better understanding of web data collection and its role in businesses’ day-to-day lives.






