The Web Scraping Dilemma

David Senecal
Dec 7, 2025
6 min read

Some of the content of this article was initially published as a white paper that I authored for Akamai Technologies, and has been refreshed with additional insight and perspective.

The world needs data to function, whether to drive business decisions, train large language models (LLMs) to develop new applications, make information on various topics more accessible, enable academics to study how society evolves and human behavior, study price inflation, research for public safety, or study complex issues such as climate change. Various machine learning models, including large language models, are used to process these massive amounts of data to generate valuable insights and drive strategic decisions. Today, there is no better place to find data than the internet. Although individuals can easily consult websites that collect data for analysis, manually collecting data is time-consuming and tedious. That’s where scraping botnets to facilitate data collection and AI models to analyze the data come in.

Botnets are at the heart of the data-collection strategy to feed ML or LLM models. They can be programmed to scrape websites, collect key data points, and format them for easier consumption by models. For competitive analysis, the data of interest is product details, prices, and inventory from ecommerce websites. To train an LLM to provide answers across many topics, the model needs verified information from reliable media known for producing high-quality reports and articles. To assist in software development by solving various problems, the model must be trained on code examples from various open-source repositories.

Two types of botnets

Verified bots

Big companies on the Internet, such as Google, Meta, Amazon, Apple, Microsoft, and others, have operated botnets, including Googlebot, Bingbot, and Applebot, for years to support their web search engines, online advertising, and social media services. These bots are generally well-behaved, identify themselves through the User-Agent HTTP header, and follow the robots.txt directives. Here’s an example of what the user-agent for Googlebot, and Bingbot:

Company/service	User-Agent
Google/googlebot search engine	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36
Microsoft/Bingot search engine	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36

Traffic from web search engines has been accepted for years, and companies even optimize their websites for them to ensure their site can be appropriately indexed and easy to find to attract new customers. With the emergence of AI agents such as Copilot and ChatGPT, web administrators noticed that new botnets were collecting data from their sites. Here’s an example of user agents for OpenAI, Perplexity, and Anthropic.

Company	User Agent
OpenAI	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Perplexity	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Anthropic	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

There was initially a lack of understanding of these new platforms, why they were collecting data, how it was used, and the potential long-term, positive or negative impact they might have on the data owner’s business. To make matters worse, this rapidly advancing AI technology is only well understood by a small community of technologists. The space still lacks strong legal oversight to prevent misuse of the information collected. This led most businesses to initially reject the idea of sharing their data with companies running large AI models. But with the emergence of a new generation of web browsers like Atlas and Comet, and even Google Chrome gradually replacing its traditional search engine with Gemini, user interaction with the web through AI agents is growing daily. With that, businesses began sharing their data with AI platforms and exploring ways to optimize their web experience with AI agents.

Because they are cooperative and transparent, verified bots are easy to detect and manage. Most bot management products detect and categorize them, making it easier for web security teams to manage them.

Unverified bots

Unverified bots, sometimes also known as evasive bots, are generally run by the web data collection industry, also known as scraping-as-a-service. They specialize in collecting data at scale on the internet. Some entities have organized around the Ethical Web Data Collection Initiative or the cross-industry Alliance for Responsible Data Collection. Both organizations promote responsible and ethical data collection practices. Among other things, ethical scraping means avoiding private data, scraping responsibly by not impacting the performance or availability of the targeted web server, and scraping during “off hours” to avoid affecting the user experience. According to G2, a peer-to-peer review site focusing on business software, “Businesses can leverage data extraction services providers to help generate leads, gather relevant information from competing businesses’ web pages, identify trends from document collections, and improve analysis of otherwise unstructured information.”

There are dozens of companies in that space, providing various levels of services:

Some mainly offer infrastructure as an extended network of proxies that may include data centers, residential, and mobile IP addresses. Proxy services can be easily plugged into any homegrown scraping solution.
Others offer scraping services with automated data extraction on top of their proxy infrastructure. They clean and structure the data to make it easier to consume, then deliver it to the customer’s data science team.
Finally, the most advanced offering also includes extracting business intelligence from the data collected to help drive business decisions.

Customers of these services can define their targets, the frequency of the data collection, and the level of service they want.

Web security practitioners and web security solutions vendors face a technologically advanced adversary staffed with a team of engineers and data scientists, which evens the odds against the team of engineers and data scientists who build bot management products. As bot detection technology advances, scraping technology must also evolve to continue collecting data while protecting revenue.

Website owners’ unease

Beyond the uncertainty of how the collected data is used, the activity represents an operational challenge for website owners.

Increased operation cost: Scraping has a price. It represents, on average, 42% of the overall site activity. In the most extreme case, scraping represents more than 90% of overall traffic volume, so legitimate user traffic represents just a small fraction. Processing the extra traffic requires scaling the infrastructure up or down as the scraping activity comes and goes. It also increases the cost of delivering the content (CDN cost).
Site stability: Poorly calibrated and overly aggressive scraping activity can result in site stability and availability issues, leading to revenue loss. All too often, scrapers want to get the data fast and may crawl hundreds of thousands of product pages in a short time. Sometimes, multiple scraping services attempt to collect data simultaneously, and the combined activity can easily overwhelm website infrastructure.
Metrics skew: Scrapers make themselves difficult to detect. Excess traffic incorrectly identified as human by bot management solutions skews KPI metrics, such as conversation rate, which most marketing teams use to inform product positioning, marketing strategy, and advertising investment.

Most websites’ acceptable use policies prohibit scraping. However, most forums or websites supporting scraping argue that the activity is legal, and indeed, there are few laws banning it.

The big dilemma

There are always two sides to a story. For web security professionals, scraping activity is unwelcome because it causes additional load on the web infrastructure, occasional instability, increased operational costs, and skewing key metrics. The lack of transparency about the intended use of the data collected, and sometimes the aggressive nature of the botnets, are not helping either and are leading web security teams to do whatever they can to block the traffic. But for companies to make decisions and be successful, data is needed. Everyone needs data about the market and their competitors. Price scraping benefits all of us as consumers; inventory scraping can help economists and financial analysts gain insight into the health of the economy and inform investment decisions. Scraping social media platforms can help with fraud detection, background checks, and enable law enforcement to find crime victims or their perpetrators. Scraping of media and publisher sites helps detect plagiarism and protect intellectual property. Also, in relation to the growing interaction through AI agents, the internet ecosystem continuously evolves, and businesses and web security products need to adapt to this new type of interaction. As a leader of an engineering team that builds bot management products, I’ve been discussing with “the other side” for over a year, and this has helped me understand a different perspective. I remain committed to my mission to fight the bots, but understanding the nuances of this problem is key to building a product that can better adapt to the realities of the world and the internet and help businesses manage the bot activity more effectively.