The Relevance of Proxy Detection in Today’s Bot and Fraud Landscape

David Senecal
4 days ago
7 min read

In the web security context, traffic from proxies is often considered high-risk because threat actors frequently use them to load-balance traffic through multiple IP addresses, in an attempt to evade detection while preserving their anonymity. In this article, we’ll review the importance of proxies in the internet ecosystem, discuss methods of detection, and conclude on the relevance of identifying proxies in today’s threat landscape.

The prevalence of proxies in the Internet ecosystem

Proxy servers are a key component of the internet infrastructure. Their primary purpose is to cache commonly requested web content closer to end users. Enterprises and internet service providers use them to accelerate web traffic for their users, and they are the core of the technology stack used by CDN providers such as Akamai, Fastly, Cloudflare, and Amazon. Without proxies, the Internet would be much more congested and much slower than it is today. Internet users who are concerned about privacy use them to remain anonymous while browsing websites. A couple of examples includes the infamous ToR (The Onion Router) network or the Apple private relay. In that context, proxies are more like relays, a hop used to transmit a request from A to B. Overall, proxies largely earn their bad reputation from web scraping services and threat actors who use large proxy infrastructures to load-balance traffic and help evade detection. These proxy relays are the focus of this discussion.

Proxy detection methods

Proxies used to accelerate web traffic are generally easy to detect because they typically include the Via and/or X-Forwarded-For HTTP headers in the request. Whereas proxies used to relay requests do not include such information. IP reputation products such as IP Quality Score, Maxmind, Neustar and many more provide metadata associated with an IP address, such as the country, the type of connection (cloud, residential, mobile), whether it is known to belong to a VPN or proxy service, and a risk score calculated based on the metadata previously cited or activity from that IP address. These commercial products use various methods to identify proxies. An IP address is more likely to be a part of a proxy network if:

It sends an abnormally high number of requests
Sends requests to an unusually high number of domains
A high number of unique user-agents associated with the IP address
A high number of unique accept-language header values associated with the IP address
Continuous traffic throughout the day without observing the typical circadian traffic pattern (traffic peaks during the day and drops at night)

In essence, one must use common bot and fraud detection methods to detect proxies. The “proxy signal,” in turn, is generally used to help detect bots and fraud. It seems that we’re going in circles…

The proliferation of proxy services

There are dozens of companies that offer proxy services. The best-known include Bright Data, Oxylab, Decodo, and IP Royal; many others exist. Some of these companies also provide web scraping services, which are among the most common use cases for proxy infrastructure in terms of traffic volume. Beyond scraping, these proxy relays are used for other purposes, such as enabling online marketers to test their sites across various geolocations and to verify whether their ad campaigns load as expected. All proxy services offer cloud-based IP addresses. However, those are easy to detect using any IP reputation service. Therefore, today most proxy services also offer access to millions of residential and mobile IP addresses in their networks, with Bright Data reporting over 150 million residential IP addresses and Oxylab reporting over 175 million in 195 countries. One of the main value propositions highlighted by proxy companies is relevant to this discussion: rapid IP rotation across residential and/or mobile IP addresses to evade anti-bot detection. The activity enabled by proxy companies is controversial, as, from a business owner's perspective, it represents unwanted bot activity with questionable purpose and benefit. Some proxy companies have worked hard in recent years to “clear their name” by being more scrupulous in vetting their customers and focusing more on data collection (scraping) and the marketing angle. Others are less scrupulous and simply sell their bandwidth, enabling fraudsters to conduct attacks such as credential stuffing and account-opening abuse at scale. When used well, bot operators can load-balance traffic through the proxy infrastructure to make it appear to come from legitimate users in the expected market. For example, if the bot activity targets a business serving the US market, the bot operator can configure the proxy service to route the activity during US daytime hours from US IP addresses assigned by well-known ISPs such as Comcast, AT&T, or T-Mobile.

How are residential and mobile IP addresses acquired?

In a recent assessment of bot activity targeting more than 300 web properties, I found that more than 75% of bot traffic originated from residential or mobile IP addresses. In contrast, fewer than 25% originated from IP addresses assigned to cloud providers. This number indicates the prevalence of residential and mobile IP proxy use by bot operators. This aligns with our understanding of common bot evasion strategies: when faced with resistance, the "proxy mix" is adjusted toward residential and mobile IP addresses. Proxy providers use various methods to source residential and mobile IP addresses:

Free mobile app monetization incentive: Proxy providers offer a mobile app software development kit (SDK) to help developers monetize their free applications (e.g., games), increase revenue, enhance the user experience, or provide an ad-free experience. Bright Data, Packet SDK, PYPROXY, and Infatica are examples of companies that offer SDKs that, when integrated into an application and enabled by the user, enable the device to participate in a proxy network. As part of the “opt-in process”, users may be offered an ad-free or enhanced experience if they agree to participate in the proxy network to help collect data for “research”. Once a device opts into the proxy service, some web scraping activity may pass through the device as needed. The SDK provider compensates the mobile app developer based on the number of active daily users who opt in to the proxy service. Although proxy providers claim this method is ethical, it remains controversial because the intent and purpose are often glossed over, glorified, or hidden in terms and conditions that users who opt in either do not read or do not understand.

Bandwidth sharing/monetization: Companies such as Honeygain, PacketStream, Packetshare, and Proxyrack offer applications that allow anyone to earn passive income by sharing their internet bandwidth and participating in a proxy service. Users get paid based on the amount of traffic that transits through their devices. The application runs in the background and requires no user intervention.

Peer-to-peer, Free VPN offering: BrightVPN and others offer a free VPN service in exchange for users sharing their bandwidth and opting into Bright Data’s proxy network.

There are also less common, more expensive private proxy solutions with deployed hardware, such as the hardware mobile proxy offered by companies like Coronium. There were also prior reports of hacked IoT devices with embedded proxy services. Deploying hacked devices is complex, far more controversial, and ultimately yields far fewer IP addresses than the monetization models previously described.

Today’s value of proxy detection

IP geolocation and reputation products remain relevant for inferring the country of origin of an IP address or the type of connection it is associated with, particularly whether it is associated with a cloud provider, a residential network, or a mobile network. However, in the current landscape, most proxy infrastructure comprises millions of residential and mobile IP addresses that can be used by legitimate users at some point in time. Confirmed bot or fraud activity shows a strong overlap with IP addresses associated with proxy services. However, more than 10% of legitimate traffic is also associated with IP addresses used by proxy services, simply because of the schemes described in the previous section. This indicates that the proxy signal has a high false-positive rate risk and, on its own, is unreliable and must be combined with other signals, such as fingerprint anomalies or abnormal behavior. For example, the proxy signal must be combined with a high request rate, an abnormally high number of sites visited within a short period to improve accuracy, or a challenge response strategy (CAPTCHA) to gracefully handle false positives. Detecting residential and mobile proxies is not easy. Given the limited signal value, I prefer to focus on detection methods with higher returns and accuracy, such as fingerprint anomaly detection, user behavior detection, and behavioral biometric detection. Some of those methods are complex, but when implemented correctly, they yield more accurate, deterministic signals.

Conclusion

Bot and fraud detection is complex, and to this day, I still see security professionals overemphasizing IP-based and proxy detection in their defense strategy. I understand why: these methods are simpler to understand and relate to, and threat actors are known to use proxy infrastructure extensively. The rationale is clear: if you can detect the proxies, you can detect and mitigate the undesirable activity. However, considering the non-negligible false positive rate, the proxy signal, on its own, is largely inadequate for protecting websites in today’s threat landscape without impacting legitimate traffic (false positives). The proxy signal worked OK 10 years ago. However, considering the proliferation of apps that can ultimately be turned into a proxy relay, way too many legitimate devices have been compromised for this signal to be usable on its own, and it requires at least that it is combined with other bot detection signals. I used the proxy signal a few years ago, but have neglected it more recently to focus on detecting bots and botnets based on their behavior and characteristics rather than on the hosting of a proxy relay. If you still want the proxy signal in your detection stack, I recommend purchasing an IP reputation solution rather than attempting to build the detection, so that you can focus on developing more valuable detection methods.