How to Scrape Websites Protected by PerimeterX: Bypassing "Please verify you are human" Challenges

published 2024-06-27
by James Sanders
565 views

TLDR: key takeaways

  • PerimeterX is a sophisticated anti-bot system that analyzes requests to detect and block automated scraping
  • It uses techniques like IP monitoring, header analysis, fingerprinting, and behavioral analysis to identify bots
  • You can bypass PerimeterX by scraping cached versions, using fortified headless browsers and proxies, or reverse engineering its detection methods

If you've ever tried to scrape data from major websites, you may have encountered a "Please verify you are human" message and been blocked from accessing the site. Chances are, this was the work of PerimeterX, a leading cybersecurity company that helps websites detect and prevent unwanted bot activity.

PerimeterX's bot detection technology is used by many large companies like Zillow, Wayfair, and Upwork to stop web scraping attempts. Its sophisticated system analyzes numerous factors to distinguish bot traffic from legitimate human users.

However, with the right techniques, it is possible to bypass PerimeterX and successfully scrape data from protected websites. In this guide, we'll explore how PerimeterX works and the most effective methods to get around its anti-bot defenses.

How PerimeterX Detects Bots

To understand how to evade PerimeterX, it helps to first know the main techniques it uses to spot automated scrapers and bots:

IP Address Monitoring

One of the first things PerimeterX looks at is the IP address a request comes from. It analyzes factors like the IP's location, provider (data center vs residential), and reputation to assign a risk score. High-risk IPs associated with prior bot activity are more likely to be blocked.

HTTP Header Analysis

Web browsers send specific HTTP headers with each request that identify the browser, OS, and other details. PerimeterX checks if the headers match known browser fingerprints or have any unusual properties that indicate the request isn't coming from a real browser.

TLS and HTTP2 Fingerprinting

During the early stages of opening a connection, clients exchange data with servers using protocols like TLS and HTTP2. These transmissions create unique fingerprints PerimeterX can analyze to determine if traffic is coming from a genuine browser or an automated script.

Browser and Device Fingerprinting

Browsers provide access to many APIs and properties that can be used to generate a detailed fingerprint of a user's browser and device configuration. PerimeterX looks at combinations of factors like screen size, installed fonts, WebGL rendering, and more to differentiate real users from headless browsers and emulators commonly used for scraping.

Behavioral Analysis and CAPTCHAs

For highly suspicious traffic, PerimeterX serves challenges like CAPTCHAs that require user interaction to solve. It also monitors behavior on the page, like mouse movements and keystrokes, to identify non-human actions. Bots that can't replicate natural user behavior get blocked.

Bypassing PerimeterX Bot Detection

While PerimeterX is continually evolving its defenses, resourceful developers have found several ways to circumvent its protections and scrape websites:

Scrape Google's Cached Version

The simplest way to access a PerimeterX-protected site is often to avoid it entirely. If the data you need isn't updated frequently, you can scrape Google's cached version of the pages instead, which won't trigger PerimeterX's live detections.

To access Google's cache, prefix the URL with: https://webcache.googleusercontent.com/search?q=cache:

Keep in mind not all sites allow caching, and this method only works for publicly accessible pages already indexed by Google.

Use a Fortified Headless Browser with Proxies

Headless browsers like Puppeteer can be customized to avoid detection by PerimeterX. Anti-fingerprinting patches help conceal the signs of automation, while carefully-vetted proxies (especially residential IPs) reduce the likelihood of blocklists and rate limits kicking in.

This approach takes significant work to get right—you need to ensure all aspects of the headless browser align to imitate a real user's fingerprints. Slight discrepancies between any factor, like using Firefox user agent headers with a Chrome TLS fingerprint, is an instant red flag.

Costs can also add up, since each page load consumes significant bandwidth (2+ MB on average) compared to a raw HTTP request. Carefully weigh the data ROI against infrastructure and proxy expenses.

Reverse-Engineer PerimeterX's Client-Side Scripts

For those willing to get their hands dirty, it's possible to manually circumvent PerimeterX by analyzing and mimicking its client-side JavaScript challenges. In a nutshell:

  1. Intercept the obfuscated JS bundle PerimeterX's tag injects on each page load
  2. De-obfuscate the script to understand its inner workings (no easy task given the layers of encoding used)
  3. Replicate the browser APIs, inputs, and behavioral signals PerimeterX looks for
  4. Automate the process of retrieving, solving and submitting valid PerimeterX tokens with each request

This is not for the faint of heart and requires significant time to reverse-engineer and keep up with PerimeterX's evolving client-side checks. Even subtle lag in how quickly the JS is executed can tip off their behavioral analysis.

Only pursue this path if you have a massive scraping operation where the savings justify the upfront and ongoing engineering investment needed to outsmart PerimeterX at the protocol level. For most, using fortified headless browsers or proxy services is more cost-effective.

Residential & Mobile Proxies for PerimeterX

Whichever bypassing method you use, pairing it with quality proxies is key to avoiding IP-based blocking. Residential and mobile proxy networks are ideal since their IP addresses have strong reputations and are less likely to be banned (compared to data center IPs which many bot operators use).

Always source proxies from reputable providers with large, diverse IP pools and ethical sourcing practices. Cheap proxies scraped together from compromised devices cause more headaches than they solve.

Fortunately, residential and mobile proxy aggregators exist which offer access to millions of IPs from tier-1 providers at affordable rates. This allows you to easily rotate high-quality IP addresses to distributed your traffic and keep success rates high.

Frequently Asked Questions

What is PerimeterX?

PerimeterX is a cybersecurity company that provides a bot detection platform for websites. Its technology analyzes incoming web requests to identify and block malicious bots engaging in content scraping, ad fraud, account takeovers, and other unwanted activities.

How does PerimeterX detect bots?

PerimeterX uses a combination of techniques to spot bots, including monitoring IP reputation, analyzing HTTP headers and protocol fingerprints for signs of automation, using browser fingerprinting to profile devices, and behavioral analysis to distinguish human and bot actions.

Can you bypass PerimeterX?

Yes, although PerimeterX is one of the most sophisticated bot detection systems, it is possible to bypass with the right techniques. Common approaches include using search engine caches, carefully-tuned headless browsers, proxy services specializing in PerimeterX, and low-level protocol emulation.

Conclusion

PerimeterX is a formidable adversary for web scrapers, but with some technical know-how and the right tools, it is possible to collect data from protected sites. Whether you use cached versions, headless browsers, proxy services, or custom bot scripts, remember to respect website terms of service and practice good scraping etiquette.

Ultimately, bypassing anti-bot measures is a means to an end—extracting valuable public data to power your business. Focus on the ROI of your scraping operations and consider whether it makes sense to build your own PerimeterX countermeasures or leverage off-the-shelf proxy solutions to keep crawlers running smoothly.

James Sanders
James joined litport.net since very early days of our business. He is an automation magician helping our customers to choose the best proxy option for their software. James's goal is to share his knowledge and get your business top performance.
Don't miss our other articles!
We post frequently about different topics around proxy servers. Mobile, datacenter, residential, manuals and tutorials, use cases, and many other interesting stuff.