Overcoming Anti-Bot Measures: Advanced Techniques
Technical deep-dive into modern anti-bot systems and strategies to navigate them while maintaining ethical scraping practices, including fingerprinting evasion and proxy rotation.
Anti-bot systems have grown from simple IP blocks into layered detection stacks that combine fingerprinting, behavioral analysis, and network-level inspection. This guide walks through how each layer works and how to navigate it without crossing ethical or legal lines. The code samples are starting points, not silver bullets. Treat detection as an ongoing conversation with the target site, not a puzzle you solve once.
TL;DR
- Modern detection stacks fingerprinting, network signals, and behavior, not just IPs.
- Use real browsers with stealth patches for JavaScript challenges like Cloudflare.
- Scrape only public data, respect robots.txt, and throttle before you get blocked.
How do modern anti-bot systems detect scrapers?
Detection rarely relies on a single signal. Vendors like Cloudflare, Akamai, and DataDome layer four categories of checks, and passing one does not pass the others. If your scraper gets flagged, the first job is figuring out which layer caught it.
Behavioral analysis
Sites track mouse movements, scroll rhythm, typing cadence, and navigation order. A headless browser that clicks straight to a submit button without touching the page in between looks nothing like a human visitor.
Browser fingerprinting
Canvas renders, WebGL parameters, installed fonts, and audio context hashes combine into a stable ID. The same fingerprint across rotating IPs is a strong bot signal.
Network analysis
TLS cipher order, HTTP/2 frame settings, and TCP/IP quirks reveal the client library. A Python requests call has a different TLS fingerprint than Chrome, even with identical headers.
JavaScript challenges
CAPTCHAs, invisible challenges, and timing-based DOM checks require real JS execution. Raw HTTP clients fail these every time.
What are the ethical limits of anti-bot evasion?
Evasion techniques are dual-use. The same stealth plugin that helps a price-comparison tool also powers credential stuffing. Before writing any code, set clear rules for what the scraper will and will not do.
Only apply these techniques when:
- You have legal authorization to access the data
- The data is publicly available without login
- Your scraping does not harm the site's operations
- You respect rate limits and terms of service
Never use them to:
- Access password-protected content without permission
- Circumvent authentication systems
- Overwhelm servers with requests
- Scrape personal data without a lawful basis
For the full legal framing, CFAA, GDPR, and robots.txt, see our guide on ethical web scraping best practices.
Which browser automation approach should you use?
Most serious anti-bot systems require a real browser. Raw HTTP works for simple sites, but once you hit a JavaScript challenge, you need Chromium driving the page. Three setups cover the majority of cases.
Playwright with stealth patches
This configuration launches Chromium with automation flags stripped and common detection hooks patched. It handles most JavaScript challenges because the browser is real.
const { chromium } = require('playwright-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
const browser = await chromium.launch({
headless: true,
args: [
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--no-sandbox'
]
})
// Use stealth plugin to avoid detection
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport: { width: 1920, height: 1080 },
locale: 'en-US'
})
Undetected-Chromedriver
For Python workflows, the pattern below disables the automation switches Selenium normally sets and overrides the navigator.webdriver property that many detection scripts read first.
from selenium.webdriver.chrome.options import Options as ChromeOptions
from seleniumwire import webdriver # Enhanced Selenium
options = ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
# Remove webdriver property
driver.execute_cdp('Page.addScriptToEvaluate', {
'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
Residential proxies
When your IP range gets blocked, routing requests through residential proxies helps traffic blend with consumer ISPs. The snippet wires an authenticated proxy into an Axios request.
const axios = require('axios')
async function fetchWithProxy(url) {
const proxy = {
host: 'proxy-server.com',
port: 8080,
auth: {
username: 'user',
password: 'pass'
}
}
const response = await axios.get(url, {
proxy: false, // Axios proxy handling
httpsAgent: new HttpsProxyAgent(
`http://${proxy.auth.username}:${proxy.auth.password}@${proxy.host}:${proxy.port}`
)
})
return response.data
}
How do you obfuscate requests at the network layer?
Header spoofing alone is not enough. Network-layer fingerprints, TLS handshakes, HTTP/2 frames, and timing patterns, leak the client library before your first byte of HTML arrives. Three adjustments narrow the gap between your scraper and a real browser.
TLS fingerprinting
Libraries like undici give more control over the TLS handshake, which is where tools like JA3 hashes classify clients. Matching a browser's cipher order closes one of the loudest bot signals.
// Use undici for better TLS fingerprint control
import { request } from 'undici'
async function fetchWithTLSSettings(url) {
return await request(url, {
dispatcher: new Agent({
connect: {
timeout: 30_000,
// Tweak TLS settings for better fingerprint match
}
})
})
}
HTTP/2 fingerprint
HTTP/2 settings frames and header ordering differ between browsers and HTTP clients. Preserving the exact header order Chrome uses helps requests pass deeper fingerprint checks.
// Control HTTP/2 settings order
const http2 = require('http2-wrapper')
await http2.get(url, {
http2: true,
headers: {
// Mimic browser header order
'user-agent': 'Mozilla/5.0...',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0'
}
})
Request timing randomization
Humans pause. Scrapers that fire exact-interval requests look mechanical. Randomized delays between actions, and variable typing speed inside forms, make traffic patterns less predictable.
// Add human-like delays
function randomDelay(min = 1000, max = 3000) {
return new Promise(resolve =>
setTimeout(resolve, Math.random() * (max - min) + min)
)
}
// Vary typing speed for form inputs
async function humanLikeType(element, text) {
for (const char of text) {
await element.type(char)
await randomDelay(50, 200) // Typing delay
}
}
How should you handle CAPTCHAs?
CAPTCHAs exist because a site has explicitly decided bots are unwelcome on that surface. Treat solving them as a last resort, on data you are authorized to access, and be ready to stop if the target escalates. Two patterns cover most use cases.
2Captcha / Anti-Captcha
These services pay human workers to solve challenges and return tokens via API. The Python snippet submits a hCaptcha and polls until the solver returns a token.
import requests
import time
def solve_captcha(site_key, url):
# Submit CAPTCHA
response = requests.post('http://2captcha.com/in.php', {
'key': API_KEY,
'method': 'hcaptcha',
'sitekey': site_key,
'pageurl': url,
'json': 1
})
task_id = response.json().request
# Poll for result
while True:
result = requests.get(f'http://2captcha.com/res.php?key={API_KEY}&action=get&id={task_id}&json=1')
if result.json().status == 1:
return result.json().request
time.sleep(5)
Playwright CAPTCHA injection
Once a solver returns a token, you inject it into the page's expected element and submit. Use this only in authorized test environments.
// For testing environments only!
async function solveRecaptcha(page) {
// Use 2captcha service
const captchaResponse = await solveCaptcha2Captcha(siteKey)
// Inject solution
await page.evaluate((token) => {
document.getElementById('g-recaptcha-response').innerHTML = token
}, captchaResponse)
// Submit form
await page.click('#submit-button')
}
What advanced evasion techniques exist beyond the basics?
Some targets push past standard HTML scraping. Real-time dashboards use WebSockets, modern apps hit GraphQL endpoints, and a few detection scripts probe for installed browser extensions. Each case needs a tailored approach.
WebSocket connections
When data streams over WebSockets, you connect directly with matching origin and user-agent headers rather than loading the full page.
// Some sites require WebSocket for real-time data
const WebSocket = require('ws')
const ws = new WebSocket('wss://example.com/socket', {
headers: {
'Origin': 'https://example.com',
'User-Agent': 'Mozilla/5.0...'
}
})
ws.on('message', (data) => {
// Process real-time updates
})
GraphQL query interception
If a site's frontend calls GraphQL, intercepting responses inside a Playwright session is often cleaner than parsing rendered HTML.
// Intercept and analyze GraphQL queries
page.on('response', async (response) => {
if (response.url().includes('/graphql')) {
const data = await response.json()
// Analyze query structure and responses
}
})
Browser extension emulation
Some detection scripts inspect navigator.plugins and flag empty lists as automation. Injecting plausible plugin entries sidesteps this simple check.
// Some sites check for specific extensions
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'plugins', {
get: () => [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
{ name: 'Chrome Native Messaging', filename: 'chrome_native_messaging_host' }
]
})
})
How does fingerprint evasion actually work?
Fingerprinting hashes values the browser normally exposes for rendering and hardware info. Overriding those values with consistent fakes makes the fingerprint stable across sessions without matching the real machine. Two surfaces matter most: Canvas and WebGL.
Canvas fingerprint
This override wraps getContext so getImageData returns predictable bytes. Detection scripts that hash canvas output get the same hash every time.
// Consistent canvas fingerprint
await page.evaluateOnNewDocument(() => {
const getContext = HTMLCanvasElement.prototype.getContext
HTMLCanvasElement.prototype.getContext = function(type) {
const context = getContext.apply(this, arguments)
// Modify getImageData to return consistent data
const originalGetImageData = context.getImageData
context.getImageData = function() {
// Return consistent fake data
return originalGetImageData.call(this, 0, 0, 1, 1)
}
return context
}
})
WebGL fingerprint
WebGL exposes GPU vendor and renderer strings through numeric parameter codes. Returning fixed values for the two most-probed codes keeps the fingerprint stable.
// Consistent WebGL parameters
await page.addInitScript(() => {
const getParameter = WebGLRenderingContext.prototype.getParameter
WebGLRenderingContext.prototype.getParameter = function(parameter) {
// UNMASKED_VENDOR_WEBGL
if (parameter === 37445) {
return 'Intel Inc.'
}
// UNMASKED_RENDERER_WEBGL
if (parameter === 37446) {
return 'Intel Iris OpenGL Engine'
}
return getParameter.call(this, parameter)
}
})
How do you maintain access over time?
A scraper that works on day one often fails by week two. Session cookies expire, IPs land on blocklists, and user-agent strings get stale. Three rotation strategies keep a pipeline running.
Rotate user agents
Cycling through current browser strings avoids one of the simplest classification rules.
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36...'
]
function getRandomUA() {
return userAgents[Math.floor(Math.random() * userAgents.length)]
}
Session management
Persisting cookies between runs preserves warmed-up sessions that have already passed challenges, reducing the chance of re-triggering them.
// Save and reuse cookies
const fs = require('fs')
async function saveCookies(page, file) {
const cookies = await page.context().cookies()
fs.writeFileSync(file, JSON.stringify(cookies))
}
async function loadCookies(page, file) {
const cookies = JSON.parse(fs.readFileSync(file))
await page.context().addCookies(cookies)
}
IP rotation
Pulling from a proxy pool for each request spreads traffic across many source IPs. This stops rate-limit triggers tied to a single address.
// Rotate through proxy list
const proxies = loadProxyList()
async function getWithRotatingProxy(url) {
const proxy = proxies[Math.floor(Math.random() * proxies.length)]
return await axios.get(url, {
proxy: {
host: proxy.host,
port: proxy.port
},
timeout: 10000
})
}
How do you detect blocks and recover?
Silent failures waste budget. A scraper that keeps fetching block pages while logging "success" is worse than one that crashes. Good pipelines check every response and retry with a new identity when flagged.
Monitor for blocking
This check scans the response body for common block phrases. When it matches, the scraper hands off to a recovery routine rather than saving garbage data.
async function checkIfBlocked(page) {
const content = await page.content()
// Common block indicators
const blocked = content.includes('Access denied') ||
content.includes('CAPTCHA') ||
content.includes('Request blocked')
if (blocked) {
await handleBlock(page)
}
}
async function handleBlock(page) {
// Rotate proxy
// Clear cookies
// Change user agent
// Wait before retry
}
Automatic recovery
Wrapping each fetch in a retry loop with exponential backoff and identity rotation turns a one-off block into a self-healing pipeline.
class ScraperWithRecovery {
async scrape(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await this.attemptScrape(url)
} catch (error) {
if (error instanceof BlockedError) {
await this.rotateIdentity()
await this.delay(Math.pow(2, attempt) * 1000)
} else {
throw error
}
}
}
}
}
What best practices keep scrapers polite?
The scrapers that survive longest are the ones target sites barely notice. Polite scraping is not a moral flourish, it is the most effective evasion. Three habits matter most.
Rate limiting
Checking robots.txt and capping request rate protects the target and keeps you off abuse lists.
// Always respect robots.txt
const robotsTxt = await fetchRobotsTxt(url)
const allowed = robotsTxt.isAllowed(url)
if (!allowed) {
console.log('Scraping disallowed by robots.txt')
return
}
// Implement rate limiting
const limiter = new RateLimiter({
tokensPerInterval: 1,
interval: 'second'
})
Off-peak scheduling
Running heavy jobs outside business hours reduces load on the target's origin and the chance of triggering anomaly alerts.
// Add delays between requests
await delay(1000 + Math.random() * 2000)
// Avoid peak hours
const hour = new Date().getHours()
if (hour >= 9 && hour <= 17) {
await delay(5000) // Slower during business hours
}
Graceful degradation
When the complex scraper fails, falling back to a simpler path keeps some data flowing rather than returning nothing.
// Graceful degradation
try {
const data = await scrapeComplexPage(url)
} catch (error) {
logger.warn('Complex scraping failed, trying fallback')
const data = await scrapeSimpleVersion(url)
}
Conclusion
Anti-bot evasion is a constant arms race. The goal is not to win every round but to build pipelines that adapt when they lose one. Prioritize ethical practices, respect site policies, and treat blocks as signals, not failures.
The most reliable playbook:
- Use official APIs when available
- Get explicit permission for scraping
- Implement robust error handling and recovery
- Respect rate limits and robots.txt
- Monitor for blocking and adapt accordingly
The next frontier is AI-assisted detection on both sides. We cover the extraction side in the future of web scraping with AI. And when your fingerprint evasion works but throughput cannot keep up, the technical scaling guide covers proxy rotation, worker pools, and session management at scale.
About SIÁN Team
SIÁN Agency builds automated data pipelines for small businesses — from web scraping to AI processing to workflow integration. We write about what we know from building these systems every day.
More Articles
Data Pipeline for Small Business: The 2026 SME Guide
A practical 2026 data pipeline guide for small business — the 4 layers, real costs (€150–€800/mo), and the 380% first-year ROI math. No data engineer needed.
We Won the Apify 1 Million Challenge Grand Prize
SIÁN Agency took home 1st place in the Apify 1 Million Challenge. Here's what we built, how we approached it, and what it means for our work going forward.
The Future of Web Scraping: AI-Powered Solutions
Explore how artificial intelligence is revolutionizing web data extraction, making it more efficient, accurate, and scalable than ever before. Discover machine learning techniques for intelligent scraping.