Scaling Web Scraping Operations: A Technical Guide
Discover the technical architecture and strategies needed to scale web scraping operations from thousands to millions of data points daily with distributed systems and cloud infrastructure.
The pattern that scales web scraping to millions of pages daily is simple: a distributed queue, stateless worker nodes, and a central coordinator that owns retries and storage. Everything else is tuning.
This guide walks through that pattern end to end. It covers the architecture, the code you need at each layer, and the trade-offs between vertical, horizontal, and serverless scaling. The same playbook is what we used to win the Apify 1 Million Challenge.
TL;DR
- Use a distributed queue plus stateless workers plus a central store. Don't scale a single machine.
- Pick scaling mode by workload shape: vertical for <1M/day, containers for steady scale, serverless for spikes.
- Reliability comes from retries with backoff, circuit breakers, and metrics on queue depth and P95 latency.
What architecture should you use to scale scraping?
Recommended pattern: a job queue (Redis or RabbitMQ) feeding a pool of stateless workers, with results written to a shared store. This separates concerns cleanly. The queue absorbs spikes, workers scale horizontally, and storage stays independent of scraping logic.
The ASCII diagram below shows how requests flow from queue to workers to the database.
┌─────────────┐
│ Queue │
│ (Redis/RMQ) │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Worker │ │ Worker │ │ Worker │
│ Node 1 │ │ Node 2 │ │ Node 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌──────▼──────┐
│ Database │
│ (MongoDB/ │
│ PostgreSQL)│
└─────────────┘
What does each component do?
1. Job Queue
- Redis or RabbitMQ for task distribution
- Priority queues for important targets
- Dead letter queues for failed jobs
2. Worker Nodes
- Auto-scaling based on queue depth
- Independent failure isolation
- Geographic distribution for locality
3. Result Storage
- Time-series database for metrics
- Document store for scraped data
- Data lake for raw HTML archives
Should you scale vertically or horizontally?
Match the scaling mode to the workload. Vertical scaling (one bigger box) handles up to roughly 1M pages per day and keeps operations simple. Horizontal scaling wins once throughput, fault tolerance, or geographic spread matter more than cost per page.
Vertical scaling (single machine)
The snippet below shows connection pooling and request batching, which are the two tweaks that most often unlock a single-node setup.
// Use connection pooling
const pool = new Pool({
host: 'localhost',
database: 'scraping',
max: 20, // Concurrent connections
idleTimeoutMillis: 30000
})
// Implement request batching
async function batchRequest(urls, batchSize = 10) {
const batches = chunk(urls, batchSize)
for (const batch of batches) {
await Promise.all(batch.map(url => fetch(url)))
await delay(1000) // Rate limiting between batches
}
}
Horizontal scaling (multiple machines)
For 1M+ pages per day you want orchestration. The Compose file below declares a replicated worker service so Swarm or Kubernetes can schedule copies across the cluster.
# docker-compose.yml
version: '3.8'
services:
workers:
image: scraper:latest
deploy:
replicas: 10
update_config:
parallelism: 2
delay: 10s
environment:
- WORKER_CONCURRENCY=50
Serverless is a strong fit when traffic is spiky and you don't want idle workers. The AWS Lambda handler below pulls a batch of URLs from an SQS event and scrapes them in parallel.
// AWS Lambda example
export const handler = async (event) => {
const urls = event.Records.map(r => r.body)
const results = await Promise.all(
urls.map(url => scrapeUrl(url))
)
return results
}
How do you squeeze more throughput out of each worker?
Three optimizations move the needle more than anything else: non-blocking I/O, connection reuse, and caching of deterministic responses. Apply them before buying more machines.
Async I/O with a concurrency cap
Sequential awaits waste most of the event loop. The version below uses p-limit to run requests in parallel while keeping the fan-out bounded.
// Bad: Blocking requests
for (const url of urls) {
const data = await fetch(url) // Sequential
save(data)
}
// Good: Parallel with concurrency limit
const pLimit = require('p-limit')
const limit = pLimit(10)
await Promise.all(
urls.map(url => limit(() => fetch(url)))
)
Connection pooling
Reusing TCP/TLS connections avoids the handshake on every request. This agent keeps sockets alive and caps how many can idle.
// Reuse connections across requests
const agent = new https.Agent({
keepAlive: true,
maxSockets: 50,
maxFreeSockets: 10
})
const response = await fetch(url, {
agent: agent
})
Smart caching
Cache only idempotent responses. This helper hashes the URL as the cache key and falls through to the network on miss.
// Cache GET requests, don't cache POST
const cache = new NodeCache({ stdTTL: 3600 })
async function cachedFetch(url) {
const cacheKey = crypto.createHash('md5').update(url).digest('hex')
let data = cache.get(cacheKey)
if (data) return data
data = await fetch(url).then(r => r.json())
cache.set(cacheKey, data)
return data
}
How do you keep a scraping pipeline reliable under failure?
Transient failures dominate scraping. Two patterns absorb most of them: retries with exponential backoff, and circuit breakers that stop hammering a dead target.
Retry with exponential backoff
The helper below doubles the delay between attempts so bursts of failures don't compound. It throws only after the last retry fails.
async function fetchWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fetch(url)
} catch (error) {
if (i === maxRetries - 1) throw error
const delay = Math.pow(2, i) * 1000
await sleep(delay)
}
}
}
Circuit breaker
A circuit breaker trips after repeated failures and short-circuits new calls until a timeout elapses. The class below is the minimum state machine you need.
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failures = 0
this.threshold = threshold
this.timeout = timeout
this.state = 'CLOSED' // CLOSED, OPEN, HALF_OPEN
}
async execute(fn) {
if (this.state === 'OPEN') {
throw new Error('Circuit breaker is OPEN')
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
}
What should you monitor?
You can't tune what you don't measure. Six signals catch almost every production issue before it spreads.
- Request rate: requests per second or minute
- Success rate: percentage of successful scrapes
- Response time: P50, P95, P99 latencies
- Error rate: split by error type (timeout, 404, 500)
- Queue depth: pending jobs in flight
- Resource usage: CPU, memory, network I/O
Structured logging
Structured logs make these metrics searchable. The snippet below emits one event per scrape with the fields you need to slice by worker and target.
// Structured logging
logger.info('Scrape completed', {
url: normalizedUrl,
status: response.status,
duration: Date.now() - startTime,
itemsExtracted: data.length,
workerId: process.env.WORKER_ID
})
How do you control infrastructure costs?
Pick the compute model that matches your duty cycle. The table below shows the rough ranges we see per million pages scraped.
| Approach | Cost/1M Pages | Best For |
|---|---|---|
| Single Server | $50-100 | Prototyping |
| Serverless | $200-500 | Spiky workloads |
| Containers | $100-300 | Steady scale |
| Spot Instances | $50-150 | Fault-tolerant tasks |
Optimization tips
-
Schedule during off-peak hours
- Many sites have lower traffic at night
- Spot pricing tends to be cheaper then
-
Deduplicate requests
const seen = new Set() urls = urls.filter(url => { if (seen.has(url)) return false seen.add(url) return true }) -
Compress stored data
const zlib = require('zlib') const compressed = zlib.gzipSync(JSON.stringify(data))
Common pitfalls
- Underestimating infrastructure costs. Start with estimates, then measure actual costs.
- Ignoring rate limits. Implement adaptive rate limiting from day one.
- No failure isolation. One bad target shouldn't crash your system.
- Insufficient monitoring. You can't improve what you don't measure.
Getting started
- Start small. Prove the architecture with 10-100 targets.
- Measure baseline. Know your current performance.
- Add gradually. Increase scale while monitoring.
- Automate everything. From deployment to recovery.
Once the pipeline moves volume, the questions shift to downstream processing and compliance. For the streaming layer see real-time data processing; for rate limiting and robots.txt at scale see ethical web scraping best practices.
Conclusion
Scaling web scraping is an architecture problem, not a code problem. Get the queue, workers, and storage separated early, then tune throughput and reliability on top. Measure before you scale, and let queue depth and P95 latency tell you when to add capacity.
About SIÁN Team
SIÁN Agency builds automated data pipelines for small businesses — from web scraping to AI processing to workflow integration. We write about what we know from building these systems every day.
More Articles
Data Pipeline for Small Business: The 2026 SME Guide
A practical 2026 data pipeline guide for small business — the 4 layers, real costs (€150–€800/mo), and the 380% first-year ROI math. No data engineer needed.
We Won the Apify 1 Million Challenge Grand Prize
SIÁN Agency took home 1st place in the Apify 1 Million Challenge. Here's what we built, how we approached it, and what it means for our work going forward.
The Future of Web Scraping: AI-Powered Solutions
Explore how artificial intelligence is revolutionizing web data extraction, making it more efficient, accurate, and scalable than ever before. Discover machine learning techniques for intelligent scraping.