How many pages per day can a single machine scrape?

With async I/O, connection pooling, and proper rate limits, a single well-tuned machine handles roughly 1 million pages per day. Above that, horizontal scaling with a queue and stateless workers becomes cheaper and more reliable than scaling vertically.

Redis or RabbitMQ for the job queue?

Redis is simpler to operate and sufficient for most scraping workloads — it handles priority queues, dead letters, and atomic pops via BLPOP. RabbitMQ wins when you need complex routing, durable message guarantees, or multi-tenant isolation. Start with Redis; migrate only when a specific limit bites.

When should you use serverless for scraping?

Serverless (AWS Lambda, Cloud Functions) shines for spiky, infrequent workloads where idle workers are pure waste. For steady, continuous scraping, containers are 2–5x cheaper per million pages. Use the cost table: serverless $200–$500/M pages, containers $100–$300/M pages.

How do you handle rate limits at scale?

Per-domain token buckets in the coordinator, not the workers. The worker asks the coordinator 'may I hit domain X now?', the coordinator says yes or sleep-for-N-ms. This keeps the limit correct even across 100 workers. Combine with exponential backoff on 429/503 responses.

What are the most important metrics to track?

Six metrics catch almost every scraping issue: request rate, success rate, P95 response time, error rate by type, queue depth, and worker CPU/memory. Queue depth and P95 latency are the two most predictive — when they climb together, you have a bottleneck, not a spike.

Scaling Web Scraping Operations: A Technical Guide

The pattern that scales web scraping to millions of pages daily is simple: a distributed queue, stateless worker nodes, and a central coordinator that owns retries and storage. Everything else is tuning.

This guide walks through that pattern end to end. It covers the architecture, the code you need at each layer, and the trade-offs between vertical, horizontal, and serverless scaling. The same playbook is what we used to win the Apify 1 Million Challenge.

TL;DR

Use a distributed queue plus stateless workers plus a central store. Don't scale a single machine.
Pick scaling mode by workload shape: vertical for <1M/day, containers for steady scale, serverless for spikes.
Reliability comes from retries with backoff, circuit breakers, and metrics on queue depth and P95 latency.

What architecture should you use to scale scraping?

Recommended pattern: a job queue (Redis or RabbitMQ) feeding a pool of stateless workers, with results written to a shared store. This separates concerns cleanly. The queue absorbs spikes, workers scale horizontally, and storage stays independent of scraping logic.

The ASCII diagram below shows how requests flow from queue to workers to the database.

                    ┌─────────────┐
                    │   Queue     │
                    │ (Redis/RMQ) │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
   ┌────▼────┐       ┌────▼────┐       ┌────▼────┐
   │ Worker  │       │ Worker  │       │ Worker  │
   │ Node 1  │       │ Node 2  │       │ Node 3  │
   └────┬────┘       └────┬────┘       └────┬────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
                   ┌──────▼──────┐
                   │  Database   │
                   │ (MongoDB/   │
                   │  PostgreSQL)│
                   └─────────────┘

What does each component do?

1. Job Queue

Redis or RabbitMQ for task distribution
Priority queues for important targets
Dead letter queues for failed jobs

2. Worker Nodes

Auto-scaling based on queue depth
Independent failure isolation
Geographic distribution for locality

3. Result Storage

Time-series database for metrics
Document store for scraped data
Data lake for raw HTML archives

Should you scale vertically or horizontally?

Match the scaling mode to the workload. Vertical scaling (one bigger box) handles up to roughly 1M pages per day and keeps operations simple. Horizontal scaling wins once throughput, fault tolerance, or geographic spread matter more than cost per page.

Vertical scaling (single machine)

The snippet below shows connection pooling and request batching, which are the two tweaks that most often unlock a single-node setup.

// Use connection pooling
const pool = new Pool({
  host: 'localhost',
  database: 'scraping',
  max: 20, // Concurrent connections
  idleTimeoutMillis: 30000
})

// Implement request batching
async function batchRequest(urls, batchSize = 10) {
  const batches = chunk(urls, batchSize)
  for (const batch of batches) {
    await Promise.all(batch.map(url => fetch(url)))
    await delay(1000) // Rate limiting between batches
  }
}

Horizontal scaling (multiple machines)

For 1M+ pages per day you want orchestration. The Compose file below declares a replicated worker service so Swarm or Kubernetes can schedule copies across the cluster.

# docker-compose.yml
version: '3.8'
services:
  workers:
    image: scraper:latest
    deploy:
      replicas: 10
      update_config:
        parallelism: 2
        delay: 10s
    environment:
      - WORKER_CONCURRENCY=50

Serverless is a strong fit when traffic is spiky and you don't want idle workers. The AWS Lambda handler below pulls a batch of URLs from an SQS event and scrapes them in parallel.

// AWS Lambda example
export const handler = async (event) => {
  const urls = event.Records.map(r => r.body)
  const results = await Promise.all(
    urls.map(url => scrapeUrl(url))
  )
  return results
}

How do you squeeze more throughput out of each worker?

Three optimizations move the needle more than anything else: non-blocking I/O, connection reuse, and caching of deterministic responses. Apply them before buying more machines.

Async I/O with a concurrency cap

Sequential awaits waste most of the event loop. The version below uses p-limit to run requests in parallel while keeping the fan-out bounded.

// Bad: Blocking requests
for (const url of urls) {
  const data = await fetch(url) // Sequential
  save(data)
}

// Good: Parallel with concurrency limit
const pLimit = require('p-limit')
const limit = pLimit(10)

await Promise.all(
  urls.map(url => limit(() => fetch(url)))
)

Connection pooling

Reusing TCP/TLS connections avoids the handshake on every request. This agent keeps sockets alive and caps how many can idle.

// Reuse connections across requests
const agent = new https.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10
})

const response = await fetch(url, {
  agent: agent
})

Smart caching

Cache only idempotent responses. This helper hashes the URL as the cache key and falls through to the network on miss.

// Cache GET requests, don't cache POST
const cache = new NodeCache({ stdTTL: 3600 })

async function cachedFetch(url) {
  const cacheKey = crypto.createHash('md5').update(url).digest('hex')

  let data = cache.get(cacheKey)
  if (data) return data

  data = await fetch(url).then(r => r.json())
  cache.set(cacheKey, data)
  return data
}

How do you keep a scraping pipeline reliable under failure?

Transient failures dominate scraping. Two patterns absorb most of them: retries with exponential backoff, and circuit breakers that stop hammering a dead target.

Retry with exponential backoff

The helper below doubles the delay between attempts so bursts of failures don't compound. It throws only after the last retry fails.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url)
    } catch (error) {
      if (i === maxRetries - 1) throw error

      const delay = Math.pow(2, i) * 1000
      await sleep(delay)
    }
  }
}

Circuit breaker

A circuit breaker trips after repeated failures and short-circuits new calls until a timeout elapses. The class below is the minimum state machine you need.

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failures = 0
    this.threshold = threshold
    this.timeout = timeout
    this.state = 'CLOSED' // CLOSED, OPEN, HALF_OPEN
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      throw new Error('Circuit breaker is OPEN')
    }

    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }
}

What should you monitor?

You can't tune what you don't measure. Six signals catch almost every production issue before it spreads.

Request rate: requests per second or minute
Success rate: percentage of successful scrapes
Response time: P50, P95, P99 latencies
Error rate: split by error type (timeout, 404, 500)
Queue depth: pending jobs in flight
Resource usage: CPU, memory, network I/O

Structured logging

Structured logs make these metrics searchable. The snippet below emits one event per scrape with the fields you need to slice by worker and target.

// Structured logging
logger.info('Scrape completed', {
  url: normalizedUrl,
  status: response.status,
  duration: Date.now() - startTime,
  itemsExtracted: data.length,
  workerId: process.env.WORKER_ID
})

How do you control infrastructure costs?

Pick the compute model that matches your duty cycle. The table below shows the rough ranges we see per million pages scraped.

Approach	Cost/1M Pages	Best For
Single Server	$50-100	Prototyping
Serverless	$200-500	Spiky workloads
Containers	$100-300	Steady scale
Spot Instances	$50-150	Fault-tolerant tasks

Optimization tips

Schedule during off-peak hours
- Many sites have lower traffic at night
- Spot pricing tends to be cheaper then

Deduplicate requests

const seen = new Set()
urls = urls.filter(url => {
  if (seen.has(url)) return false
  seen.add(url)
  return true
})

Compress stored data

const zlib = require('zlib')
const compressed = zlib.gzipSync(JSON.stringify(data))

Common pitfalls

Underestimating infrastructure costs. Start with estimates, then measure actual costs.
Ignoring rate limits. Implement adaptive rate limiting from day one.
No failure isolation. One bad target shouldn't crash your system.
Insufficient monitoring. You can't improve what you don't measure.

Getting started

Start small. Prove the architecture with 10-100 targets.
Measure baseline. Know your current performance.
Add gradually. Increase scale while monitoring.
Automate everything. From deployment to recovery.

Once the pipeline moves volume, the questions shift to downstream processing and compliance. For the streaming layer see real-time data processing; for rate limiting and robots.txt at scale see ethical web scraping best practices.

Conclusion

Scaling web scraping is an architecture problem, not a code problem. Get the queue, workers, and storage separated early, then tune throughput and reliability on top. Measure before you scale, and let queue depth and P95 latency tell you when to add capacity.