Technical

Scaling Web Scraping Operations: A Technical Guide

Discover the technical architecture and strategies needed to scale web scraping operations from thousands to millions of data points daily with distributed systems and cloud infrastructure.

SIÁN Team
February 10, 2026
12 min read
Architecture
Scalability
Infrastructure
Cloud
Distributed Systems

The pattern that scales web scraping to millions of pages daily is simple: a distributed queue, stateless worker nodes, and a central coordinator that owns retries and storage. Everything else is tuning.

This guide walks through that pattern end to end. It covers the architecture, the code you need at each layer, and the trade-offs between vertical, horizontal, and serverless scaling. The same playbook is what we used to win the Apify 1 Million Challenge.

TL;DR

  • Use a distributed queue plus stateless workers plus a central store. Don't scale a single machine.
  • Pick scaling mode by workload shape: vertical for <1M/day, containers for steady scale, serverless for spikes.
  • Reliability comes from retries with backoff, circuit breakers, and metrics on queue depth and P95 latency.

What architecture should you use to scale scraping?

Recommended pattern: a job queue (Redis or RabbitMQ) feeding a pool of stateless workers, with results written to a shared store. This separates concerns cleanly. The queue absorbs spikes, workers scale horizontally, and storage stays independent of scraping logic.

The ASCII diagram below shows how requests flow from queue to workers to the database.

                    ┌─────────────┐
                    │   Queue     │
                    │ (Redis/RMQ) │
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
   ┌────▼────┐       ┌────▼────┐       ┌────▼────┐
   │ Worker  │       │ Worker  │       │ Worker  │
   │ Node 1  │       │ Node 2  │       │ Node 3  │
   └────┬────┘       └────┬────┘       └────┬────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
                   ┌──────▼──────┐
                   │  Database   │
                   │ (MongoDB/   │
                   │  PostgreSQL)│
                   └─────────────┘

What does each component do?

1. Job Queue

  • Redis or RabbitMQ for task distribution
  • Priority queues for important targets
  • Dead letter queues for failed jobs

2. Worker Nodes

  • Auto-scaling based on queue depth
  • Independent failure isolation
  • Geographic distribution for locality

3. Result Storage

  • Time-series database for metrics
  • Document store for scraped data
  • Data lake for raw HTML archives

Should you scale vertically or horizontally?

Match the scaling mode to the workload. Vertical scaling (one bigger box) handles up to roughly 1M pages per day and keeps operations simple. Horizontal scaling wins once throughput, fault tolerance, or geographic spread matter more than cost per page.

Vertical scaling (single machine)

The snippet below shows connection pooling and request batching, which are the two tweaks that most often unlock a single-node setup.

// Use connection pooling
const pool = new Pool({
  host: 'localhost',
  database: 'scraping',
  max: 20, // Concurrent connections
  idleTimeoutMillis: 30000
})

// Implement request batching
async function batchRequest(urls, batchSize = 10) {
  const batches = chunk(urls, batchSize)
  for (const batch of batches) {
    await Promise.all(batch.map(url => fetch(url)))
    await delay(1000) // Rate limiting between batches
  }
}

Horizontal scaling (multiple machines)

For 1M+ pages per day you want orchestration. The Compose file below declares a replicated worker service so Swarm or Kubernetes can schedule copies across the cluster.

# docker-compose.yml
version: '3.8'
services:
  workers:
    image: scraper:latest
    deploy:
      replicas: 10
      update_config:
        parallelism: 2
        delay: 10s
    environment:
      - WORKER_CONCURRENCY=50

Serverless is a strong fit when traffic is spiky and you don't want idle workers. The AWS Lambda handler below pulls a batch of URLs from an SQS event and scrapes them in parallel.

// AWS Lambda example
export const handler = async (event) => {
  const urls = event.Records.map(r => r.body)
  const results = await Promise.all(
    urls.map(url => scrapeUrl(url))
  )
  return results
}

How do you squeeze more throughput out of each worker?

Three optimizations move the needle more than anything else: non-blocking I/O, connection reuse, and caching of deterministic responses. Apply them before buying more machines.

Async I/O with a concurrency cap

Sequential awaits waste most of the event loop. The version below uses p-limit to run requests in parallel while keeping the fan-out bounded.

// Bad: Blocking requests
for (const url of urls) {
  const data = await fetch(url) // Sequential
  save(data)
}

// Good: Parallel with concurrency limit
const pLimit = require('p-limit')
const limit = pLimit(10)

await Promise.all(
  urls.map(url => limit(() => fetch(url)))
)

Connection pooling

Reusing TCP/TLS connections avoids the handshake on every request. This agent keeps sockets alive and caps how many can idle.

// Reuse connections across requests
const agent = new https.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10
})

const response = await fetch(url, {
  agent: agent
})

Smart caching

Cache only idempotent responses. This helper hashes the URL as the cache key and falls through to the network on miss.

// Cache GET requests, don't cache POST
const cache = new NodeCache({ stdTTL: 3600 })

async function cachedFetch(url) {
  const cacheKey = crypto.createHash('md5').update(url).digest('hex')

  let data = cache.get(cacheKey)
  if (data) return data

  data = await fetch(url).then(r => r.json())
  cache.set(cacheKey, data)
  return data
}

How do you keep a scraping pipeline reliable under failure?

Transient failures dominate scraping. Two patterns absorb most of them: retries with exponential backoff, and circuit breakers that stop hammering a dead target.

Retry with exponential backoff

The helper below doubles the delay between attempts so bursts of failures don't compound. It throws only after the last retry fails.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url)
    } catch (error) {
      if (i === maxRetries - 1) throw error

      const delay = Math.pow(2, i) * 1000
      await sleep(delay)
    }
  }
}

Circuit breaker

A circuit breaker trips after repeated failures and short-circuits new calls until a timeout elapses. The class below is the minimum state machine you need.

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failures = 0
    this.threshold = threshold
    this.timeout = timeout
    this.state = 'CLOSED' // CLOSED, OPEN, HALF_OPEN
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      throw new Error('Circuit breaker is OPEN')
    }

    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }
}

What should you monitor?

You can't tune what you don't measure. Six signals catch almost every production issue before it spreads.

  1. Request rate: requests per second or minute
  2. Success rate: percentage of successful scrapes
  3. Response time: P50, P95, P99 latencies
  4. Error rate: split by error type (timeout, 404, 500)
  5. Queue depth: pending jobs in flight
  6. Resource usage: CPU, memory, network I/O

Structured logging

Structured logs make these metrics searchable. The snippet below emits one event per scrape with the fields you need to slice by worker and target.

// Structured logging
logger.info('Scrape completed', {
  url: normalizedUrl,
  status: response.status,
  duration: Date.now() - startTime,
  itemsExtracted: data.length,
  workerId: process.env.WORKER_ID
})

How do you control infrastructure costs?

Pick the compute model that matches your duty cycle. The table below shows the rough ranges we see per million pages scraped.

Approach Cost/1M Pages Best For
Single Server $50-100 Prototyping
Serverless $200-500 Spiky workloads
Containers $100-300 Steady scale
Spot Instances $50-150 Fault-tolerant tasks

Optimization tips

  1. Schedule during off-peak hours

    • Many sites have lower traffic at night
    • Spot pricing tends to be cheaper then
  2. Deduplicate requests

    const seen = new Set()
    urls = urls.filter(url => {
      if (seen.has(url)) return false
      seen.add(url)
      return true
    })
    
  3. Compress stored data

    const zlib = require('zlib')
    const compressed = zlib.gzipSync(JSON.stringify(data))
    

Common pitfalls

  1. Underestimating infrastructure costs. Start with estimates, then measure actual costs.
  2. Ignoring rate limits. Implement adaptive rate limiting from day one.
  3. No failure isolation. One bad target shouldn't crash your system.
  4. Insufficient monitoring. You can't improve what you don't measure.

Getting started

  1. Start small. Prove the architecture with 10-100 targets.
  2. Measure baseline. Know your current performance.
  3. Add gradually. Increase scale while monitoring.
  4. Automate everything. From deployment to recovery.

Once the pipeline moves volume, the questions shift to downstream processing and compliance. For the streaming layer see real-time data processing; for rate limiting and robots.txt at scale see ethical web scraping best practices.

Conclusion

Scaling web scraping is an architecture problem, not a code problem. Get the queue, workers, and storage separated early, then tune throughput and reliability on top. Measure before you scale, and let queue depth and P95 latency tell you when to add capacity.

About SIÁN Team

SIÁN Agency builds automated data pipelines for small businesses — from web scraping to AI processing to workflow integration. We write about what we know from building these systems every day.

Need help with web scraping?

Get in touch with our team to discuss your data extraction needs

Want to automate your data workflow?

We build custom data pipelines for small businesses. Let's talk about what you need.