Is web scraping legal?

Scraping publicly available data is generally legal in most jurisdictions, but scraping behind authentication, bypassing access controls, or violating a site's Terms of Service can trigger civil or criminal liability. Check each target's robots.txt and ToS, and apply rate limits regardless of legality.

Do I need to follow robots.txt?

Robots.txt is not legally binding in most jurisdictions, but ignoring it is strong evidence of bad faith and can weight a court against you. Respect robots.txt as a baseline; treat Disallow directives as an instruction, not a suggestion.

How does GDPR apply to scraping?

If you scrape any data that can identify an EU resident — names, emails, IPs, user IDs — GDPR applies. You need a lawful basis (usually legitimate interest), must honor deletion requests, and must not store data longer than necessary. Anonymize at ingestion when possible.

What rate limit should I use?

Start at 1 request per 2–5 seconds per domain and back off on any 429 or 503 response. For small sites, go slower. The goal is to be indistinguishable from a few concurrent human users — never a burst that looks like a denial-of-service attempt.

What does hiQ v. LinkedIn mean for scrapers?

The 9th Circuit's 2022 hiQ Labs v. LinkedIn ruling held that scraping publicly available data does not violate the Computer Fraud and Abuse Act. It's the strongest US precedent protecting public-data scraping, but it doesn't override contract law — a ToS-breach claim or a state-law trespass claim can still succeed.

Ethical Web Scraping: Best Practices for 2026

Ethical scraping in 2026 lives in a narrow corridor: public data is fair game under hiQ Labs v. LinkedIn, but ToS breaches, GDPR violations, and DOS-level request rates still carry real liability. The playbook below is what a disciplined data team actually runs — respect robots.txt, rate-limit by default, identify your bot, handle personal data under GDPR, and document everything.

TL;DR

Public data is legal to scrape in the US (hiQ v. LinkedIn, 9th Circuit 2022); authenticated or ToS-restricted data is not.
GDPR applies the moment an EU resident is identifiable — lawful basis, deletion rights, and data minimization are non-negotiable.
Rate-limit to 1 request per 2–5 seconds per domain, identify your bot, back off on 429/503, and document everything for audit.

What's the legal framework for web scraping?

Three bodies of law govern scraping: copyright, data privacy (GDPR, CCPA), and anti-hacking statutes (the US Computer Fraud and Abuse Act). Facts themselves can't be copyrighted; creative arrangement and presentation can. Public data is protected under hiQ v. LinkedIn, but contract law (Terms of Service) still bites — treat each as a separate layer, not a single test.

Copyright considerations

Facts themselves cannot be copyrighted, but the creative arrangement and presentation of data can. Key principles:

Public Domain: Government data and facts are generally safe to scrape
Creative Works: Original articles, images, and creative content are protected
Terms of Service: Website ToS can create binding contracts regarding scraping

When scraping personal data from EU sources, GDPR applies the moment an individual is identifiable — name, email, IP address, or user ID is enough. Violations draw fines up to 4% of global annual revenue or €20 million, whichever is higher (GDPR Article 83). Core obligations:

Always have a lawful basis for processing (contract, legitimate interest, or consent)
Implement appropriate data security measures
Respect data subject rights (access, deletion, portability)
Maintain records of processing activities
Anonymize at ingestion where the use case allows

Computer Fraud and Abuse Act (CFAA)

In the United States, the CFAA has historically been the hammer for unauthorized-access claims. The 9th Circuit's hiQ Labs v. LinkedIn ruling (affirmed 2022) clarified that scraping publicly available data — no login, no paywall — does not constitute "unauthorized access" under the CFAA. That precedent covers public data only; authenticated scraping still carries CFAA risk.

What are the technical best practices?

Four technical disciplines separate an ethical scraper from a problem scraper: respect robots.txt, rate-limit, identify your bot, and cache responsibly. Together they make your traffic indistinguishable from a small group of human users — which is both the ethical and the practical goal.

1. Respect robots.txt

Always check and respect the robots.txt file. Robots.txt is not legally binding in most jurisdictions, but ignoring it is strong evidence of bad faith if litigation ever surfaces.

User-agent: *
Disallow: /admin
Disallow: /private
Crawl-delay: 1

2. Implement rate limiting

Never overwhelm target servers. Start at 1 request per 2–5 seconds per domain and back off on any 429 or 503 response. The snippet below shows the two patterns most production scrapers use.

// Good: Respectful delays between requests
await delay(1000) // 1 second between requests

// Better: Adaptive rate limiting
await adaptiveDelay(serverResponseTime)

3. Identify your bot

Use descriptive user agents so target webmasters can contact you if something goes wrong. A bot that identifies itself rarely gets blocked permanently; a bot pretending to be Chrome triggers escalation.

headers: {
  'User-Agent': 'MyBot/1.0 (+https://mysite.com/bot-info); contact@mysite.com'
}

4. Cache responsibly

Implement local caching to reduce redundant requests
Respect cache headers from the server
Set appropriate expiration times

What ethical guidelines should you follow?

Ethics in scraping comes down to three principles: transparency, proportionality, and attribution. None of them are legally required in most cases. All of them materially lower your risk of being blocked, sued, or named in a news story.

Transparency

Clearly identify your organization in user agent strings
Provide contact information for webmasters
Offer to stop scraping upon request

Proportionality

Only collect data you actually need
Avoid scraping during peak hours when possible
Don't scrape more frequently than necessary

Attribution

When appropriate, attribute the original source of scraped data:

"Data sourced from [website]"
Link back to original content when displaying online

If a site actively pushes back against scraping with fingerprinting or JavaScript challenges, the line between compliant access and circumvention gets thin fast. Our guide on overcoming anti-bot measures covers where that line is and how to stay on the right side of it.

What mistakes get scrapers in trouble?

Five mistakes account for the majority of legal and operational scraping incidents. All are avoidable with basic discipline.

Ignoring robots.txt — the first rule of ethical scraping; weights courts against you
Scraping personal data without lawful basis — GDPR fines reach 4% of global revenue
Overwhelming servers — high request rates can be charged as denial-of-service
Scraping behind logins without permission — violates ToS and can invoke the CFAA
Repackaging copyrighted content — direct infringement, not a grey area

How do you build a sustainable scraping strategy?

A sustainable scraping program is built, not improvised. It starts with permission or an API where available, runs on documented rate limits and retention policies, and logs every decision for audit. The three layers below are what a compliance-ready program looks like in practice.

Start with permission

Whenever possible, get explicit permission — it is cheaper than litigation and more reliable than workarounds:

Check if the site offers an API
Contact the website owner for access
Consider licensing arrangements for commercial use

Implement monitoring

Set up systems to ensure ongoing compliance:

Regular audits of scraping targets
Automated alerts for blocked IPs or rate limits
Review of newly published content for copyright issues

Document everything

Maintain records of:

Lawful basis for scraping each target (especially under GDPR)
Rate limiting configurations per domain
Data retention and deletion policies
Communications with website owners

Compliance gets harder as volume grows — if you're planning to move from a handful of targets to millions of pages a day, the technical scaling guide shows how to keep rate limiting, robots.txt parsing, and audit logs working at scale. And if you're running a CI program on this data, competitive intelligence scraping covers the downstream analysis layer.

Conclusion

Ethical web scraping isn't just about following laws — it's about being a good internet citizen. The teams that run this well treat compliance as an engineering problem: robots.txt parsing in the crawler, rate limiting in the worker, GDPR-aware storage in the database, audit logs everywhere. That discipline is what turns scraping from a risk into a durable capability.

When in doubt, consult legal counsel familiar with data scraping in your jurisdiction — especially for cross-border GDPR exposure or any scraping behind authentication.

Ethical Web Scraping: Best Practices for 2026

TL;DR

What's the legal framework for web scraping?

Copyright considerations

Computer Fraud and Abuse Act (CFAA)

What are the technical best practices?

1. Respect robots.txt

2. Implement rate limiting

3. Identify your bot

4. Cache responsibly

What ethical guidelines should you follow?

Transparency

Proportionality

Attribution

What mistakes get scrapers in trouble?

How do you build a sustainable scraping strategy?

Start with permission

Implement monitoring

Document everything

Conclusion

About SIÁN Team

Overcoming Anti-Bot Measures: Advanced Techniques

Scaling Web Scraping Operations: A Technical Guide

Need help with web scraping?

More Articles

Where Real-Estate Agencies Leak Time and Money (2026 Data)

Scraping-Powered Lead Generation: The 2026 SME Playbook

Data Pipeline for Small Business: The 2026 SME Guide

Want to automate your data workflow?

TL;DR

What's the legal framework for web scraping?

Copyright considerations

GDPR and data privacy

Computer Fraud and Abuse Act (CFAA)

What are the technical best practices?

1. Respect robots.txt

2. Implement rate limiting

3. Identify your bot

4. Cache responsibly

What ethical guidelines should you follow?

Transparency

Proportionality

Attribution

What mistakes get scrapers in trouble?

How do you build a sustainable scraping strategy?

Start with permission

Implement monitoring

Document everything

Conclusion

About SIÁN Team

Overcoming Anti-Bot Measures: Advanced Techniques

Scaling Web Scraping Operations: A Technical Guide

Need help with web scraping?

More Articles

Where Real-Estate Agencies Leak Time and Money (2026 Data)

Scraping-Powered Lead Generation: The 2026 SME Playbook

Data Pipeline for Small Business: The 2026 SME Guide

Want to automate your data workflow?