Skip to content

Web Development · Automation

Web Scraping in 2026: Playwright, Puppeteer, and the Legal Line

Playwright has become the dominant web scraping tool, but the tooling decision is the easy part. Here's a practical guide to scraping that works, and the legal and ethical lines you need to know before shipping.

Abhishek Gupta

Abhishek Gupta

7 min read

Web Scraping in 2026: Playwright, Puppeteer, and the Legal Line

Sponsored

Share

Someone on your team needs to pull data from a website that does not have an API. Maybe it is competitor pricing, maybe it is a public government data source that only has a web interface, maybe it is product listings from a marketplace. Web scraping is the answer. The question is how to do it cleanly, reliably, and without stepping into legal or ethical problems.

This is a practical guide. It covers tool choice, the patterns that work, the patterns that get you blocked, and the legal and ethical situation you need to understand before you ship anything commercial.

Playwright vs. Puppeteer vs. everything else

Playwright is the right default for new projects in 2026. The quick comparison:

ToolLanguageBrowser supportMaintained by
PlaywrightJS/TS, Python, Java, .NETChromium, Firefox, WebKitMicrosoft
PuppeteerJS/TSChromium (Firefox experimental)Google
SeleniumJS, Python, Java, Ruby, C#All major browsersSelenium Project
ScrapyPythonN/A (HTTP only)Open source

Playwright wins on most dimensions for browser-based scraping: cross-browser, automatic waiting for elements, better async API, and first-class Python support. Selenium is worth considering when you need Java or Ruby and the project is large enough to have settled on those languages. Scrapy is the right tool when you are scraping at scale and the target site is mostly HTML-rendered (no JavaScript needed). It is faster and more efficient for that workload.

Puppeteer is still used, particularly in Node.js projects where switching would be costly. But for greenfield work, there is no reason to choose it over Playwright.

A minimal working Playwright scraper

Here is what a clean Python scraper looks like in 2026:

import asyncio
from playwright.async_api import async_playwright, Page


async def scrape_listing(url: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800},
        )
        page: Page = await context.new_page()

        await page.goto(url, wait_until="domcontentloaded")
        # Playwright auto-waits; you should still be explicit about what you need
        await page.wait_for_selector(".product-title", timeout=10_000)

        title = await page.text_content(".product-title")
        price = await page.text_content(".product-price")

        await browser.close()
        return {"title": title, "price": price}


asyncio.run(scrape_listing("https://example.com/product/1"))

A few notes on this pattern:

  • Always set a realistic user_agent. A headless browser with the default Playwright agent is easy to fingerprint.
  • wait_until="domcontentloaded" is faster than networkidle and sufficient when your target content is in the DOM rather than loaded by a subsequent fetch.
  • wait_for_selector is more reliable than wait_for_timeout. An explicit time delay is fragile; an element wait adapts to the actual page state.

Patterns that prevent blocks

Anti-bot systems fingerprint several signals. Here are the practical countermeasures:

Rate limiting. Add delays between requests. One to three seconds between pages is the range most sites consider human. Machine-speed requests (hundreds per minute) will get you blocked everywhere.

import asyncio
import random

async def polite_fetch(page, url):
    await page.goto(url)
    # jitter the wait so it does not look robotic
    await asyncio.sleep(random.uniform(1.5, 3.5))

IP rotation. A single IP hitting a site repeatedly at any speed is detectable. For larger-scale work, use a rotating proxy service or accept that your scrape will be blocked and throttled.

Request headers. Rotate User-Agent strings, set Accept-Language and Accept headers to plausible browser values, and do not send headers that headless browsers omit by default (like Sec-CH-UA).

Session management. Logged-in sessions are more complex. If a site requires login, that is a clear signal that scraping it may violate the ToS, and you should read the legal section below carefully.

Handling JavaScript-rendered content

Sites that load content via JavaScript after page load need a headful or headless browser (Playwright or Puppeteer, not a plain HTTP fetcher like requests). The Playwright pattern above handles this. Two specific cases worth knowing:

Infinite scroll. The content loads as you scroll. Simulate this:

# Scroll to the bottom incrementally to trigger loading
for _ in range(5):
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await asyncio.sleep(1.5)

Waiting for network responses. Sometimes you want to intercept the API call the page makes rather than parse the rendered HTML. Playwright’s page.expect_response() lets you wait for a specific URL and parse the JSON directly, which is cleaner and more stable than HTML parsing:

async with page.expect_response(lambda r: "/api/products" in r.url) as resp_info:
    await page.goto(url)
response = await resp_info.value
data = await response.json()

This is worth checking: many sites that do not have a public API do have internal API calls. Intercepting those is simpler and more stable than scraping the HTML.

The legal picture shifted with the 9th Circuit’s ruling in hiQ v. LinkedIn (upheld in 2022), which held that scraping publicly accessible data does not automatically violate the Computer Fraud and Abuse Act. Public data, accessible without authentication, is generally scrapeable under US law.

That said, three things can create real legal exposure:

Terms of Service violations. Scraping in violation of a site’s ToS is a contract issue, not a criminal one under most readings of the CFAA, but it can result in civil action and account termination. Read the ToS before scraping commercially.

Personal data. If the data you are collecting contains personal information about EU residents, GDPR applies. If it contains California residents’ personal data, CCPA applies. “It’s public” is not a defense if you are building a database of personal information at scale.

Competitive misappropriation. Some courts have found scraping to create liability under state unfair competition law, particularly when it harms the business being scraped. The lines here are less clear.

The practical rule: check robots.txt, honor the Crawl-delay if it is set, do not scrape behind authentication without permission, and if you are building a commercial product on scraped data, talk to a lawyer before you launch.

When to use a managed scraping service

For personal projects and internal tooling, running your own Playwright scraper is fine. For commercial products where the scraping is core to the value proposition, managed services are usually the better investment:

  • Apify. Orchestration layer, actor-based jobs, runs at scale.
  • Bright Data. Residential proxy network, web unlocker, structured data products.
  • Oxylabs. Similar to Bright Data, strong enterprise support.
  • ScrapingBee. Simpler API, good for teams that want a service without managing the infrastructure.

These services handle residential IP rotation, browser fingerprinting challenges, and CAPTCHA solving. If your target site uses Cloudflare or Akamai Bot Manager, you will spend weeks building and maintaining bypass logic that a managed service has already solved. That time cost is real.

Storing and using scraped data

The scraping is usually the easy part. Think about these before you start:

  • How often do you need to refresh? Run your scrape on a schedule (Airflow, Prefect, even a cron job on a VPS) rather than on demand.
  • Where does the data land? Postgres with a scraped_at timestamp column is a sensible starting point. Add source_url so you can debug which page a row came from.
  • How do you handle change? Structure evolves. Name your selectors clearly, write tests against the structure, and get notified when a scrape yields zero results.

The most common failure mode is a scraper that silently stops working because the site changed a class name. An empty result is not an error the way a thrown exception is, so you have to build that check explicitly.


Web scraping is a legitimate tool for gathering data that is otherwise unavailable. Done well (with rate limiting, legal awareness, and a clear data storage strategy) it works reliably for years. Done without those considerations, it is a fragile script that gets blocked and creates liability. The gap between the two is mostly planning, not sophistication.

For related reading on automation tooling at the project level, GitHub Actions in 2026 shows how scheduled scraping jobs fit into a larger CI/CD workflow.

Frequently asked questions

Is web scraping legal in 2026?
Scraping publicly accessible data is generally legal in the US following the hiQ v. LinkedIn ruling, which held that scraping public LinkedIn profiles did not violate the CFAA. However, it can still violate Terms of Service (a contract claim), data protection laws if the data includes personal information about EU or California residents, and copyright if the scraped content is original. The answer is 'usually legal for public data, but check three things before you scrape: robots.txt, ToS, and whether the data is personal.'
Playwright or Puppeteer in 2026?
Playwright for almost everything. It is actively maintained by Microsoft, supports Chromium, Firefox, and WebKit, has a cleaner async API, auto-waits for elements before interacting, and has first-class Python, TypeScript, and Java bindings. Puppeteer is Chromium-only (unless you add the experimental Firefox support) and was maintained less actively after Microsoft hired much of the team behind Playwright.
What is the fastest way to get blocked when scraping?
Sending requests too fast from a single IP without rotating headers or respecting crawl delays. Most anti-bot systems fingerprint browser behavior, request cadence, and IP reputation. A fast, headless browser hitting a site at machine speed from a residential IP without any wait time looks nothing like a real user.
When should I use a managed scraping service instead of building my own?
When the target site has aggressive anti-bot protection (Cloudflare, Akamai), when you need residential IP rotation at scale, or when maintaining your own browser fleet is not your core product. Managed services like Apify, Bright Data, and Oxylabs handle the infrastructure; you describe what you want to extract.

Sources

Sponsored

Sponsored

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored