n8n Web Scraping Tutorial: Extract Data from Any Website (Step-by-Step)

Why use n8n for scraping at all?

Most “n8n web scraping” tutorials are written by vendors trying to sell you their proxy or rendering API. This one isn’t. n8n is a workflow engine, not a scraping framework — and that’s exactly what makes it a great fit for the 80% of scraping jobs that aren’t fighting Cloudflare.

You’d pick n8n over a Python script (BeautifulSoup, Scrapy) when you want the scrape to be one node in a longer pipeline: scrape, transform, write to Airtable, alert a Slack channel, retry on failure. You’d pick it over a managed service (Apify, ScrapingBee, Octoparse) when you want to keep ownership of your data and pay flat hosting instead of per-page fees.

The three tiers of scraping difficulty

Almost every scraping task you’ll meet falls into one of three buckets. Pick your tier first, because the right architecture is different for each.

Tier 1: Static HTML sites

The page renders fully on the server, no JavaScript needed. Government data, Wikipedia, most news sites, many B2B directories, e-commerce category pages on smaller stores.

Recipe: HTTP Request node to fetch the URL, HTML node (formerly HTML Extract) to select elements with CSS selectors, Set node to clean the data, then write to your destination.

// CSS selector example for HTML node
{
  "title": "h1.product-title",
  "price": "span.price",
  "in_stock": ".availability"
}

Add a User-Agent header in the HTTP Request node (something realistic like a recent Chrome string), and respect robots.txt.

Tier 2: Paginated and rate-limited sites

The data is static but spread across many pages, or the site rate-limits you after a few requests. Job boards, large product catalogs, search results.

Recipe: Add a Loop Over Items node (formerly Split In Batches) configured with a small batch size (5-10 items) and a Wait node between batches (1-3 seconds). Use the HTTP Request node’s built-in retry on fail with exponential backoff. If the site uses cursor pagination, store the cursor in the workflow’s static data so you can resume after a crash.

For sites that block on IP, route the HTTP Request node through a residential proxy provider (Bright Data, Smartproxy, Oxylabs) by setting the proxy URL in the node’s options.

Tier 3: JavaScript-rendered sites

The page is a Single Page App (React, Vue, Next.js) and the data only appears after JS executes. Most modern SaaS dashboards, infinite-scroll feeds, sites with anti-bot protection like Cloudflare or DataDome.

The HTTP Request node won’t work here because it returns the raw HTML shell. You have three options:

Find the underlying API. Open DevTools, watch the Network tab, and you’ll usually find the JSON endpoint the page is calling. Hit that directly — it’s faster, more stable, and uses fewer resources than rendering the page.
Use Puppeteer or Playwright via the Code node. n8n’s Code node can run Node.js, so you can spin up a headless Chromium, navigate, and extract. This works for self-hosted instances; n8n Cloud doesn’t allow arbitrary npm packages.
Use a rendering API. Plug in ScrapingBee, ScrapeGraphAI, Bright Data Web Unlocker, or Apify via the HTTP Request node. They handle the headless browser, proxy rotation and CAPTCHA-solving and return clean HTML or JSON. Expect to pay $30-200/month depending on volume.

Common scraping use cases

Competitor price monitoring

Scrape your top 5 competitors’ product pages every 6 hours. Compare against your last snapshot in Airtable or Postgres. When a price changes by more than 5%, post to Slack with the old and new value plus a screenshot link.

Lead enrichment

For each company in your CRM, scrape its public homepage for the contact email pattern, employee count from About pages, and tech stack from BuiltWith or Wappalyzer’s free endpoints. Push the enriched fields back to HubSpot or Pipedrive.

News and PR monitoring

Scrape industry news sites (or use their RSS feeds where available) on a 30-minute cron. When a story matches keywords from your watchlist, summarize it with OpenAI and email the brief to your team. Cheaper than Meltwater for small teams.

Job board aggregation

Scrape niche job boards your target audience reads. Filter by city and tech stack. Push new postings to a Telegram channel or your own internal jobs feed.

Review and rating tracking

Scrape product reviews on Amazon (where allowed by ToS), Trustpilot, G2, Capterra. Run sentiment analysis with OpenAI. Trigger an alert when negative reviews spike.

n8n scraping vs Python (Scrapy, BeautifulSoup)

	n8n	Python (Scrapy)
Setup time	Minutes	Hours
Best for volumes	Up to ~100k pages/day	Millions of pages/day
Visual debugging	Yes	No
Plug into other tools (CRM, Slack)	Native	You write the glue
Cost at small scale	$0-20/month	$0 + your time
Anti-bot resilience	Via 3rd party APIs	You configure proxies, fingerprints

Rule of thumb: if the scrape is one step in a business workflow, use n8n. If scraping is the product, use Scrapy.

When n8n is NOT the right tool

You need to scrape millions of pages per day. n8n’s queue model isn’t optimized for that throughput; use Scrapy on dedicated workers.
The target site uses aggressive anti-bot (Cloudflare Turnstile, PerimeterX, Akamai Bot Manager) and you’re not willing to pay for a premium unlocker API. n8n won’t magically solve that problem.
You need browser-level interaction (login flows, multi-step forms with CAPTCHA). Possible with Puppeteer in the Code node, but you’re fighting the platform — use Playwright directly.
You’re on n8n Cloud and need npm packages like puppeteer-extra-plugin-stealth. Cloud doesn’t allow arbitrary modules.

Legal and ethical guardrails

Scraping public data is generally legal in the US (per the 2022 hiQ Labs v. LinkedIn Ninth Circuit ruling), but the picture is murkier in the EU and varies by country. Regardless of legality, follow these basics:

Read the site’s robots.txt and Terms of Service. Many explicitly prohibit scraping.
Don’t scrape personal data without a lawful basis under GDPR.
Throttle aggressively. A scraper that hammers a small site at 100 req/s is rude and may be a CFAA violation in the US.
Identify yourself in the User-Agent when possible (e.g. YourCompanyBot/1.0 ([email protected])).

FAQ

Can I scrape JavaScript-rendered sites in n8n Cloud?

Not natively, because Cloud doesn’t allow arbitrary npm packages. The cleanest path is to call a rendering API (ScrapingBee, Apify, Browserless) from the HTTP Request node and have them do the rendering for you.

How do I avoid getting blocked?

Rotate IPs via a residential proxy, set a believable User-Agent, throttle to 1-2 requests per second per domain, randomize wait times, and respect robots.txt. If you’re still blocked, the site has paid anti-bot protection and you’ll need a premium unlocker service.

Is n8n web scraping faster than Apify?

For small jobs, yes — there’s no platform overhead. For large jobs, Apify is faster because it’s purpose-built with autoscaling actors. Apify also handles fingerprinting and proxies for you, which n8n doesn’t.

What’s the best n8n node for parsing HTML?

The built-in HTML node (called HTML Extract in older versions). It accepts CSS selectors and returns clean JSON, no regex needed.

Where to go next

Once you have your data, pipe it through our batch processing guide to handle thousands of records, or read the error handling primer so your scraper survives the inevitable 503.

Ready to automate?

Start your free n8n trial today and put these workflows into production.

Try n8n free

Disclosure: links to n8n.io are affiliate links. If you start a paid plan after clicking, n8nfuel earns a commission at no extra cost to you.