Tushar Rayamajhi | AI Engineer & Backend Developer

The Problem with Web Scrapers

Every web scraper has the same fate. You build it, it works great for a while, then one day the site quietly redesigns a page. Maybe the developer renamed a CSS class. Maybe they changed the date format. Maybe they A/B tested a new detail page layout on 30% of listings. Your scraper keeps running — but silently failing. You find out days later when you notice the database stopped getting new rows.

The traditional answer is to babysit the scraper. Keep checking it. Keep fixing it manually. That doesn't scale if you're scraping many sites or if you have other things to do.

My approach to this problem: use AI only at two specific moments, and let code handle everything in between.

The Core Idea: Minimal AI

Most people think of AI-powered scrapers as AI that runs on every request — reading pages, extracting data, deciding what to save. That's slow, expensive, and overkill.

The better mental model is: AI as a code writer, not a code runner.

Claude writes the scraper once. After that, pure Python handles all the fetching, parsing, and saving. No AI involved in the daily loop. Claude only gets called again if something breaks — and even then, only to update the code, not to run it.

The result: a scraper that costs almost nothing to operate, runs on autopilot, and fixes itself when the site changes.

The Architecture

The diagram below shows all three phases. Watch how data flows: the blue path is a one-time bootstrap, the green path is the normal scrape loop that runs forever, and the red/orange path is the self-healing loop that activates only when something breaks.

Phase 1 — Bootstrap: You and Claude Write the Scraper Together

This is the only time a human is directly involved. You open Claude and have a conversation:

"Here's the site: example.com/properties. I want to scrape listing title, price, location, and date posted from each property's detail page. Here's my database schema: [your CREATE TABLE statement]. Write a scraper."

Claude uses its own web fetch tool to visit the listing page and 2–3 detail pages. It reads the HTML, identifies the structure, figures out how pagination works, and finds where each piece of data lives in the DOM.

Then Claude writes scraper.py. You don't write a single line of this file. Claude owns it.

What scraper.py Does — Two Jobs

The scraper Claude writes has two distinct responsibilities baked in from the start.

Job 1: Scrape and save

python

def scrape_listing_page(url):
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    detail_urls = [
        BASE_URL + a["href"]
        for a in soup.select("a.listing-card__link")
    ]
    return detail_urls

def scrape_detail_page(url):
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    return {
        "title":    soup.select_one("h1.listing-title").text.strip(),
        "price":    soup.select_one("[data-price]")["data-price"],
        "location": soup.select_one(".listing-location").text.strip(),
        "posted":   soup.select_one("time.posted-date")["datetime"],
        "url":      url,
    }

def save_to_db(record):
    conn = psycopg2.connect(**DB_CONFIG)
    cur = conn.cursor()
    cur.execute(
        "INSERT INTO listings (title, price, location, posted, url, scraped_at) "
        "VALUES (%s, %s, %s, %s, %s, NOW()) ON CONFLICT (url) DO UPDATE "
        "SET price=EXCLUDED.price, scraped_at=NOW()",
        (record["title"], record["price"], record["location"], record["posted"], record["url"])
    )
    conn.commit(); cur.close(); conn.close()

Job 2: Log errors to CSV

This is equally important. Every time something unexpected happens, the scraper writes it to errors.csv — not just a print statement, but a structured log that will later trigger Claude:

python

import csv
from datetime import datetime

def log_error(url: str, error_type: str, message: str, fn: str):
    with open("errors.csv", "a", newline="") as f:
        csv.writer(f).writerow([url, error_type, message, fn, datetime.now().isoformat()])

The error type matters because not all errors mean the same thing.

Phase 2 — The Scrape Loop: No AI Involved

Once scraper.py exists, two cron jobs do all the work:

# Scrape every 6 hours
0 */6 * * *   cd /home/deploy/scraper && python scraper.py

# If errors.csv has content, trigger the fix — check every 30 minutes
*/30 * * * *  [ -s /home/deploy/scraper/errors.csv ] && bash /home/deploy/scraper/fix_errors.sh

The first job is the engine. The second job is the watchdog. On a normal day, the watchdog never fires. The scraper runs, data flows into the database, and Claude is completely uninvolved.

When the Site Changes: Two Types of Error

Not every error means the code is broken. The scraper needs to distinguish between problems that will resolve themselves and problems that require a code fix.

Temporary Errors — Skip and Retry

Error	Cause	Action
`HTTP 503`	Site temporarily down	Skip, retry next cron
`HTTP 429`	Rate limited	Skip, retry next cron
Network timeout	Slow connection	Skip, retry next cron
`HTTP 500`	Server-side issue	Skip, retry next cron

These don't go to errors.csv. The scraper logs them to stdout and moves on. They'll resolve on their own.

Structural Errors — Write to CSV, Trigger Claude

These mean the code no longer matches reality:

Error	What it means	Claude's fix
`AttributeError: 'NoneType' has no attribute 'text'`	A CSS selector found nothing — class was renamed	Claude curls the URL, finds the new class name, updates the selector
`KeyError` on a dict field	A data attribute was removed or renamed	Claude finds where the data moved, updates the extractor
`ValueError` on DB insert	Data format changed (e.g. `"$1,200"` → `"1200 USD"`)	Claude updates the parsing/cleaning logic
`HTTP 404` on a previously-working URL pattern	Pagination structure changed	Claude fetches the listing page, finds the new pagination pattern
All selectors return `None`	Site migrated to JavaScript-rendered content	Claude identifies the API call the frontend makes and switches scraper to hit the API directly
DB insert fails with constraint violation	A required field is now missing or null	Claude finds where the field moved in the HTML

The key rule in code:

python

def handle_error(url, error, http_status=None):
    # Temporary — do not touch errors.csv
    if http_status in (500, 502, 503, 504, 429):
        print(f"[temp] {url}: HTTP {http_status}, will retry")
        return

    # Structural — write to errors.csv, this needs Claude
    log_error(url, type(error).__name__, str(error), "scrape_detail_page")

Phase 3 — Self-Healing: Claude Fixes the Code

When errors.csv has content, the second cron job fires fix_errors.sh. This shell file is the one thing you write manually, once. It contains a pre-built prompt with all the context Claude needs to fix the scraper without breaking what's already working.

The Shell Script

bash

#!/bin/bash
# fix_errors.sh
# Written once by you. Runs automatically when errors.csv has content.

set -e
cd "$(dirname "$0")"

SCHEMA=$(cat schema.sql)
CODE=$(cat scraper.py)
ERRORS=$(cat errors.csv)

PROMPT="You are maintaining a production Python web scraper.

DATABASE SCHEMA (do not change this):
$SCHEMA

CURRENT SCRAPER CODE (file: scraper.py):
$CODE

ERRORS LOGGED (url, error_type, message, function, timestamp):
$ERRORS

YOUR TASK:
1. Use your web fetch tool to visit each failed URL and analyze the current HTML structure.
2. Identify what changed on the site that caused these errors.
3. Update scraper.py to handle the new structure.
4. Do NOT remove or change any existing working logic — only extend or fix the broken parts.
5. Keep the same error logging format (errors.csv with the same columns).
6. Keep all database insert logic compatible with the schema above.
7. Output only the complete updated Python file, no explanation.

Update scraper.py to fix these errors:"

# Run Claude with the prompt, capture output as new scraper.py
claude -p "$PROMPT" > scraper_new.py

# Only replace if Claude returned a non-empty file
if [ -s scraper_new.py ]; then
    mv scraper_new.py scraper.py
    echo "$(date): scraper.py updated" >> fix_log.txt
    # Clear errors after successful fix
    > errors.csv
else
    echo "$(date): Claude returned empty output, keeping original" >> fix_log.txt
fi

What Claude Does With This

Claude receives the prompt, then:

Reads the error list — AttributeError on https://example.com/property/1842
Uses its fetch tool to visit https://example.com/property/1842
Reads the current HTML and sees that .listing-title is now h1[data-listing-title]
Updates the selector in scrape_detail_page
Checks whether other selectors are also affected
Returns the full updated scraper.py

Critically: Claude can see both the error message (what broke) and the live HTML (why it broke and what it looks like now). The prompt explicitly tells Claude not to remove existing working logic — so pages that still use the old layout keep working while pages on the new layout now work too:

python

def scrape_detail_page(url):
    # ...
    # Try new layout first (data attribute)
    title_el = soup.select_one("h1[data-listing-title]")
    # Fall back to old layout (class-based)
    if not title_el:
        title_el = soup.select_one("h1.listing-title")
    # Still nothing → unknown layout → log for Claude to fix next cycle
    if not title_el:
        raise AttributeError("title element not found — layout may have changed again")
    # ...

The Complete File Structure on VPS

/home/deploy/scraper/
├── scraper.py          ← Claude writes and updates this
├── schema.sql          ← your DB schema (never changes)
├── fix_errors.sh       ← you write this once
├── errors.csv          ← auto-written by scraper.py, cleared after fix
└── fix_log.txt         ← audit log of every Claude fix

That's it. No Docker, no complex orchestration. Two cron jobs and five files.

The Self-Healing Loop Continues

After Claude updates scraper.py and errors.csv is cleared, the next cron cycle runs the updated scraper. If it passes without errors — you're done. The site's new layout is now handled.

If a new error type appears (maybe Claude's fix worked for most pages but one edge case still fails), the same loop runs again. Over time, scraper.py accumulates handlers for every layout variant the site has ever used. It gets harder to break with each cycle.

The cost of all this: a handful of Claude API calls per month, only on days when a site actually changes. On normal days the bill is zero for the AI portion — you're just paying for compute to run Python.

Why This Pattern Works

The key insight is that scraper failures are rare and predictable. Sites don't redesign every day. When they do, the failure mode is always the same: a selector stops matching. That's a narrow, well-defined problem that Claude is very good at solving with access to the live HTML.

By limiting Claude's involvement to exactly that moment — "here is the broken code, here is what the page looks like now, fix it" — you get the benefit of AI adaptability without the cost and latency of running AI on every single page request.

The scraper runs fast, runs cheap, and fixes itself. You just watch the database fill up.

The Self-Healing Scraper: Using Claude Only When It Breaks

The Problem with Web Scrapers

The Core Idea: Minimal AI

The Architecture

Phase 1 — Bootstrap: You and Claude Write the Scraper Together

What scraper.py Does — Two Jobs

Phase 2 — The Scrape Loop: No AI Involved

When the Site Changes: Two Types of Error

Temporary Errors — Skip and Retry

Structural Errors — Write to CSV, Trigger Claude

Phase 3 — Self-Healing: Claude Fixes the Code

The Shell Script

What Claude Does With This

The Complete File Structure on VPS

The Self-Healing Loop Continues

Why This Pattern Works