TUSHAR.
Article13 min read

The Self-Healing Scraper: Using Claude Only When It Breaks

How I think about building web scrapers that fix themselves — Claude writes the code once, the scraper runs on autopilot, and when a site changes its layout, Claude analyzes the new HTML and updates the code. Minimal AI, maximum automation.

The Problem with Web Scrapers

Every web scraper has the same fate. You build it, it works great for a while, then one day the site quietly redesigns a page. Maybe the developer renamed a CSS class. Maybe they changed the date format. Maybe they A/B tested a new detail page layout on 30% of listings. Your scraper keeps running — but silently failing. You find out days later when you notice the database stopped getting new rows.

The traditional answer is to babysit the scraper. Keep checking it. Keep fixing it manually. That doesn't scale if you're scraping many sites or if you have other things to do.

My approach to this problem: use AI only at two specific moments, and let code handle everything in between.

The Core Idea: Minimal AI

Most people think of AI-powered scrapers as AI that runs on every request — reading pages, extracting data, deciding what to save. That's slow, expensive, and overkill.

The better mental model is: AI as a code writer, not a code runner.

Claude writes the scraper once. After that, pure Python handles all the fetching, parsing, and saving. No AI involved in the daily loop. Claude only gets called again if something breaks — and even then, only to update the code, not to run it.

The result: a scraper that costs almost nothing to operate, runs on autopilot, and fixes itself when the site changes.

The Architecture

The diagram below shows all three phases. Watch how data flows: the blue path is a one-time bootstrap, the green path is the normal scrape loop that runs forever, and the red/orange path is the self-healing loop that activates only when something breaks.

Phase 1 — Bootstrap: You and Claude Write the Scraper Together

This is the only time a human is directly involved. You open Claude and have a conversation:

"Here's the site: example.com/properties. I want to scrape listing title, price, location, and date posted from each property's detail page. Here's my database schema: [your CREATE TABLE statement]. Write a scraper."

Claude uses its own web fetch tool to visit the listing page and 2–3 detail pages. It reads the HTML, identifies the structure, figures out how pagination works, and finds where each piece of data lives in the DOM.

Then Claude writes scraper.py. You don't write a single line of this file. Claude owns it.

What scraper.py Does — Two Jobs

The scraper Claude writes has two distinct responsibilities baked in from the start.

Job 1: Scrape and save

python
def scrape_listing_page(url):
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    detail_urls = [
        BASE_URL + a["href"]
        for a in soup.select("a.listing-card__link")
    ]
    return detail_urls

def scrape_detail_page(url):
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    return {
        "title":    soup.select_one("h1.listing-title").text.strip(),
        "price":    soup.select_one("[data-price]")["data-price"],
        "location": soup.select_one(".listing-location").text.strip(),
        "posted":   soup.select_one("time.posted-date")["datetime"],
        "url":      url,
    }

def save_to_db(record):
    conn = psycopg2.connect(**DB_CONFIG)
    cur = conn.cursor()
    cur.execute(
        "INSERT INTO listings (title, price, location, posted, url, scraped_at) "
        "VALUES (%s, %s, %s, %s, %s, NOW()) ON CONFLICT (url) DO UPDATE "
        "SET price=EXCLUDED.price, scraped_at=NOW()",
        (record["title"], record["price"], record["location"], record["posted"], record["url"])
    )
    conn.commit(); cur.close(); conn.close()

Job 2: Log errors to CSV

This is equally important. Every time something unexpected happens, the scraper writes it to errors.csv — not just a print statement, but a structured log that will later trigger Claude:

python
import csv
from datetime import datetime

def log_error(url: str, error_type: str, message: str, fn: str):
    with open("errors.csv", "a", newline="") as f:
        csv.writer(f).writerow([url, error_type, message, fn, datetime.now().isoformat()])

The error type matters because not all errors mean the same thing.

Phase 2 — The Scrape Loop: No AI Involved

Once scraper.py exists, two cron jobs do all the work:

# Scrape every 6 hours
0 */6 * * *   cd /home/deploy/scraper && python scraper.py

# If errors.csv has content, trigger the fix — check every 30 minutes
*/30 * * * *  [ -s /home/deploy/scraper/errors.csv ] && bash /home/deploy/scraper/fix_errors.sh

The first job is the engine. The second job is the watchdog. On a normal day, the watchdog never fires. The scraper runs, data flows into the database, and Claude is completely uninvolved.

When the Site Changes: Two Types of Error

Not every error means the code is broken. The scraper needs to distinguish between problems that will resolve themselves and problems that require a code fix.

Temporary Errors — Skip and Retry

ErrorCauseAction
HTTP 503Site temporarily downSkip, retry next cron
HTTP 429Rate limitedSkip, retry next cron
Network timeoutSlow connectionSkip, retry next cron
HTTP 500Server-side issueSkip, retry next cron

These don't go to errors.csv. The scraper logs them to stdout and moves on. They'll resolve on their own.

Structural Errors — Write to CSV, Trigger Claude

These mean the code no longer matches reality:

ErrorWhat it meansClaude's fix
AttributeError: 'NoneType' has no attribute 'text'A CSS selector found nothing — class was renamedClaude curls the URL, finds the new class name, updates the selector
KeyError on a dict fieldA data attribute was removed or renamedClaude finds where the data moved, updates the extractor
ValueError on DB insertData format changed (e.g. "$1,200""1200 USD")Claude updates the parsing/cleaning logic
HTTP 404 on a previously-working URL patternPagination structure changedClaude fetches the listing page, finds the new pagination pattern
All selectors return NoneSite migrated to JavaScript-rendered contentClaude identifies the API call the frontend makes and switches scraper to hit the API directly
DB insert fails with constraint violationA required field is now missing or nullClaude finds where the field moved in the HTML

The key rule in code:

python
def handle_error(url, error, http_status=None):
    # Temporary — do not touch errors.csv
    if http_status in (500, 502, 503, 504, 429):
        print(f"[temp] {url}: HTTP {http_status}, will retry")
        return

    # Structural — write to errors.csv, this needs Claude
    log_error(url, type(error).__name__, str(error), "scrape_detail_page")

Phase 3 — Self-Healing: Claude Fixes the Code

When errors.csv has content, the second cron job fires fix_errors.sh. This shell file is the one thing you write manually, once. It contains a pre-built prompt with all the context Claude needs to fix the scraper without breaking what's already working.

The Shell Script

bash
#!/bin/bash
# fix_errors.sh
# Written once by you. Runs automatically when errors.csv has content.

set -e
cd "$(dirname "$0")"

SCHEMA=$(cat schema.sql)
CODE=$(cat scraper.py)
ERRORS=$(cat errors.csv)

PROMPT="You are maintaining a production Python web scraper.

DATABASE SCHEMA (do not change this):
$SCHEMA

CURRENT SCRAPER CODE (file: scraper.py):
$CODE

ERRORS LOGGED (url, error_type, message, function, timestamp):
$ERRORS

YOUR TASK:
1. Use your web fetch tool to visit each failed URL and analyze the current HTML structure.
2. Identify what changed on the site that caused these errors.
3. Update scraper.py to handle the new structure.
4. Do NOT remove or change any existing working logic — only extend or fix the broken parts.
5. Keep the same error logging format (errors.csv with the same columns).
6. Keep all database insert logic compatible with the schema above.
7. Output only the complete updated Python file, no explanation.

Update scraper.py to fix these errors:"

# Run Claude with the prompt, capture output as new scraper.py
claude -p "$PROMPT" > scraper_new.py

# Only replace if Claude returned a non-empty file
if [ -s scraper_new.py ]; then
    mv scraper_new.py scraper.py
    echo "$(date): scraper.py updated" >> fix_log.txt
    # Clear errors after successful fix
    > errors.csv
else
    echo "$(date): Claude returned empty output, keeping original" >> fix_log.txt
fi

What Claude Does With This

Claude receives the prompt, then:

  1. Reads the error list — AttributeError on https://example.com/property/1842
  2. Uses its fetch tool to visit https://example.com/property/1842
  3. Reads the current HTML and sees that .listing-title is now h1[data-listing-title]
  4. Updates the selector in scrape_detail_page
  5. Checks whether other selectors are also affected
  6. Returns the full updated scraper.py

Critically: Claude can see both the error message (what broke) and the live HTML (why it broke and what it looks like now). The prompt explicitly tells Claude not to remove existing working logic — so pages that still use the old layout keep working while pages on the new layout now work too:

python
def scrape_detail_page(url):
    # ...
    # Try new layout first (data attribute)
    title_el = soup.select_one("h1[data-listing-title]")
    # Fall back to old layout (class-based)
    if not title_el:
        title_el = soup.select_one("h1.listing-title")
    # Still nothing → unknown layout → log for Claude to fix next cycle
    if not title_el:
        raise AttributeError("title element not found — layout may have changed again")
    # ...

The Complete File Structure on VPS

/home/deploy/scraper/
├── scraper.py          ← Claude writes and updates this
├── schema.sql          ← your DB schema (never changes)
├── fix_errors.sh       ← you write this once
├── errors.csv          ← auto-written by scraper.py, cleared after fix
└── fix_log.txt         ← audit log of every Claude fix

That's it. No Docker, no complex orchestration. Two cron jobs and five files.

The Self-Healing Loop Continues

After Claude updates scraper.py and errors.csv is cleared, the next cron cycle runs the updated scraper. If it passes without errors — you're done. The site's new layout is now handled.

If a new error type appears (maybe Claude's fix worked for most pages but one edge case still fails), the same loop runs again. Over time, scraper.py accumulates handlers for every layout variant the site has ever used. It gets harder to break with each cycle.

The cost of all this: a handful of Claude API calls per month, only on days when a site actually changes. On normal days the bill is zero for the AI portion — you're just paying for compute to run Python.

Why This Pattern Works

The key insight is that scraper failures are rare and predictable. Sites don't redesign every day. When they do, the failure mode is always the same: a selector stops matching. That's a narrow, well-defined problem that Claude is very good at solving with access to the live HTML.

By limiting Claude's involvement to exactly that moment — "here is the broken code, here is what the page looks like now, fix it" — you get the benefit of AI adaptability without the cost and latency of running AI on every single page request.

The scraper runs fast, runs cheap, and fixes itself. You just watch the database fill up.

Tushar Rayamajhi | AI Engineer & Backend Developer