The Problem with Web Scrapers
Every web scraper has the same fate. You build it, it works great for a while, then one day the site quietly redesigns a page. Maybe the developer renamed a CSS class. Maybe they changed the date format. Maybe they A/B tested a new detail page layout on 30% of listings. Your scraper keeps running — but silently failing. You find out days later when you notice the database stopped getting new rows.
The traditional answer is to babysit the scraper. Keep checking it. Keep fixing it manually. That doesn't scale if you're scraping many sites or if you have other things to do.
My approach to this problem: use AI only at two specific moments, and let code handle everything in between.
The Core Idea: Minimal AI
Most people think of AI-powered scrapers as AI that runs on every request — reading pages, extracting data, deciding what to save. That's slow, expensive, and overkill.
The better mental model is: AI as a code writer, not a code runner.
Claude writes the scraper once. After that, pure Python handles all the fetching, parsing, and saving. No AI involved in the daily loop. Claude only gets called again if something breaks — and even then, only to update the code, not to run it.
The result: a scraper that costs almost nothing to operate, runs on autopilot, and fixes itself when the site changes.
The Architecture
The diagram below shows all three phases. Watch how data flows: the blue path is a one-time bootstrap, the green path is the normal scrape loop that runs forever, and the red/orange path is the self-healing loop that activates only when something breaks.
Phase 1 — Bootstrap: You and Claude Write the Scraper Together
This is the only time a human is directly involved. You open Claude and have a conversation:
"Here's the site:
example.com/properties. I want to scrape listing title, price, location, and date posted from each property's detail page. Here's my database schema:[your CREATE TABLE statement]. Write a scraper."
Claude uses its own web fetch tool to visit the listing page and 2–3 detail pages. It reads the HTML, identifies the structure, figures out how pagination works, and finds where each piece of data lives in the DOM.
Then Claude writes scraper.py. You don't write a single line of this file. Claude owns it.
What scraper.py Does — Two Jobs
The scraper Claude writes has two distinct responsibilities baked in from the start.
Job 1: Scrape and save
def scrape_listing_page(url):
resp = requests.get(url, headers=HEADERS, timeout=15)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
detail_urls = [
BASE_URL + a["href"]
for a in soup.select("a.listing-card__link")
]
return detail_urls
def scrape_detail_page(url):
resp = requests.get(url, headers=HEADERS, timeout=15)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
return {
"title": soup.select_one("h1.listing-title").text.strip(),
"price": soup.select_one("[data-price]")["data-price"],
"location": soup.select_one(".listing-location").text.strip(),
"posted": soup.select_one("time.posted-date")["datetime"],
"url": url,
}
def save_to_db(record):
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()
cur.execute(
"INSERT INTO listings (title, price, location, posted, url, scraped_at) "
"VALUES (%s, %s, %s, %s, %s, NOW()) ON CONFLICT (url) DO UPDATE "
"SET price=EXCLUDED.price, scraped_at=NOW()",
(record["title"], record["price"], record["location"], record["posted"], record["url"])
)
conn.commit(); cur.close(); conn.close()Job 2: Log errors to CSV
This is equally important. Every time something unexpected happens, the scraper writes it to errors.csv — not just a print statement, but a structured log that will later trigger Claude:
import csv
from datetime import datetime
def log_error(url: str, error_type: str, message: str, fn: str):
with open("errors.csv", "a", newline="") as f:
csv.writer(f).writerow([url, error_type, message, fn, datetime.now().isoformat()])The error type matters because not all errors mean the same thing.
Phase 2 — The Scrape Loop: No AI Involved
Once scraper.py exists, two cron jobs do all the work:
# Scrape every 6 hours
0 */6 * * * cd /home/deploy/scraper && python scraper.py
# If errors.csv has content, trigger the fix — check every 30 minutes
*/30 * * * * [ -s /home/deploy/scraper/errors.csv ] && bash /home/deploy/scraper/fix_errors.sh
The first job is the engine. The second job is the watchdog. On a normal day, the watchdog never fires. The scraper runs, data flows into the database, and Claude is completely uninvolved.
When the Site Changes: Two Types of Error
Not every error means the code is broken. The scraper needs to distinguish between problems that will resolve themselves and problems that require a code fix.
Temporary Errors — Skip and Retry
| Error | Cause | Action |
|---|---|---|
HTTP 503 | Site temporarily down | Skip, retry next cron |
HTTP 429 | Rate limited | Skip, retry next cron |
| Network timeout | Slow connection | Skip, retry next cron |
HTTP 500 | Server-side issue | Skip, retry next cron |
These don't go to errors.csv. The scraper logs them to stdout and moves on. They'll resolve on their own.
Structural Errors — Write to CSV, Trigger Claude
These mean the code no longer matches reality:
| Error | What it means | Claude's fix |
|---|---|---|
AttributeError: 'NoneType' has no attribute 'text' | A CSS selector found nothing — class was renamed | Claude curls the URL, finds the new class name, updates the selector |
KeyError on a dict field | A data attribute was removed or renamed | Claude finds where the data moved, updates the extractor |
ValueError on DB insert | Data format changed (e.g. "$1,200" → "1200 USD") | Claude updates the parsing/cleaning logic |
HTTP 404 on a previously-working URL pattern | Pagination structure changed | Claude fetches the listing page, finds the new pagination pattern |
All selectors return None | Site migrated to JavaScript-rendered content | Claude identifies the API call the frontend makes and switches scraper to hit the API directly |
| DB insert fails with constraint violation | A required field is now missing or null | Claude finds where the field moved in the HTML |
The key rule in code:
def handle_error(url, error, http_status=None):
# Temporary — do not touch errors.csv
if http_status in (500, 502, 503, 504, 429):
print(f"[temp] {url}: HTTP {http_status}, will retry")
return
# Structural — write to errors.csv, this needs Claude
log_error(url, type(error).__name__, str(error), "scrape_detail_page")Phase 3 — Self-Healing: Claude Fixes the Code
When errors.csv has content, the second cron job fires fix_errors.sh. This shell file is the one thing you write manually, once. It contains a pre-built prompt with all the context Claude needs to fix the scraper without breaking what's already working.
The Shell Script
#!/bin/bash
# fix_errors.sh
# Written once by you. Runs automatically when errors.csv has content.
set -e
cd "$(dirname "$0")"
SCHEMA=$(cat schema.sql)
CODE=$(cat scraper.py)
ERRORS=$(cat errors.csv)
PROMPT="You are maintaining a production Python web scraper.
DATABASE SCHEMA (do not change this):
$SCHEMA
CURRENT SCRAPER CODE (file: scraper.py):
$CODE
ERRORS LOGGED (url, error_type, message, function, timestamp):
$ERRORS
YOUR TASK:
1. Use your web fetch tool to visit each failed URL and analyze the current HTML structure.
2. Identify what changed on the site that caused these errors.
3. Update scraper.py to handle the new structure.
4. Do NOT remove or change any existing working logic — only extend or fix the broken parts.
5. Keep the same error logging format (errors.csv with the same columns).
6. Keep all database insert logic compatible with the schema above.
7. Output only the complete updated Python file, no explanation.
Update scraper.py to fix these errors:"
# Run Claude with the prompt, capture output as new scraper.py
claude -p "$PROMPT" > scraper_new.py
# Only replace if Claude returned a non-empty file
if [ -s scraper_new.py ]; then
mv scraper_new.py scraper.py
echo "$(date): scraper.py updated" >> fix_log.txt
# Clear errors after successful fix
> errors.csv
else
echo "$(date): Claude returned empty output, keeping original" >> fix_log.txt
fiWhat Claude Does With This
Claude receives the prompt, then:
- Reads the error list —
AttributeErroronhttps://example.com/property/1842 - Uses its fetch tool to visit
https://example.com/property/1842 - Reads the current HTML and sees that
.listing-titleis nowh1[data-listing-title] - Updates the selector in
scrape_detail_page - Checks whether other selectors are also affected
- Returns the full updated
scraper.py
Critically: Claude can see both the error message (what broke) and the live HTML (why it broke and what it looks like now). The prompt explicitly tells Claude not to remove existing working logic — so pages that still use the old layout keep working while pages on the new layout now work too:
def scrape_detail_page(url):
# ...
# Try new layout first (data attribute)
title_el = soup.select_one("h1[data-listing-title]")
# Fall back to old layout (class-based)
if not title_el:
title_el = soup.select_one("h1.listing-title")
# Still nothing → unknown layout → log for Claude to fix next cycle
if not title_el:
raise AttributeError("title element not found — layout may have changed again")
# ...The Complete File Structure on VPS
/home/deploy/scraper/
├── scraper.py ← Claude writes and updates this
├── schema.sql ← your DB schema (never changes)
├── fix_errors.sh ← you write this once
├── errors.csv ← auto-written by scraper.py, cleared after fix
└── fix_log.txt ← audit log of every Claude fix
That's it. No Docker, no complex orchestration. Two cron jobs and five files.
The Self-Healing Loop Continues
After Claude updates scraper.py and errors.csv is cleared, the next cron cycle runs the updated scraper. If it passes without errors — you're done. The site's new layout is now handled.
If a new error type appears (maybe Claude's fix worked for most pages but one edge case still fails), the same loop runs again. Over time, scraper.py accumulates handlers for every layout variant the site has ever used. It gets harder to break with each cycle.
The cost of all this: a handful of Claude API calls per month, only on days when a site actually changes. On normal days the bill is zero for the AI portion — you're just paying for compute to run Python.
Why This Pattern Works
The key insight is that scraper failures are rare and predictable. Sites don't redesign every day. When they do, the failure mode is always the same: a selector stops matching. That's a narrow, well-defined problem that Claude is very good at solving with access to the live HTML.
By limiting Claude's involvement to exactly that moment — "here is the broken code, here is what the page looks like now, fix it" — you get the benefit of AI adaptability without the cost and latency of running AI on every single page request.
The scraper runs fast, runs cheap, and fixes itself. You just watch the database fill up.