All Posts
Jeff Hooton · · 24 min read

Building Trawl: A Polite-By-Default Local Replacement for Firecrawl

Every AI pipeline I build eventually needs to scrape the web. Pull pricing pages, extract product data, crawl a docs site for RAG, walk a catalog for a client. Managed services like Firecrawl solve this well if you’re happy paying per page and running all your traffic through somebody else’s API. I wasn’t, so I built trawl: a single Go binary that does tiered scraping, BFS crawling, schema extraction, markdown conversion, screenshots, content caching, and the dozen other things you actually need, all locally, with no API key, no recurring cost, and no runtime dependency beyond what Go ships.

Trawl is polite by default. robots.txt respected, declared User-Agent, per-host rate limiting, opt-out not opt-in. It produces clean markdown ready for an LLM pipeline, composes with jq and grep the way a CLI should, and ships with a four-tier opt-in evasion model (plus an explicit refusal list) for the minority of sites that fight back. This post is the story of how it came together, including the Lightpanda feature I didn’t build because the data said not to, the two bugs my unit tests missed until I ran a real bench, and the day I shipped a TLS forgery feature that unblocked zero targets in the benchmark and led the release notes with exactly that.

Polite by default is a product decision, not a setting

The first call I made was that politeness is not a knob. It’s a load-bearing architectural decision about what trawl is.

When you run trawl scrape https://example.com with no flags, this is what goes out on the wire:

Nothing in that default posture lies about what trawl is or where the traffic is coming from. A default trawl scrape run today in 2026 will send byte-for-byte the same request as the same command in 2028. The identity is stable, auditable, and declared.

The alternative model is the common one: aggressive defaults with a --polite knob. That shape is wrong because it inverts the responsibility. The default is the mode most users end up in. If the default is aggressive, the library’s net effect on the open web is aggressive, and the small minority who care enough to flip the knob is the only thing between you and a world of scraper traffic that pretends not to be scraper traffic. If the default is polite, the net effect is polite and the operator explicitly owns the choice when they deviate.

This ends up mattering structurally when you build the evasion tiers later, because every evasion feature in trawl is opt-in behind an explicit flag AND every record in the output JSONL carries a metadata.evasion field recording which tier was active for that fetch. If I ever want to know whether a given trawl run was polite, I can grep the output. The audit trail is structural, not aspirational.

Tiered routing with per-host learning

The thing that separates trawl from “a wrapper around chromedp” is that most of the web doesn’t need Chromium. The majority of pages still serve real HTML on first byte, and paying 2-5 seconds of browser spin-up on every request is wasteful.

There’s an Engine interface with two implementations: an HTTP tier (net/http plus goquery) and a Chromium tier (chromedp). The router runs HTTP first. If the response comes back as real content, it’s done. If it comes back as a SPA shell (empty body, React mount point, that kind of thing), the router escalates to Chromium. If it comes back as a 404 or DNS failure, it short-circuits, because Chromium can’t help there either.

The twist is per-host tier learning. After the first successful fetch, trawl remembers which tier worked for each host. The cache lives at $TRAWL_HOME/tier-cache and is cross-job by design. The second time you run against vercel.com, trawl skips straight to the right tier based on what worked last time. This sounds small but it compounds on large crawls. You don’t pay the “try HTTP, escalate to Chromium” tax on every URL, only on the first one per host.

The router composes with a persistent frontier backed by BadgerDB. Jobs survive SIGINT, machine restarts, and mid-crawl crashes. trawl resume <job-id> picks up exactly where the last run left off. I built this early because long crawls that throw away three hours of work when your laptop goes to sleep are deeply unpleasant, and the alternative is to split crawls into tiny chunks and manage the seams yourself, which is worse.

The feature I didn’t build

The original SPEC called for a three-tier router: HTTP → Lightpanda → Chromium. Lightpanda is a lightweight JS engine, faster than Chromium but heavier than raw HTTP. The theory was that some pages need JS rendering but don’t need a full browser, and a middle tier would catch those cheaper.

I almost built it. Then I wrote docs/BENCHMARK.md first and committed to a decision rule: Lightpanda only ships if the chromium escalation rate on a real sample goes above 15%. Below 15%, the savings don’t justify 4-6 hours of subprocess lifecycle management plus a whole new class of bugs.

Three measurements on a 7000-company test dataset:

RunModeReachableChromium escalation
ADirect pricing URLs (n=500)31.6%11.4%
BHomepage + link follow (n=500)18.2%4.4%
CHybrid discovery (n=999)35.5%14.08%

All three below 15%. Run C crept up to 14%, close enough that I sat on the decision for a day and re-read the data. The 95% confidence interval on Run A straddled the threshold at roughly [7%, 16%], so technically the measurements didn’t exclude the possibility that Lightpanda would help. But the deeper signal was that rot dominates the failure distribution. Most unreachable pages weren’t unreachable because of JS rendering, they were unreachable because the seed data was from 2019 and the URLs had since rotted into 404s. No middle tier fixes rotted data.

The Lightpanda entry in docs/DECISIONS.md records all three measurements, the confidence interval math, the caveat about sample bias (pages with discoverable pricing paths are a biased subset), and a reversal rule: if the chromium escalation rate ever climbs above 15% on a large sample, the question reopens. The Engine interface is already there, so adding a middle tier later is a contained change.

Most teams would have built it. I didn’t, because the data didn’t earn it. The discipline of writing the measurement rule before running the measurement is the load-bearing part. If I’d run the benchmark first and decided what to do, confirmation bias would have won every time. “Well, 14% is almost 15%, and the interval includes numbers above 15%, and I already wrote half the code, and…” Writing the rule first meant the data actually got to decide, and the 4-6 hours of build cost went into something else.

Eight features in one afternoon

Once the Engine interface, router, frontier, and JSONL sink were solid, I sat down on 2026-04-10 and shipped eight Firecrawl-parity features in a single session:

Eight features in one session, not because the features were trivial but because the architecture didn’t need re-shaping to fit them. The Engine interface absorbed the extraction hooks, the frontier absorbed the BFS termination protocol, the JSONL sink absorbed the CSV flattener, the politeness gate absorbed the per-host overrides. Nothing fought back. The commit log for that day has two commits totaling around 3000 lines, both passing the full test suite on the first run, with a DECISIONS.md entry pinning down the subtle semantic calls (cache key is URL-plus-tier not URL, retries are HTTP-only not Chromium, CSV column set locks at first write) before they ossified.

The first real production workload

The first real consumer was my business partner migrating a project off Firecrawl. The corpus was 1857 entries from the Stanford Encyclopedia of Philosophy, each one a long-form philosophy article with nested structure: title, publication info, table of contents, related entries, author, copyright. I wrote a schema file (docs/examples/sep-article.yaml) that describes the extraction declaratively, then ran:

trawl batch sep-urls.txt \
  --schema docs/examples/sep-article.yaml \
  --format markdown --readability \
  --politeness slow-stanford.yaml \
  -o sep.jsonl

Result: 100% reach, 0 failures. 1857 entries, all extracted cleanly, markdown bodies plus nested structured data. Slow politeness profile for *.stanford.edu so I wasn’t hammering a cooperative public host. The same corpus had real recurring cost on Firecrawl. On trawl it cost nothing except a few minutes of wall-clock and the CPU cycles of my laptop.

I’d been building against synthetic benchmarks and small tests before that, and the SEP run was the first real workload. sep-article.yaml became the reference schema I measure future schema-extraction decisions against. If a proposed change to the schema syntax would break the SEP extraction, the change has to justify itself against the 2000-entry production corpus on the other side.

The evasion question

About a week in, a screenshot showed up in my feed of somebody suggesting “add random delays between scrapes so you look more human.” I’d already built per-host rate limiting with jitter, so that specific suggestion was a no-op, but the underlying question was real: should trawl have stealth features at all?

I’d been dodging it because the answer changes what trawl is. There’s a clean line between “polite scraper that respects robots.txt and declares its identity” and “bypass tool that forges browsers and defeats anti-abuse controls.” A lot of scraping projects drift across that line one feature at a time, and by the time you notice, your default users are running the aggressive mode and you’re shipping CAPTCHA solver integrations.

The answer I landed on, documented in docs/EVASION.md before a single line of stealth code shipped, was a four-tier opt-in model:

Every tier has a §5.x decision rule. Every tier records its activation in the output record’s metadata.evasion field. And EVASION.md §6 is a refusal list: things trawl won’t implement regardless of consumer demand, each with a “we said no because Y” paragraph so future-me doesn’t have to re-litigate. CAPTCHA solver integration, credential-based auth bypass, DoS-rate throughput, distributed evasion via residential proxy pools, session theft from real browsers. Each refusal has a structural reason, not a tactical one.

The EVASION.md doc shipped before the evasion code, which meant the design commitment was already paid when the build pressure landed. What I couldn’t predict was the two classes of bug that would show up during the build, neither of which was caught by the existing unit tests, both of which the bench caught the first time the new code faced a real adversary.

The bench caught bugs the unit tests missed

I built Tier 1 and Tier 2 together in one session. Rotating UA picker, Sec-Fetch-* block, jittered pacing, in-memory cookie jar, Chromium stealth script patching navigator.webdriver and friends. Around 400 LOC including tests. Unit tests green. Smoke tests against cooperative targets (example.org, httpbin.org, a couple of my own sites) green. Everything looked fine.

Then I built a hostile-site bench. 13 targets in three buckets: five Tier-1-expected (sites that gate on UA but not much else — Indeed, Glassdoor, Reddit, Crunchbase, SEC EDGAR), five Tier-2-expected (sites that need JS rendering — Vercel, Linear, Notion, Stripe, G2), three Tier-3+ wall (sites that fight hard — Zillow, Walmart, Best Buy). bench/evasion/run.sh ran each target through three modes (baseline, tier1, tier1+2) and printed a fixed-width comparison table. Three seconds between targets, raw JSONL preserved per run.

First bench run flagged two bugs that cost me hours.

Bug 1: Accept-Encoding broke transparent decompression.

Symptom: Vercel, Linear, Notion, and Stripe bodies all came back 10-15x smaller than they should be. Vercel returned about 78KB when the cooperative-mode scrape on the same URL returned about 920KB. Same code, same target, same everything except the header block.

It took a while to find because the HTTP response looked fine. Status 200, content-type: text/html; charset=utf-8, body present. But when I piped the body field through jq -r .body | file -, it reported gzip compressed data. The body I was writing to the output record was raw gzip bytes.

Root cause: the browser-mimicry header set was explicitly setting Accept-Encoding: gzip, deflate, br, zstd, which is what real Chrome sends. Go’s net/http has a subtle footgun in this area. The package docs say the Transport transparently decompresses gzip responses, but there’s a condition hiding in the next sentence: it only does this if the caller has NOT set the Accept-Encoding header. When I set it myself, Go’s client assumed I was going to handle decompression myself. It received valid gzip, handed me the raw compressed bytes, and moved on.

The unit test TestHTTPFetchBrowserLikeHeaders asserted that all the expected headers were present on the request. Nothing asserted “and the response body is decompressed HTML rather than raw gzip bytes,” because that assertion was implicit in every cooperative-mode test. They’d all been happy letting the stdlib decompress. The bug was invisible to every test I had, visible in the bench output within seconds (the “body 78KB” row next to the “body 920KB” row is hard to miss once you know to look).

Fix: delete the explicit Accept-Encoding line. Go’s net/http will send Accept-Encoding: gzip automatically behind the scenes and handle the decompression. Real Chrome sends the richer set, and I’m willing to accept that mismatch as the price of not reimplementing Brotli and Zstandard decompression in the HTTP engine. I left a comment in internal/engine/http.go explaining why the line is deliberately absent, because the absence looks like a bug to anyone reading the code cold:

// What we DON'T set:
//   Accept-Encoding. Go's net/http transparently sends `gzip` and
//   decompresses it for us IF the operator hasn't set the header
//   manually. Setting it ourselves opts out of transparent
//   decompression and leaves the body as raw compressed bytes,
//   which broke body extraction the first time we tried. Real
//   Chrome sends `gzip, deflate, br, zstd`; we accept that mismatch
//   as the price of not reimplementing decompression ourselves.

Bug 2: Chromium kept emitting trawl/0.2.0 under --browser-like.

After fixing the gzip bug, I re-ran the bench. SEC EDGAR returned 200 in the HTTP tier with browser-like headers, but 403 in the Chromium tier with stealth on. Same target, same --browser-like flag. The harder mode was failing where the easier mode was succeeding. Best Buy showed the same pattern. Something in the Chromium path was wrong in a way I hadn’t predicted.

Root cause: the Tier 1 work had only touched the HTTP engine. The Chromium engine was still being constructed with UserAgent: "trawl/0.2.0" from its default config. chromedp was emitting a real Chrome browser context with a scraper-shaped User-Agent override. So the request headers looked like this:

User-Agent: trawl/0.2.0 (+https://github.com/jeffdhooton/trawl)
Sec-CH-UA: "Chromium";v="132", "Google Chrome";v="132", "Not=A?Brand";v="99"

A browser-shaped Sec-CH-UA block with a scraper-shaped User-Agent. SEC EDGAR’s UA gate caught it instantly. “Your User-Agent says trawl but your client hints say Chrome 132” is the easiest possible mismatch to detect, and I was serving it on a silver platter.

Fix: in the shared cmd/trawl/evasion.go helper, when opts.browserLike && chromiumCfg.UserAgent == "", run the same UA picker the HTTP engine uses, pick one Chrome UA, and pin Chromium to it for the life of the run. Chromium can’t rotate per-host the way HTTP can because the browser allocator is one process per engine instance, and changing UA mid-session would itself be a tell. One consistent UA per run is the right model.

After both fixes, the same bench run flipped 11 of 12 actionable targets from blocked to reachable. Both bugs passed every unit test I had. The unit tests said the headers were present and the stealth patches ran. The bench said the response body was raw gzip and the User-Agent was still scraper-branded. I now assume any new evasion feature is broken until the bench says otherwise.

G2, DataDome, and the refusal that held

While I was staring at the bench results, one row kept being suspicious. G2’s data-integration category page came back as status 200, 2527 bytes, under Tier 1+2. That’s small enough to be weird for what should have been a long category page with dozens of vendor cards. I re-fetched with --format html and inspected: the body was a DataDome CAPTCHA challenge stub. Status 200, no real content, just the challenge HTML with JavaScript calling home to DataDome’s edge.

I could have added a CAPTCHA solver integration and tried to get past it. docs/EVASION.md §6.1 already answered that question:

No built-in integration with 2Captcha, CapSolver, Anti-Captcha, or similar paid solver APIs. If a site is serving CAPTCHAs at you, it has already decided it doesn’t want your traffic. Respect that, or escalate to a human-driven flow. Don’t automate past the “you lost” state. It turns trawl from a scraper into a “bypass tool,” reputationally and legally different categories.

G2 moved from the “Tier 2 expected” bucket in the bench to the “Tier 3+ wall” bucket with a note that the target is DataDome-gated and belongs in the §6.1 refusal pile, not the “Tier 3 might help” pile. Future bench runs still include G2 as a control (“does the refusal still hold, or did we accidentally start bypassing?”) but the expected state is “blocked.”

The refusal is structural. Nobody proposes CAPTCHA solver integration as a casual feature ask when the refusal is documented with reasoning and the refusal is what keeps the bench honest.

Tier 3, the h1 ALPN trap, and shipping the data that says you might not need it

Tier 3 is TLS ClientHello forgery via uTLS. The idea is to make trawl’s TLS handshake look byte-identical to Chrome’s so that JA3/JA4 fingerprinting at the CDN layer doesn’t catch it. Cloudflare, Akamai, and DataDome use these fingerprints as one of several signals to decide whether to serve real content or a block page.

The §5.3 decision rule for Tier 3 was the strictest of any tier. It required a consumer report with a packet capture showing JA3/JA4 blocking AND a demonstrable Tier 1+2 failure on the same target. I waived it and shipped Tier 3 anyway. The argument for waiving was “the next hostile target should hit a tool that’s already ready, not discover the gap mid-incident.” The argument against was the maintenance commitment: every TLS preset is a moving target that decays over time as browsers update their cipher lists, and a stale preset is worse than no preset because it becomes a distinctive fingerprint of “trawl pretending to be Chrome from two versions ago.” I attached a quarterly maintenance commitment (verify against tls.peet.ws, bump the uTLS pin if HelloChrome_Auto has drifted) and recorded the waiver in DECISIONS.md with the framing: speculative ship is a one-time waiver, not a precedent. Any future Tier 3 follow-up (HTTP/2 SETTINGS forging, more presets, header-order manipulation) has to clear the original bar AND show a concrete target that current Tier 3 fails on.

The build was about two hours. One new file, internal/engine/tls_utls.go, providing a DialTLSContext that uses uTLS’s HelloChrome_Auto preset. The strategy was to keep stdlib http.Transport intact for everything else (connection pooling, redirects, retries, cookie jar) and override only the TLS handshake. Narrowest possible change.

The h1 ALPN trap.

Smoke test against tls.peet.ws worked on the first run. The site is a TLS fingerprint mirror that echoes your JA3 and JA4 back. It reported t13d1516h2_8daaf6152771_d8a2da3f94cd. The h2 in the middle means ALPN negotiated HTTP/2, and the cipher-and-extension hash matched real Chrome’s exactly.

Smoke test against example.org returned garbage bytes:

\x00\x00\x12\x04\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00d\x00\x04...

Not a body. Not an error. Just bytes. I stared at them for a minute before I recognized the shape. \x04 is the type byte for an HTTP/2 SETTINGS frame. \x00\x00\x12 is a 9-byte frame header claiming 0x12 (18) bytes of payload. The payload looked like SETTINGS_MAX_CONCURRENT_STREAMS=100, SETTINGS_INITIAL_WINDOW_SIZE=?, which is the server’s opening SETTINGS frame on an h2 connection. The server was speaking HTTP/2 at me, and stdlib was trying to parse it as HTTP/1.1.

Root cause took some digging. The http.Transport’s automatic HTTP/2 upgrade path requires the connection returned by DialTLSContext to be a literal *tls.Conn. uTLS’s UConn implements most of the shapes stdlib checks, but the type assertion that gates auto-h2 is a specific type match, not an interface check. When I handed back a *utls.UConn, stdlib’s auto-h2 path silently fell through and the client started speaking HTTP/1.1 over a connection where ALPN had negotiated h2. The server sent an h2 SETTINGS frame (the correct first frame on an h2 connection), and stdlib tried to parse it as an HTTP/1.1 response.

First fix attempt: set Config.NextProtos = []string{"http/1.1"} to force ALPN to negotiate h1. That did nothing. Turns out uTLS parrots bake the ALPN extension into their HelloID specs, so NextProtos from the Config is ignored when you use a preset like HelloChrome_Auto. Chrome’s ALPN list is baked into the preset because that’s the whole point of parroting a real browser.

Real fix: grab Chrome’s spec via utls.UTLSIdToSpec(helloID), walk the Extensions slice to find the *utls.ALPNExtension, rewrite its AlpnProtocols field to []string{"http/1.1"}, and apply via HelloCustom instead of the original HelloID. That forces h1 on the wire while keeping cipher list, extensions, signature algorithms, supported groups, and GREASE values identical to Chrome.

The cost is that the forged JA4 now reads t13d1516h1_8daaf6152771_d8a2da3f94cd instead of real Chrome’s h2_ ending. The only dimension on which trawl’s ClientHello deviates from real Chrome is the ALPN protocol marker. Everything else matches. Lifting the deviation requires routing h2 through golang.org/x/net/http2.Transport with a custom DialTLS, which is a separate PR I haven’t done yet.

This whole class of footgun is documented basically nowhere outside the uTLS issue tracker. If you’re trying uTLS for the first time and getting raw h2 frames where your HTTP response should have been, the fix is this paragraph. I would have saved myself some time if anyone had written it down first, so now there’s at least one place.

The 0 of 13 finding.

After the fix, smoke tests all passed. I was about to cut the release when I asked myself the question I should have asked an hour earlier: “Have I actually run this against something I expected to need Tier 3?” The honest answer was no. tls.peet.ws is a fingerprint mirror, not a blocker. example.org is cooperative. The Tier 1+2 bench had already found 0 of 12 actionable targets needing more than Tier 1+2, but that bench had never been re-run with --tls-match chrome added.

So I added a fourth mode to bench/evasion/run.sh: --browser-like --tls-match chrome --tiers http. Forced --tiers http because Chromium has its own real Chrome TLS stack and would mask the signal. I wanted to know whether the ClientHello forgery alone unlocked anything the HTTP tier couldn’t already reach with just browser-like headers. Re-ran the 13-target corpus. Five minutes of wall clock.

Result: 0 of 13 targets unlocked. None.

The four targets where Tier 1+2 was still blocked (Glassdoor, Crunchbase, Walmart, plus G2 which sits in the refusal bucket) came back as 403 or 404 under Tier 3, with body sizes within 100 bytes of their Tier 1 responses. The TLS handshake completed successfully. The blocks are at the HTTP and application layer — UA gating, JS fingerprinting, CAPTCHAs — none of which TLS forgery touches. The feature worked exactly as designed. It just didn’t unblock anything.

I had three options. Ship with marketing-toned release notes (“new TLS forgery tier, Chrome preset, uTLS-powered!”). Ship with honest release notes. Or hold the release and go hunt for known JA3/JA4-blocking targets to justify the build. I picked option two. The v0.3.0 release notes lead with:

Empirical honesty: 0 of 13 targets in our bench needed this.

We added --tls-match chrome to the evasion tier ladder, but the same benchmark that drove Tier 1+2 development showed zero additional unblocks from TLS forgery alone. The blocks that survive Tier 1+2 are at the HTTP and JS layers, which TLS forgery doesn’t touch. Ship it if you need it. The data says you probably don’t.

The install instructions come after that section. That’s not how features get sold, and it’s the only framing that’s honest after running the bench. The ML and security tooling industry almost never publishes release notes like this because every release has to be “10x faster” or “bypasses [thing].” I don’t have a marketing problem, I have an identity problem, and shipping a feature while publishing the data that says you might not need it is good for the identity.

The feature is there when the next hostile target shows up with a JA4 block. The DECISIONS.md entry records the “0 of 13” result as an empirical floor: any future Tier 3 follow-up has to clear that floor AND show a concrete consumer need. The maintenance commitment is recorded too. If the quarterly verification ever finds that HelloChrome_Auto has drifted from real Chrome, the next release notes will say that too.

The stack

The whole thing is one binary you drop on your PATH. curl | sh to install, no Docker, no API keys, no managed service, no runtime dependencies. The SEP run that had real recurring cost on Firecrawl cost nothing on trawl except a few minutes of wall-clock and some laptop CPU.

The decision-discipline stack is at least as load-bearing as the code. Each doc has a specific job. ROADMAP tracks direction. DECISIONS captures one-off architectural calls with the data that drove them. BENCHMARK commits to measurable triggers before the measurements happen. EVASION locks in the shape of a scary feature area before consumer pressure lands. Together they let me ship eight features in one day without drift and skip a ninth the same week because the data said to.

github.com/jeffdhooton/trawl