Pipeline

archiveinator processes each URL through a configurable pipeline — a sequence of steps that run in order to block ads, load the page, detect and bypass paywalls, clean the DOM, and produce a self-contained HTML archive.

How the Pipeline Works

Steps run sequentially in the order defined in your config
Each step receives an ArchiveContext object and passes it to the next step
Some steps run inside the browser (marked below) — they operate on the live page before it’s serialized
The full pipeline is customizable — enable or disable steps in config.yaml or the Web UI Settings page
Bypass steps stop as soon as the page becomes accessible

Pipeline Steps (in order)

1. `network_ad_blocking`

Runs before browser launch. Intercepts network requests and blocks ads, trackers, and known malicious domains using EasyList and EasyPrivacy rule sets. Requests to blocked domains are never made, saving bandwidth and preventing tracking.

Default: enabled

2. `page_load`

Launches a headless Chromium browser via Playwright and loads the target URL. Waits for network idle before proceeding. All subsequent in-browser steps operate on the page loaded here.

Default: enabled (required)

3. `paywall_detection`

Runs inside the browser. Detects whether the loaded page is behind a paywall using three methods:

HTTP status — 401, 402, 403, or 429
DOM selectors — known paywall elements (Piano/TinyPass modals, .paywall, .content-gate, and 30+ others)
Word count — suspiciously low (< 150 words) indicates a teaser stub

If a paywall is detected, subsequent bypass strategies are triggered.

Default: enabled

4. `js_overlay_removal`

Runs inside the browser. Removes paywall modal elements from the live page DOM and restores body scroll before the page is serialized. No reload required — fires while the browser is still open.

Default: enabled

5. `js_disabled`

Retries the page load with JavaScript disabled. Some client-side paywalls rely entirely on JavaScript to hide content — loading the page without JS reveals the full article.

Default: enabled

6. `stealth_browser`

Retries with playwright-stealth anti-fingerprinting patches applied. Patches JS properties that automation detection tools probe (navigator.webdriver, navigator.plugins, canvas fingerprint, etc.). Effective against Cloudflare “Just a moment” challenges and basic DataDome checks.

Default: enabled

7. `patchright_load`

Retries using Patchright — a fork of Playwright that patches the Chromium binary itself to remove Chrome DevTools Protocol (CDP) socket detection signatures. Unlike stealth_browser (which patches at the JS layer), Patchright removes CDP fingerprints at the binary level, making automation undetectable by PerimeterX and modern DataDome challenges.

Triggered by: bot challenge, PerimeterX, or DataDome detection.

Default: enabled

8. `flaresolverr`

Contacts a running FlareSolverr instance to obtain a cf_clearance cookie that bypasses Cloudflare IUAM (I’m Under Attack Mode) and Turnstile challenges. The cookie is injected into the browser context and the page is reloaded.

FlareSolverr is a Docker sidecar — it must be running separately. This step is a complete no-op if no URL is configured.

Enable by setting in config.yaml:

flaresolverr_url: "http://localhost:8191/v1"

Or set the FLARESOLVERR_URL environment variable.

Triggered by: Cloudflare detection.

Default: enabled (no-op unless flaresolverr_url is set)

9. `camoufox_load`

Retries using Camoufox — a patched Firefox binary that applies anti-fingerprinting at the binary level across canvas, WebGL, fonts, TLS ClientHello, and HTTP/2 SETTINGS frames. Firefox has a fundamentally different engine fingerprint from Chromium; sites tuned against Chromium automation often let Firefox through.

humanize=True adds realistic mouse movement and interaction timing for behavioral analysis bypass.

Triggered by: any paywall still present after the Chromium-based strategies above.

Default: enabled

10. `ua_cycling`

If the page is still paywalled, retries the page load with the next enabled user agent from your config. Requires user_agents.cycle: true. Successful agent/domain pairs are cached so future runs on the same domain start with the known-good UA.

Default: enabled

11. `header_tricks`

Retries with Googlebot user agent, Referer: https://www.google.com/, and X-Forwarded-For: 66.249.66.1. Many publishers allow Googlebot through paywalls to stay indexed.

Default: enabled

12. `google_news`

Retries with Googlebot UA and Referer: https://news.google.com/, simulating a Google News click-through. Works on publishers that whitelist Google News traffic.

Default: enabled

13. `dom_ad_cleanup`

Runs inside the browser. Removes residual ad elements from the DOM — Google Ads, DFP slots, Taboola widgets, Outbrain containers, tracking pixels, and other advertising DOM cruft.

Default: enabled

14. `image_dedup`

Runs inside the browser. Collapses <picture> elements and srcset attributes to a single image URL ≤ 1200px wide. This prevents responsive image duplication in the final archive.

Default: enabled

15. `content_extraction`

Last-resort fallback if the page is still paywalled after all bypass strategies. Uses trafilatura to extract the article body from whatever HTML was retrieved. The archive is saved as a clean, readable document containing the article text.

Default: enabled

16. `archive_fallback`

If the page is still inaccessible, queries the Wayback Machine and then archive.today for an archived copy of the URL. The most recent snapshot is used if found.

Default: enabled

17. `asset_inlining`

Must be last if included. Uses monolith to inline all external assets (CSS, images, fonts, JS) into a single self-contained HTML file. The result is a single file viewable offline with no external dependencies.

Default: enabled

Bypass Strategy Summary

Strategy	Targets	Trigger
`js_disabled`	Client-side JS paywalls	paywall detected
`stealth_browser`	Cloudflare, basic DataDome	bot challenge
`patchright_load`	PerimeterX, DataDome (CDP detection)	bot challenge
`flaresolverr`	Cloudflare IUAM/Turnstile	cloudflare detection
`camoufox_load`	Any Chromium-aware bot protection	all bot challenges
`ua_cycling`	UA-based rate limiting	any paywall
`header_tricks`	Google-whitelisted publishers	any paywall
`google_news`	Google News whitelisted publishers	any paywall
`content_extraction`	All (text extraction fallback)	any paywall
`archive_fallback`	All (archived copy fallback)	any paywall

Customizing the Pipeline

Via Config File

pipeline:
  - step: network_ad_blocking
    enabled: true
  - step: page_load
    enabled: true
  - step: paywall_detection
    enabled: false   # disable paywall detection
  - step: asset_inlining
    enabled: true

Via Web UI

Visit Settings in the web interface to toggle steps on/off. Per-domain overrides can be set through Site Profiles.

Pipeline

How the Pipeline Works

Pipeline Steps (in order)

1. network_ad_blocking

2. page_load

3. paywall_detection

4. js_overlay_removal

5. js_disabled

6. stealth_browser

7. patchright_load

8. flaresolverr

9. camoufox_load

10. ua_cycling

11. header_tricks

12. google_news

13. dom_ad_cleanup

14. image_dedup

15. content_extraction

16. archive_fallback

17. asset_inlining

Bypass Strategy Summary

Customizing the Pipeline

Via Config File

Via Web UI

1. `network_ad_blocking`

2. `page_load`

3. `paywall_detection`

4. `js_overlay_removal`

5. `js_disabled`

6. `stealth_browser`

7. `patchright_load`

8. `flaresolverr`

9. `camoufox_load`

10. `ua_cycling`

11. `header_tricks`

12. `google_news`

13. `dom_ad_cleanup`

14. `image_dedup`

15. `content_extraction`

16. `archive_fallback`

17. `asset_inlining`