Paywall Bypass
archiveinator automatically detects paywalled pages and works through a sequence of bypass strategies, stopping as soon as the page becomes accessible.
Detection
A page is considered paywalled if any of the following are true:
- HTTP status is
401,402,403, or429 - DOM selectors match known paywall elements:
- Piano/TinyPass modals (
.tp-modal,.tp-backdrop,.piano-container) - Generic paywall overlays (
.paywall,.content-gate,.metered-content) - Subscription walls (
.subscribe-wall,.subscription-overlay) - And 30+ additional publisher-specific selectors
- Piano/TinyPass modals (
- Bot challenge page — known challenge vendor selectors present:
- PerimeterX:
#px-captcha,#px-loader,[id^='px-'] - Cloudflare:
#challenge-form,.cf-browser-verification,[data-ray] - DataDome:
#datadome-captcha,[id*='datadome'],script[src*='datadome'] - Akamai:
#ak_bmsc
- PerimeterX:
- Bot challenge title — page title contains patterns like
"just a moment","are you a robot","access denied","ddos protection", etc. - Word count is suspiciously low (< 150 words), indicating a teaser stub rather than the full article
Each check produces a paywall reason string that the bypass engine uses to choose which strategies to try. See Bypass Decision Logic below.
Bypass Decision Logic
The bypass engine reads the paywall_reason set by detection and uses it to select strategies. Not every strategy is tried for every block type — targeted strategies run first to avoid wasting time on approaches that can’t work.
Reason → Strategy Trigger Map
| Detection result | paywall_reason contains | Strategies triggered |
|---|---|---|
| PerimeterX CAPTCHA | "bot challenge" | stealth_browser → patchright_load → camoufox_load → ua_cycling → header_tricks → google_news |
| Cloudflare challenge | "bot challenge" + "cloudflare" | stealth_browser → patchright_load → flaresolverr → camoufox_load → ua_cycling → header_tricks → google_news |
| DataDome CAPTCHA | "bot challenge" + "datadome" | stealth_browser → patchright_load → camoufox_load → ua_cycling → header_tricks → google_news |
| Subscription paywall | "DOM selector matched" | camoufox_load → ua_cycling → header_tricks → google_news → content_extraction → archive_fallback |
| HTTP 403 (soft) | "HTTP 403" | stealth_browser → patchright_load → camoufox_load → ua_cycling → header_tricks → google_news |
| HTTP 403 (hard block) | "hard block" | (ua_cycling and content_extraction skipped) → patchright_load → camoufox_load → header_tricks → google_news → archive_fallback |
| Timeout | "timeout" | stealth_browser → camoufox_load → ua_cycling → header_tricks → google_news |
| Low word count | "low word count" | camoufox_load → ua_cycling → header_tricks → google_news → content_extraction → archive_fallback |
Bold = the strategy specifically designed for that block type.
Trigger Conditions (exact)
Each strategy checks a specific condition before running:
| Strategy | Fires when |
|---|---|
stealth_browser | paywall_reason contains "bot challenge", "HTTP 403", or "timeout" AND is not a hard block |
patchright_load | paywall_reason contains "bot challenge", "perimeter", "datadome", or "HTTP 403" |
flaresolverr | paywall_reason contains "cloudflare" |
camoufox_load | Page is still paywalled (always tried after above strategies) |
ua_cycling | Page is still paywalled AND NOT a hard block AND user_agents.cycle: true |
header_tricks | Page is still paywalled |
google_news | Page is still paywalled |
content_extraction | Page is still paywalled AND NOT a hard block |
archive_fallback | Page is still paywalled |
Hard Block vs Soft Block
A hard block is detected when the HTTP status is 403 AND the page has fewer than 100 words — indicating a server-side rejection with no useful content at all (e.g., a bare “Access Denied” page). Hard blocks skip ua_cycling and content_extraction since there’s nothing to extract, and go straight to the archive fallback.
A soft block (subscription paywall) returns HTTP 200 with a partial article teaser and a DOM overlay or login gate.
Bypass Cache
Before trying any strategy, the engine checks the per-domain bypass cache. If a previous run on the same domain succeeded with a specific strategy, that strategy is tried first — skipping strategies that have never worked for this domain.
On success, the winning strategy and UA (if applicable) are recorded. On total failure, the failure is recorded so stale cache entries are eventually pruned.
Bypass Strategies (in order)
Strategies run sequentially. Each one is tried only if the page is still paywalled after the previous strategy.
1. JS Overlay Removal
Runs inside the browser without reloading the page.
Removes paywall modal elements from the live page DOM and restores body scroll before the page is serialized. This handles JS-injected overlays that appear after page load.
- No page reload required
- Fires while the browser is still open
- Targets Piano/TinyPass modals, generic overlays, and subscription gates
2. Stealth Browser
Retries with playwright-stealth anti-fingerprinting patches applied — patches navigator.webdriver, navigator.plugins, canvas fingerprint, and other automation-detectable JS properties.
- Triggered by: bot challenge, HTTP 403, or timeout
- Effective against: Cloudflare “Just a moment”, basic DataDome
3. Patchright (CDP-safe Chromium)
Retries using Patchright — a Playwright fork that patches the Chromium binary to remove Chrome DevTools Protocol (CDP) socket detection signatures.
- Patches at the binary level, not the JS layer — bypasses detection that playwright-stealth cannot
- Triggered by: PerimeterX or DataDome bot challenge
- Effective against: PerimeterX, modern DataDome
4. FlareSolverr (Cloudflare IUAM solver)
Contacts a running FlareSolverr Docker sidecar to obtain a cf_clearance cookie, which is injected into the browser context for a subsequent reload.
- Targeted specifically at Cloudflare IUAM (“I’m Under Attack Mode”) and Turnstile challenges
- No-op unless configured — set
flaresolverr_url: "http://localhost:8191/v1"in config.yaml orFLARESOLVERR_URLenv var - Triggered by: Cloudflare detection
5. Camoufox (Firefox engine)
Retries using Camoufox — a patched Firefox binary with binary-level anti-fingerprinting across canvas, WebGL, fonts, TLS ClientHello, and HTTP/2 SETTINGS frames.
- Uses a completely different browser engine (Firefox) — distinct from Chromium at every layer
humanize=Trueadds realistic mouse movement and timing for behavioral analysis bypass- Final browser-engine fallback — tried when all Chromium-based strategies fail
6. UA Cycling
Retries the page load with the next enabled user agent from your config.
- Requires
user_agents.cycle: truein config - Cycles through all enabled agents until one works or the list is exhausted
- Successful agent/domain pairs are cached — future runs on the same domain start with the known-good UA
7. Header Tricks
Retries the page load with:
- User-Agent: Googlebot
- Referer:
https://www.google.com/ - X-Forwarded-For:
66.249.66.1
Many publishers allow Googlebot through paywalls to stay indexed in search results.
8. Google News Referral
Retries the page load with:
- User-Agent: Googlebot
- Referer:
https://news.google.com/
Simulates a click-through from Google News. Works on publishers that whitelist Google News traffic.
9. Content Extraction Fallback
If the page is still paywalled after all retries, trafilatura extracts the article body from whatever HTML was retrieved.
The archive is saved as a clean, readable document containing the extracted article text — stripped of ads, navigation, and paywall elements.
10. Archive Service Fallback
Queries the Wayback Machine and then archive.today for an archived copy of the URL. The most recent snapshot is used if found.
Bypass Cache
Successful bypasses are tracked per domain:
| What’s cached | Purpose |
|---|---|
| Winning user agent | Future runs on the same domain start with the known-good UA, skipping failed agents |
| Successful strategy | The last successful bypass approach is tried first on subsequent visits |
The cache is stored alongside the config file:
| Platform | Cache Path |
|---|---|
| macOS | ~/Library/Application Support/archiveinator/ua_cache.yaml |
| Linux | ~/.config/archiveinator/ua_cache.yaml |
Researching a New Paywalled Site
When a site’s paywall isn’t handled by the built-in strategies, use archiveinator ladder to quickly test header and referrer combinations before writing any code.
# Start the proxy
archiveinator ladder
# Test variations with curl — no browser spin-up, instant feedback
curl http://localhost:8181/https://example.com # default Googlebot UA
curl http://localhost:8181/raw/https://example.com # raw HTML
curl http://localhost:8181/api/https://example.com | jq '.body' | wc -w # word count
Once you find a combination that works, create a rule file:
# ~/.config/archiveinator/ladder-rules/example.yaml
domains:
- example.com
rules:
referer: "https://news.google.com/"
user_agent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Restart archiveinator ladder to load the new rule, then verify with curl. When the approach is confirmed, encode it as a site profile in the Web UI (Settings → Site Profiles) using the same UA and referrer values — future archives of that domain will pick it up automatically.
Partial Archives
If all strategies are exhausted without success, a partial archive of whatever HTML was retrieved is saved with _partial in the filename:
2026-03-21_14-30_example.com_article-title_partial.html
This ensures you have some content even when bypass fails — useful for pages that are paywalled through novel or undocumented methods.