Development

Set up a local development environment for contributing to archiveinator.

Dev Setup

# Clone the repo
git clone https://github.com/p0rkchop/archiveinator.git
cd archiveinator

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install with dev and web dependencies
pip3 install -e ".[dev,web]"

# Run setup (installs Chromium, monolith, blocklists)
archiveinator setup

Project Structure

archiveinator/
  archiveinator/
    cli.py                # CLI entry point (archive, setup, login, serve, ladder, cache)
    config.py             # Config model, defaults, YAML migration
    pipeline.py           # ArchiveContext dataclass
    bypass_cache.py       # Per-domain bypass strategy cache (YAML)
    ua_manager.py         # UA cycling and per-domain tracking
    naming.py             # Output filename format
    setup_cmd.py          # archiveinator setup logic
    blocklist.py          # EasyList/EasyPrivacy loading
    steps/
      page_load.py        # Playwright page load + ad blocking + paywall detection
      paywall.py          # Paywall/bot detection logic (selectors, titles, word count)
      js_overlay.py       # JS overlay/modal removal (93 selectors)
      stealth_browser.py  # playwright-stealth anti-fingerprinting (JS layer)
      patchright_load.py  # Patchright CDP-patched Chromium (PerimeterX/DataDome)
      flaresolverr.py     # FlareSolverr Cloudflare cookie solver (opt-in sidecar)
      camoufox_load.py    # Camoufox patched Firefox (binary-level fingerprinting)
      curl_impersonate.py # curl-impersonate TLS fast path (Linux/Docker only)
      ad_blocking.py      # Network-level request interception
      dom_cleanup.py      # DOM ad node removal
      google_news.py      # Google News referrer bypass
      header_tricks.py    # Googlebot UA + header spoofing
      ua_cycling.py       # User agent cycling bypass
      image_dedup.py      # picture/srcset deduplication
      asset_inlining.py   # monolith binary integration
      content_extraction.py  # trafilatura text fallback
      archive_fallback.py # Wayback Machine + archive.today fallback
    web/
      app.py              # FastAPI factory, lifespan, middleware
      auth.py             # Session auth, bcrypt
      db.py               # SQLAlchemy engine + session factory
      models.py           # 7 ORM models (User, ArchiveJob, SiteProfile, ...)
      job_manager.py      # Async job lifecycle + WebSocket progress streaming
      scheduler.py        # APScheduler: cron schedules + RSS polling
      feed_reader.py      # feedparser RSS/Atom parsing
      emailer.py          # Resend.com email notifications
      templates.py        # render_page() and HTML helpers
      routes/
        archive.py        # POST /archive, WS /archive/{id}/ws, GET /download/{id}
        auth.py           # /auth/register, /auth/login, /auth/logout
        bulk.py           # Bulk import (bookmarks HTML, text, CSV)
        config.py         # User pipeline settings
        dashboard.py      # Main dashboard
        feeds.py          # RSS feed management
        jobs.py           # Archive history
        profiles.py       # Site profiles + cookie upload
        schedules.py      # Cron schedule management
      static/             # CSS and JavaScript
  tests/
    unit/                 # Fast tests (no network, no browser)
    qa/
      test_mock_paywall.py   # Mock HTTP server paywall scenarios
      test_real_urls.py      # Live site bypass validation
      sites.yaml             # 80+ paywalled site catalog
  docs/                   # GitHub Pages (Just the Docs)

Running Tests

# Unit tests (fast, no network required)
pytest tests/unit/ -q

# Mock paywall tests (local server, no external network)
pytest tests/qa/test_mock_paywall.py -m mock_paywall -q

# Real-URL tests (requires network + full setup)
pytest tests/qa/test_real_urls.py -m real_url -q

# Filter real-URL tests by difficulty or paywall type
pytest tests/qa/test_real_urls.py -m real_url --qa-difficulty easy
pytest tests/qa/test_real_urls.py -m real_url --qa-paywall-type piano
pytest tests/qa/test_real_urls.py -m real_url --qa-site "NYT"

Lint and Type Check

# Lint
ruff check .

# Format check
ruff format --check .

# Auto-fix lint and format
ruff check --fix . && ruff format .

# Type check
mypy archiveinator/

Researching a New Paywalled Site

Use archiveinator ladder to quickly test header/referrer bypass combinations without modifying code:

# Start the Ladder proxy (requires Docker)
archiveinator ladder

# Test in a separate terminal
curl http://localhost:8181/https://target-site.com
curl http://localhost:8181/api/https://target-site.com | jq '.body' | wc -w

# Iterate with a YAML rule file
# ~/.config/archiveinator/ladder-rules/target.yaml

See Paywall Bypass → Researching a New Site for the full workflow.

Web UI Development

# Install web dependencies
pip3 install -e ".[dev,web]"

# Start with auto-reload and verbose logging
archiveinator serve --dev

The SQLite database is created at the platform data directory on first startup (alongside config.yaml). Sessions survive server restarts when using a persistent /data mount in Docker.

CI/CD

GitHub Actions workflows:

Workflow	Trigger	Jobs
`ci.yml`	Push / PR to main	Tests (Python 3.12), Lint + Type check, Docker build test
`release.yml`	`v*` tag push	Build monolith binaries, create GitHub Release, build + push Docker image
`qa-paywall.yml`	Weekly (Mon 06:00 UTC)	Real-URL paywall bypass tests; opens issue on failure
`update-blocklists.yml`	Weekly (Mon 03:00 UTC)	Refresh EasyList + EasyPrivacy, commit back

Release Process

Bump version in pyproject.toml and archiveinator/web/app.py
Run uv lock to update the lockfile
Commit: git commit -m "release: bump to vX.Y.Z"
Push: git push origin main
Wait for CI to pass
Tag and push: git tag vX.Y.Z && git push origin vX.Y.Z
The release workflow builds binaries, Docker image, and creates the GitHub Release automatically

Documentation

Documentation lives in docs/ and is published via GitHub Pages using the Just the Docs theme. Pages deploy automatically from the main branch /docs folder after every push.

To preview locally, install Jekyll and run bundle exec jekyll serve in the docs/ directory.