ArchiveBox

Active

Overview

ArchiveBox is a self-hosted tool that archives web content from URLs, bookmarks, RSS feeds, and browser history by rendering pages in a headless browser and saving them in formats like HTML, PDF, PNG, and WARC. It extracts media, git repositories, audio, video, and other assets into accessible folders. Designed for individuals preserving personal browsing data, researchers, and organizations for compliance, it supports private content via Chrome profiles but emphasizes local storage over centralized services.

Key Features

  • Multi-format Archiving - Saves pages in HTML, PDF, PNG, WARC, and other formats using headless Chrome, wget, and yt-dlp.
  • Content Extraction - Automatically pulls media, articles, git repos, audio, video, subtitles, images, and PDFs from pages.
  • Import Sources - Supports bulk imports from bookmarks, RSS feeds, Pocket, Pinboard, and browser history.
  • CLI and Web UI - Operates via command-line interface or optional Django-based web interface for adding and viewing archives.
  • Scheduled Archiving - Enables scheduled or realtime importing from various sources.
  • Authenticated Archiving - Uses Chrome user profiles with cookies to access login-required or paywalled content.
  • Modular Dependencies - Bundles tools like Chrome, wget, readability, and supports storage backends like S3 or Google Drive.

Pricing

PlanPriceIncludes
CommunityFreeFull open-source features, self-hosted on own hardware.
Self-HostedFreeCLI, web UI, API access with custom configuration and storage.

Platforms & Requirements

Runs on Linux, macOS, and via Docker on any system supporting it; requires Node.js, Python 3, and Chrome/Chromium. Minimum hardware includes 2GB RAM for basic use, more for large archives. Windows support is limited to Docker or WSL.

Integrations & Ecosystem

  • Browser history (Chrome, Firefox)
  • Bookmarks (HTML, JSON)
  • RSS feeds
  • Pocket/Pinboard
  • yt-dlp for media
  • S3, Google Drive, NFS/SMB storage
  • REST API (alpha)
  • Python API (beta)

Alternatives

AppDifference
WebrecorderBrowser-based recorder focused on interactive session capture rather than bulk CLI imports.
SingleFileLightweight single-HTML archiver without multi-format extraction or scheduling.
WallabagRead-it-later service with article extraction but less emphasis on full-page and media archiving.
Archive.org's Save Page NowPublic web service for single URLs without private content or local storage support.

Reputation

ArchiveBox is regarded as a robust, privacy-focused archiving solution for power users comfortable with self-hosting and CLI tools. Strengths include comprehensive format support and extraction capabilities beyond public services. Criticisms center on setup complexity, dependency on Chrome, and security caveats for authenticated archiving, with warnings against using personal profiles until fixes are implemented.

Sources (10)
  1. https://pypi.org/project/archivebox/
  2. https://archivebox.io
  3. https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration
  4. https://docs.archivebox.io/v0.6.2/Configuration.html
  5. https://github.com/ArchiveBox/ArchiveBox/wiki/Usage
  6. https://docs.archivebox.io/dev/Chromium-Install.html
  7. https://docs.archivebox.io/v0.4.13/Configuration.html
  8. https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview
  9. https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml
  10. https://docs.archivebox.io/dev/Setting-up-Authentication.html