crau: Easy-to-use Web Archiver

crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs.

Installation

pip install crau

Running

Archiving

Archive a list of URLs by passing them via command-line:

crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N

or passing a text file (one URL per line):

echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt

crau archive myarchive.warc.gz -i urls.txt

Run crau archive --help for more options.

Extracting data from an archive

List archived URLs in a WARC file:

crau list myarchive.warc.gz

Extract a file from an archive:

crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html

Playing the archived data on your Web browser

Run a server on localhost:8080 to play your archive:

crau play myarchive.warc.gz

Packing downloaded files into a WARC

If you've mirrored a website using wget -r, httrack or a similiar tool in which you have the files in your file system, you can use crau to create a WARC file based on this. Run:

crau pack [--inner-directory=path] <start-url> <path-or-archive> <warc-filename>

Where:

start-url: base URL you've downloaded (this will be joined with the
actual file names to create the complete URL).
path-or-archive: path where the files are located. Can also be a
.tar.gz, .tar.bz2, .tar.xz or .zip archive. crau will retrieve all
files recursively.
warc-filename: file to be created.
--inner-directory: used when a TAR/ZIP archive is passed to filter which
directory inside the archive will be used to retrieve files. Example: you
have an archive with a backup/ directory on the root and a
www.example.com/ inside of it, so the files are actually inside
backup/www.example.com/ - just pass
--inner-directory=backup/www.example.com/ and only the files inside this
path will be considered (in this example, the file
backup/www.example.com/contact.html will be archived as
<start-url>/contact.html).

Why not X?

There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in awesome-web-archiving.

Why not GNU Wget?

Lacks parallel downloading;
Some versions just crashes with segmentation fault depending on the website;
Lots of options make the task of archiving difficult;
There's no easy way to extend its behavior.

Why not Wpull?

Lots of options make the task of archiving difficult;
Easiest to extend than wget, but still difficult comparing to crau (since
crau uses scrapy).

Why not crawl?

Lacks some features and it's difficult to contribute to (the Gitlab instance
where it's hosted doesn't allow
registration);
Has some bugs regarding to collecting page dependencies (like static assets
inside a CSS file);
Has a bug where it enters in a loop (if a static asset returns a HTML page
instead of the expected file it ignores depth and keep trying to get this
page's dependencies - if any of the latter dependencies also has the same
problem it keeps going on infinite depth).

Why not archivenow?

This tool can be used easily to use archiving services such as archive.is via command-line and can also, but when archiving it calls wget to do the job.

Contributing

Clone the repository:

git clone https://github.com/turicas/crau.git

Install development dependencies (you may want to create a virtualenv):

cd crau && pip install -r requirements-development.txt

Install an editable version of the package:

pip install -e .

Modify everything you want to, commit to another branch and then create a pull request at GitHub.

Package Rankings

Top 6.7% on Proxy.golang.org

Top 16.53% on Pypi.org

Related Projects

batchlinks-webui

Download several Huggingface, MEGA, and CivitAI links at once. SD webui extension. For colab.

13 Feb 2023 90

ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc...

05 May 2017 19,808

wail

Web Archiving Integration Layer: One-Click User Instigated Preservation

20 Mar 2013 346

waybackpack

Download the entire Wayback Machine archive for a given URL.

11 Apr 2016 2,862

ayfabtu

Scripts to extract files from SCM directories left on web servers

07 Jun 2014 33

wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As...

25 Jun 2014 718

dirsearch

Web path scanner

30 Apr 2013 11,267

waymore

Find way more from the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan & VirusTotal!

24 Jun 2022 1,675

archiveis

A simple Python wrapper for the archive.is capturing service

22 Oct 2016 175

grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

05 Feb 2015 1,257

urlbuster

Powerful mutable web directory fuzzer to bruteforce existing and/or hidden files or directories.

26 Jan 2020 157

pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

09 Dec 2013 1,366

archivooor

Archivooor is a Python package for interacting with the archive.org API.

20 Mar 2022 3