crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs.
pip install crau
Archive a list of URLs by passing them via command-line:
crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N
or passing a text file (one URL per line):
echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt
crau archive myarchive.warc.gz -i urls.txt
Run crau archive --help
for more options.
List archived URLs in a WARC file:
crau list myarchive.warc.gz
Extract a file from an archive:
crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html
Run a server on localhost:8080 to play your archive:
crau play myarchive.warc.gz
If you've mirrored a website using wget -r
, httrack
or a similiar tool in
which you have the files in your file system, you can use crau
to create a
WARC file based on this. Run:
crau pack [--inner-directory=path] <start-url> <path-or-archive> <warc-filename>
Where:
start-url
: base URL you've downloaded (this will be joined with thepath-or-archive
: path where the files are located. Can also be a.tar.gz
, .tar.bz2
, .tar.xz
or .zip
archive. crau
will retrieve allwarc-filename
: file to be created.--inner-directory
: used when a TAR/ZIP archive is passed to filter whichbackup/
directory on the root and awww.example.com/
inside of it, so the files are actually insidebackup/www.example.com/
- just pass--inner-directory=backup/www.example.com/
and only the files inside thisbackup/www.example.com/contact.html
will be archived as<start-url>/contact.html
).There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in awesome-web-archiving.
This tool can be used easily to use archiving services such as archive.is via command-line and can also, but when archiving it calls wget to do the job.
Clone the repository:
git clone https://github.com/turicas/crau.git
Install development dependencies (you may want to create a virtualenv):
cd crau && pip install -r requirements-development.txt
Install an editable version of the package:
pip install -e .
Modify everything you want to, commit to another branch and then create a pull request at GitHub.