Playing around with SavePageNow data.
MIT License
This repository includes Jupyter notebooks that document research into the Internet Archive's Save Page Now web archive data. The project is a collaboration between Shawn Walker, Jess Ogden and Ed Summers.
The notebooks do have some order to them since some of them rely on data created in others. They are listed here as a table of contents if you want to follow the path of exploration.
Some of the notebooks use Python extensions so you'll need to install those. pipenv is a handy tool for managing a project's Python dependencies. These steps should get you up and running:
pip install pipenv
git clone https://github.com/edsu/spn
cd Data
pipenv install
pipenv shell
jupyter notebook
Note: if you are using a notebook that requires Spark you'll need to set these in your environment before starting Jupyter:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
jupyter notebook Spark.ipynb
Before arriving at using warcio in Spark enabled Jupyter Notebooks we did try using parts of the Archives Unleashed toolkit and ArchiveSpark. You can find some artifacts from that effort in analysis. We don't mean to cast any shade on those projects by not having used them. It's just that we knew exactly what we wanted to look for in the WARC data, and were not interested in bringing up a general purpose WARC analysis toolkit on the XSEDE platform. It was easier to understand how to get things working with less moving pieces.