arxiv-latex-extract

Extract latex source code from arxiv.org bulk archives

APACHE-2.0 License

Stars
3

ALE: arXiv LATEX Extract

ALE is a tool for bulk extracting LATEX sources from arXiv.org by processing arXiv Bulk Data. Unlike other tools that exclusively rely on Amazon S3 for downloading, ALE primarily utilizes the mirror on archive.org, which is a free alternative but may be out-of-date. If optionally boto3 is then also installed and the environment variables AWS_ACCESS_KEY and AWS_SECRET_KEY point to valid AWS credentials, missing buckets are retrieved from Amazon S3.

Installation

Clone the repository and install all requirements.

git clone https://github.com/potamides/arxiv-latex-extract.git
cd arxiv-latex-extract
pip install -r requirements.txt

In addition, this project needs latexpand to flatten LATEX files, so make sure it is installed and on your PATH.

Usage

To launch the script execute main.py:

python main.py

It will display a progress bar and extracted files will be saved in extracted/. Archive files are downloaded to archives/ as needed and deleted right after. By default, this script does only extract papers that make use of TikZ, i.e., papers that were released after October 23th 2005 (release date of TikZ 1.0) and contain the phrase tikzpicture. To change this, adapt the modulino in main.py to your liking.

Limitations

While this project worked wonderfully for my task, it is still a messy script that was hacked together in a short amount of time. Use at your own risk!

Acknowledgments

The code for cleaning up LATEX files is largely based on the arXiv processing code of RedPajama-Data.