Extract latex source code from arxiv.org bulk archives
APACHE-2.0 License
ALE is a tool for bulk extracting LATEX sources from
arXiv.org by processing arXiv Bulk
Data. Unlike other tools that
exclusively rely on Amazon S3 for downloading, ALE
primarily utilizes the mirror on
archive.org, which is a free
alternative but may be out-of-date. If optionally
boto3 is then also installed and the
environment variables AWS_ACCESS_KEY
and AWS_SECRET_KEY
point to valid AWS
credentials,
missing buckets are retrieved from Amazon S3.
Clone the repository and install all requirements.
git clone https://github.com/potamides/arxiv-latex-extract.git
cd arxiv-latex-extract
pip install -r requirements.txt
In addition, this project needs
latexpand
to flatten
LATEX files, so make sure it is installed and on your
PATH
.
To launch the script execute main.py
:
python main.py
It will display a progress bar and extracted files will be saved in
extracted/
. Archive files are downloaded to archives/
as needed and deleted
right after. By default, this script does only extract papers that make use of
TikZ, i.e., papers that were released after October 23th
2005 (release date of TikZ 1.0) and contain the phrase tikzpicture
. To
change this, adapt the modulino in
main.py
to your liking.
While this project worked wonderfully for my task, it is still a messy script that was hacked together in a short amount of time. Use at your own risk!
The code for cleaning up LATEX files is largely based on the arXiv processing code of RedPajama-Data.