Run online variational LDA on all the abstracts from the arXiv. The implementation is based on Matt Hoffman's GPL licensed code.
You'll need a mongod
instance running on
the port given by the environment variable MONGO_PORT
and a
redis-server
instance running on the port given by
the REDIS_PORT
environment variable.
The code depends on the Python packages: numpy
, scipy
, requests
,
pymongo
and redis
.
mkdir abstracts
./analysis.py scrape abstracts
— scrapes all the metadata from the arXivabstracts/raw-*.xml
. This takes a long time because of./analysis.py parse abstracts/raw-*.xml
— parses the raw responses andarxiv
in the collectionabstracts
../analysis.py build-vocab
— counts all the words in the corpus removing./analysis.py get-vocab 100 5000 > vocab.txt
— lists the vocabulary./analysis.py run vocab.txt
— runs online variational LDA by randomlylambda-*.txt
files. This will run forever so just kill it whenever./analysis.py vocab.txt lambda-100.txt
— list the topics and their most