

ArXiv analysis

Run online variational LDA on all the abstracts from the arXiv. The implementation is based on Matt Hoffman's GPL licensed code.


You'll need a mongod instance running on the port given by the environment variable MONGO_PORT and a redis-server instance running on the port given by the REDIS_PORT environment variable.

The code depends on the Python packages: numpy, scipy, requests, pymongo and redis.

  • mkdir abstracts
  • ./ scrape abstracts — scrapes all the metadata from the arXiv
    OAI interface and saves the raw XML
    responses as abstracts/raw-*.xml. This takes a long time because of
    the arXiv's flow control policies. It took me approximately 6 hours.
  • ./ parse abstracts/raw-*.xml — parses the raw responses and
    saves the abstracts to a MongoDB database called arxiv in the collection
    called abstracts.
  • ./ build-vocab — counts all the words in the corpus removing
    anything with less than 3 characters and removing any stop words.
  • ./ get-vocab 100 5000 > vocab.txt — lists the vocabulary
    skipping the first 100 most popular words and keeping 5000 words total.
  • ./ run vocab.txt — runs online variational LDA by randomly
    selecting articles from the database. The topic distributions are stored
    in the lambda-*.txt files. This will run forever so just kill it whenever
    you feel like it.
  • ./ vocab.txt lambda-100.txt — list the topics and their most
    common words at step 100.