"word2vec for Lucene" extracts word vectors from Lucene index.
In this section, you'll know how to use demo environment provided in this project.
Download Apache Solr 4.10.2 (recommended) and unzip the downloaded file in an appropriate directory. Go to example directory and launch Solr with solr.solr.home and solr.dir properties.
$ cd solr-4.10.2/example
$ java -Dsolr.solr.home=${word2vec-lucene}/solrhome -Dsolr.dir=${solr-inst-dir} -jar start.jar
Execute ant to prepare sample input text8.xml file.
$ ant t8-solr
This takes several minutes.
Index text8.xml file to Solr.
$ ./post.sh collection1 text8.xml
Once you got Lucene index, you can now create vectors.txt file.
$ ./demo-word2vec.sh collection1
With -f option, you can specify arbitrary output vectors file.
$ ./demo-word2vec.sh collection1 -f vectors_my.txt
If you have Lucene in Action book PDF file, post the file to Solr.
$ ./solrcell.sh LuceneInAction.pdf
Download livedoor news corpus from RONDHUIT site and unzip it in an appropriate directory.
$ cd ${word2vec-lucene}
$ mkdir work
$ cd work
$ wget http://www.rondhuit.com/download/livedoor-news-data.tar.gz
$ tar xvzf livedoor-news-data.tar.gz
$ cd ..
Index livedoor news corpus xml files to Solr.
$ ./post.sh ldcc work/*.xml
Once you got Lucene index, you can now create vectors.txt file.
$ ./demo-word2vec.sh ldcc -a org.apache.lucene.analysis.ja.JapaneseAnalyzer
With -f option, you can specify arbitrary output vectors file.
$ ./demo-word2vec.sh ldcc -a org.apache.lucene.analysis.ja.JapaneseAnalyzer -f vectors_my.txt
Once you got word vectors file vectors.txt, you can find top 40 words that are closest words to the word you specified.
With -f option, you can specify arbitrary input vectors file.
$ ./demo-distance.sh [-f <vectors_file>]
cat
Word: cat
Position in vocabulary: 2601
Word Cosine distance
------------------------------------------------------------------------
rat 0.591972
cats 0.587605
hyena 0.583455
squirrel 0.580696
dogs 0.568277
dog 0.556022
Or, you can compute vector operations e.g. vector('paris') - vector('france') + vector('italy') or vector('king') - vector('man') + vector('woman')
With -f option, you can specify arbitrary input vectors file.
$ ./demo-analogy.sh [-f <vectors_file>]
france paris italy
man king woman
This tool supports not only Lucene index but also text files. See TextFileCreateVectors.java for details. The words in the text file must be separated by white space. This is normal for English and you need nothing for pretreatment. But for some languages e.g. Japanese, you need to "tokenize" the Japanese sentences into space-separated words before executing TextFileCreateVectors.java.