A small toolkit to generate count-based PPMI-weighed SVD Distributional Semantic Models.
pip install counterix
or, after a git clone:
python3 setup.py install
To generate a raw count matrix from a tokenized corpus, run:
counterix generate \
--corpus /abs/path/to/corpus/txt/file \
--min-count frequency_threshold \
--win-size window_size
If the --output
parameter is not set, the output files will be saved to the corpus directory.
To weigh a raw count model with PPMI, run:
counterix weigh --model /abs/path/to/raw/count/npz/model
To apply SVD on a PPMI-weighed model, with k=10000, run:
counterix svd \
--model /abs/path/to/ppmi/npz/model \
--dim 10000
To control the number of threads used during SVD, run counterix with env OMP_NUM_THREADS=1