This is an entry to Kaggle's Sentiment Analysis on Movie Reviews (SAMR) competition.
It's written for Python 3.3 and it's based on scikit-learn
and nltk
.
Quoting from Kaggle's description page:
This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive.
Some examples:
So the goal of the competition is to produce an algorithm to classify phrases
into these categories. And that's what samr
does.
After installing just run:
generate_kaggle_submission.py samr/data/model2.json > submission.csv
And that will generate a Kaggle submission file that scores near 0.65844
on the
leaderboard
(should take 3 minutes, and as of 2014-07-22 that score is the 2nd place).
The model2.json
argument above is a configuration file for samr
that
determines how the scikit-learn
pipeline is going to be built and other
hyperparameters, here is how it looks:
{
"classifier":"randomforest",
"classifier_args":{"n_estimators": 100, "min_samples_leaf":10, "n_jobs":-1},
"lowercase":"true",
"map_to_synsets":"true",
"map_to_lex":"true",
"duplicates":"true"
}
You can try samr
with different configuration files you make (as long as the
options are implemented), yielding
different scores and perhaps even better scores.
In particular model2.json
feeds a random forest classifier
with a concatenation of 3 kinds of features:
During prediction, it also checks for duplicates between the training set and the train set (there are quite a few).
And that's it! Want more details? see the code! it's only 350 lines.
If you know the drill, this should be enough:
git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py
Then you will need to manually download train.tsv
and test.tsv
from the
competition's data folder
and unzip them into the samr/data
folder. You may be asked to join Kaggle and/or
accept the competition rules before downloading the data.
Even though samr
is writen for Python 3.3 it may also work with Python 2.7
(and the last time I checked it was), but this is not supported and it may
break in the future.
If the short instructions are not enough, read on.
These instructions will install the development version of samr
inside a
Python 3.3 virtualenv and were thought for a blank, vanilla Ubuntu 14.04 and
tested using Docker (awesome tool btw). They should
work more or less unchanged with other Ubuntu versions and Debian-based OSs.
Open a console and 'cd' into an empty folder of your choice. Now, execute the following commands:
Install python 3.3 and compilation requirements for numpy and scipy:
sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install -y python3.3 python3.3-dev python-scipy gfortran libopenblas-dev liblapack-dev git wget
Create virtualenv, bootstrap pip and boostrap numpy:
python3.3 -m venv venv
source venv/bin/activate
wget https://bootstrap.pypa.io/get-pip.py
python3.3 get-pip.py
echo 'PATH="$VIRTUAL_ENV/local/bin:$PATH"; export PATH' >> venv/bin/activate
source venv/bin/activate
pip install numpy==1.8.1
Clone and install samr:
git clone https://github.com/rafacarrascosa/samr.git
pip install -e samr -r samr/docs/setup/requirements-dev.txt
download_3rdparty_data.py
Optionally run the tests:
nosetests samr/tests
Lastly, you will need to manually download train.tsv
and test.tsv
from the
competition's data folder
and unzip them into the samr/data
folder. You may be asked to join Kaggle and/or
accept the competition rules before downloading the data.
The installation is self-contained (within the folder you chose at the start) with two exceptions:
sudo apt-get
made system-wide changes, to uninstallsudo apt-get remove
.nltk
downloads data to ~/nltk_data
, once you don't use nltk
it's safeThis project is open-source and BSD licensed, see the LICENSE file for details.
This license basically allows you to do anything, but in case you're wondering:
I'm ok if you use samr
to beat my score at the competition, just share back
what you've learned!
This project was developed by Rafael Carrascosa, you can contact me at [email protected].