Official implementation of: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic
MIT License
The official implementation of our paper: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic, which was part of NSURL-2019 workshop on Task 8 for Arabic Semantic Question Similarity.
requirements.txt
file
pip install -r requirements.txt
git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git
cd ELMoForManyLangs
python setup.py install
cd ..
wget http://vectors.nlpl.eu/repository/11/136.zip -O elmo_dir/136.zip
unzip elmo_dir/136.zip -d elmo_dir
cp ELMoForManyLangs/configs/cnn_50_100_512_4096_sample.json elmo_dir/cnn_50_100_512_4096_sample.json
Data preprocessing step to separate punctuations from words
python 1_preprocess.py --dataset-split train
python 1_preprocess.py --dataset-split test
Enlarging the data using both Positive and Negative Transitive properties (descriped in the paper)
python 2_enlarge.py
To make the training step faster, we pre-generate words embeddings from either ELMo or BERT models and store them in a pickle file
python 3_build_embeddings_dict.py --embeddings-type elmo # For ELMo
python 3_build_embeddings_dict.py --embeddings-type bert # For BERT
We adopted using ELMoForManyLangs over bert-embedding because it yields better results.
Training the model using ELMo with 0.2 dropout, 256 batch size, 100 epochs and 2000 dev set size
python 4_train.py --embeddings-type elmo --dropout-rate 0.2 --batch-size 256 --epochs 100 --dev-split 2000
This hyperparameters setup gives the best results according to our experiments, change the values in order to experiment more..
Inferencing predictions for the test set is done given the path to a certain checkpoint, the default threshold is 0.5
which can be changed using the optional argument --threshold
python 5_infer.py --model-path checkpoints/epoch100.h5
The following figure illustrates our best model structure.
The project is available as open source under the terms of the MIT License.