Semantic-Question-Similarity

The official implementation of our paper: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic, which was part of NSURL-2019 workshop on Task 8 for Arabic Semantic Question Similarity.

0. Prerequisites

Python >= 3.6
Install required packages listed in requirements.txt file
- pip install -r requirements.txt
To use ELMo embeddings:
- Clone ELMoForManyLangs repository
  - git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git
- Install the package:
  - cd ELMoForManyLangs
  - python setup.py install
  - cd ..
- Download and unzip Arabic pre-trainled ELMo model
  - wget http://vectors.nlpl.eu/repository/11/136.zip -O elmo_dir/136.zip
  - unzip elmo_dir/136.zip -d elmo_dir
  - cp ELMoForManyLangs/configs/cnn_50_100_512_4096_sample.json elmo_dir/cnn_50_100_512_4096_sample.json

1. Data Preprocessing

Data preprocessing step to separate punctuations from words

python 1_preprocess.py --dataset-split train
python 1_preprocess.py --dataset-split test

2. Data Enlarging

Enlarging the data using both Positive and Negative Transitive properties (descriped in the paper)

python 2_enlarge.py

3. Generating Words Embeddings

To make the training step faster, we pre-generate words embeddings from either ELMo or BERT models and store them in a pickle file

python 3_build_embeddings_dict.py --embeddings-type elmo # For ELMo
python 3_build_embeddings_dict.py --embeddings-type bert # For BERT

We adopted using ELMoForManyLangs over bert-embedding because it yields better results.

4. Model Training

Training the model using ELMo with 0.2 dropout, 256 batch size, 100 epochs and 2000 dev set size

python 4_train.py --embeddings-type elmo --dropout-rate 0.2 --batch-size 256 --epochs 100 --dev-split 2000

This hyperparameters setup gives the best results according to our experiments, change the values in order to experiment more..

5. Model Inferencing

Inferencing predictions for the test set is done given the path to a certain checkpoint, the default threshold is 0.5 which can be changed using the optional argument --threshold

python 5_infer.py --model-path checkpoints/epoch100.h5

Model Structure

The following figure illustrates our best model structure.

Note: All codes in this repository are tested on Ubuntu 18.04

Contributors

License

The project is available as open source under the terms of the MIT License.

Badges

Extracted from project README

Related Projects

Dromedary

Dromedary: towards helpful, ethical and reliable LLMs.

03 May 2023 1,114

sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification

30 Nov 2017 1,061

interpretable-embeddings

Interpretable text embeddings by asking LLMs yes/no questions

07 May 2024 8

ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

06 Feb 2019 1,226

OLMo

Modeling, training, eval, and inference code for OLMo

20 Feb 2023 3,877

sentence-transformers

Multilingual Sentence & Image Embeddings with BERT

24 Jul 2019 13,925

DeepMoji

State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.

18 Jul 2017 1,512

easymms

A simple Python package to easily use Meta's Massively Multilingual Speech (MMS) project

28 May 2023 52

translation-over-diacritization

Translation-over-Diacritization technique implementation

17 Aug 2019 4

mteb

MTEB: Massive Text Embedding Benchmark

05 Apr 2022 1,441

bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models

29 Sep 2017 1,620

Geom3D

Geom3D: Geometric Modeling on 3D Structures, NeurIPS 2023

07 Jun 2023 108

Bella

Target Dependent Sentiment Analysis (TDSA) framework.

11 Jun 2018 20

multilingual-image-captioning

30 Jun 2021 44

ToolkenGPT

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 20...

27 May 2023 191

semantic-question-similarity