NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

OTHER License

Stars
1.8K

NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

Introduction

NeuralClassifier is designed for quick implementation of neural models for hierarchical multi-label classification task, which is more challenging and common in real-world scenarios. A salient feature is that NeuralClassifier currently provides a variety of text encoders, such as FastText, TextCNN, TextRNN, RCNN, VDCNN, DPCNN, DRNN, AttentiveConvNet and Transformer encoder, etc. It also supports other text classification scenarios, including binary-class and multi-class classification. It is built on PyTorch. Experiments show that models built in our toolkit achieve comparable performance with reported results in the literature.

Support tasks

  • Binary-class text classifcation
  • Multi-class text classification
  • Multi-label text classification
  • Hiearchical (multi-label) text classification (HMC)

Support text encoders

Requirement

  • Python 3
  • PyTorch 0.4+
  • Numpy 1.14.3+

System Architecture

Usage

Training

How to train a non-hierarchical classifier

python train.py conf/train.json 
  • set task_info.hierarchical = false.
  • model_name can be FastTextTextCNNTextRNNTextRCNNDRNNVDCNNDPCNNAttentiveConvNetTransformer.

How to train a hierarchical classifier using hierarchial penalty

python train.py conf/train.hierar.json
  • set task_info.hierarchical = true.
  • model_name can be FastTextTextCNNTextRNNTextRCNNDRNNVDCNNDPCNNAttentiveConvNetTransformer

How to train a hierarchical classifier with HMCN

python train.py conf/train.hmcn.json 
  • set task_info.hierarchical = false.
  • set model_name = HMCN

Detail configurations and explanations see Configuration.

The training info will be outputted in standard output and log.logger_file.

Evaluation

python eval.py conf/train.json
  • if eval.is_flat = false, hierarchical evaluation will be outputted.
  • eval.model_dir is the model to evaluate.
  • data.test_json_files is the input text file to evaluate.

The evaluation info will be outputed in eval.dir.

Prediction

python predict.py conf/train.json data/predict.json 
  • predict.json should be of json format, while each instance has a dummy label like "" or any other label in label map.
  • eval.model_dir is the model to predict.
  • eval.top_k is the number of labels to output.
  • eval.threshold is the probability threshold.

The predict info will be outputed in predict.txt.

Input Data Format

JSON example:

{
    "doc_label": ["Computer--MachineLearning--DeepLearning", "Neuro--ComputationalNeuro"],
    "doc_token": ["I", "love", "deep", "learning"],
    "doc_keyword": ["deep learning"],
    "doc_topic": ["AI", "Machine learning"]
}

"doc_keyword" and "doc_topic" are optional.

Performance

0. Dataset

1. Compare with state-of-the-art

2. Different text encoders

3. Hierarchical vs Flat

Acknowledgement

Some public codes are referenced by our toolkit:

Update

  • 2019-04-29, init version
Related Projects