NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

Introduction

NeuralClassifier is designed for quick implementation of neural models for hierarchical multi-label classification task, which is more challenging and common in real-world scenarios. A salient feature is that NeuralClassifier currently provides a variety of text encoders, such as FastText, TextCNN, TextRNN, RCNN, VDCNN, DPCNN, DRNN, AttentiveConvNet and Transformer encoder, etc. It also supports other text classification scenarios, including binary-class and multi-class classification. It is built on PyTorch. Experiments show that models built in our toolkit achieve comparable performance with reported results in the literature.

Support tasks

Binary-class text classifcation
Multi-class text classification
Multi-label text classification
Hiearchical (multi-label) text classification (HMC)

Support text encoders

TextCNN (Kim, 2014)
RCNN (Lai et al., 2015)
TextRNN (Liu et al., 2016)
FastText (Joulin et al., 2016)
VDCNN (Conneau et al., 2016)
DPCNN (Johnson and Zhang, 2017)
AttentiveConvNet (Yin and Schutze, 2017)
DRNN (Wang, 2018)
Region embedding (Qiao et al., 2018)
Transformer encoder (Vaswani et al., 2017)
Star-Transformer encoder (Guo et al., 2019)
HMCN(Wehrmann et al.,2018)

Requirement

Python 3
PyTorch 0.4+
Numpy 1.14.3+

System Architecture

Usage

Training

How to train a non-hierarchical classifier

python train.py conf/train.json

set task_info.hierarchical = false.
model_name can be FastTextTextCNNTextRNNTextRCNNDRNNVDCNNDPCNNAttentiveConvNetTransformer.

How to train a hierarchical classifier using hierarchial penalty

python train.py conf/train.hierar.json

set task_info.hierarchical = true.
model_name can be FastTextTextCNNTextRNNTextRCNNDRNNVDCNNDPCNNAttentiveConvNetTransformer

How to train a hierarchical classifier with HMCN

python train.py conf/train.hmcn.json

set task_info.hierarchical = false.
set model_name = HMCN

Detail configurations and explanations see Configuration.

The training info will be outputted in standard output and log.logger_file.

Evaluation

python eval.py conf/train.json

if eval.is_flat = false, hierarchical evaluation will be outputted.
eval.model_dir is the model to evaluate.
data.test_json_files is the input text file to evaluate.

The evaluation info will be outputed in eval.dir.

Prediction

python predict.py conf/train.json data/predict.json

predict.json should be of json format, while each instance has a dummy label like "" or any other label in label map.
eval.model_dir is the model to predict.
eval.top_k is the number of labels to output.
eval.threshold is the probability threshold.

The predict info will be outputed in predict.txt.

Input Data Format

JSON example:

{
    "doc_label": ["Computer--MachineLearning--DeepLearning", "Neuro--ComputationalNeuro"],
    "doc_token": ["I", "love", "deep", "learning"],
    "doc_keyword": ["deep learning"],
    "doc_topic": ["AI", "Machine learning"]
}

"doc_keyword" and "doc_topic" are optional.

Performance

3. Hierarchical vs Flat

Acknowledgement

Some public codes are referenced by our toolkit:

Update

2019-04-29, init version

Related Projects

DeepLearn

Implementation of research papers on Deep Learning+ NLP+ CV in Python using Keras, Tensorflow and...

20 May 2017 1,820

Macadam

Macadam是一个以Tensorflow(Keras)和bert4keras为基础，专注于文本分类、序列标注和关系抽取的自然语言处理工具包。支持RANDOM、WORD2VEC、FASTTEXT...

04 Jun 2020 324

nlp

This repository recorded my NLP journey.

18 May 2018 1,073

the-incredible-pytorch

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relat...

11 Feb 2017 11,389

Top-Deep-Learning

Top 200 deep learning Github repositories sorted by the number of stars.

16 Mar 2018 1,669

ailearning

AiLearning：数据分析+机器学习实战+线性代数+PyTorch+NLTK+TF2

25 Feb 2017 38,884

ImageCaptioning.pytorch

I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch ...

10 Feb 2017 1,419

nlp-stuff

A bit of everything about text and nlp [IN PROGRESS]

05 Jun 2017 28

Awesome-pytorch-list-CNVersion

Awesome-pytorch-list 翻译工作进行中......

04 Sep 2019 1,707

text2vec

text2vec, text to vector. 文本向量表征工具，把文本转化为向量矩阵，实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似...

12 Nov 2019 4,034

DeepLearning-Study

This is repository for DeepLearning Study in Kyung Hee University

05 Feb 2019 27

pytextclassifier

pytextclassifier is a toolkit for text classification. 文本分类，LR，Xgboost，TextCNN，FastText，TextRNN，B...

28 Apr 2017 482

ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

06 Feb 2019 1,226

NeuralNLP-NeuralClassifier