Early Text Classification

This repository contains the implementation of the Early Text Classification framework.

The problem of classification is a widely studied one in supervised learning. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, where one needs to know the category of a document as soon as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible. This framework highlights the two main pieces involved in this problem: classification with partial information and deciding the moment of classification.

Based on the paper:

Loyola J.M., Errecalde M.L., Escalante H.J., Montes y Gomez M. (2018) Learning When to Classify for Early Text Classification. In: De Giusti A. (eds) Computer Science CACIC 2017. CACIC 2017. Communications in Computer and Information Science, vol 790. Springer, Cham. [Springer Link] [SEDICI Link]

How to use framework

The jupyter notebook example.ipynb shows how to use this framework. You need to specify the following parameters:

etc_kwargs : dict
- dataset_name: name of the dataset to use. We expect the dataset to already be splitted in training and test set and to be located inside the folder dataset. There should be two files named {dataset_name}_train.txt and {dataset_name}_test.txt. Each file must have the following structure for each document i: {label_i}[TAB]{document_i}. Both corpus must end with an empty line.
- initial_step: initial percentage of the document to read.
- step_size: percentage of the document to read in each step.
preprocess_kwargs : dict
- min_word_length: number of letters the terms must have to be consider.
- max_number_words: maximum number of words to consider. In case you want to include all words use 'all'.
cpi_kwargs : dict
- train_dataset_percentage: percentage of documents to use for training cpi.
- test_dataset_percentage: percentage of documents to use for testing cpi.
- doc_rep: document representation to use. For now the only representation available is term_frec.
- cpi_clf: classifier for the cpi module. It must have methods fit(X, y), predict(X) and get_params() similar to those in the scikit-learn API. The method fit should accept a sparse matrix as the parameter X.
context_kwargs : dict
- number_most_common: number of most common terms of each category to use.
dmc_kwargs : dict
- train_dataset_percentage: percentage of documents to use for training dmc.
- test_dataset_percentage: percentage of documents to use for testing dmc.
- dmc_clf: classifier for the dmc module. It must have methods fit(X, y), predict(X) and get_params() similar to those in the scikit-learn API.