Notebooks used to train a AWD LSTM language model and classifier for the Kaggle Google QUEST Challenge ttps://www.kaggle.com/c/google-quest-challenge
Notebooks used to train a AWD LSTM language model and classifier using fastai v2 for the Kaggle Google QUEST Challenge
These are all heavily influenced by the fastai-v2 ULMFiT and Wikitext tutorials here
I entered the competition quite late and spent too much time on learning the new fastai v2 framework, so didn't actaully get any decent results, but hopefully some of the code here will be useful for you in your LSTM endevours!
Processes and combines 3 different text datasets into a single source ready for language model pre-training. This notebook outputs a 850mb text data file with 84M words/tokens with the following distribution:
This notebook will pretrain an AWD LSTM model using a custom text dataset designed especially for this Q&A competition.
The SentencePiece Tokenizer with Byte-Pair Encoder (bpe) was used for tokenization instead of the standard fastai Spacy tokenizer. It was trained for 7 epochs and it took 2h14m per epoch.
Finetune the pretrained AWD LSTM Language Model on the competition Q&A data. Because we are finetuning the LM, we can use all of the competition data, both the train and test set.
Using the encoder from our finetuned language model in a text classification model to be trained on the competition data.
Note: As of writing the fastai SortedDL
dataloader sorts predictions according to item size, however for kaggle we need the predictions output in the same order as the submissions file, which is generally the same order as the input test file.
So there was 1 more step to do after making out preds before I could use them. I had to get the original indexes from the dataloader and then re-sort the preds:
pred_idxs = tst_cls_dls.get_idxs()
sorted_preds = [x for _,x in sorted(zip(pred_idxs, list(torch.unbind(preds[0]))))]