3rd place solution to the OTTO Multi-Objective Recommender System Kaggle Competition - Theo's Part

Status :

Document code : Done
Clean notebooks : Done
Make ReadMe : Done
Rerun full pipeline to make sure everything works : To do

Introduction - Adapted from kaggle

The pipeline follows the classical candidates extraction & reranker scheme.

CV = 0.5917 - [0.5621, 0.4438, 0.6706] -> LB 0.6028

Clicks is single model, I blend a few XGBs for carts & orders but the boost is small. Blending with models from my teammates gave our Public 0.60437 / Private 0.60382 LB !

Candidates

I use the candidates from Chris (link), as well as a slightly modified version of the ones from his public kernel. This results in approx. 80 candidates per sessions.

Feature engineering

Most of my (744) features come from the following process :

Compute item-item scores (such as w2v similarities, matrix factorization similarity, Chris' covisitation matrices coefficients) between the candidate and items in the session
Compute a weight adding information about to the item position in the session, timestamp, and type
Aggregate !

Features are computed per batch on a 32Gb V100 using RAPIDS. It's fast :)

Overall pipeline

I tune an Optuna for each fold (which is not a good practice, but I had a really reliable CV setup), pipeline can be a bit long to run but actually, the bottleneck is reading huge parquet files. Heavy downsampling makes it possible to have everything in RAM, and to train on GPU using the tricks Chris shared publicly.

How to use the repository

Prerequisites

Clone the repository
Requirements :
- RAPIDS ! Using the latest stable version should work.
- pip install -r requirements.txt
- Bunch of stuff that doesn't really matter that much
Download the data :
- Put the competition data from Kaggle in the input folder

Run The pipeline

Most of the pipelines is handled in notebooks. The order in which they should be run is specified in the name. Pipeline should run fine in a machine with a 32GB.

Prepare the data using 1-Preparation.ipynb.
Create covisitation matrices using 2-Matrices_Chris.ipynb and 2-Matrices_Theo.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
Create candidates matrices using 3-Candidates.ipynb. Notebooks have to be run with MODE="val", MODE="test" and MODE="extra".
Create embeddings matrices using 4-Matrix_Factorization.ipynb, 4-Seq2Seq_Giba.ipynb and 4-Word2Vec.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
Create features using the fe_main.py script in the src folder. Use python fe_main.py --mode MODE with modes val, test and extra.
Train an XGBoost model using 6-XGB.ipynb. You need to train a models with the 3 targets, the main parameter to tweak is POS_RATIO.
Evaluate your ensembles and generate submission files using 7-Blend.ipynb

If you run into memory issues :

For matrix computation, increase the PIECES values.
For candidates, Chris' candidates use a lot of ram but you can refactor the code to work by chunk (not implemented).
For feature engineering, reduce CHUNK_SIZE.
For training, validation data can be downsampled. I already downsample it for carts and clicks in the utils/load/load_parquets_cudf_folds function but you can downsample more.

Code structure

If you wish to dive into the code, the repository naming should be straight-forward. Each function is documented. The structure is the following :

src
 data
    candidates_chris.py         # Chris' candidates utils
    candidates.py               # Theo's candidates utils
    covisitation.py             # Theo's covistation matrices
    fe.py                       # Feature engineering
    preparation.py              # Data preparation utils
 inference           
    boosting.py                 # Main file
    predict.py                  # Predict function
 model_zoo 
    __init__.py
    lgbm.py                     # LGBM Ranker kept for legacy
    xgb.py                      # XGBoost classifier
 otto_src                        
    evaluate.py                 # From the competition repo
    labels.py                   # From the competition repo
    my_split.py                 # My custom splitting functions
    testset.py                  # From the competition repo
 training           
    boosting.py                 # Trains a boosting model
 utils          
    load.py                     # Data loading utils 
    logger.py                   # Logging utils
    metrics.py                  # Metrics for the competition
    plot.py                     # Plotting utils
    torch.py                    # Torch utils

 fe_main.py                      # Main for feature engineering
 params.py                       # Main parameters

Related Projects

seq2seq-time

Bechmarking seq2seq models on a range of multivariate regression datasets

29 Oct 2020 7

Kaggle_G2Net

G2Net Competition

12 Sep 2021 12

DAT7

General Assembly's Data Science course in Washington, DC

16 May 2015 229

prosocial-conversation-forecasting

Code and models for the paper "Conversations Gone Alright: Quantifying and Predicting Prosocial O...

12 Feb 2021 6

kaggle_birdcall_identification

3rd place solution for the Cornell Birdcall Identification competition

25 Aug 2020 69

kaggle_otto_rs