kaggle_otto_rs

3rd place solution for the OTTO – Multi-Objective Recommender System competition

MIT License

Stars
41

3rd place solution to the OTTO Multi-Objective Recommender System Kaggle Competition - Theo's Part

Status :

  • Document code : Done
  • Clean notebooks : Done
  • Make ReadMe : Done
  • Rerun full pipeline to make sure everything works : To do

Introduction - Adapted from kaggle

The pipeline follows the classical candidates extraction & reranker scheme.

  • CV = 0.5917 - [0.5621, 0.4438, 0.6706] -> LB 0.6028

Clicks is single model, I blend a few XGBs for carts & orders but the boost is small. Blending with models from my teammates gave our Public 0.60437 / Private 0.60382 LB !

Candidates

I use the candidates from Chris (link), as well as a slightly modified version of the ones from his public kernel. This results in approx. 80 candidates per sessions.

Feature engineering

Most of my (744) features come from the following process :

  • Compute item-item scores (such as w2v similarities, matrix factorization similarity, Chris' covisitation matrices coefficients) between the candidate and items in the session
  • Compute a weight adding information about to the item position in the session, timestamp, and type
  • Aggregate !

Features are computed per batch on a 32Gb V100 using RAPIDS. It's fast :)

Overall pipeline

I tune an Optuna for each fold (which is not a good practice, but I had a really reliable CV setup), pipeline can be a bit long to run but actually, the bottleneck is reading huge parquet files. Heavy downsampling makes it possible to have everything in RAM, and to train on GPU using the tricks Chris shared publicly.

How to use the repository

Prerequisites

  • Clone the repository

  • Requirements :

    • RAPIDS ! Using the latest stable version should work.
    • pip install -r requirements.txt
    • Bunch of stuff that doesn't really matter that much
  • Download the data :

    • Put the competition data from Kaggle in the input folder

Run The pipeline

Most of the pipelines is handled in notebooks. The order in which they should be run is specified in the name. Pipeline should run fine in a machine with a 32GB.

  • Prepare the data using 1-Preparation.ipynb.
  • Create covisitation matrices using 2-Matrices_Chris.ipynb and 2-Matrices_Theo.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
  • Create candidates matrices using 3-Candidates.ipynb. Notebooks have to be run with MODE="val", MODE="test" and MODE="extra".
  • Create embeddings matrices using 4-Matrix_Factorization.ipynb, 4-Seq2Seq_Giba.ipynb and 4-Word2Vec.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
  • Create features using the fe_main.py script in the src folder. Use python fe_main.py --mode MODE with modes val, test and extra.
  • Train an XGBoost model using 6-XGB.ipynb. You need to train a models with the 3 targets, the main parameter to tweak is POS_RATIO.
  • Evaluate your ensembles and generate submission files using 7-Blend.ipynb

If you run into memory issues :

  • For matrix computation, increase the PIECES values.
  • For candidates, Chris' candidates use a lot of ram but you can refactor the code to work by chunk (not implemented).
  • For feature engineering, reduce CHUNK_SIZE.
  • For training, validation data can be downsampled. I already downsample it for carts and clicks in the utils/load/load_parquets_cudf_folds function but you can downsample more.

Code structure

If you wish to dive into the code, the repository naming should be straight-forward. Each function is documented. The structure is the following :

src
 data
    candidates_chris.py         # Chris' candidates utils
    candidates.py               # Theo's candidates utils
    covisitation.py             # Theo's covistation matrices
    fe.py                       # Feature engineering
    preparation.py              # Data preparation utils
 inference           
    boosting.py                 # Main file
    predict.py                  # Predict function
 model_zoo 
    __init__.py
    lgbm.py                     # LGBM Ranker kept for legacy
    xgb.py                      # XGBoost classifier
 otto_src                        
    evaluate.py                 # From the competition repo
    labels.py                   # From the competition repo
    my_split.py                 # My custom splitting functions
    testset.py                  # From the competition repo
 training           
    boosting.py                 # Trains a boosting model
 utils          
    load.py                     # Data loading utils 
    logger.py                   # Logging utils
    metrics.py                  # Metrics for the competition
    plot.py                     # Plotting utils
    torch.py                    # Torch utils

 fe_main.py                      # Main for feature engineering
 params.py                       # Main parameters