3rd place solution for the OTTO – Multi-Objective Recommender System competition
MIT License
The pipeline follows the classical candidates extraction & reranker scheme.
Clicks is single model, I blend a few XGBs for carts & orders but the boost is small. Blending with models from my teammates gave our Public 0.60437 / Private 0.60382 LB !
I use the candidates from Chris (link), as well as a slightly modified version of the ones from his public kernel. This results in approx. 80 candidates per sessions.
Most of my (744) features come from the following process :
Features are computed per batch on a 32Gb V100 using RAPIDS. It's fast :)
I tune an Optuna for each fold (which is not a good practice, but I had a really reliable CV setup), pipeline can be a bit long to run but actually, the bottleneck is reading huge parquet files. Heavy downsampling makes it possible to have everything in RAM, and to train on GPU using the tricks Chris shared publicly.
Clone the repository
Requirements :
pip install -r requirements.txt
Download the data :
input
folderMost of the pipelines is handled in notebooks. The order in which they should be run is specified in the name. Pipeline should run fine in a machine with a 32GB.
1-Preparation.ipynb
.2-Matrices_Chris.ipynb
and 2-Matrices_Theo.ipynb
. Notebooks have to be run with MODE="val"
and MODE="test"
3-Candidates.ipynb
. Notebooks have to be run with MODE="val"
, MODE="test"
and MODE="extra"
.4-Matrix_Factorization.ipynb
, 4-Seq2Seq_Giba.ipynb
and 4-Word2Vec.ipynb
. Notebooks have to be run with MODE="val"
and MODE="test"
fe_main.py
script in the src
folder. Use python fe_main.py --mode MODE
with modes val
, test
and extra
.6-XGB.ipynb
. You need to train a models with the 3 targets, the main parameter to tweak is POS_RATIO
.7-Blend.ipynb
If you run into memory issues :
PIECES
values.CHUNK_SIZE
.utils/load/load_parquets_cudf_folds
function but you can downsample more.If you wish to dive into the code, the repository naming should be straight-forward. Each function is documented. The structure is the following :
src
data
candidates_chris.py # Chris' candidates utils
candidates.py # Theo's candidates utils
covisitation.py # Theo's covistation matrices
fe.py # Feature engineering
preparation.py # Data preparation utils
inference
boosting.py # Main file
predict.py # Predict function
model_zoo
__init__.py
lgbm.py # LGBM Ranker kept for legacy
xgb.py # XGBoost classifier
otto_src
evaluate.py # From the competition repo
labels.py # From the competition repo
my_split.py # My custom splitting functions
testset.py # From the competition repo
training
boosting.py # Trains a boosting model
utils
load.py # Data loading utils
logger.py # Logging utils
metrics.py # Metrics for the competition
plot.py # Plotting utils
torch.py # Torch utils
fe_main.py # Main for feature engineering
params.py # Main parameters