New tips are posted on LinkedIn, Twitter, and Facebook.
👉 Sign up to receive 2 video tips by email every week! 👈
Click to discuss the tip on LinkedIn, click to view the Jupyter notebook for a tip, or click to watch the tip video on YouTube:
# | Description | Links |
---|---|---|
1 | Use ColumnTransformer to apply different preprocessing to different columns |
|
2 | Seven ways to select columns using ColumnTransformer
|
|
3 | What is the difference between "fit" and "transform"? | |
4 | Use "fit_transform" on training data, but "transform" (only) on testing/new data | |
5 | Four reasons to use scikit-learn (not pandas) for ML preprocessing | |
6 | Encode categorical features using OneHotEncoder or OrdinalEncoder
|
|
7 | Handle unknown categories with OneHotEncoder by encoding them as zeros |
|
8 | Use Pipeline to chain together multiple steps |
|
9 | Add a missing indicator to encode "missingness" as a feature | |
10 | Set a "random_state" to make your code reproducible | |
11 | Impute missing values using KNNImputer or IterativeImputer
|
|
12 | What is the difference between Pipeline and make_pipeline ? |
|
13 | Examine the intermediate steps in a Pipeline
|
|
14 |
HistGradientBoostingClassifier natively supports missing values |
|
15 | Three reasons not to use drop='first' with OneHotEncoder
|
|
16 | Use cross_val_score and GridSearchCV on a Pipeline
|
|
17 | Try RandomizedSearchCV if GridSearchCV is taking too long |
|
18 | Display GridSearchCV or RandomizedSearchCV results in a DataFrame |
|
19 | Important tuning parameters for LogisticRegression
|
|
20 | Plot a confusion matrix | |
21 | Compare multiple ROC curves in a single plot | |
22 | Use the correct methods for each type of Pipeline
|
|
23 | Display the intercept and coefficients for a linear model | |
24 | Visualize a decision tree two different ways | |
25 | Prune a decision tree to avoid overfitting | |
26 | Use stratified sampling with train_test_split
|
|
27 | Two ways to impute missing values for a categorical feature | |
28 | Save a model or Pipeline using joblib |
|
29 | Vectorize two text columns in a ColumnTransformer
|
|
30 | Four ways to examine the steps of a Pipeline
|
|
31 | Shuffle your dataset when using cross_val_score
|
|
32 | Use AUC to evaluate multiclass problems | |
33 | Use FunctionTransformer to convert functions into transformers |
|
34 | Add feature selection to a Pipeline
|
|
35 | Don't use .values when passing a pandas object to scikit-learn |
|
36 | Most parameters should be passed as keyword arguments | |
37 | Create an interactive diagram of a Pipeline in Jupyter |
|
38 | Get the feature names output by a ColumnTransformer
|
|
39 | Load a toy dataset into a DataFrame | |
40 | Estimators only print parameters that have been changed | |
41 | Drop the first category from binary features (only) with OneHotEncoder
|
|
42 | Passthrough some columns and drop others in a ColumnTransformer
|
|
43 | Use OrdinalEncoder instead of OneHotEncoder with tree-based models |
|
44 | Speed up GridSearchCV using parallel processing |
|
45 | Create feature interactions using PolynomialFeatures
|
|
46 | Ensemble multiple models using VotingClassifer or VotingRegressor
|
|
47 | Tune the parameters of a VotingClassifer or VotingRegressor
|
|
48 | Access part of a Pipeline using slicing |
|
49 | Tune multiple models simultaneously with GridSearchCV
|
|
50 | Adapt this pattern to solve many Machine Learning problems |
You can interact with all of these notebooks online using Binder:
Note: Some of the tips do not include any code, and can only be viewed on LinkedIn.
Hi! I'm Kevin Markham, the founder of Data School. I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.
I teach three courses:
👉 Find out which course is right for you! 👈
Yes! In 2019, I posted 100 pandas tricks. I also created a video featuring my top 25 pandas tricks.
© 2020-2021 Data School. All rights reserved.