step-select

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Install

Create a virtual environment with Python 3.8 and install from PyPi:

pip install step-select

Use

Preliminaries

Note: this example requires two additional packages: pandas and statsmodels.

In this example we'll show how the ForwardSelector and SubsetSelector classes can be used on their own or in conjuction with a Scikit-Learn Pipeline object.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import statsmodels.datasets
from statsmodels.api import OLS
from statsmodels.tools import add_constant

from steps.forward import ForwardSelector
from steps.subset import SubsetSelector

We'll download the auto dataset via Statsmodels; we'll use mpg as the endogenous variable and the remaining variables as exongenous. We won't use make, as that will create several dummies and increase the number of paramters to 12+, which is too many for the SubsetSelector class; we'll also drop price.

data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()

X = data.iloc[:, 3:]
y = data['mpg']

Forward Stepwise Selection

The ForwardSelector follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.

The ForwardSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

Parameter	Type	Description
normalize	bool	Whether to normalize features; default `False`
metric	str	Optimization metric to use; must be one of `aic` or `bic`; default `aic`

The ForwardSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)

ForwardSelector(normalize=True)

X.loc[:, selector.best_support_]

Best Subset Selection

The SubsetSelector follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.

The SubsetSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

Parameter	Type	Description
normalize	bool	Whether to normalize features; default `False`
metric	str	Optimization metric to use; must be one of `aic` or `bic`; default `aic`

The SubsetSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)

SubsetSelector(normalize=True)

X.loc[:, selector.get_support()]

Comparing the full model

Using the SubsetSelector selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by weight, length, displacement and other features that are all related to the weight of a vehicle.

Note: Selection using BIC as the optimization metric yields a model where weight is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.

mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()

mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit()
mod.summary()

Use in Scikit-Learn Pipeline

Both ForwardSelector and SubsetSelector objects are compatible with Scikit-Learn Pipeline objects, and can be used as feature selection steps:

pl = Pipeline([
    ('feature_selection', SubsetSelector(normalize=True)),
    ('regression', LinearRegression())
])
pl.fit(X, y)

Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
                ('regression', LinearRegression())])

pl.score(X, y)

0.7097132531085899

Package Rankings

Top 27.58% on Pypi.org

Badges

Extracted from project README's

Related Projects

random-forest-importances

Code to compute permutation and drop-column importances in Python scikit-learn models

22 Mar 2018 585

hyperopt-sklearn

Hyper-parameter optimization for sklearn

19 Feb 2013 1,582

sklearn-feature-engineering

使用sklearn做特征工程

06 Dec 2017 166

datawaza

Data science tools for exploration, visualization, and model iteration.

21 Aug 2023 3

sklearn-utilities

Utilities for scikit-learn. Append prediction to x, append prediction to x single, append x predi...

09 Oct 2023 3

mlr

Multiple linear regression with statistical inference, residual analysis, direct CSV loading, and...

31 Jul 2019 31

Higgs-Dataset-Training

Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945

26 Jul 2024 0

ml-cheatsheet

A constantly updated python machine learning cheatsheet

11 Apr 2017 166

Cubist

A Python package for fitting Quinlan's Cubist regression model

06 Apr 2021 38

lazypredict

Lazy Predict help build a lot of basic models without much code and helps understand which models...

16 Nov 2019 2,871

felimination

Utility class to perform recursive feature elimination with cross validation and permutation impo...

27 Jun 2023 1

TinyAutoML

TinyAutoML is a comprehensive Pipeline Classifier Project thought as a Scikit-learn plugin

13 Feb 2022 4

kaggle-for-fun

All my submissions for Kaggle contests that I have been, and going to be participating.

24 Dec 2014 39

skfolio

Python library for portfolio optimization built on top of scikit-learn

14 Dec 2023 933

easy-gscv

This library allows you to quickly train machine learning classifiers by automatically splitting ...

04 Aug 2018 1