skippa

SciKIt-learn Pipeline in PAndas

OTHER License

Downloads

639

Stars

42

Committers

View Code on GitHub Visit Website

Ecosystems: scikit-learn, Python

Skippa

SciKIt-learn Pre-processing Pipeline in PAndas

Read more in the introduction blog on towardsdatascience

Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as scikit-pandas, but a different (and hopefully better) way to achieve it.

Installation

pip install skippa

Optional, if you want to use the gradio app functionality:

pip install skippa[gradio]

Basic usage

Import Skippa class and columns helper function

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns

Get some data

df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])

Define your pipeline:

pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)

and use it for fitting / predicting like this:

pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)

If you want details on your model, use:

model = pipe.get_model()
print(model.coef_)
print(model.intercept_)

(de)serialization

And of course you can save and load your model pipelines (for deployment). N.B. dill is used for ser/de because joblib and pickle don't provide enough support.

pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)

See the ./examples directory for more examples:

To Do

Support pandas assign for creating new columns based on existing columns
Support cast / astype transformer
Support for .apply transformer: wrapper around pandas.DataFrame.apply
Check how GridSearch (or other param search) works with Skippa
Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
Support PCA transformer
Facilitate random seed in Skippa object that is dispatched to all downstream operations
fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
Use sklearn's new dataframe output setting
Validation of pipeline steps
Input validation in transformers
Transformer for replacing values (pandas .replace)
Support arbitrary transformer (if column-preserving)
Eliminate the need to call columns explicitly

Credits

Skippa is powered by Data Science Lab Amsterdam
This project structure is based on the audreyr/cookiecutter-pypackage project template.

Package Rankings

Top 17.79% on Pypi.org

Related Projects

skoot

A package for data science practitioners. This library implements a number of helpful, common dat...

sk-dist

Distributed scikit-learn meta-estimators in PySpark

14 Aug 2019 285

diaml

Semi-automated machine learning pipelines

sklearn-utilities

Utilities for scikit-learn. Append prediction to x, append prediction to x single, append x predi...

prepdata

Automating the process of Data Preprocessing for Data Science

nyoka

Nyoka is a Python library that helps to export ML models into PMML (PMML 4.4.1 Standard).

23 Aug 2018 184

sklearn-weka-plugin

Makes Weka algorithms available in scikit-learn, by using python-weka-wrapper3 under the hood.

scikit-hts

Hierarchical Time Series Forecasting with a familiar API

23 Nov 2019 219

scikit-transformers

Very usefull package to enable and provide custom transformers such as LogColumnTransformer, Bool...

diego

Diego: Data in, IntElliGence Out. A fast framework that supports the rapid construction of automa...

AutoPrep

Automated Preprocessing Pipeline - DataFrame

pipesnake

a pandas sklearn-inspired pipeline data processor

pipeline-optimizer

Preprocessing infrastructure that simplifies and automates the machine learning pipeline

human-learn

Natural Intelligence is still a pretty good idea.

11 Jul 2020 792

skll

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

02 Aug 2013 551