AutoPrep - Automated Preprocessing Pipeline with Univariate Anomaly Indicators

This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies.

I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline
- Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- Transformer: https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html

pip install AutoPrep

Dependencies

scikit-learn
category_encoders
bitstring

Basic Usage

To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:

import pandas as pd
import numpy as np

X_train = pd.DataFrame({

    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Alice', 'Alice', "Alice"],  
    'Rank': ['A','B','C','D'],
    'Age': [25, 30, 35, 40],                 
    'Salary': [50000.00, 60000.50, 75000.75, 8_000], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, ""]  
})
X_test = pd.DataFrame({

    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Alice', 'Alice', "Bob"],  
    'Rank': ['A','B','C','D'],
    'Age': [25, 30, 35, np.nan],                 
    'Salary': [50000.00, 60000.50, 75000.75, 8_000_000], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, ""]  
})


########################################
from AutoPrep import AutoPrep

pipeline = AutoPrep(remove_columns_no_variance=False)

pipeline.fit(X=X_train)
X_output = pipeline.transform(X=X_test)

X_output

Highlights ⭐

📌 Implementation of univariate methods / Detection of univariate anomalies

Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

📌 BinaryEncoder instead of OneHotEncoder for nominal columns / Big Data and Performance

Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.

(John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
(Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151

📌 Transformation of time series data and standardization of data with RobustScaler / Normalization for better prediction results

📌 Labeling of NaN values in an extra column instead of removing them / No loss of information

Pipeline - Built-in Logic

Reference

https://www.researchgate.net/publication/379640146_Detektion_von_Anomalien_in_der_Datenqualitatskontrolle_mittels_unuberwachter_Ansatze (German Thesis)

Package Rankings

Top 34.8% on Pypi.org

Badges

Extracted from project README

Related Projects

automl-engine

3 lines of code for automate machine learning for classification and regression

06 Feb 2020 3

ml-cheatsheet

A constantly updated python machine learning cheatsheet

11 Apr 2017 166

prepdata

Automating the process of Data Preprocessing for Data Science

20 Feb 2021 7

datawaza

Data science tools for exploration, visualization, and model iteration.

21 Aug 2023 3

pandas_dq

Find data quality issues and clean your data in a single line of code with a Scikit-Learn compati...

02 Apr 2023 116

AutoML_Alex

State-of-the art Automated Machine Learning python library for Tabular Data

09 May 2020 224

DataAnalysisToolkit

DataAnalysisToolkit is a Python-based data analysis tool designed to streamline various data anal...

26 Jul 2023 2

Higgs-Dataset-Training

Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945

26 Jul 2024 0

pandas2sklearn

An integration of pandas dataframes with scikit learn.

06 May 2015 6

pipesnake

a pandas sklearn-inspired pipeline data processor

26 Jan 2018 0

DataDoctor

DataDoctor is a Python package for data cleaning and preprocessing. It provides various methods t...

25 May 2023 2

sklearn-utilities

Utilities for scikit-learn. Append prediction to x, append prediction to x single, append x predi...

09 Oct 2023 3

flight-ad

flight-ad is a Python package for anomaly detection in the aviation domain built on top of scikit...

16 Jun 2021 6

Atlantic

Atlantic: Automated Data Preprocessing Framework for Supervised Machine Learning

08 Sep 2022 11

upgini

Data search & enrichment library for Machine Learning → Easily find and add relevant features to ...

08 Dec 2021 312