Easy Tensorflow/Keras feature Preprocessing Pipelines
MIT License
The EasyFlow
package implements an interface similar to SKLearn's Pipeline API that contains easy feature preprocessing pipelines to build a full training and inference pipeline natively in Keras. All pipelines are implemented as Keras layers.
There is a need to have a similar interface for Keras that mimics the SKLearn Pipeline API such as Pipeline
, FeatureUnion
and ColumnTransformer
, but natively in Keras as Keras layers. The usual design pattern especially for tabular data is to first do preprocessing with SKLearn and then feed the data to a Keras model. With EasyFlow
you don't need to leave the Tensorflow/Keras ecosystem to build custom pipelines and your preprocessing pipeline is part of your model architecture.
Main interfaces are:
FeaturePreprocessor
: This layer applies feature preprocessing steps and returns a separate layer for each step supplied. This gives more flexibility to the user and if a more advance network architecture is needed. For example something like a Wide and Deep network.FeatureUnion
: This layer is similar to FeaturePreprocessor
with an extra step that concatenates all layers into a single layer.pip install easy-tensorflow
Lets look at a quick example:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Normalization, StringLookup, IntegerLookup
# local imports
from easyflow.data import TensorflowDataMapper
from easyflow.preprocessing import FeatureUnion
from easyflow.preprocessing import (
FeatureInputLayer,
StringToIntegerLookup,
)
Use the TensorflowDataMapper class to map pandas data frame to a tf.data.Dataset type.
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)
labels = dataframe.pop("target")
batch_size = 32
dataset_mapper = TensorflowDataMapper()
dataset = dataset_mapper.map(dataframe, labels)
train_data_set, val_data_set = dataset_mapper.split_data_set(dataset)
train_data_set = train_data_set.batch(batch_size)
val_data_set = val_data_set.batch(batch_size)
NUMERICAL_FEATURES = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']
CATEGORICAL_FEATURES = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca']
# thal is represented as a string
STRING_CATEGORICAL_FEATURES = ['thal']
dtype_mapper = {
"age": tf.float32,
"sex": tf.float32,
"cp": tf.float32,
"trestbps": tf.float32,
"chol": tf.float32,
"fbs": tf.float32,
"restecg": tf.float32,
"thalach": tf.float32,
"exang": tf.float32,
"oldpeak": tf.float32,
"slope": tf.float32,
"ca": tf.float32,
"thal": tf.string,
}
This is the main part where EasyFlow
fits in. We can now easily setup a feature preprocessing pipeline as a Keras layer with only a few lines of code.
feature_preprocessor_list = [
('numeric_encoder', Normalization(), NUMERICAL_FEATURES),
('categorical_encoder', IntegerLookup(output_mode='multi_hot'), CATEGORICAL_FEATURES),
('string_encoder', StringToIntegerLookup(), STRING_CATEGORICAL_FEATURES)
]
preprocessor = FeatureUnion(feature_preprocessor_list)
preprocessor.adapt(train_data_set)
feature_layer_inputs = FeatureInputLayer(dtype_mapper)
preprocessing_layer = preprocessor(feature_layer_inputs)
# setup simple network
x = tf.keras.layers.Dense(128, activation="relu")(preprocessing_layer)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=feature_layer_inputs, outputs=outputs)
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=[tf.keras.metrics.BinaryAccuracy(name='accuracy'), tf.keras.metrics.AUC(name='auc')])
history=model.fit(train_data_set, validation_data=val_data_set, epochs=10)
easyflow.preprocessing
module contains functionality similar to what Sklearn does with its Pipeline
, FeatureUnion
and ColumnTransformer
does. This is a quick introduction.