A simple AutoML tool for small datasets with useful helper functions
MIT License
ZipML is a lightweight AutoML library designed for small datasets, offering essential helper functions like train-test splitting, model comparison, and confusion matrix generation.
Install the package via pip:
pip install zipml
Alternatively, clone the repository:
git clone https://github.com/abdozmantar/zipml.git
cd zipml
pip install .
Here's a practical example of how to use ZipML:
import pandas as pd
from zipml.model import analyze_model_predictions
from zipml.model import calculate_model_results
from zipml.visualization import save_and_plot_confusion_matrix
from zipml.data import split_data
from zipml import compare_models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
# Sample dataset
data = {
'feature_1': [0.517, 0.648, 0.105, 0.331, 0.781, 0.026, 0.048],
'feature_2': [0.202, 0.425, 0.643, 0.721, 0.646, 0.827, 0.303],
'feature_3': [0.897, 0.579, 0.014, 0.167, 0.015, 0.358, 0.744],
'feature_4': [0.457, 0.856, 0.376, 0.527, 0.648, 0.534, 0.047],
'feature_5': [0.046, 0.118, 0.222, 0.001, 0.969, 0.239, 0.203],
'target': [0, 1, 1, 1, 1, 1, 0]
}
# Creating DataFrame
df = pd.DataFrame(data)
# Splitting data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = split_data(X, y)
# Define models
models = [
RandomForestClassifier(),
LogisticRegression(),
GradientBoostingClassifier()
]
# Compare models and select the best one
best_model, performance = compare_models(models, X_train, X_test, y_train, y_test)
print(f"Best model: {best_model} with performance: {performance}")
# Calculate performance metrics for the best model
best_model_metrics = calculate_model_results(y_test, best_model.predict(X_test))
# Analyze model predictions
val_df, most_wrong = analyze_model_predictions(best_model, X_test, y_test)
# Save and plot confusion matrix
save_and_plot_confusion_matrix(y_test, best_model.predict(X_test), save_path="confusion_matrix.png")
You can run ZipML from the command line using the following commands:
zipml --train train.csv --test test.csv --model randomforest --result results.json
--train
: Path to the training dataset CSV file.--test
: Path to the testing dataset CSV file.--model
: Name of the model to be trained (e.g., randomforest
, logisticregression
, gradientboosting
).--result
: Path to the JSON file where results will be saved.zipml --train train.csv --test test.csv --compare --compare_models randomforest svc knn --result results.json
--compare
: A flag to indicate multiple model comparison.--compare_models
: A list of models to compare (e.g., randomforest
, logisticregression
, gradientboosting
).--result
: Path to the JSON file where comparison results will be saved.zipml --load_model trained_model.pkl --test test.csv --result predictions.json
--load_model
: Path to the saved model file.--test
: Path to the testing dataset CSV file.--result
: Path to the JSON file where predictions will be saved.To save the trained model after training:
zipml --train train.csv --test test.csv --model randomforest --save_model trained_model.pkl
--result
: Path to the file where the trained model will be saved.git checkout -b feature/foo
).git commit -am 'Add some foo'
).git push origin feature/foo
).Abdullah OZMANTAR GitHub: @abdozmantar
This project is licensed under the MIT License - see the LICENSE file for details.