Data Analysis Toolkit

DataAnalysisToolkit is a comprehensive Python package offering a suite of tools designed for efficient data analysis. This toolkit simplifies tasks such as loading CSV data, performing statistical analysis, cleaning data, and visualizing results. It's an ideal tool for data analysts, scientists, and anyone looking to dive into data exploration and machine learning.

Features

Data Loading: Load data directly from CSV files into a Python environment.
Statistical Analysis: Perform calculations like mean, median, mode, and trimmed mean.
Outlier Detection: Identify outliers using the z-score method.
Data Cleaning: Handle missing values, drop duplicates, and encode categorical data.
Data Splitting: Easily split data into training and testing sets for machine learning models.
Data Visualization: Create histograms and other plots to explore data visually.
Data Export: Export cleaned and processed data back into CSV format.

Enhanced Functionalities

Advanced Visualization: Utilize a dedicated visualizer for creating a variety of insightful data plots.
Feature Engineering: Enhance your data with new, informative features.
Model Evaluation: Assess the performance of machine learning models.
Report Generation: Automatically generate comprehensive HTML reports with summaries and visualizations.
Data Imputation: Implement advanced imputation techniques to handle missing data.

This toolkit is an asset for conducting preliminary data analysis, and it seamlessly integrates into larger data processing workflows.

Getting Started

Here's how you can get started with DataAnalysisToolkit:

from data_analysis_toolkit import DataAnalysisToolkit

# Initialize the analyzer with the path to a CSV file
analyzer = DataAnalysisToolkit('../data/test.csv')


# Calculate the mean, median, mode, and trimmed mean of a column
statistics = analyzer.calculate_budget_statistics('column_name')
print(statistics)

# Detect outliers in a column using the z-score method
outliers = analyzer.detect_outliers('column_name')
print(outliers)

# Handle missing values in a column
analyzer.handle_missing_values('column_name', strategy='fill', fill_value=0)

# Drop duplicate rows in the DataFrame
analyzer.drop_duplicates()

# Encode categorical features in the DataFrame
analyzer.encode_categorical_features()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = analyzer.split_data('target_column')

# Plot a histogram of a column
analyzer.plot_data('column_name')

# Export the data to a CSV file
analyzer.export_data('new_file.csv')

Installation

Install DataAnalysisToolkit using pip:

pip install dataanalysistoolkit

Documentation

For detailed documentation, examples, and usage guides, please visit DataAnalysisToolkit Documentation.

Contributing

Contributions are welcome! For guidelines on how to contribute, please refer to our Contribution Guide.

License

DataAnalysisToolkit is open-sourced under the MIT License. For more details, see the LICENSE file.

Developed with ❤ by the DataAnalysisToolkit Team.

Package Rankings

Top 38.28% on Pypi.org

Badges

Extracted from project README

Related Projects

ml-cheatsheet

A constantly updated python machine learning cheatsheet

11 Apr 2017 166

ta

Technical Analysis Library using Pandas and Numpy

02 Jan 2018 4,270

datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

27 Feb 2016 1,050

pandas_dq

Find data quality issues and clean your data in a single line of code with a Scikit-Learn compati...

02 Apr 2023 116

flight-ad

flight-ad is a Python package for anomaly detection in the aviation domain built on top of scikit...

16 Jun 2021 6

CASS-PROPEL

Complete evaluation of traditional "SK-learn like" machine learning models for post-operative com...

11 Sep 2023 4

tslumen

A library for Time Series EDA (exploratory data analysis)

09 Nov 2022 66

Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.

18 Jan 2020 307

sweetviz

Visualize and compare datasets, target values and associations, with one line of code.

09 May 2020 2,853

tods

TODS: An Automated Time-series Outlier Detection System

08 Sep 2020 1,417

ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

09 Jan 2016 12,108

visions

Type System for Data Analysis in Python

12 Dec 2019 205

AutoPrep

Automated Preprocessing Pipeline - DataFrame

04 Aug 2024 0

pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data str...

24 Aug 2010 41,864

ATOM

Automated Tool for Optimized Modelling

03 Jul 2019 152