DataDoctor is a Python package for data cleaning and preprocessing. It provides various methods to treat common issues in data such as missing values, duplicate records, inconsistent data formats, outliers, inconsistent naming conventions, data entry errors, and more.
MIT License
DataDoctor()
is a Python package for data cleaning and preprocessing. It provides various methods to treat common issues in data such as missing values, duplicate records, inconsistent data formats, outliers, inconsistent naming conventions, data entry errors, and more. The package uses popular libraries such as pandas
, numpy
, scikit-learn
, fuzzywuzzy
, and chardet
.
pandas
- It is a Python library for data analysis and manipulation that provides fast, flexible, and expressive data structures such as DataFrame and Series.numpy
- It is a Python library for scientific computing that adds support for large, multi-dimensional arrays, matrices and high-level mathematical functions to operate on these arrays.scikit-learn
- It is a Python library for machine learning that provides simple and efficient tools for data mining and data analysis, such as classification, regression, clustering, dimensionality reduction, etc.fuzzywuzzy
- It is a Python library for fuzzy string matching that uses the Levenshtein distance to calculate the similarity between two strings.python-Levenshtein
- It is a Python extension module that provides fast computation of Levenshtein distance and string operations based on it.chardet
- It is a Python library that can automatically detect the character encoding of a text file or a byte string.Follow the below instructions to get started with DataDoctor()
.
pip install DataDoctor
or
!pip install DataDoctor
from data_doctor import DataDoctor
doctor = DataDoctor()
data = Your_Data.csv
The DataDoctor()
class provides a variety of methods for treating common data issues. These methods include:
treat_missing_data()
: treats missing data in the loaded dataset by applying an imputation technique based on the data type of each column.treat_duplicate_records()
: treats duplicate records in the loaded dataset by removing them.treat_inconsistent_data_formats()
: treats inconsistent data formats in the loaded data by converting all values in string columns to lowercase.treat_inaccurate_data_entries()
: treats inaccurate data entries in string columns of the loaded data by replacing them with the most frequent value in the column.treat_outliers()
: treats outliers in numerical columns of the loaded data by removing them using the Isolation Forest algorithm.treat_inconsistent_naming_conventions()
: treats inconsistent naming conventions in the loaded data by converting all column names to lowercase.treat_data_entry_errors()
: treats data entry errors in string columns of the loaded data by replacing incorrect or similar-but-incorrect entries with the most similar valid value.treat_inconsistent_units_of_measurement()
: treats inconsistent units of measurement in numerical columns of the loaded data by converting the values to a consistent unit of measurement.treat_incorrect_data_types()
: treats incorrect data types in the loaded data by converting string columns that should contain numeric data to their appropriate numeric data type.treat_invalid_values()
: treats invalid values in numerical columns of a dataset by replacing them with np.nan (a missing value indicator).treat_inconsistent_or_conflicting_values()
: treats inconsistent or conflicting values in string columns of the loaded data by resolving cases where there are conflicting values within the same entry or inconsistent values across different entries in a column.treat_encoding_errors()
: treats encoding errors in string columns of the loaded dataset by determining the encoding of each string column and then decoding the values using the detected encoding.treat_inconsistent_date_and_time_formats()
: treats inconsistent date and time formats in string columns of a dataset by converting the values to a consistent date and time format.treat_inconsistent_variable_names()
: treats inconsistent variable names in the loaded data by replacing non-word characters with underscores and converting them to lowercase.treat_inconsistent_capitalization_or_punctuation()
: treats inconsistent capitalization or punctuation in column names within the loaded dataset by replacing non-word characters with underscores and converting them to lowercase.treat_spelling_or_typographical_errors()
: treats spelling or typographical errors in string columns of the loaded data by replacing incorrect or misspelled values with the most similar valid value from a given set of valid names.generate_report()
: It generates a report that provides information about the data treatment steps performed on the loaded data. It gives an overview of the issues addressed and the changes made to the data.Each method has its own advantages and benefits in improving data quality and consistency. The DataDoctor
class offers a comprehensive toolkit for data treatment and cleaning.
DataDoctor()
Class:The load_data()
method is a part of the DataDoctor class. It is used to load data into the DataDoctor()
object. Here's an overview of its meaning, advantages, and why you would use it:
The load_data()
method allows you to provide data to the DataDoctor()
object for further data treatment and analysis. It initialises the dataframe attribute of the DataDoctor()
object with the provided data.
The below code helps you to perform the above-explained function:
doctor.load_data(data)
Advantages:
DataDoctor()
class, allowing you to perform a variety of data treatments and analyses on the loaded data.The treat_missing_data()
method is used to treat missing data in the loaded dataset. It handles missing values by applying an imputation technique based on the data type of each column. For numerical columns, it uses the IterativeImputer class to estimate missing values based on other features. For non-numerical columns, it uses the SimpleImputer class with the 'most_frequent' strategy to replace missing values with the most frequent value in the column.
The below code helps you to perform the above-explained function:
doctor.treat_missing_data()
Advantages:
Why use this feature from DataDoctor()
:
The treat_duplicate_records()
method is part of the DataDoctor()
class and is used to treat duplicate records in the loaded dataset. Duplicate records refer to dataset rows with identical values across all columns. These duplicates can occur due to various reasons such as data entry errors, system issues, or merging datasets.
The below code helps you to perform the above-explained function:
doctor.treat_duplicate_records()
Advantages:
Why use this feature from DataDoctor()
:
The treat_duplicate_records()
method from DataDoctor()
provides a convenient and automated way to handle duplicate records in a dataset. It saves time and effort compared to manually identifying and removing duplicates. By using this feature, you can ensure data quality, consistency, and accuracy, leading to more reliable analysis and decision-making based on clean and unique data.
Note: It's important to consider the context and nature of the dataset before applying this method. In some cases, duplicate records may be intentional and hold important information. Therefore, it's recommended to review the data and consult domain experts if necessary before treating duplicates.
This method is part of the DataDoctor()
class and is used to treat inconsistent data formats in the loaded data. It performs the following operations:
str.lower()
method. This ensures consistent capitalisation for all string values within the column.Inconsistent data formats can pose challenges when working with data analysis and processing tasks. Inconsistent capitalization within string values can lead to difficulties in searching, grouping, and analyzing data. By applying the treat_inconsistent_data_formats()
method, you can achieve a consistent format for all string values within the dataset by converting them to lowercase. This enhances data consistency, simplifies data manipulation, and enables accurate comparison and analysis of string values.
The below code helps you to perform the above explained function:
doctor.treat_inconsistent_data_formats()
Advantages and Benefits:
Why use This Feature from DataDoctor()
:
The treat_inconsistent_data_formats()
method is a valuable feature provided by the DataDoctor()
class. Using this feature, you can easily address inconsistent capitalisation within string values, improving data quality, consistency, and ease of analysis. It helps you overcome common challenges associated with inconsistent data formats and ensures compatibility with various data processing tasks. Incorporating this feature as part of your data treatment workflow can significantly enhance the reliability and usability of your dataset.
This method is part of the DataDoctor class and is used to treat inaccurate data entries in string columns of the loaded data. It identifies valid values for each string column and replaces any entries that do not match the valid values with np.nan.
Inaccurate data entries refer to values in a string column that does not correspond to valid or expected values. These can occur due to various reasons such as human error during data entry or inconsistencies in the data source. The treat_inaccurate_data_entries()
method aims to address this issue by identifying inaccurate entries and replacing them with missing values (np.nan).
The below code helps you to perform the above-explained function:
doctor.treat_inaccurate_data_entries()
Advantages and Benefits:
Why use This Feature from DataDoctor()
:
While the treat_inaccurate_data_entries()
method is effective in treating inaccurate data entries, it relies on the assumption that there are predefined valid values for each string column. If valid values are not known or cannot be determined, alternative data treatment approaches may be necessary.
It's important to review the generated report after applying this method to identify the number of errors detected and the columns affected. This information helps in understanding the impact of the treatment and the quality of the dataset.
Example:
Suppose you have a dataset containing a "Gender" column with entries such as "M", "Male", "F", "Female", and some inaccurate entries like "Man" and "Fem". Calling treat_inaccurate_data_entries()
on this dataset will identify these inaccurate entries and replace them with np.nan, resulting in a more consistent representation of gender information.
The treat_outliers()
method is a feature provided by the DataDoctor()
class to treat outliers in numerical columns of the loaded data. Outliers are data points that significantly deviate from the normal range of values in a dataset and can distort the analysis or modelling process. This method utilises the Isolation Forest algorithm to detect and remove outliers from the data.
The below code helps you to perform the above-explained function:
doctor.treat_outliers()
Advantages:
treat_outliers()
method helps improve the data's quality and reliability, leading to more accurate analysis and modelling.treat_outliers()
method helps in building more robust and reliable models that generalise well to unseen data.Why use this Feature from DataDoctor()
:
The treat_outliers()
method provides a convenient way to handle outliers within numerical columns of a dataset. Outliers can be detrimental to the accuracy and reliability of data analysis and modelling tasks. By using this feature from DataDoctor()
, you can easily detect and remove outliers, ensuring that your data is clean, representative, and suitable for further analysis or modelling purposes. The method's integration within the DataDoctor()
class allows for a comprehensive and systematic treatment of various data issues, providing a streamlined workflow for data preprocessing and quality improvement.
This method is a part of the DataDoctor()
class and is used to treat inconsistent naming conventions in the loaded dataset. It ensures that all column names are in a consistent format by converting them to lowercase.
Inconsistent naming conventions refer to variations in the capitalisation or formatting of column names within the dataset. For example, some column names may be written in uppercase, some in lowercase, and some may contain special characters or spaces. Such inconsistencies can make it difficult to work with the data and can lead to errors in data analysis and processing.
The treat_inconsistent_naming_conventions()
method addresses this issue by standardising the column names to a consistent format, which converts them to lowercase. This ensures uniformity and makes referring to and manipulating the columns in subsequent data operations easier.
The below code helps you to perform the above-explained function:
doctor.treat_inconsistent_naming_conventions()
Advantages:
Why use this feature from DataDoctor()
:
The treat_inconsistent_naming_conventions()
feature from DataDoctor is beneficial in data preprocessing and data cleaning tasks. By applying this method, you can ensure that your dataset has consistent and standardised column names, which is essential for maintaining data quality and enabling smooth data analysis and manipulation.
Using this feature helps to eliminate potential naming-related issues, reduces the likelihood of errors caused by inconsistent naming conventions, and enhances the overall quality and usability of the dataset. It promotes good data hygiene practices and prepares the dataset for further analysis, modelling, or integration with other data systems.
The treat_data_entry_errors()
method is a feature provided by the DataDoctor()
class that helps in treating data entry errors in string columns of a given dataset. This method is designed to identify and correct inaccurate or misspelled entries in the data.
Data entry errors can occur when human input or data collection processes result in inconsistencies or inaccuracies in the recorded data. Such errors can include misspellings, typos, or variations in naming conventions. The treat_data_entry_errors()
method aims to identify and rectify these errors by comparing the entries with a set of valid values and replacing incorrect or similar-but-incorrect entries with the most similar valid value.
The below code helps you to perform the above-explained function:
doctor.treat_data_entry_errors()
Advantages:
The treat_data_entry_errors() method offers several advantages in the data treatment process:
Why use this feature from DataDoctor()
:
The treat_data_entry_errors()
feature from DataDoctor()
is beneficial in scenarios where the dataset contains string columns with potential data entry errors. By utilising this feature, you can:
Improve Data Quality - By automatically correcting inaccurate or misspelled entries, the feature enhances the overall quality and integrity of the dataset.
Increase Data Consistency - Inconsistencies in data entries can hinder analysis and modelling tasks. The feature helps to standardise and harmonise the data by replacing incorrect entries with the most similar valid values.
Save Manual Effort - Correcting data entry errors manually can be time-consuming and error-prone. The feature automates the error correction process, reducing the need for manual intervention and saving valuable time and effort.
Enhance Data Analysis - Accurate and consistent data allows for more reliable analysis, modelling, and decision-making. By using the treat_data_entry_errors()
feature, you can ensure the reliability of the data and derive more accurate insights from the dataset.
This method is part of the DataDoctor()
class and is used to treat inconsistent units of measurement in numerical columns of the loaded data. It performs a conversion of the values in the numerical columns to a consistent unit of measurement.
In many datasets, numerical columns may contain values measured in different units, leading to inconsistencies and hindering accurate analysis or modelling. The treat_inconsistent_units_of_measurement()
method addresses this issue by applying a conversion factor to the numerical values, ensuring that they are consistent and comparable.
The below code helps you to perform the above-explained function:
doctor.treat_inconsistent_units_of_measurement()
Advantage:
Treating inconsistent units of measurement is important because it brings uniformity to the data, allowing for meaningful comparisons and analysis. By applying a conversion factor, this method ensures that all numerical values in the dataset are expressed in a single unit, making it easier to interpret and analyse the data.
Why use this feature from DataDoctor()
:
Using the treat_inconsistent_units_of_measurement()
feature from the DataDoctor()
class provides several benefits:
The treat_incorrect_data_types()
method in the DataDoctor()
class is used to treat incorrect data types in the loaded data. It is designed to handle situations where columns that should contain numeric data are mistakenly encoded as string data. By identifying such columns and converting their values to numeric data, this method helps ensure the data is in the appropriate format for further analysis or processing.
Incorrect data types can occur due to various reasons, such as data entry errors or formatting issues during data collection. When numeric values are stored as strings, it can lead to incorrect calculations, inaccurate analysis, or unexpected errors in subsequent data operations. The treat_incorrect_data_types()
method helps resolve these issues by attempting to convert string columns to their appropriate numeric data type.
The below code helps you to perform the above-explained function:
doctor.treat_incorrect_data_types()
Advantages:
Why use this feature from DataDoctor()
:
The treat_incorrect_data_types() feature from DataDoctor offers an automated and systematic approach to address the issue of incorrect data types in a dataset. By utilising this feature, you can benefit from the following:
The treat_invalid_values(
) method is a feature the DataDoctor()
class provides in the code. This method is used to treat invalid values in numerical columns of a dataset.
The method treat_invalid_values()
addresses the issue of invalid values present in numerical columns of the dataset. Invalid values are values that fall outside the expected range or are not meaningful in the context of the data. These values can be problematic as they can introduce errors and inconsistencies in the analysis or modelling process. This method aims to identify and handle such invalid values effectively.
The below code helps you to perform the above-explained function:
doctor.treat_invalid_values()
Advantages:
Why use this feature from DataDoctor()
:
The treat_invalid_values()
feature from the DataDoctor()
class offers a convenient and reliable way to handle invalid values in numerical columns. By utilising this feature, you can:
The treat_inconsistent_or_conflicting_values()
method is a part of the DataDoctor()
class and is used to treat inconsistent or conflicting values in string columns of the loaded data. It identifies and resolves cases where there are conflicting values within the same entry or inconsistent values across different entries in a column.
The below code helps you to perform the above-explained function:
doctor.treat_inconsistent_or_conflicting_values()
Advantage:
Why use this feature from DataDoctor()
:
The treat_encoding_errors()
method in the DataDoctor()
class is designed to handle encoding errors in string columns of the loaded dataset. Encoding errors can occur when the encoding of the data does not match the expected encoding, leading to decoding issues and incorrect representation of characters.
The treat_encoding_errors()
method detects and addresses encoding errors by attempting to determine the encoding of each string column and then decoding the values using the detected encoding. This method ensures that the data is properly decoded and represented by fixing encoding errors, preventing issues such as garbled text or incorrect characters.
The below code helps you to perform the above-explained function:
doctor.treat_encoding_errors()
Advantage:
The treat_encoding_errors()
method offers several advantages:
Why use this feature from DataDoctor()
:
Using the treat_encoding_errors()
feature from DataDoctor()
is beneficial in scenarios where you encounter encoding errors in your dataset. By employing this feature, you can address and fix encoding issues, resulting in clean, correctly encoded data that is ready for further analysis or processing.
The treat_inconsistent_date_and_time_formats()
method is a feature provided by the DataDoctor()
class from the data_doctor
module. It is designed to address inconsistent date and time formats in string columns of a dataset. This method performs data treatment by attempting to convert the values in these columns to a consistent date and time format.
In datasets that contain date and time information, it is common to encounter variations in the format of these values. This inconsistency can arise due to differences in data entry practices or data sources. The treat_inconsistent_date_and_time_formats()
method aims to standardize the date and time formats across the dataset, ensuring consistency and facilitating data analysis and processing.
The below code helps you to perform the above explained function:
doctor.treat_inconsistent_date_and_time_formats()
Advantages:
The advantages of using the treat_inconsistent_date_and_time_formats()
method include:
Why use this feature from DataDoctor()
:
The DataDoctor()
class and its associated methods, including treat_inconsistent_date_and_time_formats()
, provide a comprehensive data treatment and cleaning toolkit. By using this feature, you can address inconsistencies in date and time formats in a dataset, ensuring data quality and preparing the dataset for downstream analysis. This method eliminates the need for manual and error-prone data formatting, as it automates the process of converting inconsistent date and time values to a standardised format.
By utilizing the treat_inconsistent_date_and_time_formats()
method from DataDoctor()
, you can save time and effort in preprocessing and cleansing your data, allowing you to focus more on extracting meaningful insights and making informed decisions based on the treated dataset.
The treat_inconsistent_variable_names()
method is a feature provided by the DataDoctor()
class. It is used to treat inconsistent variable names in the loaded data. This method ensures that variable names are consistent, following a specific format, by replacing non-word characters with underscores and converting them to lowercase.
Inconsistent variable names refer to the situation where the names of variables (columns) in a dataset have different capitalisations, contain special characters, or have inconsistent word separators. For example, a dataset may have variables named "Age", "gender", and "Income(in USD)". Inconsistent variable names can make data analysis and modelling tasks more challenging and error-prone. The treat_inconsistent_variable_names()
method aims to standardise and clean up the variable names to improve data consistency and ease further data processing.
The below code helps you to perform the above-explained function:
doctor.treat_inconsistent_variable_names()
Advantages:
Why use this feature from DataDoctor()
:
The treat_inconsistent_variable_names()
feature provided by the DataDoctor()
class offers several benefits for data preprocessing and analysis:
The treat_inconsistent_capitalization_or_punctuation()
method is a part of the DataDoctor()
class in the data_doctor module
. This method is designed to treat inconsistent capitalisation or punctuation in column names within the loaded dataset.
Inconsistent capitalisation or punctuation in column names refers to situations where column names are not consistently formatted in terms of capitalisation (e.g., a mix of uppercase and lowercase letters) or punctuation (e.g., inconsistent use of underscores or hyphens). Inconsistent formatting can make it difficult to work with the dataset and can lead to errors when performing data analysis or modelling tasks.
The below code helps you to perform the above-explained function:
doctor.treat_inconsistent_capitalization_or_punctuation()
Advantages:
Treating inconsistent capitalisation or punctuation in column names has several advantages:
Why use this feature from DataDoctor()
:
The treat_inconsistent_capitalization_or_punctuation()
method provides an automated way to address inconsistent capitalisation or punctuation in column names. By using this feature, you can ensure that your dataset's column names adhere to consistent formatting standards, improving data quality and facilitating further data analysis and modelling tasks.
Example Result:
Suppose we have a dataset with the following inconsistent column names:
After applying the treat_inconsistent_capitalization_or_punctuation()
method, the column names would be standardised to:
This standardisation helps create a more consistent and readable dataset, making it easier to work with the data and avoid potential issues due to inconsistent naming conventions.
The treat_spelling_or_typographical_errors()
method in the DataDoctor()
class is designed to address spelling or typographical errors in string columns of the loaded data. It applies a correction mechanism to replace incorrect or misspelled values with the most similar valid value from a given set of valid names.
Spelling or typographical errors in data can lead to inconsistencies and inaccuracies in analysis and modelling tasks. This method helps to standardise the values in string columns by identifying and replacing incorrect or misspelled entries with the most similar valid value. By correcting these errors, the method improves the quality and reliability of the data.
The below code helps you to perform the above-explained function:
doctor.treat_spelling_or_typographical_errors()
Advantages:
Why use this feature from DataDoctor()
:
Using the treat_spelling_or_typographical_errors()
feature from DataDoctor()
can provide several benefits:
It is important to note that the correction mechanism used in this method relies on fuzzy matching algorithms to find the most similar valid value. While it can significantly improve the accuracy of data, manual review and verification may still be necessary in certain cases to ensure the correctness and relevance of the corrections.
Once the treat_spelling_or_typographical_errors()
method is executed, it will process the loaded data, identify misspelled or incorrect values in string columns, and replace them with the most similar valid value. The changes will be applied to the dataframe attribute of the DataDoctor()
object.
The generate_report()
method in the DataDoctor()
class is used to generate a report summarizing the data treatment steps that have been applied to the loaded data. It prints the report to the console. Here's an explanation of this method:
The generate_report()
method generates a report that provides information about the data treatment steps performed on the loaded data. It gives an overview of the issues addressed and the changes made to the data.
The below code helps you to perform the above-explained function:
doctor.generate_report()
Advantage:
The generate_report()
method provides a concise summary of the data treatment process. It helps users understand the transformations applied to the data, identifies any issues or errors that were encountered, and provides transparency in the data treatment workflow. This report can be useful for auditing purposes, sharing information with stakeholders, or documenting the data treatment process.
Why to use this feature from DataDoctor()
:
The generate_report()
feature from the DataDoctor class is beneficial because it allows users to quickly obtain an overview of the data treatment steps that have been applied. It provides a comprehensive summary of the actions taken to clean and preprocess the data, making it easier to understand the transformations made and the resulting data quality improvements. This report can be used to communicate the data treatment process to other team members, ensure reproducibility, and track the progress of data cleaning efforts.
Results:
When you call the generate_report()
method, it will print the data treatment report to the console. The report will contain a header indicating that it is the "Data Treatment Report," followed by a separator line. Each step of the data treatment process will be listed, along with relevant information and statistics about the treatment performed.
The output will look similar to the following:
Data Treatment Report:
======================
Missing data treated for column 'Column1'. 10 errors detected.
Duplicate records treated. 5 duplicates removed.
Inconsistent data formats treated.
Inaccurate data entries treated for column 'Column2'. 3 errors detected.
Outliers treated.
The DataDoctor package uses a combination of techniques to clean and preprocess data. These techniques include iterative imputation, simple imputation, isolation forest, fuzzy matching, and others. Each technique serves a specific purpose in the data cleaning and preprocessing process.
Iterative imputation is a method for handling missing data by modeling each missing value as a function of other variables in the data. Simple imputation is another method for handling missing data by replacing missing values with a single estimated value.
Isolation forest is an algorithm for detecting anomalies in data. It works by building an ensemble of isolation trees to isolate anomalies from the rest of the data.
Fuzzy matching is a technique for finding matches between strings that are not exactly the same but are similar. It can be used to identify and link records that refer to the same entity but have variations in spelling or formatting.
By using these techniques together, the DataDoctor package can effectively clean and preprocess data to improve its quality and usability.
I welcome feedback, bug reports, and feature requests. Please submit them to the Issue Tracker.
Contributions are welcome! Read the Contribution Guidelines for more information.
This project is licensed under the MIT License.