Daily-Dose-of-Data-Science

A collection of code snippets from the publication Daily Dose of Data Science on Substack: http://www.dailydoseofds.com/

Stars
768
Committers
1

Daily Dose of Data Science

Daily Dose of Data Science is a publication on Substack that brings together intriguing frameworks, libraries, technologies, and tips that make the life cycle of a Data Science project effortless.

This repository is a collection of all the code snippets presented in my publication. If you want to receive these tips in your mailbox daily, you can subscribe to my Substack newsletter.

Star History

Run These Code Snippets on Your Local Machine

To download the tips listed here, you can clone this repo.

git clone https://github.com/ChawlaAvi/Daily-Dose-of-Data-Science

Table of Contents

  1. Pandas
  2. Jupyter Tips
  3. Python
  4. Plotting
  5. NumPy
  6. Memory Optimization
  7. Cool Tools
  8. Run-time Optimization
  9. Sklearn
  10. Debugging
  11. Missing Data
  12. ML-AI News
  13. Machine Learning
  14. Statistics
  15. Testing
  16. Terminal
  17. Documents
  18. Animations

Pandas

Title Notebook Substack Article
One-Minute Guide To Becoming a Polars-savvy Data Scientist
Avoid Using Pandas' Apply() Method At All Times
Pandas vs Polars Run-time and Memory Comparison
A Lesser-Known Feature of the Merge Method in Pandas
A Highly Overlooked Approach To Analysing Pandas DataFrames
The Most Common Misconception About Inplace Operations in Pandas
Become A Bilingual Data Scientist With These Pandas to SQL Translations
Avoid This Costly Mistake When Indexing A DataFrame
AutoProfiler: Automatically Profile Your DataFrame As You Work
Why You Should Avoid Appending Rows To A DataFrame
Are You Sure You Are Using The Correct Pandas Terminologies?
If You Are Not Able To Code A Vectorized Approach, Try This.
Why Are We Typically Advised To Never Iterate Over A DataFrame?
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface
A Simple Trick to Make The Most Out of Pivot Tables in Pandas
Never Worry About Parsing Errors Again While Reading CSV with Pandas
An Interesting and Lesser-Known Way To Create Plots Using Pandas
Generate Helpful Hints As You Write Your Pandas Code
Speed-up Parquet I/O of Pandas by 5x
Stop Using The Describe Method in Pandas. Instead, use Skimpy.
Stop Using The Describe Method in Pandas. Instead, use Summarytools.
Analyze A Pandas DataFrame Without Code
70x Faster Pandas By Changing Just One Line of Code
Reduce Memory Usage Of A Pandas DataFrame By 90%
Speed-up Pandas Apply 5x with NumPy
A Lesser-Known Feature of Apply Method In Pandas
Create Pandas DataFrame from Dataclass
Run SQL in Jupyter To Analyze A Pandas DataFrame
When You Should Not Use the head() Method In Pandas
Three Lesser-known Tips For Reading a CSV File Using Pandas
The Best File Format To Store A Pandas DataFrame
Lesser-Known Feature of the Merge Method in Pandas
The Best Way to Use Apply() in Pandas
A No-code Tool To Understand Your Data Quickly
Display Progress Bar With Apply() in Pandas
Supercharge value_counts() Method in Pandas With Sidetable
Explore CSV Data Right From The Terminal
Define the Correct DataType for Categorical Columns
Don't Create Conditional Columns in Pandas with Apply
Write Your Own Flavor Of Pandas
Create DataFrame Hassle-free By Using Clipboard
Alter the Datatype of Multiple Columns at Once
Why you should not dump DataFrames to a CSV
Why You Should Not Read CSVs with Pandas
Parallelize Pandas Apply() With Swifter
A Hidden Feature of Describe Method In Pandas
Enrich Your Notebook With Interactive Controls
Data Analysis Using No-Code Pandas In Jupyter
Create Pivot Tables, Aggregations and Plots Without Any Code
Parallelize Pandas with Pandarallel
Pretty Plotting With Pandas
How to Read Multiple CSV Files Efficiently
Configure Sklearn To Output Pandas DataFrame
Datatype For Handling Missing Valued Columns in Pandas
Vectorization Does Not Always Guarantee Better Performance

Jupyter Tips

Title Notebook Substack Article
Declutter Your Jupyter Notebook Using Interactive Controls
Jupyter Notebook + Spreadsheet + AI All in One Place With Mito
The Coolest GitHub-Colab Integration You Would Ever See
Break the Linear Presentation of Notebooks With Stickyland
Restart Jupyter Kernel Without Losing Variables
Annotate Data With The Click Of A Button Using Pigeon
Build Elegant Web Apps Right From Jupyter Notebook with Mercury
Supercharge Your Jupyter Kernel With ipyflow
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface
Draw The Data You Are Looking For In Seconds
Never Search Jupyter Notebooks Manually Again To Find Your Code
Stop Previewing Raw DataFrames. Instead, Use DataTables
Label Your Data With The Click Of A Button
The Coolest Jupyter Notebook Hack
View Documentation in Jupyter Notebook
Get Notified When Jupyter Cell Has Executed
Clear Cell Output In Jupyter Notebook During Run-time
CodeSquire: The AI Coding Assistant You Should Use Over GitHub Copilot
Find Your Code Hiding In Some Jupyter Notebook With Ease
Enrich Your Notebook With Interactive Controls
Data Analysis Using No-Code Pandas In Jupyter
Create Pivot Tables, Aggregations and Plots Without Any Code
Restart Notebook Without Losing Variables
Retrieve Previously Computed Output In Jupyter Notebook
Transfer Variables Between Jupyter Notebooks

Python

Title Notebook Substack Article
7 Elegant Usages of Underscore in Python
How To Enforce Type Hints in Python?
A Common Misconception About Deleting Objects in Python
What Makes The Join() Method Blazingly Faster Than Iteration?
A Hidden Feature of a Popular String Method in Python
Execute Python Project Directory as a Script
Improve Python Run-time Without Changing A Single Line of Code
A Lesser-Known Difference Between For-Loops and List Comprehensions
A Lesser-Known Difference Between For-Loops and List Comprehensions
Magic Methods: An Underrated Gem of Python OOP
9 Command Line Flags To Run Python Scripts More Flexibly
Use Custom Python Objects In A Boolean Context
You Were Probably Given Incomplete Info About A Tuple's Immutability
A Counterintuitive Thing About Python Dictionaries
A Counterintuitive Thing About Python Dictionaries
Probably The Fastest Way To Execute Your Python Code
A Counterintuitive Fact About Python Functions
Manipulating Mutable Objects In Python Can Get Confusing At Times
Most Python Programmers Don't Know This About Python OOP
You Can Add a List As a Dictionary's Key (Technically)!
Why Python Does Not Offer True OOP Encapsulation
Most Python Programmers Don't Know This About Python For-loops
How To Enable Function Overloading In Python
The Right Way to Roll Out Library Updates in Python
F-strings Are Much More Versatile Than You Think
A Single Line That Will Make Your Python Code Faster
Make Dot Notation More Powerful in Python
An Elegant Way To Perform Shutdown Tasks in Python
What Are Class Methods and When To Use Them?
Hide Attributes While Printing A Dataclass Object
List : Tuple :: Set : ?
Post_init: Add Attributes To A Dataclass Post Initialization
Simplify Your Functions With Partial Functions
DotMap: A Better Alternative to Python Dictionary
Prevent Wild Imports With all in Python
Performance Comparison of Python 3.11 and Python 3.10
Why 256 is 256 But 257 is not 257?
Make a Class Object Behave Like a Function
Lesser-known Feature of Pickle Files
Specify Loops and Runs In %%timeit
Don't Use time.time() To Measure Execution Time
Import Your Python Package as a Module
Fine-grained Error Tracking With Python 3.11
Run Python Project Directory As A Script
Use Slotted Class To Improve Your Python Code
Using Dictionaries In Place of If-conditions
In Defense of Match-case Statements in Python

Plotting

Title Notebook Substack Article
Don't Overuse Scatter, Line and Bar Plots. Try These Four Elegant Alternatives.
Sankey Diagrams: An Underrated Gem of Data Visualization
Enrich Your Heatmaps With This Simple Trick
The Coolest Matplotlib Hack to Create Subplots Intuitively
Waterfall Charts: A Better Alternative to Line/Bar Plot
Enrich Your Confusion Matrix With A Sankey Diagram
A Simple One-Liner to Create Professional Looking Matplotlib Plots
Visualise The Change In Rank Over Time With Bump Charts
A Simple Trick That Significantly Improves The Quality of Matplotlib Plots
A Lesser-known Feature of Creating Plots with Plotly
A Little Bit Of Extra Effort Can Hugely Transform Your Basic Matplotlib Plots
Interactively Visualise A Decision Tree With A Sankey Diagram
Use Histograms With Caution. They Are Highly Misleading!
Three Simple Ways To (Instantly) Make Your Scatter Plots Clutter Free
Matplotlib Has Numerous Hidden Gems. Here's One of Them.
A Simple Trick That Will Make Heatmaps More Elegant
The Limitations Of Heatmap That Are Slowing Down Your Data Analysis
An Underrated Technique To Improve Your Data Visualizations
Who Said Matplotlib Cannot Create Interactive Plots?
Don't Create Messy Bar Plots. Instead, Try Bubble Charts!
Use Box Plots With Caution! They May Be Misleading.
An Underrated Technique To Create Better Data Plots
An Interesting and Lesser-Known Way To Create Plots Using Pandas
Style Matplotlib Plots To Make Them More Attractive
Simple One-Liners to Preview a Decision Tree Using Sklearn
Create Data Plots Right From The Terminal
Make Your Matplotlib Plots More Professional
Perfplot: Measure, Visualize and Compare Run-time With Ease
Prettify Word Clouds In Python
Calendar Map As A Richer Alternative to Line Plot
Density Plot As A Richer Alternative to Scatter Plot
Python One-Liner To Create Sketchy Hand-drawn Plots
Create a Moving Bubbles Chart in Python
Visualizing Google Search Trends of 2022 using Python
Create A Racing Bar Chart In Python
Elegantly Plot the Decision Boundary of a Classifier
Dot Plot: A Potential Alternative to Bar Plot
Hexbin Plots As A Richer Alternative to Scatter Plots
Enrich Your Notebook With Interactive Controls
Regression Plot Made Easy with Plotly
Pretty Plotting With Pandas
Polynomial Linear Regression Plot Made Easy With Seaborn
Analyse Flow Data With Sankey Diagrams
Waterfall Charts: A Better Alternative to Line/Bar Plot

NumPy

Title Notebook Substack Article
A Major Limitation of NumPy Which Most Users Aren't Aware Of
Beware of This Unexpected Behaviour of NumPy Methods
Speedup NumPy Methods 25x With Bottleneck
Speed-up NumPy 20x with Numexpr
An Elegant Way To Perform Matrix Multiplication
Difference Between Dot and Matmul in NumPy
Don't Print NumPy Arrays! Use Lovely-NumPy Instead
Polynomial Linear Regression with NumPy

Memory Optimization

Title Notebook Substack Article
70x Faster Pandas By Changing Just One Line of Code
Reduce Memory Usage Of A Pandas DataFrame By 90%
The Best File Format To Store A Pandas DataFrame
Define the Correct DataType for Categorical Columns
Datatype For Handling Missing Valued Columns in Pandas
Save Memory with Python Generators

Cool Tools

Title Notebook Substack Article
CNN Explainer: Interactively Visualize a Convolutional Neural Network
Break the Linear Presentation of Notebooks With Stickyland
Annotate Data With The Click Of A Button Using Pigeon
Mito Just Got Supercharged With AI!
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface
Supercharge Shell With Python Using Xonsh
Draw The Data You Are Looking For In Seconds
Preview Your README File Locally In GitHub Style
This GUI Tool Can Possibly Save You Hours Of Manual Work
Stop Previewing Raw DataFrames. Instead, Use DataTables.
Converting Python To LaTeX Has Possibly Never Been So Simple
Label Your Data With The Click Of A Button
Analyze A Pandas DataFrame Without Code
A No-Code Online Tool To Explore and Understand Neural Networks
Speed-up NumPy 20x with Numexpr
Debugging Made Easy With PySnooper
Deep Learning Network Debugging Made Easy
CodeSquire: The AI Coding Assistant You Should Use Over GitHub Copilot
Find Unused Python Code With Ease
Enrich Your Notebook With Interactive Controls
Data Analysis Using No-Code Pandas In Jupyter
Modify Python Code During Run-Time
Modify Function During Run-Time
Importing Modules Made Easy with Pyforest
Create Pivot Tables, Aggregations and Plots Without Any Code

Run-time Optimization

Title Notebook Substack Article
Pandas vs Polars Run-time and Memory Comparison
The Limitation of KMeans Which Is Often Overlooked by Many
Most Sklearn Users Don't Know This About Its LinearRegression Implementation
Probably The Fastest Way To Execute Your Python Code
Why Are We Typically Advised To Never Iterate Over A DataFrame?
Speed-up Parquet I/O of Pandas by 5x
A Single Line That Will Make Your Python Code Faster
Make Sklearn KMeans 20x times faster
Speed-up NumPy 20x with Numexpr
The Best File Format To Store A Pandas DataFrame
The Best Way to Use Apply() in Pandas
Don't Create Conditional Columns in Pandas with Apply
Why you should not dump DataFrames to a CSV
Parallelize Pandas Apply() With Swifter
Parallelize Pandas with Pandarallel
How to Read Multiple CSV Files Efficiently

Sklearn

Title Notebook Substack Article
Why Sklearn's Linear Regression Has No Hyperparameters?
Scikit-LLM: Integrate Sklearn API with Large Language Models
Most Sklearn Users Don't Know This About Its LinearRegression Implementation
A Lesser-Known Feature of Sklearn To Train Models on Large Datasets
Sklearn One-liner to Generate Synthetic Data
Skorch: Use Scikit-learn API on PyTorch Models
Make Sklearn KMeans 20x times faster
Build Baseline Models Effortlessly With Sklearn
Polynomial Linear Regression with NumPy
An Elegant Way to Import Metrics From Sklearn
Feature Tracking Made Simple In Sklearn Transformers
Configure Sklearn To Output Pandas DataFrame

Debugging

Title Notebook Substack Article
Debugging Made Easy With PySnooper
Don't use print() to debug your code.
Inspect Program Flow with IceCream
Lesser-known Feature of f-strings in Python

Missing Data

Title Notebook Substack Article
Handle Missing Data With Missingno
Datatype For Handling Missing Valued Columns in Pandas

ML-AI News

Title Notebook Substack Article
Now You Can Use DALLE With OpenAI API

Machine Learning

Title Notebook Substack Article
Decision Trees ALWAYS Overfit. Here's A Lesser-Known Technique To Prevent It.
Evaluate Clustering Performance Without Ground Truth Labels
The Most Common Misconception About Continuous Probability Distributions
A Common Misconception About Feature Scaling and Standardization
Random Forest May Not Need An Explicit Validation Set For Evaluation
A Visual and Overly Simplified Guide To Bagging and Boosting
10 Most Common (and Must-Know) Loss Functions in ML
A Visual and Overly Simplified Guide To Bagging and Boosting
10 Most Common (and Must-Know) Loss Functions in ML
Theil-Sen Regression: The Robust Twin of Linear Regression
The Limitations Of Elbow Curve And What You Should Replace It With
21 Most Important (and Must-know) Mathematical Equations in Data Science
Try This If Your Linear Regression Model is Underperforming
The Limitation of KMeans Which Is Often Overlooked by Many
Nine Most Important Distributions in Data Science
The Limitation of Linear Regression Which is Often Overlooked By Many
The Limitation of Linear Regression Which is Often Overlooked By Many
A Reliable and Efficient Technique To Measure Feature Importance
Does Every ML Algorithm Rely on Gradient Descent? [](https://github.com/ChawlaAvi/Daily-Dose-of-Data-Science/blob/main/Machine%20Learning/Does Every ML Algorithm Rely on Gradient Descent?.ipynb)
Visualize The Performance Of Linear Regression With This Simple Plot
Confidence Interval and Prediction Interval Are Not The Same
The Ultimate Categorization of Performance Metrics in ML
The Most Overlooked Problem With One-Hot Encoding
9 Most Important Plots in Data Science
Is Categorical Feature Encoding Always Necessary Before Training ML Models?
The Counterintuitive Behaviour of Training Accuracy and Training Loss
A Highly Overlooked Point In The Implementation of Sigmoid Function
The Ultimate Categorization of Clustering Algorithms
A Lesser-Known Feature of Sklearn To Train Models on Large Datasets
Visualize The Performance Of Any Linear Regression Model With This Simple Plot
How To Truly Use The Train, Validation and Test Set
The Advantages and Disadvantages of PCA To Consider Before Using It
Loss Functions: An Algorithm-wise Comprehensive Summary
Is Data Normalization Always Necessary Before Training ML Models?
A Visual Guide to Stochastic, Mini-batch, and Batch Gradient Descent
The Taxonomy Of Regression Algorithms That Many Don't Bother To Remember
The Limitation of PCA Which Many Folks Often Ignore
Breathing KMeans: A Better and Faster Alternative to KMeans
How Many Dimensions Should You Reduce Your Data To When Using PCA?
A Visual Guide To Sampling Techniques in Machine Learning
A Visual and Overly Simplified Guide to PCA
The Limitation Of Euclidean Distance Which Many Often Ignore
Visualising The Impact Of Regularisation Parameter
A (Highly) Important Point to Consider Before You Use KMeans Next Time
Is Class Imbalance Always A Big Problem To Deal With?
A Visual Comparison Between Locality and Density-based Clustering
Why Don't We Call It Logistic Classification Instead?
A Typical Thing About Decision Trees Which Many Often Ignore
Always Validate Your Output Variable Before Using Linear Regression
Why Is It Important To Shuffle Your Dataset Before Training An ML Model
Why Are We Typically Advised To Set Seeds for Random Generators?
This Small Tweak Can Significantly Boost The Run-time of KMeans
Most ML Folks Often Neglect This While Using Linear Regression
Is This The Best Animated Guide To KMeans Ever?
An Effective Yet Underrated Technique To Improve Model Performance
How to Encode Categorical Features With Many Categories?
Why KMeans May Not Be The Apt Clustering Algorithm Always
Skorch: Use Scikit-learn API on PyTorch Models
A No-Code Online Tool To Explore and Understand Neural Networks
Make Sklearn KMeans 20x times faster
Deep Learning Network Debugging Made Easy
Build Baseline Models Effortlessly With Sklearn
Polynomial Linear Regression with NumPy

Statistics

Title Notebook Substack Article
Be Cautious Before Drawing Any Conclusions Using Summary Statistics
The Limitation Of Pearson Correlation Which Many Often Ignore
Pandas and NumPy Return Different Values for Standard Deviation. Why?
Why Correlation (and Other Statistics) Can Be Misleading

Testing

Title Notebook Substack Article
Generate Your Own Fake Data In Seconds

Terminal

Title Notebook Substack Article
Supercharge Shell With Python Using Xonsh
Most Command-line Users Don't Know This Cool Trick About Using Terminals
Never Refactor Your Code Manually Again. Instead, Use Sourcery!
Create Data Plots Right From The Terminal
Visualize Commit History of Git Repo With Beautiful Animations
How Would You Identify Fuzzy Duplicates In A Data With Million Records?
Automated Code Refactoring With Sourcery
Explore CSV Data Right From The Terminal

Documents

Title Document Substack Article
Daily Dose of Data Science - Full Archive
35 Hidden Python Libraries That Are Absolute Gems
40 Open-Source Tools to Supercharge Your Pandas Workflow
37 Hidden Python Libraries That Are Absolute Gems
10 Automated EDA Tools That Will Save You Hours Of (Tedious) Work
30 Python Libraries to (Hugely) Boost Your Data Science Productivity

Animations

Title Notebook Substack Video
Visualizing The Data Transformation of a Neural Network
Badges
Extracted from project README's
View on GitHub View on Medium Daily Dose of Data Science View on LinkedIn Star History Chart