echo
Built by
Gitwallet
Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
Stars
3.5K
View Code on GitHub
View on X
Ecosystems:
Scala
About
Community Stats
How To Become a Data Engineer
Useful articles
The AI Hierarchy of Needs
The Rise of Data Engineer
The Downfall of the Data Engineer
A Beginner’s Guide to Data Engineering
Part I
Part II
Part III
Functional Data Engineering — a modern paradigm for batch data processing
How to become a Data Engineer
Ru
,
En
Introduction to Apache Airflow
Ru
,
En
Apache Airflow Alternatives
Talks
Data Engineering Principles - Build frameworks not pipelines
by Gatis Seja
Functional Data Engineering - A Set of Best Practices
by Maxime Beauchemin
Advanced Data Engineering Patterns with Apache Airflow
by Maxime Beauchemin
Creating a Data Engineering Culture
by Jesse Anderson
Streaming 101: Hello Streaming
by Josh Fischer
Algorithms & Data Structures
Algorithmic Toolbox
in Russian
Data Structures
in Russian
Data Structures & Algorithms Specialization
on Coursera
Algorithms Specialization
from Stanford on Coursera
SQL
Comprehensive SQL Tutorial
by Mode Analytics
SQL Practice
on Leetcode
Modern SQL
a website about modern SQL syntax
Introduction to Window Functions
En
,
Ru
Programming
Scala School
by Twitter
Fluent Python
intermediate level book about Python
Intro to Scala
in Russian on Stepik by Tinkoff Bank
The Hitchhiker’s Guide to Python
by Kenneth Reitz & Tanya Schlusser
Learn Python 3 The Hard Way
by Zed A. Shaw
Databases
Intro to Database Systems
by Carnegie Mellon University
Advanced Database Systems
by Carnegie Mellon University
On Disk IO
I.
Flavors of IO
II.
More Flavours of IO
III.
LSM Trees
IV.
B-Trees and RUM Conjecture
V.
Access Patterns in LSM Trees
Distributed Systems
Distributed systems for fun and profit
by Mikito Takada
Distributed Systems
by Maarten van Steen & Andrew S. Tanenbaum
CSE138: Distributed Systems
by Lindsey Kuper
CS 436: Distributed Computer Systems
by University of Waterloo
MIT 6.824: Distributed Systems
by Robert Morris from MIT
Distributed consensus reading list
maintained by Heidi Howard from University of Cambridge
Books
Design Data-Intensive Applications
by Martin Kleppmann
Fundamentals of Data Engineering: Plan and Build Robust Data Systems
by Joe Reis & Matt Housley
Introduction to Algorithms
by Thomas Cormen
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling
Star Schema The Complete Reference
Database Internals: A Deep Dive into How Distributed Data Systems Work
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
A Philosophy of Software Design
Grokking Streaming Systems
by Josh Fischer & Ning Wang
Guide to High Performance Distributed Computing
by K.G. Srinivasa & Anil Kumar Muppalla
Data Pipelines with Apache Airflow
by Bas P. Harenslak and Julian Rutger de Ruiter
Courses
Data Engineering on Google Cloud Platform Specialization
by Google
Data Engineer Nanodegree
by Udacity
Data Engineering with Python
by DataCamp
Blogs
Martin Kleppmann
author of Designing Data-Intensive Application
BaseDS
by Vaidehi Joshi about Distributed Systems
Tools
Apache Airflow
is a platform to programmatically author, schedule and monitor workflows in Python
Apache Spark
is a unified analytics engine for large-scale data processing
Apache Kafka
is a distributed streaming platform
Luigi
is a Python package that helps you build complex pipelines of batch jobs.
Dagster.io
is a system for building modern data applications.
Prefect
includes everything you need to create and run data applications.
Metaflow
build and manage real-life data science projects with ease
lakeFS
build repeatable, atomic and versioned data lake operations – from complex ETL jobs to data science and analytics.
Cloud Platforms
Amazon Web Services
Google Cloud Platform
Microsoft Azure
Yandex Cloud
DigitalOcean
IBM Cloud
Communities
data Engineering
- telegram chat about data engineering
Data Engineering Subreddit
- subreddit about data engineering
Data Engineering Jobs
Data Engineering jobs
Other
Data Engineering Podcast
Newsletters & Digests
DataEng Telegram channel
- Telegram channel about data engineering (rus/eng)
Data Engineering Weekly
SF Data Weekly
- A weekly email of useful links for people interested in building data platforms
Data Elixir
- Data Elixir is an email newsletter that keeps you on top of the tools and trends in Data Science.
Data Governance, Privacy and Security
- DbAdmin News is a news letter on the technology behind Data Governance, Security and Privacy