DAT4 Course Repository
Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15).
Instructors: Sinan Ozdemir and Kevin Markham (Data School blog, email newsletter, YouTube channel)
Teaching Assistant: Brandon Burroughs
Office hours: 1-3pm on Saturday and Sunday (Starbucks at 15th & K), 5:15-6:30pm on Monday (GA)
Course Project information
Installation and Setup
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "DAT4 team" and add your photo!
Class 1: Introduction
- Introduction to General Assembly
- Course overview: our philosophy and expectations (slides)
- Data science overview (slides)
- Tools: check for proper setup of Anaconda, overview of Slack
Homework:
- Resolve any installation issues before next class.
Optional:
Class 2: Python
- Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder
- Python quiz (solution)
- Working with data in Python
Homework:
Optional:
Resources:
Class 3: Getting Data
Homework:
- Think about your project question, and start looking for data that will help you to answer your question.
- Prepare for our next class on Git and GitHub:
- You'll need to know some command line basics, so please work through GA's excellent command line tutorial and then take this brief quiz.
- Check for proper setup of Git by running
git clone https://github.com/justmarkham/DAT-project-examples.git
. If that doesn't work, you probably need to install Git.
- Create a GitHub account. (You don't need to download anything from GitHub.)
Optional:
- If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!
Resources:
Class 4: Git and GitHub
- Special guest: Nick DePrey presenting his class project from DAT2
- Git and GitHub (slides)
Homework:
- Project milestone: Submit your question and data set to your folder in DAT4-students before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)
Optional:
- Clone this repo (DAT4) for easy access to the course files.
Resources:
Class 5: Pandas
- Pandas for data exploration, analysis, and visualization (code)
Homework:
Optional:
Resources:
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib.
- To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and Columbia's Data Mining class has an excellent slide deck.
Class 6: Numpy, Machine Learning, KNN
- Numpy (code)
- "Human learning" with iris data (code, solution)
- Machine Learning and K-Nearest Neighbors (slides)
Homework:
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
Resources:
Class 7: scikit-learn, Model Evaluation Procedures
Homework:
Optional:
- Practice what we learned in class today!
- If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
- If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
- Either way, you can submit your commented code to DAT4-students, and we'll give you feedback.
Resources:
Class 8: Linear Regression
Homework:
Optional:
- Similar to last class, your optional exercise is to practice what we have been learning in class, either on your project data or on another dataset.
Resources:
Class 9: Logistic Regression, Preview of Other Models
Resources:
Class 10: Model Evaluation Metrics
- Finishing model evaluation procedures (slides, code)
- Review of test set approach
- Cross-validation
- Model evaluation metrics (slides)
- Regression:
- Root Mean Squared Error (code)
- Classification:
Homework:
Optional:
Resources:
Class 11: Working a Data Problem
-
Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.
-
Project overview (slides)
- Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...
Class 12: Clustering and Visualization
- The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
- The code for today focuses on two main examples:
- We will investigate simple clustering using the iris data set.
- We will take a look at a harder example, using Pandora songs as data. See data.
Homework:
- Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Monday. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
- Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
-
Confusion matrix: Kevin's guide roughly mirrors the lecture from class 10.
-
Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
-
Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
- You should definitely be working on your project! Your rough draft is due in two weeks!
Resources:
Class 13: Naive Bayes
- Briefly discuss A Plan for Spam
- Probability and Bayes' theorem
- Naive Bayes classification
- Naive Bayes classification in scikit-learn (code)
Resources:
Homework:
- Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
import nltk
nltk.download()
- Choose "all".
- Alternatively, just type
nltk.download('all')
- Install two new packages:
textblob
and lda
.
- Open a terminal or command prompt.
- Type
pip install textblob
and pip install lda
.
Class 14: Natural Language Processing
- Overview of Natural Language Processing (slides)
- Real World Examples
- Natural Language Processing (code)
- NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
- Alternative: TextBlob
Resources:
Class 15: Decision Trees
Homework:
- By next Wednesday (before class), review the project drafts of your two assigned peers according to these guidelines. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT4-students. If your last name is Smith and you are reviewing Jones, you should name your file
smith_reviews_jones.md
.
Resources:
Installing Graphviz (optional):
- Mac:
- Windows:
- Download and install MSI file
- Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as:
C:\Program Files (x86)\Graphviz2.38\bin
Class 16: Ensembling
Resources:
Class 17: Databases and MapReduce
Resources:
Class 18: Recommenders
- Recommendation Engines slides
- Recommendation Engine Example code
Resources:
Class 19: Advanced scikit-learn
- Advanced scikit-learn (code)
Homework:
Resources:
Class 20: Course Review
Resources:
Class 21: Project Presentations
Class 22: Project Presentations