titanic

Download, explore, and wrangle the Titanic passenger manifest dataset with an eye toward developing a predictive model for survival.

MIT License

Stars
4

Titanic

Download, explore, and wrangle the Titanic passenger manifest dataset with an eye toward developing a predictive model for survival.

This tutorial is based on the Kaggle Competition,"Predicting Survival Aboard the Titanic"

RMS Titanic , Ocean Liner, (1912) Licensed under CC BY-SA 3.0 via Wikimedia Commons: "Cd51-1000g" by Boris Lux

STEP ONE: EXPLORATORY ANALYSIS

Start by cloning this repository.

Anaconda users: you should have everything you need, but if you find you are missing anything, type this into the command line:

conda install -c https://conda.anaconda.org/blaze <package>

Others: make sure the required libraries are installed by using:

pip install -r requirements.txt    

Then look inside the data folder and open train.csv to check out the dataset we'll be exploring today.

To start the lab, open up the iPython Notebook file: titanic_wrangling.ipynb.

Things to think about

  1. How to explore a new dataset?
  2. What to look for in tabular data?
  3. What visualization tools can you use to help you explore?
  4. What is the end goal of data wrangling? Why are we even doing this?
  5. What to clean and how to clean it?

See also: Baby steps to performing exploratory analysis in Python Data munging using Pandas

STEP TWO: MACHINE LEARNING FROM DISASTER

(You will do this portion in the Machine Learning course.)

The iPython Notebook for this class is called "titanicML_workshop.ipynb." To get it, navigate in the command line to the titanic repository that you cloned for the last class, and try:

git stash
git pull origin master    

If you haven't already installed Scikit-learn, do that now.

Anaconda users: you already have Scikit-learn! If you ever find you are missing anything, type this into the command line:

conda install -c https://conda.anaconda.org/blaze <package>

Everyone else, make sure Scikit-learn is installed:

WINDOWS USERS:

pip install -U scikit-learn

MAC OSX USERS:

pip install -U numpy scipy scikit-learn

LINUX w/ Python 2:

sudo apt-get install build-essential python-dev python-setuptools \
                 python-numpy python-scipy \
                 libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib

LINUX w/ Python 3:

sudo apt-get install build-essential python3-dev python3-setuptools \
				 python3-numpy python3-scipy \
                 libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib

Problems with installation? Check out: http://scikit-learn.org/stable/install.html

If you get hung up with the installation or the repo update, you can also get the gist: https://gist.github.com/rebeccabilbro/d40599f4ec96aa21dc48

Key Concepts

Machine Learning

Classification

Cross-Validation http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Model Evaluation -Scores -Classification reports -Visualization tools -Precision recall

Key Tools in Scikit-Learn

Linear Regression http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Random Forests http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Support Vector Machines http://scikit-learn.org/stable/modules/svm.html

Sources

This tutorial is based on the following tutorials for Kaggle's titanic competition: https://www.kaggle.com/mlchang/titanic/logistic-model-using-scikit-learn/run/91385 https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests https://github.com/savarin/pyconuk-introtutorial/tree/master/notebooks

Related Projects