Titanic

Download, explore, and wrangle the Titanic passenger manifest dataset with an eye toward developing a predictive model for survival.

This tutorial is based on the Kaggle Competition,"Predicting Survival Aboard the Titanic"

RMS Titanic , Ocean Liner, (1912) Licensed under CC BY-SA 3.0 via Wikimedia Commons: "Cd51-1000g" by Boris Lux

STEP ONE: EXPLORATORY ANALYSIS

Start by cloning this repository.

Anaconda users: you should have everything you need, but if you find you are missing anything, type this into the command line:

conda install -c https://conda.anaconda.org/blaze <package>

Others: make sure the required libraries are installed by using:

pip install -r requirements.txt

Then look inside the data folder and open train.csv to check out the dataset we'll be exploring today.

To start the lab, open up the iPython Notebook file: titanic_wrangling.ipynb.

Things to think about

How to explore a new dataset?
What to look for in tabular data?
What visualization tools can you use to help you explore?
What is the end goal of data wrangling? Why are we even doing this?
What to clean and how to clean it?

STEP TWO: MACHINE LEARNING FROM DISASTER

(You will do this portion in the Machine Learning course.)

The iPython Notebook for this class is called "titanicML_workshop.ipynb." To get it, navigate in the command line to the titanic repository that you cloned for the last class, and try:

git stash
git pull origin master

If you haven't already installed Scikit-learn, do that now.

Anaconda users: you already have Scikit-learn! If you ever find you are missing anything, type this into the command line:

conda install -c https://conda.anaconda.org/blaze <package>

Everyone else, make sure Scikit-learn is installed:

WINDOWS USERS:

pip install -U scikit-learn

MAC OSX USERS:

pip install -U numpy scipy scikit-learn

LINUX w/ Python 2:

sudo apt-get install build-essential python-dev python-setuptools \
                 python-numpy python-scipy \
                 libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib

LINUX w/ Python 3:

sudo apt-get install build-essential python3-dev python3-setuptools \
				 python3-numpy python3-scipy \
                 libatlas-dev libatlas3gf-base
sudo apt-get install python-matplotlib

Problems with installation? Check out: http://scikit-learn.org/stable/install.html

If you get hung up with the installation or the repo update, you can also get the gist: https://gist.github.com/rebeccabilbro/d40599f4ec96aa21dc48