
Training models with Apache Spark, PySpark for Titanic Kaggle competition


Sparkling Titanic

Introduction trains a Logistic Regression and makes prediction for Titanic dataset as part of Kaggle competition using Apache-Spark spark-1.3.1-bin-hadoop2.4 with its Python API on a local machine. I used to load data as Spark DataFrame, for more instructions see this.

The following will be added later

  • Imputing NAs in train and test sets
  • Cross-validation
  • Using more features and feature engineering
  • RandomForest classifier, SVM, etc.

Running PySpark Script in Shell

Use $SPARK_HOME/bin/spark-submit scriptDirectoryPath/ For multithreading, you can add the option --master local[N] where N is the number of threads.