Training models with Apache Spark, PySpark for Titanic Kaggle competition
titanic_logReg.py
trains a Logistic Regression and makes prediction for Titanic dataset as part of Kaggle competition using Apache-Spark spark-1.3.1-bin-hadoop2.4 with its Python API on a local machine. I used pyspark_csv.py
to load data as Spark DataFrame, for more instructions see this.
The following will be added later
Use $SPARK_HOME/bin/spark-submit scriptDirectoryPath/titanic_logReg.py
. For multithreading, you can add the option --master local[N]
where N is the number of threads.