v1.2 Update the readme to work with Canvas v1.1, Nov 16, Fix a bug in "Entrace.scala"
v1.0, Nov 13, Initial version
In this phase, you are required to do spatial hot spot analysis. In particular, you need to complete two different hot spot analysis tasks
This task will needs to perform a range join operation on a rectangle datasets and a point dataset. For each rectangle, the number of points located within the rectangle will be obtained. The hotter rectangle means that it include more points. So this task is to calculate the hotness of all the rectangles.
This task will focus on applying spatial statistics to spatio-temporal big data in order to identify statistically significant spatial hot spots using Apache Spark. The topic of this task is from ACM SIGSPATIAL GISCUP 2016.
The Problem Definition page is here: http://sigspatial2016.sigspatial.org/giscup2016/problem
The Submit Format page is here: http://sigspatial2016.sigspatial.org/giscup2016/submit
Note that: You may clip the source data to an envelope (latitude 40.5N 40.9N, longitude 73.7W 74.25W) encompassing the New York City in order to remove some of the noisy error data.
As stated in the Problem Definition page, in this task, you are asked to implement a Spark program to calculate the Getis-Ord statistic of NYC Taxi Trip datasets. We call it "Hot cell analysis"
To reduce the computation power needwe made the following changes:
Example
test/output hotzoneanalysis src/resources/point-hotzone.csv src/resources/zone-hotzone.csv hotcellanalysis src/resources/yellow_trip_sample_100000.csv
Note:
The main function/entrace is "cse512.Entrance" scala file.
Point data: the input point dataset is the pickup point of New York Taxi trip datasets. The data format of this phase is the original format of NYC taxi trip which is different from Phase 2. But the coding template already parsed it for you. Find the data from our asu google drive shared folder: https://drive.google.com/open?id=1bN-U4nknvN5p7jiVHO-wduM7oXR5CBji
Zone data (only for hot zone analysis): at "src/resources/zone-hotzone" of the template
The input point data can be any small subset of NYC taxi dataset.
The input point data is a monthly NYC taxi trip dataset (2009-2012) like "yellow_tripdata_2009-01_point.csv"
All zones with their count, sorted by "rectangle" string in an ascending order.
"-73.795658,40.743334,-73.753772,40.779114",1
"-73.797297,40.738291,-73.775740,40.770411",1
"-73.832707,40.620010,-73.746541,40.665414",20
The coordinates of top 50 hotest cells sorted by their G score in a descending order. Note, DO NOT OUTPUT G score.
-7399,4075,15
-7399,4075,29
-7399,4075,22
An example input and answer are put in "testcase" folder of the coding template
DO NOT DELETE any existing code in the coding template unless you see this "YOU NEED TO CHANGE THIS PART"
In the code template,
In the code template,
sbt clean assembly
. We will run the compiled package on our cluster directly using "spark-submit" with parameters. If your code cannot compile and package, you will not receive any points.This section is same with that in Phase 2.
If you are using the Scala template
.master("local[*]")
after .config("spark.some.config.option", "some-value")
to tell IDE the master IP is localhost.% "provided"
to % "compile"
in order to debug your code in IDEIf you are using the Scala template
sbt clean assembly
. You may need to install sbt in order to run this command../bin/spark-submit ~/GitHub/CSE512-Project-Hotspot-Analysis-Template/target/scala-2.11/CSE512-Project-Hotspot-Analysis-Template-assembly-0.1.0.jar test/output hotzoneanalysis src/resources/point-hotzone.csv src/resources/zone-hotzone.csv hotcellanalysis src/resources/yellow_tripdata_2009-01_point.csv