A tool for monitoring and tuning Spark jobs for efficiency.
APACHE-2.0 License
The missing Spark Performance Debugger that can be drag and dropped into your spark application!
Featured in Spark Summit 2016 EU: Introduction | Slides | Video Spark Summit 2017 SF: Introduction | Slides | Video
First figure out your current spark version. Please refer to project/BuildUtils.scala
and look for SUPPORTED_SPARK_VERSIONS
, these are the versions that we support out of box.
If your spark version has our precompiled sparklint jar (let's say 1.6.1 on scala 2.10), the jar name will be sparklint-spark161_2.10
.
Please note, in the jar the 161
means Spark1.6.1
and 2.10
means scala 2.10
If your spark version is not precompiled (i.e. 1.5.0), you can add an entry in project/BuildUtils.getSparkMajorVersion
, then provide compatible code similar to spark-1.6 in src/main/spark-1.5
For more detail about event logging, how to enable it, and how to gather log files, check http://spark.apache.org/docs/latest/configuration.html#spark-ui
SparklintListener is an implementation of SparkFirehoseListener that listen to spark event log while the application is running. To enable it, you can try one of the following:
--packages
command to inject dependency during job submission if we have a precompiled jar, like --conf spark.extraListeners=com.groupon.sparklint.SparklintListener --packages com.groupon.sparklint:sparklint-spark201_2.11:1.0.8
--conf spark.extraListeners=com.groupon.sparklint.SparklintListener
Finally, find out your spark application's driver node address, open a browser and visit port 23763 (our default port) of the driver node.
Add dependency directly for pom.xml
<dependency>
<groupId>com.groupon.sparklint</groupId>
<artifactId>sparklint-spark201_2.11</artifactId>
<version>1.0.12</version>
</dependency>
for build.sbt
libraryDependencies += "com.groupon.sparklint" %% "sparklint-spark201" % "1.0.12"
SparklintServer can run on your local machine. It will read spark event logs from the location specified. You can feed Sparklint an event log file to playback activities.
sbt run
to start the server. You can add a directory, log file, or a remote history server via UI
sbt "run -d /path/to/log/dir -r"
sbt "run -f /path/to/logfile -r"
sbt "run --historyServer http://path/to/server -r"
http://localhost:23763
sbt docker
sbt docker
command will build a roboxue/sparklint:latest image on your local machinesbt run
for you.
docker run -v /path/to/logs/dir:/logs -p 23763:23763 roboxue/sparklint -d /logs && open localhost:23763
docker run -v /path/to/logs/file:/logfile -p 23763:23763 roboxue/sparklint -f /logfile && open localhost:23763
--conf spark.sparklint.port=4242
to spark submit script--port 4242
to sbt run commandline argument-f [FileName]
: Filename of an Spark event log source to use.-d [DirectoryName]
: Directory of an Spark event log sources to use. Read in filename sort order.--historyServer [DirectoryName]
: Directory of an Spark event log sources to use. Read in filename sort order.-p [pollRate]
: The interval (in seconds) between polling for changes in directory and history event sources.-r
: Set the flag in order to run each buffer through to their end state on startup.sbt
test
+ test
testQuick
set sparkVersion := "2.0.0"
++ 2.11.8
package
+ foreachSparkVersion test
+ foreachSparkVersion publishSigned
docker
dockerPublish
dockerBuildAndPush
sbt release # github branch merging
git checkout master
sbt sparklintRelease sonatypeReleaseAll