DISTOD algorithm: Distributed discovery of bidirectional order dependencies
MIT License
The DISTOD data profiling algorithm is a distributed algorithm to discover bidirectional order dependencies (in set-based form) from relational data. DISTOD is based on the single-threaded FASTOD-BID algorithm [1], but DISTOD scales elastically to many machines outperforming FASTOD-BID by up to orders of magnitude.
Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded [1] and distributed [2] state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets.
Schmidl, S., Papenbrock, T. Efficient distributed discovery of bidirectional order dependencies. The VLDB Journal 31, pages 4974 (2022). https://doi.org/10.1007/s00778-021-00683-4
The following software is required:
distod-vx.x.x.jar
) from the latest release.distod.jar
.java -Xms2g -Xmx2g -XX:+UseG1GC \
-Ddistod.input.path="data/iris.csv" -Ddistod.input.has-header="yes" \
-Dfile.encoding=UTF-8 \
-jar distod.jar
Attention!
Please make sure to always run DISTOD with the G1 GC (
-XX:+UseG1GC
) and both heap memory limits (-Xms
and-Xmx
) set. If you do not explicitly disable system monitoring, it is required to set both heap memory limits during application start to the same value. This allows the system monitoring component to make more accurate decision regarding the memory usage.
DISTOD exposes various parameters to control its execution and features as configuration options. Configuration values can be set on the command line or using a configuration file:
java -Xms2g -Xmx2g -XX:+UseG1GC -Ddistod.input.path="data/iris.csv" -Ddistod.input.has-header="yes" -Dfile.encoding=UTF-8 -jar distod.jar
distod.conf
with the content
distod.input {
path = data/iris.csv
has-header = yes
}
run DISTOD using java -Xms2g -Xmx2g -XX:+UseG1GC -Dconfig.file=distod.conf -Dfile.encoding=UTF-8 -jar distod.jar
.For a full description of all ways one can use and set configuration options, we refer to the lightbend/config
documentation which we make use of here.
You can find all parameters of DISTOD, their default value, and their description in the application.conf
file bundled with DISTOD.
You can also change the logging behavior of DISTOD by altering the configuration of DISTOD's logback logger. Follow this procedure:
logback.xml
) with your logging configuration, such as:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<appender name="FILE" class="ch.qos.logback.core.FileAppender">
<file>distod.log</file>
<append>false</append>
<encoder>
<pattern>[%d{HH:mm:ss.SSS} %-5level] %30.30X{akkaSource:-local}| %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="FILE"/>
</root>
</configuration>
-Dlogback.configurationFile
:
java -Xms2g -Xmx2g -XX:+UseG1GC -Dlogback.configurationFile=logback.xml \
-Ddistod.input.path="data/iris.csv" -Ddistod.input.has-header="yes" \
-Dfile.encoding=UTF-8 \
-jar distod.jar
After some careful, manual evaluation on the OpenJDK 1.8.0_265 (64-Bit) Server VM,
we set the following additional java options to tune the Java GC to clean up more older generation objects that get freed up by DISTOD's PartitionMgr
.
-XX:G1ReservePercent=10
: matches DISTOD's default parameter value for distod.monitoring.heap-eviction-threshold
(see configuration file)-XX:G1HeapWastePercent=1
: reduce waste and try to reclaim as much OldGen as possible; increases mixed GC cycles (reduced from 5)-XX:MaxGCPauseMillis=400
: allow longer GC pauses (increased from 200)-XX:G1MixedGCLiveThresholdPercent=60
: already mark regions that only contain 60% garbage (reduced from 85)-XX:G1MixedGCCountTarget=10
: have up to 10 consecutive mixed runs to clean up more OldGen garbage (increased from 8)-XX:G1OldCSetRegionThresholdPercent=20
: reclaim up to 20% (of heap size) OldGen garbage in one run (increased from 10)This results in the following run-command:
java -Xms2g -Xmx2g -XX:+UseG1GC \
-XX:+UnlockExperimentalVMOptions \
-XX:G1ReservePercent=10 -XX:MaxGCPauseMillis=400 -XX:G1HeapWastePercent=1 \
-XX:G1MixedGCLiveThresholdPercent=60 -XX:G1MixedGCCountTarget=10 -XX:G1OldCSetRegionThresholdPercent=20 \
-Dconfig.file="$distod.conf" -Dlogback.configurationFile=logback.xml -Dfile.encoding=UTF-8 \
-jar distod.jar
The configuration of DISTOD and its competitors used for the experiments in the paper can be found in the experiments
-folder.
Note that we do not publish the results of the experiments in this Github-Repository due to their size.
Please contact Sebastian Schmidl directly for the experiments' result backups.
git clone [email protected]:CodeLionX/distod.git
sbt assembly
You can find the fat-jar for DISTOD in the target folder of the distod
-module.distod
-module run sbt distod/assembly
.distod
from within SBT, use sbt "; project distod; set javaOptions += "-Dconfig.file=distod.conf"; run"
If you want to contribute to this project, you are more then welcome to do so. Please read the contribution guidlines before submitting new issues or pull requests.
JMC is a tool to analyze metric recordings with the Java Flight Recorder (JFR). JFR is a very lightweight way to collect low level and detailed runtime information built into the Oracle JDK. The application (JAR) must therefore be run with an Oracle JVM. You can download it (Java SE JDK 11) from their website. A Oracle account is required (free).
To profile the application, build the DISTOD assembly with sbt assembly
and run it with an Oracle JVM using the following parameters:
oracle-java -XX:+FlightRecorder -XX:StartFlightRecording=maxage=5m,filename=distod-1.jfr,dumponexit=true -Dcom.sun.management.jmxremote.autodiscovery=true -jar distod.jar
If you use a JDK older then version 11, you also have to enable the commercial features with the flag -XX:+UnlockCommercialFeatures
before the other options are available.
You can adjust the filename of the results in the parameters. Afterwards the profiling results can be examined using the JDK Mission Control (JMC). You can download it from this site.
Start a longer running job with DISTOD, then open VisualVM and connect to the running VM. You can profile the process using the sample tab ("CPU"). Stop it at any time to inspect the results without them changing constantly.
If you want to monitor a remote process, you can start the following daemon for VisualVM to connect to:
jstatd -J-Djava.security.policy=jstatd.all.policy
An example jstatd.all.policy
-file can be found in the deployment
folder.
If you want to use the sampler, you have to enable JMX as well.
To do this, use the following options when starting the JVM and then connect to hostname:9010
using VisualVM:
-Dcom.sun.management.jmxremote
: enables JMX-Dcom.sun.management.jmxremote.authenticate=false
, -Dcom.sun.management.jmxremote.ssl=false
: disables authentication (dangerous!)-Dcom.sun.management.jmxremote.port=9010
, -Dcom.sun.management.jmxremote.rmi.port=9010
: sets remote port-Djava.rmi.server.hostname="$(hostname)"
: sets remote host to bind toExample call:
java -Xms2g -Xmx2g -XX:+UseG1GC \
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.rmi.port=9010 -Djava.rmi.server.hostname="$(hostname)" \
-Dconfig.file="distod.conf" \
-Dlogback.configurationFile=logback.xml \
-Dfile.encoding=UTF-8 \
-jar distod.jar
Supply program arguments to run
-command in SBT shell:
; set javaOptions += "-Dconfig.file=path/to/config.conf"; run
Pin a process to a CPU core (or multiple). The CPU number starts with 1. This command only works on linux:
taskset --cpu-list 1,2 <cmd>
Running the distributed version of FASTOD with Spark:
lib
-folder), application (jar
-file) on head node of cluster (odin01
)scripts/to-json.py -s
script.spark-submit
on head node to start algorithmspark-submit --jars libs/fastutil-6.1.0.jar,libs/lucene-core-4.5.1.jar --class FastODMain --master spark://odin01:7077 --executor-memory 10G --num-executors 2 --executor-cores 10 --total-executor-cores 20 distributed-fastod.jar file:${DATASET}" "${BATCHSIZE}"
The total-executor-cores
values is calculated based on the number of executors (nodes) and the number of processors (cores) that should be used by the executor.