A tool for data sampling, data generation, and data diffing
APACHE-2.0 License
A tool for random data sampling and generation
Gen[T]
) for property-based testing for scala case classes, Avro, Protocol Buffers, BigQuery TableRow
For more information or documentation, project level READMEs are provided.
If you use sbt add the following dependency to your build file:
libraryDependencies += "com.spotify" %% "ratatool-scalacheck" % "0.3.10" % "test"
If needed, the following other libraries are published:
ratatool-diffy
ratatool-sampling
Or install via our Homebrew tap if you're on a Mac:
brew tap spotify/public
brew install ratatool
ratatool
Or download the release jar and run it.
wget https://github.com/spotify/ratatool/releases/download/v0.3.10/ratatool-cli-0.3.10.tar.gz
bin/ratatool directSampler
The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.
bin/ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro
bin/ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet
# write output to both JSON file and BigQuery table
bin/ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \
--out out.json--tableOut project_id:dataset_id.table_id
It can also be used to sample from HDFS with if core-site.xml
and hdfs-site.xml
are available.
bin/ratatool bigSampler avro \
--head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro
Or execute BigDiffy directly
bin/ratatool bigDiffy \
--input-mode=avro \
--key=record.key \
--lhs=gs://path/to/left \
--rhs=gs://path/to/right \
--output=gs://path/to/output \
--runner=DataflowRunner ....
To test local changes before release:
$ sbt
> project ratatoolCli
> packArchive
and then find the built CLI at ratatool-cli/target/ratatool-cli-{version}.tar.gz
Copyright 2016-2018 Spotify AB.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0