Girder as a backend for a Bazel ExternalData, aka File Versioning
BSD-3-CLAUSE License
The intent of this project is to demonstrate how to integrate download of large data file from Girder with a Bazel based build system.
Note that principle illustrated here are not specific to Bazel, they have already been successfully implemented in other build system like CMake through the use of ExternalData module.
Possible improvements and know limitations are documented in the issue tracker.
Note that while this demo is restricted to Linux, support for large data file using Bazel and Girder can definitively be extended to work on other platform like MacOSX and Windows.
This is the server where the large file are stored.
After installing Girder, enable
also install and enable this plugin to support filepath
based versioning:
optionally, to allow visualizing obj file, enable this plugin:
create a Collection with a data folder.
create a Girder API Key with at
least read_data
and write_data
scopes.
<folder_id>
used below can be retrieved by browsing to the created folder and
copying it from the URL. (e.g 59307411739ba619e0eaa82e
)
test data used in this demo are licensed under the Creative Commons - Attribution - Share Alike license. More details here.
# Download test data
cd /tmp
curl -L --progress-bar -o small_dragon.obj -O https://github.com/jcfr/bazel-large-files-with-girder/releases/download/test-data/small_dragon.obj
curl -L --progress-bar -o large_dragon.obj -O https://github.com/jcfr/bazel-large-files-with-girder/releases/download/test-data/large_dragon.obj
# Install Girder client
mkvirtualenv test_data_upload
pip install girder-client
# Checkout project including upload client
git clone git://github.com/jcfr/bazel-large-files-with-girder.git
cd bazel-large-files-with-girder
# Set Girder server information
export GIRDER_API_KEY=<API_KEY>
export GIRDER_SERVER=https://girder.example.org
export GIRDER_FOLDER_ID=59307411739ba619e0eaa82e
# Upload file into data folder
mv /tmp/small_dragon.obj ./data/dragon.obj
./thirdparty/girder_data_revision_upload.py ./data/dragon.obj
mv /tmp/large_dragon.obj ./data/dragon.obj
./thirdparty/girder_data_revision_upload.py ./data/dragon.obj
by using thirdparty/girder_data_revision_upload.py
script, each item
is associated with additional metadata:
keys versionedFilePath
and versionedFileRevision
allowing to
retrieve file revision based on a canonical path (e.g data/dragon.obj
)
key vtkView
set to surface
allowing to have uploaded *.obj
files
rendered in 3D viewer powered by vtk_viewer Girder
plugin (which internally use vtk.js).
You can definitively reuse the Girder API Key
created earlier to upload the test data, or you could create a new one with only
the read_data
scope.
Get sources:
cd /tmp
git clone https://github.com/jcfr/bazel-large-files-with-girder.git
Configure Girder server address and API key:
In the context of this demo, Girder server URL and API key are hardcoded in tools/download_data.sh. Make sure to edit this file replace the following string with the own matching your environment:
<GIRDER_SERVER>
.... : Replace with Girder URL, for example https://girder.example.org
<GIRDER_API_KEY>
... : Replace with API key created above
Run test (default branch v2-step1-small_dragon
reference a small OBJ file of ~8MB)
bazel run test:inspect_dragon
The small dragon is expected to appear:
Type CTRL-C
to exit.
Note that while we execute the test outside of the Bazel sandbox using
run
(required to allow the viewer to access the DISPLAY environment variable),
download of girder data is expected to work in a sandboxed environment. It only
require bash and basic python interpreter.
Run test associated with branch v2-step2-large_dragon
referencing a large OBJ file (~35MB).
git checkout -b v2-step2-small_dragon origin/v2-step2-small_dragon
bazel run test:inspect_dragon
The large dragon is expected to appear:
Type CTRL-C
to exit.
This source tree is a simple Bazel project organized like this:
<root>
|-- WORKSPACE ... : Download objviewer used in inspect model tests
|-- data ........ : Key files referencing large data
|-- test ........ : Tests depending on large data files
|-- tools ...... : Girder client script to download large files
Note that the inspect model tests are not legitimate tests, they simply allow to interactively visualize which model has been fetched and made available in the bazel workspace.
The OBJ model visualization is done by running a simple Linux executable statically built against VTK. More details here.
The basic idea is to update bazel test (or binary) targets to depend on data target representing the file to download.
In this demo projects, the test inspect_dragon
depends on a data target //data:dragon.stl
that will take care of downloading the file from Girder when executing the test.
sh_test(
name = "inspect_dragon",
srcs = ["inspect_model.sh"],
args = ["$(location //data:dragon.obj)"],
data = ["@objviewerArchive//:objviewer", "//data:dragon.obj"],
)
Note also that the @objviewerArchive//:objviewer
target takes care of
downloading the viewer used in the test.
TBD
TBD
TBD
Data files are not directly stored in the source tree managed by the version control system (e.g git), instead a key file containing the hashsum of the original data is tracked.
The use of a cryptographic hash function allows to uniquely represent the data file of arbitrary size with string of a fixed size called hashsum.
We will be using SHA-512 function, a collision resistant implementation belonging to the SHA-2 family of hash algorithms.
For example, the key file associated with file named large_data_file.ext
could be stored as:
data/large_data_file.ext.sha512
and contain a 128-character string like:
01fdd890676e9b2f7f5a8eb25c01dcdb168d23e3f9d95f804df44ff235a1c022c8f516a4fe5871f37ebaa2188c640c7624c738c71c5f3965924b7bd2f9bab11b
Given the hashsum contained in the key file, the actual file can be downloaded from Girder using this API endpoint:
https://${server}/api/v1/file/hashsum/sha512/${hashsum}/download
where ${hashsum}
corresponds to a 128-character string.
Installation of the data_revisions plugin
adds a new item/revisions
endpoint allowing to retrieve all items matching a
given path
:
https://${server}/api/v1#!/item/item_getRevisionsByPath
For example, see http://ec2-184-72-193-101.compute-1.amazonaws.com/api/v1#!/item/item_getRevisionsByPath
Git-lfs (Git Large File System) extension also allows to store pointer to files found on a data server.
While this approach is also leveraging hashsum to retrieve the original file, worth noting that by default all data files are downloaded when git cloning a repository.
Since in our case, we are interested in:
... the git-lfs approach is suboptimal.
Instead, directly integrating with the build system allows to leverage its dependency resolution mechanism to selectively download files.
It is covered by the 3-Clause BSD License:
https://opensource.org/licenses/BSD-3-Clause
The license file was added at on 2017-10-24, but you may consider that the license applies to all prior revisions as well.