Rarr

A simple native R reader for Zarr Arrays

MIT License

Downloads
4.1K
Stars
30
Committers
2

title: "Zarr arrays with Rarr"
author: "Mike L. Smith"
output:
github_document:
toc: true
toc_depth: 2

GitHub Actions Bioconductor Build Sysytem Test Coverage
Package Checks Bioconductor Status Codecov test coverage
knitr::opts_chunk$set(fig.path="inst/rmd/imgs/", dev = "jpeg")

Introduction to Rarr

The Zarr specification defines a format for chunked, compressed, N-dimensional arrays. It's design allows efficient access to subsets of the stored array, and supports both local and cloud storage systems. Zarr is experiencing increasing adoption in a number of scientific fields, where multi-dimensional data are prevalent.

Rarr is intended to be a simple interface to reading and writing individual Zarr arrays. It is developed in R and C with no reliance on external libraries or APIs for interfacing with the Zarr arrays. Additional compression libraries (e.g. blosc) are bundled with Rarr to provide support for datasets compressed using these tools.

Limitations with Rarr

If you know about Zarr arrays already, you'll probably be aware they can be stored in hierarchical groups, where additional meta data can explain the relationship between the arrays. Currently, Rarr is not designed to be aware of these hierarchical Zarr array collections. However, the component arrays can be read individually by providing the path to them directly.

Currently, there are also limitations on the Zarr datatypes that can be accessed using Rarr. For now most numeric types can be read into R, although in some instances e.g. 64-bit integers there is potential for loss of information. Writing is more limited with support only for datatypes that are supported natively in R and only using the column-first representation.

Quick start guide

Current Status

Reading and Writing

Reading Zarr arrays is reasonably well supported. Writing is available, but is more limited. Both aspects are under active development.

Data Types

Currently there is only support for reading and writing a subset of the possible datatypes that can be found in a Zarr array. In some instances there are also limitations on the datatypes natively supported by R, requiring conversion from the Zarr datatype. The table below summarises the current status of datatype support. It will be updated as progress is made.

Zarr Data Type Status(reading / writing) Notes
boolean ✔ / ❌
int8 ✔ / ❌
uint8 ✔ / ❌
int16 ✔ / ❌
uint16 ✔ / ❌
int32 ✔ / ✔
uint32 ✔ / ❌ Values outside the range of int32 are converted to NA. Future plan is to allow conversion to double or use the bit64 package.
int64 ✔ / ❌ Values outside the range of int32 are converted to NA. Future plan is to allow conversion to double or use the bit64 package.
uint64 ✔ / ❌ Values outside the range of int32 are converted to NA. Future plan is to allow conversion to double or use the bit64 package.
half / float16 ✔ / ❌ Converted to double in R. No effort is made to assess loss of precision due to conversion.
single / float32 ✔ / ❌ Converted to double in R. No effort is made to assess loss of precision due to conversion.
double / float64 ✔ / ✔
complex ❌ / ❌
timedelta ❌ / ❌
datetime ❌ / ❌
string ✔ / ✔
Unicode ✔ / ✔
void * ❌ / ❌
Structured data types ❌ / ❌

Compression Tools

Data Type Status(reading / writing) Notes
zlib / gzip ✔ / ✔ Only system default compression level (normally 6) is enabled for writing.
bzip2 ✔ / ✔ Only compression level 9 is enabled for writing.
blosc ✔ / ✔ Only lz4 compression level 5 is enabled for writing.
LZMA ✔ / ✔
LZ4 ✔ / ✔
Zstd ✔ / ✔

Please open an issue if support for a required compression tool is missing.

Filters

The is currently no support for additional filters. Please open an issue if you require filter support.

Required system libraries

To provide support for BLOSC and zstd compression tools Rarr links against libraries providing these tools. If you have them installed on your system Rarr will attempt to use those versions. If they are not detected then Rarr will compile and use versions that are distributed with the package. Either way the functionality will available, however if you are using the system libraries and then later remove them Rarr may fail to work correctly.

This only concerns users installing the package from source. If you are using the pre-built binaries for Windows or Mac OSX distributed by Bioconductor then this should not be an issue for you.