{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AlignedBinaryFormat\n",
"\n",
"\n",
"\n",
"\n",
"This package provides a simple (yet powerful) interface to handle memory mapped data.\n",
"The "data" must be in the form of Array
s and BitArray
s) (although I may later support a Table
interface).\n",
"The eltype
of the Array
must be a Julia primitive type.\n",
"When accessing the data we avoid the use of reinterpret
by aligning the arrays on disk.\n",
"\n",
"For convenience String
s DataType
s, and arbitrary serialization
can be saved to labels but these are not memory mapped.\n",
"If you need a memory mapped string considering using a Vector{Char}
which can be memory mapped.\n",
"\n",
"## Usage Example"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using AlignedBinaryFormat\n",
"temp = tempname();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To write out data we do the following."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"abf = abfopen(temp, "w") # "w" is used to write only, memory mapping requires w+\n",
"write(abf, "my x array", rand(Float16,4))\n",
"abf["whY array"] = rand(Char,2,2,2) # alias of write(abf,"ζ!/b",rand(2,3))\n",
"close(abf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could also have used do block syntax"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"abfopen(temp, "r+") do abf\n",
" abf["ζ!/b"] = rand(2,3)\n",
" write(abf, "bitmat", rand(3,2) .< 0.5)\n",
" write(abf, "log", """\n",
" this is what I did\n",
" and how!\n",
" """)\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To perform serialization we need to wrap our type with Serialized
.\n",
"If we are only saving a DataType
we do not need to wrap it in Serialized
."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AlignedBinaryFormat.AbfFile([read] <file /tmp/jl_cMROHo>)\n",
"┌────────────┬──────────────────┬───────────┬────────┬────────────┐\n",
"│ label │ type │ shape │ bytes │ status │\n",
"├────────────┼──────────────────┼───────────┼────────┼────────────┤\n",
"│ bitmat │ BitArray{2} │ (3, 2) │ <8B> │ unloaded │\n",
"│ foo │ Foo │ (-1,) │ <121B> │ unloaded │\n",
"│ log │ String │ (28,) │ <112B> │ unloaded │\n",
"│ my x array │ Array{Float16,1} │ (4,) │ <8B> │ unloaded │\n",
"│ type │ DataType │ (-1,) │ <23B> │ unloaded │\n",
"│ whY array │ Array{Char,3} │ (2, 2, 2) │ <32B> │ unloaded │\n",
"│ ζ!/b │ Array{Float64,2} │ (2, 3) │ <48B> │ unloaded │\n",
"└────────────┴──────────────────┴───────────┴────────┴────────────┘\n"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"struct Foo\n",
" x::Vector{Float64}\n",
" y::Int\n",
"end\n",
"\n",
"struct Bar\n",
" x::String\n",
"end\n",
"\n",
"abfopen(temp, "r+") do abf\n",
" write(abf, "type", Bar)\n",
" write(abf, "foo", AbfSerializer(Foo(rand(3), -1)))\n",
"end\n",
"\n",
"abf = abfopen(temp, "r")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The two methods for accessing the data are read
and getindex
, which is simply an alias of read
and allows for the dictionary-like syntax.\n",
"When data is first accessed a reference to the item is cached in a dictionary contained within the AbfFile
instance.\n",
"This permits the same reference to be returned without remapping the same data multiple times due to repeated read
calls.\n",
"Strings are not memory mapped and are copied directly into the cache dictionary."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"count(abf["bitmat"]) = 3\n",
"this is what I did\n",
"and how!\n",
"\n",
"AlignedBinaryFormat.AbfFile([read] <file /tmp/jl_cMROHo>)\n",
"┌────────────┬──────────────────┬───────────┬────────┬────────────┐\n",
"│ label │ type │ shape │ bytes │ status │\n",
"├────────────┼──────────────────┼───────────┼────────┼────────────┤\n",
"│ bitmat │ BitArray{2} │ (3, 2) │ <8B> │ loaded │\n",
"│ foo │ Foo │ (-1,) │ <121B> │ unloaded │\n",
"│ log │ String │ (28,) │ <112B> │ loaded │\n",
"│ my x array │ Array{Float16,1} │ (4,) │ <8B> │ loaded │\n",
"│ type │ DataType │ (-1,) │ <23B> │ unloaded │\n",
"│ whY array │ Array{Char,3} │ (2, 2, 2) │ <32B> │ unloaded │\n",
"│ ζ!/b │ Array{Float64,2} │ (2, 3) │ <48B> │ unloaded │\n",
"└────────────┴──────────────────┴───────────┴────────┴────────────┘\n"
]
}
],
"source": [
"@show count(abf["bitmat"])\n",
"read(abf, "my x array") # just calling read(abf, label) loads the array\n",
"println(read(abf, "log"))\n",
"show(abf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the example above, bitmat
and my x array
are memory mapped and log
is cached.\n",
"Doing abf[\"bitmat\"]
or read(abf, \"bitmat\")
will return the same reference to bitmat
."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bitmat1 === bitmat2 = true\n"
]
}
],
"source": [
"bitmat1 = abf["bitmat"]\n",
"bitmat2 = read(abf, "bitmat")\n",
"@show bitmat1 === bitmat2;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Close does not unlink references you may have made\n",
"Memory mapping persists after the file has been closed and is only unlinked once the reference is garbage collected.\n",
"It is recommended to use the do-block syntax to avoid unintended access."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x = 10-element Array{Float64,1}:\n",
" 0.6737894194841203 \n",
" 0.28704212997665324\n",
" 0.8980864989824102 \n",
" 0.5948813070093115 \n",
" 0.03298590842249727\n",
" 0.9587674078053836 \n",
" 0.4213390775874484 \n",
" 0.8851345495595084 \n",
" 0.35869029635893646\n",
" 0.4629117484636671 \n",
"\n",
"AlignedBinaryFormat.AbfFile([read/write] <file /tmp/jl_cMROHo>)\n",
"┌───────┬──────────────────┬───────┬───────┬────────┐\n",
"│ label │ type │ shape │ bytes │ status │\n",
"├───────┼──────────────────┼───────┼───────┼────────┤\n",
"│ x │ Array{Float64,1} │ (10,) │ <80B> │ loaded │\n",
"└───────┴──────────────────┴───────┴───────┴────────┘\n",
"\n",
"\n",
"AlignedBinaryFormat.AbfFile([closed] <file /tmp/jl_cMROHo>)\n",
"┌───────┬──────┬───────┬───────┬────────┐\n",
"│ label │ type │ shape │ bytes │ status │\n",
"├───────┼──────┼───────┼───────┼────────┤\n",
"└───────┴──────┴───────┴───────┴────────┘\n",
"\n",
"x = 10-element Array{Float64,1}:\n",
" 0.6737894194841203 \n",
" 0.28704212997665324\n",
" 0.8980864989824102 \n",
" 0.5948813070093115 \n",
" 0.03298590842249727\n",
" 0.9587674078053836 \n",
" 0.4213390775874484 \n",
" 0.8851345495595084 \n",
" 0.35869029635893646\n",
" 0.4629117484636671 \n",
"\n"
]
}
],
"source": [
"abf = abfopen(temp, "w+")\n",
"write(abf, "x", rand(10))\n",
"x = read(abf, "x")\n",
"print("x = ")\n",
"show(stdout, MIME("text/plain"), x)\n",
"println("\n")\n",
"show(abf)\n",
"println("\n")\n",
"close(abf)\n",
"show(abf)\n",
"print("\nx = ")\n",
"show(stdout, MIME("text/plain"), x)\n",
"println("\n")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# File permissions and modifying data on-disk\n",
"\n",
"| mode | read data | modify data | add data | description |\n",
"|------|:---------:|:-----------:|:--------:|--------------------------------------|\n",
"| r
| yes | no | no | read only mode |\n",
"| r+
| yes | yes | yes | read/write existing file |\n",
"| w
| no | no | yes | overwrites existing file, write only |\n",
"| w+
| yes | yes | yes | overwrites existing file, read/write |\n",
"| a
| no | no | yes | modify existing file, create if it doesn't exist, write only |\n",
"| a+
| yes | yes | yes | modify existing file, create if it doesn't exist, read/write |\n",
"\n",
"Read permission is required to memory map so w
can only write data to the file but can not read it back in to memory-map.\n",
"Memory-mapped arrays opened using r
can only be read.\n",
"If the file is opened with r+/w+
arrays can be modified in place."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"file opened with "w"\n",
"AlignedBinaryFormat.AbfReadError(IOStream(<file /tmp/jl_cMROHo>))\n",
"\n",
"file opened with "w+"\n",
"x[1] = 0.3923187061956266\n",
"x[1] = -1.0\n",
"\n",
"file opened with "r+"\n",
"(abf["x"])[1] = 3.0\n",
"abf["y"] = [0.3650633407781294, 0.06535729174631832]\n",
"\n",
"file opened with "r"\n",
"x[1] = 3.0\n",
"abf["y"] = [0.3650633407781294, 0.06535729174631832]\n",
"ReadOnlyMemoryError()\n",
"\n",
"file opened with "a"\n",
"AlignedBinaryFormat.AbfReadError(IOStream(<file /tmp/jl_cMROHo>))\n",
"\n",
"file opened with "a+"\n",
"(read(abf, "x"))[1] = 3.0\n",
"(read(abf, "z"))[1] = 0.9387386553608443\n"
]
}
],
"source": [
"println("file opened with \"w\"")\n",
"abfopen(temp, "w") do abf\n",
" write(abf, "x", rand(3,2))\n",
" try\n",
" x = read(abf, "x")\n",
" catch e\n",
" show(e)\n",
" println()\n",
" end\n",
"end\n",
"println("\nfile opened with \"w+\"")\n",
"abfopen(temp, "w+") do abf\n",
" write(abf, "x", rand(3,2))\n",
" x = read(abf, "x")\n",
" @show x[1]\n",
" x[1] = -1\n",
" @show x[1]\n",
"end\n",
"println("\nfile opened with \"r+\"")\n",
"abfopen(temp, "r+") do abf\n",
" abf["x"][1] = 3\n",
" @show abf["x"][1]\n",
" write(abf, "y", rand(2))\n",
" @show abf["y"]\n",
"end\n",
"println("\nfile opened with \"r\"")\n",
"abfopen(temp, "r") do abf\n",
" x = read(abf, "x")\n",
" @show x[1]\n",
" @show abf["y"]\n",
" try\n",
" x[1] = 3\n",
" catch e\n",
" show(e)\n",
" println()\n",
" end\n",
"end;\n",
"println("\nfile opened with \"a\"")\n",
"abfopen(temp, "a") do abf\n",
" try\n",
" x = read(abf, "x")\n",
" catch e\n",
" show(e)\n",
" println()\n",
" end\n",
" write(abf, "z", rand(3))\n",
"end;\n",
"println("\nfile opened with \"a+\"")\n",
"abfopen(temp, "a+") do abf\n",
" @show read(abf, "x")[1]\n",
" @show read(abf, "z")[1]\n",
"end;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why not use JLD/HDF5
?\n",
" 1. They do not support memory mapping of any Julia isbits
primitive type (see here for supported data types).\n",
" 2. When memory mapping they often return an array as a ReinterpretArray
forcing the use of AbstractArray
in Type
and Function
definitions."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"typeof(x) = Array{Float16,2}\n",
"typeof(y) = BitArray{2}\n",
"typeof(z) = Array{Float64,1}\n"
]
}
],
"source": [
"x = rand(Float16,3,2);\n",
"y = x .< 0.5;\n",
"z = rand(29);\n",
"@show typeof(x);\n",
"@show typeof(y);\n",
"@show typeof(z);"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ismmappable(j["x"]) = false\n",
"typeof(read(j, "x")) = Array{Float16,2}\n",
"typeof(read(j, "y")) = BitArray{2}\n",
"typeof(read(j, "z")) = Base.ReinterpretArray{Float64,1,UInt8,Array{UInt8,1}}\n"
]
}
],
"source": [
"using JLD, HDF5\n",
"jldopen(temp, "w"; mmaparrays=true) do j\n",
" write(j,"x",x)\n",
" write(j,"y",y)\n",
" write(j,"z",z)\n",
"end\n",
"jldopen(temp, "r"; mmaparrays=true) do j\n",
" @show ismmappable(j["x"])\n",
" @show typeof(read(j,"x"))\n",
" @show typeof(read(j,"y"))\n",
" @show typeof(read(j,"z"))\n",
"end\n",
"rm(temp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see x::Matrix{Float16}
isn't able to be memory mapped and z::Vector{Float64}
gets read back as ReinterpretArray
\n",
"\n",
"## Why not use JLD2
\n",
"JLD2
doesn't actually support memory mapping.\n",
"See my comment here.\n",
"\n",
"---\n",
"\n",
"# File Layout\n",
"As an example lets examine what the structure of the following file would look like."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AlignedBinaryFormat.AbfFile([read/write] <file /tmp/jl_cMROHo>)\n",
"┌───────────┬──────────────────┬─────────┬────────┬────────────┐\n",
"│ label │ type │ shape │ bytes │ status │\n",
"├───────────┼──────────────────┼─────────┼────────┼────────────┤\n",
"│ log │ String │ (36,) │ <144B> │ unloaded │\n",
"│ myX │ Array{Float16,2} │ (10, 3) │ <60B> │ unloaded │\n",
"│ somez │ Array{Float64,1} │ (29,) │ <232B> │ unloaded │\n",
"│ whybitarr │ BitArray{2} │ (3, 2) │ <8B> │ unloaded │\n",
"└───────────┴──────────────────┴─────────┴────────┴────────────┘\n"
]
}
],
"source": [
"abfopen(temp, "w+") do abf\n",
" abf["myX"] = rand(Float16,10,3)\n",
" abf["whybitarr"] = x .< 0.5\n",
" abf["log"] = "some log information about this file"\n",
" abf["somez"] = rand(29)\n",
" show(abf)\n",
"end;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file will have the following layout for each of the stored Arrays
\n",
"* a UInt8
indicating endian-ness of the numeric data contained within\n",
" * UInt8(0)
indicates that the following data is little-endian\n",
" * UInt8(255)
indicates that the following data is big-endian\n",
" * this depends on the host machine generating the file\n",
" * currently conversion between little- and big-endian is not supported\n",
"* an Int
indicating the length of the key length(keyname)
\n",
"* the String
which is the key
of the array\n",
"* an Int
indicating the length of string(T<:Union{Array,BitArray})
\n",
"* the String
representation of the "container", i.e., \"Array\"
, \"BitArray\"
or \"String\"
\n",
" * there is no arbitrary code evaluation, types are determined via an ImmutableDict{String,DataType}
\n",
"* for \"Array\"
s an Int
and string(T)
\n",
"* an Int
indication the number of dimensions N
\n",
"* N
Int
s that give the shape of the array\n",
"* the buffer is then aligned to T
\n",
"* the data\n",
"\n",
"Note: the labels are displayed sorted by label but are written to the file sequentially.\n",
"\n",
"\n", "<BOF> # beginning of file\n", "0x00 # UInt8(0) - Little Endian\n", "3 # length of \"myX\"\n", "myX # label of first item\n", "5 # length of \"Array\"\n", "Array\n", "7 # length of \"Float16\"\n", "Float16\n", "2 # number of dimensions\n", "10 # length of dimension 1\n", "3 # length of dimension 2\n", "... (alignment spacing)\n", "<the data>\n", "0x00 # UInt8(0) - Little Endian\n", "9 # length of \"whybitarr\"\n", "whybitarr # label of second item\n", "8 # length of \"BitArray\"\n", "BitArray\n", "2 # number of dimensions\n", "10 # length of dimension 1\n", "3 # length of dimension 2\n", "... (alignment spacing)\n", "<the data>\n", "0x00 # UInt8(0) - Little Endian\n", "3 # length of \"log\"\n", "log # label of third item\n", "6 # length of \"String\"\n", "String\n", "36 # length of \"some log information ...\"\n", "some log information ...\n", "0x00 # UInt8(0) - Little Endian\n", "5 # length of \"somez\"\n", "somez # label of fourth item\n", "5 # length of \"Array\"\n", "Array\n", "7 # length of \"Float64\"\n", "Float64\n", "1 # number of dimensions\n", "29 # length of dimension 1\n", "... (alignment spacing)\n", "<the data>\n", "<EOF> # end of file\n", "
\n",
"\n",
"# Acknowledgements\n",
"Early inspiration was drawn from this gist."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.2.0",
"language": "julia",
"name": "julia-1.2"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.2.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}