Open Source Ecosystems

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AlignedBinaryFormat\n", "\n", "\n", "\n", "\n", "This package provides a simple (yet powerful) interface to handle memory mapped data.\n", "The "data" must be in the form of Arrays and BitArrays) (although I may later support a Table interface).\n", "The eltype of the Array must be a Julia primitive type.\n", "When accessing the data we avoid the use of reinterpret by aligning the arrays on disk.\n", "\n", "For convenience Strings DataTypes, and arbitrary serialization can be saved to labels but these are not memory mapped.\n", "If you need a memory mapped string considering using a Vector{Char} which can be memory mapped.\n", "\n", "## Usage Example" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using AlignedBinaryFormat\n", "temp = tempname();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To write out data we do the following." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "abf = abfopen(temp, "w") # "w" is used to write only, memory mapping requires w+\n", "write(abf, "my x array", rand(Float16,4))\n", "abf["whY array"] = rand(Char,2,2,2) # alias of write(abf,"ζ!/b",rand(2,3))\n", "close(abf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also have used do block syntax" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "abfopen(temp, "r+") do abf\n", " abf["ζ!/b"] = rand(2,3)\n", " write(abf, "bitmat", rand(3,2) .< 0.5)\n", " write(abf, "log", """\n", " this is what I did\n", " and how!\n", " """)\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform serialization we need to wrap our type with Serialized.\n", "If we are only saving a DataType we do not need to wrap it in Serialized." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AlignedBinaryFormat.AbfFile([read] <file /tmp/jl_cMROHo>)\n", "┌────────────┬──────────────────┬───────────┬────────┬────────────┐\n", "│ label │ type │ shape │ bytes │ status │\n", "├────────────┼──────────────────┼───────────┼────────┼────────────┤\n", "│ bitmat │ BitArray{2} │ (3, 2) │ <8B> │ unloaded │\n", "│ foo │ Foo │ (-1,) │ <121B> │ unloaded │\n", "│ log │ String │ (28,) │ <112B> │ unloaded │\n", "│ my x array │ Array{Float16,1} │ (4,) │ <8B> │ unloaded │\n", "│ type │ DataType │ (-1,) │ <23B> │ unloaded │\n", "│ whY array │ Array{Char,3} │ (2, 2, 2) │ <32B> │ unloaded │\n", "│ ζ!/b │ Array{Float64,2} │ (2, 3) │ <48B> │ unloaded │\n", "└────────────┴──────────────────┴───────────┴────────┴────────────┘\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "struct Foo\n", " x::Vector{Float64}\n", " y::Int\n", "end\n", "\n", "struct Bar\n", " x::String\n", "end\n", "\n", "abfopen(temp, "r+") do abf\n", " write(abf, "type", Bar)\n", " write(abf, "foo", AbfSerializer(Foo(rand(3), -1)))\n", "end\n", "\n", "abf = abfopen(temp, "r")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two methods for accessing the data are read and getindex, which is simply an alias of read and allows for the dictionary-like syntax.\n", "When data is first accessed a reference to the item is cached in a dictionary contained within the AbfFile instance.\n", "This permits the same reference to be returned without remapping the same data multiple times due to repeated read calls.\n", "Strings are not memory mapped and are copied directly into the cache dictionary." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count(abf["bitmat"]) = 3\n", "this is what I did\n", "and how!\n", "\n", "AlignedBinaryFormat.AbfFile([read] <file /tmp/jl_cMROHo>)\n", "┌────────────┬──────────────────┬───────────┬────────┬────────────┐\n", "│ label │ type │ shape │ bytes │ status │\n", "├────────────┼──────────────────┼───────────┼────────┼────────────┤\n", "│ bitmat │ BitArray{2} │ (3, 2) │ <8B> │ loaded │\n", "│ foo │ Foo │ (-1,) │ <121B> │ unloaded │\n", "│ log │ String │ (28,) │ <112B> │ loaded │\n", "│ my x array │ Array{Float16,1} │ (4,) │ <8B> │ loaded │\n", "│ type │ DataType │ (-1,) │ <23B> │ unloaded │\n", "│ whY array │ Array{Char,3} │ (2, 2, 2) │ <32B> │ unloaded │\n", "│ ζ!/b │ Array{Float64,2} │ (2, 3) │ <48B> │ unloaded │\n", "└────────────┴──────────────────┴───────────┴────────┴────────────┘\n" ] } ], "source": [ "@show count(abf["bitmat"])\n", "read(abf, "my x array") # just calling read(abf, label) loads the array\n", "println(read(abf, "log"))\n", "show(abf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, bitmat and my x array are memory mapped and log is cached.\n", "Doing abf[\"bitmat\"] or read(abf, \"bitmat\") will return the same reference to bitmat." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitmat1 === bitmat2 = true\n" ] } ], "source": [ "bitmat1 = abf["bitmat"]\n", "bitmat2 = read(abf, "bitmat")\n", "@show bitmat1 === bitmat2;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Close does not unlink references you may have made\n", "Memory mapping persists after the file has been closed and is only unlinked once the reference is garbage collected.\n", "It is recommended to use the do-block syntax to avoid unintended access." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = 10-element Array{Float64,1}:\n", " 0.6737894194841203 \n", " 0.28704212997665324\n", " 0.8980864989824102 \n", " 0.5948813070093115 \n", " 0.03298590842249727\n", " 0.9587674078053836 \n", " 0.4213390775874484 \n", " 0.8851345495595084 \n", " 0.35869029635893646\n", " 0.4629117484636671 \n", "\n", "AlignedBinaryFormat.AbfFile([read/write] <file /tmp/jl_cMROHo>)\n", "┌───────┬──────────────────┬───────┬───────┬────────┐\n", "│ label │ type │ shape │ bytes │ status │\n", "├───────┼──────────────────┼───────┼───────┼────────┤\n", "│ x │ Array{Float64,1} │ (10,) │ <80B> │ loaded │\n", "└───────┴──────────────────┴───────┴───────┴────────┘\n", "\n", "\n", "AlignedBinaryFormat.AbfFile([closed] <file /tmp/jl_cMROHo>)\n", "┌───────┬──────┬───────┬───────┬────────┐\n", "│ label │ type │ shape │ bytes │ status │\n", "├───────┼──────┼───────┼───────┼────────┤\n", "└───────┴──────┴───────┴───────┴────────┘\n", "\n", "x = 10-element Array{Float64,1}:\n", " 0.6737894194841203 \n", " 0.28704212997665324\n", " 0.8980864989824102 \n", " 0.5948813070093115 \n", " 0.03298590842249727\n", " 0.9587674078053836 \n", " 0.4213390775874484 \n", " 0.8851345495595084 \n", " 0.35869029635893646\n", " 0.4629117484636671 \n", "\n" ] } ], "source": [ "abf = abfopen(temp, "w+")\n", "write(abf, "x", rand(10))\n", "x = read(abf, "x")\n", "print("x = ")\n", "show(stdout, MIME("text/plain"), x)\n", "println("\n")\n", "show(abf)\n", "println("\n")\n", "close(abf)\n", "show(abf)\n", "print("\nx = ")\n", "show(stdout, MIME("text/plain"), x)\n", "println("\n")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# File permissions and modifying data on-disk\n", "\n", "| mode | read data | modify data | add data | description |\n", "|------|:---------:|:-----------:|:--------:|--------------------------------------|\n", "| r | yes | no | no | read only mode |\n", "| r+ | yes | yes | yes | read/write existing file |\n", "| w | no | no | yes | overwrites existing file, write only |\n", "| w+ | yes | yes | yes | overwrites existing file, read/write |\n", "| a | no | no | yes | modify existing file, create if it doesn't exist, write only |\n", "| a+ | yes | yes | yes | modify existing file, create if it doesn't exist, read/write |\n", "\n", "Read permission is required to memory map so w can only write data to the file but can not read it back in to memory-map.\n", "Memory-mapped arrays opened using r can only be read.\n", "If the file is opened with r+/w+ arrays can be modified in place." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "file opened with "w"\n", "AlignedBinaryFormat.AbfReadError(IOStream(<file /tmp/jl_cMROHo>))\n", "\n", "file opened with "w+"\n", "x[1] = 0.3923187061956266\n", "x[1] = -1.0\n", "\n", "file opened with "r+"\n", "(abf["x"])[1] = 3.0\n", "abf["y"] = [0.3650633407781294, 0.06535729174631832]\n", "\n", "file opened with "r"\n", "x[1] = 3.0\n", "abf["y"] = [0.3650633407781294, 0.06535729174631832]\n", "ReadOnlyMemoryError()\n", "\n", "file opened with "a"\n", "AlignedBinaryFormat.AbfReadError(IOStream(<file /tmp/jl_cMROHo>))\n", "\n", "file opened with "a+"\n", "(read(abf, "x"))[1] = 3.0\n", "(read(abf, "z"))[1] = 0.9387386553608443\n" ] } ], "source": [ "println("file opened with \"w\"")\n", "abfopen(temp, "w") do abf\n", " write(abf, "x", rand(3,2))\n", " try\n", " x = read(abf, "x")\n", " catch e\n", " show(e)\n", " println()\n", " end\n", "end\n", "println("\nfile opened with \"w+\"")\n", "abfopen(temp, "w+") do abf\n", " write(abf, "x", rand(3,2))\n", " x = read(abf, "x")\n", " @show x[1]\n", " x[1] = -1\n", " @show x[1]\n", "end\n", "println("\nfile opened with \"r+\"")\n", "abfopen(temp, "r+") do abf\n", " abf["x"][1] = 3\n", " @show abf["x"][1]\n", " write(abf, "y", rand(2))\n", " @show abf["y"]\n", "end\n", "println("\nfile opened with \"r\"")\n", "abfopen(temp, "r") do abf\n", " x = read(abf, "x")\n", " @show x[1]\n", " @show abf["y"]\n", " try\n", " x[1] = 3\n", " catch e\n", " show(e)\n", " println()\n", " end\n", "end;\n", "println("\nfile opened with \"a\"")\n", "abfopen(temp, "a") do abf\n", " try\n", " x = read(abf, "x")\n", " catch e\n", " show(e)\n", " println()\n", " end\n", " write(abf, "z", rand(3))\n", "end;\n", "println("\nfile opened with \"a+\"")\n", "abfopen(temp, "a+") do abf\n", " @show read(abf, "x")[1]\n", " @show read(abf, "z")[1]\n", "end;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why not use JLD/HDF5?\n", " 1. They do not support memory mapping of any Julia isbits primitive type (see here for supported data types).\n", " 2. When memory mapping they often return an array as a ReinterpretArray forcing the use of AbstractArray in Type and Function definitions." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "typeof(x) = Array{Float16,2}\n", "typeof(y) = BitArray{2}\n", "typeof(z) = Array{Float64,1}\n" ] } ], "source": [ "x = rand(Float16,3,2);\n", "y = x .< 0.5;\n", "z = rand(29);\n", "@show typeof(x);\n", "@show typeof(y);\n", "@show typeof(z);" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ismmappable(j["x"]) = false\n", "typeof(read(j, "x")) = Array{Float16,2}\n", "typeof(read(j, "y")) = BitArray{2}\n", "typeof(read(j, "z")) = Base.ReinterpretArray{Float64,1,UInt8,Array{UInt8,1}}\n" ] } ], "source": [ "using JLD, HDF5\n", "jldopen(temp, "w"; mmaparrays=true) do j\n", " write(j,"x",x)\n", " write(j,"y",y)\n", " write(j,"z",z)\n", "end\n", "jldopen(temp, "r"; mmaparrays=true) do j\n", " @show ismmappable(j["x"])\n", " @show typeof(read(j,"x"))\n", " @show typeof(read(j,"y"))\n", " @show typeof(read(j,"z"))\n", "end\n", "rm(temp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see x::Matrix{Float16} isn't able to be memory mapped and z::Vector{Float64} gets read back as ReinterpretArray\n", "\n", "## Why not use JLD2\n", "JLD2 doesn't actually support memory mapping.\n", "See my comment here.\n", "\n", "---\n", "\n", "# File Layout\n", "As an example lets examine what the structure of the following file would look like." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AlignedBinaryFormat.AbfFile([read/write] <file /tmp/jl_cMROHo>)\n", "┌───────────┬──────────────────┬─────────┬────────┬────────────┐\n", "│ label │ type │ shape │ bytes │ status │\n", "├───────────┼──────────────────┼─────────┼────────┼────────────┤\n", "│ log │ String │ (36,) │ <144B> │ unloaded │\n", "│ myX │ Array{Float16,2} │ (10, 3) │ <60B> │ unloaded │\n", "│ somez │ Array{Float64,1} │ (29,) │ <232B> │ unloaded │\n", "│ whybitarr │ BitArray{2} │ (3, 2) │ <8B> │ unloaded │\n", "└───────────┴──────────────────┴─────────┴────────┴────────────┘\n" ] } ], "source": [ "abfopen(temp, "w+") do abf\n", " abf["myX"] = rand(Float16,10,3)\n", " abf["whybitarr"] = x .< 0.5\n", " abf["log"] = "some log information about this file"\n", " abf["somez"] = rand(29)\n", " show(abf)\n", "end;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file will have the following layout for each of the stored Arrays\n", "* a UInt8 indicating endian-ness of the numeric data contained within\n", " * UInt8(0) indicates that the following data is little-endian\n", " * UInt8(255) indicates that the following data is big-endian\n", " * this depends on the host machine generating the file\n", " * currently conversion between little- and big-endian is not supported\n", "* an Int indicating the length of the key length(keyname)\n", "* the String which is the key of the array\n", "* an Int indicating the length of string(T<:Union{Array,BitArray})\n", "* the String representation of the "container", i.e., \"Array\", \"BitArray\" or \"String\"\n", " * there is no arbitrary code evaluation, types are determined via an ImmutableDict{String,DataType}\n", "* for \"Array\"s an Int and string(T)\n", "* an Int indication the number of dimensions N\n", "* N Ints that give the shape of the array\n", "* the buffer is then aligned to T\n", "* the data\n", "\n", "Note: the labels are displayed sorted by label but are written to the file sequentially.\n", "\n", "\n", "<BOF> # beginning of file\n", "0x00 # UInt8(0) - Little Endian\n", "3 # length of \"myX\"\n", "myX # label of first item\n", "5 # length of \"Array\"\n", "Array\n", "7 # length of \"Float16\"\n", "Float16\n", "2 # number of dimensions\n", "10 # length of dimension 1\n", "3 # length of dimension 2\n", "... (alignment spacing)\n", "<the data>\n", "0x00 # UInt8(0) - Little Endian\n", "9 # length of \"whybitarr\"\n", "whybitarr # label of second item\n", "8 # length of \"BitArray\"\n", "BitArray\n", "2 # number of dimensions\n", "10 # length of dimension 1\n", "3 # length of dimension 2\n", "... (alignment spacing)\n", "<the data>\n", "0x00 # UInt8(0) - Little Endian\n", "3 # length of \"log\"\n", "log # label of third item\n", "6 # length of \"String\"\n", "String\n", "36 # length of \"some log information ...\"\n", "some log information ...\n", "0x00 # UInt8(0) - Little Endian\n", "5 # length of \"somez\"\n", "somez # label of fourth item\n", "5 # length of \"Array\"\n", "Array\n", "7 # length of \"Float64\"\n", "Float64\n", "1 # number of dimensions\n", "29 # length of dimension 1\n", "... (alignment spacing)\n", "<the data>\n", "<EOF> # end of file\n", "\n", "\n", "# Acknowledgements\n", "Early inspiration was drawn from this gist." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.2.0", "language": "julia", "name": "julia-1.2" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 4 }