FSharp codegen for Apache Avro
NuGet Package: FSharp.Avro.Apache.Tools
This package provides a tool that generates F# types wrapping Apache Avro serialisation mechanics.
OPTIONS:
--schema-file <file> Path to .avsc file
--output <file> Output location
--record-repr <repr> Record representation, 'class' or 'record'
--namespace <string> Map an Avro schema namespace to a .NET namespace.
The format is "my.avro.namespace:my.csharp.namespace"
May be specified multiple times to map multiple namespaces.
--help display this list of options.
Avro already has an "official" codegen tool for .NET, but it comes with some disadvantages:
It does not support Avro Unions at the generated types level.
An object
type is used to unify the choices.
For example, a field that is declareed in Avro as
{"name": "Foo", "type": ["string", "int"]
will be generated in C# as
public object Foo { get; set; }
There is no support for optional types at the generated code level.
Generated properties for types like ["null", "string"]
and string
will both be
of type string
in C#.
There is no structural equality provided for the generated types.
Generated types are extremely mutable.
Bugs like this one exist.
To make developers experience a bit better, this tiny library was born.
The goal of this library is to still utilise the "official" Apache Avro for the actual encoding/decoding Avro payloads, while providing developers with more structured and friendly types to mitigate issues above (as much as possible).
Compared to building a bottom-to-top FSharp Avro library (which may be considered as a next step) the approach of using Apache Avro library has its tradeoffs:
We are still somehow a little bit not pure here and there. We can mitigate a lot of it, and make a lot of it conveniently hidden, but strictly speaking it is still there.
For example, while generated types to Avro Records
have immutable interface, they still need to implement ISpecificRecord
and provide a way for mutation
(via CLIMutable
attribute) for Apache serialiser to work.
We inherit bugs from Apache Avro
library. Some of them we can mitigate, some we cannot.
F# code is generated as follows:
The tool provides a choice between two representations to chose from: F# Record
and .NET Class
.
Consider this simple form of an Avro record:
{
"type": "record",
"name": "Person",
"fields": [
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" }
]
}
The generated type for the schema above is an F# record with CLIMutable
that implements
ISpecificRecord
:
[<CLIMutable>]
type Person =
{ name: string
age: int }
static member _SCHEMA : Avro.Schema = ...
interface Avro.Specific.ISpecificRecord with
member this.Get(pos: int) = ...
member this.Put(pos: int, value: obj) = ...
The generated type for the schema above is a .NET type that provides the constructor and the structural equality.
It also has an unsafe default constructor (required by Apache Avro), but we make it inaccessible to F# developers.
[<Sealed>]
type Person(name: string, age: int) =
let mutable __name = name
let mutable __age = age
[<CompilerMessage("This method is not intended for use from F#.", 10001, IsError = true, IsHidden = true)>]
new () = Person(Unchecked.defaultof<string>, Unchecked.defaultof<int>)
member this.name = __name
member this.age = __age
static member _SCHEMA : Avro.Schema = ...
interface Avro.Specific.ISpecificRecord with
member this.Get(pos: int) = ...
member this.Put(pos: int, value: obj) = ...
interface System.IEquatable<Person> with
member this.Equals other = ...
override this.Equals(other) = ...
override this.GetHashCode() = ...
Unfortunately Apache Avro lib requires an Avro enum to be represented as .NET enum.
Because of that we cannot generate a nice discriminated union and have to fall back to generating enums:
Avro:
{
"type": "enum",
"name": "Suit",
"symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}
F#:
type Suit =
| SPADES = 0
| HEARTS = 1
| DIAMONDS = 2
| CLUBS = 3
The official C# codegen tools uses IList<T>
for arrays.
This tool simply uses 'T array
type.
The official C# codegen tool uses IDictionary<string, T>
for maps.
This tool uses Map<string, 'T>
.
The official C# codegen tool uses object
to represent union types.
This tool uses F#'s Choice
and Option
types.
Examples:
Avro type | F# Type |
---|---|
["null", "string"] | string option |
["int", "string"] | Choice<int, string> |
["string", "User", "Role"] | Choice<string, User, Role> |
["null", "string", "int"] | Choice<string, int> option |
Considering this schema:
{
"name": "md5",
"type": { "type": "fixed", "size": 16, "name": "MD5" }
}
Unfortunately Apache Avro heavily relies on fixed types inheriting from SpecificFixed
hierarchy,
so that we cannot have a simple type MD5 = MD5 of byte array
.
But this tool tries to mitigate this inconvenience and provides a slightly better developer experience:
type MD5 private (value: byte[]) =
inherit Avro.Specific.SpecificFixed(uint 16)
override this.Schema = ...
static member _SCHEMA = ...
// smart constructor
static member Create(value) : Result<MD5, string> =
match Array.length (value) with
| 16 -> Ok(MD5 value)
| _ -> Error "Fixed size value Test.AvroMsg.MD5 is required have length 16"
[<AutoOpen>]
module MD5 =
let (|MD5|) (value: MD5) = value.Value
The generated type has its constructor hidden and provides a "smart constructor" (static Create
function) instead to make sure that the declared size is respected,
and that the values are correct by construction.
It also provides an active pattern to make pattern matching easier.
There are no changes to what Apache Avro does, all the primitives are the same .NET primitives.
Apache Avro lib conveniently solves the logical types puzzle and this tool just relies on that solution without deviating from it.
For the performance reasons this tool can generate a little bit more tricky code
compared to "straightforward" implementation, such as using CLIMutable
or smartly cached
reflection that is needed for implementing ISpecificRecord
.
These tricks are typically internal (to the generated code) and are not exposed to developers using the result of this tool.
Populating a fairly complex Avro type (~15 properties, nested, has optionals and choices) yields these results:
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
'C# Classes' | 775.3 ns | 13.47 ns | 17.04 ns | 1.00 | 0.00 |
'F# Classes' | 1,111.3 ns | 21.64 ns | 21.25 ns | 1.43 | 0.04 |
'F# Records' | 1,209.1 ns | 17.10 ns | 16.00 ns | 1.80 | 0.03 |
F# types are slower than C# ones, but perhaps because F# types do a bit more when
checking types for inputs, etc. (C# classes just blindly cast values and leave unions as object
s).
At this point we do not consider "just above microsecond" performance critical (being for a fairly complex data type, too) despite being almost 2x slower than C#.
But optimisations and hints are always welcome :)
The biggest one know by now is AVRO-3671. C# code that is generated with the official codegen tool cannot handle it and either crashes or uses wrong types. This tool tries to make the best effort to mitigate the issue. For example, in the case where C# code crashes, F# code will work and use the correct type. But this issue cannot be fully eliminated until AVRO-3671 is addressed.
Other bug reports and suggestions are appreciated and welcome!