Purpose-built DAFSA for named character reference matching during HTML tokenization
0BSD License
This is an attempt at a purpose-built DAFSA of named character references for implementing the Named character reference state of HTML tokenization.
That is, the goal is to encode the necessary data compactly while still allowing for fast matching of named character references, while taking full advantage of the note that:
This list [of named character references] is static and will not be expanded or changed in the future.
u21
integers, which allows the storage of 2,231 character reference -> codepoint(s)
transformations in 5,857 bytesSome relevant information about the set of named character references:
&
which every named character reference starts with). The characters are (using Zig switch case syntax): '1'...'8', ';', 'a'...'z', 'A'...'Z'
u12
u8
.U+1D56B
, meaning all first codepoint values can fit into a u17
.U+0338
, U+20D2
, U+200A
, U+0333
, U+20E5
, U+FE00
, U+006A
, U+0331
), meaning the value of the second codepoint can be encoded as a u3
(with a supporting lookup function to go from an enum -> codepoint). One more bit is needed to encode the 'no second codepoint' option.named_character_references.zig
Contains the generated dafsa
array, the packed version of the codepoints_lookup
array, and helper functions to deal with both in ways that are relevant to the Named character reference state specification.
generate.zig
Note: Running this is only necessary if the encoding of the
dafsa
orcodepoints_lookup
arrays are changed
Requires entities.json
which can be downloaded from here.
zig run generate.zig > generated.zig
Outputs the generated Zig code to stdout, containing the dafsa
array and the codepoints_lookup
packed array.
test.zig
Note: This is not a full test of the Named character reference state tokenization step; instead, it's a somewhat crude approximation in order to run the
namedEntities.test
test cases
Requires namedEntities.test
from html5lib-tests
which can be downloaded from here
zig test test.zig
Encoding the DAFSA nodes:
Minimal perfect hashing:
Constructing and minimizing the trie when generating the DAFSA: