Fast, StAX-like XML Parser for BEAM Languages
APACHE-2.0 License
Fast, StAX-like XML Parser for BEAM Languages
Instead of as with SAX or DOM parsing of XML, forcing the user to handle everything at once, this parser allows the user to consume events from a stream as it suits them. Simply call next_event
on the stream.
This means that the user can parse multiple streams from the same process at the same time.
It works like an iterator on any set or list-like type, but returns XML events instead.
yaccety_sax
is a Namespace aware, non-validating XML 1.0 parser.
Chances are when parsing XML from some REST API, you won't need a lot of the features yaccety
has.
This is what yaccety_sax_simple
is for.
It works mostly in the same way as the full version, except for:
yaccety_sax_simple:string/1
without a continuation functionkinda_equal(Filename1, Filename2) ->
% UTF-16 file with external DTD and full of whitespace nodes
{Cont, Init} = ys_utils:trancoding_file_continuation(Filename1),
LhState = yaccety_sax:stream(Init, [
{whitespace, false},
{comments, false},
{proc_inst, false},
{continuation, {Cont, <<>>}},
{base, filename:dirname(Filename1)},
{external, fun ys_utils:external_file_reader/2}
]),
% Start Document event
{_, LhState1} = yaccety_sax:next_event(LhState),
% DTD event
{_, LhState2} = yaccety_sax:next_event(LhState1),
% UTF-8 file with no DTD or whitespace nodes
% Could have streamed this file as well...
{ok, Bin2} = file:read_file(Filename2),
RhState = yaccety_sax:stream(Bin2),
% Start Document event
{_, RhState1} = yaccety_sax:next_event(RhState),
% Now both streams are in a comparable state, so diff them
diff_loop(LhState2, RhState1).
diff_loop(LhState, RhState) ->
{LhEvent, LhState1} = yaccety_sax:next_event(LhState),
{RhEvent, RhState1} = yaccety_sax:next_event(RhState),
#{type := EventType} = LhEvent,
% Some function that checks equality, maybe ignoring
% namespaces or prefixes or something.
case equal_enough(LhEvent, RhEvent) of
true when EventType =:= endDocument -> true;
true -> diff_loop(LhState1, RhState1);
false -> false
end.
Just-for-fun parsing a 5.2 GB Wiki abstract dump with a callback that throws away all events:
yaccety_sax
takes around 5 minutes on my machine.xmerl_sax_parser
with default settings is still running...xmerl_sax_parser
with a larger buffer in the continuation function takes around 12 minutes.Another big difference is that the xmerl process held onto about 42 MB by the end of parsing. yaccety never went above 109 KB.
I didn't attempt using the xmerl_scan on the 5.2 GB file. Not sure it's a good idea to try.
I'm sure there are other parsers out there that stream-parse large data. It would be cool to see how all of them react.
Anyone who has seen The Benny Hill Show knows the song that inspired the name for the repo. Yakety Sax