Miniup - the pragmatic parsing library

[Try now!](TODO link to test page)

TODO

A short introduction to miniup

Welcome to Miniup, the pragmatic parsing library. Miniup is a pragmatic parsing library that is strongly inspired by context-free grammar parsing libraries such as Stratego/XT. Miniup offers grammars in a pragmatic and therefor easy-to-use way. Miniup is based on the PEG parsing formalism, but adds some nice features and language constructions on top of normal PEG grammars, which makes the tool excellent for rapid prototyping of new languages!

The pragmatism of Miniup:

Easy-to-write grammars. Miniup offers built-in support for common concepts such as lists, operators and choices. These concepts avoids the need of the grammar writer to be aware of the underlying parsing algorithm and its typical drawbacks.
Context-free parsing algorithm. Miniup applies no lexical analysis before parsing and is therefor a context-free parser. This means the parser can switch between different languages during the parse process
No semantic actions Many parser generators allow semantic actions do be defined inside a grammar. However, this kills the readability of a grammar, makes interoperability with other environments very hard and prevent parser to do some automatic refactorings on the grammar. In fact, the allowance for semantic actions mixes two concerns which should be separated in my humble opininion: Parsing input to an AST, and interpreting an AST to do something sensible with it.
Friendly AST. The AST generated by the Miniup parser is very intuitive. For example, if you parse a list, the parser will return a JSON array instead of a right recursive structure like most parsers.
No intermediate code. Miniup does not generate a parser, it is a parser. You can hook up miniup directly in your java / javascript / typescript environment without needing intermediate compilation steps.
Built-in support for recognizing whitespace and common built-in tokens like identifiers, numbers, strings and comments.

Properties of miniup

miniup has a small playground.
miniup recognizes all common PEG structures and should be compatible with most PEG grammars.
miniup is small, just over 1 kloc of typescript code. Minified the lib is TODO kB.
miniup has an extensive test suite, covering 90% of the code. Please keep this stat high when contributing!
miniup interprets grammars, and allows you to even modify grammars on runtime!
miniup can be used in both browser and server (node) environments
miniup offers a command line interface. Install it with npm install -g miniup

Parsing constructions introduced by miniup

miniup can match lambda's (parsing rules that might potentially recognize nothing, even in combination with repetition operators!). Use - to match nothing.
miniup supports left recursion: a = a 'x' / 'y'
miniup has built in support for recognizing binary operators, respecting precedence and right / left associativity: binop = 'and' > 'or' > expr
miniup can match lists without building recursive ASTs and has built in support for separators: statements = (statement ; ';')+
miniup can match regular expressions (use the native javascript syntax): number = (/\d+/)
miniup allows to use anonymous rules if your grammar consists of just one rule.
miniup offers built in support for whitespace, allowing the parser to automatically consume and ignore any whitespace between token: numbers = @whitespace-on (@whitespace-off $('-'? [0-9]+))* + 1 -22 34 » ["1", "-22", "34"]
miniup can match text that is composed from multiple grammar expressions using $. (See previous example)
miniup can import other grammar definitions and apply them: TODO
miniup can match sets of expressions: TODO

Getting started

Running miniup from commandline

Miniup can be run from the commandline if it is installed with npm. (So install nodejs / npm first).

Install miniup by running npm install -g miniup

Now you can run miniup against a grammar file by using

miniup -s <grammar-file> -i <input file>

or by using

miniup -g <grammar> <input>

Example:

$ miniup -g 'numbers = @whitespace-on (FLOAT / INTEGER ; ",")*' '4, -5, 6.7'
[
  "4",
  "-5",
  "6.7"
]

$ miniup -c -g 'g = greeting: ("hello" / "hi") (/\s+/) who: DOUBLEQUOTEDSTRING' 'hello "world"'
{
  "greeting": "hello",
  "who": "\"world\""
}

Check miniup --help for a list of all command line options.

From javascript

Embed miniup to your app. In a node environment, install and require miniup. Or in the browser include the miniup.js using a script tag or any other way. From there on, create load the grammar by using miniup.Grammar.load or miniup.Grammar.loadFromFile.

var miniup = require("miniup");
var grammar = miniup.Grammar.load("numbers = @whitespace-on (FLOAT / INTEGER ; \",\")*");
var numbers = grammar.parse("4, -5, 6.7",  { cleanAST : true });
console.dir(numbers); //prints ["4", "-5", "6.7"]

Or, to put it in a one liner miniup can be invoked as function directly, with grammar, input and an optional parse options object. Note that we can even leave out the rule name if there is just one rule:

var numbers = require("miniup")("@whitespace-on (FLOAT / INTEGER ; \",\")*", "4, -
5, 6.7");

Thats all!

Miniup grammar concepts

/* comment */

Java style multi-line comment

// comment END OF LINE

Java style single line comment

rulename "displayname" = expr

Defines a rule with rulename. A rule with a certain name can be defined only once. expr refers to any of the constructions defined below. The `"displayname" string is a human-friendly name used in error reporting and is optional.

Instead of usign the = sign, using a left arrow <- is allowed as well.

rule

Searches for a rule named rulename and tries to match it. Rules can be used before they are defined (syntactically speaking).

Example:

All examples in this section are in the format grammar x input » output. The examples can be verified by running the command miniup -c -g "grammar" "input" which should produce the mentioned output.

phone = number; number = [0-9]+; x 45 » ["4", "5"]

'literal'

Tries to match literal literally in the input. Both single- and double quotes are allowed to define the literal. Normal java(script) escaping is allowed within the literal (e.g. 'quote\' and newline\n'). Unicode, hexadecimal and octal escape sequences are allowed as well. The i flag can be added after the closing quote to perform the match case-insensitive.

TODO: mention automatic word boundaries if auto whitespace.

Example:foo = "baR"i x BAr » "BAr"

[characterset]

Tries to match exactly one character from the given characterset. Ranges and negations can be used, similar to regular expressions. For example [^0-9 ] matches everything but a digit or a space. The i flag can be added after the closing bracket to perform the match case-insensitive. Within the characterset, the slash (\) can be used as escape character.

Note that characterclasses can be combined with the $ operator to combine the individual matched characters.

Example: foo = [^bar] x R » "R" Example: phone = number; number = [0-9]+; x 45 » ["4", "5"] Example: phone = number; number = $[0-9]+; x 45 » "45"

expr1 expr2 ... exprn

Tries to match all the expressions or fails. Returns an object containing all submatches. If the extendedAST option is enabled, each submatch is available under its (zero-based) index. Otherwise only labeled items are available (see label)

Example: foo = 'bar' name:'baz' x barbaz » { name: "baz" }

expr1 / expr2 ... / exprn

Tries to match either expr1, expr2 or expr-n. The choice rule returns the first successful matches and does not attempt to match any subsequent choices. This is unlike the behavior of the | operator in context free grammars.

Example: foo = 'a' / 'b' / 'b' / 'c' x 'b' » "b"

(expr1 expr2 ... exprn)

Groups the list of expressions. Behavior is identical to not using parentheses. But parentheses are very useful to make quantifiers or predicates match multiple expressions at the same time.

Example:

decl = @whitespace-on modifiers:(pub:'public'? stat:'static'?) name: IDENTIFIER '(' ')'

x static foo() » { modifiers : { pub : null, stat: 'static' }, name: "foo" }

label: expr

Matches expr and stores it under label in the resulting AST.

Items will not be available in the result AST unless they are either labeled or matched by a Regex or Characterclass (todo: verify:) or Literal matcher. If the extendedAST parse option (--extended) is used, all items are available in the resulting AST.

Example: abc = a:'a' 'b' c:'c' x abc » { a: "a", c: "c"}

Example (with extended AST enabled): abc = a:'a' 'b' c:'c' x abc » { a: "a", c: "c", 0: "a", 1: "b", 2: "c", length: 3 }

expr?

Optionally matches expr. If expr is not found, the parse is still considered to be successful.

Example: foo = bar:'bar'? baz:'baz' x baz » { bar: null, baz: 'baz'}

expr*

Matches expr as many times as possible, but no matches are fine as well. The match will always be performed greedy. The matches will be returned as array.

Example: foo = 'a'* x aaaa » ['a', 'a', 'a', 'a']

In combination with the semicolon (;) separators can be recognized. It is allowed to refer to other rules for separators.

Example: foo = ('a' ; ',')* x a,a,a » ['a','a','a']

expr+

Matches expr as many times as possible, but at least once. The match will always be performed greedy. The matches will be returned as array. Use parentheses and semicolon to introduce separators.

Example: foo = 'a'+ x aaaa » ['a', 'a', 'a', 'a']

&expr

Positive predicate. Tries to match expr. If expr is found, the match is considered successful, but the rule does not match anything. Can be used as 'lookahead'.

Example: foo = &'0' num:[0-9]+ x 017 » { num : [ '0', '1', '7' ] }. But, this rule will fail on the input 117.

!expr

Negative predicate. Tries to match expr. If expr is not found, the match is considered successful and parsing will continue. But fails if expr is found.

Example: foo = 'idontlike ' !'coffee' what:/[a-z]*/ x idontlike tea » { what: 'tea' }. But, this rule will fail on idontlike coffee. And rightfully so.

.

The . matches any character

Miniup PEG extensions

$expr

Matches expr. But instead of returning the result, the original input string on which the match was made is returned.

Example: number = $('-'? [0-9]+) x -44 » "44" Example: number = sign:'-'? digits:[0-9]+ x -44 » { sign : null, digits:["4,4"]}

(/regular expression/)

Matches the specified regular expression at the begin of the remainder of the input. Useful to express more powerful patterns than charactersets. The syntax is equal to native Javascript regular expressions and the same escaping rules apply. Only the i flag is supported, g and m flags are not supported.

The surround parentheses are not strictly necessary, but are usually required to distinguish regular expressions from choices. For example a / b / c matches just a, b or c, while a (/ b /) c only matches a b c.

Example: float = (/[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?/) x -34.3e523 » "-34.3e523"

(expr1 ... exprn separator)*?

Matches all items between the parentheses zero or more times. However, the last item of the sequence is used as separator to initiate the repitition.

Example: args = args:(expr ',')*?; expr = 'dummy' x dummy,dummy,dummy » { args: ["dummy", "dummy", "dummy"] }

(expr1 ... exprn separator)+?

Behaves the same as the *? operator, but requires at least one match.

Example: args = args:(expr ',')*?; expr = 'dummy' x dummy » { args: ["dummy"] }

(expr1 ... exprn separator)#

Matches any subset of the provided expressions, but none are required.

Example: modifiers = @whitespace-on (public:'public' static:'static' final: 'final')# x final public » {public:"public", static: null, final: "final"}

@whitespace-on (or off) expr1 .. exprn

Enables or disables automatic whitespace parsing for this rule. Enabling automatic whitespace parsing avoids the need to explicitly match whitespace between tokens. This is very useful in many grammar. Automatic whitespace matching is by default turned off for compatibility with existing PEG grammars. Enabling whitespace enables it for the rest of this rule, and all rules called by it. After completing (or failing) the match, the whitespace status will be reset to its original value.

Note that using this construction requires a rule with the name 'whitespace' to be defined.

Example: numbers = @whitespace-on ($number)+; number = @whitespace-off '-'? [0-9] + ('.' [0-9]+)?; whitespace = WHITESPACECHARS x 42 3.16 -12 » ["42", "3.16", "-12"]

@import grammar.name

TODO: @import "filename.peg".rule

Behaves similar to call, but, with using @import rules from other grammars can be imported and applied. This import statement applies the rule name from the grammar with name grammar. Note that the grammar has to be registered first.

Example:

var coffeeGrammar = miniup.Grammar.load("coffee = flavor : ('coffee' /  'cappucino')");
miniup.Grammar.register('CoffeeGrammar', coffeeGrammar);
var fooGrammar = miniup.Grammar.load("foo = @import CoffeeGrammar.coffee");
fooGrammar.parse("cappucino");
//returns: { flavor : "cappucino" }

@operators left:expr > right:expr on expr

TODO:

.

built-in tokens

left recursion

adept grammar on the fly

Show me a grammar!

Some grammar

Run from command line

Some AST

Use from javascript

Miniup reference

Language options

Miniup grammar concepts

List

Choice

Token

Import

Operator

Run Miniup from commandline

Miniup can read input from files, standard input stream or arguments. To parse input from the input stream use the following command:

echo "+31 2344567" | miniup -rc -g "phone = '+' countrycode:[0-9]+ ' ' phonenumber:[0-9]+"

Which (TODO) outputs:

{
	country : "31",
	phonenumber: "2344567"
}

Or, if you provide invalid input:

echo "+31 23oops7" | miniup -rc -g "phone = '+' countrycode:[0-9]+ ' ' phonenumber:[0-9]+"

TODO: error output

All available command line options can be listed by using miniup --help

Run Miniup from javascript

Build a grammar object from a grammar string:

var grammar = miniup.Grammar.load("phone = '+' countrycode:[0-9]+ ' ' phonenumber:[0-9]+");`

To parse input, use grammar.parse(input /*string*/ [, options /*options object*/])

var ast = grammar.parse("+31 2344567");
console.log("Country: " + ast.country + " Phone: " + ast.phone)

The following options can be provided to the parse function:

{
	debug: true, /* default: false, prints the complete parse strategy*/
	cleanAST: true, /* default: false, does not include additional match information such as $text, $rule and $pos*/
	extendedAST: true, /* default: false, if true, unlabeled items are added to the resulting AST as well */
	startSymbol: 'rulename', /* use alternative start symbol. Default: first rule that is defined in the grammar */
	inputName: 'string' /* default: 'input', name of the input to use in error reporting, such as a filename */
	allowLeftRecursion: bool; default: true. --no-leftrecursion to disable
}

Built-in tokens

The following tokens are available by default in every grammar and can be called by any rule. These regexes are all available in the miniup.RegExpUtil namespace. Values for REGEX, INTEGER, FLOAT, BOOLEAN and strings will be automatically parsed to the corresponding javascript type.

IDENTIFIER /[a-zA-Z_][a-zA-Z_0-9]/ WHITESPACE /\s+/ REGEX //([^\/]|(\.))/i?/ SINGLEQUOTEDSTRING /'([^'\]|(\.))'/ DOUBLEQUOTEDSTRING /"([^"\]|(\.))"/ SINGLELINECOMMENT ///.(\n|$)/ MULTILINECOMMENT //*(?:[^]|*(?!/))?*// CHARACTERCLASS /[([^\/]|(\.))]/ INTEGER /(-|+)?\d+/ FLOAT /[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?/ BOOLEAN /(true|false)\b/ LINEENDCHAR /\r?\n|\u2028|\u2029/

* Interpreted as whitespace

** These regular expressions are in unescaped form. To use them in a Miniup grammar file, apply javascript style escaping: First prepend each forwardslash or backward slash with a backward slash, second, wrap the whole regular expression in forward slashes. Or, take a look at the miniup.txt file, which contains the miniup grammar described in miniup.

Related Projects

zebu

A compiler for little languages in tagged template strings

26 Nov 2018 428

character-parser

Parse JavaScript one character at a time to look for snippets in Templates

14 Apr 2013 15

miniup

Miniup - the pragmatic parsing library

Table of contents

A short introduction to miniup

The pragmatism of Miniup:

Properties of miniup

Parsing constructions introduced by miniup

Getting started

Running miniup from commandline

From javascript

Miniup grammar concepts

/* comment */

// comment END OF LINE

rulename "displayname" = expr

rule

'literal'

[characterset]

expr1 expr2 ... exprn

expr1 / expr2 ... / exprn

(expr1 expr2 ... exprn)

label: expr

expr?

expr*

expr+

&expr

!expr

.

Miniup PEG extensions

$expr

(/regular expression/)

(expr1 ... exprn separator)*?

(expr1 ... exprn separator)+?

(expr1 ... exprn separator)#

@whitespace-on (or off) expr1 .. exprn

@import grammar.name

@operators left:expr > right:expr on expr

.

built-in tokens

left recursion

adept grammar on the fly

Show me a grammar!

Run from command line

Use from javascript

Miniup reference

Language options

Miniup grammar concepts

List

Choice

Token

Import

Operator

Run Miniup from commandline

Run Miniup from javascript

Built-in tokens

Related Projects

zebu

character-parser