APACHE-2.0 License
TyRE provides the following features compared to standard Java regex matcher:
Unit
, Char
, Tuple
, List
, and Either
types, but it can be freely transformed by the user.Goals: Main goal of this library is to provide safer regex parsing compared to Java regex matcher. This is achieved through compile time regex validation and refined return type compared to java.util.regex.Matcher
with its arbitrary number of capture groups and possible null
captures.
Currently, TyRE is not production ready. However, we welcome you to download the released library and play around.
Tyre
is a type constructor parameterized by the type of the parse tree. The result of parsing a word: String
using a pattern: Tyre[R]
will be Option[R]
, where None
denotes that word
did not match pattern
.
TyRE patterns can be build through smart constructors. The following set of constructors is complete:
// matches an empty string
Tyre.epsilon: Tyre[Unit]
// matches a single character satisfying predicate `f`
Pred.pred(f: Char => Boolean): Tyre[Char]
// sequence of patterns
Tyre[R]#<*>[S](re: Tyre[S]): Tyre[(R, S)]
// alternative of patterns
Tyre[R]#<|>[S](re: Tyre[S]): Tyre[Either[R, S]]
// one or more repetitions of a pattern (Kleene star)
Tyre[R]#rep: Tyre[List[R]]
Tyre[R]#map[S](f: R => S): Tyre[S]
Please note that recursive definitions cannot be used.
Though these constructors can be used explicitly, it's usually more convenient to use string literals for creating patterns. The syntax of string literals is at large standard, its full description can be found in the Supported syntax section.
To use TyRE for pattern matching you have to first define your Tyre
pattern. This can be done using string interpolation, e.g.
val example: Tyre[(Char, Char)] = tyre"[a-z]t"
The syntax of Tyre patterns is very similar to the standard regex syntax, with a few exceptions, of which the most relevant are:
tyre"([a-z]t)!s" : Tyre[String]
// `|` corresponds to a union
val union: Tyre[String | Char] = tyre"([a-z]t)!s|a"
// `||` corresponds to a tagged union
val taggedUnion: Tyre[Either[String, Char]] = tyre"([a-z]t)!s||a"
Next you need to compile the pattern into our enriched automaton:
val parser: Automaton[(Char, Char)] = example.compile()
The Automaton[T]
trait has a single public method def run(str: String): Option[T]
, which parses the input string. Note that the method will only return the parse tree if the whole input string matches the pattern.
parser.run("zt") // = Some(('z', 't'))
parser.run("zx") // = None
val example: Tyre[List[Char]] = tyre"[a-z]||[0-9]*".map:
case Left(c) => List(c)
case Right(list) => list
val parser = example.compile()
parser.run("c") // = Some(List('c'))
parser.run("123") // = Some(List('1', '2', '3'))
parser.run("z23") // = None
val t1: Tyre[Char] = tyre"[a-z]|[0-9]"
val t2: Tyre[String] = tyre"(${t1}*)!s ?:\)".map(_(0))
val parser = t2.compile()
parser.run("a12b :)") // = Some("a12b")
parser.run(":)") // = Some("")
parser.run("Ab :)") // = None
TyRE syntax reflects standard regular expressions, albeit with several limitations. Some of them might get eliminated as the library grows. Others, like capture groups, are at odds with TyRE design objectives. Some specific syntax extensions have also been introduced to make writing TyRE expressions more convenient.
token | match |
---|---|
x |
the character x
|
[abc] |
a single character of a , b or c
|
[^abc] |
any character except a , b and c
|
[a-zA-Z] |
a character in the range a -z or A -Z (inclusive) |
[^a-zA-Z] |
a character not in the range a -z and A -Z (inclusive) |
. |
any single character |
\uhhhh |
the Unicode character with hexadecimal value hhhh
|
\w |
any word character ([a-zA-Z0-9_] ) |
\W |
any non-word character ([^a-zA-Z0-9_] ) |
\s |
any whitespace character ([ \t\n\r\f\u000B] ) |
\S |
a non-whitespace character ([^\s] ) |
\h |
any horizontal whitespace character ([ \t\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000] ) |
\H |
not a horizontal whitespace character ([^\h] ) |
\v |
any vertical whitespace character ([\n\r\f\u000B\u0085\u2028\u2029] ) |
\V |
not a vertical whitespace character ([^\v] ) |
\d |
any digit ([0-9] ) |
\D |
a non-digit ([^0-9] ) |
\t |
the horizontal tab character |
\n |
the line feed (new line) character |
\r |
the carriage return character |
\f |
the form feed (new page) character (\u000C ) |
\ |
nothing but quotes (escapes) the following character if it is a special one, triggers error otherwise |
XY |
X followed by Y (not in [] ) |
X|Y |
X or Y - union (not in [] ) |
X||Y |
either X or Y - tagged union, see below (not in [] ) |
(X) |
group X to override precedence (not in [] ) |
X* |
X zero or more times (not in [] ) |
X+ |
X at least once (one or more times) (not in [] ) |
X? |
X optionally (zero or one time) (not in [] ) |
X!s |
converts X to string (not in [] ) |
Patterns other than the above (e.g. exact quantifiers, intersection with &&
, hexadecimal or Unicode characters) are not supported (yet).
Regular alternation (|
) corresponds to a union type while tagged one (||
) corresponds to Either
.
E.g. ab|cd
results in (Char, Char)
while ab||cd
yields Either[(Char, Char), (Char, Char)]
;
ab|c
yields (Char, Char) | Char
while ab||c
- Either[(Char, Char), Char]
.
TyRE always parses the whole input - you can think of it like ^
and $
from traditional regular expressions being inserted implicitly at the beginning and end of the pattern. No boundary matchers are thus supported.
TyRE doesn't support any flags / options (yet). In particular, no case insensitive or Unicode character class matching is supported.
Special characters must be escaped with a backslash, e.g. \.
matches a dot and \+
matches a plus. Although different sets of special characters are in use outside and inside brackets, escaping is allowed for all of them in both contexts. It is not allowed to use escape sequences for non-special characters.
TyRE is more regular in escaping requirements than traditional regexes - characters have to be always escaped if they have special meaning in the given context. E.g. dash is not allowed at the beginning or end of []
character class. So this [+-]
won't work. You have to write [+\-]
.
special character(s) | outside []
|
inside []
|
purpose |
---|---|---|---|
\ |
yes | yes | escape (quote) next character (not allowed for non-special characters) or mark predefined character or character class |
. |
yes | no | match any character |
- |
no | yes | range |
^ |
yes[^1] | yes | negate (complement) a [] character class (allowed only after opening bracket) |
[] |
yes | yes | character class |
() |
yes | no | grouping |
*+? |
yes | no | quantifiers (not allowed directly after conversion operator) |
! |
yes | no | conversion (allowed only with predefined operators, currently s ) |
| |
yes | no | alternation (including strict alternation || ) |
[^1]: Although not used by TyRE, ^
is treated as a special character outside brackets to stick with traditional regular expressions.
TyRE operator precedence follows the standard of regular expressions - the order of special characters in the above table mirrors it roughly.
\
), the escape operator - it yields a single character while parsing.-
), negation (^
), and character class itself ([]
) - they translate to the single character as well.()
) converts it to a single token - the operators in the rest of this list are applied to it as a whole.*+?
) and conversions (!
) - they are applied to the preceding character, character class or a whole pattern if it is put in parentheses.|
).Note that you can put the conversion operator directly after the quantifier but not vice-versa. This \w*!s
is legal, but this \w!s*
isn't (although you can write (\w!s)*
).
Grouping in TyRE doesn't define a capture group. A capture group is deemed unnecessary because the TyRE return type fully mirrors the expression structure.
Complex TyRE patterns result in complex return types. For example, \w+@[A-Za-z0-9\-]+\.com
yields ((Char, List[Char), Char, (Char, List[Char]), Char, Char, Char, Char)
. Quite often, we only need some of that information. One method to simplify this is using the map()
method on Tyre
, building it gradually. The other is using the conversion operator.
!s
converts the preceding token into a string. \w+!s
yields a String
instead of (Char, List[Char]
. If we are interested in the user and domain part in the above example, we may refactor it into \w+!s@([A-Za-z0-9\-]+\.com)!s
which yields (String, Char, String)
.
!s
is the only current conversion available, but additional are considered.
TyRE makes use of the following tools, languages and libraries (in alphabetical order):
/C means compile/runtime dependency, /T means test dependency, /D means development tool. Only direct dependencies are presented on the above list.