Python library for creating PEG parsers
MIT License
Bot releases are hidden (Show)
Published by ptmcg almost 4 years ago
API CHANGE
Diagnostic flags have been moved to an enum, pyparsing.Diagnostics
, and they are enabled through module-level methods:
pyparsing.enable_diag()
pyparsing.disable_diag()
pyparsing.enable_all_warnings()
API CHANGE
Most previous SyntaxWarnings
that were warned when using pyparsing classes incorrectly have been converted to TypeError
and ValueError
exceptions, consistent with Python calling conventions. All warnings warned by diagnostic flags have been converted from SyntaxWarnings
to UserWarnings
.
To support parsers that are intended to generate native Python collection types such as lists and dicts, the Group
and Dict
classes now accept an additional boolean keyword argument aslist
and asdict
respectively. See the jsonParser.py
example in the pyparsing/examples
source directory for how to return types as ParseResults
and as Python collection types, and the distinctions in working with the different types.
In addition parse actions that must return a value of list type (which would normally be converted internally to a ParseResults) can override this default behavior by returning their list wrapped in the new ParseResults.List
class:
# this parse action tries to return a list, but pyparsing
# will convert to a ParseResults
def return_as_list_but_still_get_parse_results(tokens):
return tokens.asList()
# this parse action returns the tokens as a list, and pyparsing will
# maintain its list type in the final parsing results
def return_as_list(tokens):
return ParseResults.List(tokens.asList())
This is the mechanism used internally by the Group
class when defined using aslist=True
.
A new IndentedBlock
class is introduced, to eventually replace the current indentedBlock
helper method. The interface is largely the same, however, the new class manages its own internal indentation stack, so it is no longer necessary to maintain an external indentStack
variable.
API CHANGE
Added cache_hit
keyword argument to debug actions. Previously, if packrat parsing was enabled, the debug methods were not called in the event of cache hits. Now these methods will be called, with an added argument cache_hit=True
.
If you are using packrat parsing and enable debug on expressions using a custom debug method, you can add the cache_hit=False
keyword argument,
and your method will be called on packrat cache hits. If you choose not to add this keyword argument, the debug methods will fail silently, behaving as they did previously.
When using setDebug
with packrat parsing enabled, packrat cache hits will now be included in the output, shown with a leading '*'. (Previously, cache hits and responses were not included in debug output.) For those using custom debug actions, see the previous item regarding an optional API change for those methods.
setDebug
output will also show more details about what expression is about to be parsed (the current line of text being parsed, and the current parse position):
Match integer at loc 0(1,1)
1 2 3
^
Matched integer -> ['1']
The current debug location will also be indicated after whitespace has been skipped (was previously inconsistent, reported in Issue #244, by Frank Goyens, thanks!).
Modified the repr() output for ParseResults
to include the class name as part of the output. This is to clarify for new pyparsing users who misread the repr output as a tuple of a list and a dict. pyparsing results will now read like:
ParseResults(['abc', 'def'], {'qty': 100}]
instead of just:
(['abc', 'def'], {'qty': 100}]
Fixed bugs in Each when passed OneOrMore or ZeroOrMore expressions:
. first expression match could be enclosed in an extra nesting level
. out-of-order expressions now handled correctly if mixed with required expressions
. results names are maintained correctly for these expressions
Fixed traceback trimming, and added ParserElement.verbose_traceback
save/restore to reset_pyparsing_context()
.
Default string for Word
expressions now also include indications of min
and max
length specification, if applicable, similar to regex length specifications:
Word(alphas) -> "W:(A-Za-z)"
Word(nums) -> "W:(0-9)"
Word(nums, exact=3) -> "W:(0-9){3}"
Word(nums, min=2) -> "W:(0-9){2,...}"
Word(nums, max=3) -> "W:(0-9){1,3}"
Word(nums, min=2, max=3) -> "W:(0-9){2,3}"
For expressions of the Char
class (similar to Word(..., exact=1)
, the expression is simply the character range in parentheses:
Char(nums) -> "(0-9)"
Char(alphas) -> "(A-Za-z)"
Removed copy()
override in Keyword
class which did not preserve definition of ident chars from the original expression. PR #233 submitted by jgrey4296, thanks!
In addition to pyparsing.__version__
, there is now also a pyparsing.__version_info__
, following the same structure and field names as in sys.version_info
.
Published by ptmcg over 4 years ago
Summary of changes for 3.0.0 can be found in "What's New in Pyparsing 3.0.0" documentation.
API CHANGE
Changed result returned when parsing using countedArray, the array items are no longer returned in a doubly-nested list.
An excellent new enhancement is the new railroad diagram generator for documenting pyparsing parsers:
import pyparsing as pp
from pyparsing.diagram import to_railroad, railroad_to_html
from pathlib import Path
# define a simple grammar for parsing street addresses such
# as "123 Main Street"
# number word...
number = pp.Word(pp.nums).setName("number")
name = pp.Word(pp.alphas).setName("word")[1, ...]
parser = number("house_number") + name("street")
parser.setName("street address")
# construct railroad track diagram for this parser and
# save as HTML
rr = to_railroad(parser)
Path('parser_rr_diag.html').write_text(railroad_to_html(rr))
Very nice work provided by Michael Milton, thanks a ton!
Enhanced default strings created for Word expressions, now showing string ranges if possible. Word(alphas)
would formerly print as W:(ABCD...)
, now prints as W:(A-Za-z)
.
Added ignoreWhitespace(recurse:bool = True) and added a recurse argument to leaveWhitespace, both added to provide finer control over pyparsing's whitespace skipping. Also contributed by Michael Milton.
The unicode range definitions for the various languages were recalculated by interrogating the unicodedata module by character name, selecting characters that contained that language in their Unicode name. (Issue #227)
Also, pyparsing_unicode.Korean was renamed to Hangul (Korean is also defined as a synonym for compatibility).
Enhanced ParseResults dump() to show both results names and list subitems. Fixes bug where adding a results name would hide lower-level structures in the ParseResults.
Added new __diag__
warnings:
"warn_on_parse_using_empty_Forward" - warns that a Forward has been included in a grammar, but no expression was attached to it using '<<=' or '<<'
"warn_on_assignment_to_Forward" - warns that a Forward has been created, but was probably later overwritten by erroneously using '=' instead of '<<=' (this is a common mistake when using Forwards) (currently not working on PyPy)
Added ParserElement.recurse() method to make it simpler for grammar utilities to navigate through the tree of expressions in a pyparsing grammar.
Fixed bug in ParseResults repr() which showed all matching entries for a results name, even if listAllMatches was set to False when creating the ParseResults originally. Reported by Nicholas42 on GitHub, good catch! (Issue #205)
Modified refactored modules to use relative imports, as pointed out by setuptools project member jaraco, thank you!
Off-by-one bug found in the roman_numerals.py example, a bug that has been there for about 14 years! PR submitted by Jay Pedersen, nice catch!
A simplified Lua parser has been added to the examples (lua_parser.py).
Added make_diagram.py to the examples directory to demonstrate creation of railroad diagrams for selected pyparsing examples. Also restructured some examples to make their parsers importable without running their embedded tests.
Published by ptmcg over 4 years ago
Published by ptmcg almost 5 years ago
Fixed typos in White mapping of whitespace characters, to use
correct "\u" prefix instead of "u".
Fix bug in left-associative ternary operators defined using
infixNotation. First reported on StackOverflow by user Jeronimo.
Backport of pyparsing_test namespace from 3.0.0, including
TestParseResultsAsserts mixin class defining unittest-helper
methods:
. def assertParseResultsEquals(
self, result, expected_list=None, expected_dict=None, msg=None)
. def assertParseAndCheckList(
self, expr, test_string, expected_list, msg=None, verbose=True)
. def assertParseAndCheckDict(
self, expr, test_string, expected_dict, msg=None, verbose=True)
. def assertRunTestResults(
self, run_tests_report, expected_parse_results=None, msg=None)
. def assertRaisesParseException(self, exc_type=ParseException, msg=None)
To use the methods in this mixin class, declare your unittest classes as:
from pyparsing import pyparsing_test as ppt
class MyParserTest(ppt.TestParseResultsAsserts, unittest.TestCase):
...
Published by ptmcg almost 5 years ago
Published by ptmcg almost 5 years ago
Check-in bug in Pyparsing 2.4.3 that raised UserWarnings was masked by stdout buffering in unit tests - fixed.
Published by ptmcg almost 5 years ago
(Backport of selected critical items from 3.0.0 development branch.)
Fixed a bug in ParserElement.__eq__
that would for some parsers create a recursion error at parser definition time. Thanks to Michael Clerx for the assist. (Addresses issue #123)
Fixed bug in indentedBlock
where a block that ended at the end of the input string could cause pyparsing to loop forever. Raised as part of discussion on StackOverflow with geckos.
Backports from pyparsing 3.0.0:
. __diag__.enable_all_warnings()
. Fixed bug in PrecededBy
which caused infinite recursion, issue #127
. support for using regex
-compiled RE to construct Regex
expressions
Published by ptmcg about 5 years ago
Updated the shorthand notation that has been added for repetition
expressions: expr[min, max], with '...' valid as a min or max value:
Better interpretation of [...] as ZeroOrMore raised by crowsonkb,
thanks for keeping me in line!
If upgrading from 2.4.1 or 2.4.1.1 and you have used expr[...]
for OneOrMore(expr)
, it must be updated to expr[1, ...]
.
The defaults on all the __diag__
switches have been set to False,
to avoid getting alarming warnings. To use these diagnostics, set
them to True after importing pyparsing.
Example:
import pyparsing as pp
pp.__diag__.warn_multiple_tokens_in_named_alternation = True
Fixed bug introduced by the use of getitem for repetition,
overlooking Python's legacy implementation of iteration
by sequentially calling getitem with increasing numbers until
getting an IndexError. Found during investigation of problem
reported by murlock, merci!
Published by ptmcg about 5 years ago
This is a re-release of version 2.4.1 to restore the release history
in PyPI, since the 2.4.1 release was deleted.
There are 3 known issues in this release, which are fixed in
the upcoming 2.4.2:
API change adding support for expr[...]
- the original
code in 2.4.1 incorrectly implemented this as OneOrMore.
Code using this feature under this relase should explicitly
use expr[0, ...]
for ZeroOrMore and expr[1, ...]
for
OneOrMore. In 2.4.2 you will be able to write expr[...]
equivalent to ZeroOrMore(expr)
.
Bug if composing And, Or, MatchFirst, or Each expressions
using an expression. This only affects code which uses
explicit expression construction using the And, Or, etc.
classes instead of using overloaded operators '+', '^', and
so on. If constructing an And using a single expression,
you may get an error that "cannot multiply ParserElement by
0 or (0, 0)" or a Python IndexError
. Change code like
cmd = Or(Word(alphas))
to
cmd = Or([Word(alphas)])
(Note that this is not the recommended style for constructing
Or expressions.)
Some newly-added __diag__
switches are enabled by default,
which may give rise to noisy user warnings for existing parsers.
You can disable them using:
import pyparsing as pp
pp.__diag__.warn_multiple_tokens_in_named_alternation = False
pp.__diag__.warn_ungrouped_named_tokens_in_collection = False
pp.__diag__.warn_name_set_on_empty_Forward = False
pp.__diag__.warn_on_multiple_string_args_to_oneof = False
pp.__diag__.enable_debug_on_named_expressions = False
In 2.4.2 these will all be set to False by default.
Published by ptmcg about 5 years ago
Release candidate for 2.4.2:
__getitem__
-induced iterability for ParserElement class__diag__
flags are now all False by defaultPublished by ptmcg about 5 years ago
For a minor point release, this release contains many new features!
A new shorthand notation has been added for repetition expressions: expr[min, max]
, with ...
valid as a min or max value:
expr[...]
is equivalent to OneOrMore(expr)
expr[0, ...]
is equivalent to ZeroOrMore(expr)
expr[1, ...]
is equivalent to OneOrMore(expr)
expr[n, ...]
or expr[n,]
is equivalent to expr*n + ZeroOrMore(expr)
(read as "n or more instances of expr")expr[..., n]
is equivalent to expr*(0, n)
expr[m, n]
is equivalent to expr*(m, n)
expr[..., n]
and expr[m, n]
do not raise an exception if more than n exprs exist in the input stream. If this behavior is desired, then write expr[..., n] + ~expr
....
can also be used as short hand for SkipTo
when used in adding parse expressions to compose an And
expression.
Literal('start') + ... + Literal('end')
And(['start', ..., 'end'])
are both equivalent to:
Literal('start') + SkipTo('end')("_skipped*") + Literal('end')
The ...
form has the added benefit of not requiring repeating the skip target expression. Note that the skipped text is returned with '_skipped' as a results name, and that the contents of _skipped
will contain a list of text from all ...
s in the expression.
...
can also be used as a "skip forward in case of error" expression:
expr = "start" + (Word(nums).setName("int") | ...) + "end"
expr.parseString("start 456 end")
['start', '456', 'end']
expr.parseString("start 456 foo 789 end")
['start', '456', 'foo 789 ', 'end']
- _skipped: ['foo 789 ']
expr.parseString("start foo end")
['start', 'foo ', 'end']
- _skipped: ['foo ']
expr.parseString("start end")
['start', '', 'end']
- _skipped: ['missing <int>']
Note that in all the error cases, the '_skipped'
results name is present, showing a list of the extra or missing items.
This form is only valid when used with the '|'
operator.
Improved exception messages to show what was actually found, not just what was expected.
word = pp.Word(pp.alphas)
pp.OneOrMore(word).parseString("aaa bbb 123", parseAll=True)
Former exception message:
pyparsing.ParseException: Expected end of text (at char 8), (line:1, col:9)
New exception message:
pyparsing.ParseException: Expected end of text, found '1' (at char 8), (line:1, col:9)
Added diagnostic switches to help detect and warn about common parser construction mistakes, or enable additional parse debugging. Switches are attached to the pyparsing.__diag__
namespace object:
warn_multiple_tokens_in_named_alternation
- flag to enable warnings when a results name is defined on a MatchFirst
or Or
expression with one or more And
subexpressions (default=True)warn_ungrouped_named_tokens_in_collection
- flag to enable warnings when a results name is defined on a containing expression with ungrouped subexpressions that also have results names (default=True)warn_name_set_on_empty_Forward
- flag to enable warnings whan a Forward is defined with a results name, but has no contents defined (default=False)warn_on_multiple_string_args_to_oneof
- flag to enable warnings whan oneOf
is incorrectly called with multiple str arguments (default=True)enable_debug_on_named_expressions
- flag to auto-enable debug on all subsequent calls to ParserElement.setName()
(default=False)warn_multiple_tokens_in_named_alternation
is intended to help those who currently have set __compat__.collect_all_And_tokens
to False as a workaround for using the pre-2.3.1 code with named MatchFirst
or Or
expressions containing an And
expression.
Added ParseResults.from_dict
classmethod, to simplify creation of a ParseResults
with results names using a dict, which may be nested. This makes it easy to add a sub-level of named items to the parsed tokens in a parse action.
Added asKeyword
argument (default=False) to oneOf
, to force keyword-style matching on the generated expressions.
ParserElement.runTests
now accepts an optional 'file' argument to redirect test output to a file-like object (such as a StringIO, or opened file). Default is to write to sys.stdout.
conditionAsParseAction
is a helper method for constructing a parse action method from a predicate function that simply returns a boolean result. Useful for those places where a predicate cannot be added using addCondition
, but must be converted to a parse action (such as in infixNotation
). May be used as a decorator if default message and exception types can be used. See ParserElement.addCondition
for more details about the expected signature and behavior for predicate condition methods.
While investigating issue #93, I found that Or
and addCondition
could interact to select an alternative that is not the longest match. This is because Or
first checks all alternatives for matches without running attached parse actions or conditions, orders by longest match, and then rechecks for matches with conditions and parse actions. Some expressions, when checking with conditions, may end up matching on a shorter token list than originally matched, but would be selected because of its original priority. This matching code has been expanded to do more extensive searching for matches when a second-pass check matches a smaller list than in the first pass.
Fixed issue #87, a regression in indented block. Reported by Renz Bagaporo, who submitted a very nice repro example, which makes the bug-fixing process a lot easier, thanks!
Fixed MemoryError issue #85 and #91 with str generation for Forwards. Thanks decalage2 and Harmon758 for your patience.
Modified setParseAction
to accept None
as an argument, indicating that all previously-defined parse actions for the expression should be cleared.
Modified pyparsing_common.real
and sci_real
to parse reals without leading integer digits before the decimal point, consistent with Python real number formats. Original PR #98 submitted by ansobolev.
Modified runTests
to call postParse
function before dumping out the parsed results - allows for postParse
to add further results, such as indications of additional validation success/failure.
Updated statemachine
example: refactored state transitions to use overridden classmethods; added <statename>Mixin
class to simplify definition of application classes that "own" the state object and delegate to it to model state-specific properties and behavior.
Added example nested_markup.py
, showing a simple wiki markup with nested markup directives, and illustrating the use of ...
for skipping over input to match the next expression. (This example uses syntax that is not valid under Python 2.)
Rewrote delta_time.py
example (renamed from deltaTime.py
) to fix some omitted formats and upgrade to latest pyparsing idioms, beginning with writing an actual BNF.
With the help and encouragement from several contributors, including Matej Cepl and Cengiz Kaygusuz, I've started cleaning up the internal coding styles in core pyparsing, bringing it up to modern coding practices from pyparsing's early development days dating back to 2003. Whitespace has been largely standardized along PEP8 guidelines, removing extra spaces around parentheses, and adding them around arithmetic operators and after colons and commas. I was going to hold off on doing this work until after 2.4.1, but after cleaning up a few trial classes, the difference was so significant that I continued on to the rest of the core code base. This should facilitate future work and submitted PRs, allowing them to focus on substantive code changes, and not get sidetracked by whitespace issues.
NOTE: Deprecated functions and features that will be dropped in pyparsing 2.5.0 (planned next release):
support for Python 2 - ongoing users running with Python 2 can continue to use pyparsing 2.4.1
ParseResults.asXML()
- if used for debugging, switch to using ParseResults.dump()
; if used for data transfer, use ParseResults.asDict()
to convert to a nested Python dict, which can then be converted to XML or JSON or other transfer format
operatorPrecedence
synonym for infixNotation
- convert to calling infixNotation
commaSeparatedList
- convert to using pyparsing_common.comma_separated_list
upcaseTokens
and downcaseTokens
- convert to using pyparsing_common.upcaseTokens
and downcaseTokens
__compat__.collect_all_And_tokens
will not be settable to False to revert to pre-2.3.1 results name behavior - review use of names for MatchFirst
and Or
expressions containing And
expressions, as they will return the complete list of parsed tokens, not just the first one. Use __diag__.warn_multiple_tokens_in_named_alternation
to help identify those expressions in your parsers that will have changed as a result.
Published by ptmcg over 5 years ago
Well, it looks like the API change that was introduced in 2.3.1 was more drastic than expected, so for a friendlier forward upgrade path, this release:
. Bumps the current version number to 2.4.0, to reflect this incompatible change.
. Adds a pyparsing.__compat__
object for specifying compatibility with future breaking changes.
. Conditionalizes the API-breaking behavior, based on the value pyparsing.__compat__.collect_all_And_tokens
. By default, this value will be set to True, reflecting the new bugfixed behavior. To set this value to False, add to your code:
import pyparsing
pyparsing.__compat__.collect_all_And_tokens = False
. User code that is dependent on the pre-bugfix behavior can restore it by setting this value to False.
In 2.5 and later versions, the conditional code will be removed and setting the flag to True or False in these later versions will have no effect.
Updated unitTests.py and simple_unit_tests.py to be compatible with python setup.py test
. To run tests using setup, do:
python setup.py test
python setup.py test -s unitTests.suite
python setup.py test -s simple_unit_tests.suite
Prompted by issue #83 and PR submitted by bdragon28, thanks.
Fixed bug in ParserElement.runTests
handling '\n' literals in quoted strings.
Added tag_body
attribute to the start tag expressions generated by makeHTMLTags
, so that you can avoid using SkipTo
to roll your own tag body expression:
a, aEnd = pp.makeHTMLTags('a')
link = a + a.tag_body("displayed_text") + aEnd
for t in s.searchString(html_page):
print(t.displayed_text, '->', t.startA.href)
indentedBlock
failure handling was improved; PR submitted by TMiguelT, thanks!
Address Py2 incompatibility in simple_unit_tests
, plus explain() and Forward str() cleanup; PRs graciously provided by eswald.
Fixed docstring with embedded '\w', which creates SyntaxWarnings in Py3.8, issue #80.
Examples:
Added example parser for rosettacode.org tutorial compiler.
Added example to show how an HTML table can be parsed into a collection of Python lists or dicts, one per row.
Updated SimpleSQL.py example to handle nested selects, reworked 'where' expression to use infixNotation.
Added include_preprocessor.py, similar to macroExpander.py.
Examples using makeHTMLTags use new tag_body expression when retrieving a tag's body text.
Updated examples that are runnable as unit tests:
python setup.py test -s examples.antlr_grammar_tests
python setup.py test -s examples.test_bibparse
Published by ptmcg almost 6 years ago
New features in Pyparsing 2.3.1 -
ParseException.explain() method, to convert a raw Python traceback into a list of the parse expressions leading up to a parse mismatch.
New unicode sets Latin-A and Latin-B, and the ability to define custom sets using multiple inheritance.
class Turkish_set(pp.pyparsing_unicode.Latin1, pp.pyparsing_unicode.LatinA):
pass
turkish_word = pp.Word(Turkish_set.alphas)
State machine examples, showing how to extend Python with your own pyparsing-enabled syntax. The examples implement a 'statemachine' keyword to define a set of classes and transition attribute to implement a State pattern:
statemachine TrafficLightState:
Red -> Green
Green -> Yellow
Yellow -> Red
Transitions can be named also:
statemachine LibraryBookState:
New -(shelve)-> Available
Available -(reserve)-> OnHold
OnHold -(release)-> Available
Available -(checkout)-> CheckedOut
CheckedOut -(checkin)-> Available
Example parser for decaf language. This language is commonly used in university CS compiler classes.
Fixup of docstrings to Sphinx format, so pyparsing docs are now available on readthedocs.com! (https://pyparsing-docs.readthedocs.io/en/latest/)
Published by ptmcg almost 6 years ago
POSSIBLE API CHANGES:
IndexError
s raised in parse actions are now wrapped in ParseException
sParseResults
have had several bugfixes which remove erroneous nesting levelsNew classes:
PrecededBy
- lookbehind matchChar
- single character match (similar to Word(exact=1)
)Published by ptmcg about 6 years ago
Fixed bug in SkipTo, if a SkipTo expression that was skipping to
an expression that returned a list (such as an And), and the
SkipTo was saved as a named result, the named result could be
saved as a ParseResults - should always be saved as a string.
Issue #28, reported by seron.
Added simple_unit_tests.py, as a collection of easy-to-follow unit
tests for various classes and features of the pyparsing library.
Primary intent is more to be instructional than actually rigorous
testing. Complex tests can still be added in the unitTests.py file.
New features added to the Regex class:
optional asGroupList parameter, returns all the capture groups as
a list
optional asMatch parameter, returns the raw re.match result
new sub(repl) method, which adds a parse action calling
re.sub(pattern, repl, parsed_result). Simplifies creating
Regex expressions to be used with transformString. Like re.sub,
repl may be an ordinary string (similar to using pyparsing's
replaceWith), or may contain references to capture groups by group
number, or may be a callable that takes an re match group and
returns a string.
For instance:
expr = pp.Regex(r"([Hh]\d):\s*(.*)").sub(r"<\1>\2</\1>")
expr.transformString("h1: This is the title")
will return
<h1>This is the title</h1>
Fixed omission of LICENSE file in source tarball, also added
CODE_OF_CONDUCT.md per GitHub community standards.
Issue #31
Published by ptmcg about 6 years ago