A command line tool and Python library for converting lists of strings into matching regular expressions (finite automata).
MIT License
Compatible with Python 3.4+
This library and command line tool compresses multiple strings into one regular expression that can be used to find/match these strings later in larger piece of text.
As simple as pip install w2re
Input string are: is
, in
, it
, if
, the
, than
As a library:
from w2re import iterable_to_regexp
iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])
'(?:i[fnst]|th(?:e|an))'
As command line tool:
echo -e "is\nin\nit\nif\nthe\nthan" | w2re
(?:i[fnst]|th(?:e|an))
Input text is The Zen of Python
Counting words:
from collections import Counter
from re import findall
from requests import get
from w2re import iterable_to_regexp
Counter(
findall(
iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than']),
get('https://raw.githubusercontent.com/python/peps/master/pep-0020.txt').text
)
).most_common()
[('is', 15), ('it', 12), ('in', 11), ('than', 8), ('the', 7), ('if', 2)]
This is very useful if you need to search for multiple strings and are not sure how to write the correct regexp (or like me, are lazy and write libraries for it instead).
Terminate your input with EOF (Ctrl+D on empty line in Linux).
w2re
i am searching for this
and this
and this as well
(?:i\ am\ searching\ for\ this|and\ this(?:\ as\ wel{2})?)
echo 'hahaha' | w2re
(?:ha){3}
This unfortunately does not produce a range yet. E.g. subsubsection
, subsection
and section
will become s(?:ection|ubs(?:ection|ubsection))
rather than expected (?:sub){0,2}section
.
echo '* test: ...' | w2re
\*\ test\:\ \.{3}
w2re -i /usr/share/dict/words
head -n 10 /usr/share/dict/words | w2re
A(?:\'s|MD(?:\'s)?|OL(?:\'s)?|WS(?:\'s)?|achen(?:\'s)?)
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])
'(?:i[fnst]|th(?:e|an))'
import w2re
import io
w2re.stream_to_regexp(io.StringIO('is\nin\nit\nif\nthe\nthan'))
'(?:i[fnst]|th(?:e|an))'
w2re.PythonFormatter
Standard Python formatted regular expression, based on the re module. This is the default formatter for command line and library.
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonFormatter)
'(?:i[fnst]|th(?:e|an))'
w2re.PythonWordMatchFormatter
Standard Python formatted regular expression, based on the re module. Suitable for matching whole words, rather than strings. Unlike PythonFormatter
, it won't match Python
in Pythonista
.
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonWordMatchFormatter)
'(?:\\W+|\\A)((?:i[fnst]|th(?:e|an)))(?=\\W+|\\Z)'
w2re.BaseFormatter
Base class for implementation of custom formatters. See the w2re.formatters module.