Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
OTHER License
This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:
This module is primarily intended as a building block for other packages that process Wikitionary or Wikipedia data, particularly for data extraction. You will need to write code to use this.
For pre-existing extraction modules that use this package, please see:
Install from source:
git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
This package includes tests written using the unittest
framework.
The test dependencies can be installed with command
python -m pip install -e .[dev]
.
To run the tests, use the following command in the top-level directory:
make test
To run a specific test, use the following syntax:
python -m unittest tests.test_[module].[Module]Tests.test_[name]
Python's unittest framework help and options can be accessed through:
python -m unittest -h
This package is primarily intended for processing Wiktionary and Wikipedia dump files (though you can also use it for processing individual pages or other files that are in wikitext format). To download WikiMedia dump files, go to the dump download page. We recommend using the <name>-<date>-pages-articles.xml.bz2 files.
Usage example:
from functools import partial
from typing import Any
from wikitextprocessor import Wtp, WikiNode, NodeKind, Page
from wikitextprocessor.dumpparser import process_dump
def page_handler(wtp: Wtp, page: Page) -> Any:
wtp.start_page(page.title)
# process parse tree
tree = wtp.parse(page.body)
# or get expanded plain text
text = wtp.expand(page.body)
wtp = Wtp(
db_path="en_20230801.db", lang_code="en", project="wiktionary"
)
# extract dump file then save pages to SQLite file
process_dump(
wtp,
"enwiktionary-20230801-pages-articles.xml.bz2",
{0, 10, 110, 828}, # namespace id, can be found at the start of dump file
)
for _ in map(
partial(page_handler, wtp), wtp.get_all_pages([0])
):
pass
The basic operation is as follows:
Most of the functionality is hidden behind the Wtp
object.
WikiNode
objects are used for representing the parse
tree that is returned by the Wtp.parse()
function. NodeKind
is an enumeration type used to encode the type of a WikiNode
.
def __init__(
self,
db_path: Optional[Union[str, Path]] = None,
lang_code="en",
template_override_funcs: Dict[str, Callable[[Sequence[str]], str]] = {},
project: str = "wiktionary",
):
The initializer can usually be called without arguments, but recognizes the following arguments:
db_path
can be None
, in which case a temporary database file/tmp
, or a path for the database file which contains/tmp
(3.4G for English dump file),Wtp.process()
but instead useWtp.reprocess()
or just call Wtp.expand()
or Wtp.parse()
onWtp.process()
lang_code
- the language code of the dump file.template_override_funcs
- Python functions for overriding expanded template text.project
- "wiktionary" or "wikipedia".def read_by_title(
self, title: str, namespace_id: Optional[int] = None
) -> Optional[str]:
Reads the contents of the page with the specified title from the cache
file. There is usually no need to call this function explicitly, as
Wtp.process()
and Wtp.reprocess()
normally load the page
automatically. This function does not automatically call Wtp.start_page()
.
Arguments are:
title
- the title of the page to readnamespace_id
- namespace id number, this argument is required iftitle
donesn't have namespace prefix like Template:
.This returns the page contents as a string, or None
if the page
does not exist.
def parse(
self,
text: str,
pre_expand=False,
expand_all=False,
additional_expand=None,
do_not_pre_expand=None,
template_fn=None,
post_template_fn=None,
) -> WikiNode:
Parses wikitext into a parse tree (WikiNode
), optionally expanding
some or all the templates and Lua macros in the wikitext (using the definitions
for the templates and macros in the cache files, as added by Wtp.process()
or calls to Wtp.add_page()
.
The Wtp.start_page()
function must be called before this function
to set the page title (which may be used by templates, Lua macros, and
error messages). The Wtp.process()
and Wtp.reprocess()
functions will call it automatically.
This accepts the following arguments:
text
(str) - the wikitext to be parsedpre_expand
(boolean) - if set to True
, the templates that wereexpand_all
- if set to True
, expands all templates and Luaadditional_expand
(set or None
) - if this argument is provided, itpre_expand
is True
or just theseexpand_all
is set toTrue
).This returns the parse tree. See below for a documentation of the WikiNode
class used for representing the parse tree.
def node_to_wikitext(self, node)
Converts a part of a parse tree back to wikitext.
node
(WikiNode
, str, list/tuple of these) - This is the part of thenode.children
can be used directly asdef expand(self, text, template_fn=None, post_template_fn=None,
pre_expand=False, templates_to_expand=None,
expand_parserfns=True, expand_invoke=True)
Expands the selected templates, parser functions and Lua macros in the given Wikitext. This can selectively expand some or all templates. This can also capture the arguments and/or the expansion of any template as well as substitute custom expansions instead of the default expansions.
The Wtp.start_page()
function must be called before this function to
set the page title (which may be used by templates and Lua macros). The
Wtp.process()
and Wtp.reprocess()
will call it automatically. The
page title is also used in error messages.
The arguments are as follows:
text
(str) - the wikitext to be expandedtemplate_fn
(function) - if set, this will be called astemplate_fn(name, args)
, where name
(str) is the name of theargs
is a dictionary containing arguments to theNone
to cause the template to be expanded in""
(empty string) topost_template_fn
(function) - if set, this will be calledpost_template_fn(name, ht, expansion)
after the template hasNone
to use thepre_expand
(boolean) - if set to True
, all templates that weretemplates_to_expand
(None
or set or dictionary) - if this is set,None
, all templates will be expanded.expand_parserfns
(boolean) - Normally, wikitext parser functions willFalse
to prevent parser functionexpand_invoke
(boolean) - Normally, the #invoke
parser functionFalse
to prevent expansion of the#invoke
parser function.def start_page(self, title)
This function should be called before starting the processing of a new page or file. This saves the page title (which is frequently accessed by templates, parser functions, and Lua macros). The page title is also used in error messages.
The Wtp.process()
and Wtp.reprocess()
functions will automatically
call this before calling the page handler for each page. This needs to be
called manually when processing wikitext obtained from other sources.
The arguments are as follows:
title
(str) - The page title. For normal pages, there is usually noTemplate:
prefix and Lua modulesModule:
prefix, and other prefixes are also used (e.g., Thesaurus:
).def start_section(self, title)
Sets the title of the current section on the page. This is
automatically reset to None
by Wtp.start_page()
. The section
title is only used in error, warning, and debug messages.
The arguments are:
title
(str) - the title of the section, or None
to clear it.def start_subsection(self, title)
Sets the title of the current subsection of the current section on the
page. This is automatically reset to None
by Wtp.start_page()
and Wtp.start_section()
. The subsection title is only used in error,
warning, and debug messages.
The arguments are:
title
(str) - the title of the subsection, or None
to clear it.def add_page(self, title: str, namespace_id: int, body: Optional[str] = None,
redirect_to: Optional[str] = None, need_pre_expand: bool = False,
model: str = "wikitext") -> None:
This function is used to add pages, templates, and modules for
processing. There is usually no need to use this if Wtp.process()
is used; however, this can be used to add templates and pages for
testing or other special processing needs.
The arguments are:
title
- the title of the page to be added (normal pages typicallyTemplate:
, and LuaModule:
)namespace_id
- namespace idbody
- the content of the page, template, or moduleredirect_to
- title of redirect pageneed_pre_expand
- set to True
if the page is a template that need tomodel
- the model value for the page (usually wikitext
Scribunto
for Lua modules)The Wtp.analyze_templates()
function needs to be called after
calling Wtp.add_page()
before pages can be expanded or parsed (it should
preferably only be called once after adding all pages and templates).
def analyze_templates(self)
Analyzes the template definitions in the cache file and determines which
of them should be pre-expanded before parsing because they affect the
document structure significantly. Some templates in, e.g., Wiktionary
expand to table start tags, table end tags, or list items, and parsing
results are generally much better if they are expanded before parsing.
The actual expansion only happens if pre_expand
or some other argument
to Wtp.expand()
or Wtp.parse()
tells them to do so.
The analysis is heuristic and is not guaranteed to find every such template. In particular, it cannot detect templates that call Lua modules that output Wikitext control structures (there are several templates in Wiktionary that call Lua code that outputs list items, for example). Such templates may need to be identified manually and specified as additional templates to expand. Luckily, there seem to be relatively few such templates, at least in Wiktionary.
This function is automatically called by Wtp.process()
at the end of
phase 1. An explicit call is only necessary if Wtp.add_page()
has been
used by the application.
Various functions in this module, including Wtp.parse()
and
Wtp.expand()
may generate errors and warnings. Those will be displayed
on stdout
as well as collected in Wtp.errors
, Wtp.warnings
, and
Wtp.debugs
. These fields will contain lists of dictionaries, where
each dictionary describes an error/warning/debug message. The dictionary can
have the following keys (not all of them are always present):
msg
(str) - the error messagetrace
(str or None
) - optional stacktrace where the error occurredtitle
(str) - the page title on which the error occurredsection
(str or None
) - the section where the error occurredsubsection
(str or None
) - the subsection where the error occurredpath
(tuple of str) - a path of title, template names, parser functionThe fields containing the error messages will be cleared by every call
to Wtp.start_page()
(including the implicit calls during
Wtp.process()
and Wtp.reprocess()
). Thus, the
page_handler
function often returns these lists together with any
information extracted from the page, and they can be collected
together from the values returned by the iterators returned by these
functions. The Wtp.to_return()
function maybe useful for this.
The following functions can be used for reporting errors. These can
also be called by application code from within the page_handler
function as well as template_fn
and post_template_fn
functions
to report errors, warnings, and debug messages in a uniform way.
def error(self, msg, trace=None)
Reports an error message. The error will be added to Wtp.errors
list and
printed to stdout. The arguments are:
None
) - an optional stack trace giving more informationdef warning(self, msg, trace=None)
Reports a warning message. The warning will be added to Wtp.warnings
list
and printed to stdout. The arguments are the same as for Wtp.error()
.
def debug(self, msg, trace=None)
Reports a debug message. The message will be added to Wtp.debugs
list
and printed to stdout. The arguments are the same as for Wtp.error()
.
def to_return(self)
Produces a dictionary containing the error, warning, and debug
messages from Wtp
. This would typically be called at the end of a
page_handler
function and the value returned along with whatever
data was extracted from that page. The error lists are reset by
Wtp.start_page()
(including the implicit calls from
Wtp.process()
and Wtp.reprocess()
), so they should be saved
(e.g., by this call) for each page. (Given the parallelism in
the processing of the pages, they cannot just be accumulated in the
subprocesses.)
The returned dictionary contains the following keys:
errors
- a list of dictionaries describing any error messageswarnings
- a list of dictionaries describing any warning messagesdebugs
- a list of dictionaries describing any debug messages.The WikiNode
class represents a parse tree node and is returned by
Wtp.parse()
. This object can be printed or converted to a string
and will display a human-readable format that is suitable for
debugging purposes (at least for small parse trees).
The WikiNode
objects have the following fields:
kind
(NodeKind, see below) - The type of the node. This determineschildren
(list) - Contents of the node. This is generally used whenargs
(list or str, depending on kind
) - Direct arguments to theattrs
- A dictionary containing HTML attributes or a definition listdef
key).The NodeKind
type is an enumerated value for parse tree (WikiNode
)
node types. Currently the following values are used (typically these
need to be prefixed by Nodekind.
, e.g., NodeKind.LEVEL2
):
ROOT
- The root node of the parse tree.LEVEL2
- Level 2 subtitle (==). The args
field contains the titlechildren
field contains any contents that are within this sectionLEVEL3
- Level 3 subtitle (===)LEVEL4
- Level 4 subtitle (====)LEVEL5
- Level 5 subtitle (=====)LEVEL6
- Level 6 subtitle (======)ITALIC
- Italic, content is in children
BOLD
- Bold, content is in children
HLINE
- A horizontal line (no arguments or children)LIST
- Indicates a list. Each list and sublist will start withargs
will contain the prefix used to open the"##"
- note this is stored directly as a stringargs
). List items will be stored in children
.LIST_ITEM
- A list item in the children of a LIST
node. args
LIST
node).children
. If the list is a definition list (i.e., the prefix ends";"
), then children
contains the item label to be defineddefinition
contains the definition.PREFORMATTED
- Preformatted text where markup is interpreted. Contentchildren
. This is used for lines starting with a space inPRE
- Preformatted text where markup is not interpreted. Contentchildren
. This is indicated in wikitext byLINK
- An internal wikimedia link ([[...]] in wikitext). The linkargs
. This tag is also used for media inclusion.children
.TEMPLATE
- A template call (transclusion). Template name is in theargs
.children
field is not used. In wikitext templates are marked upTEMPLATE_ARG
- A template argument. The argument name is in the firstargs
followed by any subsequet arguments (normally at most twochildren
field is not used. In wikitextPARSER_FN
- A parser function invocation. This is also used for built-inargs
and parser function arguments in subsequentURL
- An external URL. The first argument is the URL. The secondargs
) is the display text. The children
TABLE
- A table. Content is in children
. In wikitext, a tableTABLE_CAPTION
- A table caption. This can only occur underTABLE
. The content is in children
. The attrs
field containsTABLE_ROW
- A table row. This can only occur under TABLE
. Thechildren
(normally the content would be TABLE_CELL
TABLE_HEADER_CELL
nodes). The attrs
field contains a dictionaryTABLE_HEADER_CELL
- A table header cell. This can only occur underTABLE_ROW
. Content is in children. The attrs
field containsTABLE_CELL
- A table cell. This can only occur under TABLE_ROW
.children
. The attrs
field contains a dictionaryMAGIC_WORD
- A MediaWiki magic word. The magic word is assignedargs
as a string (i.e., not in a list). children
is__NOTOC__
.HTML
- A HTML tag (or a matched pair of HTML tags). args
is theattrs
is set to a dictionary of any HTML attributes from the tag.children
.This can generally process a few Wiktionary pages per second per processor core, including expansion of all templates, Lua macros, parsing the full page, and analyzing the parse. On a multi-core machine, this can generally process a few dozen to a few hundred pages per second, depending on the speed and the number of the cores.
Most of the processing effort goes to expanding Lua macros. You can elect not to expand Lua macros, but they are used extensively in Wiktionary and for important information. Expanding templates and Lua macros allows much more robust and complete data extraction, but does not come cheap.
Please create an issue on github to report bugs or to contribute!