Allowlist-based HTML cleaner
BSD-3-CLAUSE License
This is a allowlist-based and very opinionated HTML sanitizer that can be used both for untrusted and trusted sources. It attempts to clean up the mess made by various rich text editors and or copy-pasting to make styling of webpages simpler and more consistent. It builds on the excellent HTML cleaner in lxml_ to make the result both valid and safe.
HTML sanitizer goes further than e.g. bleach_ in that it not only ensures that content is safe and tags and attributes conform to a given allowlist, but also applies additional transforms to HTML fragments.
<span style="...">
, <b>
and<i>
) into either <strong>
or <em>
(but never both).<strong>
or<h3>
directly after each other.<li>
tags.>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer() # default configuration
>>> sanitizer.sanitize('<span style="font-weight:bold">some text</span>')
'<strong>some text</strong>'
b
tags are converted into strong
tags, italici
tags into em
tags (if strong
and em
arediv
element is used to wrap the HTML fragment for the parser,div
tags are not allowed.The default settings are::
DEFAULT_SETTINGS = {
"tags": {
"a", "h1", "h2", "h3", "strong", "em", "p", "ul", "ol",
"li", "br", "sub", "sup", "hr",
},
"attributes": {"a": ("href", "name", "target", "title", "id", "rel")},
"empty": {"hr", "a", "br"},
"separate": {"a", "p", "li"},
"whitespace": {"br"},
"keep_typographic_whitespace": False,
"add_nofollow": False,
"autolink": False,
"sanitize_href": sanitize_href,
"element_preprocessors": [
# convert span elements into em/strong if a matching style rule
# has been found. strong has precedence, strong & em at the same
# time is not supported
bold_span_to_strong,
italic_span_to_em,
tag_replacer("b", "strong"),
tag_replacer("i", "em"),
tag_replacer("form", "p"),
target_blank_noopener,
],
"element_postprocessors": [],
"is_mergeable": lambda e1, e2: True,
}
The keys' meaning is as follows:
tags
: A set()
of allowed tags.attributes
: A dict()
mapping tags to their allowed attributes.empty
: Tags which are allowed to be empty. By default, empty tagsseparate
: Tags which are not merged if they appear as siblings. Bywhitespace
: Tags which are treated as whitespace and removed fromkeep_typographic_whitespace
: Keep typographically used spaceadd_nofollow
: Whether to add rel="nofollow"
to all links.autolink
: Enable lxml_'s autolinker_. May be either a boolean or aautolink
.sanitize_href
: A callable that gets anchor's href
value and#
).element_preprocessors
and element_postprocessors
: Additionalbacklog.append
). Preprocessors are run before whitespaceis_mergeable
: Adjacent elements which aren't kept separate
arelambda e1, e2: e1.get('class') == e2.get('class')
)Settings can be specified partially when initializing a sanitizer
instance, but are still checked for consistency. For example, it is not
allowed to have tags in empty
that are not in tags
, that is,
tags that are allowed to be empty but at the same time not allowed at
all. The Sanitizer
constructor raises TypeError
exceptions when
it detects inconsistencies.
An example for an even more restricted configuration might be::
>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer({
... 'tags': ('h1', 'h2', 'p'),
... 'attributes': {},
... 'empty': set(),
... 'separate': set(),
... })
The rationale for such a restricted set of allowed tags (e.g. no
images) is documented in the design decisions
_ section of
django-content-editor_'s documentation.
HTML sanitizer does not depend on Django, but ships with a module which makes configuring sanitizers using Django settings easier. Usage is as follows::
>>> from html_sanitizer.django import get_sanitizer
>>> sanitizer = get_sanitizer([name=...])
Different sanitizers can be configured. The default configuration is
aptly named 'default'
. Example settings follow::
HTML_SANITIZERS = {
'default': {
'tags': ...,
},
...
}
The 'default'
configuration is special: If it isn't explicitly
defined, the default configuration above is used instead. Non-existing
configurations will lead to ImproperlyConfigured
exceptions.
The get_sanitizer
function caches sanitizer instances, so feel free
to call it as often as you want to.
Please report security issues to me directly at [email protected].
.. _bleach: https://bleach.readthedocs.io/ .. _Django: https://www.djangoproject.com/ .. _django-content-editor: http://django-content-editor.readthedocs.io/ .. _FeinCMS: https://pypi.python.org/pypi/FeinCMS .. _feincms-cleanse: https://pypi.python.org/pypi/feincms-cleanse .. _design decisions: http://django-content-editor.readthedocs.io/en/latest/#design-decisions .. _lxml: http://lxml.de/ .. _autolinker: http://lxml.de/api/lxml.html.clean-module.html