percentcoding - fast url encoding and decoding

Percent encoding is a generalization of the text escaping method defined for URIs in RFC 3986. Unlike C backslash escaping, which requires that every reserved character be explicitly named (eg. 0x0a corresponds to \n), percent encoding can easily accommodate an arbitrary set of reserved characters.

For the specific case of URI escaping, the percentcoding library also provides a 10x faster drop-in replacement for the urllib.quote, urllib.unquote, urllib.quote_plus, and urllib.unquote_plus functions. A unit test suite is included.

Examples

As a faster replacement for urllib.quote and urllib.unquote:

#!/usr/bin/env python
from percentcoding import quote, unquote
str = "This is a test!"
escaped = quote(str)
print escaped
assert(str == unquote(escaped))

Escaping whitespace in whitespace-delimited records:

#!/usr/bin/env python
import percentcoding
import string

ascii = set([chr(c) for c in xrange(255)])
whitespace = set([c for c in string.whitespace])
safe = ''.join( ascii - whitespace )
codec = percentcoding.Codec(safe)

record = [ "a\nleaf\nfalls", " X\tY\tZ " ]
print " ".join([ codec.encode(v) for v in record])

Performance

The percentcoding library is about 10x faster than the standard urllib.quote and urllib.unquote implementations. This is not surprising; the standard implementations are pure Python.

$ ./benchmark.py

percentcodec.encode x 10000
0.348151922226

percentcodec.decode x 10000
0.381587028503

urllib.quote x 10000
4.51035284996

urllib.unquote x 10000
3.50923490524

Notes

(TODO: move into pydoc)

All ASCII characters not occurring in the safe set are considered unsafe and will be escaped by encode.

With quote and unquote, the '+' character does not map to a space, as is necessary for processing application/x-www-form-urlencoded. Like urllib, percentcoding exports quote_plus and unquote_plus for that.

The "%%" character sequence decodes to '%', but is not the canonical encoding.

When decoding, if an invalid hex sequence is encountered (eg "%az"), it is copied as-is.

Per the spec, Unicode and UTF-8 strings are encoded byte-wise, resulting in an ASCII string. When decoding, the result is also an ASCII string, which if originally Unicode can be recovered using the Python string method decode:

unquote(s).decode('utf8')

Installation

Ubuntu / Debian users:

fakeroot ./debian/rules binary
dpkg -i ../python-percentcoding*.deb

If there's no "real" packaging for your system yet:

./setup.py build_ext --inplace
./test.py
./setup.py build
./setup.py install

Package Rankings

Top 21.34% on Pypi.org

Related Projects

PyObfuscator

This module obfuscates python code.

11 Apr 2021 10

elements-of-python-style

Goes beyond PEP8 to discuss what makes Python code feel great. A Strunk & White for Python.

03 Jan 2016 3,447

python-slugify

Returns unicode slugs

15 Oct 2012 1,487

ftfy-web

Paste in some broken unicode text and FTFY will tell you how to fix it!

09 Jan 2018 67

fold_to_ascii

A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and sy...

27 Oct 2017 15

SublimeStringEncode

Converts characters from one "encoding" to another using a transformation (think HTML entities, n...

03 Jan 2012 149

encoding-test-files

Encoding test files

11 Aug 2014 7

top-1000

Top 1000 website URLs

09 Aug 2021 11

subencode

Automatic subencoding of data for use in restrictive binary exploits

06 Dec 2019 5

python-cheatsheet

Comprehensive Python Cheatsheet

25 Jan 2018 35,334

charset_normalizer

Truly universal encoding detector in pure Python

02 Aug 2019 529

ultrajson

Ultra fast JSON decoder and encoder written in C with Python bindings

27 Feb 2011 4,254

bencoder.pyx

A fast bencode implementation in Cython

24 Jan 2016 33

chardet

Python character encoding detector

26 Jul 2012 2,087

django-video-encoding

django-video-encoding helps to convert your videos into different formats and resolutions.

14 Mar 2017 115