Top 1000 website URLs
I've taken a dataset of top URLs from https://gtmetrix.com/top1000.html
download.py
to download and save the content and headers.analysis.py
to determine the likely character encodings.Total pages: 855
Total pages with a specified charset: 73.1%
Total pages with no specified charset, which successfully decode as utf-8: 22.5%
Total pages with no specified charset, which result in a successful chardet guess: 4.4%
Total pages with no specified charset, and an unsuccessful chardet guess: 0.0%
Total pages with no specified charset, and chardet returned no guess at all: 0.0%
Based on the results I've seen here, I think the most sensible policy for HTTPX to use for charcter decoding would be...
errors='replace'
..decode(utf-8, errors='strict')
. (Much quicker than chardet, and handles the large majority of cases.)Also, it's important that the Chinese character set GB2312 should always be coerced to the superset GBK8, since that resolves a bunch of cases that'll otherwise fail. This is important both for cases with a specified charset, and for cases where we guess with chardet.
See https://en.wikipedia.org/wiki/GBK_(character_encoding)
Specified charsets:
utf-8 598
gbk 8
iso-8859-1 6
shift_jis 4
euc-jp 2
iso-8859-15 2
utf8 1
iso-8859-2 1
euc-kr 1
windows-1251 1
windows-1256 1
Unspecfied charsets, which successfully decode as UTF-8:
utf-8 192
Guessed charsets:
gbk 21
shift_jis 8
euc-kr 5
windows-1252 2
iso-8859-1 1
euc-jp 1
Follow-up:
Switched from chardet
to charset_normalizer
- in all the examples we have here it gave the same results, but was significantly faster.
Was interested to figure out if the proposed "fallback to utf-8 first" was a good policy or not, so dig some digging into the 192 cases where no charset was specified, but utf-8 decoding works successfully.
Of these cases what we're interested in is are there cases where content.decode(charset_normalizer_guess) != content.decode("utf-8")
. Turns out there's exactly 12 of these cases. In all of those 12 cases, the reason is because the encoding is actually utf-8-sig
(Includes a leading BOM)
Anyways, what's the actual upshot of that?...
We can keep our charset decoding policy nice & simple...
errors='replace'
.charset_normalizer
guess.