Bot releases are hidden (Show)
Adds support for running chardet CLI via python -m chardet
(0e9b7bc20366163efcc221281201baff4100fe19, @dan-blanchard)
Published by dan-blanchard almost 2 years ago
should_rename_legacy
argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312
becomes GB18030
) (#264, @dan-blanchard)--minimal
flag to chardetect
command (#214, @dan-blanchard)Published by dan-blanchard over 2 years ago
⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️
In addition to that change, it features the following user-facing changes:
SingleByteCharSetProber
confidence to match latest uchardet (#209)detect_all
return child prober confidences (#210)Published by dan-blanchard almost 4 years ago
⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️
This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:
CharsetGroupProber
class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.chardet.detect_all
function that returns a list of possible encodings for the input with associated confidences.The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).
Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM
Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091
cp932: 4.71090956645177
cp949: 2.937256786994428
euc-jp: 4.870580412090848
euc-kr: 6.6910755971933416
euc-tw: 87.71098043480079
gb2312: 6.614302607154443
ibm855: 27.595893549680685
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434
iso-2022-kr: 26181.67290886392
iso-8859-1: 120.63424740403983
iso-8859-5: 32.65106262196898
iso-8859-7: 62.480089080556084
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496
shift_jis: 4.996013583677438
tis-620: 14.323112928341818
utf-16: 166771.53081510935
utf-32: 198782.18009478672
utf-8: 13.966236809766901
utf-8-sig: 193732.28637413395
windows-1251: 23.038910006925768
windows-1252: 99.48409117053738
windows-1255: 6.336261495718825
Total time: 357.05358052253723s (10.054513372323958 calls per second)
Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 38176.31067961165
big5: 12.86915132656389
cp932: 4.656400877065864
cp949: 7.282976434315926
euc-jp: 4.329381447610525
euc-kr: 8.16386823884839
euc-tw: 90.230745070368
gb2312: 14.248865889128146
ibm855: 33.30225548069821
ibm866: 44.181691968506
iso-2022-jp: 3024.2295767539117
iso-2022-kr: 25055.57945041816
iso-8859-1: 59.25262902122995
iso-8859-5: 39.7069713674529
iso-8859-7: 61.008422013862194
koi8-r: 41.21560517643845
maccyrillic: 31.402474369805002
shift_jis: 4.9091652743515155
tis-620: 14.408875278821073
utf-16: 177349.00634249471
utf-32: 186413.51111111112
utf-8: 108.62174360115105
utf-8-sig: 181965.46637744035
windows-1251: 43.16933400329809
windows-1252: 211.27653358317968
windows-1255: 16.15113643694104
Total time: 268.0230791568756s (13.394368915143872 calls per second)
Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.
Published by dan-blanchard over 7 years ago
This minor bugfix release just fixes some packaging and documentation issues:
setup.py
where pytest_runner
was always being installed. (PR #119, thanks @zmedico)test.py
is included in the manifest (PR #118, thanks @zmedico)Published by dan-blanchard over 7 years ago
This release fixes a crash when debugging logging was enabled. (Issue #115, PRs #117 and #125)
Published by dan-blanchard over 7 years ago
Fixes an issue where detect
would sometimes return None
instead of a dict
with the keys encoding
, language
, and confidence
(Issue #113, PR #114).
Published by dan-blanchard over 7 years ago
This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).
Published by dan-blanchard over 7 years ago
This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:
'rb'
for chardetect
CLI. (PR #38, thanks @lpsinger)chardetect
crash with non-ascii file names (PR #39, thanks @nkanaev)mTypicalPositiveRatio
, and instead typical_positive_ratio
)filter_without_english_words
to filter_international_words
and make it match current Mozilla implementation (PR #44, thanks @rsnair2)filter_english_letters
to match C implementation (c6654595)hypotheis
-based test (PR #66, thanks @DRMacIver)bytes.decode()
(PR #73, thanks @snoack)chardetect
when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)bytearray
objects internally instead of wrap_ord
calls, which provides a nice performance boost across the board (PR #106)language
property to probers and UniversalDetector
results (PR #180)Published by dan-blanchard almost 10 years ago
First version after merger with charade. Loads of little changes.
Published by dan-blanchard almost 10 years ago
Fix missing paren in chardetect.py
Published by dan-blanchard about 10 years ago
In this release, we:
chardetect
to use argparse
for argument parsing.gh-pages
branch. You can now access them at http://chardet.github.io.