chardet

Python character encoding detector

LGPL-2.1 License

Downloads
84M
Stars
2.1K
Committers
49

Bot releases are hidden (Show)

chardet - chardet 5.2.0 Latest Release

Published by dan-blanchard about 1 year ago

Adds support for running chardet CLI via python -m chardet (0e9b7bc20366163efcc221281201baff4100fe19, @dan-blanchard)

chardet - chardet 5.1.0

Published by dan-blanchard almost 2 years ago

Features

  • Add should_rename_legacy argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312 becomes GB18030) (#264, @dan-blanchard)
  • Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)
  • Add a prober for MacRoman encoding (#5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and @dan-blanchard )
  • Add --minimal flag to chardetect command (#214, @dan-blanchard)
  • Add type annotations to the project and run mypy on CI (#261, @jdufresne)
  • Add support for Python 3.11 (#274, @hugovk)

Fixes

  • Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
  • Remove support for EOL Python 3.6 (#260, @jdufresne)
  • Remove unnecessary guards for non-falsey values (#259, @jdufresne)

Misc changes

  • Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
  • Remove setup.py in favor of build package (#262, @jdufresne)
  • Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)
chardet - chardet 5.0.0

Published by dan-blanchard over 2 years ago

⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️

In addition to that change, it features the following user-facing changes:

  • Added a prober for Johab Korean (#207, @grizlupo)
  • Added a prober for UTF-16/32 BE/LE (#109, #206, @jpz)
  • Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages
  • Improved XML tag filtering, which should improve accuracy for XML files (#208)
  • Tweaked SingleByteCharSetProber confidence to match latest uchardet (#209)
  • Made detect_all return child prober confidences (#210)
  • Updated examples in docs (#223, @domdfcoding)
  • Documentation fixes (#212, #224, #225, #226, #220, #221, #244 from too many to mention)
  • Minor performance improvements (#252, @deedy5)
  • Add support for Python 3.10 when testing (#232, @jdufresne)
  • Lots of little development cycle improvements, mostly thanks to @jdufresne
chardet - chardet 4.0.0

Published by dan-blanchard almost 4 years ago

⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️

Major Changes

This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

  1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
  2. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
  3. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.
  4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

Benchmarks

Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

old version (chardet 3.0.4)

Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091
cp932: 4.71090956645177
cp949: 2.937256786994428
euc-jp: 4.870580412090848
euc-kr: 6.6910755971933416
euc-tw: 87.71098043480079
gb2312: 6.614302607154443
ibm855: 27.595893549680685
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434
iso-2022-kr: 26181.67290886392
iso-8859-1: 120.63424740403983
iso-8859-5: 32.65106262196898
iso-8859-7: 62.480089080556084
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496
shift_jis: 4.996013583677438
tis-620: 14.323112928341818
utf-16: 166771.53081510935
utf-32: 198782.18009478672
utf-8: 13.966236809766901
utf-8-sig: 193732.28637413395
windows-1251: 23.038910006925768
windows-1252: 99.48409117053738 
windows-1255: 6.336261495718825

Total time: 357.05358052253723s (10.054513372323958 calls per second)

new version (chardet 4.0.0)


Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 38176.31067961165
big5: 12.86915132656389
cp932: 4.656400877065864
cp949: 7.282976434315926
euc-jp: 4.329381447610525
euc-kr: 8.16386823884839
euc-tw: 90.230745070368
gb2312: 14.248865889128146
ibm855: 33.30225548069821
ibm866: 44.181691968506
iso-2022-jp: 3024.2295767539117
iso-2022-kr: 25055.57945041816
iso-8859-1: 59.25262902122995
iso-8859-5: 39.7069713674529
iso-8859-7: 61.008422013862194
koi8-r: 41.21560517643845
maccyrillic: 31.402474369805002
shift_jis: 4.9091652743515155
tis-620: 14.408875278821073
utf-16: 177349.00634249471
utf-32: 186413.51111111112
utf-8: 108.62174360115105
utf-8-sig: 181965.46637744035
windows-1251: 43.16933400329809
windows-1252: 211.27653358317968
windows-1255: 16.15113643694104

Total time: 268.0230791568756s (13.394368915143872 calls per second)


Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

Full changelog

  • Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard
  • Add API option to get all the encodings confidence (#111) @mdamien
  • Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard
  • Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard
  • Include license file in the generated wheel package (#141) @jdufresne
  • Drop support for Python 2.6 (#143) @jdufresne
  • Remove unused coverage configuration (#142) @jdufresne
  • Doc the chardet package suitable for production (#144) @jdufresne
  • Pass python_requires argument to setuptools (#150) @jdufresne
  • Update pypi.python.org URL to pypi.org (#155) @jdufresne
  • Typo fix (#159) @saintamh
  • Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok
  • Test Python 3.7 and 3.8 and document support (#175) @jdufresne
  • Drop support for end-of-life Python 3.4 (#181) @jdufresne
  • Workaround for distutils bug in python 2.7 (#165) @xeor
  • Remove deprecated license_file from setup.cfg (#182) @jdufresne
  • Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne
  • Add testing for Python 3.9 (#201) @jdufresne
  • Adds explicit os and distro definitions (#140) @edumco
  • Remove shebang from nonexecutable script (#192) @hrnciar
  • Remove use of deprecated 'setup.py test' (#187) @jdufresne
  • Remove unnecessary numeric placeholders from format strings (#176) @jdufresne
  • Update links (#152) @aaaxx
  • Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne
  • Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard
  • Switch from Travis to GitHub Actions (#204) @dan-blanchard
  • Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard
  • Add language to detect_all output (1e208b7) @dan-blanchard
chardet - chardet 3.0.4

Published by dan-blanchard over 7 years ago

This minor bugfix release just fixes some packaging and documentation issues:

  • Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)
  • Make sure test.py is included in the manifest (PR #118, thanks @zmedico)
  • Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
  • Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
chardet - chardet 3.0.3

Published by dan-blanchard over 7 years ago

This release fixes a crash when debugging logging was enabled. (Issue #115, PRs #117 and #125)

chardet - chardet 3.0.2

Published by dan-blanchard over 7 years ago

Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).

chardet - chardet 3.0.1

Published by dan-blanchard over 7 years ago

This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).

chardet - chardet 3.0.0

Published by dan-blanchard over 7 years ago

This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:

  • Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
  • Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
  • Removed Python 3.2 from testing, but add 3.4 - 3.6
  • Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)
  • Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)
  • Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)
  • Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
  • Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)
  • Updated filter_english_letters to match C implementation (c6654595)
  • Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a079)
  • Allow CLI sub-package to be importable (PR #55)
  • Add a hypotheis-based test (PR #66, thanks @DRMacIver)
  • Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)
  • Fixed broken links in docs (PR #90, thanks @roskakori)
  • Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)
  • Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)
  • Add language property to probers and UniversalDetector results (PR #180)
  • Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
chardet - chardet 2.2.0

Published by dan-blanchard almost 10 years ago

First version after merger with charade. Loads of little changes.

chardet - chardet 2.2.1

Published by dan-blanchard almost 10 years ago

Fix missing paren in chardetect.py

chardet - chardet 2.3.0

Published by dan-blanchard about 10 years ago

In this release, we:

  • Added support for CP932 detection (thanks to @hashy).
  • Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
  • Modified chardetect to use argparse for argument parsing.
  • Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.
Package Rankings
Top 6.83% on Alpine-edge
Top 6.95% on Alpine-v3.13
Top 12.08% on Anaconda.org
Top 3.2% on Alpine-v3.15
Top 3.09% on Alpine-v3.7
Top 3.95% on Alpine-v3.9
Top 2.34% on Alpine-v3.8
Top 2.86% on Alpine-v3.18
Top 2.59% on Alpine-v3.10
Top 3.35% on Alpine-v3.17
Top 2.98% on Alpine-v3.14
Top 2.32% on Alpine-v3.5
Top 7.16% on Alpine-v3.4
Top 2.64% on Alpine-v3.12
Top 3.11% on Alpine-v3.16
Top 7.4% on Alpine-v3.3
Top 2.92% on Alpine-v3.11
Top 2.69% on Pkg.adelielinux.org
Top 2.22% on Alpine-v3.6
Top 3.98% on Spack.io
Top 6.06% on Npmjs.org
Top 5.52% on Conda-forge.org
Top 1.03% on Pypi.org