| single |
# chardet
Universal character encoding detector.
[![License: 0BSD]](LICENSE)
[Documentation]
[codecov]
chardet 7 is a ground-up, 0BSD-licensed rewrite of [chardet].
Same package name, same public API — drop-in replacement for chardet
5.x/6.x, just much faster and more accurate.
Python 3.10+, zero runtime dependencies, works on PyPy.
## Why chardet 7?
**99.3% accuracy** on 2,517 test files. **47x faster** than chardet 6.0.0
and **1.5x faster** than charset-normalizer 3.4.6. **Language detection**
for
every result. **MIME type detection** for binary files. **0BSD licensed.**
| | chardet 7.4.0 (mypyc) | chardet 6.0.0 |
[charset-normalizer] 3.4.6 |
| ---------------------- | :--------------------: | :-----------: |
:-------------------------: |
| Accuracy (2,517 files) | **99.3%** | 88.2% |
85.4% |
| Speed | **551 files/s** | 12 files/s |
376 files/s |
| Language detection | **95.7%** | 40.0% |
59.2% |
| Peak memory | **52.9 MiB** | 29.5 MiB |
78.8 MiB |
| Streaming detection | **yes** | yes |
no |
| Encoding era filtering | **yes** | no |
no |
| Encoding filters | **yes** | no |
yes |
| MIME type detection | **yes** | no |
no |
| Supported encodings | 99 | 84 |
99 |
| License | 0BSD | LGPL |
MIT |
[charset-normalizer]: https://github.com/jawah/charset_normalizer
## Installation
`bash
pip install chardet
`
## Quick Start
```python
import chardet
chardet.detect(b"Python is a great programming language for beginners and
experts alike.")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': 'en', 'mime_type':
'text/plain'}
# UTF-8 English with accented characters
chardet.detect("The naïve approach doesn't always work in complex
systems.".encode("utf-8"))
# {'encoding': 'utf-8', 'confidence': 0.84, 'language': 'en', 'mime_type':
'text/plain'}
# Japanese EUC-JP
chardet.detect("日本語の文字コード検出テストです。このテキストはEUC-JPでエンコードされています。正しく検出できるか確認します。".encode("euc-jp"))
# {'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'ja', 'mime_type':
'text/plain'}
# Get all candidate encodings ranked by confidence
text = "Le café est une boisson très populaire en France et dans le monde
entier."
results = chardet.detect_all(text.encode("windows-1252"))
for r in results[:4]:
print(r["encoding"], round(r["confidence"], 2))
# Windows-1252 0.32
# iso8859-15 0.32
# ISO-8859-1 0.32
# MacRoman 0.31
```
### Streaming Detection
For large files or network streams, use `UniversalDetector` to feed data
incrementally:
```python
from chardet import UniversalDetector
detector = UniversalDetector()
with open("unknown.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break
|