Character detection in a text file in Python using the Universal Encoding Detector (chardet)

Question

I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.

While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.

However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).

My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:

import chardet    
rawdata=open(infile,"r").read()
chardet.detect(rawdata)

Character detection is necessary as the script goes on to run the following (as well as several similar uses):

inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()

Any help would be greatly appreciated.

David Z · Accepted Answer · 2017-04-26 17:56:50Z

70

chardet.detect() returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:

import chardet    
rawdata = open(infile, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']

The chardet documentation is not explicitly clear about whether text strings and/or byte strings are supposed to work with the module, but it stands to reason that if you have a text string you don't need to run character detection on it, so you should probably be passing byte strings. Hence the binary mode flag (b) in the call to open(). But chardet.detect() might also work with a text string depending on which versions of Python and of the library you're using, i.e. if you do omit the b you might find that it works anyway even though you're technically doing something wrong.

edited Apr 26, 2017 at 17:56

answered Jul 24, 2010 at 4:24

David Z

133k29 gold badges264 silver badges284 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

木川炎星 Over a year ago

Thank you! I thought it would be something simple!

Endophage Over a year ago

Just what I needed. Out of curiosity, is there some way to get it to return more than one result so you can see say, the 3 highest confidence level guesses?

David Z Over a year ago

@Endophage I'm not sure, I haven't really used it much myself.

John Lemberger Over a year ago

Python 3.6 threw TypeError: Expected object of type bytes or bytearray, got: <class 'str'> when attempting to open a UTF-8 file. Opening with "rb" instead of "r" fixed the problem.

Mark Ransom Over a year ago

@Ousret also available at pypi.org/project/charset-normalizer which means you can install it with pip. I haven't had a chance to try it yet but it looks interesting.

|

Collectives™ on Stack Overflow

Character detection in a text file in Python using the Universal Encoding Detector (chardet)

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related