How to check character encoding of a file in Linux

Question

I have some text files that're encoded by different character encodings, such as ascii, utf-8, big5, gb2312.

Now I want to know their accurate character encodings to view them with an text editor, otherwise, they will present garbled characters.

I searched online and found file command could display the character encoding of a file, like:

$ file -bi *
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii
text/plain; charset=iso-8859-1
text/plain; charset=utf-8

Unfortunately, files encoded with big5 and gb2312 both present charset=iso-8859-1, so I still couldn't make a distinction. Is there a better way to check character encoding of a text file?

You cannot reliably check encoding, you can only guess. file makes a bad guess while uchardet is better, but both are guessing. — n. m. could be an AI
– n. m. could be an AI, Commented Feb 12, 2018 at 6:03
I have a hard time believing you have ASCII-encoding files. It is far more likely to be happenstance that your file's current contents are limited to the C0 Controls and Basic Latin characters. If the file is indeed ASCII, perhaps you have a specification or standard that says so. Then you won't need guessing programs. — Tom Blodget
– Tom Blodget, Commented Feb 13, 2018 at 0:16
When someone writes a text file, they choose a character encoding. That's almost never ASCII. If they were to choose ASCII, they would likely do so because of a specification or standard. In every case, the reader must use the same encoding to read the file. So, a specification or standard is one way to know which encoding is being used and you should have it available to you. Guessing is very sketchy. You might do so from a sample. But if a file is part of a repetitive process then the file might have different content in the future that could invalidate the guess. — Tom Blodget
– Tom Blodget, Commented Feb 13, 2018 at 3:45

user4785733 · Accepted Answer · 2018-02-12 05:50:26Z

12

To some extent, @ewcz's advice works.

$ uchardet *
big5.txt: BIG5
conf: ASCII
gb2312-windows.txt: GB18030
gb.txt: GB18030
test.java: UTF-8

And

enca -L chinese *
big5.txt: Traditional Chinese Industrial Standard; Big5
conf: 7bit ASCII characters
gb2312-windows.txt: Simplified Chinese National Standard; GB2312
  CRLF line terminators
gb.txt: Simplified Chinese National Standard; GB2312
test.java: Universal transformation format 8 bits; UTF-8

answered Feb 12, 2018 at 5:50

user4785733

Sign up to request clarification or add additional context in comments.

1 Comment

tuxayo Over a year ago

The huge advantage of uchardet is that it analyses the whole file (just tried with a 20GiB file) as opposed to file and enca

Falaen · Accepted Answer · 2021-04-21 11:14:42Z

1

You can use a command line tool like detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

Then you can detect the encoding like so:

$ dfeal "/home/user name/Documents/subtitle file.srt"
# Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } }

Make sure you have Node.js and NPM installed! If you don't have it installed already:

$ sudo apt install nodejs npm

edited Apr 21, 2021 at 11:14

answered Mar 24, 2021 at 15:41

Falaen

3834 silver badges13 bronze badges

2 Comments

Boris Over a year ago

Running this command on a simple text file on macOS doesn't detect a language: "language": null. Did I miss something?

gignu Over a year ago

How big is your text file? This package can only reliably detect the language with text files of 500 words or more.

Collectives™ on Stack Overflow

How to check character encoding of a file in Linux

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related