Conversion of plain-text files from ASCII to Unicode without any command

Question

Why are ASCII-encoded files extended to UTF-8 or in reverse reduced to ASCII?

user:~$ echo 'A  B  C  |  }  ~' > ./file 
user:~$ 
user:~$ file --brief --mime ./file
text/plain; charset=us-ascii
user:~$ 
user:~$ 
user:~$ echo 'ᴁ  ♫  ⼌  𝐑  🀵  🈀' >> ./file 
user:~$ 
user:~$ file --brief --mime ./file 
text/plain; charset=utf-8
user:~$
user:~$  
user:~$ cat ./file 
A  B  C  |  }  ~
ᴁ  ♫  ⼌  𝐑  🀵  🈀
user:~$ 
user:~$ 
user:~$ sed -i '$d' ./file 
user:~$ 
user:~$ cat ./file 
A  B  C  |  }  ~
user:~$
user:~$ file --brief --mime ./file 
text/plain; charset=us-ascii
user:~$

In case you cannot read a character in the second echo statement: From first to last: U+1D01, ᴁ; U+266B, ♫; U+2F0C, ⼌; U+1D411, 𝐑; U+1F035, 🀵; U+1F200, 🈀.

The locale settings are:

user:~$ echo $LANG
en_US.UTF-8
user:~$ echo $LANGUAGE
en_US:en
user:~$ echo $LC_COLLATE

user:~$ echo $LC_CTYPE

user:~$ echo $SHELL
/bin/bash
user:~$ echo $SHELL
/bin/bash
user:~$ 
user:~$ ps -p $$
  PID TTY          TIME CMD
 7537 pts/6    00:00:00 bash
user:~$

What causes the automated conversion? Can I prevent a conversion? — user2964971
– user2964971, Commented Oct 8, 2014 at 9:05
ASCII is valid UTF-8. So there is no conversion at all. The file utility can report UTF-8 for the second revision as well, but it chooses to display a more refined one. — Siyuan Ren
– Siyuan Ren, Commented Oct 8, 2014 at 9:43

Jenny D · Accepted Answer · 2014-10-08 09:09:58Z

7

I think you're confusing "encoding" and "character sets".

In the first case, the file contains only characters found in US-ASCII. This means that the file will look the same no matter what language settings you're using to display it.

In the second case, the file now contains characters belonging to the UTF8 character set, because that's what you put into it.

There's no conversion happening here; the command is simply informing you of what the contents of the file are.

answered Oct 8, 2014 at 9:09

Jenny D

13.3k3 gold badges42 silver badges55 bronze badges

Add a comment |

Anthon · Accepted Answer · 2014-10-08 09:55:07Z

The file command just guesses what is in the files you have it analyse. It does the analysis by reading a certain amount of bytes from the header of a file, sometimes in a multiple step process (if it find some clear marker at the beginning). In a non structured text file it will certainly read more than the number of characters than are in your extended ./file, so it analyses all characters.

In your second example you put some utf-8 characters in a file and based on that file will conclude this file using the utf-8 character set. If you would have e.g. 900Kb file with only ASCII characters and append your 'utf-8' echo line, file will still report it as an ascii encoded file, because it did not read as far as the utf-8 encoded characters.

The threshold lies somewhere close to 100Kb

Stack Exchange Network

Conversion of plain-text files from ASCII to Unicode without any command

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Conversion of plain-text files from ASCII to Unicode without any command

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions