Computers only understand raw binary data, raw bits. One bit is a Binary Digit : 0 or 1. AAn 8-bits number is a byte. One byte is a number between 0 and 255.
ASCII is a table that converts numbers to characters. Numbers between 0 and 31 are controls : tab, new line, and others. Numbers between 32 and 126 are printable characters : letter a, number 1, % sign, underscore _
So with ASCII, there are 33 control characters and 95 printable characters.
ASCII is the most commonly used character encoding today. The first entries of the Unicode table are ASCII and match the ASCII character set.
ASCII is a 7-bit character set. Numbers between 0 and 127. With 8 bits we can go up to 255.
The most common alternative to ASCII is EBDICEBCDIC which is not compatible with ASCII and still exists today on IBM computers and databases.
1 byte, so one 8 bits number is the most common unit in computer science nowadays. 1 byte is a number between 0 and 255.
ASCII defines a meaning for each number between 0 and 127.
The character associated towith numbers between 128 and 255 dependdepends on the character encodingencoding being used. Two widely used character encodings used nowadays are windows1252 and UTF-8.
In windows1252 the number corresponding to the € sign is 128. 1 byte : [A0]. In the Unicode Database, the € sign is number 8364.
Now I give you the number 8364. Tow bytes : [20,AC]. In UTF-8 the Euro sign is the number 14844588. Three bytes : [E282AC].
Now I give you some raw data. Let's say 20AC. Is it two windows1252 characters : £ or one single Unicode € sign ?
I give you some more raw data. E282AC. Well, 82 is an unassigned character in windows1252 so it is probably not windows1252. It could be macRoman "‚Ǩ" or OEM 437 "ßéó" or the UTF-8 "€" sign.
It is possible to guess the encoding of a stream of raw bytes based on the characteristics of the character encodings and on statistics but there is no reliable way to do that. Numbers between 128 and 255 on their own are invalid in UTF-8. The é is common in some languages (french) so if you see many bytes with the value E9 surrounded by letters it is probably a windows1252-encoding string, the E9 byte representing the é character.
When you have a stream of raw bytebytes that representrepresents a string, it is far better to know the matching encoding rather than trying to guess.
Below is a screenshot of one raw byte in varionsvarious encodings that were once widely used.
