Detect encoding in PHP without multibyte extension?

Question

Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?

If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?

detecting encoding isn't easy. a plain ascii file that uses only the 0-127 chars is also a perfectly valid utf-8 file, but you can't tell it was built with utf or with old-school ascii, because the two are indistinguishable. you could do stuff like looking for the BOM, but not all files have that. — Marc B
– Marc B, Commented Oct 8, 2015 at 20:29

user3942918 · Accepted Answer · 2016-05-05 02:47:31Z

3

Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.

There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.

You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 Â¤ just by looking at it - because they're the exact same bytes.

For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

edited May 5, 2016 at 2:47

answered Oct 8, 2015 at 21:06

user3942918

26.5k13 gold badges57 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jon B Over a year ago

thanks for that info. to the last part of my question with your logic, it should possible to detect that a string is not utf8, correct? what would that look like?

Machavity · Accepted Answer · 2015-10-08 20:28:57Z

0

There's always iconv, which is generally enabled in PHP by default

<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>

answered Oct 8, 2015 at 20:28

Machavity♦

31.8k27 gold badges97 silver badges108 bronze badges

1 Comment

Jon B Over a year ago

i definitely wasn't aware of those methods, but i need a way to specifically test a string

Collectives™ on Stack Overflow

Detect encoding in PHP without multibyte extension?

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related