1

Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?

If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?

1
  • detecting encoding isn't easy. a plain ascii file that uses only the 0-127 chars is also a perfectly valid utf-8 file, but you can't tell it was built with utf or with old-school ascii, because the two are indistinguishable. you could do stuff like looking for the BOM, but not all files have that. Commented Oct 8, 2015 at 20:29

2 Answers 2

3

Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.

There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.

You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 ¤ just by looking at it - because they're the exact same bytes.

For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for that info. to the last part of my question with your logic, it should possible to detect that a string is not utf8, correct? what would that look like?
0

There's always iconv, which is generally enabled in PHP by default

<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>

1 Comment

i definitely wasn't aware of those methods, but i need a way to specifically test a string

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.