3

Let's say (for simplicity's sake) that I have a multibyte, UTF-8 encoded string variable with 3 letters (consisting of 4 bytes):

$original = 'Fön';

Since it's UTF-8, the bytes' hex values are (excluding the BOM):

46 C3 B6 6E

As the $original variable is user-defined, I will need to hande two things:

  1. Get the exact number of bytes (not UTF-8 characters) used in the string, and
  2. A way to access each individual byte (not UTF-8 character).

I would tend to use strlen() to handle "1.", and access the $original variable's bytes with a simple `$original[$byteposition] like this:

<?php
header('Content-Type: text/html; charset=UTF-8');

$original = 'Fön';
$totalbytes = strlen($original);
for($byteposition = 0; $byteposition < $totalbytes; $byteposition++)
{
    $currentbyte = $original[$byteposition];

    /*
        Doesn't work since var_dump shows 3 bytes.
    */
    var_dump($currentbyte);

    /*
        Fails too since "ord" only works on ASCII chars.
        It returns "46 F6 6E"
    */
    printf("%02X", ord($currentbyte));
    echo('<br>');
}

exit();
?>

This proves my initial idea is not working:

  1. var_dump shows 3 bytes
  2. printf fails too since "ord" only works on ASCII chars

How can I get the single bytes from a multibyte PHP string variable in a binary-safe way?

What I am looking for is a binary-safe way to convert UTF-8 string(s) into byte-array(s).

4
  • If strlen returns a character count rather than a byte count, then check php.ini for the value of mbstring.func_overload; but are you sure your ö is a UTF-8 character and not simply extended ASCII? F6 is the hex code for ö in extended ascii Commented Aug 1, 2013 at 11:46
  • 1
    just an idea: $a = utf8_encode('Fön'); $b = unpack('C*', $a); var_dump($b); the result is an array with 4 int values, i utf8_encoded because i had an iso-file. Commented Aug 1, 2013 at 11:49
  • and you can find an uniord function in the comments here: us.php.net/manual/en/function.ord.php (search for "uniord") Commented Aug 1, 2013 at 11:51
  • @MarkBaker Yes, I am sure it's UTF-8 as a memory-dump and a file-dump both show ö is correctly represented as C3 B6, which fits UTF-8 and not extended ASCII (which would be represented by 1 byte). Commented Aug 1, 2013 at 12:10

2 Answers 2

6

you can get a bytearray by unpacking the utf8_encoded string $a:

$a = utf8_encode('Fön');
$b = unpack('C*', $a); 
var_dump($b);

used format C* for "unsigned char"

References

Sign up to request clarification or add additional context in comments.

Comments

0

I actually wrote my own class for this problem.
I was trying to make the javascript new TextEncoder("utf-8").encode(...) in PHP.
So this is what i came up with: It uses the PHP
ord() function for getting the bytes
and the chr() function for building the utf8 message back

class Uint8Array{
    public $val = array();
    public $length = 0;
    function from($string, $mode = "utf8"){
      if($mode == "utf8"){
      $arr = [];
      foreach (str_split($string) as $chr) {
        $arr[] = ord($chr);
      }
      $this->val = $arr;
      $this->length = count($arr);
      return $arr;
      }
      elseif($mode == "hex"){
      $arr = [];
      for($i=0;$i<strlen($string);$i++){
        if($i%2 == 0)
          $arr[] = hexdec($string[$i].$string[$i+1]);
      }
      $this->val = $arr;
      $this->length = count($arr);
      return $arr;
      }
    }
    function toString($enc = "utf8"){
      if($enc == "utf8"){
          $str = "";
        foreach($this->val as $byte){
          $str .= chr($byte);
        }
        return $str;
      }
      elseif($enc == "hex"){
        $str = "";
        foreach($this->val as $byte){
          $str .= str_pad(dechex($byte),2,"0",STR_PAD_LEFT);
        }
        return $str;
      }
    }
  }

use it like this:

create instance:

$handle = new Uint8Array;

input with ->from(string, encoding) like this: 1)utf8 2)hex bytes(without spaces)

$handle->from("Fön","utf8");
//or with hex bytes
$handle->from("46c3b66e","hex");

output with ->toString(encoding) hex/utf8:

$to_utf8 = $handle->toString("utf8");
//Fön
$to_hex = $handle->toString("hex");
//46c3b66e

the byte-array itself can be found at ->val as you can see here:

$bytearray = $handle->val;
//[70, 195, 182, 110]
$arrayleng = $handle->length;
//4

that is all, be free to use this!

You can learn more about used functions here:
chr() ord()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.