74

I need to replace Microsoft Word's version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ") due to an encoding issue in my application. I do not need them to be HTML entities and I cannot change my database schema.

I have two options: to use either a regular expression or an associated array.

Is there a better way to do this?

1

8 Answers 8

122

I have found an answer to this question. You need just one line of code using iconv() function in php:

// replace Microsoft Word version of single  and double quotations marks (“ ” ‘ ’) with  regular quotes (' and ")
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);     
Sign up to request clarification or add additional context in comments.

12 Comments

if my answer helped u can u upvote my original question stackoverflow.com/questions/6597268/…
Thanks however in my case I needed to pick the right character encoding (which was CP1252 and not UTF-8): $output = iconv('CP1252', 'ASCII//TRANSLIT', $input);
@eric good to know you used your mind on it for others. thanks for sharing :)
Yep this worked for me. I'd recommend this over the accepted answer :)
This is a good trick. but I notice this solution removes special characters like é, è, à, â and others. Any solution to shirk the issue?
|
91

Considering you only want to replace a few specific and well identified characters, I would go for str_replace with an array: you obviously don't need the heavy artillery regex will bring you ;-)

And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.


The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP

And the associated code (quoting that page):

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
                    chr(146), 
                    chr(147), 
                    chr(148), 
                    chr(151)); 

    $replace = array("'", 
                     "'", 
                     '"', 
                     '"', 
                     '-'); 

    return str_replace($search, $replace, $string); 
} 

(I don't have Microsoft Word on this computer, so I can't test by myself)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...

6 Comments

How would you specify the MS characters?
This is what I was looking for. Thanks. The search array did not work as is, I ended up using the Hex version that was provided in the comments from the link you gave above.
The '&' sign copied from MS word doesn't encode properly, is there anyway we can use this snippet to encode that to '&'. (aswell as bullets and other chars)
For other users: You might look for chr(149) (bullet) and replace it with an asterisk as well. This page has a list of several chr() characters you might want to convert.
You don't check the encoding of the string first, so this function will mangle certain Unicode passed into it.
|
44

Your Microsoft-encoded quotes are the probably the typographic quotation marks. You can simply replace them with str_replace if you know the encoding of the string in that you want to replace them.

Here’s an example for UTF-8 but using a single mapping array with strtr:

$quotes = array(
    "\xC2\xAB"     => '"', // « (U+00AB) in UTF-8
    "\xC2\xBB"     => '"', // » (U+00BB) in UTF-8
    "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
    "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
    "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
    "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
    "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
    "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
    "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
    "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
    "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
    "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
);
$str = strtr($str, $quotes);

If you’re need another encoding, you can use mb_convert_encoding to convert the keys.

5 Comments

Rather than the ugly \x escapes, couldn't you simply include the literal characters in your source file?
@R..: That’s the problem: There are many that don’t know enough about character encodings and/or what character encoding they’re using.
worked a charm thanks. Gotta love importing excel spread sheets into mysql :S +1
This worked for me when the accepted answer, for some reason, did not (probably a UTF-8 thing). Thanks.
@marcvangend The accepted answer does not expect UTF-8 but some other single-byte character encoding.
13

If like me you arrive here with an enormous range of broken ASCII / Microsoft Word characters that are doing weird things to your CMS or RTE and iconv isn't working, then this mad function might just be for you.

Make sure your encoding is UTF-8 when you save this function to a file.

<?php
    /**
     * fixMSWord
     *
     * Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
     * correctly map and will be replaced by spaces.
     *
     * @author      Robin Cafolla
     * @date        2013-03-22
     */
    function fixMSWord($string) {
        $map = Array(
            '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
            '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
            '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
            '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
            '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
            '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\',
            '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
            '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
            '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
            '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
            '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
            '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
            '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢',
            '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬',
            '173'=> '­', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶',
            '183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À',
            '193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê',
            '203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô',
            '213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ',
            '223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è',
            '233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò',
            '243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü',
            '253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ'
        );

        $search = Array();
        $replace = Array();

        foreach ($map as $s => $r) {
            $search[] = chr((int)$s);
            $replace[] = $r;
        }

        return str_replace($search, $replace, $string);
    }

8 Comments

Can i use this for project because this is in MIT licenses
In general the MIT licence lets you use it in whatever way you like, so long as you don't remove the licence :)
You decided to put a license on what essentially amounts to... an array?
It doesn't matter what license you place inside an answer, all user content is licensed under cc by-sa 3.0 with attribution required. You can see this in the footer. This code is no longer under the MIT license.
@NobleUplift I have renamed it to fixMSWord. I'd agree that it does mangle, but if you have the problem this function fixes it does the job, and I've yet to find another solution.
|
6

Every single one of the previous answers except for Gumbo's will mangle Unicode strings:

echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad.");

Results in:

This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.

The iconv:

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);

Results in:

PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1

You can change it to //IGNORE, which will remove the characters, but not translate them.

This is the best way to replace Microsoft quotes encoded in CP1252. If they are in Unicode and you need to replace them, use Gumbo's answer:

function convert_cp1252_to_ascii($input, $default = '') {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */

        $replace = array(
            128 => "E",    // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",     // UNDEFINED
            130 => ",",    // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "f",    // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => ",,",   // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "...",  // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "t",    // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "T",    // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "^",    // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "%",    // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "S",    // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "<",    // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "OE",   // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",     // UNDEFINED
            142 => "Z",    // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
            143 => "",     // UNDEFINED
            144 => "",     // UNDEFINED
            145 => "'",    // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
            146 => "'",    // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "\"",   // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "\"",   // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "*",    // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "-",    // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "--",   // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "~",    // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "TM",   // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "s",    // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => ">",    // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "oe",   // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",     // UNDEFINED
            158 => "z",    // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "Y",    // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );

        $find = array();
        foreach (array_keys($replace) as $key) {
            $find[] = chr($key);
        }

        $input = str_replace($find, array_values($replace), $input);
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
    }
    return $input;
}

Taken from this answer, with some modifications. If you want to control over what you find/replace, use that function.

Comments

6

We used the following. It deals with a few more special characters.

$text = str_replace(chr(130), ',', $text);    // Baseline single quote
$text = str_replace(chr(132), '"', $text);    // Baseline double quote
$text = str_replace(chr(133), '...', $text);  // Ellipsis
$text = str_replace(chr(145), "'", $text);    // Left single quote
$text = str_replace(chr(146), "'", $text);    // Right single quote
$text = str_replace(chr(147), '"', $text);    // Left double quote
$text = str_replace(chr(148), '"', $text);    // Right double quote

$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');

1 Comment

You should check the encoding of the string $text before you run replaces in it. It could already be a Unicode string and you are mangling it.
0

I had a similar issue, but could not find the character to replace (copying the character directly into PHP code did not work).

Finally I logged the character using error_log(substr($str, 10, 11)); This printed "#8217;" which then worked with str_replace

So it seems the character code that error_log printed corresponded to the replaceable character.

Comments

0

The iconv method is essentially useless these days, especially if your application is being hosted on something other than linux.

This is something you could try:

function replaceSmartQuotes($text) {
    $search = array(
        "\xE2\x80\x9C", // left double quotation mark
        "\xE2\x80\x9D", // right double quotation mark
        "\xE2\x80\x98", // left single quotation mark
        "\xE2\x80\x99"  // right single quotation mark
    );

    $replace = array(
        '"', // straight double quote
        '"', // straight double quote
        "'", // straight single quote
        "'"  // straight single quote
    );

    return str_replace($search, $replace, $text);
}

You can than use this function throughout your app like so:

$inputText = "Here is a “quote” and here’s another one.";
$cleanText = replaceSmartQuotes($inputText);
echo $cleanText; // Outputs: Here is a "quote" and here's another one.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.