PostgreSQL - Replace HTML Entities

Question

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, '&#xe0;', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, '&#xe1;', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, '&#xe2;', 'â', 'g') WHERE name LIKE '%#xe2%';

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

Shall I just press on with the queries? There will probably only be 40 or so of them.

You'll want to VACCUM aggressively if you're doing this, to avoid huge table bloat. Doing the text processing in a PL is by far the better approach as @SzymonGuz explains. It's possible in SQL using substring or regexp_matches and a replacement table, but it'll be slow and ugly. — Craig Ringer
– Craig Ringer, Commented Aug 28, 2012 at 3:21

Randal Schwartz · Accepted Answer · 2014-04-06 05:24:39Z

7

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

Of course you need to have perl installed and pl/perl available.

1) First of all create the procedural language pl/perlu:

CREATE EXTENSION plperlu;

2) Then create a function like this:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3) Then you can use it like this:

select decode_html_entities('aaabbb&amp;.... asasdasdasd &hellip;');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)

edited Apr 6, 2014 at 5:24

Randal Schwartz

44.8k4 gold badges51 silver badges83 bronze badges

answered Aug 27, 2012 at 19:30

Szymon Lipiński

28.8k17 gold badges81 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

lynks Over a year ago

Thanks, I was hoping not to have to jump into perl, but I guess a pure sql solution would be a little too much to ask for!

Szymon Lipiński Over a year ago

Well, it would be also very easy, however much longer than those 2 lines of perl code.

Stefan Steiger Over a year ago

Requires apt-get install postgresql-plperl-9.1

Stefan Steiger · Accepted Answer · 2025-11-20 12:07:58Z

You could use xpath (HTML-encoded content is the same as XML encoded content):

SELECT 
   'AT&amp;T' AS input 
  ,(xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] AS output

Or so far the theory.
HTML is not really XML though, so this will fail on Umlauts (äöü), or any other HTML entity that is not valid XML.

Note:
This used to work (probably PG 9.1).
It doesn't anymore (currently tested in PG18) - probably a bug in postgresql.

Here's how to fix this (without using a plAnything):

SELECT x.txt
FROM xmltable(
    '/root' PASSING xmlparse(content '<root>' || 'AT&amp;T' || '</root>')
    COLUMNS txt text PATH 'text()'
) AS x;

So for a table with multiple rows, you can do:

;WITH CTE AS
(
              SELECT 'AT&amp;T' AS enc
    UNION ALL SELECT '&lt;script&gt;' AS enc 
)
SELECT
     CTE.enc 
    ,t_decoded.txt 
FROM CTE 

LEFT JOIN LATERAL 
    (
        SELECT x.txt 
        FROM xmltable(
            '/root' PASSING xmlparse(content '<root>' || CTE.enc || '</root>')
            COLUMNS txt text PATH 'text()'
        ) AS x
    ) AS t_decoded ON 1=1

But to parse HTML properly, use plPython3 (because HTML entities like Umlauts, eg.g. ä are not valid XML):

CREATE EXTENSION IF NOT EXISTS plpython3u;


-- Decode HTML entities
CREATE OR REPLACE FUNCTION decode_html_entities(input_text text)
RETURNS text AS $$
    import html
    if input_text is None:
        return None
    return html.unescape(input_text)
$$ LANGUAGE plpython3u;


-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities(input_text text)
RETURNS text AS $$
    import html
    if input_text is None:
        return None
    return html.escape(input_text)
$$ LANGUAGE plpython3u;


-- Testing: 
SELECT
    decode_html_entities('&lt;script&gt;alert(&quot;test &auml;&ouml;&uuml; &Auml;&Ouml;&Uuml;&quot;)&lt;/script&gt;')
    -- Returns: <script>alert("test äöü ÄÖÜ")</script>
   ,encode_html_entities('<script>alert("test äöü ÄÖÜ")</script>')
    -- Returns: &lt;script&gt;alert(&quot;test äöü ÄÖÜ&quot;)&lt;/script&gt;
;

Note:
You need to install plPython3 (or plPerl if you use Perl) for PostgreSQL, e.g. on Ubuntu for PostgreSQL 18:

sudo apt install postgresql postgresql-plpython3-18
sudo apt install postgresql postgresql-plperl-18

Note that on modern machines, Python3 is usually faster than Perl, because Python’s built-in html module is highly optimized.

Note that the NULL-handling in xmltable is incorrect (maybe as documented, but incorrect):


WITH xml_data AS 
(
    SELECT '<table>
              <row>
                <col1>A1</col1>
                <col2>B1</col2>
                <col3>C1</col3>
                <col4>3.14</col4>
              </row>
              <row>
                <col1>A2</col1>
                <col2>B2</col2>
                <col3>C2</col3>
                <col4>15</col4>
              </row>
    
            <row>
                <col2></col2>
                <col3 />
                <col4></col4>
              </row>
    
            </table>' AS data
)
SELECT 
    -- xml_data.data, 
     x.col1 
    ,x.col2 
    ,x.col3 
    ,x.col4 
FROM xml_data 
CROSS JOIN LATERAL xmltable 
    ( 
        '/table/row'  -- XPath to select each row 
        PASSING xmlparse(content data) 
        COLUMNS 
             col1 text PATH 'col1/text()' 
            ,col2 text PATH 'col2/text()' 
            ,col3 text PATH 'col3/text()' 
            ,col4 float PATH 'col4/text()' 
    ) AS x 
WHERE (1=1) 
;

Notice how values like <col1></col1> are null instead of string.empty - and <col1 /> should also be string.empty, because it's the same as <col1></col1>. But PostgreSQL doesn't do it correctly. This transforms all empty strings into NULLs, which might not be correct.

Note that if you use Perl, you also need to correctly handle the NULL-input case:

CREATE EXTENSION IF NOT EXISTS plperlu;


CREATE OR REPLACE FUNCTION decode_html_entities_perl(input_text text)
RETURNS text AS $$
    use HTML::Entities;

    my ($input_text) = @_;           # named parameter
    return undef unless defined $input_text;  # handle NULL
    return decode_entities($input_text);
$$ LANGUAGE plperlu;


-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities_perl(input_text text)
RETURNS text AS $$
    use HTML::Entities;

    my ($input_text) = @_;           # named parameter
    return undef unless defined $input_text;  # handle NULL
    return encode_entities($input_text);
$$ LANGUAGE plperlu;





WITH test_strings AS 
(
    SELECT NULL::text AS txt
    UNION ALL
    SELECT '' 
    UNION ALL
    SELECT '<div>Hello & "World"</div>'
    UNION ALL
    SELECT '&lt;span&gt;Test&lt;/span&gt;'
    UNION ALL
    SELECT 'Normal Text'
)
SELECT
    txt AS original,
    encode_html_entities_perl(txt) AS encoded,
    decode_html_entities_perl(txt) AS decoded
FROM test_strings;

(and not return an empty-string on NULL values).

Your example does nothing on PG 16.8. Both input and output are the same, but I had to cast output to text.
I can confirm: This doesn't work anymore (it used to). But I couldn't get it to work by casting to text either. Seems more like a bug. I added a plpython3 variant.
@Joe Love: Found the solution to that - see the edited answer.

sorrell · Accepted Answer · 2019-06-20 01:19:43Z

1

This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like , for some reason. So I used Python3.

From the command line

sudo apt install postgresql-plpython3-10

From your SQL interface:

CREATE LANGUAGE plpython3u;

CREATE OR REPLACE  FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
    from html.parser import HTMLParser
    h = HTMLParser() 
    if str is None:
        return str
    return h.unescape(str);
$$ LANGUAGE plpython3u;

edited Jun 20, 2019 at 1:19

answered Jun 20, 2019 at 0:23

sorrell

1,8712 gold badges16 silver badges30 bronze badges

Collectives™ on Stack Overflow

PostgreSQL - Replace HTML Entities

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related