6

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

Shall I just press on with the queries? There will probably only be 40 or so of them.

2
  • You'll want to VACCUM aggressively if you're doing this, to avoid huge table bloat. Doing the text processing in a PL is by far the better approach as @SzymonGuz explains. It's possible in SQL using substring or regexp_matches and a replacement table, but it'll be slow and ugly. Commented Aug 28, 2012 at 3:21
  • thanks for the VACCUM tip, I shall look into that. Commented Aug 28, 2012 at 10:12

3 Answers 3

7

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

Of course you need to have perl installed and pl/perl available.

1) First of all create the procedural language pl/perlu:

CREATE EXTENSION plperlu;

2) Then create a function like this:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3) Then you can use it like this:

select decode_html_entities('aaabbb&.... asasdasdasd …');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, I was hoping not to have to jump into perl, but I guess a pure sql solution would be a little too much to ask for!
Well, it would be also very easy, however much longer than those 2 lines of perl code.
Requires apt-get install postgresql-plperl-9.1
6

You could use xpath (HTML-encoded content is the same as XML encoded content):

SELECT 
   'AT&T' AS input 
  ,(xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] AS output 

Or so far the theory.
HTML is not really XML though, so this will fail on Umlauts (&auml;&ouml;&uuml;), or any other HTML entity that is not valid XML.

Note:
This used to work (probably PG 9.1).
It doesn't anymore (currently tested in PG18) - probably a bug in postgresql.

Here's how to fix this (without using a plAnything):

SELECT x.txt
FROM xmltable(
    '/root' PASSING xmlparse(content '<root>' || 'AT&amp;T' || '</root>')
    COLUMNS txt text PATH 'text()'
) AS x;

So for a table with multiple rows, you can do:

;WITH CTE AS
(
              SELECT 'AT&amp;T' AS enc
    UNION ALL SELECT '&lt;script&gt;' AS enc 
)
SELECT
     CTE.enc 
    ,t_decoded.txt 
FROM CTE 

LEFT JOIN LATERAL 
    (
        SELECT x.txt 
        FROM xmltable(
            '/root' PASSING xmlparse(content '<root>' || CTE.enc || '</root>')
            COLUMNS txt text PATH 'text()'
        ) AS x
    ) AS t_decoded ON 1=1 

But to parse HTML properly, use plPython3 (because HTML entities like Umlauts, eg.g. ä are not valid XML):

CREATE EXTENSION IF NOT EXISTS plpython3u;


-- Decode HTML entities
CREATE OR REPLACE FUNCTION decode_html_entities(input_text text)
RETURNS text AS $$
    import html
    if input_text is None:
        return None
    return html.unescape(input_text)
$$ LANGUAGE plpython3u;


-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities(input_text text)
RETURNS text AS $$
    import html
    if input_text is None:
        return None
    return html.escape(input_text)
$$ LANGUAGE plpython3u;


-- Testing: 
SELECT
    decode_html_entities('&lt;script&gt;alert(&quot;test &auml;&ouml;&uuml; &Auml;&Ouml;&Uuml;&quot;)&lt;/script&gt;')
    -- Returns: <script>alert("test äöü ÄÖÜ")</script>
   ,encode_html_entities('<script>alert("test äöü ÄÖÜ")</script>')
    -- Returns: &lt;script&gt;alert(&quot;test äöü ÄÖÜ&quot;)&lt;/script&gt;
; 

Note:
You need to install plPython3 (or plPerl if you use Perl) for PostgreSQL, e.g. on Ubuntu for PostgreSQL 18:

sudo apt install postgresql postgresql-plpython3-18
sudo apt install postgresql postgresql-plperl-18

Note that on modern machines, Python3 is usually faster than Perl, because Python’s built-in html module is highly optimized.

Note that the NULL-handling in xmltable is incorrect (maybe as documented, but incorrect):


WITH xml_data AS 
(
    SELECT '<table>
              <row>
                <col1>A1</col1>
                <col2>B1</col2>
                <col3>C1</col3>
                <col4>3.14</col4>
              </row>
              <row>
                <col1>A2</col1>
                <col2>B2</col2>
                <col3>C2</col3>
                <col4>15</col4>
              </row>
    
            <row>
                <col2></col2>
                <col3 />
                <col4></col4>
              </row>
    
            </table>' AS data
)
SELECT 
    -- xml_data.data, 
     x.col1 
    ,x.col2 
    ,x.col3 
    ,x.col4 
FROM xml_data 
CROSS JOIN LATERAL xmltable 
    ( 
        '/table/row'  -- XPath to select each row 
        PASSING xmlparse(content data) 
        COLUMNS 
             col1 text PATH 'col1/text()' 
            ,col2 text PATH 'col2/text()' 
            ,col3 text PATH 'col3/text()' 
            ,col4 float PATH 'col4/text()' 
    ) AS x 
WHERE (1=1) 
;

Notice how values like <col1></col1> are null instead of string.empty - and <col1 /> should also be string.empty, because it's the same as <col1></col1>. But PostgreSQL doesn't do it correctly. This transforms all empty strings into NULLs, which might not be correct.

Note that if you use Perl, you also need to correctly handle the NULL-input case:

CREATE EXTENSION IF NOT EXISTS plperlu;


CREATE OR REPLACE FUNCTION decode_html_entities_perl(input_text text)
RETURNS text AS $$
    use HTML::Entities;

    my ($input_text) = @_;           # named parameter
    return undef unless defined $input_text;  # handle NULL
    return decode_entities($input_text);
$$ LANGUAGE plperlu;


-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities_perl(input_text text)
RETURNS text AS $$
    use HTML::Entities;

    my ($input_text) = @_;           # named parameter
    return undef unless defined $input_text;  # handle NULL
    return encode_entities($input_text);
$$ LANGUAGE plperlu;





WITH test_strings AS 
(
    SELECT NULL::text AS txt
    UNION ALL
    SELECT '' 
    UNION ALL
    SELECT '<div>Hello & "World"</div>'
    UNION ALL
    SELECT '&lt;span&gt;Test&lt;/span&gt;'
    UNION ALL
    SELECT 'Normal Text'
)
SELECT
    txt AS original,
    encode_html_entities_perl(txt) AS encoded,
    decode_html_entities_perl(txt) AS decoded
FROM test_strings;

(and not return an empty-string on NULL values).

3 Comments

Your example does nothing on PG 16.8. Both input and output are the same, but I had to cast output to text.
I can confirm: This doesn't work anymore (it used to). But I couldn't get it to work by casting to text either. Seems more like a bug. I added a plpython3 variant.
@Joe Love: Found the solution to that - see the edited answer.
1

This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.

From the command line

sudo apt install postgresql-plpython3-10

From your SQL interface:

CREATE LANGUAGE plpython3u;

CREATE OR REPLACE  FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
    from html.parser import HTMLParser
    h = HTMLParser() 
    if str is None:
        return str
    return h.unescape(str);
$$ LANGUAGE plpython3u;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.