You could use xpath (HTML-encoded content is the same as XML encoded content):
SELECT
'AT&T' AS input
,(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] AS output
Or so far the theory.
HTML is not really XML though, so this will fail on Umlauts (äöü), or any other HTML entity that is not valid XML.
Note:
This used to work (probably PG 9.1).
It doesn't anymore (currently tested in PG18) - probably a bug in postgresql.
Here's how to fix this (without using a plAnything):
SELECT x.txt
FROM xmltable(
'/root' PASSING xmlparse(content '<root>' || 'AT&T' || '</root>')
COLUMNS txt text PATH 'text()'
) AS x;
So for a table with multiple rows, you can do:
;WITH CTE AS
(
SELECT 'AT&T' AS enc
UNION ALL SELECT '<script>' AS enc
)
SELECT
CTE.enc
,t_decoded.txt
FROM CTE
LEFT JOIN LATERAL
(
SELECT x.txt
FROM xmltable(
'/root' PASSING xmlparse(content '<root>' || CTE.enc || '</root>')
COLUMNS txt text PATH 'text()'
) AS x
) AS t_decoded ON 1=1
But to parse HTML properly, use plPython3 (because HTML entities like Umlauts, eg.g. ä are not valid XML):
CREATE EXTENSION IF NOT EXISTS plpython3u;
-- Decode HTML entities
CREATE OR REPLACE FUNCTION decode_html_entities(input_text text)
RETURNS text AS $$
import html
if input_text is None:
return None
return html.unescape(input_text)
$$ LANGUAGE plpython3u;
-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities(input_text text)
RETURNS text AS $$
import html
if input_text is None:
return None
return html.escape(input_text)
$$ LANGUAGE plpython3u;
-- Testing:
SELECT
decode_html_entities('<script>alert("test äöü ÄÖÜ")</script>')
-- Returns: <script>alert("test äöü ÄÖÜ")</script>
,encode_html_entities('<script>alert("test äöü ÄÖÜ")</script>')
-- Returns: <script>alert("test äöü ÄÖÜ")</script>
;
Note:
You need to install plPython3 (or plPerl if you use Perl) for PostgreSQL, e.g. on Ubuntu for PostgreSQL 18:
sudo apt install postgresql postgresql-plpython3-18
sudo apt install postgresql postgresql-plperl-18
Note that on modern machines, Python3 is usually faster than Perl, because Python’s built-in html module is highly optimized.
Note that the NULL-handling in xmltable is incorrect (maybe as documented, but incorrect):
WITH xml_data AS
(
SELECT '<table>
<row>
<col1>A1</col1>
<col2>B1</col2>
<col3>C1</col3>
<col4>3.14</col4>
</row>
<row>
<col1>A2</col1>
<col2>B2</col2>
<col3>C2</col3>
<col4>15</col4>
</row>
<row>
<col2></col2>
<col3 />
<col4></col4>
</row>
</table>' AS data
)
SELECT
-- xml_data.data,
x.col1
,x.col2
,x.col3
,x.col4
FROM xml_data
CROSS JOIN LATERAL xmltable
(
'/table/row' -- XPath to select each row
PASSING xmlparse(content data)
COLUMNS
col1 text PATH 'col1/text()'
,col2 text PATH 'col2/text()'
,col3 text PATH 'col3/text()'
,col4 float PATH 'col4/text()'
) AS x
WHERE (1=1)
;
Notice how values like <col1></col1> are null instead of string.empty - and <col1 /> should also be string.empty, because it's the same as <col1></col1>. But PostgreSQL doesn't do it correctly. This transforms all empty strings into NULLs, which might not be correct.
Note that if you use Perl, you also need to correctly handle the NULL-input case:
CREATE EXTENSION IF NOT EXISTS plperlu;
CREATE OR REPLACE FUNCTION decode_html_entities_perl(input_text text)
RETURNS text AS $$
use HTML::Entities;
my ($input_text) = @_; # named parameter
return undef unless defined $input_text; # handle NULL
return decode_entities($input_text);
$$ LANGUAGE plperlu;
-- Encode HTML entities
CREATE OR REPLACE FUNCTION encode_html_entities_perl(input_text text)
RETURNS text AS $$
use HTML::Entities;
my ($input_text) = @_; # named parameter
return undef unless defined $input_text; # handle NULL
return encode_entities($input_text);
$$ LANGUAGE plperlu;
WITH test_strings AS
(
SELECT NULL::text AS txt
UNION ALL
SELECT ''
UNION ALL
SELECT '<div>Hello & "World"</div>'
UNION ALL
SELECT '<span>Test</span>'
UNION ALL
SELECT 'Normal Text'
)
SELECT
txt AS original,
encode_html_entities_perl(txt) AS encoded,
decode_html_entities_perl(txt) AS decoded
FROM test_strings;
(and not return an empty-string on NULL values).
VACCUMaggressively if you're doing this, to avoid huge table bloat. Doing the text processing in a PL is by far the better approach as @SzymonGuz explains. It's possible in SQL usingsubstringorregexp_matchesand a replacement table, but it'll be slow and ugly.