Stripping HTML tags in PostgreSQL

Question

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text between the tags too!

okrunner · Accepted Answer · 2017-10-15 16:26:00Z

57

select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;

answered Oct 15, 2017 at 16:26

okrunner

3,20332 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jack Deeth Over a year ago

What does the E'' syntax mean here?

oh_my_lawdy Over a year ago

@JackDeeth - stackoverflow.com/a/34823171/12763382

Javier Arias Nov 26, 2024 at 10:52

Could check on <[^>]*> to include empty tags

Javier Arias Nov 26, 2024 at 10:54

@JackDeeth E'' enables escape string syntax, but it's not needed in this case as the regular expression doesn't contain characters like basckslashes

6 revs, 2 users 79% · Accepted Answer · 2022-03-10 08:39:44Z

17

Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":

  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)

regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;

See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

edited Mar 10, 2022 at 8:39

community wiki

6 revs, 2 users 79%
Peter Krauss

2 Comments

Jan-Willem Gmelig Meyling Over a year ago

Did you ever work around issues with XHTML specific XML entities such as   that fail with "line 1: Entity 'nbsp' not defined"?

Peter Krauss Over a year ago

Hi @Jan-WillemGmeligMeyling, in nowadays the better is never use entities in a text UTF8, use real chr(160) or input XML as real UTF8 text... Use only basic entities, <, > and &... Use xmlelement to convert invalid into valid text (e.g. xmlelement(name foo,'A>B')) and cast to XML to preserve valid XML (e.g. '<A/>B'::xml)... In both chr(160) will be encoded as UTF8.

Daniel Vérité · Accepted Answer · 2012-08-21 14:18:23Z

10

The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.

Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.

CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
  use HTML::TreeBuilder;
  use HTML::FormatText;
  my $tree = HTML::TreeBuilder->new;
  $tree->parse_content(shift);
  my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
  $text = $formatter->format($tree);
$$ LANGUAGE plperlu;

Demo:

select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;

Output:

     extract_contents_from_html 
    ----------------------------
     Hi there!
     How are you?

One needs to be aware of the caveats that come with untrusted languages, though.

answered Aug 21, 2012 at 14:18

Daniel Vérité

62.3k16 gold badges134 silver badges160 bronze badges

3 Comments

kgrittn Over a year ago

One advantage of this technique is that if you are storing the actual HTML in the database but want to do text search on just the contents, you don't need to store the contents redundantly to the HTML. You could create a GiST or GIN index on (for example) to_tsvector('english', extract_contents_from_html(html_text)) -- I've done something very similar with extracting text from PDF files in the database using a function referencing the poppler library.

Peter Krauss Over a year ago

Using standard DOM and standard PL/pgSQL is faster (!) and most reliable. See how use it with XPath here.

Daniel Vérité Over a year ago

That's not comparable. Your suggested regexp-based replacement will treat embedded JS code and CSS as if it was HTML contents whereas HTML::FormatText will skip it.

Winfield Trail · Accepted Answer · 2012-08-21 07:14:31Z

7

Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.

A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.

A great way to do this (in PHP, at least) would be HTML purifier. Don't do this in JavaScript, the user can tamper with it very easily.

answered Aug 21, 2012 at 7:14

Winfield Trail

5,7032 gold badges31 silver badges45 bronze badges

4 Comments

kgrittn Over a year ago

Well, since the database is PostgreSQL, you have many languages available for writing functions which are capable of doing this safely and sanely (as pointed out by @DanielVérité). I agree regarding use of PostgreSQL regexp functions or plpgsql code, but surely you would agree that perl, with appropriate modules, is up to the task, regardless of whether that is done inside or outside the database. Without knowing more I can't say which I think is the best solution, but if HTML pages are stored in the database and they want text search capability on content, db-side might be a good idea.

Winfield Trail Over a year ago

Perl is up to the task. Running Perl inside of database operations is still much more expensive and is bad database design.

Peter Krauss Over a year ago

To say "any solution performed in the RDBMS is going to involve (...)" is very delicate, and in this case is wrong. Also the assertion "(...) there is NO way to manipulate HTML in a standards-compliant" is wrong, DOM (!) is a standard. See a DOM+XPath solution here at this other answer.

AntonOfTheWoods Over a year ago

This is typical zealotry. We live in a big world, and sometimes doing stuff that might be dangerous in some circumstances can be used safely for providing a massive performance boost without completely rearchitecting an existing solution. That this provided valuable advice for the original poster is beside the point - Google brings you here for the general case and there are valid usecases that the other solutions cater to.

Q.f · Accepted Answer · 2021-07-01 08:04:39Z

2

regexp_replace("Content",'\s*(<[^>]+>|<script.+?<\/script>|<style.+?<\/style>)\s*','','gi')

This code works well for me, it remove common html-tags and keep the inner text(likes some-text), and remove script, style blocks, and remove the inner codes.

answered Jul 1, 2021 at 8:04

Q.f

211 bronze badge

Comments

James Mitch · Accepted Answer · 2012-08-21 07:18:55Z

-1

Don't do it in postgreSQL.

It is not designed to do this.

Use PHP or whatever language you are using to serve webpages.

Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.

Use a DOM parser to strip out tags.

If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.

edited Aug 21, 2012 at 7:18

answered Aug 21, 2012 at 7:12

James Mitch

5492 gold badges11 silver badges28 bronze badges

3 Comments

Craig Ringer Over a year ago

Alas, an XML DOM parser will choke on most HTML, which tends to be invalid XML. Way better to do it in the app as you said, where you can get a more robust HTML-to-text tool or a HTML-friendly DOM.

Peter Krauss Over a year ago

@JamesMitch I think "Don't do it in postgreSQL" is wrong, for me it is the best and faster solution (!).

Peter Krauss Over a year ago

@CraigRinger, I preffer to say "DOM parser will choke on most BAD HTML", so, there are two good practices: 1) filter BAD HTML with tidy-html5 before to insert into database; 2) review the HTML production in your internal system, changing BAD to good. Any "normal HTML4 or HTML5" can be load by standard DOM's loadHTML() method (even BAD if silent the warnings). I've been using libxml2 with sucess for years (for feed PostgreSQL XML datatype), it is the same used by PostgreSQL, Mozilla, Android, etc.

Collectives™ on Stack Overflow

Stripping HTML tags in PostgreSQL

6 Answers 6

4 Comments

Use xpath

regex solutions...

2 Comments

3 Comments

4 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

Use xpath

regex solutions...

2 Comments

3 Comments

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related