21

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text between the tags too!

0

6 Answers 6

57
select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;
Sign up to request clarification or add additional context in comments.

4 Comments

What does the E'' syntax mean here?
Could check on <[^>]*> to include empty tags
@JackDeeth E'' enables escape string syntax, but it's not needed in this case as the regular expression doesn't contain characters like basckslashes
17

Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":

  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)

regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;

See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

2 Comments

Did you ever work around issues with XHTML specific XML entities such as &nbsp; that fail with "line 1: Entity 'nbsp' not defined"?
Hi @Jan-WillemGmeligMeyling, in nowadays the better is never use entities in a text UTF8, use real chr(160) or input XML as real UTF8 text... Use only basic entities, &lt;, &gt; and &amp;... Use xmlelement to convert invalid into valid text (e.g. xmlelement(name foo,'A>B')) and cast to XML to preserve valid XML (e.g. '<A/>B'::xml)... In both chr(160) will be encoded as UTF8.
10

The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.

Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.

CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
  use HTML::TreeBuilder;
  use HTML::FormatText;
  my $tree = HTML::TreeBuilder->new;
  $tree->parse_content(shift);
  my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
  $text = $formatter->format($tree);
$$ LANGUAGE plperlu;

Demo:

select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;

Output:

     extract_contents_from_html 
    ----------------------------
     Hi there!
     How are you?

One needs to be aware of the caveats that come with untrusted languages, though.

3 Comments

One advantage of this technique is that if you are storing the actual HTML in the database but want to do text search on just the contents, you don't need to store the contents redundantly to the HTML. You could create a GiST or GIN index on (for example) to_tsvector('english', extract_contents_from_html(html_text)) -- I've done something very similar with extracting text from PDF files in the database using a function referencing the poppler library.
Using standard DOM and standard PL/pgSQL is faster (!) and most reliable. See how use it with XPath here.
That's not comparable. Your suggested regexp-based replacement will treat embedded JS code and CSS as if it was HTML contents whereas HTML::FormatText will skip it.
7

Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.

A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.

A great way to do this (in PHP, at least) would be HTML purifier. Don't do this in JavaScript, the user can tamper with it very easily.

4 Comments

Well, since the database is PostgreSQL, you have many languages available for writing functions which are capable of doing this safely and sanely (as pointed out by @DanielVérité). I agree regarding use of PostgreSQL regexp functions or plpgsql code, but surely you would agree that perl, with appropriate modules, is up to the task, regardless of whether that is done inside or outside the database. Without knowing more I can't say which I think is the best solution, but if HTML pages are stored in the database and they want text search capability on content, db-side might be a good idea.
Perl is up to the task. Running Perl inside of database operations is still much more expensive and is bad database design.
To say "any solution performed in the RDBMS is going to involve (...)" is very delicate, and in this case is wrong. Also the assertion "(...) there is NO way to manipulate HTML in a standards-compliant" is wrong, DOM (!) is a standard. See a DOM+XPath solution here at this other answer.
This is typical zealotry. We live in a big world, and sometimes doing stuff that might be dangerous in some circumstances can be used safely for providing a massive performance boost without completely rearchitecting an existing solution. That this provided valuable advice for the original poster is beside the point - Google brings you here for the general case and there are valid usecases that the other solutions cater to.
2
regexp_replace("Content",'\s*(<[^>]+>|<script.+?<\/script>|<style.+?<\/style>)\s*','','gi')

This code works well for me, it remove common html-tags and keep the inner text(likes some-text), and remove script, style blocks, and remove the inner codes.

Comments

-1

Don't do it in postgreSQL.

It is not designed to do this.

Use PHP or whatever language you are using to serve webpages.

Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.

Use a DOM parser to strip out tags.

If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.

3 Comments

Alas, an XML DOM parser will choke on most HTML, which tends to be invalid XML. Way better to do it in the app as you said, where you can get a more robust HTML-to-text tool or a HTML-friendly DOM.
@JamesMitch I think "Don't do it in postgreSQL" is wrong, for me it is the best and faster solution (!).
@CraigRinger, I preffer to say "DOM parser will choke on most BAD HTML", so, there are two good practices: 1) filter BAD HTML with tidy-html5 before to insert into database; 2) review the HTML production in your internal system, changing BAD to good. Any "normal HTML4 or HTML5" can be load by standard DOM's loadHTML() method (even BAD if silent the warnings). I've been using libxml2 with sucess for years (for feed PostgreSQL XML datatype), it is the same used by PostgreSQL, Mozilla, Android, etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.