select HTML text element with regex?

Question

I want to look for © in an HTML document, and basically get the entity the copyright is attributed to.

The copyright line shows up a couple of different ways:

<p class="bg-copy">&copy; 2011  The New York Times Company</p>

or

<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
&copy; 2011</a> 
<a href="http://www.nytco.com/">The New York Times Company</a>

or

<br>Published since 1996<br>Copyright &copy; CounterPunch<br>
All rights reserved.<br>

I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch".

I haven't been able to find much on using regex with JavaScript or JQuery, though I get the impression that it can lead to major headaches. If there is a better approach to this, let me know.

Don't use regex, rather use the DOM tree to find what you are looking for. Some link : howtocreate.co.uk/tutorials/javascript/dombasics — FailedDev
– FailedDev, Commented Oct 30, 2011 at 19:00
Normally the response you'd get is - please, don't use regex for JS parsing. Use JS parser. Question is - can you? — ZenMaster
– ZenMaster, Commented Oct 30, 2011 at 19:00
@ZenMaster Regex is NOT the tool for this kind of parsing. — FailedDev
– FailedDev, Commented Oct 30, 2011 at 19:02

Community · Accepted Answer · 2017-05-23 11:44:01Z

2

For a robust solution, you will probably need a combination of DOM navigation and some heuristics. Your examples are solvable with regex, but there are so many more scenarios possible...

&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)

works for your three samples. But ONLY for them and similar cases.

See on rubular

Explanation:

&copy; // copyright symbol
[\s\d]* // followed by spaces or digits 
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag

See this answer on how to use in javascript with jquery. Basically you can use the match(/regex/) function:

var result = string.match(/&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered Oct 30, 2011 at 19:48

morja

8,5703 gold badges43 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tarayani Over a year ago

thanks, I see that that works, but I decided to find "©" encoding in a page and parse that element. However, now I'm having trouble with that: stackoverflow.com/questions/8282250/…

tarayani Over a year ago

also, would you mind breaking down your regex for me? I don't really understand it. and how would I use this in javascript?

unloco · Accepted Answer · 2011-11-29 13:38:27Z

0

$('*:contains(©)').filter(function(){
    return $(this).find('*:contains(©)').length == 0
}).text();

test it here http://jsfiddle.net/unloco/kGPYA/

answered Nov 29, 2011 at 13:38

unloco

7,3904 gold badges52 silver badges62 bronze badges

Collectives™ on Stack Overflow

select HTML text element with regex?

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related