5

I want to look for © in an HTML document, and basically get the entity the copyright is attributed to.

The copyright line shows up a couple of different ways:

<p class="bg-copy">&copy; 2011  The New York Times Company</p>

or

<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
&copy; 2011</a> 
<a href="http://www.nytco.com/">The New York Times Company</a>

or

<br>Published since 1996<br>Copyright &copy; CounterPunch<br>
All rights reserved.<br>

I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch".

I haven't been able to find much on using regex with JavaScript or JQuery, though I get the impression that it can lead to major headaches. If there is a better approach to this, let me know.

3
  • Don't use regex, rather use the DOM tree to find what you are looking for. Some link : howtocreate.co.uk/tutorials/javascript/dombasics Commented Oct 30, 2011 at 19:00
  • Normally the response you'd get is - please, don't use regex for JS parsing. Use JS parser. Question is - can you? Commented Oct 30, 2011 at 19:00
  • @ZenMaster Regex is NOT the tool for this kind of parsing. Commented Oct 30, 2011 at 19:02

2 Answers 2

2

For a robust solution, you will probably need a combination of DOM navigation and some heuristics. Your examples are solvable with regex, but there are so many more scenarios possible...

&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)

works for your three samples. But ONLY for them and similar cases.

See on rubular

Explanation:

&copy; // copyright symbol
[\s\d]* // followed by spaces or digits 
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag

See this answer on how to use in javascript with jquery. Basically you can use the match(/regex/) function:

var result = string.match(/&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)
Sign up to request clarification or add additional context in comments.

2 Comments

thanks, I see that that works, but I decided to find "&copy;" encoding in a page and parse that element. However, now I'm having trouble with that: stackoverflow.com/questions/8282250/…
also, would you mind breaking down your regex for me? I don't really understand it. and how would I use this in javascript?
0
$('*:contains(©)').filter(function(){
    return $(this).find('*:contains(©)').length == 0
}).text();

test it here http://jsfiddle.net/unloco/kGPYA/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.