help with regular expression pattern to extract some text from html in C#

Question

I have this html block:

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">nanana<span>bababa</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>


<tr>
<th colspan="2" valign="middle">Some other text</th>
</tr>
<tr>
<td class="row1">(this text needs to be extracted)</td>
<td class="row2"><input name="myUniqueInput"></td>
</tr>

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

what I need is to extract only the data between the "(this text needs to be extracted)".. here is what I've done so far:

<th[^>]*>(.*?)<input[^>]*name="myUniqueInput"[^>]*>

the problem with this pattern. its matching the whole text from the beginning till the "myUniqueInput".. any idea how to fix this? thanks in advance..

Duplicate of RegEx match open tags except XHTML self-contained tags, Regular expression to find a value in a webpage and too many others to count. — outis
– outis, Commented Apr 30, 2011 at 9:05

Johan Soderberg · Accepted Answer · 2011-04-30 09:06:55Z

1

/<td[^>]*>([^<]*)<[^>]*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/

You can always match more/less depending if you know how the html will look. The idea is to skip td* before the input name. Then get everything between the previous td /td.

answered Apr 30, 2011 at 9:06

Johan Soderberg

2,7501 gold badge15 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Desolator Over a year ago

I need it to be more dynamic.. for example if I want it to match: "<td class="row1"><span>(this text needs to be extracted)</span></td> <td class="row2"><input name="myUniqueInput"></td>"... It will fail..

Johan Soderberg Over a year ago

/<td[^>]*>(.*?)<\/td>*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/ This will get it with <span> included so it needs to be filtered after. How dynamic does it have to be? :)

Desolator Over a year ago

well it partially works with the html I have... anyway thanks with the idea.. I will try to make the pattern more dynamic to make it match everything I need.. thanks for your help :)

Brian Willis · Accepted Answer · 2011-04-30 08:57:55Z

0

It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly. Have you considered using a library to parse the HTML for you, and then extracting the data from there?

answered Apr 30, 2011 at 8:57

Brian Willis

24.1k9 gold badges50 silver badges50 bronze badges

3 Comments

Ankur Over a year ago

As far as a library is concerned to parse HTML you can use "htmlagilitypack.codeplex.com" . This is .NET specific :)

Desolator Over a year ago

thanks for the answer.. and I know that there are html parsers but i'm not interested in that right now... I just need dirty solution..

Andrew Savinykh Over a year ago

I don't agree that "It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly". You probably mistook attempts at brackets matching problem that can't be solved with "standard" regular expressions with more generic case of web scraping. Regular Expressions are quite sufficient for web scraping in many many cases. But even bracket matching can be achieved with certain implementations of regular expressions, for example in .NET Framework implementation.

Collectives™ on Stack Overflow

help with regular expression pattern to extract some text from html in C#

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related