1

I have this html block:

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">nanana<span>bababa</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>


<tr>
<th colspan="2" valign="middle">Some other text</th>
</tr>
<tr>
<td class="row1">(this text needs to be extracted)</td>
<td class="row2"><input name="myUniqueInput"></td>
</tr>

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

what I need is to extract only the data between the "(this text needs to be extracted)".. here is what I've done so far:

<th[^>]*>(.*?)<input[^>]*name="myUniqueInput"[^>]*>

the problem with this pattern. its matching the whole text from the beginning till the "myUniqueInput".. any idea how to fix this? thanks in advance..

1

2 Answers 2

1
/<td[^>]*>([^<]*)<[^>]*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/

You can always match more/less depending if you know how the html will look. The idea is to skip td* before the input name. Then get everything between the previous td /td.

Sign up to request clarification or add additional context in comments.

3 Comments

I need it to be more dynamic.. for example if I want it to match: "<td class="row1"><span>(this text needs to be extracted)</span></td> <td class="row2"><input name="myUniqueInput"></td>"... It will fail..
/<td[^>]*>(.*?)<\/td>*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/ This will get it with <span> included so it needs to be filtered after. How dynamic does it have to be? :)
well it partially works with the html I have... anyway thanks with the idea.. I will try to make the pattern more dynamic to make it match everything I need.. thanks for your help :)
0

It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly. Have you considered using a library to parse the HTML for you, and then extracting the data from there?

3 Comments

As far as a library is concerned to parse HTML you can use "htmlagilitypack.codeplex.com" . This is .NET specific :)
thanks for the answer.. and I know that there are html parsers but i'm not interested in that right now... I just need dirty solution..
I don't agree that "It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly". You probably mistook attempts at brackets matching problem that can't be solved with "standard" regular expressions with more generic case of web scraping. Regular Expressions are quite sufficient for web scraping in many many cases. But even bracket matching can be achieved with certain implementations of regular expressions, for example in .NET Framework implementation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.