2

I'm trying to use Regex in VBScript to replace a HTML tag that has the class 'candidate' with the text 'PLACEHOLDER'. However, it's not always working.

<[^\>]*class=""[^\>]*candidate[^\>]*""[^\>]*>([\s\S]*?)</[^\>]*>

Flags: IgnoreCase = True, Multiline = True, Global = True

The issue is that I'm not sure what type of HTML tags will contain this class (e.g. It might be a < div > tag or a < p > tag). Secondly the Regex doesn't work particularly well with inner HTML tags.

Subject HTML:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
<p>Test 1:</p>
<ul>
    <li>Test 2</li>
    <li>Test 3 </li>
    <li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

Expected:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

Actual:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
    <li>Test 2</li>
    <li>Test 3 </li>
    <li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

The same HTML tag may also have inner tags with the same type but different classes which is currently sporadically working.

e.g:

<div class="candidate">Test<div class="normal">Test</div></div>

Any help would very greatly be appreciated.

1 Answer 1

3

Does it have to be a regular expression? The task is really easy using MSHTML (or any other HTML parser). In this example, I put your subject HTML in a file called "test.htm":

Option Explicit

Const ForReading = 1

Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim inFile
Set inFile = fso.OpenTextFile("test.htm", ForReading)

Dim html
Set html = CreateObject("htmlfile")
html.write inFile.ReadAll()
inFile.Close

Dim allElements
Set allElements = html.getElementsByTagName("*")

Dim el
For Each el in allElements
    If (HasClass(el, "candidate")) Then
        el.innerText = "PLACEHOLDER"
    End If
Next

WScript.Echo html.body.outerHtml

' Takes into account the fact that the HTML "class" attribute can
' contain multiple whitespace-delimited classes
Function HasClass(el, className)
    Dim re
    Set re = New RegExp

    re.Pattern = "\b" & className & "\b"
    HasClass = re.Test(el.className)
End Function
Sign up to request clarification or add additional context in comments.

1 Comment

+1 for hinting to adapt the approach instead of using the Golden Hammer of Regex (+2 int, -2 wis)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.