0

I need to repair several huge buggy XML files. Because they are buggy, I cannot just do:

[xml]$xml = Get-Content .\data.xml

I want to parse them with captured groups. However, I don't know how to handle nested tags.

Here is a simple example to illustrate my problem.

$xml = '<tag><tag><tag>Anything</tag><tag>Something else</tag></tag><tag><tag>Another value</tag><tag>And another one...</tag></tag></tag>'
$Pattern = '<tag>(?<Content>.+?)</tag>'
([regex]::Matches($Xml, $Pattern)).Value

This piece of code returns:

<tag><tag><tag>Anything</tag>
<tag>Something else</tag>
<tag><tag>Another value</tag>
<tag>And another one...</tag>

How can I change my Regex pattern to get this?

<tag>Anything</tag>
<tag>Something else</tag>
<tag>Another value</tag>
<tag>And another one...</tag>

It seems that Regex recursion would fit my needs. However, I couldn't find someone explaining how it works with PowerShell (if ever...)

6
  • 1
    .NET regex does not support recursion. It supports balanced constructs. And Powershell has an XML parser. Commented Feb 10, 2019 at 21:51
  • @WiktorStribiżew, thanks for this information. I will try to find some documentation about this technique. Commented Feb 11, 2019 at 17:42
  • 1
    Why use lookahead? Use <tag>(?<Content>[^<]*)</tag> Commented Feb 11, 2019 at 17:44
  • And why not? :-) There seems to be several possible way to get the same result. Just searched for the meaning. Really interesting too. Thanks for this interesting suggestion. Commented Feb 11, 2019 at 18:21
  • 1
    Marco's solution will also match <tag><font><tag>Anything</font></tag> in <tag><tag><font><tag>Anything</font></tag></tag>. Not sure what you need. Commented Feb 11, 2019 at 18:24

2 Answers 2

2

Negative lookahead is enough.

<tag>(?!<tag>)(?<Content>.+?)<\/tag>

It takes only the last <tag>, which is the one that passes the lookahead check.

Sign up to request clarification or add additional context in comments.

Comments

0

Your "specification" consists of a single example of input and desired output, which isn't necessarily a very good basis for writing code, but for the given example you could adopt the approach of replacing any sequence of <tag> start tags with a single <tag> start tag, and any sequence of </tag> end tags with a single </tag> end tag.

So replace (<tag>)+ by <tag>, and (</tag>)+ by </tag>.

If I have misunderstood the question, then you need to find a way to describe the problem more clearly.

Of course, repairing bad XML is no substitute for fixing the buggy code that generated the bad XML in the first place.

1 Comment

I didn't think about this technique, but it would do the job too. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.