PowerShell Regex for nested XML tags

Question

I need to repair several huge buggy XML files. Because they are buggy, I cannot just do:

[xml]$xml = Get-Content .\data.xml

I want to parse them with captured groups. However, I don't know how to handle nested tags.

Here is a simple example to illustrate my problem.

$xml = '<tag><tag><tag>Anything</tag><tag>Something else</tag></tag><tag><tag>Another value</tag><tag>And another one...</tag></tag></tag>'
$Pattern = '<tag>(?<Content>.+?)</tag>'
([regex]::Matches($Xml, $Pattern)).Value

This piece of code returns:

<tag><tag><tag>Anything</tag>
<tag>Something else</tag>
<tag><tag>Another value</tag>
<tag>And another one...</tag>

How can I change my Regex pattern to get this?

<tag>Anything</tag>
<tag>Something else</tag>
<tag>Another value</tag>
<tag>And another one...</tag>

It seems that Regex recursion would fit my needs. However, I couldn't find someone explaining how it works with PowerShell (if ever...)

.NET regex does not support recursion. It supports balanced constructs. And Powershell has an XML parser. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 10, 2019 at 21:51
@WiktorStribiżew, thanks for this information. I will try to find some documentation about this technique. — Luke
– Luke, Commented Feb 11, 2019 at 17:42
And why not? :-) There seems to be several possible way to get the same result. Just searched for the meaning. Really interesting too. Thanks for this interesting suggestion. — Luke
– Luke, Commented Feb 11, 2019 at 18:21
Marco's solution will also match <tag><font><tag>Anything</font></tag> in <tag><tag><font><tag>Anything</font></tag></tag>. Not sure what you need. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 11, 2019 at 18:24

Marco Luzzara · Accepted Answer · 2019-02-10 20:40:16Z

2

Negative lookahead is enough.

<tag>(?!<tag>)(?<Content>.+?)<\/tag>

It takes only the last <tag>, which is the one that passes the lookahead check.

answered Feb 10, 2019 at 20:40

Marco Luzzara

6,1864 gold badges26 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Michael Kay · Accepted Answer · 2019-02-10 23:34:40Z

0

Your "specification" consists of a single example of input and desired output, which isn't necessarily a very good basis for writing code, but for the given example you could adopt the approach of replacing any sequence of <tag> start tags with a single <tag> start tag, and any sequence of </tag> end tags with a single </tag> end tag.

So replace (<tag>)+ by <tag>, and (</tag>)+ by </tag>.

If I have misunderstood the question, then you need to find a way to describe the problem more clearly.

Of course, repairing bad XML is no substitute for fixing the buggy code that generated the bad XML in the first place.

answered Feb 10, 2019 at 23:34

Michael Kay

165k11 gold badges97 silver badges173 bronze badges

1 Comment

Luke Over a year ago

I didn't think about this technique, but it would do the job too. Thanks!

Collectives™ on Stack Overflow

PowerShell Regex for nested XML tags

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related