0

I'm using a Powershell script to automate the replacement of some troublesome characters from an xml file such as & ' - £

The script I have works well for these characters, but I also want to remove the double quote character " but only if it is used within an xml attribute (which unfortunately is enclosed by double quotes) so I obviously cannot remove all double quotes from the xml file as this will stop the attributes from working as they should.

My Powershell script is below:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" -replace "£", "GBP" -replace "'", "" -replace "–", " "} |
Set-Content C:\test\communication.xml

What I'd like to be able to so is to remove ONLY the double quotes that make up part the XML attributes that are themselves enclosed by a pair of double quotes as below. I know that Powershell looks at each line as a separate object so suspect this should be quite easy, possibly by using conditions?

An example XML file is below:

<?xml version="1.0" encoding="UTF-8"?>
<Portal> 
<communication updates="Text data with no double quotes in the attribute" />
<communication updates="Text data that "includes" double quotes within the double quotes for the attribute" />
</Portal>

In the above example I'd like to remove only the double quotes that immediately surround the word includes BUT not the double quotes that are to the left of the word Text or to the right of the word attribute. The words used for the XML attributes will change on a regular basis but the left double quote will always be to the immediate right of the = symbol and the right double quote will always be to the left of a space forward slash combination / Thanks

5
  • You can daisy-chain -replace operations ($_ -replace "&","+" -replace "£","GBP" ...). A separate loop for each replacement is not required. Commented Jul 1, 2013 at 20:51
  • great thanks for that little tip Ansgar, I'll amend my code. Any idea on the replacing of specific double quotes? Commented Jul 1, 2013 at 20:54
  • Do the files always have just one attribute per line? Commented Jul 1, 2013 at 21:38
  • 3
    That's not valid XML. Can you fix the source instead? Double-quotes within an XML attribute should be encoded as &quot;, e.g., <communication updates="Text data the &quot;includes&quot; double quotes within the double quotes." /> Commented Jul 1, 2013 at 22:09
  • @ splattered- thanks, I know it's not valid XML, this is why I want to be able to remove the rogue characters. Human users are manually amending the XML attributes in a text editor & saving the XML file so I'm not able to fix the source. Rogue characters & bad XML is inevitable. The XML files are then synced to tablets & an HTML locally hosted webpage uses Javascript to parse XML & add the content of the attributes to the page as text. If there are illegal characters entered by the users the attributes don't get displayed as the XML cannot be parsed, or some like ' and £ don't format properly Commented Jul 2, 2013 at 21:00

1 Answer 1

1

Try this regex:

"(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")"

In your code it would be:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" `
    -replace "£", "GBP" `
    -replace "'", "" `
    -replace "–", " " `
    -replace "(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")", ""} |
Set-Content C:\test\communication.xml

This will take any " that has a " in-front of and behind it (except a line that has ?xml in it) and replace it with nothing.

Edit to include breakdown of regex;

(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")

1. (?<!\?xml.*)----> Excludes any line that has "?xml" before the first quote
2. (?<=`".*?)------> Lookbehind searching for a quotation mark.  
       The ` is to escape the quotation mark, which is needed for powershell
3. `"--------------> The actual quotation mark you are searching for
4. (?=.*?`")-------> Lookahead searching for a quotation mark

For more information about lookbehinds and lookaheads see this site

Sign up to request clarification or add additional context in comments.

2 Comments

Hi Nick, I now have an extra challenge (probably quite a small one) which I'd really appreciate your help with... I had scheduled this script to run on a fixed frequency but what I'd now like to do is to run it every time a change is made to one of the XML files. I have an app which handles this nicely but as the script sets the character replacements (saves the file) each time it's run then this is seen as a new file each time and creates an infinite loop. Ideally I'd like the Set -Content command ONLY to happen if any characters have been replaced. Hopefully this is possible. Thanks again
@ladders81 It shouldn't be to hard. Just create a variable with everything in the communication.xml and run the replaces on it, saving to a new variable. Just compare the two, and if same, do nothing, if not the same, then do the set-content. If you're unsure how to do that, then create a new question, and I'm sure people will be able to guide you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.