1

The following python code

import xml.etree.cElementTree as ET
import time
import fileinput
import re

ts = str(int(time.time()))
modifiedline =''
for line in fileinput.input("singleoutbound.xml"):
    line = re.sub('OrderName=".*"','OrderName="'+ts+'"', line)
    line = re.sub('OrderNo=".*"','OrderNo="'+ts+'"', line)

    line = re.sub('ShipmentNo=".*"','ShipmentNo="'+ts+'"', line)

    line = re.sub('TrackingNo=".*"','TrackingNo="'+ts+'"', line)

    line = re.sub('WaveKey=".*"','WaveKey="'+ts+'"', line)
    modifiedline=modifiedline+line

Returns the modifiedline string with some lines truncated wherever the first match is found

How do I ensure it returns the complete string for each line?

Edit:

I have changed the way I am solving this problem, inspired by Tomalak's answer

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*'):
    if 'OrderName' in elem.attrib:
        elem.attrib['OrderName'] = ts   
    if 'OrderNo' in elem.attrib:
        elem.attrib['OrderNo'] = ts
    if 'ShipmentNo' in elem.attrib:
        elem.attrib['ShipmentNo'] = ts
    if 'TrackingNo' in elem.attrib:
        elem.attrib['TrackingNo'] = ts
    if 'WaveKey' in elem.attrib:
        elem.attrib['WaveKey'] = ts


doc.write('singleoutbound_2.xml')
11
  • 7
    You are using regular expressions to replace parts of XML? Scrap your code, start over. Modifications on *ML should be done with a proper tool, in your case with a DOM API (or with XSLT). The ElementTree module you import is a proper tool, but I don't see you using it anywhere. Commented Aug 24, 2016 at 15:04
  • It's not clear what your expected behavior actually is. Can you provide a sample singleoutbound.xml with your question, the actual output that your code generates, and the desired output that you want your code to produce? Also, I'll point out that your code as written doesn't return anything. You construct modifiedline, but do not output it, store it or return it. Commented Aug 24, 2016 at 15:15
  • This 'ShipmentNo="'+ts+'"' looks like runtime replacement string. I think the replacement string expects a compile time string. Does this work with no exceptions? Commented Aug 24, 2016 at 15:17
  • @sln: I generated a simple XML file with one match for each of the five regexs given, and it consumed it with no exceptions thrown. Commented Aug 24, 2016 at 15:19
  • Disregard question, apparently it can be used in runtime but I don't know how they could do that without an internal eval of the code. dotnetperls.com/sub-python Commented Aug 24, 2016 at 15:21

2 Answers 2

1

Here is how to use ElementTree to make modifications to an XML file without accidentally breaking it:

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*[@OrderName]'):
    elem.attrib['OrderName'] = ts

# and so on

doc.write('singleoutbound_2.xml')

Things to understand:

  • XML represents a tree-shaped data structure that consists of elements, attributes and values, among other things. Treating it as line-based plain text fails to recognize this fact.
  • There is a language to select items from that tree of data, called XPath. It's powerful and not difficult to learn. Learn it. I've used //*[@OrderName] above to find all elements that have an OrderName attribute.
  • Trying to modify the document tree with improper tools like string replace and regular expressions will lead to more complex and hard-to-maintain code. You will encounter run-time errors for completely valid input that your regex has no special case for, character encoding issues and silent errors that are only caught when someone looks at your program's output. In other words: It's the wrong thing to do, so don't do it.
  • The above code is actually simpler and much easier to reason about and extend than your code.
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you Tomalak, I have edited the question text with the new code inspired by your answer, which I am using to solve my problem!
@PraveerN That code is looking very good. That's the way to go. You can also make a for attribName in ['OrderName', 'OrderNo', 'etc'] loop instead of copying the lines.
Parsing and writing a document with etree may change the document a bit - although there shopuld not be any semantic changes, there may be other changes as etree does a bit of normalization eg on CDATA
@jan That's right. There are multiple ways of representing an in-memory tree as serialized XML. That's the core point of my argument. Don't rely on the textual representation, it's ephemeral. Think of it as a transport container between parsers. Don't write tools that rely on the textual representation. A parser will give you the actual data, whether it was in a CDATA or not, escaped or not, broken over multiple lines or not, etc.
There are parsers that handle CDATA different from non-CDATA and possibly there are other quirks - so while it is clear that running stuff through etree will not break other well implemented applications parsing the data it might bring problems for apps with problematic use of correct parsers (very impropable).
|
0

Do not use Regexes for parsing XML if you don't have an important reason for doing so

* does greedy matching but what you actually seem to want is *? for not matching until the last " in the line but the next ".

So just replace each * with *? in your cone and you should be fine (apart from the usual do-not-regex-XML-problems).

Edit:

The usual Problem with Regex and XML is that your Regex works fine at first but does not with valid XML from other sources (eg other exporters or even other versions of the same exporter) because there different ways of saying the same thing in XML. Some examples for this are <name att="123"></name> or <name att="123"/> being the same as <name att='123' /> which is the same as this with the 123 &-quoted - this may be the same as <a:name att="123"/> or <b:name att="123"/> depending on namespace-use.

Short:

Actually you cannot be sure that your Regex still works when something that you cannot control changes.

But:

  • Some parsers may produce unexpected results, too in such cases
  • Some exporters produce bad XML that normal parsers do not understand correctly so - if they cannot be fixed - workarounds like Regexes are needed.

11 Comments

Thank you, this worked for me. Could you elaborate upon (apart from the usual do-not-regex-XML-problems)
If you know about the "usual do-not-regex-XML-problems", why do you advice people to do it anyway?
The comments below the question already warn about potential problems and there are actually use-cases to use regex with XML like for parsing invalid XML.
elaborated on regexing-XML-problems
I don't believe there are use-cases where regex is the superior solution to parse XML. Name one, I'm genuinely interested. Generally speaking: XML is an extremely strict format, any well-formed input will be parsed properly. Any input that cannot be parsed simply isn't XML - even if it has angle brackets and stuff like that. Instead of tinkering with the consumer, the producer of broken XML should be fixed. If that's not possible, there are specialized tools like tidy to preprocess the document, or more lenient HTML parsers can deal with it. Falling back to regex simply is never necessary.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.