Replacing multiple strings with regex in python for a file giving truncated string

Question

The following python code

import xml.etree.cElementTree as ET
import time
import fileinput
import re

ts = str(int(time.time()))
modifiedline =''
for line in fileinput.input("singleoutbound.xml"):
    line = re.sub('OrderName=".*"','OrderName="'+ts+'"', line)
    line = re.sub('OrderNo=".*"','OrderNo="'+ts+'"', line)

    line = re.sub('ShipmentNo=".*"','ShipmentNo="'+ts+'"', line)

    line = re.sub('TrackingNo=".*"','TrackingNo="'+ts+'"', line)

    line = re.sub('WaveKey=".*"','WaveKey="'+ts+'"', line)
    modifiedline=modifiedline+line

Returns the modifiedline string with some lines truncated wherever the first match is found

How do I ensure it returns the complete string for each line?

Edit:

I have changed the way I am solving this problem, inspired by Tomalak's answer

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*'):
    if 'OrderName' in elem.attrib:
        elem.attrib['OrderName'] = ts   
    if 'OrderNo' in elem.attrib:
        elem.attrib['OrderNo'] = ts
    if 'ShipmentNo' in elem.attrib:
        elem.attrib['ShipmentNo'] = ts
    if 'TrackingNo' in elem.attrib:
        elem.attrib['TrackingNo'] = ts
    if 'WaveKey' in elem.attrib:
        elem.attrib['WaveKey'] = ts


doc.write('singleoutbound_2.xml')

You are using regular expressions to replace parts of XML? Scrap your code, start over. Modifications on *ML should be done with a proper tool, in your case with a DOM API (or with XSLT). The ElementTree module you import is a proper tool, but I don't see you using it anywhere. — Tomalak
– Tomalak, Commented Aug 24, 2016 at 15:04
It's not clear what your expected behavior actually is. Can you provide a sample singleoutbound.xml with your question, the actual output that your code generates, and the desired output that you want your code to produce? Also, I'll point out that your code as written doesn't return anything. You construct modifiedline, but do not output it, store it or return it. — Matthew Cole
– Matthew Cole, Commented Aug 24, 2016 at 15:15
This 'ShipmentNo="'+ts+'"' looks like runtime replacement string. I think the replacement string expects a compile time string. Does this work with no exceptions? — user557597
– user557597, Commented Aug 24, 2016 at 15:17
@sln: I generated a simple XML file with one match for each of the five regexs given, and it consumed it with no exceptions thrown. — Matthew Cole
– Matthew Cole, Commented Aug 24, 2016 at 15:19
Disregard question, apparently it can be used in runtime but I don't know how they could do that without an internal eval of the code. dotnetperls.com/sub-python — user557597
– user557597, Commented Aug 24, 2016 at 15:21

Tomalak · Accepted Answer · 2016-08-25 07:07:50Z

1

Here is how to use ElementTree to make modifications to an XML file without accidentally breaking it:

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*[@OrderName]'):
    elem.attrib['OrderName'] = ts

# and so on

doc.write('singleoutbound_2.xml')

Things to understand:

XML represents a tree-shaped data structure that consists of elements, attributes and values, among other things. Treating it as line-based plain text fails to recognize this fact.
There is a language to select items from that tree of data, called XPath. It's powerful and not difficult to learn. Learn it. I've used //*[@OrderName] above to find all elements that have an OrderName attribute.
Trying to modify the document tree with improper tools like string replace and regular expressions will lead to more complex and hard-to-maintain code. You will encounter run-time errors for completely valid input that your regex has no special case for, character encoding issues and silent errors that are only caught when someone looks at your program's output. In other words: It's the wrong thing to do, so don't do it.
The above code is actually simpler and much easier to reason about and extend than your code.

edited Aug 25, 2016 at 7:07

answered Aug 25, 2016 at 6:51

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Praveer N Over a year ago

Thank you Tomalak, I have edited the question text with the new code inspired by your answer, which I am using to solve my problem!

Tomalak Over a year ago

@PraveerN That code is looking very good. That's the way to go. You can also make a for attribName in ['OrderName', 'OrderNo', 'etc'] loop instead of copying the lines.

janbrohl Over a year ago

Parsing and writing a document with etree may change the document a bit - although there shopuld not be any semantic changes, there may be other changes as etree does a bit of normalization eg on CDATA

Tomalak Over a year ago

@jan That's right. There are multiple ways of representing an in-memory tree as serialized XML. That's the core point of my argument. Don't rely on the textual representation, it's ephemeral. Think of it as a transport container between parsers. Don't write tools that rely on the textual representation. A parser will give you the actual data, whether it was in a CDATA or not, escaped or not, broken over multiple lines or not, etc.

janbrohl Over a year ago

There are parsers that handle CDATA different from non-CDATA and possibly there are other quirks - so while it is clear that running stuff through etree will not break other well implemented applications parsing the data it might bring problems for apps with problematic use of correct parsers (very impropable).

|

janbrohl · Accepted Answer · 2016-08-25 13:24:10Z

0

Do not use Regexes for parsing XML if you don't have an important reason for doing so

* does greedy matching but what you actually seem to want is *? for not matching until the last " in the line but the next ".

So just replace each * with *? in your cone and you should be fine (apart from the usual do-not-regex-XML-problems).

Edit:

The usual Problem with Regex and XML is that your Regex works fine at first but does not with valid XML from other sources (eg other exporters or even other versions of the same exporter) because there different ways of saying the same thing in XML. Some examples for this are <name att="123"></name> or <name att="123"/> being the same as <name att='123' /> which is the same as this with the 123 &-quoted - this may be the same as <a:name att="123"/> or <b:name att="123"/> depending on namespace-use.

Short:

Actually you cannot be sure that your Regex still works when something that you cannot control changes.

But:

Some parsers may produce unexpected results, too in such cases
Some exporters produce bad XML that normal parsers do not understand correctly so - if they cannot be fixed - workarounds like Regexes are needed.

edited Aug 25, 2016 at 13:24

answered Aug 24, 2016 at 15:29

janbrohl

2,6541 gold badge19 silver badges15 bronze badges

11 Comments

Praveer N Over a year ago

Thank you, this worked for me. Could you elaborate upon (apart from the usual do-not-regex-XML-problems)

Tomalak Over a year ago

If you know about the "usual do-not-regex-XML-problems", why do you advice people to do it anyway?

janbrohl Over a year ago

The comments below the question already warn about potential problems and there are actually use-cases to use regex with XML like for parsing invalid XML.

janbrohl Over a year ago

elaborated on regexing-XML-problems

Tomalak Over a year ago

I don't believe there are use-cases where regex is the superior solution to parse XML. Name one, I'm genuinely interested. Generally speaking: XML is an extremely strict format, any well-formed input will be parsed properly. Any input that cannot be parsed simply isn't XML - even if it has angle brackets and stuff like that. Instead of tinkering with the consumer, the producer of broken XML should be fixed. If that's not possible, there are specialized tools like tidy to preprocess the document, or more lenient HTML parsers can deal with it. Falling back to regex simply is never necessary.

|

Collectives™ on Stack Overflow

Replacing multiple strings with regex in python for a file giving truncated string

2 Answers 2

6 Comments

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related