2

I have a file that contains text as well as some xml content dumped into it. It looks something like this :

The authentication details : <id>70016683</id><password>password@123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>

I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :

The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj

At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.

  • I have tried using regex for this purpose :
       xml_tag = re.search(r"<\w*>",line)
       if xml_tag:
            start_position = xml_tag.start()
            xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
            xml_pattern = r'{}'.format(xml_word)
            stop_position = re.search(xml_pattern,line).stop()

But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.

Any advice would be helpful. Thanks in advance.

Edit :

I also want to apply the same logic to files that look like this :

The authentication details : ID <id>70016683</id> Password <password>password@123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>

The above files may have more than one xml object in a line.

They may also have some plain text after the xml part.

4
  • Does the xml snippet in each line always follow a :? Commented Nov 25, 2020 at 11:30
  • @Jack - Not always. Commented Nov 25, 2020 at 11:32
  • Then please edit your question and add a few more representative samples of what the text looks like. Commented Nov 25, 2020 at 11:35
  • @JackFleeting - Sure Commented Nov 25, 2020 at 11:36

1 Answer 1

1

The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:

txt = """[your sample text above]"""
lines = txt.splitlines()
entries = []
new_txt = ''

for line in lines:
    entry = (line.replace(' <',' xxx<',1).split('xxx'))
    if len(entry)==2:
        entries.append(entry[1])
        entry[1]="xml_obj"
        line=''.join(entry)
    else:
        entries.append('none')
    new_txt+=line+'\n'
for entry in entries:
    print(entry)
print('---')
print(new_txt)

Output:

<id>70016683</id><password>password@123</password>
none
<request><id>90016133</id><password>password@3212</password></request>
<Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
---
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
Sign up to request clarification or add additional context in comments.

3 Comments

@ Jack - I was trying it out on other types of files I had. I have edited the question to include that as well. Could you help me out with that too? Thanks in advance.
@sunehaks Please post it as a separate question (per SO policy) and I'll be happy to take a look.
@ Jack Fleeting - Sure.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.