Linux: Command to delete line(s) from XML file with matching string starting with the 2nd occurrence

Question

I have a XML file that looks something like this:

<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>

I would like to delete the "Header" (and not /Header) lines starting with the 2nd occurrence - don't ask why :-). So the output should look something like this (yes, I know that it is not well formed, but I am going to perform other processing on it as well):

<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Date>2017-04-18</Date>
   .
   .
   .`
</Header>

I tried:

sed -i '2,${/<Header/d;}' file

but that deleted all the occurrences of Header. Any suggestions?

Thanks

@Cyrus Could be a XY problem, true. But it is not so hard to do what OP wants, as to make understanding what the Y is important. However, OP, the recommendation by Cyrus (thinking of alternate ways to achieve the ultimate goal) is valid. You might spend some thought that way. — Yunnosch
– Yunnosch, Commented May 4, 2017 at 5:55
it makes no sense - after such deletion your xml will become invalid — RomanPerekhrest
– RomanPerekhrest, Commented May 4, 2017 at 6:11

potong · Accepted Answer · 2017-05-04 14:56:57Z

2

This might work for you (GNU sed):

sed '/^<\/Header/,${/^<Header/d}' file

From the first closing Header tag to the end of the file, remove any lines beginning with a Header tag.

answered May 4, 2017 at 14:56

potong

59.3k6 gold badges55 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Yunnosch · Accepted Answer · 2017-05-04 17:13:19Z

0

sed  "/<Header/{p;:a;s/^.*$//;N;s/\n//;/<Header/!p;ba}" input.txt

find the first occurence
print it
start a loop
- forget the current line
- get the next
- get rid of the unwanted newline
- print it if it is not a match
loop

This assumes that your header lines are always a single line. Otherwise it gets tough. In that case, think about whether this might be a XY problem (see comment by Cyrus). I also assume that removing the indentation of the date lines is not actually wanted.

edited May 4, 2017 at 17:13

answered May 4, 2017 at 5:50

Yunnosch

26.8k9 gold badges46 silver badges66 bronze badges

3 Comments

user3152289 Over a year ago

I tried the above, but am getting an error: sed: -e expression #1, char 19: unknown command: `z'

user3152289 Over a year ago

I just used this command from another post and it worked: awk '!/^<Header/ || !f++' file

Yunnosch Over a year ago

@user3152289 I avoided the z, which seems not to be supported by your sed. Maybe try my changed answer, just to test whether it works on your sed, too. Use whatever helps you. Consider making your own answer, having three different styles would be cool.

Collectives™ on Stack Overflow

Linux: Command to delete line(s) from XML file with matching string starting with the 2nd occurrence

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related