Delete duplicate consecutive lines without sort or unique in xml file

Question

I have an xml file where I need to keep the order of the tags but have a tag called media that has duplicate lines in consecutive order. I would like to delete one of the duplicate media tags but want to preserve all of the parent tags - (which are also consecutive and repeat). I'm wondering if there is an awk solution to delete only if a pattern is matched. For example:

<story>
   <article>
      <media>One line</media>
      <media>One line</media>    <-- Same line as above, want to delete this
      <media>Another Line</media>
      <media>Another Line</media>  <-- Another duplicate, want to delete this
   </article>
</story>
<story>
   <article>
     ........ and so on

I want to keep the consecutive story and article tags and just delete duplicates for the media tag. I've tried a number of awk scripts but nothing seems to work without sorting the file and ruining the order of the xml. Any help much appreciated.

not a clear example. Please move your as above notations into your comments. — shellter
– shellter, Commented Jan 7, 2015 at 3:52

nu11p01n73R · Accepted Answer · 2015-01-07 08:34:48Z

6

An awk script would help you

awk '!(f == $0){print} {f=$0}' input

Test

$ cat input
<story>
   <article>
      <media>One line</media>
      <media>One line</media>
      <media>Another Line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

$ awk '!(f == $0){print} {f=$0}' input
<story>
   <article>
      <media>One line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

OR

$ awk 'f!=$0&&f=$0' input

Thanks to Jidder

edited Jan 7, 2015 at 8:34

answered Jan 7, 2015 at 3:54

nu11p01n73R

26.8k3 gold badges42 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3442743 Over a year ago

Shorter awk 'f!=$0&&f=$0'

nu11p01n73R Over a year ago

@Jidder That was more shorter

NeronLeVelu · Accepted Answer · 2015-01-07 07:48:54Z

3

use behaviour of uniq that need normaly a sorted file, removing dupliucate lines tat are following exactly the previous line

uniq YourFile

answered Jan 7, 2015 at 7:48

NeronLeVelu

10.1k1 gold badge26 silver badges44 bronze badges

2 Comments

potong Over a year ago

Would this not remove all duplicate reguardless of the tag?

NeronLeVelu Over a year ago

You are right and it is not clear in the request. Here it remove all duplicate consecutives lines, whatever it is (tag or not). This is mainly what is done in half of the reply that don't look at tag. Request is to remove lines and it explain the case with the media tag in the sample. So if another duplicate kind of line occur, and it should not be removed, my solution is not adequat (like 2 lines of <BR> in HTML).

John1024 · Accepted Answer · 2015-01-07 04:05:38Z

2

Consider the file:

$ cat file
<story>
   <article>
      <media>One Line</media>
      <media>One Line</media>
      <media>Another Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

To remove duplicate media lines and only duplicate media lines:

$ awk '/<media>/ && $0==last{next} {last=$0} 1' file
<story>
   <article>
      <media>One Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

How it works

/<media>/ && $0==last{next}

Any line that has a <media> tag and matches the previous line is skipped: the command next tells awk to skip all remaining commands and start over on the next line.
last=$0

This saves the last line, in its entirety, in the variable last.
1

This is cryptic awk notation which means print the current line. If you prefer clarity to conciseness, you may replace the 1 with {print $0}.

edited Jan 7, 2015 at 4:05

answered Jan 7, 2015 at 3:56

John1024

115k15 gold badges151 silver badges183 bronze badges

2 Comments

user3442743 Over a year ago

Shorter awk '!/<media>/||$0!=last&&last=$0'

Sopalajo de Arrierez Over a year ago

Thanks for the concise explanation. AWK is a bit hard for me, and your details helped me to solve the problem in another question: stackoverflow.com/questions/50071125/…

potong · Accepted Answer · 2015-01-07 06:43:12Z

1

This might work for you (GNU sed):

sed -r 'N;/^(\s*<media>.*)\n\1$/!P;D' file

This deletes duplicate lines that begin with the <media> tag.

N.B. This deletes the lines from the front but as they are duplicates it should not matter.

answered Jan 7, 2015 at 6:43

potong

59.3k6 gold badges55 silver badges92 bronze badges

Collectives™ on Stack Overflow

Delete duplicate consecutive lines without sort or unique in xml file

4 Answers 4

2 Comments

2 Comments

How it works

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

How it works

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related