4

I have an xml file where I need to keep the order of the tags but have a tag called media that has duplicate lines in consecutive order. I would like to delete one of the duplicate media tags but want to preserve all of the parent tags - (which are also consecutive and repeat). I'm wondering if there is an awk solution to delete only if a pattern is matched. For example:

<story>
   <article>
      <media>One line</media>
      <media>One line</media>    <-- Same line as above, want to delete this
      <media>Another Line</media>
      <media>Another Line</media>  <-- Another duplicate, want to delete this
   </article>
</story>
<story>
   <article>
     ........ and so on

I want to keep the consecutive story and article tags and just delete duplicates for the media tag. I've tried a number of awk scripts but nothing seems to work without sorting the file and ruining the order of the xml. Any help much appreciated.

1
  • 4
    not a clear example. Please move your as above notations into your comments. Commented Jan 7, 2015 at 3:52

4 Answers 4

6

An awk script would help you

awk '!(f == $0){print} {f=$0}' input

Test

$ cat input
<story>
   <article>
      <media>One line</media>
      <media>One line</media>
      <media>Another Line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

$ awk '!(f == $0){print} {f=$0}' input
<story>
   <article>
      <media>One line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

OR

$ awk 'f!=$0&&f=$0' input

Thanks to Jidder

Sign up to request clarification or add additional context in comments.

2 Comments

Shorter awk 'f!=$0&&f=$0'
@Jidder That was more shorter
3

use behaviour of uniq that need normaly a sorted file, removing dupliucate lines tat are following exactly the previous line

uniq YourFile

2 Comments

Would this not remove all duplicate reguardless of the tag?
You are right and it is not clear in the request. Here it remove all duplicate consecutives lines, whatever it is (tag or not). This is mainly what is done in half of the reply that don't look at tag. Request is to remove lines and it explain the case with the media tag in the sample. So if another duplicate kind of line occur, and it should not be removed, my solution is not adequat (like 2 lines of <BR> in HTML).
2

Consider the file:

$ cat file
<story>
   <article>
      <media>One Line</media>
      <media>One Line</media>
      <media>Another Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

To remove duplicate media lines and only duplicate media lines:

$ awk '/<media>/ && $0==last{next} {last=$0} 1' file
<story>
   <article>
      <media>One Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

How it works

  • /<media>/ && $0==last{next}

    Any line that has a <media> tag and matches the previous line is skipped: the command next tells awk to skip all remaining commands and start over on the next line.

  • last=$0

    This saves the last line, in its entirety, in the variable last.

  • 1

    This is cryptic awk notation which means print the current line. If you prefer clarity to conciseness, you may replace the 1 with {print $0}.

2 Comments

Shorter awk '!/<media>/||$0!=last&&last=$0'
Thanks for the concise explanation. AWK is a bit hard for me, and your details helped me to solve the problem in another question: stackoverflow.com/questions/50071125/…
1

This might work for you (GNU sed):

sed -r 'N;/^(\s*<media>.*)\n\1$/!P;D' file

This deletes duplicate lines that begin with the <media> tag.

N.B. This deletes the lines from the front but as they are duplicates it should not matter.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.