1

I have a .txt file where by a number of Snort alerts are generated. I would like to search through this file and delete duplicate alerts and keep only one of each. I am using the following code so far:

with open('SnortReportFinal', 'r') as f:
    file_lines = f.readlines()

cont_lines = []
for line in range(len(file_lines)):
        if re.search('\d:\d+:\d+', file_lines[line]):
        cont_lines.append(line)

for idx in cont_lines[1:]: # skip one instance of the string
    file_lines[idx] = "" # replace all others

with open('SnortReportFinal', 'w') as f:
    f.writelines(file_lines)

The regular expression matches the string I am searching for i.e. 1:234:5, should it find multiple instances of the same string I would like it to delete them and keep only one. This does not work as all other strings are being deleted and it is keeping only one string that the expression matches.

File Contains text like this:

[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]

Where the part [1:368:6] could be a variation of numbers, i.e. [1:5476:5].

I would like my expected output to be only:

[1:368:6] ICMP PING BSDtype [**]
[1:563:2] ICMP PING BSDtype [**]

The rest of the strings being deleted, by rest I mean the difference in numbers is fine, but not duplicate numbers.

3
  • Does it matter what order the alerts are in the file? Commented Mar 16, 2015 at 19:02
  • No the alerts can be in any order. Commented Mar 16, 2015 at 19:02
  • why are you using a regex are there other lines? Commented Mar 16, 2015 at 19:44

2 Answers 2

5

It seems like you really dont need regex for this. To remove duplicates simply:

alerts = set(f.readlines())

This converts the list of lines in the file to a set, which deletes the duplicates. From here you can directly write the set back to your text file.

Alternatively, you can directly call set on the file object as Padraic Cunningham points out in the comments:

alerts = set(f)
Sign up to request clarification or add additional context in comments.

3 Comments

This will potentially fail unless you map(str.rstrip, you can also call set on the file object
@PadraicCunningham the only reason it would fail without rstrip() would be if there's differences in the whitespace; with computer generated output that shouldn't be an issue. +1 for calling set() directly on the file object though
@wnnmaw, if you had a duplicate line at the end without a newline etc.. it would not catch it.
3

You dont need regex you can use set :

seen=set(i.strip() for i in open('infile.txt'))

example :

>>> s="""[1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:563:2] ICMP PING BSDtype [**]"""
>>> set(s.split('\n'))
set(['[1:563:2] ICMP PING BSDtype [**]', '[1:368:6] ICMP PING BSDtype [**]'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.