Use regular expression to delete duplicate strings with python

Question

I have a .txt file where by a number of Snort alerts are generated. I would like to search through this file and delete duplicate alerts and keep only one of each. I am using the following code so far:

with open('SnortReportFinal', 'r') as f:
    file_lines = f.readlines()

cont_lines = []
for line in range(len(file_lines)):
        if re.search('\d:\d+:\d+', file_lines[line]):
        cont_lines.append(line)

for idx in cont_lines[1:]: # skip one instance of the string
    file_lines[idx] = "" # replace all others

with open('SnortReportFinal', 'w') as f:
    f.writelines(file_lines)

The regular expression matches the string I am searching for i.e. 1:234:5, should it find multiple instances of the same string I would like it to delete them and keep only one. This does not work as all other strings are being deleted and it is keeping only one string that the expression matches.

File Contains text like this:

[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]

Where the part [1:368:6] could be a variation of numbers, i.e. [1:5476:5].

I would like my expected output to be only:

[1:368:6] ICMP PING BSDtype [**]
[1:563:2] ICMP PING BSDtype [**]

The rest of the strings being deleted, by rest I mean the difference in numbers is fine, but not duplicate numbers.

Does it matter what order the alerts are in the file?

wnnmaw
– wnnmaw

2015-03-16 19:02:15 +00:00
Commented Mar 16, 2015 at 19:02 — wnnmaw
– wnnmaw, Commented Mar 16, 2015 at 19:02
No the alerts can be in any order.

Andrew Stewart
– Andrew Stewart

2015-03-16 19:02:48 +00:00
Commented Mar 16, 2015 at 19:02 — Andrew Stewart
– Andrew Stewart, Commented Mar 16, 2015 at 19:02
why are you using a regex are there other lines?

Padraic Cunningham
– Padraic Cunningham

2015-03-16 19:44:12 +00:00
Commented Mar 16, 2015 at 19:44 — Padraic Cunningham
– Padraic Cunningham, Commented Mar 16, 2015 at 19:44

wnnmaw · Accepted Answer · 2015-03-16 19:54:43Z

5

It seems like you really dont need regex for this. To remove duplicates simply:

alerts = set(f.readlines())

This converts the list of lines in the file to a set, which deletes the duplicates. From here you can directly write the set back to your text file.

Alternatively, you can directly call set on the file object as Padraic Cunningham points out in the comments:

alerts = set(f)

edited Mar 16, 2015 at 19:54

answered Mar 16, 2015 at 19:05

wnnmaw

5,5343 gold badges41 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Padraic Cunningham Over a year ago

This will potentially fail unless you map(str.rstrip, you can also call set on the file object

wnnmaw Over a year ago

@PadraicCunningham the only reason it would fail without rstrip() would be if there's differences in the whitespace; with computer generated output that shouldn't be an issue. +1 for calling set() directly on the file object though

Padraic Cunningham Over a year ago

@wnnmaw, if you had a duplicate line at the end without a newline etc.. it would not catch it.

Kasravnd · Accepted Answer · 2015-03-16 19:58:47Z

3

You dont need regex you can use set :

seen=set(i.strip() for i in open('infile.txt'))

example :

>>> s="""[1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:563:2] ICMP PING BSDtype [**]"""
>>> set(s.split('\n'))
set(['[1:563:2] ICMP PING BSDtype [**]', '[1:368:6] ICMP PING BSDtype [**]'])

edited Mar 16, 2015 at 19:58

answered Mar 16, 2015 at 19:05

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Collectives™ on Stack Overflow

Use regular expression to delete duplicate strings with python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related