Iterate over multiple files and count multiple strings

Question

I want to write a code to open multiple text files and count how many times predefined strings occurs in each file. My desired output it can be a list of the sums of occurrence of each string along the files.

My desired strings are values of a dictionary.

For instance:

mi = { "key1": "string1", "key2": "string2", and so on..." }

For the purpose to open a unique file and realized my desired count I got the code. Check below:

mi = {} #my dictionary
data = open("test.txt", "r").read()
import collections 
od_mi = collections.OrderedDict(sorted(mi.items()))
count_occur = list()

for value in od_mi.values():
    count = data.count(value)
    count_occur.append(count)

lista_keys = []   
for key in od_mi.keys():
    lista_keys.append(key)

dic_final = dict(zip(lista_keys, count_occur))
od_mi_final = collections.OrderedDict(sorted(dic_final.items()))

print(od_mi_final) #A final dictionary with keys and values with the count of how many times each string occur.

My next target is do the same with multiple files. I have a group of text files that are named according a pattern, e.g. "ABC 01.2015.txt ; ABC 02.2015.txt ...".

I made 3 text files as test files, in each one of the files, each string occurs one time. Therefore, in my test run my desired output is a count of 3 for each string.

mi = {}
import collections
od_mi = collections.OrderedDict(sorted(mi.items()))
for i in range(2,5):
for value in od_mi.values():
    x = "ABC" + " " + str(i) +".2015.txt"
    data = open(x, "r").read()
    contar = data.count(value)
    count_occur.append(contar)

 print(count_occur)

Output:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

I realize that my code was overwriting the counting when entered each time in the loop. Therefore, how can I fix this issue?

Padraic Cunningham · Accepted Answer · 2016-03-17 17:19:16Z

2

Make a Counter from the values in your mi dict, then use the intersection between the new Counter dict keys and each line of split words:

mi = { "key1": "string1", "key2": "string2"}


import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))
for fle in list_of_file_names:
    with open(fle) as f:
        for words in map(str.split, f):
            counts.update(counts.viewkeys() & words)
print(counts)

If you are looking for exact matches and you have multiple word phrases to find, your best bet will be a regex with word boundaries:

from collections import Counter

import re

patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_file_names:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line))
print(counts)

You might find that calling the regex on f.read() presuming the file content fits into memory:

with open(fle) as f:
     counts.update(patt.findall(f.read()))

The regular re module won't work for overlapping matches, if you pip install [regex][1] that will catch the overlapping matches once you set the overlapped flag:

import regex
import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))

patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_files:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line, overlapped=True))
print(counts)

If we change your examples slightly you can see the difference:

In [30]: s = "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."

In [31]: mi =  {"RTL. 10": "conflitantes sobre", "RTL. 11": "sobre"}
In [32]: patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()])) 
In [33]: patt.findall(s)
Out[33]: ['conflitantes sobre']

In [34]: patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))

In [35]: patt.findall(s,overlapped=True)
Out[35]: ['conflitantes sobre', 'sobre']

edited Mar 17, 2016 at 17:19

answered Mar 17, 2016 at 16:12

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

José Ferraz Neto Over a year ago

Hello, Padraic! Thank you for your help! I made a test with my original dictionary who has many strings with special characters and for some reason does not work. The final output counts zero for them. But i remade my test files and search only for strings without special characters and your code works great. Can you help to solve the problem with my original set of strings ?

Padraic Cunningham Over a year ago

Can you provide a small sample of your file content and what you have in your dict values? Also what version of python?

José Ferraz Neto Over a year ago

I use Canopy which use ' Python 2.7.10 | 64-bit | (default, Oct 21 2015, 17:08:47) [MSC v.1500 64 bit (AMD64)]'. some examples = > "RTL. 10": "O rótulo apresentado está ilegível/incompleto.", "RTL. 11": "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."

Padraic Cunningham Over a year ago

Are you actually looking for more than just single words?

Padraic Cunningham Over a year ago

Ah ok, well that is a very different story, you are also matching substrings, i.e foo would match foobar?

|

mvelay · Accepted Answer · 2016-03-17 16:03:28Z

0

You should use Counter to simplify your code:

from collections import Counter

mi = {'key1': 'string1', 'key2': 'string2'}
count_occur = []
with open("test.txt", "r") as data_file:
    for data in data_file:
        count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

Then, to process it on multiples files, just loop on a list of files, for example:

from collections import Counter

count_occur = []
mi = {'key1': 'string1', 'key2': 'string2'}
files = ["ABC" + " " + str(i) +".2015.txt" for i in range(2,5)]

for file_c in files:
    with open(file_c, "r") as data_file:
        for data in data_file:
            count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

edited Mar 17, 2016 at 16:03

answered Mar 17, 2016 at 14:08

mvelay

1,5201 gold badge10 silver badges23 bronze badges

5 Comments

José Ferraz Neto Over a year ago

Hi massiou, thank you for answer. In each case the output was: Counter(). There was not a number in function output, i believe...

mvelay Over a year ago

it means that 'count_occur' is empty, is your 'mi' dictionary filled with the requested strings ?

José Ferraz Neto Over a year ago

Yes, sir! My dictionary is filled with all strings. I checked if the count_occur list has something inside and it's empty.

José Ferraz Neto Over a year ago

For some reason the code has the same output as before.

José Ferraz Neto Over a year ago

Just few details, my test files are encoded in UTF-8 and has special characters in it.

Collectives™ on Stack Overflow

Iterate over multiple files and count multiple strings

2 Answers 2

12 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related