3

I want to write a code to open multiple text files and count how many times predefined strings occurs in each file. My desired output it can be a list of the sums of occurrence of each string along the files.

My desired strings are values of a dictionary.

For instance:

mi = { "key1": "string1", "key2": "string2", and so on..." }

For the purpose to open a unique file and realized my desired count I got the code. Check below:

mi = {} #my dictionary
data = open("test.txt", "r").read()
import collections 
od_mi = collections.OrderedDict(sorted(mi.items()))
count_occur = list()

for value in od_mi.values():
    count = data.count(value)
    count_occur.append(count)

lista_keys = []   
for key in od_mi.keys():
    lista_keys.append(key)

dic_final = dict(zip(lista_keys, count_occur))
od_mi_final = collections.OrderedDict(sorted(dic_final.items()))

print(od_mi_final) #A final dictionary with keys and values with the count of how many times each string occur. 

My next target is do the same with multiple files. I have a group of text files that are named according a pattern, e.g. "ABC 01.2015.txt ; ABC 02.2015.txt ...".

I made 3 text files as test files, in each one of the files, each string occurs one time. Therefore, in my test run my desired output is a count of 3 for each string.

mi = {}
import collections
od_mi = collections.OrderedDict(sorted(mi.items()))
for i in range(2,5):
for value in od_mi.values():
    x = "ABC" + " " + str(i) +".2015.txt"
    data = open(x, "r").read()
    contar = data.count(value)
    count_occur.append(contar)

 print(count_occur)

Output:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

I realize that my code was overwriting the counting when entered each time in the loop. Therefore, how can I fix this issue?

2 Answers 2

2

Make a Counter from the values in your mi dict, then use the intersection between the new Counter dict keys and each line of split words:

mi = { "key1": "string1", "key2": "string2"}


import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))
for fle in list_of_file_names:
    with open(fle) as f:
        for words in map(str.split, f):
            counts.update(counts.viewkeys() & words)
print(counts)

If you are looking for exact matches and you have multiple word phrases to find, your best bet will be a regex with word boundaries:

from collections import Counter

import re

patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_file_names:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line))
print(counts)

You might find that calling the regex on f.read() presuming the file content fits into memory:

with open(fle) as f:
     counts.update(patt.findall(f.read()))

The regular re module won't work for overlapping matches, if you pip install [regex][1] that will catch the overlapping matches once you set the overlapped flag:

import regex
import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))

patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_files:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line, overlapped=True))
print(counts)

If we change your examples slightly you can see the difference:

In [30]: s = "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."

In [31]: mi =  {"RTL. 10": "conflitantes sobre", "RTL. 11": "sobre"}
In [32]: patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()])) 
In [33]: patt.findall(s)
Out[33]: ['conflitantes sobre']

In [34]: patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))

In [35]: patt.findall(s,overlapped=True)
Out[35]: ['conflitantes sobre', 'sobre']
Sign up to request clarification or add additional context in comments.

12 Comments

Hello, Padraic! Thank you for your help! I made a test with my original dictionary who has many strings with special characters and for some reason does not work. The final output counts zero for them. But i remade my test files and search only for strings without special characters and your code works great. Can you help to solve the problem with my original set of strings ?
Can you provide a small sample of your file content and what you have in your dict values? Also what version of python?
I use Canopy which use ' Python 2.7.10 | 64-bit | (default, Oct 21 2015, 17:08:47) [MSC v.1500 64 bit (AMD64)]'. some examples = > "RTL. 10": "O rótulo apresentado está ilegível/incompleto.", "RTL. 11": "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."
Are you actually looking for more than just single words?
Ah ok, well that is a very different story, you are also matching substrings, i.e foo would match foobar?
|
0

You should use Counter to simplify your code:

from collections import Counter

mi = {'key1': 'string1', 'key2': 'string2'}
count_occur = []
with open("test.txt", "r") as data_file:
    for data in data_file:
        count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

Then, to process it on multiples files, just loop on a list of files, for example:

from collections import Counter

count_occur = []
mi = {'key1': 'string1', 'key2': 'string2'}
files = ["ABC" + " " + str(i) +".2015.txt" for i in range(2,5)]

for file_c in files:
    with open(file_c, "r") as data_file:
        for data in data_file:
            count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

5 Comments

Hi massiou, thank you for answer. In each case the output was: Counter(). There was not a number in function output, i believe...
it means that 'count_occur' is empty, is your 'mi' dictionary filled with the requested strings ?
Yes, sir! My dictionary is filled with all strings. I checked if the count_occur list has something inside and it's empty.
For some reason the code has the same output as before.
Just few details, my test files are encoded in UTF-8 and has special characters in it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.