1

I am trying to count number of words that has length between 1 and 5, file size is around 4GB end I am getting memory error.

import os 
files = os.listdir('C:/Users/rram/Desktop/') 
for file_name in files:     
    file_path = "C:/Users/rram/Desktop/"+file_name     
    f = open (file_path, 'r')    
    text = f.readlines()
    update_text = '' 
    wordcount = {}
    for line in text:         
        arr = line.split("|")
        word = arr[13]
        if 1<=len(word)<6:
            if word not in wordcount:
                wordcount[word] = 1
        else:
            wordcount[word] += 1
            update_text+= '|'.join(arr)
print (wordcount)     #print update_text
print 'closing', file_path, '\t', 'total files' , '\n\n'
f.close()

At the end i get a MemoryError on this line text = f.readlines()

Can you pelase help to optimize it.

5
  • delete this line text = f.readlines() you can iterate over the file handle Commented Jun 1, 2018 at 8:33
  • Can you please correct the indentation ? Commented Jun 1, 2018 at 8:33
  • You should iterate over the lines like for line in f:. Don't overload your memory reading all of the file at once. Commented Jun 1, 2018 at 8:33
  • Sorry, indentation moved out when copy pasting. @MohamedALANI Commented Jun 1, 2018 at 8:44
  • can i use f.readline() , for faster output since it loaded into memory and performs the operation. Commented Jun 1, 2018 at 9:44

1 Answer 1

3

As suggested in the comments you should read the file line by line and not the entire file.

For example :

count = 0
with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
          if(1 <= len(word) <=5):
              count=count+1
print(count)

EDIT :

If you only want to count the words in 14-th column and split by | instead then :

count = 0
with open('words.txt','r') as f:
    for line in f:
        iterator = 0
        for word in line.split("|"):
            if(1 <= len(word) <=5 and iterator == 13):
                count=count+1
            iterator = iterator +1
print(count)

note that you should avoid to write this

arr = line.split("|")
word = arr[13]

since the line may contains less than 14 words, which can result in a segmentation error.

Sign up to request clarification or add additional context in comments.

3 Comments

In my file there are many records and in each record fields are separated by |, and i am particularly looking at the column number 14.
@FlorentJousse: if you are concerned, that arr has not enough elements, use count += len(arr) >= 13 and 1 <= len(arr[13]) <= 5
Dear @FlorentJousse , i am sure my records have length more than 14, there are around 70 columnns and if no data is there just pipes are available.like |||||||||. Thank you very much. Is their any other advantage of Iterator int the above code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.