Python memory error when searching substring

Question

I am trying to find substring of very large string and getting memory error:

The code:

def substr(string):
    le = []
    st = list(string)
    for s in xrange(len(string)+1):
        for s1 in xrange(len(string)+1):
            le.append(''.join(st[s:s1]))

    cou = Counter(le)
    cou_val = cou.keys()
    cou_val.remove('')
    return le, cou_val

I am getting error as ile "solution.py", line 31, in substr le.append(''.join(st[s:s1])) MemoryError

How to tackle this problem?

possible duplicate of stackoverflow.com/questions/11283220/memory-error-in-python — Rikka
– Rikka, Commented Nov 21, 2015 at 7:13

J. Malcolm · Accepted Answer · 2015-11-21 15:05:31Z

1

Answer

I noticed that your code prints all the possible substrings of string in a certain order. I suggest that instead of storing all of them in an array, you use code to return just the substring that you want. I tested the subroutine below with 'a very long string' and it always returns the same value as if you were to get an indexed value from an array.

    string = 'a very long string'
    def substr2(string,i):
        return string[i//(len(string)+1):i%(len(string)+1)]

    print(substr2(string,10))

Solution

The way you order the arguments for your for loops (s,s1) work similarly to a number system. s1 increments by 1 until it gets to a given value, then it resets to 0 and s increments by 1, repeating the cycle. This is seen in a decimal system (e.g. 01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16 etc.)

The i//n div operator returns the integer value of i/n. (e.g. 14//10=1). The i%n mod operator returns the remainder value of i/n. (e.g. 14%10 is 4).

So if we were to, for example, increment i by 1 and define (s,s1) as [i//10,i%10], we would get:

[0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,6],[0,7],[0,8],[0,9],[1,0],[1,1],[1,2] etc.

We can utilize these to produce the same answer as in your array.

PS. My first answer. Hope this helps!

answered Nov 21, 2015 at 15:05

J. Malcolm

621 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

urban Over a year ago

+1: Nice idea and nice answer :). You can also write the above as a generator by looping in substr2 and yeald instead of return.

urban · Accepted Answer · 2015-11-21 12:31:45Z

It seems that you are running out of memory. When the string is too large the code you posted seems to be copying it over and over again into the le list. As @Rikka's link suggests, buffer/memoryview may be of use for you but I have never used it.

As a workaround to your solution/code I would suggest that instead of storing each substring in le, store the indexes as a tuple. Additionally, I don't think that st list is required (not sure tho if your way speeds it up) so the result would be (code not tested):

def substr(string):
    le = []

    for s in xrange(len(string)+1):
        for s1 in xrange(len(string)+1):
            # Skip empty strings
            if s!=s1:
                le.append((s, s1))

    cou = Counter(le)
    cou_val = cou.keys()
    cou_val.remove('')
    return le, cou_val

Now, an example of how you can use the substr is (code not tested):

myString = "very long string here"
matchString = "here"
matchPos = False
indexes, count = substr(myString)

# Get all the substrings without storing them simultaneously in memory
for i in indexes:
    # construct substring and compare
    if myString[i[0],i[1]]==matchString:
        matchPos = i
        break

After the above you have start and end positions of the 1st occurrence of "here" into your large string. I am not sure what you try to achieve but this can easily be modified to find all occurrences, count matches, etc - I just post it as example. I am also not sure why the Counter is there...

This approach should not present the memory error, however, it is a trade-off between memory and CPU and I expect it to be bit slower on runtime since every time you use indexes you have to re-construct every substring.

Hope it helps

Seven He · Accepted Answer · 2015-11-23 12:59:20Z

-1

The solution: The error in memory is always caused by out of range.And the slice technique also has some rules. When the step is positive, just like 1, the first index must be greater than the second.And on the contrary, when negative, such as -1, the number of the index is shorter than the second, but it is actually the greater one.(-1 > -2) So in your program, the index s is greater than s1 when step is one, so you access a place you have not applied for it.And you know, that is MemoryError!!!

edited Nov 23, 2015 at 12:59

answered Nov 21, 2015 at 7:21

Seven He

13 bronze badges

Collectives™ on Stack Overflow

Python memory error when searching substring

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related