1

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).

The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.

When I try to open the .txt file contains numbers between 1M-50M

myfile.txt has stopped working

So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.

I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.

Here are the codes for 1M-50M

import random

x = 1000000
y = 10000000


while x < 50000001:
    nums = [a for a in range(x,x+y)]
    random.shuffle(nums)
    with open ("nums.txt","a+") as f:
        for z in nums:
            f.write(str(z)+"\n")
        x += 10000000

How can I speed up this process?

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

Is there any module can you suggest may be useful for this process?

19
  • I suggest you rethink your process on this, whatever you're doing. Opening up a 9 GB text file requires at least 9 GB of RAM. Furthermore, it's likely that there's a better approach to whatever problem you're solving... writing integers to a text file is usually not a good approach. Commented May 22, 2016 at 20:21
  • 3
    Plain text is not an efficient way to store that much numeric data. What are you doing that makes you think you want to do that? Commented May 22, 2016 at 20:21
  • out of curiosity, what sort of job would this be for? Commented May 22, 2016 at 20:21
  • 3
    @hichris123: Opening a huge file doesn't consume much RAM, but trying to read the whole thing at once certainly does. Commented May 22, 2016 at 20:27
  • 1
    It does sound like the application would be better served by something that can take a seed and use it to generate the same ordered list of numbers each time Commented May 22, 2016 at 21:20

2 Answers 2

1

Is there any module can you suggest may be useful for this process?

Using Numpy is really helpful for working with large arrays.

How can I speed up this process?

Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.

The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.

import random
import timeit
import numpy

x = 10**6
y = 2 * 10**6

def list_rand():
    nums = [a for a in range(x, y)]
    random.shuffle(nums)
    return nums

def numpy_rand():
    nums = numpy.arange(x, y)
    numpy.random.shuffle(nums)
    return nums

def std_write(nums):
    with open ('nums_std.txt', 'w') as f:
        for z in nums:
            f.write(str(z) + '\n')

def numpy_write(nums):
    with open('nums_numpy.txt', 'w') as f:
        nums.tofile(f, '\n')

print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))

print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))

print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))

print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))



list generation, random [secs]
    1.3995
numpy array generation, random [secs]
    0.0319
standard write [secs]
    2.5745
numpy write [secs]
    0.3622

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?

Sign up to request clarification or add additional context in comments.

2 Comments

I couldn't turn it like my code actually, didn't use numpy before. Could you slice it like every 20M so I don't get MemoryError
You can use arange with a start and stop position, so, yes, you create slices in numpy just like you did with your list. What computer are you using for this job?
0

I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.

Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.

You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

5 Comments

do you mean 1E9 as hexidecimal? That is only 489 which is far too few items for the actual numbers in the question...
Plus there is no bitarray module for Python 3.4. At least I couldn't find, I'm still looking.
@TadhgMcDonald-Jensen Even Python understands 1E9 as a number without problem. It is 1000000000.
@GLHF There seem to be packages, but I admit I did not investigate them.
ooh, it is scientific notation, 1e9 -> 10^9, ok got it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.