0

I read in an large python array from a csv file (20332 *17009) using window7 64 bit OS machine with 12 G ram. The array has values in the half of places, like the example below. I only need the array where has values for analysis, rather than the whole array.

[0 0 0 0 0 0

0 0 0 3 8 0

0 4 2 7 0 0

0 0 5 2 0 0

0 0 1 0 0 0]

I am wondering: is it possible to ignore 0 value for analysis and save more memory?

Thanks in advance!

6
  • 3
    you could use an associative array (which would be a dict in python) with index => value Commented Jan 13, 2013 at 22:27
  • What is the origin of the data in the array? File or other line oriented device? What makes something NA vs data of interest? What is the real size we are taking about? Your example is 6x5 so the bother of 'saving memory' is not really worth the effort. Are we talking thousands by thousands? How many are NA vs A? The more you say, the more applicable the answer... Commented Jan 13, 2013 at 22:35
  • 1
    You could implement a SparseList Commented Jan 13, 2013 at 22:47
  • 1
    Both SciPy and Pandas offer sparse matrices, but if really only half of the numbers is zero, the overhead is probably not worth it. Same goes for most other options. Commented Jan 13, 2013 at 22:57
  • 1
    If you don't need random access to the entire array at once, don't read the whole thing into memory and process it line-by-line instead. Otherwise, memory-mapping the file via mmap might help especially with a 64-bit Python. Commented Jan 13, 2013 at 23:09

1 Answer 1

2

Given your description, a sparse representation may not be very useful to you. There are many other options, though:

  1. Make sure your values are represented using the smallest data type possible. The example you show above is best represented as single-byte integers. Reading into a numpy array or python array will give you good control over data type.

  2. You can trade memory for performance by only reading a part of the data at a time. If you re-write the entire dataset as binary instead of CSV, then you can use mmap to access the file as if it were already in memory (this would also make it faster to read and write).

  3. If you really need the entire dataset in memory (and it really doesn't fit), then some sort of compression may be necessary. Sparse matrices are an option (as larsmans mentioned in the comments, both scipy and pandas have sparse matrix implementations), but these will only help if the fraction of zero-value entries is large. Better compression options will depend on the nature of your data. Consider breaking up the array into chunks and compressing those with a fast compression algorithm like RLE, SZIP, etc.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.