3

I need to write a huge amount number-number pairs into a NumPy array. Since a lot of these pairs have a second value of 0, I thought of making something akin to a dictionary. The problem is that I've read through the NumPy documentation on structured arrays and it seems like dictionaries built like those on the page can only use strings as keys.

Other than that, I need insertion and searching to have log(N) complexity. I thought of making my own Red-black tree structure using a regular NumPy array as storage, but I'm fairly certain there's an easier way to go about this.

Language is Python 2.7.12.

4
  • Why do you need to use a NumPy array specifically? Commented Aug 12, 2016 at 15:57
  • @David Z Because the amount of data I'm using is too much to store on RAM. That's why I need to give it to another datatype(this one link) which supports writing to the hard drive's database directly. That thing there is essentially a NumPy array with the ability to write to the hard drive if needed... Commented Aug 12, 2016 at 16:09
  • Ah, that's useful information to have (and to include in the question). It also opens up the option of using a different large-capacity storage library if there is one that makes this task easier. Commented Aug 12, 2016 at 17:43
  • @DavidZ Well, using another library isn't really an option, since I'll have to redo everything on the new system, which will arguably be more difficult than just writing my own wrapper. Anyways, I'll leave this for next week to ponder. Hope that someone has an idea... Commented Aug 12, 2016 at 17:57

2 Answers 2

2

The most basic form of a dictionary is a structure called a HashMap. Implementing a hashmap relies on turning your key into a value that can be quickly looked up. A pathological example would be using ints as keys: The value for key 1 would go in array[1], the value for key 2 would go in array[2], the Hash Function is simply the identity function. You can easily implement that using a numpy array.

If you want to use other types, it's just a case of writing a good hash function to turn those keys into unique indexes into your array. For example, if you know you've got a (int, int) tuple, and the first value will never be more than 100, you can do 100*key[1] + key[0].

The implementation of your hash function is what will make or break your dictionary replacement.

Sign up to request clarification or add additional context in comments.

1 Comment

Yeah, I get that and I can easily build my array so that finding the position of a value is quick, but insertion is the problem. If I want to keep the array sorted, after I insert I'll need to move every element greater than the inserted one to the right, effectively making insertion's complexity O(n). I need to achieve a complexity of O(logN) as can be done using red-black trees(link preferably without needing to write it myself...
0

So you have an (N,2) array, and many values in x[:,1] are 0.

What do you mean by insertion? Adding a value to the array to make it (N+1,2)? Or just changing x[i,:] to something new?

And what about the search? numpy array are great for finding the ith values, x[i,:], but not that good for finding the values that match z. python numpy filter two dimentional array by condition

scipy.sparse implements various forms of sparse matrix, which are useful if less than a tenth of the possible values are non-zero. One format is dok, a dictionary of keys. It is actually a dict subclass, and the keys are a 2d index tuple (i,j). Other formats store their values as arrays,e.g. row, cols and data.

structured arrays are meant for cases with a modest number of named fields, and each field can hold a different type of data. But I don't think it helps to turn a (N,2) array into a (N,) array with 2 fields.

================

Your comments suggest that you aren't familiar with how numpy arrays are stored or accessed.

An array consists of a flat 1d data buffer (just a c array of bytes), and attributes like shape, strides, itemsize and dtype.

Let's say it is np.arange(100).

In [1324]: np.arange(100).__array_interface__
Out[1324]: 
{'data': (163329128, False),
 'descr': [('', '<i4')],
 'shape': (100,),
 'strides': (4,)
 'typestr': '<i4',
 'version': 3}

So if I ask for x[50], it calculates the strides, 4 bypes/element, * 50 elements = 200 bytes, and asks, in c code for the 4 bytes at 163329128+200, and it returns them as an integer (object of np.int32 type actually).

For a structured array the type descr and bytes per element will be larger, but access will be the same. For a 2d array it will take the shape and strides tuples into account to find the appropriate index.

Strides for a (N,2) integer array is (8,4). So access to the x[10,1] element is with a 10*8 + 1*4 = 84 offset. And access to x[:,1] is with i*8 for i in range... offsets.

But in all cases it relies on the values being arranged in a rectangular predicable pattern. There's nothing fancy about the numpy data structures. They are relatively fast simply because many operations are coded in compiled code.

Sorting, accessing items by value, and rearranging elements are possible with arrays, but are not a strong point. More often than not these actions will produce a new array, with values copied from old to new in some new pattern.

There are just a few builtin numpy array subclasses, mainly np.matrix and np.masked_array, and they don't extend the access methods. Subclassing isn't as easy as with regular Python classes, since it numpy has some much of its own compiled code. A subclass has to have a __new__ method rather than regular __init__.

There are Python modules that maintain sorted lists, bisect and heapq. But I don't see how they will help you with the large out-of-ram memory issue.

3 Comments

You could say that I want a (N, 2) array that's sorted by its first element such that it has insertion and search of complexity O(logN). By insertion I mean adding a new element to the array so that it retains its sorted state. By search I mean finding the index and, thus, the second value of an element, given its first value. I know this is possible, since it's what Red-black trees do and that's how Python's dictionaries work. I was asking if there was a built-in subtype of numpy.array with these properties, since it has indexing...
by an element as stated on NumPy's structured arrays page(the specifying the name instead of the index part). Also, I'm tightly limited to just NumPy arrays, so no scipy or other libraries...
I've expanded on how numpy arrays are stored and accessed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.