Replace data of an array by two values of a second array

Question

I have two numpy arrays "Elements" and "nodes". My aim is to gather some data of these arrays. I need to replace "Elements" data of the two last columns by the two coordinates contains in "nodes" array. The two arrays are very huge, I have to automate it.

This posts refers to an old one: Replace data of an array by 2 values of a second array

with a difference that arrays are very huge (Elements: (3342558,5) and nodes: (581589,4)) and the previous way out does not work.

An example :

    import numpy as np
    
    Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
    
    nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])
    
    results = np.array([[1., 0., 0., 3., 3.],
    [2., 1., 1., 2., 2.]])

The previous way out proposed by hpaulj

    e = Elements[:,1:].ravel().astype(int)
    n=nodes[:,0].astype(int)
    
    I, J = np.where(e==n[:,None])
    
    results = np.zeros((e.shape[0],2),nodes.dtype)
    results[J] = nodes[I,:1]
    results = results.reshape(2,4)

But with huge arrays, this script does not work:
DepreciationWarning: elementwise comparison failed; this will raise an error in the future...

I don't see why bigger size should be a problem. I get your warning with e==[]. There may be other mis matches that trigger the warning. For start verify the arrays shapes at the problem line. — hpaulj
– hpaulj, Commented Oct 20, 2020 at 16:54
e.shape : (13370232,) and n.shape: (581589,) and this error appears for a critical size because I try to increase arrays until the error — user3601754
– user3601754, Commented Oct 21, 2020 at 6:14
With those numbers I'd expect a memory error, since e==n[:,None] would produce a (13370232, 581589) shaped array. I won't even try to test that. — hpaulj
– hpaulj, Commented Oct 21, 2020 at 6:57
Sorry I was busy. I retained your 1st approach Divakar. It s the very efficient, it works well and it last less 5 seconds!!! Thanks for your help ! — user3601754
– user3601754, Commented Oct 27, 2020 at 14:08

Divakar · Accepted Answer · 2020-10-23 14:32:14Z

Most of the game would be to figure out the corresponding matching indices from Elements in nodes.

Approach #1

Since it seems you are open to conversion to integer, let's assume we could take them as integers. With that, we could use an array-assignment + mapping based method, as shown below :

ar = Elements.astype(int)
a = ar[:,1:].ravel()
nd = nodes[:,0].astype(int)

n = a.max()+1
# for generalized case of neagtive ints in a or nodes having non-matching values:
# n = max(a.max()-min(0,a.min()), nd.max()-min(0,nd.min()))+1

lookup = np.empty(n, dtype=int)
lookup[nd] = np.arange(len(nd))
indices = lookup[a]

nc = (Elements.shape[1]-1)*(nodes.shape[1]-1) # 4 for given setup
out = np.concatenate((ar[:,0,None], nodes[indices,1:].reshape(-1,nc)),axis=1)

Approach #2

We could also use np.searchsorted to get those indices.

For nodes having rows sorted based on first col and matching case, we can simply use :

indices = np.searchsorted(nd, a)

For not-necessarily sorted case and matching case :

sidx = nd.argsort()
idx = np.searchsorted(nd, a, sorter=sidx)
indices = sidx[idx]

For non-matching case, use an invalid bool array :

invalid = idx==len(nd)
idx[invalid] = 0
indices = sidx[idx]

Approach #3

Another with concatenation + sorting -

b = np.concatenate((nd,a))
sidx = b.argsort(kind='stable')

n = len(nd)
v = sidx<n
counts = np.diff(np.flatnonzero(np.r_[v,True]))
r = np.repeat(sidx[v], counts)

indices = np.empty(len(a), dtype=int)
indices[sidx[~v]-n] = r[sidx>=n]

To detect non-matching ones, use :

nd[indices] != a

Port the idea here to numba :

from numba import njit

def numba1(Elements, nodes):
    a = Elements[:,1:].ravel()
    nd = nodes[:,0]
    b = np.concatenate((nd,a))
    sidx = b.argsort(kind='stable')
    
    n = len(nodes)        
    ncols = Elements.shape[1]-1
    size = nodes.shape[1]-1        
    dt = np.result_type(Elements.dtype, nodes.dtype)
    nc = ncols*size
    
    out = np.empty((len(Elements),1+nc), dtype=dt)
    out[:,0] = Elements[:,0]
    return numba1_func(out, sidx, nodes, n, ncols, size)

@njit
def numba1_func(out, sidx, nodes, n, ncols, size):
    N = len(sidx)    
    for i in range(N):
        if sidx[i]<n:
            cur_id = sidx[i]
            continue
        else:
            idx = sidx[i]-n        
            row = idx//ncols
            col = idx-row*ncols        
            cc = col*size+1
            for ii in range(size):
                out[row, cc+ii] = nodes[cur_id,ii+1]
    return out

Sebastien D · Accepted Answer · 2020-10-23 10:19:52Z

1

Would you consider using pandas?

import pandas as pd
Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])

df_elements = pd.DataFrame(Elements,columns = ['idx','node1','node2'])
df_nodes = pd.DataFrame(nodes, columns = ['node_id','x','y'])

#Double merge to get the coordinates from df_nodes
results = df_elements.merge(df_nodes, left_on = 'node1', right_on="node_id", how='left').merge(df_nodes, left_on="node2",right_on = "node_id", how='left')[['idx',"x_x",'y_x','x_y','y_y']].values

Output

array([[1., 0., 0., 3., 3.],
       [2., 1., 1., 2., 2.]])

answered Oct 23, 2020 at 10:19

Sebastien D

4,5024 gold badges23 silver badges50 bronze badges

Comments

Dhruv Kothari · Accepted Answer · 2020-10-24 06:46:49Z

First, let's estimate the sizes of the arrays to see if we will encounter a memory error

from sys import getsizeof

Element_size = getsizeof(np.random.randint(0,100,(3342558,5))) / (1024**3)
nodes_size = getsizeof(np.random.randint(0,100,(581589,4))) / (1024**3)
result_size = getsizeof(np.random.randint(0,100,(3342558,13))) / (1024**3)

total_size = Element_size + nodes_size + result_size

Running this script (13=(5-1)*(4-1)+1), the total_size is about 0.46 GB, this means we don't need to worry too much about memory error, but we should still do our best to avoid making copies of an array.

We first create arrays to work with

elements = np.random.randint(0,100,(100,5))
elements[:,0] = np.arange(100)
nodes = np.random.randint(0,100,(300,4))

# create an empty result array 
results = np.empty((100,13)).astype(elements.dtype)
results[:,:5] = elements

As you can see, we create the array results in the first place, there are two benefits to create this array at the beginning:

Most operations can be in-place operations performed on results.
If the memory space is not sufficient, you will know this when you create results.

With these arrays, you can solve your problem with

aux_inds = np.arange(4)
def argmax_with_exception(row):
    
    mask = row[1:5][:,None] == nodes[:,0]
    indices = np.argmax(mask,axis=1)
    node_slices = nodes[indices][:,1:]

    # if a node in Element is not found in the array nodes
    not_found = aux_inds[~np.any(mask,axis=1)]
    node_slices[not_found] = np.ones(3) * -999
    row[1:] = node_slices.flatten()
    
np.apply_along_axis(argmax_with_exception,1,results)

in which, if a node in Element is not found in nodes, its value will be assigned to (-999,-999,-999).

In this approach, np.apply_along_axis(argmax_with_exception,1, results) will perform an in-place operation on the array results, therefore, it is unlikely you will run into memory error as long as the arrays can be created in the first place. If, however, the machine you are working with has a very small RAM, you can save the array Elements to disk in the first place, then load it into results with results[:,:5] = np.load('Elements.npy')

Aaj Kaal · Accepted Answer · 2020-10-26 02:33:50Z

In order to understand the pythonic solution first look at the solution provided by sgnfis on the old post: Old solution

import numpy as np
# I used numpy 1.10.1 here

Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])

# Create an array with enough rows and five columns
res = np.zeros((np.shape(Elements)[0],5))

for i in range(np.shape(Elements)[0]):
    res[i,0] = Elements[i,0] # The first column stays the same

    # Find the Value of the 2nd column of Elements in the first column of nodes.
    nodesindex = np.where(nodes[:,0]==Elements[i,1])
    # Replace second and third row of the results with the ventries from nodes.
    res[i,1:3]=nodes[nodesindex,1:3]

    #Do the same for the 3rd column of Elements
    nodesindex = np.where(nodes[:,0]==Elements[i,2])
    res[i,3:5]=nodes[nodesindex,1:3]

print(res)

The above solution is now turned into pythonic solution as given below: New Solution:

import numpy as np

Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])

# Create an array with enough rows and five columns
res = np.zeros((np.shape(Elements)[0],5))
res[:,0] = Elements[:,0]  # The first column stays the same
res[:,1:3]=[nodes[np.where(nodes[:,0]==Elements[i,1]),1:3] for i in range(np.shape(Elements)[0])]
res[:,3:5]=[nodes[np.where(nodes[:,0]==Elements[i,2]),1:3] for i in range(np.shape(Elements)[0])]
print(res)

Collectives™ on Stack Overflow

Replace data of an array by two values of a second array

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related