Adding 2d Numy array to 1d Numpy Array

Question

I am having a python list, and each element is a 2d Numpy array with size of (20, 22). I need to convert the list to a numpy array but doing np.array(my_list) is literally eating the RAM, so does np.asarray(my_list).

The list has around 7M samples, I was thinking instead of converting my list to a numpy array, let me start with a numpy array and keep appending another 2d numpy arrays.

I cant find a way of doing that using numpy, my aim is to start with something like that:

numpy_array = np.array([])

df_values = df.to_numpy() # faster than df.values
for x in df_values:
    if condition:
        start_point += 20
        end_point += 20
    features = df_values[start_point:end_point] # 20 rows, 22 columns
    np.append(numpy_array, features)

As you can see above, after each loop, the size of numpy_array should be changing to something like this:

first iteration: (1, 20, 22) 
second iteration: (2, 20, 22) 
third iteration: (3, 20, 22) 
N iteration: (N, 20, 22)

Update:

Here is my full code,

def get_X(df_values):
    x = [] #np.array([], dtype=np.object)
    y = [] # np.array([], dtype=int32)
    counter = 0
    start_point = 20
    previous_ticker = None
    index = 0
    time_1 = time.time()
    df_length = len(df_values)
    for row in tqdm(df_values):
        if 0 <= start_point < df_length:
            ticker = df_values[start_point][0]
            flag = row[30]
            if index == 0: previous_ticker = ticker
            if ticker != previous_ticker:
                counter += 20 
                start_point += 20
                previous_ticker = ticker
            features = df_values[counter:start_point]
            x.append(features)
            y.append(flag)
            # np.append(x, features)
            # np.append(y, flag)
            counter += 1
            start_point += 1
            index += 1
        else:
            break
    print("Time to finish the loop", time.time()-time_1)
    return x, y


x, y = get_X(df.to_numpy())

@DaniMesejo It is being changed based on the start_point, and end_point — samer hassan
– samer hassan, Commented Oct 4, 2021 at 13:15
@DaniMesejo okay, I updated it just to clear this out, thanks for noticing this — samer hassan
– samer hassan, Commented Oct 4, 2021 at 13:16
Appending to an array is not amortized unlike a list. You will waste a lot more RAM this way. Your best bet is the first option. — Mad Physicist
– Mad Physicist, Commented Oct 4, 2021 at 13:18
Show a complete minimal reproducible example with random numbers or something. Most of the statements you're making are borderline nonsense. Your data is already in memory, and you're not showing what actually "eats up your ram", or what "condition" is — Mad Physicist
– Mad Physicist, Commented Oct 4, 2021 at 13:20

oekopez · Accepted Answer · 2021-10-04 13:30:19Z

1

Numpy arrays are so efficent because they have a fixed size and type. Hence, "appending" to an array is very slow and consuming, because a new array is created all the time. If you know beforehand how many samples you have (eg 7000000) the best way is:

N = 7000000
# Make complete array with NaN's
features = np.empty(size=(N, 20, 22), dtype=np.float64) * np.NaN
for whatever:
    ...
    features[counter:start_point] = ...

Should be the fastest and most memory efficiant way, when using a loop. However, this looks like some transformation of a dataframe into the 3D array, which might be much, much faster solved with pandas numerous features for transformation.

If you do not know the final size, err on the bigger size and copy it once to the smaller (correct) size.

answered Oct 4, 2021 at 13:30

oekopez

1,0901 gold badge12 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

samer hassan Over a year ago

I believe this is a nice trick. However, changing np.float64 to np.object plays a huge role here, is there a certain way of specifying the data type for each column, as I am having textual data in some columns ? maybe you have got something else in mind ?

samer hassan Over a year ago

I think this answers my question here

oekopez Over a year ago

I wrote the answer without knowledge of the large code - np.float64 was just a guess. If you are dealing with strings, you need np.object of course. But there is a reason, why it is called "Num"Py and not "Object"Py or "Str"Py. If you have mixed types, I would avoid NumPy arrays completely - lists are better for that.

Collectives™ on Stack Overflow

Adding 2d Numy array to 1d Numpy Array

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related