1

I am using np.delete(), to drop a specific band from my ndarray. However, while profiling the memory usage with memory profiler, I noticed that after using np.delete, the memory usage doubles, even though I would expect it slightly decrease.

Here the full example:

import numpy as np

def clean_data(raster_np):
    # Build column names
    scl_index = 0
    scl = raster_np[:, scl_index]

    # Create mask for invalid SCL values
    invalid_scl_mask = np.isin(scl, [0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12])

    # Set rows to NaN where SCL is invalid
    raster_np[invalid_scl_mask, :] = np.nan

    # Drop SCL column
    raster_np = np.delete(raster_np, scl_index, axis=1)

    # Replace 0s with NaN
    raster_np[raster_np == 0] = np.nan


    return raster

# Function call
raster, meta = load_s2_tile(...)
raster = clean_data(raster)

Here the profiling output (See line 33):

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    20   5647.9 MiB   5647.9 MiB           1   @profile
    21                                         def clean_data(raster_np):
    22                                             # Build column names
    23   5647.9 MiB      0.0 MiB           1       scl_index = 0
    24   5647.9 MiB      0.0 MiB           1       scl = raster_np[:, scl_index]
    25                                         
    26                                             # Create mask for invalid SCL values
    27   5762.9 MiB    115.0 MiB           1       invalid_scl_mask = np.isin(scl, [0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12])
    28                                         
    29                                             # Set rows to NaN where SCL is invalid
    30   5762.9 MiB      0.0 MiB           1       raster_np[invalid_scl_mask, :] = np.nan
    31                                         
    32                                             # Drop SCL column
    33  10821.6 MiB   5058.8 MiB           1       raster_np = np.delete(raster_np, scl_index, axis=1)
    34                                         
    35                                             # Replace 0s with NaN
    36  10821.8 MiB      0.2 MiB           1       raster_np[raster_np == 0] = np.nan
    37                                         
    38                                         
    39  10821.8 MiB      0.0 MiB           1       return raster

If someone could point out why this is the case and how to avoid this, that would be great! I would not expect this behaviour as I do not have any other references to raster

2
  • Looking inside the function isn't helpful since the caller still holds a reference to the original array. You have to wait until the call is done and you can replace/remove the reference to the original. Also, a full minimal reproducible example is required for this kind of question Commented May 29 at 18:01
  • Well, even only looking inside the function we see at least one reference: scl is a reference to the older raster_nb. So both old and new raster_np has to continue to exist in memory. When scl won't be used any more (or will be overwritten, for example with a slice of the the raster_np), then the old raster_np can be garbage collected (assuming that there isn't any other references outside that partial code) Commented May 30 at 7:17

2 Answers 2

2

raster_np = np.delete(raster_np, scl_index, axis=1)

np.delete returns a new array. It does not modify raster_np in place.

Assigning that result back to raster_np replaces it locally, but because this occurs in a function, it does not replace the external reference. So now you have the original raster_np and this new deleted array.

The final assignment raster = clean_data(raster) may drop the memory use, depending on garbage collection and overall memory management issues.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your explaination! While I understand hat np.delete return a new array, I would expect that at this there are no more references to the raster from outside of the function. So shouldn't the memory for the original raster get cleared? Is there any alternative to force freeing the memory from the external reference? Or would this be considered bad practice? While the reply from @quantumsurge, solves provides a solution, I would nevertheless try to understand how to deal with such situations if encountered elsewhere.
That assignment within the function breaks all links with original raster object. So python/numpy will not gc the external raster untill after you exit the function.
No need to invoke external (outside the function) references for that correct explanation. There is an internal one: scl
2

np.delete() creates a new array rather than modifying the existing array in place.

In your example, you can use slicing which is O(1)

raster_np = raster_np[:, 1:] # if SCL is first column.

1 Comment

Yep. That is using to our profit the exact reason why we have a problem in the first place: it is because existing line scl = raster_np[:,0] does not create a new array, but rather a reference to the existing one, that, afterward, when raster_np is "np.deleted", the old one isn't garbage collected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.