1

I have a large numpy array with 4 million rows and 4 columns (shape = (4000000,4))

I need to modify/ decrease the number of rows, based on the value in fourth column. For example few of my rows in my data set look like the following:

a = np.array([[1.32, 24.42, 224.21312, 0],[1.32, 24.42, 224.21312, 0],[1.32, 24.42, 224.21312, 1],[1.32, 24.42, 224.21312, 1],[1.32, 24.42, 224.21312, 0]]);

My result should be the following (only rows with last column value = 1)

b = [1.32, 24.42, 224.21312, 1],[1.32, 24.42, 224.21312, 1]

A for loop to go through each row is taking a long time to process.

I have 200 of these arrays, so I am already using multiprocessing for each array.

Looking for suggestions.

2 Answers 2

3

does this work for you?

a[a[:,3] == 1]

gives:

array([[  1.32   ,  24.42   , 224.21312,   1.     ],
       [  1.32   ,  24.42   , 224.21312,   1.     ]])
Sign up to request clarification or add additional context in comments.

1 Comment

a[a[:, -1] == 1] is slightly better for n length arrays.
0

You can convert it to dataframe and operate your operations there and then convert back to array:

df = pd.DataFrame(a)
df = df[df[3] == 1]
a = df.as_matrix()

Output:

array([[  1.32   ,  24.42   , 224.21312,   1.     ],
       [  1.32   ,  24.42   , 224.21312,   1.     ]])

1 Comment

You haven't really converted anything because it's still a numpy array underneath, this just wraps it in the overhead of pandas. You can operate on the array directly as shown in the other answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.