3

I am working with numpy arrays filled with strings. My goal is to assign to a slice of a first array a, values contained in a second array b of smaller size.

The implementation that I had in mind is the following:

import numpy as np

a = np.empty((10,), dtype=str)

b = np.array(['TEST' for _ in range(2)], dtype=str)

print(b)

a[1:3] = b

print(a)

print(b) returns, as expected ['TEST' 'TEST']

But then print(a) returns ['' 'T' 'T' '' '' '' '' '' '' '']. Therefore the values from b are not correctly assigned to the slice of a.

Any idea of what is causing this wizardry?

Thanks!

2 Answers 2

6

You can see it as a form of overflow.

Have a look at the exact types of your arrays:

>>> a.dtype
dtype('<U1') # Array of 1 unicode char
>>> b.dtype
dtype('<U4') # array of 4 unicode chars

When you define an array of strings, numpy tries to infer the smallest size of string it that can contain all the elements you defined.

  • for a , 1 character is enough
  • for b, TEST is 4 chars long

Then, when you assign a new value to any new element of an array of strings, numpy will truncate the new value to the capacity of the array. Here, it keeps only the first letter of TEST, T.

Your slicing operation has nothing to do with it:

a = np.zeros(1, dtype=str)
a[0] = 'hello world'
print(a[0])
# h

How to overcome it

  1. define a with a dtype of object: numpy will not try to optimize its storage space anymore, and you'll get a predictable behaviour
  2. Increase the size of the char array: a = np.zero(10, dtype='U256') will increase the capacity of each cell to 256 characters
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Ahahah quelle coincidence! Merci pour ta réponse et j'espere que tu vas bien!
De rien! C'est assez improbable comme rencontre :) Portes-toi bien aussi !
2

The problem is that numpy truncates the string to lenght 1 when specifying dtype=str.

You can resolve the issue by using dtype='<U4' though.

So following code would work for your case:

import numpy as np

a = np.empty((10,), dtype='<U4')

b = np.array(['TEST' for _ in range(2)], dtype=str)

print(b)

a[1:3] = b

print(a)

The number in dtype='<U4' specifies the maximum possible length for a string in that array - so for your case 4 is fine since 'TEST' only has 4 letters.

2 Comments

Thanks! Do you know the reason behind that?
numpy tries to be as efficient as possible and if it was to store strings of arbitrary length with dynamic memory allocation, this would take way longer than it does with this behavior (where it can preallocate the memory)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.