Numpy array of strings, value assignation

Question

I am working with numpy arrays filled with strings. My goal is to assign to a slice of a first array a, values contained in a second array b of smaller size.

The implementation that I had in mind is the following:

import numpy as np

a = np.empty((10,), dtype=str)

b = np.array(['TEST' for _ in range(2)], dtype=str)

print(b)

a[1:3] = b

print(a)

print(b) returns, as expected ['TEST' 'TEST']

But then print(a) returns ['' 'T' 'T' '' '' '' '' '' '' '']. Therefore the values from b are not correctly assigned to the slice of a.

Any idea of what is causing this wizardry?

Thanks!

Grégoire Roussel · Accepted Answer · 2020-04-17 14:58:42Z

6

You can see it as a form of overflow.

Have a look at the exact types of your arrays:

>>> a.dtype
dtype('<U1') # Array of 1 unicode char
>>> b.dtype
dtype('<U4') # array of 4 unicode chars

When you define an array of strings, numpy tries to infer the smallest size of string it that can contain all the elements you defined.

for a , 1 character is enough
for b, TEST is 4 chars long

Then, when you assign a new value to any new element of an array of strings, numpy will truncate the new value to the capacity of the array. Here, it keeps only the first letter of TEST, T.

Your slicing operation has nothing to do with it:

a = np.zeros(1, dtype=str)
a[0] = 'hello world'
print(a[0])
# h

How to overcome it

define a with a dtype of object: numpy will not try to optimize its storage space anymore, and you'll get a predictable behaviour
Increase the size of the char array: a = np.zero(10, dtype='U256') will increase the capacity of each cell to 256 characters

answered Apr 17, 2020 at 14:58

Grégoire Roussel

9578 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pierresegonne Over a year ago

Thanks! Ahahah quelle coincidence! Merci pour ta réponse et j'espere que tu vas bien!

Grégoire Roussel Over a year ago

De rien! C'est assez improbable comme rencontre :) Portes-toi bien aussi !

sempersmile · Accepted Answer · 2020-04-17 14:52:53Z

2

The problem is that numpy truncates the string to lenght 1 when specifying dtype=str.

You can resolve the issue by using dtype='<U4' though.

So following code would work for your case:

import numpy as np

a = np.empty((10,), dtype='<U4')

b = np.array(['TEST' for _ in range(2)], dtype=str)

print(b)

a[1:3] = b

print(a)

The number in dtype='<U4' specifies the maximum possible length for a string in that array - so for your case 4 is fine since 'TEST' only has 4 letters.

answered Apr 17, 2020 at 14:52

sempersmile

4812 silver badges9 bronze badges

2 Comments

pierresegonne Over a year ago

Thanks! Do you know the reason behind that?

sempersmile Over a year ago

numpy tries to be as efficient as possible and if it was to store strings of arbitrary length with dynamic memory allocation, this would take way longer than it does with this behavior (where it can preallocate the memory)

Collectives™ on Stack Overflow

Numpy array of strings, value assignation

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related