How large can a Numpy Unicode array be?
dtype = 'U100', 'U1000', 'U1000000' ?
I cannot find any reference of maximums the documentation.
I found this line in https://numpy.org/doc/stable/reference/arrays.dtypes.html:
Total dtype itemsize is limited to
ctypes.c_int.
Which for 32-bit signed integer would be 2,147,483,647. But practically, the byte size of the item is also limited, so divide by 4 (Unicode code point size) is 2147483647 // 4 or 536,870,911.
>>> import numpy as np
>>> np.array(['abcdef'],dtype='U536870911')
array(['abcdef'], dtype='<U536870911')
>>> np.array(['abcdef'],dtype='U536870911').itemsize
2147483644
Also:
The element size of this data-type object.
For 18 of the 21 types this number is fixed by the data-type. For the flexible data-types, this number can be anything.
'U10000' or ('U',10000) so while it talks about the latter, it implies the former as well.4, since np.array(['abcdefghij']).itemsize is 40. In any case, I'm not sure this is what the OP is asking.dtype. For the actual array it is byte size of the item. x=np.array(['abcdef'],dtype=('U',2)) returns array(['ab'], dtype='<U2') and x.itemsize == 8.('U',2**29-1) before it complained (4x that is 0x7ffffffc).
numpyisn't great for large strings.nbtyesfor anp.zeros((1000,), dtype='U500')will be 1000*500*4, regardless of the actual strings.repeatandtiledon't change thedtype, nor do they depend on andtypedetails (except for the number of bytes per element). So they behave the same whether the dtype is numeric or 'U1000'. But as you found in the other question, joining the elements into larger strings is best done with lists and Python strings.strobjects have a lot of overhead. It is possible to intern Python strings, however, which would make it very efficient if you have a lot of potential repeats. The runtime may intern them due to various optimizations that are arcane in nature. But generally, play around withsys.getsizeand you should get an idea, considering you have to use the maximum possible size if you want aUdtype