Pandas cast on Numpy different dtype array

Question

I'm currently having an issue with the way that Pandas casts Numpy array into DataFrame.

Sample code:

example_array = np.array([
[1, 2, 3],
['one', 'two', 'three'],
[4.01, 5.01, 6.01],
[np.nan, np.nan, np.nan]])


df = pd.DataFrame(example_array, index=['int', 'string', 'float', 'nan'])

df = df.T

df.dtypes

output:

int       object
string    object
float     object
nan       object
dtype: object

It seems that either Numpy or Pandas does not recognise the type nor converts it properly, while looking around one of the suggestions was to specify the dtype in creating a Series, however, this does not help me as I'm working with a large Numpy array.

Example by @r-max:

In [2]: df = pd.DataFrame({'x': pd.Series(['1.0', '2.0', '3.0'], dtype=float), 'y': pd.Series(['1', '2', '3'], dtype=int)})

In [3]: df
Out[3]: 
   x  y
0  1  1
1  2  2
2  3  3

[3 rows x 2 columns]

In [4]: df.dtypes
Out[4]: 
x    float64
y      int64
dtype: object

Is there a better solution to this problem? Is this a bug?

Thanks!

Fixed. Either from what I saw is that Pandas converts it into a Numpy array — Akmal Soliev
– Akmal Soliev, Commented May 5, 2021 at 13:42
Okay, you've got an array now, but that's an array of object dtype. You'll want to avoid building arrays like that if you want to use NumPy effectively. — user2357112
– user2357112, Commented May 5, 2021 at 13:44
Hint : if you just try printing the dataframe, you will know why everything is marked as object — heretolearn
– heretolearn, Commented May 5, 2021 at 13:48
As you can see from the first example, I have constructed multiple independent lists and passed them through DataFrame, however, the result is the same. That's nice to know, but I am wondering what is the solution to this issue? — Akmal Soliev
– Akmal Soliev, Commented May 5, 2021 at 14:14

heretolearn · Accepted Answer · 2021-05-05 15:04:47Z

The issue is not with numpy or pandas but in the way how you have used the np.array to create the example array.

When you define the numpy array, as a nested array, each sub array will be considered as a row. So when you try to print the 'df' in the above case, it will be as

           0     1     2
   int     1     2     3
string     one   two   three
 float     4.01  5.01  6.01
   nan     nan   nan   nan

As you can see each column has a mix of int, string, float, null. This is why when you check the datatype of the column you get it as object.

I suppose what you want is to have these values as columns.

arr1 = np.array([1, 2, 3])
arr2 = np.array(['one', 'two', 'three'])
arr3 = np.array([4.01, 5.01, 6.01])
arr4 = np.array([np.nan, np.nan, np.nan])

df = pd.DataFrame({'int': arr1, 'string': arr2, 'float': arr3, 'nan': arr4})

Output:

      int   string  float   nan
  0     1   one     4.01    NaN
  1     2   two     5.01    NaN
  2     3   three   6.01    NaN

 df.dtypes

 int         int64
 string     object
 float     float64
 nan       float64

 dtype: object

hpaulj · Accepted Answer · 2021-05-05 15:38:19Z

Note the starting array dtype:

In [158]: example_array = np.array([
     ...: [1, 2, 3],
     ...: ['one', 'two', 'three'],
     ...: [4.01, 5.01, 6.01],
     ...: [np.nan, np.nan, np.nan]])
In [159]: example_array
Out[159]: 
array([['1', '2', '3'],
       ['one', 'two', 'three'],
       ['4.01', '5.01', '6.01'],
       ['nan', 'nan', 'nan']], dtype='<U32')

You've lost the distinction between strings and float and integers right away.

But starting with a list:

In [178]: alist =[
     ...: [1, 2, 3],
     ...: ['one', 'two', 'three'],
     ...: [4.01, 5.01, 6.01],
     ...: [np.nan, np.nan, np.nan]]

and performing a "transpose" on that to make list of tuples:

In [179]: blist = list(zip(*alist))
In [180]: blist
Out[180]: [(1, 'one', 4.01, nan), (2, 'two', 5.01, nan), (3, 'three', 6.01, nan)]

We can then make a dataframe with distinct dtypes (by column):

In [181]: pd.DataFrame(_, columns=list('abcd'))
Out[181]: 
   a      b     c   d
0  1    one  4.01 NaN
1  2    two  5.01 NaN
2  3  three  6.01 NaN
In [182]: _.dtypes
Out[182]: 
a      int64
b     object
c    float64
d    float64
dtype: object

An equivalent structured array - one field per column

In [184]: _181.to_records()
Out[184]: 
rec.array([(0, 1, 'one', 4.01, nan), (1, 2, 'two', 5.01, nan),
           (2, 3, 'three', 6.01, nan)],
          dtype=[('index', '<i8'), ('a', '<i8'), ('b', 'O'), ('c', '<f8'), ('d', '<f8')])

Collectives™ on Stack Overflow

Pandas cast on Numpy different dtype array

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related