1

I'm currently having an issue with the way that Pandas casts Numpy array into DataFrame.

Sample code:

example_array = np.array([
[1, 2, 3],
['one', 'two', 'three'],
[4.01, 5.01, 6.01],
[np.nan, np.nan, np.nan]])


df = pd.DataFrame(example_array, index=['int', 'string', 'float', 'nan'])

df = df.T

df.dtypes

output:

int       object
string    object
float     object
nan       object
dtype: object

It seems that either Numpy or Pandas does not recognise the type nor converts it properly, while looking around one of the suggestions was to specify the dtype in creating a Series, however, this does not help me as I'm working with a large Numpy array.

Example by @r-max:

In [2]: df = pd.DataFrame({'x': pd.Series(['1.0', '2.0', '3.0'], dtype=float), 'y': pd.Series(['1', '2', '3'], dtype=int)})

In [3]: df
Out[3]: 
   x  y
0  1  1
1  2  2
2  3  3

[3 rows x 2 columns]

In [4]: df.dtypes
Out[4]: 
x    float64
y      int64
dtype: object

Is there a better solution to this problem? Is this a bug?

Thanks!

5
  • I don't see any NumPy arrays in the input. Commented May 5, 2021 at 13:40
  • Fixed. Either from what I saw is that Pandas converts it into a Numpy array Commented May 5, 2021 at 13:42
  • Okay, you've got an array now, but that's an array of object dtype. You'll want to avoid building arrays like that if you want to use NumPy effectively. Commented May 5, 2021 at 13:44
  • Hint : if you just try printing the dataframe, you will know why everything is marked as object Commented May 5, 2021 at 13:48
  • As you can see from the first example, I have constructed multiple independent lists and passed them through DataFrame, however, the result is the same. That's nice to know, but I am wondering what is the solution to this issue? Commented May 5, 2021 at 14:14

2 Answers 2

2

The issue is not with numpy or pandas but in the way how you have used the np.array to create the example array.

When you define the numpy array, as a nested array, each sub array will be considered as a row. So when you try to print the 'df' in the above case, it will be as

           0     1     2
   int     1     2     3
string     one   two   three
 float     4.01  5.01  6.01
   nan     nan   nan   nan

As you can see each column has a mix of int, string, float, null. This is why when you check the datatype of the column you get it as object.

I suppose what you want is to have these values as columns.

arr1 = np.array([1, 2, 3])
arr2 = np.array(['one', 'two', 'three'])
arr3 = np.array([4.01, 5.01, 6.01])
arr4 = np.array([np.nan, np.nan, np.nan])

df = pd.DataFrame({'int': arr1, 'string': arr2, 'float': arr3, 'nan': arr4})

Output:

      int   string  float   nan
  0     1   one     4.01    NaN
  1     2   two     5.01    NaN
  2     3   three   6.01    NaN

 df.dtypes

 int         int64
 string     object
 float     float64
 nan       float64

 dtype: object
Sign up to request clarification or add additional context in comments.

Comments

0

Note the starting array dtype:

In [158]: example_array = np.array([
     ...: [1, 2, 3],
     ...: ['one', 'two', 'three'],
     ...: [4.01, 5.01, 6.01],
     ...: [np.nan, np.nan, np.nan]])
In [159]: example_array
Out[159]: 
array([['1', '2', '3'],
       ['one', 'two', 'three'],
       ['4.01', '5.01', '6.01'],
       ['nan', 'nan', 'nan']], dtype='<U32')

You've lost the distinction between strings and float and integers right away.

But starting with a list:

In [178]: alist =[
     ...: [1, 2, 3],
     ...: ['one', 'two', 'three'],
     ...: [4.01, 5.01, 6.01],
     ...: [np.nan, np.nan, np.nan]]

and performing a "transpose" on that to make list of tuples:

In [179]: blist = list(zip(*alist))
In [180]: blist
Out[180]: [(1, 'one', 4.01, nan), (2, 'two', 5.01, nan), (3, 'three', 6.01, nan)]

We can then make a dataframe with distinct dtypes (by column):

In [181]: pd.DataFrame(_, columns=list('abcd'))
Out[181]: 
   a      b     c   d
0  1    one  4.01 NaN
1  2    two  5.01 NaN
2  3  three  6.01 NaN
In [182]: _.dtypes
Out[182]: 
a      int64
b     object
c    float64
d    float64
dtype: object

An equivalent structured array - one field per column

In [184]: _181.to_records()
Out[184]: 
rec.array([(0, 1, 'one', 4.01, nan), (1, 2, 'two', 5.01, nan),
           (2, 3, 'three', 6.01, nan)],
          dtype=[('index', '<i8'), ('a', '<i8'), ('b', 'O'), ('c', '<f8'), ('d', '<f8')])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.