1

why does read_csv automatically convert all my columns read into 'object' types? I want to read 10 Gb csv (float and Int) and load it into a pandas data frame. I don't run into this issue (where all columns with numbers are converted to object types) if I read a smaller file (100 MB or less) with either panda or dask

csv sample

i tried to specify dtype explicitly, manually as part of read_csv; still ended up with objects (verified after read with df.dtype)

import pandas as pd
file='D:/path/combine.csv'
data_type={'Lat':np.float32,'Long':np.float32,   'HorizontalAccuracy':np.int,'RSRP':np.int}
data=pd.read_csv(file, low_memory=False, dtype=data_type)
data.dtypes

tried to read 1st line of file and get dtypes automatically, then read file with the defined dtypes: ended up with all objects

file='D:/path/combine.csv'
col_names=pd.read_csv(file, nrows=0).columns
types_dict=data_type
types_dict.update({col:np.int64 for col in col_names if col not in types_dict})
data=pd.read_csv(file, low_memory=False, dtype=data_type)
data.dtypes

TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe' During handling of the above exception, another exception occurred: ValueError: could not convert string to float: '\x1a'

tried read_csv with dask while explicitly specifying dtype; got error about can't convert string to float

import dask.dataframe as dd
file='D:/path/combine.csv'
data_type={'Lat':np.float32,'Long':np.float32, 'HorizontalAccuracy':np.int,'RSRP':np.int} 
ddf=dd.read_csv(file, dtype=data_type)
ddf.compute()

TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe' ValueError: could not convert string to float: 'Latitude'

7
  • 2
    We can't guess. Give us a minimal reproducible example which shows when Pandas uses object unnecessarily. See How to make good reproducible pandas examples if you need help with this. Commented Dec 28, 2018 at 20:32
  • 1
    Keep in mind that "string"-type columns will always show up as "objects" in pandas. The float and int types are as expected, though. Commented Dec 28, 2018 at 20:34
  • you should probably include the head of your dataset, the code you are using (just the relevant part) to load the csv to a dataframe, and then your error message. Describing to us with words is less informative than showing code. Commented Dec 28, 2018 at 20:37
  • @jpp i can't reproduce it here since the problem doesn't appear without loading a large file Commented Dec 28, 2018 at 21:32
  • @webelo sorry i didnt have any strings in my input file, i edited my question. thanks for the info tho. Commented Dec 28, 2018 at 21:32

2 Answers 2

0

There are some issues with your code, and most specifically, your column names in your csv file are different than the names of the dtypes you are assigning. This will throw an error because these need to match exactly. So using this as my input file (I tried to replicate yours as much as possible):

Location_Lat,Location_Long,Location_H,Location_Z,QOS_RSRP
47.46058,-123.053,6,98588,-115
47.62989,-122.81,87,98312,-114
47.54865,-122.859,9,98312,-113

I noticed that you had one extra column (Location_Z) that you did not call in your dtypes. I set up a dataframe for assigning dtypes, and using the specified columns based on your information above. You can specify everything you want in the pd.read_csv() function as follows:

import sys

import pandas as pd import numpy as np

df = pd.read_csv('fun.csv', header=0, usecols=['Location_Lat',
                                               'Location_Long',
                                               'Location_H',
                                               'QOS_RSRP'],
                 dtype={'Location_Lat':np.float32,
                        'Location_Long':np.float32,
                        'Location_H':np.int,
                        'QOS_RSRP':np.int}, low_memory=False)
print(df)
print(df.dtypes)

A couple things to note. I explicitly assigned the header=0. This will ensure that pandas uses the first line of the csv as your header (so I know what the names will be called). If you want it to be a different line just set header=(line you want)

Second, I told pandas to explicitly use only 4 out of the 5 columns, since you did not specify dtype for the Location_Z column, I did not include that column, but if you want it you can just include it in the usecols= param above and specify the dtype in the dtype= param.

Lastly, the dtype dict uses the column names from the dataframe to assign dtypes to. Fortunately, we assigned the columns with header=0 so pandas already "knows" the column names. In pandas, floats and ints have a really low memory cost compared to a string object. The reason your df is returning object dtypes for everything is because the header might be being read in as row1 of your df, in which case pandas would assign that whole column as an object dtypes. The result of the code above when printed to screen is:

[dkennetz@hpc02  fun_temp]$ python3.5 pandas_fun.py
   Location_Lat  Location_Long  Location_H  QOS_RSRP
0     47.460579    -123.053001           6      -115
1     47.629890    -122.809998          87      -114
2     47.548649    -122.859001           9      -113
Location_Lat     float32
Location_Long    float32
Location_H         int64
QOS_RSRP           int64
dtype: object

So we can see that only 4 out of the 5 columns are being loaded into the dataframe, and that the dtypes are indeed float32 and int64 as assigned. The dtype: object at the bottom is referring to the dataframe itself, which will always be a dataframe object in pandas.

I hope this clears up any questions you may have had!

Sign up to request clarification or add additional context in comments.

2 Comments

thanks so much for your time. I finally figured out the problem: the placement of where "header=0" actually matters! In my code i had dd.read_csv(file, usecols=twr_coln, dtype=data_type, header=0) and dask must've tried to process the whole file BEFORE seeing my command that header=0; resulting in long processing time. I needed the code like dd.read_csv(file, header=0, usecols=twr_coln, dtype=data_type), then the process time was drastically reduced. your comment helped me trouble shoot!
That is good! The order does indeed matter because of the size of your dataset. if I do the same thing with a 3 column csv with size of 51gb (passing header=0 last) pandas throws an out of memory error, but if I do it first the in memory size of the dataset drastically reduces because of datatypes. And I am talking about pandas, not dask. Dask inherits pandas api so this could be where your issue is occurring. It has to do with the how pandas reads in information of large files (uses like head 100 lines) and tries to guess the datatype based on the first 100 lines.
0

my problem of timeouts and reading entire large csv as object despite it being numbers, was caused by the lack of specifying "header=0". More importantly where "header=0" is placed in your read_csv will determine whether it's going to work.

wrong code dd.read_csv(file, usecols=twr_coln, dtype=data_type, header=0) right code dd.read_csv(file, header=0, usecols=twr_coln, dtype=data_type)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.