Pandas read_sql DataTypes

Question

I have to compare two data sources to see if the same record is the same across all rows. One data source comes from an Excel File, where another comes from a SQL Table. I tried using DataFrame.equals() Like i have in the past.

However, the issue is due to pesky datatype issues. Even though the data looks the same, the datatypes are making excel_df.loc[excel_df['ID'] = 1].equals(sql_df.loc[sql_df['ID'] = 1]) return False. Here is an example of the datatype from pd.read_excel():

COLUMN ID                         int64
ANOTHER Id                      float64
SOME Date                datetime64[ns]
Another Date             datetime64[ns]

The same columns from pd.read_sql:

COLUMN ID                        float64
ANOTHER Id                       float64
SOME Date                         object
Another Date                      object

I could try using the converters argument from pd.read_excel() to match SQL. Or also doing df['Column_Name] = df['Column_Name].astype(dtype_here) But I am dealing with a lot of columns. Is there an easier way to check for values across all columns?

checking pd.read_sql() there is no thing like converters but I'm looking for something like:

df = pd.read_sql("Select * From Foo", con, dtypes = ({Column_name: str,
                                                      Column_name2:int}))

Igor Raush · Accepted Answer · 2017-10-06 20:22:29Z

3

How about

excel_df = pd.read_excel(...)
sql_df = pd.read_sql(...)

# attempt to cast all columns of excel_df to the types of sql_df
excel_df.astype(sql_df.dtypes.to_dict()).equals(sql_df)

edited Oct 6, 2017 at 20:22

answered Oct 5, 2017 at 21:13

Igor Raush

15.3k1 gold badge38 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

MattR Over a year ago

I wish this would've worked. I get TypeError: data type not understood

MattR Over a year ago

This answer did lead me to a semi-usable answer. I created a loop and this does not return the TypeError. However, I did have to change some datatypes (like dates and zip codes to str that were being read as int). If you would like to update your answer, I can accept it for the community. Here is what I found: for column in df1.columns.tolist(): df1[column] = df1[column].astype(sql_df[column].dtype) Proper indentation must be used.

Igor Raush Over a year ago

@Matt, which Pandas version are you using?

Igor Raush Over a year ago

Also, can you guarantee that the columns of both data frames are identical? (i.e. all columns of excel_df are columns of sql_df and vice-versa)

MattR Over a year ago

version 0.20.2 and yes, the columns will always match

|

Mikhail Venkov · Accepted Answer · 2017-10-05 21:38:14Z

-1

If you are seeing "Object" dtype that means that pandas can't interpret some of the rows as dates so instead in casts the whole column as Object (which is basically string)

Looking at documentation for dtypes, converters and parse_dates arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can also check dayfirst argument to parse the dates correctly.

answered Oct 5, 2017 at 21:38

Mikhail Venkov

3762 silver badges11 bronze badges

Collectives™ on Stack Overflow

Pandas read_sql DataTypes

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related