6

I have to compare two data sources to see if the same record is the same across all rows. One data source comes from an Excel File, where another comes from a SQL Table. I tried using DataFrame.equals() Like i have in the past.

However, the issue is due to pesky datatype issues. Even though the data looks the same, the datatypes are making excel_df.loc[excel_df['ID'] = 1].equals(sql_df.loc[sql_df['ID'] = 1]) return False. Here is an example of the datatype from pd.read_excel():

COLUMN ID                         int64
ANOTHER Id                      float64
SOME Date                datetime64[ns]
Another Date             datetime64[ns] 

The same columns from pd.read_sql:

COLUMN ID                        float64
ANOTHER Id                       float64
SOME Date                         object
Another Date                      object

I could try using the converters argument from pd.read_excel() to match SQL. Or also doing df['Column_Name] = df['Column_Name].astype(dtype_here) But I am dealing with a lot of columns. Is there an easier way to check for values across all columns?

checking pd.read_sql() there is no thing like converters but I'm looking for something like:

df = pd.read_sql("Select * From Foo", con, dtypes = ({Column_name: str,
                                                      Column_name2:int}))

2 Answers 2

3

How about

excel_df = pd.read_excel(...)
sql_df = pd.read_sql(...)

# attempt to cast all columns of excel_df to the types of sql_df
excel_df.astype(sql_df.dtypes.to_dict()).equals(sql_df)
Sign up to request clarification or add additional context in comments.

8 Comments

I wish this would've worked. I get TypeError: data type not understood
This answer did lead me to a semi-usable answer. I created a loop and this does not return the TypeError. However, I did have to change some datatypes (like dates and zip codes to str that were being read as int). If you would like to update your answer, I can accept it for the community. Here is what I found: for column in df1.columns.tolist(): df1[column] = df1[column].astype(sql_df[column].dtype) Proper indentation must be used.
@Matt, which Pandas version are you using?
Also, can you guarantee that the columns of both data frames are identical? (i.e. all columns of excel_df are columns of sql_df and vice-versa)
version 0.20.2 and yes, the columns will always match
|
-1

If you are seeing "Object" dtype that means that pandas can't interpret some of the rows as dates so instead in casts the whole column as Object (which is basically string)

Looking at documentation for dtypes, converters and parse_dates arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can also check dayfirst argument to parse the dates correctly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.