I have some data in a PostgreSQL table.
I am pulling the data back to a notebook via code like the following:
import numpy as np
import pandas as pd
%load_ext sql
%sql postgresql://foo:foo@localhost:5432/barbar
result_from_sql = %%sql SELECT Date, Year,Score, Cost FROM MyData;
result_df = result_from_sql.DataFrame()
In the PostgreSQL table all columns were typed accurately but result_df is as follows:
result_df.dtypes
date object
year int64
score object
cost object
Converting the date column was fine:
result_df['date'] = pd.to_datetime(result_df['date'])
As was ensuring all None values are now NaN values:
result_df.replace([None], [np.nan], inplace=True)
But to convert the columns score & cost to numeric I need to execute the following 3 lines of code:
s = ['score', 'cost']
result_df[s] = pd.to_numeric(result_df[s].astype(str), errors = 'coerce')
result_df[s] = result_df[s].apply(pd.to_numeric, errors='coerce')
If I use only lines 1 and 2 then the typing is still object - if I use only lines 1 and 3 then all the data is converted to NaN as if all the data has not coerced.
Why do I have to use this code and is there a more elegant solution?