Error with Pandas command on Spark?

Question

I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.

Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.

After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "

The code that I am running that triggered this:

 # Changing the column data types
if len(int_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
    df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
    df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')

'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.

Any help would be great!

What is df, a pandas dataframe or a Spark one? And in which exact command does the error happen? Sharing more details of your code wouldn't hurt... — desertnaut
– desertnaut, Commented Jul 11, 2017 at 11:46
@desertnaut The df is my Pandas dataframe. It is picking up that it is my first command with downcast that is an error. I didn't want to post my entire code on here, but any more information I'll gladly post. — rmahesh
– rmahesh, Commented Jul 11, 2017 at 12:03
So, it sounds like a pandas-related question - cannot see what Spark has to do with it (other than converting the initial dataframe). I suggest trying to load the initial data directly to a pandas df - if you still face the issue, it has indeed nothing to do with Spark (in any case, 'pandas command on Spark' is not an accurate description of your issue). — desertnaut
– desertnaut, Commented Jul 11, 2017 at 12:38

desertnaut · Accepted Answer · 2017-07-11 13:48:14Z

1

From your comments:

I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.

So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.

After initializing a cluster and uploading a file user_info.csv, we get this screenshot:

including the actual path for our uploaded file.

Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:

 import pandas as pd
 pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
 [...]
 IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist

because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:

 pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
 pandas_df.head() # works OK

answered Jul 11, 2017 at 13:48

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

rmahesh Over a year ago

Thank you for the answer back! While I am able to upload it directly as a Pandas dataframe, I am still getting the same error that I originally mentioned. I don't really know what community to reach out to as I am sort of lost as to why it wouldn't work. Here is my entire code if this helps: codesend.com/view/19783918ed48c5546829571fda051986

desertnaut Over a year ago

@rmahesh So, at least now you know that it is not due to Spark, so arguably my answer was not without its merit (and you could even upvote it). Check for version incompatibilities (Databricks CE runs Python 2.7.12 & pandas 0.18.1), and if the error persists, raise an issue with Databricks forums.databricks.com

rmahesh Over a year ago

Your answer definitely has merit, I just wanted to provide all the information. I will reach out to Databricks with this error, thank you again.

Kate Over a year ago

@rmahesh did you ever fix this problem? I'm running into the same issue

rmahesh Over a year ago

@Kate I do not believe so no.

Collectives™ on Stack Overflow

Error with Pandas command on Spark?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related