0

I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.

Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.

After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "

The code that I am running that triggered this:

 # Changing the column data types
if len(int_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
    df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
    df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
    df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')

'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.

Any help would be great!

3
  • What is df, a pandas dataframe or a Spark one? And in which exact command does the error happen? Sharing more details of your code wouldn't hurt... Commented Jul 11, 2017 at 11:46
  • @desertnaut The df is my Pandas dataframe. It is picking up that it is my first command with downcast that is an error. I didn't want to post my entire code on here, but any more information I'll gladly post. Commented Jul 11, 2017 at 12:03
  • So, it sounds like a pandas-related question - cannot see what Spark has to do with it (other than converting the initial dataframe). I suggest trying to load the initial data directly to a pandas df - if you still face the issue, it has indeed nothing to do with Spark (in any case, 'pandas command on Spark' is not an accurate description of your issue). Commented Jul 11, 2017 at 12:38

1 Answer 1

1

From your comments:

I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.

So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.

After initializing a cluster and uploading a file user_info.csv, we get this screenshot:

enter image description here

including the actual path for our uploaded file.

Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:

 import pandas as pd
 pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
 [...]
 IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist

because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:

 pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
 pandas_df.head() # works OK
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for the answer back! While I am able to upload it directly as a Pandas dataframe, I am still getting the same error that I originally mentioned. I don't really know what community to reach out to as I am sort of lost as to why it wouldn't work. Here is my entire code if this helps: codesend.com/view/19783918ed48c5546829571fda051986
@rmahesh So, at least now you know that it is not due to Spark, so arguably my answer was not without its merit (and you could even upvote it). Check for version incompatibilities (Databricks CE runs Python 2.7.12 & pandas 0.18.1), and if the error persists, raise an issue with Databricks forums.databricks.com
Your answer definitely has merit, I just wanted to provide all the information. I will reach out to Databricks with this error, thank you again.
@rmahesh did you ever fix this problem? I'm running into the same issue
@Kate I do not believe so no.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.