12

I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable

3
  • Hi @OMG, read allows you to access a DataFrameReader, which enables loading parquet / csv / json / text / excel / ... files with specific methods Commented Jan 22, 2020 at 8:08
  • @baitmbarek: shall i use .load.... please help Commented Jan 22, 2020 at 8:12
  • You can take a look at these suggestions first : datascience.stackexchange.com/questions/22736/… Commented Jan 22, 2020 at 8:14

2 Answers 2

19

You can use pandas to read .xlsx file and then convert that to spark dataframe.

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Amit, but getting error like : ImportError: Install xlrd >= 1.0.0 for Excel support
xlrd package is not installed. Just pip install xlrd, it will start working.
inferSchema is not (or no longer, probably?) a supported argument. (TypeError: read_excel() got an unexpected keyword argument 'inferSchema')
Is there a way to reading an Excel file direct to Spark without using pandas as an intermediate step?
1

You could use crealytics package.

Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.

df = spark.read
     .format("com.crealytics.spark.excel")
     .option("dataAddress", "'Sheet1'!")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("C:\P_DATA\tyco_93_A.xlsx")

More options are available in below github page.

https://github.com/crealytics/spark-excel

1 Comment

Version 0.14.0 was released in Aug 2021 and it's working. Version 0.15.0, 0.15.1, 0.15.2, 0.16.0 is also release for spark 3, but these are not working, so stick with 0.14.0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.