Reading Excel (.xlsx) file in pyspark

Question

I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable

Hi @OMG, read allows you to access a DataFrameReader, which enables loading parquet / csv / json / text / excel / ... files with specific methods — baitmbarek
– baitmbarek, Commented Jan 22, 2020 at 8:08
You can take a look at these suggestions first : datascience.stackexchange.com/questions/22736/… — baitmbarek
– baitmbarek, Commented Jan 22, 2020 at 8:14

Ghost · Accepted Answer · 2020-01-22 10:32:47Z

19

You can use pandas to read .xlsx file and then convert that to spark dataframe.

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()

answered Jan 22, 2020 at 10:32

Ghost

5204 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

OMG Over a year ago

Thanks Amit, but getting error like : ImportError: Install xlrd >= 1.0.0 for Excel support

Ghost Over a year ago

xlrd package is not installed. Just pip install xlrd, it will start working.

Sander Vanden Hautte Over a year ago

inferSchema is not (or no longer, probably?) a supported argument. (TypeError: read_excel() got an unexpected keyword argument 'inferSchema')

Ramon Over a year ago

Is there a way to reading an Excel file direct to Spark without using pandas as an intermediate step?

Deva · Accepted Answer · 2021-07-01 07:29:16Z

1

You could use crealytics package.

Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.

df = spark.read
     .format("com.crealytics.spark.excel")
     .option("dataAddress", "'Sheet1'!")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("C:\P_DATA\tyco_93_A.xlsx")

More options are available in below github page.

https://github.com/crealytics/spark-excel

answered Jul 1, 2021 at 7:29

Deva

766 bronze badges

1 Comment

Pramod Kumar Sharma Over a year ago

Version 0.14.0 was released in Aug 2021 and it's working. Version 0.15.0, 0.15.1, 0.15.2, 0.16.0 is also release for spark 3, but these are not working, so stick with 0.14.0

Collectives™ on Stack Overflow

Reading Excel (.xlsx) file in pyspark

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related