Reading large number of Excel files into Apache Spark

Question

I have 100 Excel (*.xlsx) files stored in HDFS. The 100 *.xlsx files are organized into 10 directories, as shown below:

/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx
/user/cloudera/raw_data/dataPoint2/dataPoint.xlsx
...
..
.
/user/cloudera/raw_data/dataPoint10/dataPoint.xlsx

Reading in one of *.xlsx files from above using

rawData = sc.textFile("/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx")

threw gibberish data!

One obvious suggestion I received was to use the Gnumeric spreadsheet application's command-line utility called ssconvert:

$ ssconvert dataPoint.xlsx dataPoint.csv

and then dump it into the HDFS, so I can read the *.csv file directly. But that is not what I am trying to solve or is the requirement.

Solutions in Python (preferable) and Java would be appreciated. I am a rookie, so a detailed walkthrough would be really helpful.

Thanks in advance.

I would load each file with xlrd pypi.python.org/pypi/xlrd process it and then union all the data. — Tom Ron
– Tom Ron, Commented Mar 2, 2016 at 13:46
@TomRon when you say process it, do you mean extract the sheet data to a python list and then load the list to an RDD? — benSooraj
– benSooraj, Commented Mar 2, 2016 at 14:34
Try to use pandas as described (stackoverflow.com/questions/9884353/xls-to-csv-convertor) to convert to csv and then load into Spark RDD — szu
– szu, Commented Mar 4, 2016 at 19:33

Kirupa · Accepted Answer · 2016-06-20 05:22:28Z

Use the following code to read excel files in Spark directly from HDFS using Hadoop FileSystem API. However you have to implement Apache POI API to parse the data

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.util.Date
import scala.io.Source
import java.io.{ InputStream, FileInputStream, File }
import org.apache.poi.hssf.usermodel.HSSFWorkbook
import org.apache.poi.ss.usermodel.{ Cell, Row, Workbook, Sheet }
import org.apache.poi.xssf.usermodel._
import scala.collection.JavaConversions._
import org.apache.poi.ss.usermodel.WorkbookFactory
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import java.net._

object Excel {
  def main(arr: Array[String]) {
    val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val fs = FileSystem.get(URI.create("hdfs://localhost:9000/user/files/timetable.xlsx"),new Configuration());
    val path=  new Path("hdfs://localhost:9000/user/files/timetable.xlsx");
    val InputStream = fs.open(path)
    read(InputStream)
  }
  def read(in:InputStream)={

  }
}

read(in:InputStream) method is where you implement Apache POI API to parse the data.

Iurii Ant · Accepted Answer · 2017-07-25 03:12:53Z

1

You can use Spark Excel Library for converting xlsx files to DataFrames directly. See this answer with a detailed example.

As of version 0.8.4, the library does not support streaming and loads all the source rows into memory for conversion.

answered Jul 25, 2017 at 3:12

Iurii Ant

1,0359 silver badges16 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

If you are willing to build yourself a custom XLSX to CSV Converter, The Apache POI Event API would be Ideal for this. This API is suitable for Spreadsheets with large memory footprints. Look out what is it about here. Here is an example XSLX processing with the XSSF Event code

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 6, 2016 at 18:44

Manish Mishra

8589 silver badges25 bronze badges

2 Comments

benSooraj Over a year ago

Can you please elaborate with an example or throw some more light?

Manish Mishra Over a year ago

If you've gone through the second link I provided, You will see a class SheetHandler which implements two methods named startElement and endElement. These methods receive notifications of different sheets elements like a cell value, end of a row etc. You will notice, the cell values are being printed to standard output in the method in method endElement. Similarly you can have an output path and write these values to a CSV file or can customize these method as well to do whatever on occurrence of an attribute or its value.

Jörn Franke · Accepted Answer · 2017-01-25 21:57:01Z

0

You can try the HadoopOffice library: https://github.com/ZuInnoTe/hadoopoffice/wiki

Works with Spark and if you can use the Spark2 data source API you can use also Python. If you cannot use the Spark2 data source API then you can use the standard Spark APIs for reading/writing files using a HadoopFile Format provided by the HadoopOffice library.

answered Jan 25, 2017 at 21:57

Jörn Franke

1865 bronze badges

Collectives™ on Stack Overflow

Reading large number of Excel files into Apache Spark

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related