I have a multiline flat file which I wish to convert to an rdd/dataframe as a 4 column dataframe, or rdd array via PySpark. The Spark Scala code is,
#from pyspark.sql import SparkSession # Scala equivalant
#from pyspark import SparkContext # Scala equivalant
import org.apache.spark.mllib.rdd.RDDFunctions._
path = '/mypath/file'
spark = SparkSession.builder.appName('findApp').getOrCreate()
rdd = spark.sparkContext.textFile(path).sliding(4, 4).toDF("x", "y", "z", "a")
There is not a sliding() function in PySpark. What is the equivalent? The input is
A
B
C
D
A2
B2
C2
D2
The desired output is
| x | y | z | a |
|---|---|---|---|
| A | B | C | D |
| A2 | B2 | C2 | D2 |
I'd better add that the data sets are around 50 million records, per data set and there are couple of 100 data sets. So it's over 2 terabyte of data in total because one column holds >300 features. I like the pandas code by @GoodMan