Pyspark SQL Issue loading a tsv file as a dataframe

Question

I have the following data in as a .txt file in tab separated format stored in my blob storage. I'm using pyspark.sql to load the data into databricks as a pyspark.sql.df.

This is the shape of the data.

df = spark.createDataFrame(
    [
    (302, 'foo'), # values
    (203, 'bar'),
    (202, 'foo'),
    (202, 'bar'),
    (172, 'xxx'),
    (172, 'yyy'),
],
['LU', 'Input'] # column labels
)

display(df)

First I have created a schema for the data before loading:

from pyspark.sql.types import *

data_schema = [
           StructField('LU', StringType(), True), 
           StructField('Input', StringType(), True)]

mySchema = StructType(fields=data_schema)

I then use the following code to read in the data:

df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True)
df.show()

However, when I look at the data the first column looks fine, but the second column values show as null.

+----------+-----+
|        LU|Input|
+----------+-----+
|302       | null|
|203       | null|
|202       | null|
|202       | null|
|172       | null|
|172       | null|
+----------+-----+

Does anyone know why the 'Input' variable shows as null? This is just dummy data, when using real data that has 30+ variables only the first variables values ever load, everything else is null.

Thanks

@Shu I didn't think you can add a datafile to stack overflow? How do you do that? — Mrmoleje
– Mrmoleje, Commented Apr 11, 2020 at 15:37
@Mrmoleje: Try: df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True, sep='\t'). You can just add some lines as plain text to your question. — cronoik
– cronoik, Commented Apr 13, 2020 at 17:37

thebot · Accepted Answer · 2020-04-20 09:55:12Z

1

To avoid this issue in the future, maybe consider inferring the schema at first and save it as json and for future reads, you can use this schema back. This will avoid making mistakes when creating the schema manually.

df.schema.json()

answered Apr 20, 2020 at 9:55

thebot

916 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mrmoleje · Accepted Answer · 2020-04-14 12:14:24Z

0

I worked out what the issue was with my data. In my schema I had:

StructField('Date', DateType()

on a date value like 13/03/2000. You can't force this kind of date into DateType in a schema, only 13-03-2000.

This means that when applying the schema to all variables they all returned nulls, even though only one variable had failed.

Hope that helps some.

answered Apr 14, 2020 at 12:14

Mrmoleje

4931 gold badge15 silver badges42 bronze badges

Comments

H Roy · Accepted Answer · 2020-04-15 12:31:01Z

Since you already have header in file, why don't you let Spark to infer the schema. I tried with your sample data and it gave the correct result.

>>> df = spark.read.csv("file:////Users/sam/Downloads/file.txt",  inferSchema=True, header=True, sep='\t')
>>> df.show()
+---+-----+
| LU|Input|
+---+-----+
|302| foo |
|203|  bar|
|202|  foo|
|202|  bar|
|172|  xxx|
|172|  yyy|
+---+-----+

>>> df.printSchema()
root
 |-- LU: integer (nullable = true)
 |-- Input: string (nullable = true)

Databricks some time does not show the correct result. So let Spark infer the data 1st and then you try to understand what is the issue with your schema and then take corrective action.

I suspect you must you LongType for 1st field but compare it. Attaching a sample for your reference. Since I am not sure about the actual file, I am just pointing you in that direction.

Collectives™ on Stack Overflow

Pyspark SQL Issue loading a tsv file as a dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related