1

I have the following data in as a .txt file in tab separated format stored in my blob storage. I'm using pyspark.sql to load the data into databricks as a pyspark.sql.df.

This is the shape of the data.

df = spark.createDataFrame(
    [
    (302, 'foo'), # values
    (203, 'bar'),
    (202, 'foo'),
    (202, 'bar'),
    (172, 'xxx'),
    (172, 'yyy'),
],
['LU', 'Input'] # column labels
)

display(df)

First I have created a schema for the data before loading:

from pyspark.sql.types import *

data_schema = [
           StructField('LU', StringType(), True), 
           StructField('Input', StringType(), True)]

mySchema = StructType(fields=data_schema)

I then use the following code to read in the data:

df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True)
df.show() 

However, when I look at the data the first column looks fine, but the second column values show as null.

+----------+-----+
|        LU|Input|
+----------+-----+
|302       | null|
|203       | null|
|202       | null|
|202       | null|
|172       | null|
|172       | null|
+----------+-----+

Does anyone know why the 'Input' variable shows as null? This is just dummy data, when using real data that has 30+ variables only the first variables values ever load, everything else is null.

Thanks

5
  • could you add sample records from file? Commented Apr 11, 2020 at 14:11
  • @Shu I'm not sure what you mean? Commented Apr 11, 2020 at 15:23
  • add some sample data from /filepath/filename.txt file.. Commented Apr 11, 2020 at 15:27
  • @Shu I didn't think you can add a datafile to stack overflow? How do you do that? Commented Apr 11, 2020 at 15:37
  • @Mrmoleje: Try: df = spark.read.csv("/filepath/filename.txt", schema=mySchema , header=True, sep='\t'). You can just add some lines as plain text to your question. Commented Apr 13, 2020 at 17:37

3 Answers 3

1

To avoid this issue in the future, maybe consider inferring the schema at first and save it as json and for future reads, you can use this schema back. This will avoid making mistakes when creating the schema manually.

df.schema.json()
Sign up to request clarification or add additional context in comments.

Comments

0

I worked out what the issue was with my data. In my schema I had:

StructField('Date', DateType()

on a date value like 13/03/2000. You can't force this kind of date into DateType in a schema, only 13-03-2000.

This means that when applying the schema to all variables they all returned nulls, even though only one variable had failed.

Hope that helps some.

Comments

0

Since you already have header in file, why don't you let Spark to infer the schema. I tried with your sample data and it gave the correct result.

>>> df = spark.read.csv("file:////Users/sam/Downloads/file.txt",  inferSchema=True, header=True, sep='\t')
>>> df.show()
+---+-----+
| LU|Input|
+---+-----+
|302| foo |
|203|  bar|
|202|  foo|
|202|  bar|
|172|  xxx|
|172|  yyy|
+---+-----+

>>> df.printSchema()
root
 |-- LU: integer (nullable = true)
 |-- Input: string (nullable = true)

Databricks some time does not show the correct result. So let Spark infer the data 1st and then you try to understand what is the issue with your schema and then take corrective action.

I suspect you must you LongType for 1st field but compare it. Attaching a sample for your reference. Since I am not sure about the actual file, I am just pointing you in that direction. enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.