PySpark loading CSV AttributeError: 'RDD' object has no attribute '_get_object_id'

Question

I'm trying to load a CSV file into a spark DataFrame. This is what I have done so far:

# sc is an SparkContext.
appName = "testSpark"
master = "local"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

# csv path
text_file = sc.textFile("hdfs:///path/to/sensordata20171008223515.csv")
df = sqlContext.load(source="com.databricks.spark.csv", header = 'true', path = text_file)

print df.schema()

Here's the trace:

Traceback (most recent call last):
File "/home/centos/main.py", line 16, in <module>
df = sc.textFile(text_file).map(lambda line: (line.split(';')[0], line.split(';')[1])).collect()
File "/usr/hdp/2.5.6.0-40/spark/python/lib/pyspark.zip/pyspark/context.py", line 474, in textFile
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 804, in __call__
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 278, in get_command_part
AttributeError: 'RDD' object has no attribute '_get_object_id'

I'm new to spark. So if anyone could tell me what I've done wrong this would be very helpful.

Alper t. Turker · Accepted Answer · 2017-08-11 11:17:06Z

2

You cannot pass RDD to csv reader. You should use path directly:

df = sqlContext.load(source="com.databricks.spark.csv", 
    header = 'true', path = "hdfs:///path/to/sensordata20171008223515.csv")

Only a limited number of formats (notably JSON) supports RDD as an input argument.

answered Aug 11, 2017 at 11:17

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark loading CSV AttributeError: 'RDD' object has no attribute '_get_object_id'

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related