pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Question

Folks,

Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file

Cluster Info:

9 datanodes 128 GB Memory /48 vCore CPU /Node

Job config

  conf = SparkConf().setAppName('test') \
                          .set('spark.executor.cores', 4) \
                          .set('spark.executor.memory', '72g') \
                          .set('spark.driver.memory', '16g') \
                          .set('spark.yarn.executor.memoryOverhead',4096 ) \
                          .set('spark.dynamicAllocation.enabled', 'true') \
                          .set('spark.shuffle.service.enabled', 'true') \
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
                          .set('spark.driver.maxResultSize',10000) \
                          .set('spark.kryoserializer.buffer.max', 2044) 

    fileRDD=sc.textFile("/tmp/test_file.txt")
    fileRDD.cache
    list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()

Error

The Collect piece is spitting outofmemory error.

18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1 
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space

any help is much appreciated.

.collect() method will give result to driver node and i think it is out of memory. Try changing .set('spark.driver.maxResultSize',10000) to higher value. — goks
– goks, Commented May 17, 2018 at 23:52
I tried setting .set('spark.driver.maxResultSize',2147483648) , but still the same error. — Suresh Sethuramaswamy
– Suresh Sethuramaswamy, Commented May 18, 2018 at 13:33

Community · Accepted Answer · 2020-06-20 09:12:55Z

A little background on this issue

I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster

Finding in Jupyter

since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file

Worked fine in spark-submit

I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .

spark-submit --name  "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true  --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py

Next Step :

Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.

Collectives™ on Stack Overflow

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Cluster Info:

Job config

Error

1 Answer 1

A little background on this issue

Finding in Jupyter

Worked fine in spark-submit

Next Step :

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Cluster Info:

Job config

Error

1 Answer 1

A little background on this issue

Finding in Jupyter

Worked fine in spark-submit

Next Step :

Comments

Your Answer

Sign up or log in

Post as a guest

Related