5

I’m trying to run Spark job on Yarn client. I have two nodes and each node has the following configurations. enter image description here

I’m getting “ExecutorLostFailure (executor 1 lost)”.

I have tried most of the Spark tuning configuration. I have reduced to one executor lost because initially I got like 6 executor failures.

These are my configuration (my spark-submit) :

HADOOP_USER_NAME=hdfs spark-submit --class genkvs.CreateFieldMappings --master yarn-client --driver-memory 11g --executor-memory 11G --total-executor-cores 16 --num-executors 15 --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.akka.frameSize=1000 --conf spark.shuffle.memoryFraction=1 --conf spark.rdd.compress=true --conf spark.core.connection.ack.wait.timeout=800 my-data/lookup_cache_spark-assembly-1.0-SNAPSHOT.jar -h hdfs://hdp-node-1.zone24x7.lk:8020 -p 800

My data size is 6GB and I’m doing a groupBy in my job.

def process(in: RDD[(String, String, Int, String)]) = {
    in.groupBy(_._4)
}

I’m new to Spark, please help me to find my mistake. I’m struggling for at least a week now.

Thank you very much in advance.

1 Answer 1

1

Two issues pop out:

  • the spark.shuffle.memoryFraction is set to 1. Why did you choose that instead of leaving the default 0.2 ? That may starve other non shuffle operations

  • You only have 11G available to 16 cores. With only 11G I would set the number of workers in your job to no more than 3 - and initially (to get past the executors lost issue) just try 1. With 16 executors each one gets like 700mb - which then no surprise they are getting OOME / executor lost.

Sign up to request clarification or add additional context in comments.

15 Comments

Initially I got this following error :Missing an output location for shuffle 0. So I got to know I need to increase my shuffle memory fraction. Ok thanks for your suggestions. I will try and get back to you.
still it fails - Spark: Executor Lost Failure
when I go through the stack it shows : spark java.lang.OutOfMemoryError: Java heap space exception. I think because of this reason it's loosing the executors. Any suggestions ?
I mentioned that in the answer to try a single executor. You don't have enough memory available to the cluster in general.
Yes, I tried it. Even with single executor same issue.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.