Spark error for python program "java.lang.OutOfMemoryError: Java heap space"

Question

I run my python kmeans program at spark as below command:

./bin/spark-submit --master spark://master_ip:7077 my_kmeans.py

The main python kmeans program seems as below:

sc = spark.sparkContext
# data
X = jl.load('X.jl.z')
data_x = sc.parallelize(X)
# kmeans
model = KMeans.train(data_x, 10000, maxIterations=5)

The file 'X.jl.z' size is ~100M.

But I get the spark error:

  File "/home/xxx/tmp/spark-2.0.2-bin-hadoop2.7/my_kmeans.py", line 24, in <module>
    data_x = sc.parallelize(X)
py4j.protocol.Py4JJavaError: An error occurred while calling    z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.    
  : java.lang.OutOfMemoryError: Java heap space

I know how to modify JVM heap size for Java program. But how can I increase the heap size for my python program ?

Zhang Tong · Accepted Answer · 2017-03-13 07:27:42Z

2

Try to add the number of partitions:

data_x = sc.parallelize(X,n)
# n = 2-4 partitions for each CPU in your cluster

or :

Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command line option in the client mode

answered Mar 13, 2017 at 7:27

Zhang Tong

4,7593 gold badges21 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sade Over a year ago

What if it is running on a local computer?

Collectives™ on Stack Overflow

Spark error for python program "java.lang.OutOfMemoryError: Java heap space"

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related