1

I run my python kmeans program at spark as below command:

./bin/spark-submit --master spark://master_ip:7077 my_kmeans.py

The main python kmeans program seems as below:

sc = spark.sparkContext
# data
X = jl.load('X.jl.z')
data_x = sc.parallelize(X)
# kmeans
model = KMeans.train(data_x, 10000, maxIterations=5)

The file 'X.jl.z' size is ~100M.

But I get the spark error:

  File "/home/xxx/tmp/spark-2.0.2-bin-hadoop2.7/my_kmeans.py", line 24, in <module>
    data_x = sc.parallelize(X)
py4j.protocol.Py4JJavaError: An error occurred while calling    z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.    
  : java.lang.OutOfMemoryError: Java heap space

I know how to modify JVM heap size for Java program. But how can I increase the heap size for my python program ?

1 Answer 1

2

Try to add the number of partitions:

data_x = sc.parallelize(X,n)
# n = 2-4 partitions for each CPU in your cluster

or :

Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command line option in the client mode

Sign up to request clarification or add additional context in comments.

1 Comment

What if it is running on a local computer?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.