Skip to main content

All Questions

1 vote
1 answer
2k views

Pyspark - df.cache().count() taking forever to run

I'm trying to force eager evaluation for PySpark, using the count methodology I read online: spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties) spark_df....
Haiyang Li's user avatar
1 vote
0 answers
39 views

bugs due to pyspark lazy evalution [duplicate]

from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("Ark API Stats") sc = SparkContext(conf=conf) a = sc.parallelize([1,2,3,4,5,6,7,8,9,10]) count = [2,4] array = [a.filter(...
Xing Shi's user avatar
  • 2,230
1 vote
1 answer
2k views

Pyspark lazy evaluation in loops too slow

First of all I want to let you know that I am still very new in spark and getting used to the lazy-evaluation concept. Here my issue: I have two spark DataFrames that I load from reading CSV.GZ ...
Oscar Mike's user avatar
0 votes
2 answers
653 views

RDD creation and variable binding

I have a very simple code: def fun(x, n): return (x, n) rdds = [] for i in range(2): rdd = sc.parallelize(range(5*i, 5*(i+1))) rdd = rdd.map(lambda x: fun(x, i)) rdds.append(rdd) a =...
abhinavkulkarni's user avatar