Tag Info

Hot answers tagged apache-spark

5 votes

Accepted

Find the youngest athlete to win a Gold medal

pyspark.sql.functions import It's more common to use from pyspark.sql import functions as F since there are functions like <...

C.Nivs

3,117

answered Aug 11, 2023 at 13:53

3 votes

Accepted

Joining Apache Spark data frames, with many conditional substitutions

Indeed, the sequence of when statements is very repetitive and can be refactored. All whens are similar, except the last one, ...

Antot

3,742

answered Jan 31, 2018 at 9:06

3 votes

PYSPARK: Find the penultimate (2nd largest value) row-wise

approach production environment structured well? I would be reluctant to run this code in a production environment. Some of it isn't due to the code at all -- it starts with the chosen example. We'...

J_H

43.3k

answered Mar 27, 2024 at 17:03

2 votes

Accepted

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

in tryRun()/runWithRetry(): Naming is extremely poor. The function name should show what it does. "Run" what? The comment doesn't even help -- it just ...

Snowbody

8,692

answered Nov 27, 2020 at 5:19

2 votes

Graph coloring problem with Spark (JAVA)?

coloring Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal ...

Sara J

4,221

answered Oct 6, 2021 at 16:06

2 votes

Binary check code in pyspark

Spark dataframes (and columns) have a distinct method, which you can use to get all values in that column. Getting the actual values out is a bit more complicated ...

Graipher

41.7k

answered Mar 26, 2019 at 13:05

1 vote

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

This submission is about performance. It includes no CPU profiling, nor performance measurements of any kind, and does not generate example data which someone else could query against. This seems ...

J_H

43.3k

answered Apr 2, 2024 at 16:54

1 vote

PySpark: Create a column containing the minimum divided by maximum of each row

For the most part, the code layout is good, and you chose meaningful names for variables and functions. When I run pylint, it complains about these unused imports: ...

toolic

16.4k

answered Feb 22, 2024 at 15:39

1 vote

Managing PySpark DataFrames

Formatting Pyspark code can create some pretty long lines, parentheses can allow you to break them up for easier readability. Let's do that first: ...

C.Nivs

3,117

answered Aug 27, 2023 at 22:35

1 vote

Accepted

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

I believe the below solution should solve for what you are looking to do in a more efficient way. Your current method involves a lot of "shuffle" operations (group by, sorting, joining). The below ...

zam

answered Apr 16, 2018 at 21:28

1 vote

PySpark SCD Type 1

Some observations about your sql table and data types. In most relational databases, and definitely SQL Server, choosing an ...

Stu

answered May 16, 2021 at 10:17

1 vote

PySpark SCD Type 1

A random scattering of things: CapitalCities is suspicious and suggests an improperly normalized schema. One would expect a separate table with a foreign key if a ...

Reinderien

71.2k

answered Apr 27, 2021 at 17:25

1 vote

Accepted

Group events close in time into sessions and assign unique session IDs

Couldn't you reduce the complexity a lot by transforming the timestamp data per user to a KeyValueGroupedDataSet[String, Int] and then group the sessions based on ...

lex82

1,153

answered Dec 13, 2020 at 9:03

1 vote

Scala app to transpose columns into rows

Immediately I'd ask if there's any specific style guide that allows these vastly different names, otherwise I'd suggest following IDE hints and/or a linter and rename the variables and methods to be ...

ferada

11.4k

answered Sep 8, 2019 at 14:58

1 vote

Filtering and creating new columns by condensing the lists for each item information

Old active question, but I hope I can still help with my insights on how I would refactor the code. I would start by removing declarative comments ("this does x" type of comments). There ...

svacx

answered Feb 18, 2021 at 11:17

1 vote

Accepted

Binary check code in pyspark

This one is O(1) in terms of pyspark collect operations instead of previous answers, both of which are O(n), where ...

foxale

answered Apr 5, 2019 at 8:43

Only top scored, non community-wiki answers of a minimum length are eligible

questions tagged

apache-spark

Synonyms

spark

apache-spark × 34
python × 18
scala × 12
performance × 7
python-3.x × 7
java × 5
mapreduce × 4
beginner × 3
machine-learning × 3
sql × 2
hadoop × 2
algorithm × 1
object-oriented × 1
game × 1
functional-programming × 1
regex × 1
graph × 1
sql-server × 1
xml × 1
statistics × 1
logging × 1
compression × 1
coordinate-system × 1
geospatial × 1
static × 1

Tag Info

Hot answers tagged apache-spark

Synonyms

Related Tags