Skip to main content
5 votes
Accepted

Find the youngest athlete to win a Gold medal

pyspark.sql.functions import It's more common to use from pyspark.sql import functions as F since there are functions like <...
C.Nivs's user avatar
  • 3,117
3 votes
Accepted

Joining Apache Spark data frames, with many conditional substitutions

Indeed, the sequence of when statements is very repetitive and can be refactored. All whens are similar, except the last one, ...
Antot's user avatar
  • 3,742
3 votes

PYSPARK: Find the penultimate (2nd largest value) row-wise

approach production environment structured well? I would be reluctant to run this code in a production environment. Some of it isn't due to the code at all -- it starts with the chosen example. We'...
J_H's user avatar
  • 43.3k
2 votes
Accepted

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

in tryRun()/runWithRetry(): Naming is extremely poor. The function name should show what it does. "Run" what? The comment doesn't even help -- it just ...
Snowbody's user avatar
  • 8,692
2 votes

Graph coloring problem with Spark (JAVA)?

coloring Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal ...
Sara J's user avatar
  • 4,221
2 votes

Binary check code in pyspark

Spark dataframes (and columns) have a distinct method, which you can use to get all values in that column. Getting the actual values out is a bit more complicated ...
Graipher's user avatar
  • 41.7k
1 vote

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

This submission is about performance. It includes no CPU profiling, nor performance measurements of any kind, and does not generate example data which someone else could query against. This seems ...
J_H's user avatar
  • 43.3k
1 vote

PySpark: Create a column containing the minimum divided by maximum of each row

For the most part, the code layout is good, and you chose meaningful names for variables and functions. When I run pylint, it complains about these unused imports: ...
toolic's user avatar
  • 16.4k
1 vote

Managing PySpark DataFrames

Formatting Pyspark code can create some pretty long lines, parentheses can allow you to break them up for easier readability. Let's do that first: ...
C.Nivs's user avatar
  • 3,117
1 vote
Accepted

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

I believe the below solution should solve for what you are looking to do in a more efficient way. Your current method involves a lot of "shuffle" operations (group by, sorting, joining). The below ...
zam's user avatar
  • 26
1 vote

PySpark SCD Type 1

Some observations about your sql table and data types. In most relational databases, and definitely SQL Server, choosing an ...
Stu's user avatar
  • 111
1 vote

PySpark SCD Type 1

A random scattering of things: CapitalCities is suspicious and suggests an improperly normalized schema. One would expect a separate table with a foreign key if a ...
Reinderien's user avatar
  • 71.2k
1 vote
Accepted

Group events close in time into sessions and assign unique session IDs

Couldn't you reduce the complexity a lot by transforming the timestamp data per user to a KeyValueGroupedDataSet[String, Int] and then group the sessions based on ...
lex82's user avatar
  • 1,153
1 vote

Scala app to transpose columns into rows

Immediately I'd ask if there's any specific style guide that allows these vastly different names, otherwise I'd suggest following IDE hints and/or a linter and rename the variables and methods to be ...
ferada's user avatar
  • 11.4k
1 vote

Filtering and creating new columns by condensing the lists for each item information

Old active question, but I hope I can still help with my insights on how I would refactor the code. I would start by removing declarative comments ("this does x" type of comments). There ...
svacx's user avatar
  • 433
1 vote
Accepted

Binary check code in pyspark

This one is O(1) in terms of pyspark collect operations instead of previous answers, both of which are O(n), where ...
foxale's user avatar
  • 26

Only top scored, non community-wiki answers of a minimum length are eligible