5
votes
Accepted
Find the youngest athlete to win a Gold medal
pyspark.sql.functions import
It's more common to use
from pyspark.sql import functions as F
since there are functions like <...
3
votes
Accepted
Joining Apache Spark data frames, with many conditional substitutions
Indeed, the sequence of when statements is very repetitive and can be refactored.
All whens are similar, except the last one, ...
3
votes
PYSPARK: Find the penultimate (2nd largest value) row-wise
approach
production environment
structured well?
I would be reluctant to run this code in a production environment.
Some of it isn't due to the code at all -- it starts
with the chosen example.
We'...
2
votes
Accepted
Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions
in tryRun()/runWithRetry():
Naming is extremely poor. The function name should show what it does. "Run" what? The comment doesn't even help -- it just ...
2
votes
Graph coloring problem with Spark (JAVA)?
coloring
Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal ...
2
votes
Binary check code in pyspark
Spark dataframes (and columns) have a distinct method, which you can use to get all values in that column. Getting the actual values out is a bit more complicated ...
1
vote
Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?
This submission is about performance.
It includes no CPU profiling, nor performance
measurements of any kind, and does not generate
example data which someone else could query against.
This seems ...
1
vote
PySpark: Create a column containing the minimum divided by maximum of each row
For the most part, the code layout is good, and you chose meaningful names for variables and functions.
When I run pylint, it complains about these unused imports:
...
1
vote
Managing PySpark DataFrames
Formatting
Pyspark code can create some pretty long lines, parentheses can allow you to break them up for easier readability. Let's do that first:
...
1
vote
Accepted
Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation
I believe the below solution should solve for what you are looking to do in a more efficient way. Your current method involves a lot of "shuffle" operations (group by, sorting, joining). The below ...
1
vote
PySpark SCD Type 1
Some observations about your sql table and data types.
In most relational databases, and definitely SQL Server, choosing an ...
1
vote
PySpark SCD Type 1
A random scattering of things:
CapitalCities is suspicious and suggests an improperly normalized schema. One would expect a separate table with a foreign key if a ...
1
vote
Accepted
Group events close in time into sessions and assign unique session IDs
Couldn't you reduce the complexity a lot by transforming the timestamp data per user to a KeyValueGroupedDataSet[String, Int] and then group the sessions based on ...
1
vote
Scala app to transpose columns into rows
Immediately I'd ask if there's any specific style guide that allows
these vastly different names, otherwise I'd suggest following IDE hints
and/or a linter and rename the variables and methods to be ...
1
vote
Filtering and creating new columns by condensing the lists for each item information
Old active question, but I hope I can still help with my insights on how I would refactor the code.
I would start by removing declarative comments ("this does x" type of comments).
There ...
1
vote
Accepted
Binary check code in pyspark
This one is O(1) in terms of pyspark collect operations instead of previous answers, both of which are O(n), where ...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
apache-spark × 34python × 18
scala × 12
performance × 7
python-3.x × 7
java × 5
mapreduce × 4
beginner × 3
machine-learning × 3
sql × 2
hadoop × 2
algorithm × 1
object-oriented × 1
game × 1
functional-programming × 1
regex × 1
graph × 1
sql-server × 1
xml × 1
statistics × 1
logging × 1
compression × 1
coordinate-system × 1
geospatial × 1
static × 1