Add PySpark ETL best practices cursorrules by rishikaidnani · Pull Request #203 · PatrickJS/awesome-cursorrules

rishikaidnani · 2026-03-21T18:10:28Z

Summary

Adds production-tested PySpark & ETL best practices as a .cursorrules file — the first PySpark/Spark-specific rules in the repository.

What's covered

8 sections covering the full ETL development lifecycle:

Project Structure — ETL base class scaffold, config factory pattern, .transform() pipeline composition, shared partition-aware readers, reusable merge utilities
Code Style — F.col() prefix convention, named conditions, select over withColumn, alias over withColumnRenamed, chaining limits
Joins — explicit how=, left over right, .alias() for disambiguation, F.broadcast() for small dims, no .dropDuplicates() as a crutch
Window Functions — explicit frame specification, row_number vs first, ignorenulls=True, avoid empty partitionBy()
Map & Array HOFs — map_zip_with for conflict-aware merges, transform + array_max for nested structs, avoid UDFs
Cumulative Table Patterns — idempotent merges, order-independent conflict resolution, primary key uniqueness validation
Data Quality & Performance — F.lit(None) over empty strings, .otherwise() pitfalls, production-safe logging, intentional persist()
Iceberg Write Patterns — .byName() for schema evolution, __partitions metadata table, write.distribution-mode (none/hash/range)

Credits

Inspired by the Palantir PySpark Style Guide and production experience debugging data skew, cumulative table merges, and Iceberg write patterns.

Checklist

Rule is in its own rules/pyspark-etl-best-practices-cursorrules-prompt-file/ directory
Directory contains .cursorrules and README.md
Main README.md updated with entry in the "Language-Specific" section (alphabetical order)

Summary by CodeRabbit

Documentation
- Added a comprehensive PySpark ETL best-practices guide covering ETL scaffolding, configuration patterns, pipeline composition, code-style conventions, join/merge strategies, window-function guidance, higher-order/map-merge patterns, idempotent cumulative/snapshot table rules, data-quality guardrails, and Iceberg write/read patterns.
- Included usage instructions for applying the rules within PySpark projects.

coderabbitai · 2026-03-21T18:10:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5835bed6-ba58-4465-b5a8-5ce1d1afe002

📥 Commits

Reviewing files that changed from the base of the PR and between c3d938c and 3b4df49.

📒 Files selected for processing (1)

rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules

✅ Files skipped from review due to trivial changes (1)

rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules

📝 Walkthrough

Walkthrough

Adds a new "PySpark ETL Best Practices" ruleset and its README, and updates the repository README to link to the new ruleset. Changes are documentation-only; no public APIs or code entities were modified.

Changes

Cohort / File(s)	Summary
Main README Update `README.md`	Inserted a new list entry linking to the PySpark ETL Best Practices ruleset and brief description.
PySpark ETL Best Practices Rules `rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules`, `rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md`	Added a comprehensive `.cursorrules` file prescribing ETL scaffold, config parsing, transform composition, partition-aware readers, join/window/map patterns, data-quality guardrails, and Iceberg write conventions; included a README describing usage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested reviewers

PatrickJS

Poem

🐰 With whiskers twitching, I hop and cheer,
A ruleset landed, tidy and clear.
Joins and windows, maps in line,
ETL steps now neatly defined.
Hop on, coders — spark the light! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new PySpark ETL best practices cursorrules file to the repository.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules`:
- Around line 352-364: The code assumes max_partition from
partition_df.orderBy(...).first() is non-null; when the partitions table is
empty this will be None and max_partition["partition_date"] will TypeError—fix
by checking max_partition is not None before accessing its keys (in the block
that defines max_partition and latest_date), and handle the empty case (e.g.,
set latest_date to None or raise a clear error/log via processLogger) so
subsequent logic using latest_date (or partition_df/max_partition) won't crash.
- Around line 96-109: Update the example to use F.col() for column references in
the datediff call: replace the current F.datediff('date_a', 'date_b') usage with
a call that passes F.col('date_a') and F.col('date_b') so the date_passed
variable uses F.datediff(F.col('date_a'), F.col('date_b')) consistent with the
guideline; ensure the variables is_delivered, date_passed and has_registration
all use F.col() where appropriate and keep the final F.when(...) expression
unchanged.
- Around line 214-254: The examples reference the Window alias W (W.partitionBy,
W.unboundedPreceding, W.unboundedFollowing) but never define/import it; add
guidance to import and alias Spark's Window class (e.g., "from
pyspark.sql.window import Window as W") near the top or in the Code Style
section so examples using W and the F.* conventions are valid and consistent;
update the documentation text to mention importing Window as W when showing
windowed examples.
- Around line 259-274: The lambda passed to map_zip_with uses when() unprefixed
— change all uses of when(...) to F.when(...) in that lambda (and anywhere else
in this snippet) to follow the project convention; update or confirm the module
alias import (functions as F) is present so F.when is available, and keep
map_zip_with/map_concat usages unchanged except for the F.when prefix to ensure
consistent PySpark expression usage.
- Around line 59-74: The read_latest method in class PartitionedReader can crash
when the table is empty because .first() may return None; modify
PartitionedReader.read_latest to capture the result of
.agg(F.max(partition_col)).first() into a variable (e.g., first_row), check if
it is None (or if first_row[0] is None), and handle that case by returning an
empty DataFrame with the target table schema (e.g., use
spark.createDataFrame(spark.sparkContext.emptyRDD(),
spark.read.table(table_name).schema) or spark.read.table(table_name).limit(0))
or raise a clear, descriptive error; otherwise proceed to filter on max_val as
before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe9a030d-4e7c-4503-8aef-52afa13d30f6

📥 Commits

Reviewing files that changed from the base of the PR and between fc2ce04 and c3d938c.

📒 Files selected for processing (3)

README.md
rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules
rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md

rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules

Add PySpark ETL best practices cursorrules

c3d938c

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

ridnani-blip added 3 commits March 21, 2026 11:18

Handle empty table edge case in PartitionedReader.read_latest

5052287

Handle F.col

8c3441e

Add PySpark ETL best practices cursorrules

3b4df49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PySpark ETL best practices cursorrules#203

Add PySpark ETL best practices cursorrules#203
rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
rishikaidnani:add-pyspark-etl-best-practices

rishikaidnani commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

rishikaidnani commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's covered

Credits

Checklist

Summary by CodeRabbit

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

rishikaidnani commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading