Skip to content

Add PySpark ETL best practices cursorrules#203

Open
rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
rishikaidnani:add-pyspark-etl-best-practices
Open

Add PySpark ETL best practices cursorrules#203
rishikaidnani wants to merge 4 commits intoPatrickJS:mainfrom
rishikaidnani:add-pyspark-etl-best-practices

Conversation

@rishikaidnani
Copy link
Copy Markdown

@rishikaidnani rishikaidnani commented Mar 21, 2026

Summary

Adds production-tested PySpark & ETL best practices as a .cursorrules file — the first PySpark/Spark-specific rules in the repository.

What's covered

8 sections covering the full ETL development lifecycle:

  1. Project Structure — ETL base class scaffold, config factory pattern, .transform() pipeline composition, shared partition-aware readers, reusable merge utilities
  2. Code StyleF.col() prefix convention, named conditions, select over withColumn, alias over withColumnRenamed, chaining limits
  3. Joins — explicit how=, left over right, .alias() for disambiguation, F.broadcast() for small dims, no .dropDuplicates() as a crutch
  4. Window Functions — explicit frame specification, row_number vs first, ignorenulls=True, avoid empty partitionBy()
  5. Map & Array HOFsmap_zip_with for conflict-aware merges, transform + array_max for nested structs, avoid UDFs
  6. Cumulative Table Patterns — idempotent merges, order-independent conflict resolution, primary key uniqueness validation
  7. Data Quality & PerformanceF.lit(None) over empty strings, .otherwise() pitfalls, production-safe logging, intentional persist()
  8. Iceberg Write Patterns.byName() for schema evolution, __partitions metadata table, write.distribution-mode (none/hash/range)

Credits

Inspired by the Palantir PySpark Style Guide and production experience debugging data skew, cumulative table merges, and Iceberg write patterns.

Checklist

  • Rule is in its own rules/pyspark-etl-best-practices-cursorrules-prompt-file/ directory
  • Directory contains .cursorrules and README.md
  • Main README.md updated with entry in the "Language-Specific" section (alphabetical order)

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive PySpark ETL best-practices guide covering ETL scaffolding, configuration patterns, pipeline composition, code-style conventions, join/merge strategies, window-function guidance, higher-order/map-merge patterns, idempotent cumulative/snapshot table rules, data-quality guardrails, and Iceberg write/read patterns.
    • Included usage instructions for applying the rules within PySpark projects.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5835bed6-ba58-4465-b5a8-5ce1d1afe002

📥 Commits

Reviewing files that changed from the base of the PR and between c3d938c and 3b4df49.

📒 Files selected for processing (1)
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules
✅ Files skipped from review due to trivial changes (1)
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules

📝 Walkthrough

Walkthrough

Adds a new "PySpark ETL Best Practices" ruleset and its README, and updates the repository README to link to the new ruleset. Changes are documentation-only; no public APIs or code entities were modified.

Changes

Cohort / File(s) Summary
Main README Update
README.md
Inserted a new list entry linking to the PySpark ETL Best Practices ruleset and brief description.
PySpark ETL Best Practices Rules
rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules, rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md
Added a comprehensive .cursorrules file prescribing ETL scaffold, config parsing, transform composition, partition-aware readers, join/window/map patterns, data-quality guardrails, and Iceberg write conventions; included a README describing usage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested reviewers

  • PatrickJS

Poem

🐰 With whiskers twitching, I hop and cheer,
A ruleset landed, tidy and clear.
Joins and windows, maps in line,
ETL steps now neatly defined.
Hop on, coders — spark the light! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new PySpark ETL best practices cursorrules file to the repository.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules`:
- Around line 352-364: The code assumes max_partition from
partition_df.orderBy(...).first() is non-null; when the partitions table is
empty this will be None and max_partition["partition_date"] will TypeError—fix
by checking max_partition is not None before accessing its keys (in the block
that defines max_partition and latest_date), and handle the empty case (e.g.,
set latest_date to None or raise a clear error/log via processLogger) so
subsequent logic using latest_date (or partition_df/max_partition) won't crash.
- Around line 96-109: Update the example to use F.col() for column references in
the datediff call: replace the current F.datediff('date_a', 'date_b') usage with
a call that passes F.col('date_a') and F.col('date_b') so the date_passed
variable uses F.datediff(F.col('date_a'), F.col('date_b')) consistent with the
guideline; ensure the variables is_delivered, date_passed and has_registration
all use F.col() where appropriate and keep the final F.when(...) expression
unchanged.
- Around line 214-254: The examples reference the Window alias W (W.partitionBy,
W.unboundedPreceding, W.unboundedFollowing) but never define/import it; add
guidance to import and alias Spark's Window class (e.g., "from
pyspark.sql.window import Window as W") near the top or in the Code Style
section so examples using W and the F.* conventions are valid and consistent;
update the documentation text to mention importing Window as W when showing
windowed examples.
- Around line 259-274: The lambda passed to map_zip_with uses when() unprefixed
— change all uses of when(...) to F.when(...) in that lambda (and anywhere else
in this snippet) to follow the project convention; update or confirm the module
alias import (functions as F) is present so F.when is available, and keep
map_zip_with/map_concat usages unchanged except for the F.when prefix to ensure
consistent PySpark expression usage.
- Around line 59-74: The read_latest method in class PartitionedReader can crash
when the table is empty because .first() may return None; modify
PartitionedReader.read_latest to capture the result of
.agg(F.max(partition_col)).first() into a variable (e.g., first_row), check if
it is None (or if first_row[0] is None), and handle that case by returning an
empty DataFrame with the target table schema (e.g., use
spark.createDataFrame(spark.sparkContext.emptyRDD(),
spark.read.table(table_name).schema) or spark.read.table(table_name).limit(0))
or raise a clear, descriptive error; otherwise proceed to filter on max_val as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe9a030d-4e7c-4503-8aef-52afa13d30f6

📥 Commits

Reviewing files that changed from the base of the PR and between fc2ce04 and c3d938c.

📒 Files selected for processing (3)
  • README.md
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/.cursorrules
  • rules/pyspark-etl-best-practices-cursorrules-prompt-file/README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants