Skip to content

feat: support CLUSTER BY [AUTO, NONE] for Databricks#5846

Merged
StuffbyYuki merged 7 commits into
SQLMesh:mainfrom
EhabEasee:feat/clustered-by-auto-none
Jun 30, 2026
Merged

feat: support CLUSTER BY [AUTO, NONE] for Databricks#5846
StuffbyYuki merged 7 commits into
SQLMesh:mainfrom
EhabEasee:feat/clustered-by-auto-none

Conversation

@EhabEasee

@EhabEasee EhabEasee commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

Databricks supports two keyword forms of liquid clustering that don't take column arguments:

  • CLUSTER BY AUTO — lets Databricks automatically select clustering columns
  • CLUSTER BY NONE — disables liquid clustering on a table

Previously, SQLMesh had no way to express these in a model definition. This PR adds support for both.

constants.py: Adds LIQUID_CLUSTERING_KEYWORDS = frozenset({"AUTO", "NONE"}) as a shared constant used across the parser, validator, and adapter.

Parsing (dialect.py): The clustered_by property parser now recognises bare AUTO and NONE tokens (unquoted VAR tokens) as liquid clustering keywords rather than column references. Backtick-quoted `auto` / `none` are still treated as regular column names, preserving backwards compatibility for columns that happen to share those names.

Validation (meta.py): A single string passed to clustered_by is normalised to a list before processing. The validator then skips the column-count check for exp.Var(AUTO|NONE), but only when the field is clustered_by and the dialect is databricks. On deserialisation from JSON, keyword strings are restored to exp.Var sentinels before list_of_fields_validator can normalise them into quoted columns.

Validation (definition.py): The validate_definition column-existence check skips keyword sentinels for the same clustered_by + databricks scope.

Code generation (databricks.py): _build_table_properties_exp detects a single exp.Var in clustered_by (guarded by a ValueError if the Var holds an unexpected value), and emits CLUSTER BY AUTO / CLUSTER BY NONE without wrapping in a tuple. Multi-column paths are unchanged.

Usage:

-- In a SQLMesh model definition
MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by AUTO
);

MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by NONE
);

Via the Python API, both a plain string and exp.Var are accepted:

create_sql_model(..., dialect="databricks", clustered_by="AUTO")
create_sql_model(..., dialect="databricks", clustered_by=exp.Var(this="AUTO"))

Columns with the names auto or none are still supported via backtick quoting:

MODEL (
  name my_catalog.my_schema.my_table,
  kind FULL,
  dialect databricks,
  clustered_by (`auto`, `none`)
);

Test Plan

  • tests/core/test_dialect.py — parser round-trips: AUTO/NONE keywords, backtick-quoted columns, paren-wrapped single columns, multi-column lists, mixed list (a, AUTO), non-Databricks dialect
  • tests/core/test_model.py — model DDL; Python API with both exp.Var and plain string; backtick-quoted column names; render_definition output; JSON serialisation round-trip; non-Databricks dialect rejection; mixed-list column treatment
  • tests/core/engine_adapter/test_databricks.py — adapter emits CLUSTER BY AUTO / CLUSTER BY NONE without column parens

Checklist

  • I have run make style and fixed any issues
  • I have added tests for my changes (if applicable)
  • All existing tests pass (make fast-test)
  • My commits are signed off (git commit -s) per the DCO
@EhabEasee EhabEasee force-pushed the feat/clustered-by-auto-none branch from 6f3e9a9 to 4f29141 Compare June 25, 2026 09:39
@EhabEasee EhabEasee changed the title feat: support CLUSTER BY AUTO and CLUSTER BY NONE for Databricks liquid clustering Jun 25, 2026
@StuffbyYuki StuffbyYuki self-requested a review June 29, 2026 05:27
@StuffbyYuki

Copy link
Copy Markdown
Collaborator

@EhabEasee Thanks for this PR!

Not trying to be nit-picky, but here's a few items:

  • Docs: Add a note in model docs that Databricks supports clustered_by AUTO / NONE, and that backticks are needed for real columns named auto/none.
  • Test: test_clustered_by_keyword_non_databricks_dialect: perhaps use pytest.raises(ConfigError) instead of (ConfigError, Exception).

Let me know if I'm missing anything!

@EhabEasee

EhabEasee commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@StuffbyYuki both comments make sense and I've made the updates. However, the comment in the docs feels misplaced and easy to miss.

I was considering adding it in the Databricks engine docs but couldn't find a reasonable place to add it. Do you have any suggestions on a more relevant place to add that note? The StarRocks docs seem to have something similar so I could imitate that?

@StuffbyYuki

Copy link
Copy Markdown
Collaborator

@EhabEasee thanks! Yeah I don't think it has to be that big block like starrocks docs do, but I just figured adding something somewhere in the docs might be helpful! I'll let you decide where and how to put it on the docs

@EhabEasee

Copy link
Copy Markdown
Contributor Author

@StuffbyYuki I added a new section to the databricks integration docs. Let me know if you have any more feedback

@StuffbyYuki

Copy link
Copy Markdown
Collaborator

@EhabEasee It looks like your commits need DCO checks!

…id clustering

Adds parser, validator, and Databricks adapter support for the keyword
forms of liquid clustering. Bare AUTO/NONE (unquoted VAR tokens) are
recognised as keywords; backtick-quoted `auto`/`none` and
parenthesised forms remain real column references.

- Add LIQUID_CLUSTERING_KEYWORDS constant to avoid repeating the
  sentinel set across dialect, meta, definition, and adapter
- Parser (dialect.py): detect VAR-token AUTO/NONE on clustered_by;
  strip Paren from single-column clustered_by to match partitioned_by
  normalisation
- Validator (meta.py): normalise single string input to list; restore
  keyword sentinels from JSON strings on deserialisation; skip
  column-count check for keywords, gated on clustered_by + databricks
- validate_definition (definition.py): skip keyword sentinels in the
  column-existence check, same gate
- Adapter (databricks.py): emit CLUSTER BY AUTO / CLUSTER BY NONE
  without a tuple wrapper; raise ValueError on unexpected bare Var
- Tests: parser round-trips, Python API (exp.Var and plain string),
  backtick-quoted columns, render_definition, JSON round-trip,
  non-Databricks rejection, mixed-list behaviour, adapter SQL emission

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
@EhabEasee EhabEasee force-pushed the feat/clustered-by-auto-none branch from fb8c119 to 9b49578 Compare June 30, 2026 18:43
…ed_by docs

Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
…d_non_databricks_dialect

Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
… clustered_by docs"

This reverts commit bb70305.

Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
…tion docs

Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
@EhabEasee EhabEasee force-pushed the feat/clustered-by-auto-none branch 2 times, most recently from 1c7d73a to a26e4f6 Compare June 30, 2026 18:49
@EhabEasee

Copy link
Copy Markdown
Contributor Author
@StuffbyYuki StuffbyYuki merged commit 9a25aa1 into SQLMesh:main Jun 30, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants