Skip to content

Conversation

@nkaki
Copy link

@nkaki nkaki commented Jan 29, 2026

Rationale for this change

To make it easier for readers to get an overall picture of Parquet encodings.

What changes are included in this PR?

Adds a summary table to Encodings.md that lists the encoding types (each linked to its description), enums and targets for different Parquet format versions.

See rendered format here: https://github.com/nkaki/parquet-format/blob/master/Encodings.md

Do these changes have PoC implementations?

No - Documentation change only

Closes #550

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nkaki -- this is a great start

Encodings.md Outdated

### Supported Encodings

| Encoding type | Encoding enum | Encoding Targets <br> (Parquet 2.0.0+) | Encoding Targets <br> (Parquet 1.0.0+) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.

I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec

I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Thank you for the review!

I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon. 
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec

I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily. For the table I will remove the last column and rename the thrid one.

And just a question, would Data Page V2 (header?) would be a better term in this case?

I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?

For PLAIN_DICTIONARY and RLE_DICTIONARY, I will merge the rows and mark PLAIN_DICTIONARY enum as deprecated.

For BIT_PACKED, since the deprecated encodings are still explained in the document and it is linked by other encodings , I thought it should be in the table and linked to the details. I think there are few options.

  1. Remove BIT_PACKED encoding from the table (your suggestion)
  2. Remove BIT_PACKED encoding description from the page and from the table (this may break links).
  3. Seperate currently supported and deprecated encodings as seperate tables, and change the layout of the page.
  • Layout A:
    supported encodings table
    deprecated encodings table (only BIT_PACKED)
    supported + deprecated encodings descriptions (current order)
  • Layout B:
    supported encodings table
    supported encodings descriptions (current order with out BIT_PACKED)
    deprecated encodings table (only BIT_PACKED)
    deprecated encodings descriptions (only BIT_PACKED)
  • Layout C:
    supported encodings table
    deprecated encodings table (only BIT_PACKED)
    supported encodings descriptions (current order with out BIT_PACKED)
    deprecated encodings descriptions (only BIT_PACKED)

Also about Encoding Targets column should I just list the physical types? removing other encoding targets (e.g. Repetition and definition levels)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed v1 columns and seperated the table. If the deprecated encodings table is not needed I will remove it.

Link to the rendered page: https://github.com/nkaki/parquet-format/blob/master/Encodings.md

…gs.md) (apache#550) - remove v1 related column, and seperate tables for supported and deprecated encodings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants