-
Notifications
You must be signed in to change notification settings - Fork 467
MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550) #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nkaki -- this is a great start
Encodings.md
Outdated
|
|
||
| ### Supported Encodings | ||
|
|
||
| | Encoding type | Encoding enum | Encoding Targets <br> (Parquet 2.0.0+) | Encoding Targets <br> (Parquet 1.0.0+) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec
I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
Thank you for the review!
I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec
I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily. For the table I will remove the last column and rename the thrid one.
And just a question, would Data Page V2 (header?) would be a better term in this case?
I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?
For PLAIN_DICTIONARY and RLE_DICTIONARY, I will merge the rows and mark PLAIN_DICTIONARY enum as deprecated.
For BIT_PACKED, since the deprecated encodings are still explained in the document and it is linked by other encodings , I thought it should be in the table and linked to the details. I think there are few options.
- Remove BIT_PACKED encoding from the table (your suggestion)
- Remove BIT_PACKED encoding description from the page and from the table (this may break links).
- Seperate currently supported and deprecated encodings as seperate tables, and change the layout of the page.
- Layout A:
supported encodings table
deprecated encodings table (only BIT_PACKED)
supported + deprecated encodings descriptions (current order) - Layout B:
supported encodings table
supported encodings descriptions (current order with out BIT_PACKED)
deprecated encodings table (only BIT_PACKED)
deprecated encodings descriptions (only BIT_PACKED) - Layout C:
supported encodings table
deprecated encodings table (only BIT_PACKED)
supported encodings descriptions (current order with out BIT_PACKED)
deprecated encodings descriptions (only BIT_PACKED)
Also about Encoding Targets column should I just list the physical types? removing other encoding targets (e.g. Repetition and definition levels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed v1 columns and seperated the table. If the deprecated encodings table is not needed I will remove it.
Link to the rendered page: https://github.com/nkaki/parquet-format/blob/master/Encodings.md
…gs.md) (apache#550) - remove v1 related column, and seperate tables for supported and deprecated encodings
Rationale for this change
To make it easier for readers to get an overall picture of Parquet encodings.
What changes are included in this PR?
Adds a summary table to Encodings.md that lists the encoding types (each linked to its description), enums and targets for different Parquet format versions.
See rendered format here: https://github.com/nkaki/parquet-format/blob/master/Encodings.md
Do these changes have PoC implementations?
No - Documentation change only
Closes #550