-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46905: [C++][Parquet] Expose Statistics.is_{min/max}_value_exact and default set to true if min/max are set #46992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
@wgtmac this is the PR I am working on that pairs with the new Parquet file on parquet-testing, will update parquet-testing submodule once the parquet file is merged to have tests passing. I would appreciate if you could give early feedback in case I am missing something. |
…o std::optional<bool>
…ue) instead of (max, min, is_min_exact_value, is_max_exact_value)
General LGTM, also, currently C++ code doesn't generate "truncated" min-max, I wonder if parquet-cpp file can regard as "not-truncated"? |
Yes, currently C++ doesn't generate truncated min-max statistics, that's why I am defaulting
I am not sure I entirely understand what you mean on this part, do you mean to update |
cpp/src/parquet/reader_test.cc
Outdated
@@ -1023,7 +1023,7 @@ Column 0 | |||
Uncompressed Size: 103, Compressed Size: 104 | |||
Column 1 | |||
Values: 3, Null Values: 0, Distinct Values: 0 | |||
Max: 1, Min: 1 | |||
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we want to make the display terser? For example (but feel free to do better :)):
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A | |
Max (exact: unknown): 1, Min (exact: unknown): 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested the parquet-reader tool with the new file:
--- Rows: 12 ---
Column 0
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: false): Kf, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 250, Compressed Size: 250
Column 1
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: false): Kf, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 250, Compressed Size: 250
Column 2
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): 🚀Kevin Bacon, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 258, Compressed Size: 258
Column 3
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): ��, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 236, Compressed Size: 236
Column 4
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): Ke, Min (exact: true): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 210, Compressed Size: 210
Column 5
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): Ke, Min (exact: true): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 210, Compressed Size: 210
Rationale for this change
The
is_{min/max}_value_exact
fields exist on the thrift definition and some implementations are already using them and truncating min and max values. This PR aims to expose those values and to default to true when writing files on C++ as no truncation is happening at the moment. If min/max statistics are generated we can setis_{min/max}_value_exact
to true.Truncation for string and binary min/max is out of scope for this PR, we can do this on a following one.
What changes are included in this PR?
ParquetFilePrinter
Are these changes tested?
Yes on CI.
Are there any user-facing changes?
Yes, the new fields will be available for the users on the API when reading Parquet files.