-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46905: [C++][Parquet] Expose Statistics.is_{min/max}_value_exact and default set to true if min/max are set #46992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
@wgtmac this is the PR I am working on that pairs with the new Parquet file on parquet-testing, will update parquet-testing submodule once the parquet file is merged to have tests passing. I would appreciate if you could give early feedback in case I am missing something. |
…o std::optional<bool>
…ue) instead of (max, min, is_min_exact_value, is_max_exact_value)
General LGTM, also, currently C++ code doesn't generate "truncated" min-max, I wonder if parquet-cpp file can regard as "not-truncated"? |
Yes, currently C++ doesn't generate truncated min-max statistics, that's why I am defaulting
I am not sure I entirely understand what you mean on this part, do you mean to update |
cpp/src/parquet/reader_test.cc
Outdated
@@ -1023,7 +1023,7 @@ Column 0 | |||
Uncompressed Size: 103, Compressed Size: 104 | |||
Column 1 | |||
Values: 3, Null Values: 0, Distinct Values: 0 | |||
Max: 1, Min: 1 | |||
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we want to make the display terser? For example (but feel free to do better :)):
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A | |
Max (exact: unknown): 1, Min (exact: unknown): 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested the parquet-reader tool with the new file:
--- Rows: 12 ---
Column 0
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: false): Kf, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 250, Compressed Size: 250
Column 1
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: false): Kf, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 250, Compressed Size: 250
Column 2
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): 🚀Kevin Bacon, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 258, Compressed Size: 258
Column 3
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): ��, Min (exact: false): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 236, Compressed Size: 236
Column 4
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): Ke, Min (exact: true): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 210, Compressed Size: 210
Column 5
Values: 12, Null Values: 0, Distinct Values: 0
Max (exact: true): Ke, Min (exact: true): Al
Compression: UNCOMPRESSED, Encodings: PLAIN RLE
Uncompressed Size: 210, Compressed Size: 210
@@ -166,10 +166,20 @@ void ParquetFilePrinter::DebugPrint(std::ostream& stream, std::list<int> selecte | |||
stream << " Values: " << column_chunk->num_values(); | |||
if (column_chunk->is_stats_set()) { | |||
std::string min = stats->min(), max = stats->max(); | |||
std::string max_exact = | |||
stats->is_max_value_exact.has_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an overkill to print unknown
? Could we simplify it as below:
std::string max_exact = stats->is_max_value_exact.value_or(false) ? "true" : "false";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I do prefer to print unknown
. If I am doing some debugging/investigation I'd rather see that it's unknown instead of false as that wouldn't be really correct and might produce some misinterpretation but I am happy to do what the majority prefers :)
<< R"("IsMaxValueExact": ")" | ||
<< (stats->is_max_value_exact().value() ? "true" : "false") << "\""; | ||
} else { | ||
stream << ", " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
cpp/src/parquet/statistics.cc
Outdated
@@ -613,6 +615,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> { | |||
if (has_min_max) { | |||
PlainDecode(encoded_min, &min_); | |||
PlainDecode(encoded_max, &max_); | |||
statistics_.is_min_value_exact = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment above (Create stats from a thrift Statistics object
), shouldn't we expand the function parameters for these two fields?
@@ -259,6 +264,14 @@ class PARQUET_EXPORT Statistics { | |||
/// \brief Plain-encoded maximum value | |||
virtual std::string EncodeMax() const = 0; | |||
|
|||
/// \brief Return the minimum value exact flag if set. | |||
/// It will be true if there was no truncation. | |||
virtual std::optional<bool> is_min_value_exact() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider simplifying it as virtual bool is_min_value_exact() const = 0;
which returns false if the value is missing internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comments on the printer, I'll let the other Parquet committers and PMC discuss but imho we should be explicit, this matches the Parquet spec as the underlying Thrift fields are optional so we should expose that optionality. I had a previous implementation with virtual bool is_min_value_exact() const = 0;
but I also had the corresponding virtual bool HasIsMinValueExact() const = 0;
.
…ieve is_{min/max}_value_exact
Python failures are unrelated, they are failing on other PRs, I opened: |
Rationale for this change
The
is_{min/max}_value_exact
fields exist on the thrift definition and some implementations are already using them and truncating min and max values. This PR aims to expose those values and to default to true when writing files on C++ as no truncation is happening at the moment. If min/max statistics are generated we can setis_{min/max}_value_exact
to true.Truncation for string and binary min/max is out of scope for this PR, we can do this on a following one.
What changes are included in this PR?
ParquetFilePrinter
Are these changes tested?
Yes on CI.
Are there any user-facing changes?
Yes, the new fields will be available for the users on the API when reading Parquet files.