Skip to content

GH-46905: [C++][Parquet] Expose Statistics.is_{min/max}_value_exact and default set to true if min/max are set #46992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Jul 4, 2025

Rationale for this change

The is_{min/max}_value_exact fields exist on the thrift definition and some implementations are already using them and truncating min and max values. This PR aims to expose those values and to default to true when writing files on C++ as no truncation is happening at the moment. If min/max statistics are generated we can set is_{min/max}_value_exact to true.

Truncation for string and binary min/max is out of scope for this PR, we can do this on a following one.

What changes are included in this PR?

Are these changes tested?

Yes on CI.

Are there any user-facing changes?

Yes, the new fields will be available for the users on the API when reading Parquet files.

This comment was marked as outdated.

@raulcd raulcd changed the title GH-46905: [C++][Parquet] Expose Statistics.is_{min/max}_value_exact and default set to true if min/max are set GH-46905: [C++][Parquet] Expose Statistics.is_{min/max}_value_exact and default set to true if min/max are set Jul 4, 2025
Copy link

github-actions bot commented Jul 4, 2025

⚠️ GitHub issue #46905 has been automatically assigned in GitHub to PR creator.

@raulcd
Copy link
Member Author

raulcd commented Jul 7, 2025

@wgtmac this is the PR I am working on that pairs with the new Parquet file on parquet-testing, will update parquet-testing submodule once the parquet file is merged to have tests passing. I would appreciate if you could give early feedback in case I am missing something.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 8, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 8, 2025
…ue) instead of (max, min, is_min_exact_value, is_max_exact_value)
@mapleFU
Copy link
Member

mapleFU commented Jul 8, 2025

General LGTM, also, currently C++ code doesn't generate "truncated" min-max, I wonder if parquet-cpp file can regard as "not-truncated"?

@raulcd
Copy link
Member Author

raulcd commented Jul 8, 2025

General LGTM, also, currently C++ code doesn't generate "truncated" min-max,

Yes, currently C++ doesn't generate truncated min-max statistics, that's why I am defaulting is_{min/max}_value_exact=true when we are setting min/max.

I wonder if parquet-cpp file can regard as "not-truncated"?

I am not sure I entirely understand what you mean on this part, do you mean to update ParquetFilePrinter to show something like not-truncated?

@@ -1023,7 +1023,7 @@ Column 0
Uncompressed Size: 103, Compressed Size: 104
Column 1
Values: 3, Null Values: 0, Distinct Values: 0
Max: 1, Min: 1
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we want to make the display terser? For example (but feel free to do better :)):

Suggested change
Max: 1, Min: 1, Is Max Value Exact: N/A, Is Min Value Exact: N/A
Max (exact: unknown): 1, Min (exact: unknown): 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the parquet-reader tool with the new file:

--- Rows: 12 ---
Column 0
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: false): Kf, Min (exact: false): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 250, Compressed Size: 250
Column 1
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: false): Kf, Min (exact: false): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 250, Compressed Size: 250
Column 2
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: true): 🚀Kevin Bacon, Min (exact: false): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 258, Compressed Size: 258
Column 3
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: true): ��, Min (exact: false): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 236, Compressed Size: 236
Column 4
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: true): Ke, Min (exact: true): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 210, Compressed Size: 210
Column 5
  Values: 12, Null Values: 0, Distinct Values: 0
  Max (exact: true): Ke, Min (exact: true): Al
  Compression: UNCOMPRESSED, Encodings: PLAIN RLE 
  Uncompressed Size: 210, Compressed Size: 210

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 9, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 9, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 9, 2025
@raulcd raulcd marked this pull request as ready for review July 9, 2025 09:44
@raulcd
Copy link
Member Author

raulcd commented Jul 21, 2025

@wgtmac @mapleFU can you take a look while Antoine is on holidays? Otherwise I'll just wait for him to be back :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy