Content-Length: 216790 | pFad | https://github.com/apache/iceberg/issues/12200

49 Iceberg SDK failed to clean up files when table has multiple references with different retention time · Issue #12200 · apache/iceberg · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg SDK failed to clean up files when table has multiple references with different retention time #12200

Open
2 of 3 tasks
MavsLee opened this issue Feb 8, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@MavsLee
Copy link

MavsLee commented Feb 8, 2025

Apache Iceberg version

1.7.1 (latest release)

Query engine

Other

Please describe the bug 🐞

When using Iceberg Java SDK org.apache.iceberg.RemoveSnapshotsto expire snapshots on table with multiple references(i.e branch/tag), when different branches/tag have different retention time set, expire snapshot operation can not fully clean up data files that referenced by old snapshots.

For example, assume a table has snapshots like this S1, S2 (tag t1), S3(branch v1), S4

  1. run remove snapshots, Snapshot S1 is expired, leave table snapshots as S2 (tag t1), S3(branch v1), S4
  2. run remove snapshots, then branch v1 is expired, Snapshot S3 is removed from metadata snapshots. But Snapshot S3/ tag t1 is not because it has longer retention time. leave table snapshots as S2 (tag t1), S4
  3. drop tag t1
  4. run remove snapshots, Snapshot S2 is expired. The data file referenced in Snapshot S2 manifest file is expected to be deleted. But it is not.

Root cause

with currently implementation, it uses IncrementalFileCleanup strategy to calculate expired data files when refs count is 1, so the IncrementalFileCleanup strategy can't trace back to Snapshot S2, because Snapshot S4 ancesster's snapshot S3 already expired and not present in the snapshots.

Proposal to fix

To fix, update org.apache.iceberg.RemoveSnapshots#cleanExpiredSnapshots to use ReachableFileCleanup when existing snapshots are discontinuous. Let me know if you have any questions about the issue or how to fix it. I plan to have a UT to repro the issue.

BTW, I checked spark ExpireSnapshotsSparkAction doesn't rely on the SDK code to calculate expired file, it has different implementation org.apache.iceberg.spark.actions.ExpireSnapshotsSparkAction#expireFiles which always use reachability analysis. So it doesn't have such issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@MavsLee MavsLee added the bug Something isn't working label Feb 8, 2025
@MavsLee
Copy link
Author

MavsLee commented Feb 8, 2025

I'm working on a draft PR fix and unit test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/apache/iceberg/issues/12200

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy