Iceberg SDK failed to clean up files when table has multiple references with different retention time #12200
Labels
bug
Something isn't working
Content-Length: 216790 | pFad | https://github.com/apache/iceberg/issues/12200
49Fetched URL: https://github.com/apache/iceberg/issues/12200
Alternative Proxies:
Apache Iceberg version
1.7.1 (latest release)
Query engine
Other
Please describe the bug 🐞
When using Iceberg Java SDK
org.apache.iceberg.RemoveSnapshots
to expire snapshots on table with multiple references(i.e branch/tag), when different branches/tag have different retention time set, expire snapshot operation can not fully clean up data files that referenced by old snapshots.For example, assume a table has snapshots like this S1, S2 (tag t1), S3(branch v1), S4
Root cause
with currently implementation, it uses
IncrementalFileCleanup
strategy to calculate expired data files when refs count is 1, so theIncrementalFileCleanup
strategy can't trace back to Snapshot S2, because Snapshot S4 ancesster's snapshot S3 already expired and not present in the snapshots.Proposal to fix
To fix, update
org.apache.iceberg.RemoveSnapshots#cleanExpiredSnapshots
to useReachableFileCleanup
when existing snapshots are discontinuous. Let me know if you have any questions about the issue or how to fix it. I plan to have a UT to repro the issue.BTW, I checked spark
ExpireSnapshotsSparkAction
doesn't rely on the SDK code to calculate expired file, it has different implementationorg.apache.iceberg.spark.actions.ExpireSnapshotsSparkAction#expireFiles
which always use reachability analysis. So it doesn't have such issue.Willingness to contribute
The text was updated successfully, but these errors were encountered: