Skip to content

[lake/iceberg] Iceberg encoding strategy #1350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 22, 2025

Conversation

MehulBatra
Copy link
Contributor

@MehulBatra MehulBatra commented Jul 16, 2025

Purpose

Implement Iceberg encoding strategy for Fluss.

Linked Issue: #1341

Brief change log

  • Add DataLakeFormat.ICEBERG enum value
  • Implementing IcebergKeyEncoder using the binary writer for key encoding
  • Implemented 'IcebergBinaryRowWriter' to spport encoding

Tests

Wrote a Unit tests to test the encoding matches between fluss and iceberg

API and Format

None

Documentation

  • New feature: Iceberg lake format support for key encoding and bucketing

@MehulBatra MehulBatra marked this pull request as draft July 16, 2025 20:55
@MehulBatra MehulBatra marked this pull request as ready for review July 17, 2025 21:25
Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MehulBatra Thanks for the pr. I left minor comment. PTAL

@MehulBatra
Copy link
Contributor Author

@MehulBatra
Copy link
Contributor Author

Currently our CI runs for java8, when I am using iceberg 1.9.1 in my code as it only supports java 11 , it is failing the CI check, trying to downgrade the version to 1.4.3 to support java 8, my assumption it would pass post that.
Stacktrace:
Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project fluss-common: Compilation failure Error: /home/fluss/actions-runner/_work/fluss/fluss/fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java:[30,32] cannot access org.apache.iceberg.types.Conversions Error: bad class file: /home/fluss/.m2/repository/org/apache/iceberg/iceberg-api/1.9.1/iceberg-api-1.9.1.jar(org/apache/iceberg/types/Conversions.class) Error: class file has wrong version 55.0, should be 52.0 Error: Please remove or make sure it appears in the correct subdirectory of the classpath.

@luoyuxia
Copy link
Contributor

Hi @luoyuxia I tried to take inspiration from https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Conversions.java for encoding https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java for types

Thanks for the explaination. I think it make senses to be inspired from iceberg's encoding. But keep in mind we use it for different purpose:

  • Iceebrg use it to store the binary values of field statistic, right?
  • We use it for distribute records, in Fluss's implementation, it'll first encoding the bucket columns, and then use the encoded value to do bucket partition transfrom which should align with iceberg. But iceberg use literal value to do bucket partition transfrom. So it means we don't really care about how to encode the bucket columns into binary value. We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

@luoyuxia
Copy link
Contributor

Currently our CI runs for java8, when I am using iceberg 1.9.1 in my code as it only supports java 11 , it is failing the CI check, trying to downgrade the version to 1.4.3 to support java 8, my assumption it would pass post that. Stacktrace: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project fluss-common: Compilation failure Error: /home/fluss/actions-runner/_work/fluss/fluss/fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java:[30,32] cannot access org.apache.iceberg.types.Conversions Error: bad class file: /home/fluss/.m2/repository/org/apache/iceberg/iceberg-api/1.9.1/iceberg-api-1.9.1.jar(org/apache/iceberg/types/Conversions.class) Error: class file has wrong version 55.0, should be 52.0 Error: Please remove or make sure it appears in the correct subdirectory of the classpath.

If that's the case, we can downgrade the version to 1.4.3 firstly, and left todo when #1195 is resolved

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MehulBatra Thanks for the pr. Left minor comments

@MehulBatra
Copy link
Contributor Author

Hi @luoyuxia I tried to take inspiration from https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Conversions.java for encoding https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java for types

Thanks for the explaination. I think it make senses to be inspired from iceberg's encoding. But keep in mind we use it for different purpose:

  • Iceebrg use it to store the binary values of field statistic, right?
  • We use it for distribute records, in Fluss's implementation, it'll first encoding the bucket columns, and then use the encoded value to do bucket partition transfrom which should align with iceberg. But iceberg use literal value to do bucket partition transfrom. So it means we don't really care about how to encode the bucket columns into binary value. We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

-> Agreed Iceberg does use these to store partition stats and min/max for filtering.
-> The end goal is to encode the values in iceberg compatibal format, can you add more on this?
We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

@MehulBatra MehulBatra force-pushed the key_encoding_iceberg branch from 4f7644c to 179f1db Compare July 21, 2025 19:26
@luoyuxia
Copy link
Contributor

luoyuxia commented Jul 22, 2025

-> The end goal is to encode the values in iceberg compatibal format, can you add more on this?

We don't need to encode the values in iceberg compatibal format. We just need to make sure the bucket partition transform align with iceberg. Iceberg just bucket via literal value, without the encoding process which is different from paimon. Fluss need to encode for we need to align the process that fluss write rows, which will

  • step1: first encoding row into bytes
  • step2: hash(distribute) the bytes into a bucket.

So, for iceberg, although we encode it, we still need to decode the bytes into the literal value(a int value, long value, etc), and use the iceberg bucket strategy. See BucketInteger, BucketLong in iceberg repo

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MehulBatra Thanks for the update. Only left minor comments, should be merged in next iteration.

pom.xml Outdated
@@ -98,6 +98,7 @@
<curator.version>5.4.0</curator.version>
<netty.version>4.1.104</netty.version>
<arrow.version>15.0.0</arrow.version>
<iceberg.version>1.4.3</iceberg.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this, we dont need it in root pom.xml

Copy link
Contributor Author

@MehulBatra MehulBatra Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 <!-- paimon bundle, only for test purpose -->
    <dependency>
        <groupId>org.apache.paimon</groupId>
        <artifactId>paimon-bundle</artifactId>
        <version>${paimon.version}</version>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-api</artifactId>
        <version>${iceberg.version}</version>
        <scope>test</scope>
    </dependency>
</dependencies>

Paimon and other packages also uses root pom.xml for version, I tried to match and did the same for iceberg-api in fluss-common pom.xml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing that test need it. I'd like to retract this comment

@luoyuxia luoyuxia force-pushed the key_encoding_iceberg branch from 14d9e40 to 9aaa307 Compare July 22, 2025 03:19
Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MehulBatra I append a some commit to refine pom file. LGTM

@luoyuxia luoyuxia merged commit b492c29 into apache:main Jul 22, 2025
3 checks passed
@MehulBatra
Copy link
Contributor Author

Thank you @luoyuxia for the review & feedbacks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy