[lake/iceberg] Iceberg encoding strategy #1350

MehulBatra · 2025-07-16T20:55:44Z

Purpose

Implement Iceberg encoding strategy for Fluss.

Linked Issue: #1341

Brief change log

Add DataLakeFormat.ICEBERG enum value
Implementing IcebergKeyEncoder using the binary writer for key encoding
Implemented 'IcebergBinaryRowWriter' to spport encoding

Tests

Wrote a Unit tests to test the encoding matches between fluss and iceberg

API and Format

None

Documentation

New feature: Iceberg lake format support for key encoding and bucketing

luoyuxia

@MehulBatra Thanks for the pr. I left minor comment. PTAL

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergBinaryRowWriter.java

fluss-common/src/main/java/com/alibaba/fluss/row/encode/KeyEncoder.java

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergBinaryRowWriter.java

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoder.java

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergBinaryRowWriter.java

fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java

MehulBatra · 2025-07-21T08:48:22Z

Hi @luoyuxia I tried to take inspiration from
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Conversions.java
for encoding
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java
for types

MehulBatra · 2025-07-21T09:44:01Z

Currently our CI runs for java8, when I am using iceberg 1.9.1 in my code as it only supports java 11 , it is failing the CI check, trying to downgrade the version to 1.4.3 to support java 8, my assumption it would pass post that.
Stacktrace:
Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project fluss-common: Compilation failure Error: /home/fluss/actions-runner/_work/fluss/fluss/fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java:[30,32] cannot access org.apache.iceberg.types.Conversions Error: bad class file: /home/fluss/.m2/repository/org/apache/iceberg/iceberg-api/1.9.1/iceberg-api-1.9.1.jar(org/apache/iceberg/types/Conversions.class) Error: class file has wrong version 55.0, should be 52.0 Error: Please remove or make sure it appears in the correct subdirectory of the classpath.

luoyuxia · 2025-07-21T09:50:35Z

Hi @luoyuxia I tried to take inspiration from https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Conversions.java for encoding https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java for types

Thanks for the explaination. I think it make senses to be inspired from iceberg's encoding. But keep in mind we use it for different purpose:

Iceebrg use it to store the binary values of field statistic, right?
We use it for distribute records, in Fluss's implementation, it'll first encoding the bucket columns, and then use the encoded value to do bucket partition transfrom which should align with iceberg. But iceberg use literal value to do bucket partition transfrom. So it means we don't really care about how to encode the bucket columns into binary value. We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

luoyuxia · 2025-07-21T10:07:54Z

Currently our CI runs for java8, when I am using iceberg 1.9.1 in my code as it only supports java 11 , it is failing the CI check, trying to downgrade the version to 1.4.3 to support java 8, my assumption it would pass post that. Stacktrace: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project fluss-common: Compilation failure Error: /home/fluss/actions-runner/_work/fluss/fluss/fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java:[30,32] cannot access org.apache.iceberg.types.Conversions Error: bad class file: /home/fluss/.m2/repository/org/apache/iceberg/iceberg-api/1.9.1/iceberg-api-1.9.1.jar(org/apache/iceberg/types/Conversions.class) Error: class file has wrong version 55.0, should be 52.0 Error: Please remove or make sure it appears in the correct subdirectory of the classpath.

If that's the case, we can downgrade the version to 1.4.3 firstly, and left todo when #1195 is resolved

luoyuxia

@MehulBatra Thanks for the pr. Left minor comments

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergBinaryRowWriter.java

pom.xml

MehulBatra · 2025-07-21T17:17:02Z

Hi @luoyuxia I tried to take inspiration from https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Conversions.java for encoding https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java for types

Thanks for the explaination. I think it make senses to be inspired from iceberg's encoding. But keep in mind we use it for different purpose:

Iceebrg use it to store the binary values of field statistic, right?

We use it for distribute records, in Fluss's implementation, it'll first encoding the bucket columns, and then use the encoded value to do bucket partition transfrom which should align with iceberg. But iceberg use literal value to do bucket partition transfrom. So it means we don't really care about how to encode the bucket columns into binary value. We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

-> Agreed Iceberg does use these to store partition stats and min/max for filtering.
-> The end goal is to encode the values in iceberg compatibal format, can you add more on this?
We only need to get the literal value from the binary value(this will be called decoding binary value). That means we can use any encoding way. But It's fine for me to following how iceberg encode the values.

luoyuxia · 2025-07-22T01:38:33Z

-> The end goal is to encode the values in iceberg compatibal format, can you add more on this?

We don't need to encode the values in iceberg compatibal format. We just need to make sure the bucket partition transform align with iceberg. Iceberg just bucket via literal value, without the encoding process which is different from paimon. Fluss need to encode for we need to align the process that fluss write rows, which will

step1: first encoding row into bytes
step2: hash(distribute) the bytes into a bucket.

So, for iceberg, although we encode it, we still need to decode the bytes into the literal value(a int value, long value, etc), and use the iceberg bucket strategy. See BucketInteger, BucketLong in iceberg repo

luoyuxia

@MehulBatra Thanks for the update. Only left minor comments, should be merged in next iteration.

fluss-common/src/main/java/com/alibaba/fluss/row/encode/iceberg/IcebergBinaryRowWriter.java

fluss-common/src/test/java/com/alibaba/fluss/row/encode/iceberg/IcebergKeyEncoderTest.java

luoyuxia · 2025-07-22T01:52:07Z

pom.xml

@@ -98,6 +98,7 @@
        <curator.version>5.4.0</curator.version>
        <netty.version>4.1.104</netty.version>
        <arrow.version>15.0.0</arrow.version>
+        <iceberg.version>1.4.3</iceberg.version>


remove this, we dont need it in root pom.xml

 <dependency> <groupId>org.apache.paimon</groupId> <artifactId>paimon-bundle</artifactId> <version>${paimon.version}</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.iceberg</groupId> <artifactId>iceberg-api</artifactId> <version>${iceberg.version}</version> <scope>test</scope> </dependency> </dependencies>

Paimon and other packages also uses root pom.xml for version, I tried to match and did the same for iceberg-api in fluss-common pom.xml

Sorry for missing that test need it. I'd like to retract this comment

luoyuxia

@MehulBatra I append a some commit to refine pom file. LGTM

MehulBatra · 2025-07-22T13:16:48Z

Thank you @luoyuxia for the review & feedbacks!

in progress

b3565f8

MehulBatra marked this pull request as draft July 16, 2025 20:55

MehulBatra added 2 commits July 18, 2025 02:41

iceberg encoding stratergy using memory segment

adc083d

fix linting

421c186

MehulBatra marked this pull request as ready for review July 17, 2025 21:25

luoyuxia reviewed Jul 18, 2025

View reviewed changes

address Yuxia comments

eaa7f95

luoyuxia reviewed Jul 21, 2025

View reviewed changes

address Yuxia comments part 2

d76d3e1

Mehul Batra added 2 commits July 22, 2025 00:55

change notice and remove unused methods to succeed jacao coverage

3c52715

change notice and remove unused methods to succeed jacao coverage

179f1db

MehulBatra force-pushed the key_encoding_iceberg branch from 4f7644c to 179f1db Compare July 21, 2025 19:26

MehulBatra and others added 2 commits July 22, 2025 00:57

Merge branch 'main' into key_encoding_iceberg

505c40f

exclude keyencoder from coverage

bef1957

luoyuxia reviewed Jul 22, 2025

View reviewed changes

spotless pom

9aaa307

luoyuxia force-pushed the key_encoding_iceberg branch from 14d9e40 to 9aaa307 Compare July 22, 2025 03:19

include keyencoder from coverage

e7fe494

MehulBatra mentioned this pull request Jul 22, 2025

[lake/iceberg] Upgrade Iceberg to 1.9.1 post #1195 merge for Java 11 support #1379

Open

2 tasks

Mehul Batra and others added 4 commits July 22, 2025 14:18

address Yuxia comments

ee1ae64

pom fixes

2b52676

fix dependency for fluss-common tests

7a985b3

pom some fix

8533630

luoyuxia approved these changes Jul 22, 2025

View reviewed changes

luoyuxia merged commit b492c29 into apache:main Jul 22, 2025
3 checks passed

[lake/iceberg] Iceberg encoding strategy #1350

[lake/iceberg] Iceberg encoding strategy #1350

Conversation

MehulBatra commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MehulBatra commented Jul 21, 2025

Uh oh!

MehulBatra commented Jul 21, 2025

Uh oh!

luoyuxia commented Jul 21, 2025

Uh oh!

luoyuxia commented Jul 21, 2025

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MehulBatra commented Jul 21, 2025

Uh oh!

luoyuxia commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

luoyuxia Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

MehulBatra Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luoyuxia Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

luoyuxia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MehulBatra commented Jul 22, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

MehulBatra commented Jul 16, 2025 •

edited

Loading

luoyuxia commented Jul 22, 2025 •

edited

Loading

MehulBatra Jul 22, 2025 •

edited

Loading

luoyuxia left a comment •

edited

Loading