Flink: Prevent setting endTag/endSnapshotId for streaming source #10207

pvary · 2024-04-23T15:40:20Z

Tried to hack around issues with setting endSnapshotId for a streaming job. It turns out the code is silently overwriting the values set in the context.

Since it doesn't make sense to set end for the streaming case, I propose to add validation to prevent these cases.
The Preconditions are copied from StreamingMonitorFunction used by the old FlinkSource implementation.

stevenzwu · 2024-04-23T17:09:35Z

flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

+          snapshotId == null, "Cannot set snapshot-id option for streaming reader");
+      Preconditions.checkArgument(
+          asOfTimestamp == null, "Cannot set as-of-timestamp option for streaming reader");
+      Preconditions.checkArgument(


where were end snapshot (id or tag) silently overridden? I thought it might be a valid scenario. E.g., during backfill, maybe it can be streaming mode and discover splits snapshot by snapshot with an end snapshot.

iceberg/flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousSplitPlannerImpl.java

Line 131 in fbcd142

ScanContext incrementalScan =

Here is the code:

Snapshot toSnapshotInclusive = toSnapshotInclusive( lastConsumedSnapshotId, currentSnapshot, scanContext.maxPlanningSnapshotCount()); IcebergEnumeratorPosition newPosition = IcebergEnumeratorPosition.of( toSnapshotInclusive.snapshotId(), toSnapshotInclusive.timestampMillis()); ScanContext incrementalScan = scanContext.copyWithAppendsBetween( lastPosition.snapshotId(), toSnapshotInclusive.snapshotId());

The toSnapshotInclusive reads until the current snapshot. Its only role is to prevent reading more snapshot than maxPlanningSnapshotCount. And we set this as a last snapshot for the scan.

this is to set the end snapshot per incremental scan/discovery. The source doesn't check/support end snapshot (like Kafka source's bounded end position).

Yeah, I got it. This is why we need the new validations to prevent the wrong parametrization of the source.

flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestScanContext.java

…d add timeout to the previously failing test

stevenzwu · 2024-04-25T00:59:30Z

flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

@@ -130,7 +131,9 @@ private ScanContext(
    this.watermarkColumn = watermarkColumn;
    this.watermarkColumnTimeUnit = watermarkColumnTimeUnit;

-    validate();
+    if (!skipValidate) {


is this only for the endSnapshotId? if yes, should we remove the validation on endSnapshotId?

The TestStreamScanSql was stuck because we set snapshotId too.
So this is for snapshotId and for endSnapshotId.

On the philosophical level, the question is:

Do we want to make sure that the created ScanContext is always valid?

If yes, then we change the copyWithAppendsBetween and the copyWithSnapshotId to remove the streaming flag, as those are not a streaming scans anymore.

If no, then we accept that the programmatically creates ScanContext objects don't need validation.

I opted for the first solution as I'm not sure where we depend on the streaming settings, and it was the least disruptive.

If yes, then we change the copyWithAppendsBetween and the copyWithSnapshotId to remove the streaming flag, as those are not a streaming scans anymore.

that is also not correct, because copy should meant copy. The main problem is that ScanContext is used for both user intention (via source builder) and internal incremental scan. I agree that internal incremental scan shouldn't have the streaming setting.

note that ScanContext is not a public class. Users can't construct this object directly. Maybe the validate() method shouldn't be called by the constructor and only be called by the ScanContext#Builder#build() method? or move some more intrinsic validation to IcebergSource builder?

Went with your suggestion, and removed the generic validation, and called the validation explicitly from the source.

flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java

…rces too

pvary · 2024-04-26T17:19:58Z

Merged to main.
Thanks for the review @stevenzwu!

Co-authored-by: Peter Vary <peter_vary4@apple.com>

…che#10207)

Co-authored-by: Peter Vary <peter_vary4@apple.com>

…che#10207)

Co-authored-by: Peter Vary <peter_vary4@apple.com>

Flink: Prevent setting endTag/endSnapshotId for streaming source

9b906aa

github-actions bot added the flink label Apr 23, 2024

stevenzwu reviewed Apr 23, 2024

View reviewed changes

More concrete error message test

c2815f6

pvary force-pushed the prevent_end_on_stream branch from 76cdc62 to c2815f6 Compare April 24, 2024 11:14

Skip validation for ScanContexts created by the monitor functions, an…

6d03fa6

…d add timeout to the previously failing test

stevenzwu reviewed Apr 25, 2024

View reviewed changes

flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java Show resolved Hide resolved

Peter Vary added 2 commits April 26, 2024 12:29

Validate only at source creation

e247e82

Fixed validation for FlinkSource too, and added tests for bounded sou…

789191a

…rces too

pvary force-pushed the prevent_end_on_stream branch from cef9372 to 789191a Compare April 26, 2024 11:16

stevenzwu approved these changes Apr 26, 2024

View reviewed changes

pvary merged commit 646440a into apache:main Apr 26, 2024
13 checks passed

pvary deleted the prevent_end_on_stream branch April 26, 2024 17:19

pvary pushed a commit to pvary/iceberg that referenced this pull request Apr 27, 2024

Flink: Backport apache#10207 to v1.18 and v1.17

4f31479

stevenzwu pushed a commit that referenced this pull request Apr 27, 2024

Flink: Backport #10207 to v1.18 and v1.17 (#10235)

1e35bf9

Co-authored-by: Peter Vary <peter_vary4@apple.com>

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: Prevent setting endTag/endSnapshotId for streaming source (apa…

5bbf93d

…che#10207)

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: Backport apache#10207 to v1.18 and v1.17 (apache#10235)

5bf5c13

Co-authored-by: Peter Vary <peter_vary4@apple.com>

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: Prevent setting endTag/endSnapshotId for streaming source (apa…

982fad1

…che#10207)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: Backport apache#10207 to v1.18 and v1.17 (apache#10235)

7487ba7

Co-authored-by: Peter Vary <peter_vary4@apple.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Prevent setting endTag/endSnapshotId for streaming source #10207

Flink: Prevent setting endTag/endSnapshotId for streaming source #10207

pvary commented Apr 23, 2024

stevenzwu Apr 23, 2024

pvary Apr 23, 2024

pvary Apr 24, 2024

stevenzwu Apr 25, 2024

pvary Apr 25, 2024

stevenzwu Apr 25, 2024

pvary Apr 25, 2024

stevenzwu Apr 25, 2024 •

edited

Loading

pvary Apr 26, 2024

pvary commented Apr 26, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Flink: Prevent setting endTag/endSnapshotId for streaming source #10207

Flink: Prevent setting endTag/endSnapshotId for streaming source #10207

Conversation

pvary commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Apr 26, 2024

stevenzwu Apr 25, 2024 •

edited

Loading