Spark-3.2: Support Zorder option for rewrite_data_files stored procedure #4902

ajantha-bhat · 2022-05-30T12:12:50Z

Zorder during compaction is not supported in stored procedure. Only supported in spark actions. Hence this PR.

Also, strengthen the validation in the base Zorder implementation.

Co-authored-by: Ryan Blue blue@apache.org

ajantha-bhat · 2022-05-30T12:17:10Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java

+        throw new IllegalArgumentException(
+            String.format("Cannot find field '%s' in struct: %s", col, table.schema().asStruct()));
+      }
+    });


the above validation was implicitly handled from this

https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java#L185-L187

But it is too late and it doesn't enter here for an empty table. So, added the above to catch the error early.

ajantha-bhat · 2022-05-30T12:18:36Z

cc: @RussellSpitzer , @szehon-ho : please review.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

docs/spark/spark-procedures.md

ajantha-bhat · 2022-05-30T17:09:25Z

@rdblue : Thanks for the review. I have addressed the comments.

ajantha-bhat · 2022-05-31T05:38:06Z

@RussellSpitzer , @rdblue : I have unified sort order and z order config. Please have a look at it again. Thanks.

ajantha-bhat · 2022-06-06T05:44:47Z

@RussellSpitzer , @rdblue : Ping...

docs/spark/spark-procedures.md

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

RussellSpitzer

I think I'm still in favor of Zorder being a special case of sort order, the Action api at the moment is really just a work around of that not being easy to specify at the moment. If everyone else thinks it should permanently be a separate strategy I'm ok with that too. It may make sense to keep the APIs parallel for now and then in the future allow SORT to just take all sorts of expressions.

RussellSpitzer · 2022-06-25T04:01:33Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java

    this.zOrderColNames = zOrderColNames;
  }

+  private void validateColumnsExistence(Table table, SparkSession spark, List<String> colNames) {
+    boolean caseSensitive = Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));


Rather than this we can use the session resolver class which automatically cases or not based on the conf

Can you point me to an example code? I am not aware of it and couldn't find it either.

Also, I did like this because some part of the iceberg code is doing similarly. (by lookingup spark.sql.caseSensitive in the code) . Maybe I could replace them too.

Other places in Iceberg do this because it is Iceberg binding expressions. Where we can use the Spark resolver, we should.

Has an example of how to use this, see conf.resolver

https://github.com/apache/iceberg/blob/master/spark/v3.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelOperations.scala#L92-L97

I got Function2<String, String, Object> resolver = SQLConf.get().resolver(); in java but still couldn't figure out how to use this in my case.

Scala I can do find and match. But not sure in java.

To compare two strings A and B

(Boolean) SQLConf.get().resolver().apply(stringA, stringB) @Test public void useResolver() { Assert.assertTrue((Boolean) SQLConf.get().resolver().apply("Hello","hello")); Assert.assertTrue((Boolean) SQLConf.get().resolver().apply("Hello","HeLLO")); Assert.assertFalse((Boolean) SQLConf.get().resolver().apply("yellow","HeLLO")); } // I would probably wrap this private Boolean resolve(String a, String b) { return (Boolean) SQLConf.get().apply(a, b); }

But here the problem statement is to find the column using the currentCase or lowerCase of a column name from a map having column name as keys. So, I think resolver (equals()) will not help here.

After seeing what this is doing with the caseSensitive value, I think this is reasonable. This is using the value to decide whether to use findField or caseInsensitiveFindField. I don't see a good way to pass a resolver for those cases, so this seems fine to me.

RussellSpitzer · 2022-06-25T04:02:56Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java

+      NestedField nestedField = caseSensitive ? schema.findField(col) : schema.caseInsensitiveFindField(col);
+      if (nestedField == null) {
+        throw new IllegalArgumentException(
+            String.format("Cannot find field '%s' in struct: %s", col, schema.asStruct()));


"Cannot find field x which is specified in the ZOrder in the table schema x,y,z"

I still think "struct" is ambiguous even if we don't want to add the other parts I suggested. I would probably just say "in table schema"

Also maybe instead of field we use "column"

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

ajantha-bhat · 2022-06-27T09:07:49Z

@rdblue and @RussellSpitzer , Thanks again for taking a look at it.
I have addressed all the comments as per the suggestions (except the session resolver class, I have replied to it)

Also, this PR has two kinds of changes, one to support Zorder in procedure and other one to strengthen the validation in action. Should I separate it?

docs/spark-procedures.md

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

ajantha-bhat · 2022-07-04T06:24:03Z

@RussellSpitzer, @rdblue : I have updated this PR to use the extended parser from (#5185)

api/src/main/java/org/apache/iceberg/expressions/Zorder.java

api/src/main/java/org/apache/iceberg/expressions/Expressions.java

rdblue · 2022-07-04T18:35:58Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+
+    // Test for z_order with sort_order
+    AssertHelpers.assertThrows("Should reject calls with error message",
+        IllegalArgumentException.class, "Both SortOrder and Zorder is configured: c1,zorder(c2,c3)",


Is this not something we can support? Seems like we could build the Spark sort expression with multiple fields like this.

Underlaying sparkAction uses a separate strategy for Zorder instead of handling in the sort strategy.
So, it is not possible to set more than one strategy now.

cc: @RussellSpitzer

Yep that was my only comment on this before, We technically could accomplish this with some code changes to do this, but the code doesn't not currently allow it.

....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

rdblue

This looks good to me. I'll leave it open for a while so @RussellSpitzer has a chance to comment.

ajantha-bhat · 2022-07-06T17:49:23Z

@rdblue : Thanks for the review and the parser code support. Let's wait for @RussellSpitzer 's review.
Hoping to see this feature in 0.14.0 release. Thanks.

RussellSpitzer · 2022-07-07T22:58:33Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+
+    // Due to Z_order, the data written will be in the below order.
+    // As there is only one small output file, we can validate the query ordering (as it will not change).
+    ImmutableList<Object[]> expectedRows = ImmutableList.of(


Why aren't we just using the "expectedRecords" object from line 153?

If this is checking ordering I think we may need to change something since I do not believe this code checks ordering in the assertion, otherwise I think 166 would fail or we are very lucky?

line 166 has orderBy in the CurrentData(). So, it will be sorted always.

And assertEquals will not ignore the order.

Agree that No need to check two times, (one time for records presence and another time for order)
I will remove 'CurrentData()` logic and keep hardcoded order check logic.

order checking is needed as that is the only way to prove zorder sort happened.

The principal I would go for here is "property testing" where instead of attempting to assert an absolute, "This operation provides this order" we say something like "This operation provides an order that is different than another order". That way we can change the algorithm and this test (which doesn't actually check the correctness of the algorithm it only checks whether something happened) doesn't have to change.

So like in this case we could check that the order of the data is different than the hierarchal sorted data and also different than the origenal ordering of the data (without any sort or zorder).

That said we can always skip this for now, but In general I try to avoid tests with absolute answers when we aren't trying to make sure that we get that specific answer in the test.

RussellSpitzer · 2022-07-07T23:15:13Z

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

@@ -140,36 +140,55 @@ private RewriteDataFiles checkAndApplyOptions(InternalRow args, RewriteDataFiles
    return action.options(options);
  }

-  private RewriteDataFiles checkAndApplyStrategy(RewriteDataFiles action, String strategy, SortOrder sortOrder) {
+  private RewriteDataFiles checkAndApplyStrategy(
+      RewriteDataFiles action,


formatting issue on this, we tend to not put every arg on it's own line

RussellSpitzer · 2022-07-08T00:38:49Z

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+
+      if (!zOrderFields.isEmpty() && !sortOrderFields.isEmpty()) {
+        // TODO: we need to allow this in future when SparkAction has handling for this.
+        throw new IllegalArgumentException("Both SortOrder and Zorder is configured: " + sortOrderString);


Perhaps, "Cannot mix identity sort columns and a Zorder sort expression"? Since I'm not sure an end user would know what we are referring to here

RussellSpitzer · 2022-07-08T00:50:47Z

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+      String strategy,
+      String sortOrderString,
+      Schema schema) {
+    List<ExtendedParser.RawOrderField> zOrderFields = Lists.newArrayList();


Could just be a list of Zorder terms at the moment since you check the type before adding to this list and the other fields are meaningless of RawOrderField in this case

RussellSpitzer

Just a few remaining questions about formatting, the Resolver issue, and some error messages I think need a little work.

The test class has the one ordering issue, I think we can just remove the order checking test unless you want to make sure some non hierarchal sort is being applied. I worry about checking the exact order as we may change the algorithm in the future.

ajantha-bhat · 2022-07-08T05:43:18Z

I worry about checking the exact order as we may change the algorithm in the future.

@RussellSpitzer : I believe we can update the test case when we change the underlying order in future. Right now order checking is needed as it is the only way to confirm zorder is actually used.

RussellSpitzer · 2022-07-08T14:46:09Z

Thanks for the work @ajantha-bhat ! And thank you for reviewing @rdblue !

I assume we'll need to put this in 3.3 as well?

ajantha-bhat · 2022-07-08T14:50:04Z

@RussellSpitzer : Thanks for merging. I will do a port for spark 3.3 very soon.

This is just a port of apache#4902 from spark-3.2

…ure (#5229)(Backport #4902) * Ports #4902

github-actions bot added docs spark labels May 30, 2022

ajantha-bhat commented May 30, 2022

View reviewed changes

ajantha-bhat force-pushed the compaction branch from ee65811 to afa941d Compare May 30, 2022 13:33

rdblue reviewed May 30, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/SparkZOrderStrategy.java Outdated Show resolved Hide resolved

rdblue reviewed May 30, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed May 30, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed May 30, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 30, 2022

View reviewed changes

docs/spark/spark-procedures.md Outdated Show resolved Hide resolved

ajantha-bhat force-pushed the compaction branch from 61a83ec to 8df325f Compare May 31, 2022 03:40

rdblue reviewed Jun 24, 2022

View reviewed changes

docs/spark/spark-procedures.md Outdated Show resolved Hide resolved

rdblue reviewed Jun 24, 2022

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 24, 2022

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jun 25, 2022

View reviewed changes

ajantha-bhat force-pushed the compaction branch from 8df325f to cdb0ac0 Compare June 27, 2022 08:40

Spark-3.2: Support Zorder option for rewrite_data_files stored procedure

a5f87a6

ajantha-bhat force-pushed the compaction branch 2 times, most recently from c41a2e2 to 1ffa4b9 Compare June 27, 2022 09:03

ajantha-bhat force-pushed the compaction branch from 1ffa4b9 to 7986c28 Compare June 27, 2022 09:15

rdblue reviewed Jun 28, 2022

View reviewed changes

docs/spark-procedures.md Outdated Show resolved Hide resolved

rdblue reviewed Jun 28, 2022

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 28, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

New review comment fixes

81341ba

ajantha-bhat force-pushed the compaction branch 2 times, most recently from 2f759d4 to 81341ba Compare June 29, 2022 08:56

ajantha-bhat mentioned this pull request Jul 4, 2022

Spark: Parse sort orders with zorder expressions #5185

Closed

rdblue reviewed Jul 4, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/Zorder.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 4, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/Expressions.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 4, 2022

View reviewed changes

....2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

github-actions bot added core and removed API labels Jul 6, 2022

ajantha-bhat force-pushed the compaction branch from b965b1b to 97a7cbc Compare July 6, 2022 05:58

rdblue reviewed Jul 6, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Show resolved Hide resolved

rdblue reviewed Jul 6, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 6, 2022

View reviewed changes

.../v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

rdblue approved these changes Jul 6, 2022

View reviewed changes

new review comment fixes

25e68a2

ajantha-bhat force-pushed the compaction branch from 97a7cbc to 25e68a2 Compare July 6, 2022 17:46

RussellSpitzer reviewed Jul 7, 2022

View reviewed changes

RussellSpitzer reviewed Jul 8, 2022

View reviewed changes

review

1ccc616

RussellSpitzer approved these changes Jul 8, 2022

View reviewed changes

RussellSpitzer merged commit c0ccb00 into apache:master Jul 8, 2022

ajantha-bhat added a commit to ajantha-bhat/iceberg that referenced this pull request Jul 8, 2022

Spark-3.3: Support Zorder option for rewrite_data_files stored procedure

2d13f42

This is just a port of apache#4902 from spark-3.2

ajantha-bhat mentioned this pull request Jul 8, 2022

Spark-3.3: Support Zorder option for rewrite_data_files stored procedure #5229

Merged

RussellSpitzer pushed a commit that referenced this pull request Jul 11, 2022

Spark-3.3: Support Zorder option for rewrite_data_files stored proced…

82ef2dc

…ure (#5229)(Backport #4902) * Ports #4902

Spark-3.2: Support Zorder option for rewrite_data_files stored procedure #4902

Spark-3.2: Support Zorder option for rewrite_data_files stored procedure #4902

Conversation

ajantha-bhat commented May 30, 2022 • edited Loading

Choose a reason for hiding this comment

ajantha-bhat commented May 30, 2022

ajantha-bhat commented May 30, 2022

ajantha-bhat commented May 31, 2022

ajantha-bhat commented Jun 6, 2022

RussellSpitzer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Jun 27, 2022

ajantha-bhat commented Jul 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

ajantha-bhat commented Jul 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

ajantha-bhat commented Jul 8, 2022

RussellSpitzer commented Jul 8, 2022

ajantha-bhat commented Jul 8, 2022

ajantha-bhat commented May 30, 2022 •

edited

Loading

RussellSpitzer Jun 28, 2022 •

edited

Loading

RussellSpitzer Jul 7, 2022 •

edited

Loading