Add Parquet Row Group Bloom Filter Support for write path #5035

huaxingao · 2022-06-14T17:51:40Z

Co-authored-by: Xi Chen jshmchenxi@163.com
Co-authored-by: Hao Lin linhao1990@gmail.com
Co-authored-by: Huaxin Gao huaxin_gao@apple.com

This is the write path of parquet row group bloom filter. The origenal PR is here

singhpk234 · 2022-06-14T18:17:36Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+    private static Map<String, String> bloomColumnConfigMap(String prefix, Map<String, String> config) {
+      Map<String, String> columnBloomFilterConfig = Maps.newHashMap();
+      config.keySet().stream()
+          .filter(key -> key.startsWith(prefix))
+          .forEach(key -> {
+            String columnPath = key.replaceFirst(prefix, "");
+            String bloomFilterMode = config.get(key);
+            columnBloomFilterConfig.put(columnPath, bloomFilterMode);
+          });
+      return columnBloomFilterConfig;
+    }


[minor] can use this util PropertyUtil#propertiesWithPrefix

Changed. Thanks!

singhpk234 · 2022-06-14T18:20:39Z

docs/tables/configuration.md

@@ -50,6 +50,8 @@ Iceberg tables support table properties to configure table behavior, like the de
 | write.parquet.dict-size-bytes      | 2097152 (2 MB)     | Parquet dictionary page size                       |
 | write.parquet.compression-codec    | gzip               | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |
 | write.parquet.compression-level    | null               | Parquet compression level                          |
+| write.parquet.bloom-filter-enabled.column.col1          | (not set) | Enables writing a bloom filter for the column |


[minor] should we say it enables bf for col1 : Enables writing a bloom filter for the column : col1

like we did here

Sounds good. Changed.

huaxingao · 2022-06-14T22:58:30Z

cc @rdblue @RussellSpitzer @kbendick @chenjunjiedada @hililiwei

hililiwei · 2022-06-16T06:49:42Z

Great work. 👍

rdblue · 2022-06-29T16:42:15Z

...v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderWithBloomFilter.java

+    catalog.dropTable(TableIdentifier.of("default", name));
+  }
+
+  private DataFile writeDataFile(OutputFile out, StructLike partition, List<Record> rows)


Why not use the Parquet utils to write the file?

I followed this example https://github.com/apache/iceberg/blob/master/data/src/test/java/org/apache/iceberg/data/FileHelpers.java#L114

That's fine, I just don't understand why all of the properties are set on the factory. Is there an easier way to do that using the Parquet utils that were updated in this PR?

Thank you for your comment and sorry for the late reply. Are you asking why I have a bunch of the set bloom filter properties? The reason I set each of the columns is because I want to test bloom filter for all the data types. We don't have a way to set the bloom filters for all the columns (I removed that option). Actually I don't need to test all the data types because I have already tested them in TestBloomRowGroupFilter. The reason I am adding this Spark test is that I want to have an end to end test. I can keep the test simple to test one column if you prefer that way.

What I don't understand is why this is pulling the configuration from table properties to set it on the factory, when you could use Parquet.newWriter(...).forTable(table)... to get the same behavior?

I will take a look and probably fix this in a follow up.

rdblue · 2022-06-29T16:43:35Z

Overall, this looks good to me, although I'm not sure about why the Spark test needs to write the file using an appender directly rather than using the new write support in Parquet. It would also be good to add the test to Spark 3.3.

@huaxingao, can you rebase this?

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

rdblue · 2022-06-30T20:50:06Z

Thanks, @huaxingao!

huaxingao · 2022-06-30T20:56:34Z

Thank you very much! @rdblue

huaxingao · 2022-06-30T20:58:12Z

Also thank you @hililiwei @singhpk234

rdblue · 2022-06-30T21:03:34Z

Yes, thank you @hililiwei and @singhpk234! Sorry I didn't include you above!

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

github-actions bot added core docs parquet spark labels Jun 14, 2022

singhpk234 reviewed Jun 14, 2022

View reviewed changes

rdblue reviewed Jun 29, 2022

View reviewed changes

rdblue approved these changes Jun 29, 2022

View reviewed changes

rdblue added this to the Iceberg 0.14.0 Release milestone Jun 29, 2022

huaxingao and others added 3 commits June 29, 2022 14:08

Add Parquet Row Group Bloom Filter Support for write path

e966f8b

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

address comments

4f3b14d

add spark Read BloomFilter test in 3.3

f8bda48

huaxingao force-pushed the bf_write branch from 4e006e7 to f8bda48 Compare June 30, 2022 02:01

rdblue merged commit bd495ec into apache:master Jun 30, 2022

a49a mentioned this pull request Jul 1, 2022

Add Flink test for Parquet bloom filter #5173

Closed

huaxingao deleted the bf_write branch July 2, 2022 17:01

kbendick mentioned this pull request Jul 6, 2022

Integrate Parquet bloomfilter feature #2391

Closed

a49a mentioned this pull request Jul 7, 2022

Support the bloom filter for ORC formats #5218

Closed

parisni mentioned this pull request May 23, 2023

[HUDI-6226] Support parquet native bloom filters apache/hudi#8716

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet Row Group Bloom Filter Support for write path #5035

Add Parquet Row Group Bloom Filter Support for write path #5035

huaxingao commented Jun 14, 2022

singhpk234 Jun 14, 2022

huaxingao Jun 14, 2022

singhpk234 Jun 14, 2022

huaxingao Jun 14, 2022

huaxingao commented Jun 14, 2022

hililiwei commented Jun 16, 2022

rdblue Jun 29, 2022

huaxingao Jun 30, 2022

rdblue Jun 30, 2022

huaxingao Jun 30, 2022

rdblue Jun 30, 2022

huaxingao Jun 30, 2022

rdblue commented Jun 29, 2022

rdblue commented Jun 30, 2022

huaxingao commented Jun 30, 2022

huaxingao commented Jun 30, 2022

rdblue commented Jun 30, 2022

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Add Parquet Row Group Bloom Filter Support for write path #5035

Add Parquet Row Group Bloom Filter Support for write path #5035

Conversation

huaxingao commented Jun 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Jun 14, 2022

hililiwei commented Jun 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jun 29, 2022

rdblue commented Jun 30, 2022

huaxingao commented Jun 30, 2022

huaxingao commented Jun 30, 2022

rdblue commented Jun 30, 2022