Add Parquet Row Group Bloom Filter Support #4831

huaxingao · 2022-05-21T04:14:17Z

Co-Authored-By: Xi Chen jshmchenxi@163.com
Co-Authored-By: Hao Lin linhao@qiyi.com
Co-Authored-By: Huaxin Gao huaxin_gao@apple.com

Currently, Iceberg has ParquetMetricsRowGroupFilter and ParquetDictionaryRowGroupFilter. This PR adds one more filter ParquetBloomRowGroupFilter, which takes advantage of the Parquet bloom filter to find out if a row group needs to be read or not.

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao@qiyi.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

jshmchenxi · 2022-05-21T04:22:04Z

@huaxingao Thanks for continuing this work!

hililiwei

Great work. Left some comment.

core/src/main/java/org/apache/iceberg/TableProperties.java

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

hililiwei · 2022-05-21T07:02:16Z

...v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderWithBloomFilter.java

+          .set(TableProperties.PARQUET_VECTORIZATION_ENABLED, "true")
+          .set(TableProperties.PARQUET_BATCH_SIZE, "4")
+          .commit();
+    }


I don't quite understand this. Can we inject these properties directly when the table is created instead of changing it after it is created?

table = catalog.createTable(TableIdentifier.of("default", name), schema);

These properties can be set either at table creation or updated later. I don't think it matters. The reason I picked the second choice is because when I wrote the test, I happened to use TestSparkReaderDeletes as my template and followed the style there.

+1 to following existing style

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

api/src/main/java/org/apache/iceberg/data/Record.java

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

huaxingao · 2022-05-25T01:03:01Z

cc @aokolnychyi @RussellSpitzer @flyrain @szehon-ho
This PR is ready for review. Thank you very much in advance!

huaxingao · 2022-05-25T01:05:29Z

also cc @kbendick @chenjunjiedada @rdblue

chenjunjiedada · 2022-05-25T12:23:48Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+  public static final boolean DEFAULT_PARQUET_BLOOM_FILTER_ENABLED_DEFAULT = false;
+
+  public static final String PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX = "write.parquet.bloom-filter-enabled.column.";
+  public static final String PARQUET_BLOOM_FILTER_COLUMN_EXPECTED_NDV_PREFIX =


It would be better to document that the NDV is specific for a parquet file.

Thank for your comment. I documented this in configuration.md.

stevenzwu · 2022-05-26T05:44:08Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -167,6 +167,16 @@ private TableProperties() {
      "write.delete.parquet.row-group-check-max-record-count";
  public static final int PARQUET_ROW_GROUP_CHECK_MAX_RECORD_COUNT_DEFAULT = 10000;

+  public static final String DEFAULT_PARQUET_BLOOM_FILTER_ENABLED = "write.parquet.bloom-filter-enabled.default";


does it ever make sense to enable bloom filter for all columns? should we only allow bloom filter for explicitly specified columns?

It is not a common usage to enable bloom filter for all columns, but it's legal. This is consistent with the parquet-mr bloom filter implementations.

I agree with Steven. I think it only makes sense to enable bloom filters for some columns.

+1. While it's consistent with the parquet-mr bloom filter implementaiton, we need to think of user experience first and foremost.

It doesn't make sense to enable bloom filters for a lot of columns. And many users don't do any tuning of their metadata / statistics.

I think it's in-line with other things we do to make the users experience better, like turning off column level statistics after a certain number of columns. We can point it out in the docs under a big !!!NOTE (that's highlighted) that bloom filter is only used when turned on.

It's really an advanced thing to use at all imo.

Sounds good. I have removed DEFAULT_PARQUET_BLOOM_FILTER_ENABLED. Now user needs to enable bloom filter for individual column using PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX. If the column is a complex type, user needs to enable the column inside the complex type, for example,
set(PARQUET_BLOOM_FILTER_COLUMN_ENABLED_PREFIX + "struct_col.int_field", "true")

api/src/main/java/org/apache/iceberg/data/Record.java

docs/tables/configuration.md

rdblue · 2022-05-26T23:09:40Z

docs/tables/configuration.md

+| write.parquet.bloom-filter-enabled.default | false | Whether to enable writing bloom filter for all columns |
+| write.parquet.bloom-filter-enabled.column.col1 | (not set) | Whether to enable writing bloom filter for column 'col1' to allow per-column configuration; This property overrides `bloom-filter-enabled.default` for the specified column; For example, setting both `write.parquet.bloom-filter-enabled.default=true` and `write.parquet.bloom-filter-enabled.column.some_col=false` will enable bloom filter for all columns except `some_col` |
+| write.parquet.bloom-filter-expected-ndv.column.col1 | (not set) | The expected number of distinct values in a column, it is used to compute the optimal size of the bloom filter; Note that the NDV is specific for a parquet file. If this property is not set, the bloom filter will use the maximum size set in `bloom-filter-max-bytes`; If this property is set for a column, then no need to enable the bloom filter with `write.parquet.bloom-filter-enabled` property |
+| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset |


What is the behavior of this? If the NDV requires a size that is too large, does it skip writing the bloom filter?

If the NDV requires a size that is too large, parquet still writes the bloom filter using the max bytes set by this property, not using the bitset calculated by NDV.

I guess there probably isn't much we can do about this, although that behavior makes no sense to me. Is it possible to set the expected false positive probability anywhere? Or is that hard-coded in the Parquet library?

There isn't a property to set fpp in Parquet.

What fpp is used by Parquet?

Parquet uses 0.01 for fpp.
I chatted offline with @chenjunjiedada, we can probably add a config for fpp in parquet first, and then in iceberg.

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

rdblue · 2022-05-26T23:10:58Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+          parquetWriteBuilder.withBloomFilterNDV(colPath, Long.valueOf(numDistinctValue));
+        }
+
+        return new ParquetWriteAdapter<>(


@aokolnychyi, what do you think about removing the old ParquetWriteAdapter code? I don't think that anyone uses it anymore.

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

rdblue · 2022-05-26T23:15:04Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

+            "parquet",
+            RANDOM.nextBoolean(),
+            WRITE_DISTRIBUTION_MODE_HASH,
+            true


Why not add bloom filter testing to one of the existing cases? I don't see a reason why this can't be tested along with other cases. And Spark tests already take a long time. We should avoid adding more cases.

I want to have some bloom filter write/read tests that involve copy on write or merge on read. I think this is the only place that test copy on write or merge on read (I maybe wrong). That's why I added a parameter here to turn on bloom filter to test.

I think it's fine to have some tests with the row level operations. What I was pointing out is that we can add bloom filters to the existing tests. There's no need to execute the entire suite an additional time.

I see. Thanks for the clarification. Instead of executing the entire suite, is it OK to add a new (but small) test suite to test bloom filter for copy on write or merge on read? I didn't find any existing copy on write/merge on read tests with filters.

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

rdblue · 2022-05-27T00:02:28Z

Thanks for working on this, @huaxingao! It looks close overall.

Does it make sense to split this into separate read and write PRs? It seems like adding the evaluator is quite a lot to review independently, and we don't need the write path for its tests.

huaxingao · 2022-05-30T04:22:38Z

Thanks @rdblue @kbendick for reviewing!
I have addressed the comments and changed the code accordingly. I changed the code using this PR because I think it might be easier for you to check if I address the comments here. I can split this PR into two afterwards.

rdblue · 2022-05-30T15:53:06Z

api/src/main/java/org/apache/iceberg/expressions/Binder.java

+  }
+
+  public static Set<Integer> boundReferences(
+      StructType struct, List<Expression> exprs, boolean caseSensitive, boolean alreadyBound) {


I don't think it is a good idea to add a second boolean argument to this method. It is confusing enough with just one.

How about using a different method name for this and then renaming this to a be a private internal implementation?

I changed this method name to exprReferences (please let me know if you have a better name), but I still need to keep this as public because I need to access this method from ParquetBloomRowGroupFilter, which is in a different package.

Maybe just references since it doesn't bind?

Sounds good. Will change.

rdblue · 2022-05-30T15:59:45Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+      config.keySet().stream()
+          .filter(key -> key.startsWith(prefix))
+          .forEach(key -> {
+            String columnPath = key.replaceFirst(prefix, "");


Since this uses column name in the config, is there any logic to update these configs when columns are renamed?

Good question. I don't have a logic to update the configs when columns are renamed. I think we are OK, though. At write path, I use these configs to write bloom filters at file creation time. I don't use these configs any more for read. At read path, the bloom filters are loaded using id instead of column name. If the columns are renamed after the bloom filters have been written, as long as the id are still the same, the bloom filters should be able to loaded OK.

We're okay for adding read support, but we should consider how to configure write support then.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

rdblue · 2022-05-30T16:06:48Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

+
+    private <T> boolean shouldRead(PrimitiveType primitiveType, T value, BloomFilter bloom, Type type) {
+      long hashValue = 0;
+      switch (type.typeId()) {


I think that this needs to use the Parquet type from the file, not the Iceberg type.

The Iceberg type comes from the read schema, which could be long when the file's type is int because of type promotion rules. We need to make sure that the hash function used matches the file's type because that was what produced the bloom filter. Hopefully, the same hash function is used for long and int in Parquet, but I'm not sure that is the case so the safest thing is to convert value to the Parquet type.

I changed the code to switch (primitiveType.getPrimitiveTypeName())

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

rdblue · 2022-05-30T23:17:44Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

+              hashValue = bloom.hash(((Number) value).intValue());
+              return bloom.findHash(hashValue);
+          }
+        case INT64:


Looks like this will fall through if the type ID is not handled? I think you need to add a break;.

rdblue · 2022-05-30T23:18:57Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetBloomRowGroupFilter.java

+          switch (type.typeId()) {
+            case DECIMAL:
+              BigDecimal decimalValue = (BigDecimal) value;
+              hashValue = bloom.hash(decimalValue.unscaledValue().intValue());


This is what Parquet would do to hash a decimal stored as an int?

rdblue

@huaxingao, other than cleaning up shouldRead to add avoid falling through for unexpected Iceberg types, I think the read side of this is about ready to go. Do you want to open a separate PR for that?

huaxingao · 2022-05-31T05:00:51Z

@rdblue I have a question on how to test the read path. If we separate the write and read path, I don't have a good way to test the read path. Shall we do the write path first? For write path, we can test if the bloom filter is written OK by calling the Parquet bloom filter APIs. Or I just submit the read path PR without a test, and add the test after the write path is in?

rdblue · 2022-06-29T18:13:25Z

I'm going to close this because it is broken into separate read and write side PRs.

jia-zhengwei · 2023-08-28T11:41:45Z

@huaxingao
How to create a bloom filter index for iceberg with spark?
I know I need to set (write.parquet.bloom-filter-enabled.column.col1='true') when create table, do I need other operations when insert data by pyspark?
How to make sure the bloom filter is OK? Will they have some new index file generated?

huaxingao and others added 3 commits May 20, 2022 18:50

Parquet bloom filter support

e814287

add co-author.

3b985f4

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao@qiyi.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

add co-author.

f0e6aa1

Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>

github-actions bot added API core data parquet spark labels May 21, 2022

This was referenced May 21, 2022

Add support for Parquet BloomFilter #2582

Closed

Core: Support writing parquet bloom filter #2642

Closed

Core: Support reading parquet bloom filter #2643

Closed

hililiwei suggested changes May 21, 2022

View reviewed changes

address comments and fix filter in TestSparkReaderWithBloomFilter

9c6c64f

huaxingao force-pushed the bf branch from 459a94a to 9c6c64f Compare May 24, 2022 23:59

chenjunjiedada reviewed May 25, 2022

View reviewed changes

update configuration.md to add the bloom filter related properties

0520fb4

github-actions bot added the docs label May 25, 2022

stevenzwu reviewed May 26, 2022

View reviewed changes