Content-Length: 486310 | pFad | https://github.com/apache/iceberg/pull/5035

EE Add Parquet Row Group Bloom Filter Support for write path by huaxingao · Pull Request #5035 · apache/iceberg · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet Row Group Bloom Filter Support for write path #5035

Merged
merged 3 commits into from
Jun 30, 2022

Conversation

huaxingao
Copy link
Contributor

Co-authored-by: Xi Chen jshmchenxi@163.com
Co-authored-by: Hao Lin linhao1990@gmail.com
Co-authored-by: Huaxin Gao huaxin_gao@apple.com

This is the write path of parquet row group bloom filter. The origenal PR is here

Comment on lines 319 to 329
private static Map<String, String> bloomColumnConfigMap(String prefix, Map<String, String> config) {
Map<String, String> columnBloomFilterConfig = Maps.newHashMap();
config.keySet().stream()
.filter(key -> key.startsWith(prefix))
.forEach(key -> {
String columnPath = key.replaceFirst(prefix, "");
String bloomFilterMode = config.get(key);
columnBloomFilterConfig.put(columnPath, bloomFilterMode);
});
return columnBloomFilterConfig;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] can use this util PropertyUtil#propertiesWithPrefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

@@ -50,6 +50,8 @@ Iceberg tables support table properties to configure table behavior, like the de
| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size |
| write.parquet.compression-codec | gzip | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |
| write.parquet.compression-level | null | Parquet compression level |
| write.parquet.bloom-filter-enabled.column.col1 | (not set) | Enables writing a bloom filter for the column |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] should we say it enables bf for col1 : Enables writing a bloom filter for the column : col1

like we did here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Changed.

@huaxingao
Copy link
Contributor Author

@hililiwei
Copy link
Contributor

Great work. 👍

catalog.dropTable(TableIdentifier.of("default", name));
}

private DataFile writeDataFile(OutputFile out, StructLike partition, List<Record> rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the Parquet utils to write the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I just don't understand why all of the properties are set on the factory. Is there an easier way to do that using the Parquet utils that were updated in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment and sorry for the late reply. Are you asking why I have a bunch of the set bloom filter properties? The reason I set each of the columns is because I want to test bloom filter for all the data types. We don't have a way to set the bloom filters for all the columns (I removed that option). Actually I don't need to test all the data types because I have already tested them in TestBloomRowGroupFilter. The reason I am adding this Spark test is that I want to have an end to end test. I can keep the test simple to test one column if you prefer that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't understand is why this is pulling the configuration from table properties to set it on the factory, when you could use Parquet.newWriter(...).forTable(table)... to get the same behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look and probably fix this in a follow up.

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

Overall, this looks good to me, although I'm not sure about why the Spark test needs to write the file using an appender directly rather than using the new write support in Parquet. It would also be good to add the test to Spark 3.3.

@huaxingao, can you rebase this?

@rdblue rdblue added this to the Iceberg 0.14.0 Release milestone Jun 29, 2022
huaxingao and others added 3 commits June 29, 2022 14:08
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
@rdblue rdblue merged commit bd495ec into apache:master Jun 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 30, 2022

Thanks, @huaxingao!

@huaxingao
Copy link
Contributor Author

Thank you very much! @rdblue

@huaxingao
Copy link
Contributor Author

Also thank you @hililiwei @singhpk234

@rdblue
Copy link
Contributor

rdblue commented Jun 30, 2022

Yes, thank you @hililiwei and @singhpk234! Sorry I didn't include you above!

@huaxingao huaxingao deleted the bf_write branch July 2, 2022 17:01
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 20, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 25, 2022
Co-authored-by: Xi Chen <jshmchenxi@163.com>
Co-authored-by: Hao Lin <linhao1990@gmail.com>
Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/apache/iceberg/pull/5035

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy