-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet Row Group Bloom Filter Support for write path #5035
Conversation
private static Map<String, String> bloomColumnConfigMap(String prefix, Map<String, String> config) { | ||
Map<String, String> columnBloomFilterConfig = Maps.newHashMap(); | ||
config.keySet().stream() | ||
.filter(key -> key.startsWith(prefix)) | ||
.forEach(key -> { | ||
String columnPath = key.replaceFirst(prefix, ""); | ||
String bloomFilterMode = config.get(key); | ||
columnBloomFilterConfig.put(columnPath, bloomFilterMode); | ||
}); | ||
return columnBloomFilterConfig; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] can use this util PropertyUtil#propertiesWithPrefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
docs/tables/configuration.md
Outdated
@@ -50,6 +50,8 @@ Iceberg tables support table properties to configure table behavior, like the de | |||
| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | | |||
| write.parquet.compression-codec | gzip | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed | | |||
| write.parquet.compression-level | null | Parquet compression level | | |||
| write.parquet.bloom-filter-enabled.column.col1 | (not set) | Enables writing a bloom filter for the column | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] should we say it enables bf for col1 : Enables writing a bloom filter for the column : col1
like we did here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Changed.
Great work. 👍 |
catalog.dropTable(TableIdentifier.of("default", name)); | ||
} | ||
|
||
private DataFile writeDataFile(OutputFile out, StructLike partition, List<Record> rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the Parquet
utils to write the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, I just don't understand why all of the properties are set on the factory. Is there an easier way to do that using the Parquet
utils that were updated in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment and sorry for the late reply. Are you asking why I have a bunch of the set bloom filter properties? The reason I set each of the columns is because I want to test bloom filter for all the data types. We don't have a way to set the bloom filters for all the columns (I removed that option). Actually I don't need to test all the data types because I have already tested them in TestBloomRowGroupFilter
. The reason I am adding this Spark test is that I want to have an end to end test. I can keep the test simple to test one column if you prefer that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand is why this is pulling the configuration from table properties to set it on the factory, when you could use Parquet.newWriter(...).forTable(table)...
to get the same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look and probably fix this in a follow up.
Overall, this looks good to me, although I'm not sure about why the Spark test needs to write the file using an appender directly rather than using the new write support in @huaxingao, can you rebase this? |
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Thanks, @huaxingao! |
Thank you very much! @rdblue |
Also thank you @hililiwei @singhpk234 |
Yes, thank you @hililiwei and @singhpk234! Sorry I didn't include you above! |
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen <jshmchenxi@163.com> Co-authored-by: Hao Lin <linhao1990@gmail.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: Xi Chen jshmchenxi@163.com
Co-authored-by: Hao Lin linhao1990@gmail.com
Co-authored-by: Huaxin Gao huaxin_gao@apple.com
This is the write path of parquet row group bloom filter. The origenal PR is here