-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement projection and filter pushdown to Parquet #11418
Comments
When using the "PARQUET:filename.parquet" syntax, this should already be implemented to the best extent we can using the ScannerBuilder::Filter API (https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset14ScannerBuilder6FilterERKN7compute10ExpressionE) With regular "filename.parquet" syntax, we do optimizations on OGR side, that may be inferior to the prior approach. The reason we have those 2 modes is that when the driver was developped, the ArrowDataset library which makes it possible to use Scannerbuilder::Filter wasn't declared to be stable |
Thanks for the quick answer. Just to understand you correctly: The parquet reader already does filtering and projection pushdown? (sorry my ignorance about the implementation where I don't yet see what the typical dependency to Arrow as a row format means). E.g. like with following test query from https://docs.overturemaps.org/getting-data/duckdb/ not yet adapted to OGR, where the attributes id, name and geometry are "projected" and the WHERE statement "filters" optimized - including hive_partitioning and zone maps):
|
yes, for partitionned datasets like OvertureMaps, the "PARQUET:" dataset mode (https://gdal.org/en/stable/drivers/vector/parquet.html#dataset-partitioning-read-support) is necessary used and attribute and spatial filter pushdown is implemented using https://arrow.apache.org/docs/cpp/dataset.html#filtering-data . I believe the reason for DuckDB being much faster than Arrow is a smarter implementation. I don't think there's anything that can be done on GDAL side (I may have missed something of course) |
#11459 implements attribute and spatial filter pushdown for Parquet through the new to 3.11dev ADBC driver using DuckDB Parquet capabilities |
@rouault - I may be missing something, but whenever I've tried to create a geoparquet file using GDAL, they are missing Column and Offset indexes - is this expected? I don't see how efficient filter pushdown would be possible without these. I, of course, could be mistaken here. |
what do you mean exactly by Column and Offset indexes ? |
The page indexes for each column chunk: https://parquet.apache.org/docs/file-format/pageindex/ |
thanks. The capability was added in libarrow 12 per apache/arrow#34054 , but required to be explicitly enabled. Will be done per #11565 |
Fantastic! Thanks for the speedy response! |
Feature description
This concept is a read-only feature well known in databases, big data and distributed computing. Filtering and projection pushdown provide significant performance benefits if the Parquet file is prepared accordingly.
Projection pushdown means that when you query a Parquet file, the driver only reads the columns required for the query.
Filter pushdown means that when you apply a filter to a column scanned from a Parquet file, the filter is pushed down and can be used to skip parts of the file using the built-in zone maps of the Parquet file. Note that this depends on whether the Parquet file contains such zone maps.
Additional context
See for example https://duckdb.org/docs/data/parquet/overview#partial-reading.
I'm not sure if it's that "filter pushdown" what @rouault meant when he wrote in #10887 "... I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities".
I may be able to contribute - but I'd like to know if this is feasible?
The text was updated successfully, but these errors were encountered: