Content-Length: 283396 | pFad | http://github.com/OSGeo/gdal/issues/11418

32 Implement projection and filter pushdown to Parquet · Issue #11418 · OSGeo/gdal · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement projection and filter pushdown to Parquet #11418

Open
sfkeller opened this issue Dec 2, 2024 · 9 comments
Open

Implement projection and filter pushdown to Parquet #11418

sfkeller opened this issue Dec 2, 2024 · 9 comments

Comments

@sfkeller
Copy link

sfkeller commented Dec 2, 2024

Feature description

This concept is a read-only feature well known in databases, big data and distributed computing. Filtering and projection pushdown provide significant performance benefits if the Parquet file is prepared accordingly.

Projection pushdown means that when you query a Parquet file, the driver only reads the columns required for the query.

Filter pushdown means that when you apply a filter to a column scanned from a Parquet file, the filter is pushed down and can be used to skip parts of the file using the built-in zone maps of the Parquet file. Note that this depends on whether the Parquet file contains such zone maps.

Additional context

See for example https://duckdb.org/docs/data/parquet/overview#partial-reading.

I'm not sure if it's that "filter pushdown" what @rouault meant when he wrote in #10887 "... I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities".

I may be able to contribute - but I'd like to know if this is feasible?

@rouault
Copy link
Member

rouault commented Dec 2, 2024

When using the "PARQUET:filename.parquet" syntax, this should already be implemented to the best extent we can using the ScannerBuilder::Filter API (https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset14ScannerBuilder6FilterERKN7compute10ExpressionE)

With regular "filename.parquet" syntax, we do optimizations on OGR side, that may be inferior to the prior approach. The reason we have those 2 modes is that when the driver was developped, the ArrowDataset library which makes it possible to use Scannerbuilder::Filter wasn't declared to be stable

@sfkeller
Copy link
Author

sfkeller commented Dec 3, 2024

Thanks for the quick answer. Just to understand you correctly: The parquet reader already does filtering and projection pushdown? (sorry my ignorance about the implementation where I don't yet see what the typical dependency to Arrow as a row format means).

E.g. like with following test query from https://docs.overturemaps.org/getting-data/duckdb/ not yet adapted to OGR, where the attributes id, name and geometry are "projected" and the WHERE statement "filters" optimized - including hive_partitioning and zone maps):

SELECT
  id,
  names.primary as name,
  geometry                            -- the GEOMETRY type to be implemted 
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1)
WHERE 
  bbox.xmin BETWEEN -75 AND -73       -- Only use the bbox miny/miny values 
  AND bbox.ymin BETWEEN 40 AND 41     -- because they are point geometries.
  AND categories.primary = 'pizza_restaurant':

@rouault
Copy link
Member

rouault commented Dec 3, 2024

The parquet reader already does filtering and projection pushdown?

yes, for partitionned datasets like OvertureMaps, the "PARQUET:" dataset mode (https://gdal.org/en/stable/drivers/vector/parquet.html#dataset-partitioning-read-support) is necessary used and attribute and spatial filter pushdown is implemented using https://arrow.apache.org/docs/cpp/dataset.html#filtering-data . I believe the reason for DuckDB being much faster than Arrow is a smarter implementation. I don't think there's anything that can be done on GDAL side (I may have missed something of course)

@rouault
Copy link
Member

rouault commented Dec 8, 2024

#11459 implements attribute and spatial filter pushdown for Parquet through the new to 3.11dev ADBC driver using DuckDB Parquet capabilities

@dwilson1988
Copy link

@rouault - I may be missing something, but whenever I've tried to create a geoparquet file using GDAL, they are missing Column and Offset indexes - is this expected? I don't see how efficient filter pushdown would be possible without these. I, of course, could be mistaken here.

@rouault
Copy link
Member

rouault commented Jan 1, 2025

they are missing Column and Offset indexes

what do you mean exactly by Column and Offset indexes ?

@dwilson1988
Copy link

The page indexes for each column chunk: https://parquet.apache.org/docs/file-format/pageindex/

@rouault
Copy link
Member

rouault commented Jan 1, 2025

The page indexes for each column chunk: https://parquet.apache.org/docs/file-format/pageindex/

thanks. The capability was added in libarrow 12 per apache/arrow#34054 , but required to be explicitly enabled. Will be done per #11565

@dwilson1988
Copy link

Fantastic! Thanks for the speedy response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/OSGeo/gdal/issues/11418

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy