Implement projection and filter pushdown to Parquet #11418

sfkeller · 2024-12-02T20:24:02Z

Feature description

This concept is a read-only feature well known in databases, big data and distributed computing. Filtering and projection pushdown provide significant performance benefits if the Parquet file is prepared accordingly.

Projection pushdown means that when you query a Parquet file, the driver only reads the columns required for the query.

Filter pushdown means that when you apply a filter to a column scanned from a Parquet file, the filter is pushed down and can be used to skip parts of the file using the built-in zone maps of the Parquet file. Note that this depends on whether the Parquet file contains such zone maps.

Additional context

See for example https://duckdb.org/docs/data/parquet/overview#partial-reading.

I'm not sure if it's that "filter pushdown" what @rouault meant when he wrote in #10887 "... I believe especially if the result set is small compared to the table size, despite my attempts at translating OGR SQL to Arrow filtering capabilities".

I may be able to contribute - but I'd like to know if this is feasible?

rouault · 2024-12-02T20:45:14Z

When using the "PARQUET:filename.parquet" syntax, this should already be implemented to the best extent we can using the ScannerBuilder::Filter API (https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset14ScannerBuilder6FilterERKN7compute10ExpressionE)

With regular "filename.parquet" syntax, we do optimizations on OGR side, that may be inferior to the prior approach. The reason we have those 2 modes is that when the driver was developped, the ArrowDataset library which makes it possible to use Scannerbuilder::Filter wasn't declared to be stable

sfkeller · 2024-12-03T11:24:27Z

Thanks for the quick answer. Just to understand you correctly: The parquet reader already does filtering and projection pushdown? (sorry my ignorance about the implementation where I don't yet see what the typical dependency to Arrow as a row format means).

E.g. like with following test query from https://docs.overturemaps.org/getting-data/duckdb/ not yet adapted to OGR, where the attributes id, name and geometry are "projected" and the WHERE statement "filters" optimized - including hive_partitioning and zone maps):

SELECT
  id,
  names.primary as name,
  geometry                            -- the GEOMETRY type to be implemted 
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1)
WHERE 
  bbox.xmin BETWEEN -75 AND -73       -- Only use the bbox miny/miny values 
  AND bbox.ymin BETWEEN 40 AND 41     -- because they are point geometries.
  AND categories.primary = 'pizza_restaurant':

rouault · 2024-12-03T11:51:32Z

The parquet reader already does filtering and projection pushdown?

yes, for partitionned datasets like OvertureMaps, the "PARQUET:" dataset mode (https://gdal.org/en/stable/drivers/vector/parquet.html#dataset-partitioning-read-support) is necessary used and attribute and spatial filter pushdown is implemented using https://arrow.apache.org/docs/cpp/dataset.html#filtering-data . I believe the reason for DuckDB being much faster than Arrow is a smarter implementation. I don't think there's anything that can be done on GDAL side (I may have missed something of course)

rouault · 2024-12-08T18:09:29Z

#11459 implements attribute and spatial filter pushdown for Parquet through the new to 3.11dev ADBC driver using DuckDB Parquet capabilities

dwilson1988 · 2025-01-01T21:45:12Z

@rouault - I may be missing something, but whenever I've tried to create a geoparquet file using GDAL, they are missing Column and Offset indexes - is this expected? I don't see how efficient filter pushdown would be possible without these. I, of course, could be mistaken here.

rouault · 2025-01-01T21:53:14Z

they are missing Column and Offset indexes

what do you mean exactly by Column and Offset indexes ?

dwilson1988 · 2025-01-01T21:54:35Z

The page indexes for each column chunk: https://parquet.apache.org/docs/file-format/pageindex/

rouault · 2025-01-01T22:26:55Z

The page indexes for each column chunk: https://parquet.apache.org/docs/file-format/pageindex/

thanks. The capability was added in libarrow 12 per apache/arrow#34054 , but required to be explicitly enabled. Will be done per #11565

dwilson1988 · 2025-01-01T23:02:11Z

Fantastic! Thanks for the speedy response!

sfkeller added the enhancement label Dec 2, 2024

rouault mentioned this issue Jan 1, 2025

Parquet writer: write page index #11565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement projection and filter pushdown to Parquet #11418

Implement projection and filter pushdown to Parquet #11418

sfkeller commented Dec 2, 2024 •

edited

Loading

rouault commented Dec 2, 2024

sfkeller commented Dec 3, 2024

rouault commented Dec 3, 2024

rouault commented Dec 8, 2024 •

edited

Loading

dwilson1988 commented Jan 1, 2025

rouault commented Jan 1, 2025

dwilson1988 commented Jan 1, 2025

rouault commented Jan 1, 2025

dwilson1988 commented Jan 1, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Implement projection and filter pushdown to Parquet #11418

Implement projection and filter pushdown to Parquet #11418

Comments

sfkeller commented Dec 2, 2024 • edited Loading

Feature description

Additional context

rouault commented Dec 2, 2024

sfkeller commented Dec 3, 2024

rouault commented Dec 3, 2024

rouault commented Dec 8, 2024 • edited Loading

dwilson1988 commented Jan 1, 2025

rouault commented Jan 1, 2025

dwilson1988 commented Jan 1, 2025

rouault commented Jan 1, 2025

dwilson1988 commented Jan 1, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

sfkeller commented Dec 2, 2024 •

edited

Loading

rouault commented Dec 8, 2024 •

edited

Loading