Avoid schema validation by default in ArrowEngine #6536

rjzamora · 2020-08-20T18:58:46Z

Given the computational expense required to validate the schema of every file in a dataset, we should certainly be avoiding this by default.

Example/motivation (I can confirm that this PR achieves the same performance improvement).

Non-dask_cudf example:

import dask.dataframe as dd
import dask

path = "test.parquet"
dask.datasets.timeseries(partition_freq='600s').to_parquet(path)

%timeit dd.read_parquet(
    path,
    engine="pyarrow",
    gather_statistics=False,
    split_row_groups=False,
)

With This PR:

274 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Before This PR:

4.4 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tests added / passed
Passes black dask / flake8 dask

martindurant · 2020-08-21T17:31:56Z

Could perhaps use a test?
I am happy to merge as is, if you think the current tests cover both True and False cases.

TomAugspurger

LGTM, one question about the implementation, which is also present on master so not a huge deal to fix here.

dask/dataframe/io/parquet/arrow.py

TomAugspurger · 2020-08-21T17:47:35Z

Could perhaps use a test?

Or perhaps a benchmark in https://github.com/dask/dask-benchmarks, but we need a bit more tooling around that to automatically detect and report regressions.

rjzamora · 2020-08-21T17:58:32Z

Could perhaps use a test?

I modified test_to_parquet_pyarrow_w_inconsistent_schema_by_partition_fails_by_default to cover this change.

Or perhaps a benchmark in https://github.com/dask/dask-benchmarks, but we need a bit more tooling around that to automatically detect and report regressions.

Yeah, we should definitely try to incorperate meaningful parquet/IO benchmarks to detect these types of regressions

martindurant · 2020-08-21T18:00:56Z

we should definitely try to incorperate meaningful parquet/IO benchmarks

I suggest this is a separate item of work

TomAugspurger · 2020-08-21T19:35:04Z

Thanks!

use validate_schema=False by default

bd2944c

kkraus14 mentioned this pull request Aug 21, 2020

Release dask/community#87

Closed

TomAugspurger approved these changes Aug 21, 2020

View reviewed changes

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

copy dataset_kwargs

3a97dc8

add test coverage

34fdf07

TomAugspurger merged commit 23196a5 into dask:master Aug 21, 2020

rjzamora deleted the avoid-schema-validation branch August 21, 2020 19:37

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Avoid schema validation by default in ArrowEngine (dask#6536)

93f2a9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid schema validation by default in ArrowEngine #6536

Avoid schema validation by default in ArrowEngine #6536

rjzamora commented Aug 20, 2020

martindurant commented Aug 21, 2020

TomAugspurger left a comment

TomAugspurger commented Aug 21, 2020

rjzamora commented Aug 21, 2020

martindurant commented Aug 21, 2020

TomAugspurger commented Aug 21, 2020

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Avoid schema validation by default in ArrowEngine #6536

Avoid schema validation by default in ArrowEngine #6536

Conversation

rjzamora commented Aug 20, 2020

martindurant commented Aug 21, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 21, 2020

rjzamora commented Aug 21, 2020

martindurant commented Aug 21, 2020

TomAugspurger commented Aug 21, 2020

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.