Skip to content

Avoid schema validation by default in ArrowEngine #6536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 21, 2020

Conversation

rjzamora
Copy link
Member

Addresses cudf#6055

Given the computational expense required to validate the schema of every file in a dataset, we should certainly be avoiding this by default.

Example/motivation (I can confirm that this PR achieves the same performance improvement).

Non-dask_cudf example:

import dask.dataframe as dd
import dask

path = "test.parquet"
dask.datasets.timeseries(partition_freq='600s').to_parquet(path)

%timeit dd.read_parquet(
    path,
    engine="pyarrow",
    gather_statistics=False,
    split_row_groups=False,
)

With This PR:

274 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Before This PR:

4.4 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Tests added / passed
  • Passes black dask / flake8 dask

@kkraus14 kkraus14 mentioned this pull request Aug 21, 2020
@martindurant
Copy link
Member

Could perhaps use a test?
I am happy to merge as is, if you think the current tests cover both True and False cases.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question about the implementation, which is also present on master so not a huge deal to fix here.

@TomAugspurger
Copy link
Member

Could perhaps use a test?

Or perhaps a benchmark in https://github.com/dask/dask-benchmarks, but we need a bit more tooling around that to automatically detect and report regressions.

@rjzamora
Copy link
Member Author

Could perhaps use a test?

I modified test_to_parquet_pyarrow_w_inconsistent_schema_by_partition_fails_by_default to cover this change.

Or perhaps a benchmark in https://github.com/dask/dask-benchmarks, but we need a bit more tooling around that to automatically detect and report regressions.

Yeah, we should definitely try to incorperate meaningful parquet/IO benchmarks to detect these types of regressions

@martindurant
Copy link
Member

we should definitely try to incorperate meaningful parquet/IO benchmarks

I suggest this is a separate item of work

@TomAugspurger TomAugspurger merged commit 23196a5 into dask:master Aug 21, 2020
@TomAugspurger
Copy link
Member

Thanks!

@rjzamora rjzamora deleted the avoid-schema-validation branch August 21, 2020 19:37
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy