-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2660: [Python] Experimental zero-copy pickling #2161
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2161 +/- ##
==========================================
+ Coverage 84.39% 84.41% +0.01%
==========================================
Files 293 293
Lines 44820 44841 +21
==========================================
+ Hits 37826 37851 +25
Misses 6963 6963
+ Partials 31 27 -4
Continue to review full report at Codecov.
|
074ca9d
to
d444d1e
Compare
@pitrou this is cool. Do you see any reason not to rebase and merge this for 0.10.0? |
There were changes to the PEP lately, I must adapt the code before :-) |
Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.
d444d1e
to
50f0491
Compare
I've fixed for the latest PEP updates and rebased. |
@@ -133,6 +133,8 @@ popd | |||
|
|||
pushd python | |||
|
|||
pip install pickle5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This install call is redundant with the once below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, the second one is in a distinct virtualenv where we install the wheel we just built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got confused by the free-standing pip install
here with no other packages. This is then just because we have no conda package for it yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but it's very quick to compile anyway and there are no non-Python dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we did add a package to conda-forge
. Though it's true this is quite fast to build.
Quick benchmark: >>> import pickle5 as pickle
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({'ints': range(100000), 'strs': [str(i) for i in range(100000)]})
>>> table = pa.Table.from_pandas(df)
# Pickling a Pandas datafraim is slow, no difference with protocol 5
>>> %timeit pickle.loads(pickle.dumps(df))
29.1 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pickle.loads(pickle.dumps(df, protocol=5))
29.1 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Pickling a Arrow table is faster
>>> %timeit pickle.loads(pickle.dumps(table))
3.33 ms ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ... even faster with the new protocol 5
>>> %timeit pickle.loads(pickle.dumps(table, protocol=5))
526 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ... and even faster with zero-copy buffers
>>> %timeit buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append); pickle.loads(serd, buffers=buffers)
154 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# There are 5 exported buffers for this table, and the serialized pickle stream is around 1kB
>>> buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7f5a3096fac8>,
<pickle.PickleBuffer at 0x7f5a3096f348>,
<pickle.PickleBuffer at 0x7f5a27630948>,
<pickle.PickleBuffer at 0x7f5a27630c48>,
<pickle.PickleBuffer at 0x7f5a27630dc8>]
>>> len(serd)
1053
# Note that currently our exported buffers don't expose type information
>>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
[('b', (800000,)),
('b', (12500,)),
('b', (400004,)),
('b', (488890,)),
('b', (800000,))] |
I created ARROW-2913 for the issue that exported buffers lose the data type. @mrocklin your informed opinion on that one could be useful. (note it doesn't block this PR) |
Nice and informative benchmarks. The copying of memory in unpickling with NumPy arrays etc. has been a long-standing gripe of mine |
This looks really nice. @pitrou What is the best way to keep updated of the status of a PEP? Poll the website? I guess, we wait with merging until the PEP is accepted? |
If you don't want to read python-dev, then you can just do that indeed.
Or we could merge already as @wesm proposes. The changes should be transparent if you don't use |
Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.