-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Accessing items in NpzFile using [] is accidentally linear #29081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm confused about the title: your description sounds like accessing items in an NpzFile is linear, and iterating is quadratic, am I missing something? You could alternately define Your other option would be to redefine ... after some tinkering: I tried both of my first two suggestions, and reading 600 1000-element arrays is still slower than reading a single array with half a million elements, nearly the same speed with my changes as with the default NumPy implementation. |
If there are simple fixes, we can entertain them, but I'm going to suggest that this use case is well out of scope of what the format is intended to accomplish. There are better options out there. |
@DWesl , yes: I would expect @DWesl, @rkern: The simple solution is indeed to convert _files into a set, which I think is a simple fix and happy to send a patch. I was just asking if there were any ideas about how to handle Here's code that demonstrates the problem: #!/usr/bin/env python3
import numpy as np
from io import BytesIO
import timeit
sz = 50
for N in [10, 100, 1000, 10000, 100000]:
out = [np.random.rand(sz) for _ in range(N)]
memfile = BytesIO()
np.savez(memfile, *out) # will save arrays as arr_0, arr_1, ...
memfile.seek(0)
npf = np.load(memfile, allow_pickle=True)
gl = {'N': N, 'npf': npf}
print(N)
print("first:", timeit.timeit('npf["arr_0"]', number=30, globals=gl))
print("last:", timeit.timeit(f'npf["arr_{N-1}"]', number=30, globals=gl))
print("iteration:", timeit.timeit(f'for i in range(N): npf[f"arr_{{i}}"]', number=30, globals=gl)) Here's output on my machine:
Notice how the time to access the last element depends on the number of files, and how the time for iteration quickly blows up. |
I think we should just apply the simple fix (without changing The format really shouldn't be used here probably, but I agree that doesn't mean that access should unnecessarily scale O(n). |
Thank you, I'll open a pull request with fix soon. Point taken about NPZ not being ideal for our use case, but I was in the process of converting NPZ to the new format when I discovered that I could make it work with minimal changes and and thought I might alert upstream. |
We have a NPZ file storing a very large number of arrays O(100K) to O(1M). We access these arrays using
f[key]
syntax. Unfortunately, the code in__item__
first looks up the key in the list containing the items inside the zip file. This makes what should be a O(1) access a O(n) access. If we're iterating through all elements of the file using keys, this is accidentally quadratic. On my decade-old laptop, looking through a list containing 50K elements this way takes more than a minute. For 1M elements, on a more modern computer, the time can be up to 15 hours.The code is here:
numpy/numpy/lib/_npyio_impl.py
Line 252 in 3363b38
I'm not sure how to address this.
files
is part of the external API and a list, so changing that might be impossible. However,_files
is internal and could become a set. In the common(?) case thatnpz
files mostly containnpy
files, this would ameliorate the situation quite a bit.The text was updated successfully, but these errors were encountered: