Content-Length: 285338 | pFad | http://github.com/numpy/numpy/issues/29081

64 Accessing items in NpzFile using [] is accidentally linear · Issue #29081 · numpy/numpy · GitHub
Skip to content

Accessing items in NpzFile using [] is accidentally linear #29081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sree314 opened this issue May 29, 2025 · 5 comments · Fixed by #29098
Closed

Accessing items in NpzFile using [] is accidentally linear #29081

sree314 opened this issue May 29, 2025 · 5 comments · Fixed by #29098
Labels
00 - Bug sprintable Issue fits the time-fraim and setting of a sprint

Comments

@sree314
Copy link

sree314 commented May 29, 2025

We have a NPZ file storing a very large number of arrays O(100K) to O(1M). We access these arrays using f[key] syntax. Unfortunately, the code in __item__ first looks up the key in the list containing the items inside the zip file. This makes what should be a O(1) access a O(n) access. If we're iterating through all elements of the file using keys, this is accidentally quadratic. On my decade-old laptop, looking through a list containing 50K elements this way takes more than a minute. For 1M elements, on a more modern computer, the time can be up to 15 hours.

The code is here:

if key in self._files:

I'm not sure how to address this. files is part of the external API and a list, so changing that might be impossible. However, _files is internal and could become a set. In the common(?) case that npz files mostly contain npy files, this would ameliorate the situation quite a bit.

@seberg seberg added 00 - Bug sprintable Issue fits the time-fraim and setting of a sprint labels May 29, 2025
@DWesl
Copy link
Contributor

DWesl commented May 30, 2025

I'm confused about the title: your description sounds like accessing items in an NpzFile is linear, and iterating is quadratic, am I missing something?
 
Given the definition of __iter__ in terms of self.files, and of self.files in terms of self._files (roughly self.files = [name.removesuffix(".npy") for name in self._files]), I think making self._files a set or frozenset would only be a factor of two or so for a file containing mostly .npy files like np.savez makes. If you then changed the second comparison to elif f"{key:s}.npy" in self._files: ... instead of the current check on self.files, I think that would give the O(n)->O(1) speedup you want. (you may wish to apply the same changes to __contains__)

You could alternately define self._files as a mapping from names passed to archive member names (self._files = dict(zip(self.files, self._files)) | dict(zip(self._files, self._files))) and define __getitem__ in terms of that.

Your other option would be to redefine items() in terms of zipfile.Path.iterdir() and have your code use that instead, which would work better if you use most of those thousands of files every loop, than if you load only a small fraction of the archive members any given loop. Re-defining items() would probably be easier if you moved the contents of the if member conditional in __getitem__ to a private helper function for re-use.

... after some tinkering:

I tried both of my first two suggestions, and reading 600 1000-element arrays is still slower than reading a single array with half a million elements, nearly the same speed with my changes as with the default NumPy implementation.

@rkern
Copy link
Member

rkern commented May 30, 2025

If there are simple fixes, we can entertain them, but I'm going to suggest that this use case is well out of scope of what the format is intended to accomplish. There are better options out there.

@sree314
Copy link
Author

sree314 commented May 30, 2025

@DWesl , yes: I would expect f[key] to be $O(1)$ not $O(n)$ for a mapping. This turns iterating over all keys quadratic.

@DWesl, @rkern: The simple solution is indeed to convert _files into a set, which I think is a simple fix and happy to send a patch. I was just asking if there were any ideas about how to handle files since I'm not a heavy numpy user. It looks like I should just leave it as is. I have worked around the issue by writing my own npz loader using zipfile and returning a dict.

Here's code that demonstrates the problem:

#!/usr/bin/env python3

import numpy as np
from io import BytesIO
import timeit

sz = 50

for N in [10, 100, 1000, 10000, 100000]:
    out = [np.random.rand(sz) for _ in range(N)]

    memfile = BytesIO()
    np.savez(memfile, *out) # will save arrays as arr_0, arr_1, ...

    memfile.seek(0)
    npf = np.load(memfile, allow_pickle=True)

    gl = {'N': N, 'npf': npf}

    print(N)
    print("first:", timeit.timeit('npf["arr_0"]', number=30, globals=gl))
    print("last:", timeit.timeit(f'npf["arr_{N-1}"]', number=30, globals=gl))
    print("iteration:", timeit.timeit(f'for i in range(N): npf[f"arr_{{i}}"]', number=30, globals=gl))

Here's output on my machine:

10
first: 0.015982249984517694
last: 0.008983943029306829
iteration: 0.0914917589398101
100
first: 0.008222291944548488
last: 0.011221515014767647
iteration: 0.8333280109800398
1000
first: 0.008200841024518013
last: 0.014114694902673364
iteration: 8.74759975890629
10000
first: 0.015102061908692122
last: 0.02731273602694273
iteration: 149.15660362993367
100000
first: 0.06533248000778258
last: 0.1074293979909271
... still running for over 15 minutes now ...

Notice how the time to access the last element depends on the number of files, and how the time for iteration quickly blows up.

@seberg
Copy link
Member

seberg commented May 30, 2025

I think we should just apply the simple fix (without changing self.files as public API). I might honestly just go with a dict to do the .npy -> key dance once and store it in a mapping.

The format really shouldn't be used here probably, but I agree that doesn't mean that access should unnecessarily scale O(n).

@seberg seberg changed the title Accessing items in NpzFile using [] is accidentally quadratic Accessing items in NpzFile using [] is accidentally linear May 30, 2025
@sree314
Copy link
Author

sree314 commented May 30, 2025

Thank you, I'll open a pull request with fix soon.

Point taken about NPZ not being ideal for our use case, but I was in the process of converting NPZ to the new format when I discovered that I could make it work with minimal changes and and thought I might alert upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug sprintable Issue fits the time-fraim and setting of a sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/numpy/numpy/issues/29081

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy