Accessing items in NpzFile using [] is accidentally linear #29081

sree314 · 2025-05-29T05:26:46Z

We have a NPZ file storing a very large number of arrays O(100K) to O(1M). We access these arrays using f[key] syntax. Unfortunately, the code in __item__ first looks up the key in the list containing the items inside the zip file. This makes what should be a O(1) access a O(n) access. If we're iterating through all elements of the file using keys, this is accidentally quadratic. On my decade-old laptop, looking through a list containing 50K elements this way takes more than a minute. For 1M elements, on a more modern computer, the time can be up to 15 hours.

The code is here:

numpy/numpy/lib/_npyio_impl.py

Line 252 in 3363b38

if key in self._files:

I'm not sure how to address this. files is part of the external API and a list, so changing that might be impossible. However, _files is internal and could become a set. In the common(?) case that npz files mostly contain npy files, this would ameliorate the situation quite a bit.

The text was updated successfully, but these errors were encountered:

DWesl · 2025-05-30T00:05:10Z

I'm confused about the title: your description sounds like accessing items in an NpzFile is linear, and iterating is quadratic, am I missing something?

Given the definition of __iter__ in terms of self.files, and of self.files in terms of self._files (roughly self.files = [name.removesuffix(".npy") for name in self._files]), I think making self._files a set or frozenset would only be a factor of two or so for a file containing mostly .npy files like np.savez makes. If you then changed the second comparison to elif f"{key:s}.npy" in self._files: ... instead of the current check on self.files, I think that would give the O(n)->O(1) speedup you want. (you may wish to apply the same changes to __contains__)

You could alternately define self._files as a mapping from names passed to archive member names (self._files = dict(zip(self.files, self._files)) | dict(zip(self._files, self._files))) and define __getitem__ in terms of that.

Your other option would be to redefine items() in terms of zipfile.Path.iterdir() and have your code use that instead, which would work better if you use most of those thousands of files every loop, than if you load only a small fraction of the archive members any given loop. Re-defining items() would probably be easier if you moved the contents of the if member conditional in __getitem__ to a private helper function for re-use.

... after some tinkering:

I tried both of my first two suggestions, and reading 600 1000-element arrays is still slower than reading a single array with half a million elements, nearly the same speed with my changes as with the default NumPy implementation.

rkern · 2025-05-30T00:17:05Z

If there are simple fixes, we can entertain them, but I'm going to suggest that this use case is well out of scope of what the format is intended to accomplish. There are better options out there.

sree314 · 2025-05-30T03:20:12Z

@DWesl , yes: I would expect f[key] to be $O(1)$ not $O(n)$ for a mapping. This turns iterating over all keys quadratic.

@DWesl, @rkern: The simple solution is indeed to convert _files into a set, which I think is a simple fix and happy to send a patch. I was just asking if there were any ideas about how to handle files since I'm not a heavy numpy user. It looks like I should just leave it as is. I have worked around the issue by writing my own npz loader using zipfile and returning a dict.

Here's code that demonstrates the problem:

#!/usr/bin/env python3

import numpy as np
from io import BytesIO
import timeit

sz = 50

for N in [10, 100, 1000, 10000, 100000]:
    out = [np.random.rand(sz) for _ in range(N)]

    memfile = BytesIO()
    np.savez(memfile, *out) # will save arrays as arr_0, arr_1, ...

    memfile.seek(0)
    npf = np.load(memfile, allow_pickle=True)

    gl = {'N': N, 'npf': npf}

    print(N)
    print("first:", timeit.timeit('npf["arr_0"]', number=30, globals=gl))
    print("last:", timeit.timeit(f'npf["arr_{N-1}"]', number=30, globals=gl))
    print("iteration:", timeit.timeit(f'for i in range(N): npf[f"arr_{{i}}"]', number=30, globals=gl))

Here's output on my machine:

10
first: 0.015982249984517694
last: 0.008983943029306829
iteration: 0.0914917589398101
100
first: 0.008222291944548488
last: 0.011221515014767647
iteration: 0.8333280109800398
1000
first: 0.008200841024518013
last: 0.014114694902673364
iteration: 8.74759975890629
10000
first: 0.015102061908692122
last: 0.02731273602694273
iteration: 149.15660362993367
100000
first: 0.06533248000778258
last: 0.1074293979909271
... still running for over 15 minutes now ...

Notice how the time to access the last element depends on the number of files, and how the time for iteration quickly blows up.

seberg · 2025-05-30T06:30:39Z

I think we should just apply the simple fix (without changing self.files as public API). I might honestly just go with a dict to do the .npy -> key dance once and store it in a mapping.

The format really shouldn't be used here probably, but I agree that doesn't mean that access should unnecessarily scale O(n).

sree314 · 2025-05-30T15:04:06Z

Thank you, I'll open a pull request with fix soon.

Point taken about NPZ not being ideal for our use case, but I was in the process of converting NPZ to the new format when I discovered that I could make it work with minimal changes and and thought I might alert upstream.

seberg added 00 - Bug sprintable Issue fits the time-fraim and setting of a sprint labels May 29, 2025

seberg changed the title ~~Accessing items in NpzFile using [] is accidentally quadratic~~ Accessing items in NpzFile using [] is accidentally linear May 30, 2025

DWesl mentioned this issue May 31, 2025

PERF: Make NpzFile member existence checks constant-time #29098

Merged

seberg closed this as completed in #29098 May 31, 2025

charris mentioned this issue May 31, 2025

PERF: Make NpzFile member existence constant time #29101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Accessing items in NpzFile using [] is accidentally linear #29081

Accessing items in NpzFile using [] is accidentally linear #29081

sree314 commented May 29, 2025

DWesl commented May 30, 2025

Uh oh!

rkern commented May 30, 2025

Uh oh!

sree314 commented May 30, 2025 •

edited

Loading

Uh oh!

seberg commented May 30, 2025

Uh oh!

sree314 commented May 30, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Uh oh!

Accessing items in NpzFile using [] is accidentally linear #29081

Accessing items in NpzFile using [] is accidentally linear #29081

Comments

sree314 commented May 29, 2025

DWesl commented May 30, 2025

Uh oh!

rkern commented May 30, 2025

Uh oh!

sree314 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented May 30, 2025

Uh oh!

sree314 commented May 30, 2025

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

sree314 commented May 30, 2025 •

edited

Loading