Skip to content

ENH: Add madvise for memmap objects #29260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

IndifferentArea
Copy link
Contributor

A simple wrapper on python's mmap.madvise, as requested in #13172

I've seen the long-stalled PR #24489 addressing similar functionality. I'm eager to contribute more to NumPy, and if the original author of that PR wishes to continue their efforts, I'm happy to step aside and close this PR.

After seeing comments and review on that PR, I'm not sure what to do on view of mmap.

@IndifferentArea IndifferentArea marked this pull request as draft June 23, 2025 14:30
Copy link
Member

@jorenham jorenham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about the stubs :)

@IndifferentArea IndifferentArea marked this pull request as ready for review June 30, 2025 13:34
@nikoladze
Copy link

Found #13172 while looking for the "sprintable" issues at scipy25 and recognised it since i ran across this before. It seems this PR implements everything that's needed and seems to work for the application i had in mind. So i would appreciate if this can be merged 🙂

More context:


I ran into this 4 years ago when trying to read relatively random chunks of data across many files on an NFS mounted storage (context: particle physics data read via the uproot library that used memory mapped files back then).

Trying out the changes from this PR i could roughly simulate what i was doing back then by the following (the complete setup is a bit more involved and would require changes in downstream libraries):

# create 512M random data
import numpy as np

x = np.random.rand(2**26).view(np.uint8)
with open("data.bin", "wb") as f:
    f.write(x.tobytes())

Reading random chunks, but roughly sequentially

import numpy as np
starts = np.sort(np.random.randint(0, 2**29, size=1024))
data = np.memmap("data.bin", dtype=np.uint8, mode="r")
for start in starts:
    data[start: start + 2**14].copy()

Depending on the filesystem and mount configuration this reads a large fraction of the file although only a few percent have been touched. I checked this using the vmtouch tool:

$ vmtouch -e data.bin # evict from cache

Then run the python code and check touched pages. On my laptop i get:

$ vmtouch data.bin
           Files: 1
     Directories: 0
  Resident Pages: 29021/131072  113M/512M  22.1% # <- before the changes
         Elapsed: 0.017649 seconds

After this PR i can do

import numpy as np
import mmap
starts = np.sort(np.random.randint(0, 2**29, size=1024))
data = np.memmap("data.bin", dtype=np.uint8, mode="r")
data.madvise(mmap.MADV_RANDOM)
for start in starts:
    data[start: start + 2**14].copy()

Which results in

$ vmtouch data.bin
           Files: 1
     Directories: 0
  Resident Pages: 5032/131072  19M/512M  3.84% # <- after the changes
         Elapsed: 0.012261 seconds

Touching only the needed pages (again, details depend a bit on system). This can help when running on a larger number of files, especially on NFS mounted storage.

@IndifferentArea
Copy link
Contributor Author

IndifferentArea commented Jul 13, 2025

It seems this PR implements everything that's needed and seems to work for the application i had in mind.

Thx for your comment, i'm really happy to see this PR can help others. I'll make this PR ready to be merged asap

@IndifferentArea IndifferentArea requested a review from jorenham July 13, 2025 02:12
@IndifferentArea
Copy link
Contributor Author

@seberg could u plz review this when you are free? I'll be active on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy