-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Add madvise
for memmap
objects
#29260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about the stubs :)
Found #13172 while looking for the "sprintable" issues at scipy25 and recognised it since i ran across this before. It seems this PR implements everything that's needed and seems to work for the application i had in mind. So i would appreciate if this can be merged 🙂 More context: I ran into this 4 years ago when trying to read relatively random chunks of data across many files on an NFS mounted storage (context: particle physics data read via the Trying out the changes from this PR i could roughly simulate what i was doing back then by the following (the complete setup is a bit more involved and would require changes in downstream libraries): # create 512M random data
import numpy as np
x = np.random.rand(2**26).view(np.uint8)
with open("data.bin", "wb") as f:
f.write(x.tobytes()) Reading random chunks, but roughly sequentially import numpy as np
starts = np.sort(np.random.randint(0, 2**29, size=1024))
data = np.memmap("data.bin", dtype=np.uint8, mode="r")
for start in starts:
data[start: start + 2**14].copy() Depending on the filesystem and mount configuration this reads a large fraction of the file although only a few percent have been touched. I checked this using the $ vmtouch -e data.bin # evict from cache Then run the python code and check touched pages. On my laptop i get:
After this PR i can do import numpy as np
import mmap
starts = np.sort(np.random.randint(0, 2**29, size=1024))
data = np.memmap("data.bin", dtype=np.uint8, mode="r")
data.madvise(mmap.MADV_RANDOM)
for start in starts:
data[start: start + 2**14].copy() Which results in
Touching only the needed pages (again, details depend a bit on system). This can help when running on a larger number of files, especially on NFS mounted storage. |
Thx for your comment, i'm really happy to see this PR can help others. I'll make this PR ready to be merged asap |
@seberg could u plz review this when you are free? I'll be active on it. |
A simple wrapper on python's
mmap.madvise
, as requested in #13172I've seen the long-stalled PR #24489 addressing similar functionality. I'm eager to contribute more to NumPy, and if the original author of that PR wishes to continue their efforts, I'm happy to step aside and close this PR.
After seeing comments and review on that PR, I'm not sure what to do on view of mmap.