Content-Length: 275644 | pFad | https://github.com/numpy/numpy/issues/28840

6F Unnecessary dependence in fromfile on the ability to seek into a file · Issue #28840 · numpy/numpy · GitHub
Skip to content

Unnecessary dependence in fromfile on the ability to seek into a file #28840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eyalroz opened this issue Apr 27, 2025 · 6 comments
Open

Unnecessary dependence in fromfile on the ability to seek into a file #28840

eyalroz opened this issue Apr 27, 2025 · 6 comments

Comments

@eyalroz
Copy link

eyalroz commented Apr 27, 2025

If, in my script, my input file is a PIPE, e.g. I call my script like so:

cat nums.fp32.bin | ./myscript

and my script contains the line:

arr = numpy.fromfile(sys.stdin, dtype=numpy.dtype('f32'))

then - NumPy fails,
when I run the script, I get:

$ cat nums.fp32.bin | ./myscript
File "./myscript", line 123, in main
    arr = numpy.fromfile(sys.stdin, dtype=numpy.dtype('f32'))
OSError: obtaining file position failed

why is NumPy not "falling back" on reading individual values, without knowing the overall number of them?

@seberg
Copy link
Member

seberg commented Apr 28, 2025

That would require memory resizing, etc. which fromfile doesn't do. I suppose it could do it, but there seems no downside to me in reading via: data = raw_reading; arr = np.frombuffer(data, dtype='f') for binary data.
(For non-binary reads np.loadtxt, should be superior anyway.)

@eyalroz
Copy link
Author

eyalroz commented Apr 28, 2025

Well, it seems - naively - that reading raw, then calling frombuffer, would mean requiring twice the memory, at least temoprarily... but regardless - the point is that its not justified for NumPy to fail. That is, the fact that the user of numpy could have gone to the effort of checking the seekability of the file and writing something differently is not reason enough for refusing to perform the read within the library itself.

@seberg
Copy link
Member

seberg commented Apr 28, 2025

Well, it seems - naively - that reading raw, then calling frombuffer

Just for the record (not 100% sure you are aware): There is no additional copy involed.

@eyalroz
Copy link
Author

eyalroz commented Apr 28, 2025

Just for the record (not 100% sure you are aware): There is no additional copy involed.

Ok, that's a nice optimization, kudos... I'm not a Pythonista, so I wouldn't know.

However, in that case - why not just write the few lines of raw reading and a frombuffer invocation, in case you've discovered that the file is unseekable? It seems like little effort on your part, based on what you've said; it rounds out the API; it doesn't fill your codebase with only-used-once functionality... seems like a win.

@aureliobarbosa
Copy link
Contributor

Why not just improve the documentation by clearly stating the file is a raw binary file and not a buffered file?

I think the same issue occurs with numpy.rec.fromfile, but there documentation clearly states that "The file object must
support random access (i.e. it must have tell and seek methods)."
, as indicated below:

@set_module("numpy.rec")
def fromfile(fd, dtype=None, shape=None, offset=0, formats=None,
names=None, titles=None, aligned=False, byteorder=None):
"""Create an array from binary file data
Parameters
----------
fd : str or file type
If file is a string or a path-like object then that file is opened,
else it is assumed to be a file object. The file object must
support random access (i.e. it must have tell and seek methods).
dtype : data-type, optional

This would avoid confusion and close the issue. If a maintainer agree with that I can do the PR.

Otherwise, it would be better to tag this issue as an enhancement.

@seberg
Copy link
Member

seberg commented May 14, 2025

Why not just improve the documentation by clearly stating the file is a raw binary file and not a buffered file?

I think that sounds great, it doesn't mean we can't ever extend it (although if we point to alternatives, I am not sure I think that it is necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/numpy/numpy/issues/28840

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy