-
-
Notifications
You must be signed in to change notification settings - Fork 32.4k
gh-51067: Add remove()
and repack()
to ZipFile
#134627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 64 commits
6aed859
5453dbc
80ab2e2
72c2a66
a4b410b
a9e85c6
236cd06
bdc58c7
c3c8345
1b7d75a
51c9254
0d971d8
8f0a504
0cb8682
a788a00
ae01b8c
edee203
555ac78
95fde31
8a448e4
4c35eb2
926338c
e76f9a1
93f4c25
9e94209
3ef72c6
fbf7588
81a419a
c62a455
a05353c
3d0240c
31c4c93
f8fade1
c80d21b
e1caea9
2b23d46
de4f15b
ea3259f
fef92c4
8067b0c
92d3a9c
b5d7ae3
1d5ec61
db9d0d6
aaa566c
578c7c8
c470c33
b1dcb07
04cddef
797a62c
3b2f232
cb549c9
0f50a6f
c759b63
1ece5b1
ce88616
5f093e5
11c0937
4b2176e
d9824ce
85811ab
748ac63
001a8d0
3a364ce
0832528
f20ec5d
8e69c09
725b1a3
9e82bb7
93db94a
72673e0
e926a95
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -518,6 +518,66 @@ ZipFile Objects | |
.. versionadded:: 3.11 | ||
|
||
|
||
.. method:: ZipFile.remove(zinfo_or_arcname) | ||
|
||
Removes a member from the archive. *zinfo_or_arcname* may be the full path | ||
of the member or a :class:`ZipInfo` instance. | ||
|
||
If multiple members share the same full path, only one is removed when | ||
a path is provided. | ||
|
||
This does not physically remove the local file entry from the archive. | ||
Call :meth:`repack` afterwards to reclaim space. | ||
danny0838 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The archive must be opened with mode ``'w'``, ``'x'`` or ``'a'``. | ||
|
||
Returns the removed :class:`ZipInfo` instance. | ||
|
||
Calling :meth:`remove` on a closed ZipFile will raise a :exc:`ValueError`. | ||
|
||
.. versionadded:: next | ||
|
||
|
||
.. method:: ZipFile.repack(removed=None, *, \ | ||
strict_descriptor=False[, chunk_size]) | ||
|
||
Rewrites the archive to remove stale local file entries, shrinking its file | ||
size. | ||
|
||
If *removed* is provided, it must be a sequence of :class:`ZipInfo` objects | ||
representing removed entries; only their corresponding local file entries | ||
will be removed. | ||
|
||
If *removed* is not provided, the archive is scanned to identify and remove | ||
local file entries that are no longer referenced in the central directory. | ||
The algorithm assumes that local file entries (and the central directory, | ||
which is mostly treated as the "last entry") are stored consecutively: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be good to define what you mean by a stale local file entry here. In the third paragraph it is somewhat explained but I would suggest adding something earlier on about what a stale file entry is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the "stale local file entries" is the general abstract concept that this method does (and should do). According to the context readers should be able to get that "stale local file entries" is defined as "local file entries referenced by the provided removed ZipInfo objects" when removed is provided, and "local file entries that are no longer referenced in the central directory (and meeting the 3 criteria)" when removed is not provided, and potentially another definition if more algorithm/mode is added in the future. I do can be more explicit by saying "stale local file entries" is defined as above, but it would probably be too redundant. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Readers should not need to read multiple paragraphs to infer the meaning of a phrase in the first sentence of documentation when it can be briefly defined earlier on instead. I think defining what "stale" means in a stale local file entry is would be sufficient, as that is not a term coming from appnote.txt, and only introduced in these docs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As aforementioned, the definition of "stale" is difficult and almost as complex/long as the paragraph 2~4. Even if we provide the "definition" first, the reader still need to read the equally long sentences to get it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can simply replace "stale" by "a local file entry that doesn't exist in the central directory" or something similar. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about simply unreferenced local file entries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ambiguous imo |
||
|
||
#. Data before the first referenced entry is removed only when it appears to | ||
be a sequence of consecutive entries with no extra following bytes; extra | ||
preceding bytes are preserved. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems hard to work around if this isn't the behavior a user wants. i.e. if the bytes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think documenting the list of limitations of repack's scan would be an acceptable alternative to adding this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you be more specific about what the "some other format" you are mentioning? If it's something like:
then the unreferenced local file entry will be removed (since it's "a sequence of consecutive entries with extra preceding bytes"). If it's something like:
or
then all bytes before the first local file entry will be preserved. I think this is the reasonable behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are many formats that build on top of zip, such as doc, java class files, etc. These have prefixes and you need to be sure what you detect is an unreferenced file entry vs a local file entry magic and noise. As far as I am aware, there's no way to be certain what you have is an actual file entry. So there are potentially cases where the following layout could be a misinterpretation of a prefix to a zip file, and repack would incorrectly remove data from that prefix.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there is no 100% definite way to define a sequence of bytes is a stale local file entry, that's why I call it a heuristic. But I think the current criteria is accurate enough for most real-world cases, and a false removal is very, very unlikely to happen. If you don't agree with me, can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry? Without this it's kind of like a dry talk, which is non-constructive. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that heuristics and data scanning are involved I think it is worth while to add a Realistically I think people with (2) style formats should know this but it is good to set expectations. I'm asking for (1) preemptively because we're basically guaranteed to get security "vulnerability" reports about all possible API behaviors today [I'm on the security team - we really do] so pre-documenting the thing makes replying to those easier. 😺
That's exactly the kind of thing someone would do in a future security@ report. :P We can stay ahead of that by at least not offering a guarantee of safety for this API. The first innocent real world starting points that come to mind are zip files embedded stored within zip files. And file formats involving multiple zip files within them where only one of them appears at the end of the file and thus acts like a zip to normal zip file tooling such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For (1): not only For (2): this is what This can even be improved by providing an option to check for CRC when validating every seemingly like "local file entry". Though it would impact performance significantly and I don't think it worth. A prepended zip should be safe since it has a central directory, which will be identified as "extra following bytes" and skipped from stripping. Unless the zip is abnormal, e.g. having the last local file entry overlapping in the central comment and thus having no additional bytes after its end. A zip embedded as the content of a member is also 100% safe since the algorithm won't strip anything inside a local file entry. A zip embedded immediately after a local file entry will be falsely stripped, but it's explicitly precluded by the documented presumption that "local file entries are stored consecutively", and should be something unlikely to happen on a normal zip-like file. Given that the current documentation already explains its assumption and algorithm, I expect that the developer be able to estimate the risk on his own. Although it's not 100% safe, worrying about this may be something like worrying about a repository breaking due to SHA-1 collision when using Git. I agree that it would be good to set a fair expectation on the heuristics based algorithm and encourage the usage of providing removed for better performance and accuracy, but I also don't want to give an impression that the algorithm is something fragile and could easily blow on a random input. Don't you think it's overkill for Git to show a big warning saying that it doesn't guarantee your data won't break accidentally? Anyway, I'm open to this. It's welcome if someone can provide a good rephasing or note without such issues. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revised in 93db94a26 |
||
#. Data between referenced entries is removed only when it appears to | ||
be a sequence of consecutive entries with no extra preceding bytes; extra | ||
following bytes are preserved. | ||
#. Entries must not overlap. If any entry's data overlaps with another, a | ||
:exc:`BadZipFile` error is raised and no changes are made. | ||
|
||
When scanning, setting ``strict_descriptor=True`` disables detection of any | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not default to strict_descriptor=True given that it performs better and the zip files we expect people to be manipulating in remove/repack manners are presumed most likely to be "modern" forms? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is exactly one of the open question(#134627 (comment), #134627 (comment)). The current quick decision is primarily since it adheres better to the spec and most Python stdlib tend to prioritize compatibility than performance. E.g. |
||
entry using an unsigned data descriptor (deprecated in the ZIP specification | ||
since version 6.3.0, released on 2006-09-29, and used only by some legacy | ||
tools). This improves performance, but may cause some stale entries to be | ||
preserved. | ||
|
||
*chunk_size* may be specified to control the buffer size when moving | ||
entry data (default is 1 MiB). | ||
|
||
The archive must be opened with mode ``'a'``. | ||
danny0838 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Calling :meth:`repack` on a closed ZipFile will raise a :exc:`ValueError`. | ||
|
||
.. versionadded:: next | ||
|
||
|
||
The following data attributes are also available: | ||
|
||
.. attribute:: ZipFile.filename | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add that which one gets removed is unspecified and should not be relied on (to defensively discourage mis-use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the current implementation is definite that the one mapped by
ZipFile.getinfo(name)
will get removed and the last one in the filelist with the same name (if exists) will be the new mapped one.This is similar to what will be mapped by
ZipFile.getinfo(name)
if there are multiple zinfos with same name. The current implementation is always the last one in the filelist, though it's also undocumented.The question is that should we document the definite behavior or state that it's actually undefined and the current behavior should not be relied on? Before the question is solved I would just keep the current statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would lean towards saying the behavior is undefined in this method. I would want some discussion about documenting the behavior with multiple zinfos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone back and looked, but I have a vague recollection that multiple zip entries with the same name is/was a "normal" zip file legacy-ish "feature" as that was how replacing the contents of one zip file member was implemented in cases where the entire thing cannot be rewritten as it could be done solely by rewriting the end of the file and central directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpshead This is the convention of many ZIP tools, including Python, at least currently. Unfortunately it's not clearly documented in the ZIP spec.