Skip to content

gh-51067: Add remove() and repack() to ZipFile #134627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
6aed859
Add `remove()` and `repack()` to `ZipFile`
danny0838 May 24, 2025
5453dbc
📜🤖 Added by blurb_it.
blurb-it[bot] May 24, 2025
80ab2e2
Fix and optimize test code
danny0838 May 24, 2025
72c2a66
Handle common setups with `setUpClass`
danny0838 May 24, 2025
a4b410b
Add tests for mode `w` and `x` for `remove()`
danny0838 May 24, 2025
a9e85c6
Introduce `_calc_initial_entry_offset` and refactor
danny0838 May 24, 2025
236cd06
Optimize `_calc_initial_entry_offset` by introducing cache
danny0838 May 24, 2025
bdc58c7
Introduce `_validate_local_file_entry` and refactor
danny0838 May 24, 2025
c3c8345
Introduce `_debug` and refactor
danny0838 May 24, 2025
1b7d75a
Introduce `_move_entry_data` and rework chunk_size passing
danny0838 May 25, 2025
51c9254
Refactor `_validate_local_file_entry`
danny0838 May 25, 2025
0d971d8
Add `strict_descriptor` option
danny0838 May 25, 2025
8f0a504
Fix and improve validation tests
danny0838 May 25, 2025
0cb8682
Remove obsolete NameToInfo updating
danny0838 May 25, 2025
a788a00
Use `zinfo` rather than `info`
danny0838 May 25, 2025
ae01b8c
Raise on overlapping file blocks
danny0838 May 25, 2025
edee203
Rework writing protection
danny0838 May 25, 2025
555ac78
Update doc
danny0838 May 25, 2025
95fde31
Fix typo
danny0838 May 26, 2025
8a448e4
Add test for bytes between file entries
danny0838 May 26, 2025
4c35eb2
Check `testzip()` after zip file closed
danny0838 May 26, 2025
926338c
Support `repack(removed)`
danny0838 May 26, 2025
e76f9a1
Fix bytes between entries be removed when `removed` is passed
danny0838 May 26, 2025
93f4c25
Fix bad test code
danny0838 May 26, 2025
9e94209
Revise docstring
danny0838 May 27, 2025
3ef72c6
Add `tearDown` for tests
danny0838 May 28, 2025
fbf7588
Rename methods and parameters
danny0838 May 28, 2025
81a419a
Adjust parameter order
danny0838 May 28, 2025
c62a455
Optimize code and revise comment
danny0838 May 28, 2025
a05353c
Improve debug for `_ZipRepacker.repack()`
danny0838 May 29, 2025
3d0240c
Rework `_validate_local_file_entry_sequence` to return size or None
danny0838 May 29, 2025
31c4c93
Rework `_validate_local_file_entry_sequence` to allow passing no `che…
danny0838 May 29, 2025
f8fade1
Introduce `_scan_data_descriptor_no_sig_by_decompression`
danny0838 May 30, 2025
c80d21b
Strip only entries immediately following a referenced entry
danny0838 May 29, 2025
e1caea9
Adjust method names
danny0838 May 30, 2025
2b23d46
Add memory usage test
danny0838 May 30, 2025
de4f15b
Fix rst
danny0838 May 30, 2025
ea3259f
Optimize code
danny0838 Jun 1, 2025
fef92c4
Fix and optimize `_iter_scan_signature`
danny0838 Jun 1, 2025
8067b0c
Fix `_scan_data_descriptor`
danny0838 Jun 1, 2025
92d3a9c
Fix and optimize `_scan_data_descriptor_no_sig`
danny0838 Jun 1, 2025
b5d7ae3
Rename `_trace_compressed_block_end`
danny0838 Jun 1, 2025
1d5ec61
Fix `_scan_data_descriptor_no_sig_by_decompression`
danny0838 Jun 1, 2025
db9d0d6
Add tests for `_ZipRepacker`
danny0838 Jun 1, 2025
aaa566c
Remove unneeded import
danny0838 Jun 1, 2025
578c7c8
Add requirements
danny0838 Jun 1, 2025
c470c33
Fix `_scan_data_descriptor_no_sig_by_decompression` when library not …
danny0838 Jun 1, 2025
b1dcb07
Test with pre-calculated CRC
danny0838 Jun 1, 2025
04cddef
Remove unneeded import
danny0838 Jun 1, 2025
797a62c
Fix and optimize `repack`
danny0838 Jun 1, 2025
3b2f232
Remove unneeded catch type
danny0838 Jun 14, 2025
cb549c9
Patch more explicitly
danny0838 Jun 14, 2025
0f50a6f
Remove unneeded variables
danny0838 Jun 14, 2025
c759b63
Improve dependency check for decompression tests
danny0838 Jun 14, 2025
1ece5b1
Refactor and optimize `RepackHelperMixin`
danny0838 Jun 14, 2025
ce88616
Update NEWS
danny0838 Jun 20, 2025
5f093e5
Sync with danny0838/zipremove@1691ca25bf971cf1e45d5ed7d22c512636f20cb8
danny0838 Jun 20, 2025
11c0937
Revise NEWS
danny0838 Jun 20, 2025
4b2176e
Sync with danny0838/zipremove@1843d87b70e6cb129fb55446eaf4486a87d2af4d
danny0838 Jun 21, 2025
d9824ce
Fix timezone related timestamp issue
danny0838 Jun 21, 2025
85811ab
Simplify tests with data descriptors
danny0838 Jun 22, 2025
748ac63
Sync with danny0838/zipremove@e79042768f3c2541e0226f6bed3a9ff2ee04fac0
danny0838 Jun 23, 2025
001a8d0
Sync with danny0838/zipremove@87bcdb50411a355d24c35f31dcbe4273c0568cf8
danny0838 Jun 24, 2025
3a364ce
Sync with danny0838/zipremove@6a78bd15de87afde510f8a1b6364365c6e17f252
danny0838 Jun 25, 2025
0832528
Sync with danny0838/zipremove@092f98b4d7b3a0cd335fe4ba64e7090ebb3dc6da
danny0838 Jun 27, 2025
f20ec5d
Revise doc for `repack`
danny0838 Jun 28, 2025
8e69c09
Revise doc for `remove`
danny0838 Jun 28, 2025
725b1a3
Update `data_offset`
danny0838 Jun 29, 2025
9e82bb7
Revise doc for `repack`
danny0838 Jul 1, 2025
93db94a
Revise doc for `repack`
danny0838 Jul 2, 2025
72673e0
Sync with danny0838/zipremove@8bedf7c9b891acadc3393d2f1267b78bd9b5a49a
danny0838 Jul 3, 2025
e926a95
Sync with danny0838/zipremove@86a240bf019fe9212b1e72c963306186163fb8b8
danny0838 Jul 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions Doc/library/zipfile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,66 @@ ZipFile Objects
.. versionadded:: 3.11


.. method:: ZipFile.remove(zinfo_or_arcname)

Removes a member from the archive. *zinfo_or_arcname* may be the full path
of the member or a :class:`ZipInfo` instance.

If multiple members share the same full path, only one is removed when
a path is provided.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add that which one gets removed is unspecified and should not be relied on (to defensively discourage mis-use).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the current implementation is definite that the one mapped by ZipFile.getinfo(name) will get removed and the last one in the filelist with the same name (if exists) will be the new mapped one.

This is similar to what will be mapped by ZipFile.getinfo(name) if there are multiple zinfos with same name. The current implementation is always the last one in the filelist, though it's also undocumented.

The question is that should we document the definite behavior or state that it's actually undefined and the current behavior should not be relied on? Before the question is solved I would just keep the current statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would lean towards saying the behavior is undefined in this method. I would want some discussion about documenting the behavior with multiple zinfos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone back and looked, but I have a vague recollection that multiple zip entries with the same name is/was a "normal" zip file legacy-ish "feature" as that was how replacing the contents of one zip file member was implemented in cases where the entire thing cannot be rewritten as it could be done solely by rewriting the end of the file and central directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpshead This is the convention of many ZIP tools, including Python, at least currently. Unfortunately it's not clearly documented in the ZIP spec.


This does not physically remove the local file entry from the archive.
Call :meth:`repack` afterwards to reclaim space.

The archive must be opened with mode ``'w'``, ``'x'`` or ``'a'``.

Returns the removed :class:`ZipInfo` instance.

Calling :meth:`remove` on a closed ZipFile will raise a :exc:`ValueError`.

.. versionadded:: next


.. method:: ZipFile.repack(removed=None, *, \
strict_descriptor=False[, chunk_size])

Rewrites the archive to remove stale local file entries, shrinking its file
size.

If *removed* is provided, it must be a sequence of :class:`ZipInfo` objects
representing removed entries; only their corresponding local file entries
will be removed.

If *removed* is not provided, the archive is scanned to identify and remove
local file entries that are no longer referenced in the central directory.
The algorithm assumes that local file entries (and the central directory,
which is mostly treated as the "last entry") are stored consecutively:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to define what you mean by a stale local file entry here. In the third paragraph it is somewhat explained but I would suggest adding something earlier on about what a stale file entry is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the "stale local file entries" is the general abstract concept that this method does (and should do).

According to the context readers should be able to get that "stale local file entries" is defined as "local file entries referenced by the provided removed ZipInfo objects" when removed is provided, and "local file entries that are no longer referenced in the central directory (and meeting the 3 criteria)" when removed is not provided, and potentially another definition if more algorithm/mode is added in the future.

I do can be more explicit by saying "stale local file entries" is defined as above, but it would probably be too redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the context readers should be able to get that "stale local file entries" is defined as "local file entries referenced by the provided removed ZipInfo objects" when removed is provided, and "local file entries that are no longer referenced in the central directory (and meeting the 3 criteria)" when removed is not provided, and potentially another definition if more algorithm/mode is added in the future.

Readers should not need to read multiple paragraphs to infer the meaning of a phrase in the first sentence of documentation when it can be briefly defined earlier on instead.

I think defining what "stale" means in a stale local file entry is would be sufficient, as that is not a term coming from appnote.txt, and only introduced in these docs.

Copy link
Author

@danny0838 danny0838 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As aforementioned, the definition of "stale" is difficult and almost as complex/long as the paragraph 2~4. Even if we provide the "definition" first, the reader still need to read the equally long sentences to get it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simply replace "stale" by "a local file entry that doesn't exist in the central directory" or something similar.

Copy link
Author

@danny0838 danny0838 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply unreferenced local file entries?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ambiguous imo


#. Data before the first referenced entry is removed only when it appears to
be a sequence of consecutive entries with no extra following bytes; extra
preceding bytes are preserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems hard to work around if this isn't the behavior a user wants. i.e. if the bytes PK\003\004 appear before the first entry in some other format then a user cannot use repack according to my reading of this (I will revisit this once I read the implementation). Perhaps it would be better to take a start_offset (defaulting to ZipFile.data_offset or 0 maybe) that is the offset to start the search?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think documenting the list of limitations of repack's scan would be an acceptable alternative to adding this.

Copy link
Author

@danny0838 danny0838 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what the "some other format" you are mentioning?

If it's something like:

[PK\003\004 and noise]
[unreferenced local file entry]
[first local file entry]

then the unreferenced local file entry will be removed (since it's "a sequence of consecutive entries with extra preceding bytes").

If it's something like:

[PK\003\004 and noise]
[first local file entry]

or

[unreferenced local file entry]
[PK\003\004 and noise]
[first local file entry]

then all bytes before the first local file entry will be preserved.

I think this is the reasonable behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many formats that build on top of zip, such as doc, java class files, etc. These have prefixes and you need to be sure what you detect is an unreferenced file entry vs a local file entry magic and noise. As far as I am aware, there's no way to be certain what you have is an actual file entry. So there are potentially cases where the following layout could be a misinterpretation of a prefix to a zip file, and repack would incorrectly remove data from that prefix.

[unreferenced local file entry]
[PK\003\004 and noise]
[first local file entry]

Copy link
Author

@danny0838 danny0838 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is no 100% definite way to define a sequence of bytes is a stale local file entry, that's why I call it a heuristic. But I think the current criteria is accurate enough for most real-world cases, and a false removal is very, very unlikely to happen.

If you don't agree with me, can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry? Without this it's kind of like a dry talk, which is non-constructive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that heuristics and data scanning are involved I think it is worth while to add a .. note:: to the repack docs here stating that it (1) cannot be guaranteed" safe to use on untrusted zip file inputs or (2) does not guarantee that zip-like archives such as those with executable data prepended will survive unharmed.

Realistically I think people with (2) style formats should know this but it is good to set expectations.

I'm asking for (1) preemptively because we're basically guaranteed to get security "vulnerability" reports about all possible API behaviors today [I'm on the security team - we really do] so pre-documenting the thing makes replying to those easier. 😺
Given we are never going to be able to prevent all such DoS and zipbomb, frankenzip, or multiple-interpretations-by-different-tooling-zip format reports given the zip format definition combined with the collective behavior of the worlds implementations is so... fuzzy.

can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry?

That's exactly the kind of thing someone would do in a future security@ report. :P We can stay ahead of that by at least not offering a guarantee of safety for this API. The first innocent real world starting points that come to mind are zip files embedded stored within zip files. And file formats involving multiple zip files within them where only one of them appears at the end of the file and thus acts like a zip to normal zip file tooling such as zipfile seeking out the end of file central directory. Those might not trigger a match in your logic as is, but work backwards from the logic and there may be ways to construct some that could? I'm just suggesting we don't try to overthink our robustness and make guarantees here.

Copy link
Author

@danny0838 danny0838 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (1): not only ZipFile.repack but the whole zipfile package is not guaranteed to be safe on an untrusted ZIP file. If a note or warning is required, it probably should be for the whole package rather than for this method only.

For (2): this is what repack want to at least guarantee. The current algorithm should be quite safe against a false positive as it's very unlikely that random binary bytes would happen to form a local file header magic and happen to have its "entry length" ends exactly at the position of the first referenced local file header (or the next "local file entry").

This can even be improved by providing an option to check for CRC when validating every seemingly like "local file entry". Though it would impact performance significantly and I don't think it worth.

A prepended zip should be safe since it has a central directory, which will be identified as "extra following bytes" and skipped from stripping. Unless the zip is abnormal, e.g. having the last local file entry overlapping in the central comment and thus having no additional bytes after its end.

A zip embedded as the content of a member is also 100% safe since the algorithm won't strip anything inside a local file entry.

A zip embedded immediately after a local file entry will be falsely stripped, but it's explicitly precluded by the documented presumption that "local file entries are stored consecutively", and should be something unlikely to happen on a normal zip-like file.

Given that the current documentation already explains its assumption and algorithm, I expect that the developer be able to estimate the risk on his own. Although it's not 100% safe, worrying about this may be something like worrying about a repository breaking due to SHA-1 collision when using Git. I agree that it would be good to set a fair expectation on the heuristics based algorithm and encourage the usage of providing removed for better performance and accuracy, but I also don't want to give an impression that the algorithm is something fragile and could easily blow on a random input. Don't you think it's overkill for Git to show a big warning saying that it doesn't guarantee your data won't break accidentally?

Anyway, I'm open to this. It's welcome if someone can provide a good rephasing or note without such issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised in 93db94a26

#. Data between referenced entries is removed only when it appears to
be a sequence of consecutive entries with no extra preceding bytes; extra
following bytes are preserved.
#. Entries must not overlap. If any entry's data overlaps with another, a
:exc:`BadZipFile` error is raised and no changes are made.

When scanning, setting ``strict_descriptor=True`` disables detection of any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not default to strict_descriptor=True given that it performs better and the zip files we expect people to be manipulating in remove/repack manners are presumed most likely to be "modern" forms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly one of the open question(#134627 (comment), #134627 (comment)).

The current quick decision is primarily since it adheres better to the spec and most Python stdlib tend to prioritize compatibility than performance. E.g. json.dump with ensure_ascii=True and http.server with HTTP version 1.0. But it's not solid and can be changed, based on a vote or something?

entry using an unsigned data descriptor (deprecated in the ZIP specification
since version 6.3.0, released on 2006-09-29, and used only by some legacy
tools). This improves performance, but may cause some stale entries to be
preserved.

*chunk_size* may be specified to control the buffer size when moving
entry data (default is 1 MiB).

The archive must be opened with mode ``'a'``.

Calling :meth:`repack` on a closed ZipFile will raise a :exc:`ValueError`.

.. versionadded:: next


The following data attributes are also available:

.. attribute:: ZipFile.filename
Expand Down
Loading
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy