Skip to content

Next TAR header offset recomputation is wrong for GNU sparse 1.0 file combined with 'size' PAX header key #136602

@mxmlnkn

Description

@mxmlnkn

Bug report

Bug description:

For a more detailed description, please see #136601.

I have a bug that causes TAR file parsing to end preemptively for very large sparse files. The computed next TAR header is off by one 512 B block.

The problem is the recomputation of the next TAR offset in case the PAX header contains a size key to override the overflowed (> 8GB) TAR size:

cpython/Lib/tarfile.py

Lines 1562 to 1569 in 47b01da

if "size" in pax_headers:
# If the extended header replaces the size field,
# we need to recalculate the offset where the next
# header starts.
offset = next.offset_data
if next.isreg() or next.type not in SUPPORTED_TYPES:
offset += next._block(next.size)
tarfile.offset = offset

The problem is that next.offset_data is used for this recomputation even though next.offset_data gets overwritten in _proc_gnusparse_10:

next.offset_data = tarfile.fileobj.tell()

This leads to the next TAR offset header being off by the number of blocks it takes to store the sparse data.

But, maybe I am wrong and have overlooked something. I can say, that this fixes it for my test case:

diff --git a/Lib/tarfile.py b/Lib/tarfile.py
index 068aa13ed7..7f3e62f5a2 100644
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -1565,7 +1565,7 @@ def _proc_pax(self, tarfile):
                 # header starts.
                 offset = next.offset_data
                 if next.isreg() or next.type not in SUPPORTED_TYPES:
-                    offset += next._block(next.size)
+                    offset += next._block(next.size) - BLOCKSIZE
                 tarfile.offset = offset
 
         return next

Minimal reproducer (tested on EXT4 with GNU tar 1.35):

echo bar > foo
echo bar > sparse
fallocate -l 9G sparse
echo bar >> sparse
fallocate --punch-hole -o 1G -l 10M sparse
tar --numeric-owner --format=pax --sparse-version=1.0 -cSf sparse.tar sparse foo
ls -la sparse.tar
# -rw-rw-r-- 1 user user 9663682560 Jul 13 14:14 sparse.tar
tar tvlf sparse.tar
# -rw-rw-r-- 1000/1000 9663676420 2025-07-13 14:13 sparse
# -rw-rw-r-- 1000/1000          4 2025-07-13 14:11 foo
python3 -c 'import sys, tarfile;
[print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name)
for tarInfo in tarfile.open(sys.argv[1])]' sparse.tar
# [(0, 1073741824), (1084227584, 8579448836), (9663676420, 0)] 0 2048 9653191172 sparse
#  -> foo is missing!
cat sparse.tar | xz -9 | zstd -19 | base64

Reproducer sparse-file-larger-than-8GiB-followed-by-normal-file.tar.xz.zst file as base64:

cat <<EOF | base64 -d | zstd -d > sparse-file-larger-than-8GiB-followed-by-normal-file.tar.xz
KLUv/QRojBIA1CP9N3pYWgAABObWtEYCACEBHAAAABDPWMz//5wCcV0AFwvGh5JaO6ePxyUOuA/z
XtE/5U/vyT1WUwqPhMr1HTeZeJyWILwrrtDwH0eKx6KKGcU7D2aYidf/9bCtFMcWp8KxDA1FLF58
w9bO4J+eDKd9QfIZFPCutpNB91dMk9bSVazx9pUcWEWn2r0SWsv1BtSYmVDmdKaMdGC/Epx8bcRA
nm5Joy2Tgi3O7VouoCAqha+1YYNOQyyB4sG+tDbfLGdW6fyZMztJ/lRFQwtlFpDLHGFpia92kkke
+2a/mwMvPc58aiT5X56QuH2mw1OhsrBKnbYYnT89BJjyAh2GTOeDbtZ/lLDGwhvxkXlnCm/M8Qiq
fUGfqAjnBeikNY2nodSBFo8YQh+636fk9xfuTQ3kKQ8qEWa613HftzHJ/X/ha1bKD91T/SPTCgd/
rhyvFtn8FBBiUS7UayidinQBNmGebczIaRsKUQKoffUTC9EbCrRXDQjQMjfDyo7N/eDIxD7jBImH
Dv8Qk/hxeFn4C83/lShGD6n8fN77mjAuVsCPhfODgcBlxCVT+PWRNjEFpbDub8FwTUcM0ZERqq1g
HbrOsScYXFmG6WZSWL7pdqxZ5OVbBQj5x9qt/PtSK3TNHlsgQvndUz34KWQJO4DLKmzftTvwxL0u
X6oPPktmQpAT+5I61gCf/xABKwDsc1On/b6ufDEan7eNMW5wnqcjX+woy4XRlZiKfiqR8id19xnA
BphNmP3Yr9WQD1EPP7IEADz8NCsncIBOR5aC/hM+FaZUAAAAAQD9/778uQYCRAAAAAEA/f85AAJE
AAAAAQD9/zkAAkQAAAABAP3/OQACRAAAAAEA/f85AAJEAAAAAQD9/zkAAkQAAAABAP3/OQACRAAA
AAEA/f85AAJEAAAAAQD9/zkAAh0GAIQLkODIaNUNHQG+Ib0xB5201x9u5Typk+S1zSY18D/tc2o+
BXKM/RM9v6MTQoFntxwNm0So6CELgft8dinBPFJBg583tJn+q69PwBnThZQjYTzvNhv0fkxX4Gjm
SgwnOrb7GU5pc2qtcrNCcHrPaNkQicmkdyzESbMAA8S2zfCiJIzpnN25EroA08/3fFWQ44Jfrake
AIiPdXLNPTRNAAGY3VWAsID7IwAAdKr/2BQXOzADAAAAAARZWgIAG0DNWVsOgERj+N4=
EOF

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy