Skip to content

Process never ends when sending tensors through multiprocessing queues in Python 3.12+ with filesystem strategy #153050

@rafalh

Description

@rafalh

🐛 Describe the bug

If a tensor is sent in multiprocessing queue, something blocks the process from ending after the end of script is reached (I have to press Ctrl+C to end the program).
It seems to be related to the resource tracker (multiprocessing.resource_tracker.ResourceTracker) process started by Python automatically, because when the process should end I can see resource tracker child process in the process tree and if I kill it the main process ends successfully.
The problem occurs in Python 3.12. It doesn't occur in Python 3.11. I am using macOS Sequoia. I tried running examples in Ubuntu container and couldn't reproduce the problem there, so it may be macOS specific. Multiple Torch versions are affected - I tested 2.2.0 (the oldest one installing successfully in Python 3.12) and 2.7.0 (the latest)
Calling multiprocessing.set_start_method("fork") fixes the issue (default start method is spawn), but it is not recommended according to Python docs. Start methods spawn and forkserver do not work.

Example using DataLoader:

from torch.utils.data import Dataset, DataLoader

class DummyDataset(Dataset):
    def __getitem__(self, index: int) -> int:
        return 1

    def __len__(self) -> int:
        return 10

def main() -> None:
    dataset = DummyDataset()
    data_loader = DataLoader(dataset, num_workers=1)
    for batch_idx, batch in enumerate(data_loader):
        print(batch_idx, batch)
    print("DONE?")

if __name__ == "__main__":
    main()

Example using just a tensor and a queue:

import torch.multiprocessing as multiprocessing
import threading
from torch import Tensor

def worker(q):
    q.put(Tensor(0))
    print("worker thread ended")

def main() -> None:
    q = multiprocessing.Queue()
    w = multiprocessing.Process(target=worker, args=(q,))
    w.start()
    w.join()
    print(q.get())
    print("DONE?")

if __name__ == "__main__":
    main()

In both cases program after printing "DONE?" does not end (unless interrupted with Ctrl+C) and the process tree looks like this:

~/tmp$ pstree 48529
-+= 48529 rafal.harabien /opt/homebrew/Cellar/python@3.12/3.12.10/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Users/rafal.harabien/minimal_mp_hang.py
 \--- 48530 rafal.harabien /opt/homebrew/Cellar/python@3.12/3.12.10/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python -c from multiprocessing.resource_tracker import main;main(6)

The second example works fine when sending non-tensor values, e.g. int.

Versions

((venv_py312) ) ~/tmp$ python collect_env.py
/Users/rafal.harabien/tmp/venv_py312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.7.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.4.1 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.0.13.3)
CMake version: version 4.0.1
Libc version: N/A

Python version: 3.12.10 (main, Apr 8 2025, 11:35:47) [Clang 16.0.0 (clang-1600.0.26.6)] (64-bit runtime)
Python platform: macOS-15.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] torch==2.7.0
[conda] No relevant packages

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @VitalyFedyunin @albanD @malfet

Metadata

Metadata

Assignees

Labels

high prioritymodule: deadlockProblems related to deadlocks (hang without exiting)module: macosMac OS related issuesmodule: multiprocessingRelated to torch.multiprocessingmodule: regressionIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy