-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Description
🐛 Describe the bug
If a tensor is sent in multiprocessing queue, something blocks the process from ending after the end of script is reached (I have to press Ctrl+C to end the program).
It seems to be related to the resource tracker (multiprocessing.resource_tracker.ResourceTracker
) process started by Python automatically, because when the process should end I can see resource tracker child process in the process tree and if I kill it the main process ends successfully.
The problem occurs in Python 3.12. It doesn't occur in Python 3.11. I am using macOS Sequoia. I tried running examples in Ubuntu container and couldn't reproduce the problem there, so it may be macOS specific. Multiple Torch versions are affected - I tested 2.2.0 (the oldest one installing successfully in Python 3.12) and 2.7.0 (the latest)
Calling multiprocessing.set_start_method("fork")
fixes the issue (default start method is spawn
), but it is not recommended according to Python docs. Start methods spawn
and forkserver
do not work.
Example using DataLoader
:
from torch.utils.data import Dataset, DataLoader
class DummyDataset(Dataset):
def __getitem__(self, index: int) -> int:
return 1
def __len__(self) -> int:
return 10
def main() -> None:
dataset = DummyDataset()
data_loader = DataLoader(dataset, num_workers=1)
for batch_idx, batch in enumerate(data_loader):
print(batch_idx, batch)
print("DONE?")
if __name__ == "__main__":
main()
Example using just a tensor and a queue:
import torch.multiprocessing as multiprocessing
import threading
from torch import Tensor
def worker(q):
q.put(Tensor(0))
print("worker thread ended")
def main() -> None:
q = multiprocessing.Queue()
w = multiprocessing.Process(target=worker, args=(q,))
w.start()
w.join()
print(q.get())
print("DONE?")
if __name__ == "__main__":
main()
In both cases program after printing "DONE?" does not end (unless interrupted with Ctrl+C) and the process tree looks like this:
~/tmp$ pstree 48529
-+= 48529 rafal.harabien /opt/homebrew/Cellar/python@3.12/3.12.10/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Users/rafal.harabien/minimal_mp_hang.py
\--- 48530 rafal.harabien /opt/homebrew/Cellar/python@3.12/3.12.10/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python -c from multiprocessing.resource_tracker import main;main(6)
The second example works fine when sending non-tensor values, e.g. int
.
Versions
((venv_py312) ) ~/tmp$ python collect_env.py
/Users/rafal.harabien/tmp/venv_py312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.7.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.4.1 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.0.13.3)
CMake version: version 4.0.1
Libc version: N/A
Python version: 3.12.10 (main, Apr 8 2025, 11:35:47) [Clang 16.0.0 (clang-1600.0.26.6)] (64-bit runtime)
Python platform: macOS-15.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M3 Pro
Versions of relevant libraries:
[pip3] torch==2.7.0
[conda] No relevant packages
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @VitalyFedyunin @albanD @malfet