-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Added support for AMD GPUs in "docker run --gpus". #49952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@sgopinath1 as a maintainer of the NVIDIA Container Toolkit and its components I would strongly recommend against using the environment variable to control this behaviour -- even as an interim solution. Adding this behaviour now means that we have to keep in mind when implemening a |
@elezar a couple of points:
|
We would love to be able to use --gpus for AMD! BTW, the AMD Container Toolkit is now published: https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/overview.html |
daemon/amd_linux.go
Outdated
// countToDevicesAMD returns the list 0, 1, ... count-1 of deviceIDs. | ||
func countToDevicesAMD(count int) string { | ||
devices := make([]string, count) | ||
for i := range devices { | ||
devices[i] = strconv.Itoa(i) | ||
} | ||
return strings.Join(devices, ",") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same implementation as countToDevices
in nvidia_linux.go
. Does it make sense to just use that function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. Changed accordingly.
daemon/amd_linux.go
Outdated
|
||
const amdContainerRuntime = "amd-container-runtime" | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgopinath1 instead of having a separate init
function for nvidia
and amd
GPUs, does it make sense to refactor this and the code in nvidia_linux.go
to have a single init
function that checks for the existence of the various executables and registers the drivers accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have modified the init
function in nvidia_linux.go
to register the AMD driver also.
@elezar thanks for reviewing. I have updated the code as per your suggestions. Please review. |
daemon/nvidia_linux.go
Outdated
const nvidiaHook = "nvidia-container-runtime-hook" | ||
const ( | ||
nvidiaHook = "nvidia-container-runtime-hook" | ||
amdHook = "amd-container-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amdHook = "amd-container-runtime" | |
amdContainerRuntimeExecutableName = "amd-container-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as suggested.
@elezar Let me know if there are any further comments on the changes. Thanks. |
LGTM |
Could you do a quick rebase and squash the commits? |
Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs. Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Reused common functions from nvidia_linux.go. Removed duplicate code in amd_linux.go by reusing the init() and countToDevices() functions in nvidia_linux.go. AMD driver is registered in init(). Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Renamed amd-container-runtime constant Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>
Done. |
daemon/nvidia_linux.go
Outdated
} else { | ||
// no "gpu" capability | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. looks like the linter doesn't like this (even with a comment inside the branch 🤔)
daemon/nvidia_linux.go:55:9: SA9003: empty branch (staticcheck)
} else {
^
daemon/nvidia_linux.go
Outdated
if _, err := exec.LookPath(nvidiaHook); err == nil { | ||
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | ||
nvidiaDriver := &deviceDriver{ | ||
capset: capset, | ||
updateSpec: setNvidiaGPUs, | ||
} | ||
for c := range allNvidiaCaps { | ||
nvidiaDriver.capset[string(c)] = struct{}{} | ||
} | ||
registerDeviceDriver("nvidia", nvidiaDriver) | ||
} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | ||
capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}} | ||
amdDriver := &deviceDriver{ | ||
capset: capset, | ||
updateSpec: setAMDGPUs, | ||
} | ||
registerDeviceDriver("amd", amdDriver) | ||
} else { | ||
// no "gpu" capability | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an early return would work;
if _, err := exec.LookPath(nvidiaHook); err == nil { | |
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | |
nvidiaDriver := &deviceDriver{ | |
capset: capset, | |
updateSpec: setNvidiaGPUs, | |
} | |
for c := range allNvidiaCaps { | |
nvidiaDriver.capset[string(c)] = struct{}{} | |
} | |
registerDeviceDriver("nvidia", nvidiaDriver) | |
} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | |
capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}} | |
amdDriver := &deviceDriver{ | |
capset: capset, | |
updateSpec: setAMDGPUs, | |
} | |
registerDeviceDriver("amd", amdDriver) | |
} else { | |
// no "gpu" capability | |
} | |
if _, err := exec.LookPath(nvidiaHook); err == nil { | |
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | |
for c := range allNvidiaCaps { | |
capset[string(c)] = struct{}{} | |
} | |
registerDeviceDriver("nvidia", &deviceDriver{ | |
capset: capset, | |
updateSpec: setNvidiaGPUs, | |
}) | |
return | |
} | |
if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | |
registerDeviceDriver("amd", &deviceDriver{ | |
capset: capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}, | |
updateSpec: setAMDGPUs, | |
}) | |
return | |
} | |
// no "gpu" capability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious though; should amd
and nvidia
be considered mutually exclusive? Would splitting this into two init
funcs (once in the nvidia file, one in the amd file) and both register a driver (if present) work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are mutually exclusive, but I suggested that the same init function be used because @sgopinath1 was also checking for the NVIDIA runtime in the AMD init function. The intent was to ensure that the AMD runtime is not used if the NVIDIA hook is present and that the nvidia
logic takes precedence.
It may be cleaner to repeat some code here to keep these logically separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha; yup, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an early return would work;
@thaJeztah Changed as suggested.
// Register Nvidia driver if Nvidia helper binary is present. | ||
// Else, register AMD driver if AMD helper binary is present. | ||
if _, err := exec.LookPath(nvidiaHook); err == nil { | ||
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also curious if gpu
and nvidia
should be added to allNvidiaCaps
, so that no merging is needed 🤔 (not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep that out of scope for this PR. I think the allNvidiaCaps
is supposed to mirror our NVIDIA_DRIVER_CAPABILITIES
. These are different from the Docker capabilities (which include gpu
and nvidia
). I will revisit this when I start working on a --gpus
to CDI migration plan in the next couple of weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, definitely fine for separate; mostly me thinking out loud here 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created #50099 to actually remove support for the NVIDIA-specific capabilities. This should simplify things, but I'm not sure whether we're really breaking any users here.
const nvidiaHook = "nvidia-container-runtime-hook" | ||
const ( | ||
nvidiaHook = "nvidia-container-runtime-hook" | ||
amdContainerRuntimeExecutableName = "amd-container-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we split (see other comment); probably cleaner to have this const defined in the amd_linux
file.
Possibly rename both files to have a common prefix (gpu_amd_linux.go
, gpu_nvidia_linux.go
) for easier discoverability that they go together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar Are you fine with renaming nvidia_linux.go
as gpu_nvidia_linux.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I have renamed amd_linux.go
to gpu_amd_linux.go
in this PR. Could you please rename nvidia_linux.go
in your PR?
Also renamed amd_linux.go to gpu_amd_linux.go. Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>
This change adds support for AMD GPUs in
docker run --gpus
command.- What I did
Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs.
- How I did it
gpu
capability.--gpus
input in the docker command to an Environment Variable,AMD_VISIBLE_DEVICES
, that is handled by the AMD container runtime.- How to verify it
AMD container runtime must be installed on the system to verify this functionality. AMD container runtime is expected to be published as an open-source project soon.
The following commands can be used to specify which GPUs are required inside the container and
rocm-smi
output can be used to verify the correct GPUs are made available inside the container.To use all available GPUs
docker run --runtime=amd --gpus all rocm/rocm-terminal rocm-smi
OR
docker run --runtime=amd --gpus device=all rocm/rocm-terminal rocm-smi
To use any 2 GPUs
docker run --runtime=amd --gpus 2 rocm/rocm-terminal rocm-smi
To use a set of specific GPUs
docker run --runtime=amd --gpus 1,2,3 rocm/rocm-terminal rocm-smi
OR
docker run --runtime=amd --gpus '"device=1,2,3"' rocm/rocm-terminal rocm-smi
- Human readable description for the release notes