Skip to content

Added support for AMD GPUs in "docker run --gpus". #49952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sgopinath1
Copy link

This change adds support for AMD GPUs in docker run --gpus command.

- What I did

Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs.

- How I did it

  • Followed the same approach as Nvidia by registering a new driver with gpu capability.
  • Similar to the Nvidia GPU driver, the AMD driver maps the --gpus input in the docker command to an Environment Variable, AMD_VISIBLE_DEVICES, that is handled by the AMD container runtime.
  • The AMD driver is registered only if the Nvidia container runtime is not installed on the system and the AMD container runtime is installed.

- How to verify it

AMD container runtime must be installed on the system to verify this functionality. AMD container runtime is expected to be published as an open-source project soon.

The following commands can be used to specify which GPUs are required inside the container and rocm-smi output can be used to verify the correct GPUs are made available inside the container.

  • To use all available GPUs

    docker run --runtime=amd --gpus all rocm/rocm-terminal rocm-smi

    OR

    docker run --runtime=amd --gpus device=all rocm/rocm-terminal rocm-smi

  • To use any 2 GPUs

    docker run --runtime=amd --gpus 2 rocm/rocm-terminal rocm-smi

  • To use a set of specific GPUs

    docker run --runtime=amd --gpus 1,2,3 rocm/rocm-terminal rocm-smi

    OR

    docker run --runtime=amd --gpus '"device=1,2,3"' rocm/rocm-terminal rocm-smi

- Human readable description for the release notes

Added support for AMD GPUs in docker run --gpus command.

@elezar
Copy link
Contributor

elezar commented May 13, 2025

@sgopinath1 as a maintainer of the NVIDIA Container Toolkit and its components I would strongly recommend against using the environment variable to control this behaviour -- even as an interim solution. Adding this behaviour now means that we have to keep in mind when implemening a --gpus flag to CDI mapping as discussed in #49824.

@sgopinath1
Copy link
Author

sgopinath1 commented May 15, 2025

@elezar a couple of points:

  1. This PR is neither introducing new user-visible behavior nor changing the existing behavior. The backend code is also identical to the Nvidia driver. So, IMO, this should not add any new variables / considerations when we move to the long-term solution of mapping --gpus flag to CDI.
  2. The AMD container toolkit will support CDI. However, this PR is for customers who are insisting on parity with Nvidia w.r.t the --gpus flag. As I understand, there is no timeline for the long-term solution yet. We need to provide customers a way to use the --gpus flag with AMD GPUs asap.

@deke997
Copy link

deke997 commented May 17, 2025

We would love to be able to use --gpus for AMD!

BTW, the AMD Container Toolkit is now published: https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/overview.html

Comment on lines 56 to 63
// countToDevicesAMD returns the list 0, 1, ... count-1 of deviceIDs.
func countToDevicesAMD(count int) string {
devices := make([]string, count)
for i := range devices {
devices[i] = strconv.Itoa(i)
}
return strings.Join(devices, ",")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same implementation as countToDevices in nvidia_linux.go. Does it make sense to just use that function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. Changed accordingly.


const amdContainerRuntime = "amd-container-runtime"

func init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgopinath1 instead of having a separate init function for nvidia and amd GPUs, does it make sense to refactor this and the code in nvidia_linux.go to have a single init function that checks for the existence of the various executables and registers the drivers accordingly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have modified the init function in nvidia_linux.go to register the AMD driver also.

@sgopinath1
Copy link
Author

@elezar thanks for reviewing. I have updated the code as per your suggestions. Please review.

const nvidiaHook = "nvidia-container-runtime-hook"
const (
nvidiaHook = "nvidia-container-runtime-hook"
amdHook = "amd-container-runtime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
amdHook = "amd-container-runtime"
amdContainerRuntimeExecutableName = "amd-container-runtime"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as suggested.

@sgopinath1
Copy link
Author

sgopinath1 commented May 26, 2025

@elezar Let me know if there are any further comments on the changes. Thanks.

@elezar
Copy link
Contributor

elezar commented May 27, 2025

LGTM

@thaJeztah
Copy link
Member

Could you do a quick rebase and squash the commits?

Added backend code to support the exact same interface
used today for Nvidia GPUs, allowing customers to use
the same docker commands for both Nvidia and AMD GPUs.

Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>

Reused common functions from nvidia_linux.go.

Removed duplicate code in amd_linux.go by reusing
the init() and countToDevices() functions in
nvidia_linux.go. AMD driver is registered in init().

Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>

Renamed amd-container-runtime constant

Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>
@sgopinath1
Copy link
Author

Could you do a quick rebase and squash the commits?

Done.

Comment on lines 55 to 57
} else {
// no "gpu" capability
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. looks like the linter doesn't like this (even with a comment inside the branch 🤔)

daemon/nvidia_linux.go:55:9: SA9003: empty branch (staticcheck)
	} else {
	       ^

Comment on lines 38 to 57
if _, err := exec.LookPath(nvidiaHook); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}
nvidiaDriver := &deviceDriver{
capset: capset,
updateSpec: setNvidiaGPUs,
}
for c := range allNvidiaCaps {
nvidiaDriver.capset[string(c)] = struct{}{}
}
registerDeviceDriver("nvidia", nvidiaDriver)
} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}
amdDriver := &deviceDriver{
capset: capset,
updateSpec: setAMDGPUs,
}
registerDeviceDriver("amd", amdDriver)
} else {
// no "gpu" capability
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an early return would work;

Suggested change
if _, err := exec.LookPath(nvidiaHook); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}
nvidiaDriver := &deviceDriver{
capset: capset,
updateSpec: setNvidiaGPUs,
}
for c := range allNvidiaCaps {
nvidiaDriver.capset[string(c)] = struct{}{}
}
registerDeviceDriver("nvidia", nvidiaDriver)
} else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}
amdDriver := &deviceDriver{
capset: capset,
updateSpec: setAMDGPUs,
}
registerDeviceDriver("amd", amdDriver)
} else {
// no "gpu" capability
}
if _, err := exec.LookPath(nvidiaHook); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}
for c := range allNvidiaCaps {
capset[string(c)] = struct{}{}
}
registerDeviceDriver("nvidia", &deviceDriver{
capset: capset,
updateSpec: setNvidiaGPUs,
})
return
}
if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil {
registerDeviceDriver("amd", &deviceDriver{
capset: capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}},
updateSpec: setAMDGPUs,
})
return
}
// no "gpu" capability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious though; should amd and nvidia be considered mutually exclusive? Would splitting this into two init funcs (once in the nvidia file, one in the amd file) and both register a driver (if present) work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are mutually exclusive, but I suggested that the same init function be used because @sgopinath1 was also checking for the NVIDIA runtime in the AMD init function. The intent was to ensure that the AMD runtime is not used if the NVIDIA hook is present and that the nvidia logic takes precedence.

It may be cleaner to repeat some code here to keep these logically separate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha; yup, makes sense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an early return would work;

@thaJeztah Changed as suggested.

// Register Nvidia driver if Nvidia helper binary is present.
// Else, register AMD driver if AMD helper binary is present.
if _, err := exec.LookPath(nvidiaHook); err == nil {
capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious if gpu and nvidia should be added to allNvidiaCaps, so that no merging is needed 🤔 (not sure)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep that out of scope for this PR. I think the allNvidiaCaps is supposed to mirror our NVIDIA_DRIVER_CAPABILITIES. These are different from the Docker capabilities (which include gpu and nvidia). I will revisit this when I start working on a --gpus to CDI migration plan in the next couple of weeks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, definitely fine for separate; mostly me thinking out loud here 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created #50099 to actually remove support for the NVIDIA-specific capabilities. This should simplify things, but I'm not sure whether we're really breaking any users here.

const nvidiaHook = "nvidia-container-runtime-hook"
const (
nvidiaHook = "nvidia-container-runtime-hook"
amdContainerRuntimeExecutableName = "amd-container-runtime"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we split (see other comment); probably cleaner to have this const defined in the amd_linux file.

Possibly rename both files to have a common prefix (gpu_amd_linux.go, gpu_nvidia_linux.go) for easier discoverability that they go together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar Are you fine with renaming nvidia_linux.go as gpu_nvidia_linux.go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I have renamed amd_linux.go to gpu_amd_linux.go in this PR. Could you please rename nvidia_linux.go in your PR?

Also renamed amd_linux.go to gpu_amd_linux.go.

Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Open
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy