GPU move - Stage 1
Stage 1 involves moving two GPU cards from the Hadoop cluster to the DSE-K8S cluster.
We are going to attempt to add both of these two cards into a single host.
- Shut down an-worker1096
- Shut down an-worker1097
- Shut down dse-k8s-worker1001
- Remove the GPU card from an-worker1096
- Remove the GPU card from an-worker1097
- Install both GPU cards into dse-k8s-worker1001
- Retrieve the GPU Ready Configuration Cable Install Kit (470-ACQQ) from an-worker1096
- Retrieve the GPU Ready Configuration Cable Install Kit (470-ACQQ) from an-worker1097
- Boot all three servers
GPU move - Stage 2
Stage 1 involves moving another two GPU cards from the Hadoop cluster to the DSE-K8S cluster.
We are going to add both of these two cards into a single host.
- Shut down an-worker1098
- Shut down an-worker1099
- Shut down dse-k8s-worker1002
- Remove the GPU card from an-worker1098
- Remove the GPU card from an-worker1099
- Install both GPU cards into dse-k8s-worker1002
- Boot all three servers
Once this work is done, @BTullis will follow up with puppet changes to remove the GPU customization from an-worker109[8-9]
Original description below
Current status:
- We have six AMD GPUs that are currently installed to the Hadoop cluster but these are currently under-utilized.
- We have a new Kubernetes cluster named DSE-K8S which we would like to be able to use for GPU based workloads.
- Four of the hosts in this DSE-K8S cluster are supposedly GPU-ready, with all necessary cable kits in place.
- The other four nodes are supposedly GPU compatible, but are missing the cable kits.
See the following page for more information on the existing GPUs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU
Desired status:
- Four of the six existing GPU cards are removed from the Hadoop cluster
- These four cards are installed in pairs to two of the dse-k8s-worker nodes
- Any spare CPU cable kits are reclaimed from the Hadoop hosts from which the cards were removed (if feasible, convenient, practical etc)
This will obviously require the kind input and cooperation of the DC-Ops team and specifically ops-eqiad in order to carry out the physical card moves, but Data-Engineering can collaborate on shutting down and depooling the relevant servers in order to facilitate the work.
I have already spoken to representatives from the Research team, such as @leila and @Miriam and I believe that they are happy in principle for this hardware move. I know also that @achou is one of the main users of the existing cards in Hadoop, so may have insights on how and when to proceed.
Ultimately, this is still an experiment focused on trying to extract value from the GPUs that we already have and ascertain more about the compatibility. We know that in thew longer term the DC Ops team would prefer that we buy servers with GPUs fitted in the factory, but that's not been possible yet and therefore we would be keen for this hardware move if it's at all practical.