Brkcom 1008
Brkcom 1008
Brkcom 1008
End-To-End Hybrid-Cloud
AI Infrastructure
Nick Geyer, Cisco Systems Inc.
Eugene Minchenko, Cisco Systems Inc.
BRKCOM-1008
#CiscoLive
Cisco Webex App
https://ciscolive.ciscoevents.com/
ciscolivebot/#BRKCOM-1008
Questions?
Use Cisco Webex App to chat
with the speaker after the session
How
1 Find this session in the Cisco Live Mobile App
BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Agenda
• Introduction
• AI Fundamentals & Impacts on Infrastructure Design Decisions
• Training Infrastructure & Network Considerations for AI Environments
• Inferencing, Fine-Tuning, & Compute Infrastructure
• Sizing for Inferencing
• AI Infrastructure Automation & Cisco Validated Designs
• Future Trends and Industry Impacts of AI Infrastructure Demands
• Summary
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
AI sets a new standard for Infrastructure
How can we harness all the Is our network AI-ready, with How are we addressing
data available to us to simplify the ability to support data corporate and regulatory
data center operations? training and inferencing use sustainability requirements in
cases? our data center design?
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
What we know
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
What we’re hearing from IT infra and operations
Need consistency; avoid Optimize for utilization and efficiency in Comprehensive security
new islands of operations many dimensions—support multiple protocols and measures
projects, leverages GPUs wisely, power
and cooling needs, lifecycle
management
Support rapidly-evolving Manage cloud vs. on-prem Straddle the training → fine tuning
software ecosystem vs. hosted model → inferencing → repeat model
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Cisco’s 2-Fold AI Strategy & Our Focus Today
Using AI to maximize YOUR experience Enabling YOUR infrastructure to
with Cisco products support adoption of AI applications
In On
Develop AI tools across the Cisco Develop products that help accelerate
portfolio that help manage networks YOUR adoption of AI for your business
more effectively solutions
• Delivering better results
• High-speed networking for AI training and
• Providing intelligent guidance inference clusters
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
AI Fundamentals &
Impacts on
Infrastructure
Design Decisions
AI: Level Setting and Definition
Data Computer
Science Science
Supervised Learning
Unsupervised Learning
Reinforcement Learning
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
AI Infrastructure Requirements
AI Infrastructure Requirements Spectrum
Extensive Model Customization
Infrastructure Requirements
AI Innovator
Market Profile AI Adopter
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
LLM Training vs. Fine Tuning vs. Inferencing
Relative Relative
Model Stage Compute Utilization Analogy
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
AI Maturity Model
Align customer capabilities to technology investment
Business use for AI not yet Formulated short term AI Defined AI standalone AI Strategy based on long
defined strategy, proof of concept strategy, platform in place term roadmap for new
scenarios for quick wins, dedicated services,
Data culture to support AI AI budget
not established Exec, board support for AI,
not across all lines of Decentralized support Framework defined to
Exec agenda for AI not a business. Small skillset of across staff, adequate assure quality, format,
priority data science on staff resources for early stages ownership
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Operationalizing AI/ML is not trivial
Everyone in your organization plays a critical role in a complex process
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
AI and Infrastructure Pipelines
Storage
Compute
Network
High storage requirement for ETL, data Compute intensive often with GPU acceleration Lower compute requirements, GPU acceleration
cleansing and optimized for AI retrieval and high-speed low latency network and network demands. Requirements can
increase with scale
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Framework and Common Software
Data Model Model Model
Data Ingest
Preprocessing Training Validation Deployment
Latency Sensitive
IO Intensive Compute / GPU Intensive (Training)
(Inferencing)
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
The need for flexible AI acceleration
1
Inference
Real-time Generative AI
for real-time
For mixed video and audio
streams
Group chat Screen share Recording for meeting
summarization
transcription
workloads or translation
2
Data Ingest Edge Data Center Fine Large Foundational
and Preparation Inferencing Inferencing Tuning Training
For the
diversity of GPUS 0-2 1-4 4-64 64-10K+
AI workloads
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Revolutionizing AI workloads with 5th Gen Intel
Xeon Scalable Processors
High Performance Features
• Intel AMX with built-in AI accelerator in each
core
• Accelerated computations and reduced memory
bandwidth pressure
• Significant memory reductions with BF16/INT8
Software Optimization
Kubernetes (Red Hat OpenShift)
• Software suite of optimized open-source
frameworks and tools
• Intel Xeon optimizations integrated into popular
deep learning frameworks
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Will Organizations Build
Large Clusters with over
1000 GPUs?
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Inference and Fine Tuning
https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
99% of customers will not
be building infrastructure
to train their own LLMs
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Many customers will build
GPU clusters in their existing
DCs for training use case
specific ”smaller” models, for
fine tuning existing models,
and to do inferencing or
generative AI.
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
Sample Large Language Model use Cases
LLMs are highly effective in text Language translation is a key
summarization use-case for LLMs in areas such
tasks, in areas such as Academic Research, Travel & Tourism, Legal, Emergency
Summarization
Business Report summary, Legal Analysis, Services, Education, Real-time
Education materials, Emails, etc Translation translation.
Some examples of use cases for LLM Use of LLMs for content
chatbots include Customer Service, creation, marketing,
Personal Assistants, Tech Support, News documentation, Business
and Information. communication, product
documentation, etc
Text Generation
Dialog
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Enterprise Considerations to Define Requirements
• What is the use case? • Cost
• Am I Training? Fine Tuning? • Accuracy
Inferencing? RAG?
• Model Size
• How much data am I training on?
• User Experience (Response Time)
• How many models am I training on?
• Data Fidelity
• Am I using Private Data?
• Concurrent User/Inputs
• Who is responsible for
Management?
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Where can this be run
Enterprises can choose where any model should be trained. Primarily there are two options:
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Smart Cloud, not Cloud First
On-Premise Data Center
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Bringing it all together
A helicopter view of an AI Deployment Journey
1 2 3 4 5
Periodic model
Install common Prep and inject
Deploy AI-ready updates and Deploy application
AI models from data to fine tune
infrastructure infrastructure scaling for inferencing
industry repositories the model
as required
Core
Cisco
Intersight
HCI AI NGC
Edge
FlexPod AI FlashStack AI
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
AI Training
Infrastructure &
Network
Considerations for
AI Environments
Breaking-down Machine Learning – The Process
Retraining
(as required)
Feedback
Decision
Recommendation
Retraining
(as required) New/Live Data
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Architecting an AI/ML training cluster - Considerations
Inferencing Scalability
Congestion Management
JOB
AI/ML COMPLETION Low Latency
Lifecycle
TIME
Training High Bandwidth
& Feedback
Retraining
No traffic drops
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Training and Inference Network Behaviors
Network Network
Bandwidth Bandwidth
Network Network
Memory Memory
Latency Latency
Bandwidth Bandwidth
Sensitivity Sensitivity
Memory Memory
Compute Compute
Capacity Capacity
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
AI Networking: RDMA
Remote Direct Memory Access
Benefits of RDMA
• Low latency and CPU overhead
• High network utilization
• Efficient data transfer
• Supported by all major operating systems
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
Remote Direct Memory Access (RDMA)….InfiniBand
• RDMA allows AI/ML nodes to Direct Memory
System GPU GPU System
exchange data over a network by Memory Memory to NIC
Memory Memory
communication
accessing the bytes directly in the
RAM
• Latency is very low as CPU and CPU GPU GPU CPU
kernel can be bypassed
• RDMA data was natively
PCI PCI
exchanged over InfiniBand fabrics e e
RDMA RDMA
• Later, RoCEv2 (RDMA over NIC
RoCEv2
NIC
Converged Ethernet) protocol Non-blocking,
allowed the exchange over Lossless Ethernet
transport, requires
Ethernet fabrics ECN and PFC
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
AI Networking: RoCE v1/RoCE v2 Protocol Stacks
RDMA Over Converged Ethernet
Software
Hardware
RoCE v1 RoCE v2
• Ethernet link layer protocol • Internet layer protocol – can be
routed
• Dedicated ether type (0x8915)
• Dedicated UDP port (4791)
• Can be used with or without VLAN tag
• UDP source port field is used to carry
an opaque flow-identifier
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
RoCEv2: PFC and ECN Together for Lossless Transport
How does it work?
ECN is a layer 3 congestion avoidance protocol
ECN is an IP Layer Notification System allowing
switches to indirectly inform the sources to slow
down the throughput.
WRED thresholds are set low in no-drop queue.
• Signal early for congestion with CNP’s, gives
enough time for end points to react.
PFC is a layer 2 congestion avoidance protocol
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Data Center Quantized Congestion Notification
• IP ECN or PFC cannot alone provide a valid
Congestion Management framework
• IP ECN signalling might take too long to relieve the
congestion
• PFC can could introduce other problems like Head
Of Line Blocking and unfairness for the flows
• The two of them together provide the desired
result of having lossless RDMA communications
across Ethernet networks (this is called DCQCN)
• The requirements are:
• Ethernet devices compatible with both
techniques
• Proper configurations applied
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
AI/ML Flow Characteristics (Training Focused)
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Bringing Visibility to AI workloads
With the granular visibility provided Tune thresholds until congestion hot This is the first and most important step
by Cisco Nexus Dashboard Insights spots clear and packet drops stop in to ensure that the AI/ML network will
the network administrator can normal cope with regular traffic congestion
observe drops traffic conditions occurrences effectively
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Monitoring These Events
• DCQCN leaves the fabric congestion
management in a self healing status
• Still it is important to keep it under
control:
• Frequently congested links can be
discovered
• QoS policies can be tweaked with a
direct feedback from the monitoring
tools
• Nexus ASICs can stream these metrics
directly to Nexus Dashboard Insights
• NDI will then collect, aggregate and
visualize them all to provide insights to
the operations team
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Nexus Dashboard Insights – Congestion Visibility
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Designing a Network for AI success
• Dedicated Network Stalled/Idle Job
• Clos Topology
• Expensive, wasted
resources/time
• Visibility is key!
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Do I need a backend network?
Frontend Network
Nexus Dashboard
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Cisco Nexus
HyperFabric Cisco Nexus HyperFabric
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
A Simplified Backend Network for AI Environments
Cisco Nexus HyperFabric Use Cases
Extend data centers Downsize data center Manage multiple customer data centers
tooling footprint
• Plug-and-play deployment • Managed from cloud
• Easily expand to data center edge/colo • Remote hands assistance
• Data center anywhere with cloud controller
• Small fabrics of 1-2 switches
• Planning/design tools to help build rollout
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Building High-performance AI/ML Ethernet Fabrics
Maximizing customer choice and options
Cisco Nexus HyperFabric Nexus 9000 with
Cisco 8000
AI Cluster Nexus Dashboard
Enterprise/
Enterprise/ Public Service Tier2 Web/ Tier2 Web/
Public Sector/ Hyperscalers
Sector/ Providers AI aaS AI aaS
Commercial Commercial
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Building an AI Workload Pod for Training
Backend Spines
training
Backend Leafs
• 32 rack servers split
across 2 racks Front End Compute Fabric
switch
Performance Testing S
X X X X X X X X X X X X X X X X X X X X X X X X
10GbE Copper
10GbE Copper
Linear Scalability demonstrated through benchmark
800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
X X X X X X X X X X X X X X X X X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X
1
800 GB
NVMEHWH800
NVME SSD
2
800 GB
NVMEHWH800
NVME SSD
3
800 GB
NVMEHWH800
NVME SSD
4
2 TB
HD2T7KL6GN
SATA HDD
5
2 TB
HD2T7KL6GN
SATA HDD
6
2 TB
HD2T7KL6GN
SATA HDD
7
2 TB
HD2T7KL6GN
SATA HDD
8
2 TB
HD2T7KL6GN
SATA HDD
9
2 TB
HD2T7KL6GN
SATA HDD
10
2 TB
HD2T7KL6GN
SATA HDD
11
2 TB
HD2T7KL6GN
SATA HDD
12
2 TB
HD2T7KL6GN
SATA HDD
13
2 TB
HD2T7KL6GN
SATA HDD
14
2 TB
HD2T7KL6GN
SATA HDD
15
2 TB
HD2T7KL6GN
SATA HDD
16
2 TB
HD2T7KL6GN
SATA HDD
17
2 TB
HD2T7KL6GN
SATA HDD
18
2 TB
HD2T7KL6GN
SATA HDD
19
2 TB
HD2T7KL6GN
SATA HDD
20
2 TB
HD2T7KL6GN
SATA HDD
21
2 TB
HD2T7KL6GN
SATA HDD
22
2 TB
HD2T7KL6GN
SATA HDD
23
2 TB
HD2T7KL6GN
SATA HDD
24
UCS
C245 M6
sizes.
X X X X X X X X X X X X X X X X X X X X X X X X
Cisco C240 M7
800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
with MLNX-CX7
S
2x200G
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Multigrid)
X X X X X X X X X X X X X X X X X X X X X X X X
100Gb
100Gb
S
Accelerated Deployment
E
• Centralized management and automation
NetApp A800
• NVIDIA HPC-X Software Toolkit Setup &
Configuration
• NetApp DataOps Toolkit to help developers, data BCN
1
2
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
BCN
STS
1
2
3
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Cisco Nexus
N9K-C9364D-GX2A
STS 4
ENV ENV
Cisco LS Cisco LS
Nexus Nexus
Cisco UCS C-Series Rack Server and NetApp AFF A400 storage array connected to Cisco
CVD Link Nexus 93600CD-GX leaf switch with layer 2 configuration for a single rack testing
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Inferencing,
Fine-Tuning, &
Compute
Infrastructure
Model Inferencing Use Cases
Productization Phase
Self-driving vehicles
Machine translation
Content generation
Recommender systems Images/Video/Voice
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
Large Language Models (LLMs)
Limitations for enterprise use
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
Training LLMs
Resource-Intensive and costly
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
Use Foundational Models
Starting point for most Enterprises
BERT
GPT
Llama Customize or
integrate directly for
Foundational Download Mistral AI inferencing in
models (FM) Stable Diffusion enterprise
Cohere applications
Model Size Claude
LLMs <100B BLOOM
Other Generative <1B …
Predictive <100M Pre-trained,
general-purpose
models
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
LLM, Fine-Tuning and RAG?
LLM
TRAINING SET LLM PRIVATE
DATASET
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 53
Business value of LLM + RAG
• RAG helps in mitigating hallucination or generation of incorrect or
misleading information.
• One of the major concerns with AI models is their "black box" nature, that
is we are unsure of the source it has used to generate content. When
RAG generates a response, it references the sources it used, enhancing
transparency and instilling trust in the users.
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
IT Infrastructure for Enterprise GenAI
High-level Architecture
AI/ML Infrastructure
Kubernetes
Infrastructure
Compute Block/File
CPU + Network Storage
GPU
Object
Store
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Scale fine-tuning and inferencing compute
from the data center to the edge
UCS X-Series
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 56
Simpler, Smarter, More Agile
Fabric-Based adaptive
Computing
Storage HBA NIC Network
policies policies policies policies
Server
profiles
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
Modular architecture
Ideal for AI component evolution
PCIe Node 2
PCIe Node 3
PCIe Node 4
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
UCS X-Series for AI Workloads
Expandability and Flexibility
1. No backplane
2. X-Fabric
3. Server disaggregation (PCIe Node)
PCIe Node 1
PCIe Node 2
PCIe Node 3
PCIe Node 4
X-Fabric
X-Fabric
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
X440P PCIe Node
X440P X440P X440P X440P
• Two different types PCIe
node 1
PCIe
node 2
PCIe
node 3
PCIe
node 4
• Provides 2 or 4 PCIe
slots per slot
GPU 10
GPU 5
GPU 6
GPU 9
GPU
GPU
4
1
• Connects via X-Fabric to
adjacent compute node
• Dedicated power and
GPU 11
GPU 12
GPU 7
GPU 8
cooling to GPU (no disks
GPU
GPU
3
2
or CPUs blocking airflow)
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
Riser Style A
• Up to two dual width A16,
A40, L40, L40S, A100 or H100
(NVL*), Flex170, MI210* GPUs
• One x16 per riser = 1 per CPU
• No mixing of GPUs
* planned
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
Riser Style B
• Up to 4 single width
T4/L4/Flex140 GPUs
• Two x8 per riser = 2 per CPU
• No mixing of GPU models
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
X210c/X215c Blade with GPU options
Additional Front Card GPU Options
• Up to two GPUs
GPU 4
GPU 6
GPU 2
GPU 4
GPU 1
GPU 2
GPU 1
• Slides into front of
X210C/X215C compute node
• Can be used with PCIe node to
provide up to 6 GPUs per host
GPU 3
GPU 5
GPU 3
• Intel or AMD CPU
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
Cisco GPU-accelerated platforms offering
X-Series C-Series Rack Servers
C240 M6 INTEL
C245 M6 AMD C240 M7 INTEL C245 M8 AMD
C220 M6 INTEL C225 M8 AMD
3x NVIDIA T4
5x NVIDIA A10 3x NVIDIA A16 3x NVIDIA L4 Plan (2H’24) Plan (2H’24)
3x NVIDIA A16 3x NVIDIA A30
Up to 24x HHHL GPUs or 3x NVIDIA L4
3x NVIDIA A30 3x NVIDIA A40 C225 M6 AMD NVIDIA H100-80
8x FHFL GPUs per X9508 chassis
3x NVIDIA A40 3x NVIDIA A100-80 NVIDIA L40S
3x NVIDIA A100-80 2x NVIDIA H100-80 3x NVIDIA T4 NVIDIA L40
Plan (Q3’24)
3x NVIDIA L40 NVIDIA L4
8x NVIDIA L4 8x NVIDIA L4 C220 M7 INTEL NVIDIA H100-NVL
X210c M6/M7 2S Blades X440p + X210c M6/M7 X440p + X210c M7 (C240 M6 only) 2x NVIDIA L40S NVIDIA A16
2x NVIDIA T4 (MEZZ) 4x NVIDIA T4 (M6 Only) 2x NVIDIA H100-NVL 5x Intel Flex140 AMD MI210
3x NVIDIA L4
2x NVIDIA A16 3x Intel Flex170
X210c M7 2S Blade 3x Intel FLex140
Intel Flex140 (MEZZ) 2x NVIDIA A40
2 x NVIDIA A100-80 X440p + X215c M8 AMD
X210c M7 2S Blade 2x NVIDIA H100-NVL Plans are Subject to change
X440p + M7 (X210c & X410c)
NVIDIA L4 (MEZZ) 2x AMD MI210
2x NVIDIA H100-80
X215c M8 2S Blade 4x NVIDIA L4
2x NVIDIA L40
NVIDIA L4 (MEZZ) 2x NVIDIA L40S
4x NVIDIA L4
2x NVIDIA L40
2x NVIDIA L40S
2x NVIDIA A16
X440p + M7 (X210c & X410c)
4x Intel Flex140
Please Refer to the Server Specifications and HCL for detailed configuration support:
2x Intel Flex170 Plans are Subject to change C-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-c-series-rack-servers/series.html#~tab-documents
X-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-x-series-modular-system/series.html#~tab-documents
UCS HCL: https://ucshcltool.cloudapps.cisco.com/public/
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
Sizing for
Inferencing
LLM Inference Performance
How many GPUs do I need for inference?
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
LLM Inferencing Performance
Objective and Subjective
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
LLM Inference – Estimating Memory
How much memory does my model need?
• Model Memory
Precision in Bytes x # of parameters (P)
• Model Memory:
13 billion x 2Bytes/parameter = 26GB
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
LLM Inference – Estimating Memory
How much memory does my model need?
• Memory (Inference)
Model Memory + ~20% overhead
• Memory (Inference):
26GB + 20% overhead = 31.2GB
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
LLM Inference - GPU Estimation
Which GPU do I use?
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
LLM Inference - Methodology
How many GPUs do I need for inference?
For a given model and inferencing runtime, start with
enough GPUs to load the model based on memory sizing
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
Sample Performance Comparison with Nvidia A100
Llama 2 7B | NV-GPT-8B-Chat-4k-SFT | Llama2 13B
Llama 2 – 7B
Input tokens Length: 128 and output Tokens Length: 20
Average
Batch Average
GPUs Throughput
Size Latency (ms)
(sentences/s)
1 1 241.1 4.1
2 1 249.9 8.0
4 1 280.2 14.3
8 1 336.4 23.8
1 2 197.1 5.1
2 2 204.1 9.8
4 2 230.2 17.4
8 2 312.6 25.5
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
Integrate with DevOps to accelerate
AI application delivery
Data center
Colo
Edge
Accelerate CI/CD processes and extend infrastructure Simplify lifecycle management with integrated
as code (IaC) workflows by integrating Intersight into infrastructure and workload orchestration tools
your DevOps toolchains
#CiscoLive © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public
Day 0/2: Operations (Full Stack Bare Metal)
• Lack of visibility across multiple infra and cluster deployments
Operational • Difficulty gathering compliance and resource audits
Challenges • Capacity planning and inventory expansion
Hybrid Cloud
Optimization Security Supported Alerts Admin
Telemetry – Infra Health, Alerts, Alarms, Security
Hybrid Cloud Console Infra capacity management for expansion
SaaS
RBAC, Multi-tenant
UCS-X at the
Edge sites
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
One-click Openshift cluster deployment
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
AI project deployment workflow example
1 2 4 8 10
Deploy Red Hat Load LLM from Deploy Vector Deploy Q/A Chatbot
Deploy AI-ready OpenShift and other Hugging Face and Database for RAG App using Enterprise
infrastructure resources explore/evaluate Knowledge Base
* Workbenches/namespaces
** For demo purposes
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Scalability
Model Delivery Lifecycle Governance
Iterate
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
Cisco Validated
Designs (CVD’s)
for AI
Cisco Validated Designs (CVD)
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 83
Cisco Compute Coverage UCS ONLY FLEXPOD FLASHSTACK NUTANIX
Explore the cutting-edge of MLOps, where Immerse yourself in the innovative world of Delve into the realm of Image Analysis,
the efficiency of machine learning text-to-image synthesis, where vivid where advanced algorithms interpret and
workflows meets the rigor of operational images are conjured from descriptive understand visual data with astonishing
excellence. language or existing photos. accuracy.
Optimized for AI
Accelerated Deployment
• Deployment validation of popular Inferencing Servers
and AI models such as Stable Diffusion and Llama 2
LLMs with diverse model serving options
• Automated deployment with Ansible playbook
AI at Scale
• Scale discretely with future-ready and modular design
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 85
FlashStack for Generative AI | Inferencing with LLMs
Foundational Architecture for Gen AI
• Validated NVIDIA NeMo Inference with
TensorRT-LLM that accelerates inference
performance of LLMs on NVIDIA GPUs Generative AI Models
Nemo GPT, Llama, Stable Diffusion
Cisco Intersight
• Validated models using Text Generation
Inference server from Hugging Face Inferencing Servers
NVIDIA Triton, Text Generation Inference, PyTorch
Deployment
• Extensive breadth of validation of AI models such Red Hat OpenShift
Control plane and worker virtual machines
as GPT, Stable Diffusion and Llama 2 LLMs with
diverse model serving options Virtualization
VMware vSphere
• Automated deployment with Ansible playbook
FlashStack Infrastructure
Consistent Performance Cisco UCS X210 Compute nodes
Cisco UCS X440p PCIe nodes
• Consistent average latency and Throughput Pure Storage FlashBlade or FlashArray
NVIDIA GPU accelerators
• Better price to performance ratio
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
Cisco and Nutanix partner for AI: The Power of Two
Chat GPT-in-a-box
AI Everywhere
Existing apps and
new experiences
Nutanix
Cloud Platform Cisco Intersight
Proven platforms
CVDs and automated
playbooks
Cisco Compute
and Networking
Secure foundation
End-to-end resiliency
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
Cisco Compute Hyperconverged GPT-in-a-Box
Deploy hybrid-cloud AI-ready clusters with Cisco Validated Designs (CVDs)
Generative AI Apps
AHV Virtualization
Nutanix AOS
Cisco
• Risk reduction & fast time to market Intersight
• Streamline operations
Benefits • Proven performance
GPU-enabled
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
CVDs to simplify end-to-end AI infrastructure
1 2 EXPANDED ROADMAP 3 NEW
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
Future Trends and
Industry Impacts of
AI Infrastructure
Demands
AI drives a better future
of data center
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 91
Power & Cooling Trends 2U Server Power Total ~ 2400
• CPU, GPU and Switch ASIC power requirements FAN, 240, 10%
moving from ~350W TDP today to 400W+ and far CPU, 500, 21% CPU
beyond in the coming year(s) Memory
GPU, 600, 25%
• Traditional fan cooling consumes lot of power and Memory, 480, 20%
Storage
less efficient as system power increases Misc
PCIe -IO
• Passive cooling is approaching its limitation Misc, 150, 7%
Storage, 360, 15% GPU
• Liquid cooling technology to address future cooling FAN
requirement with significantly better cooling efficiency
& reduced noise levels
• Closed loop liquid cooling provides a retrofit solution
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
Liquid Cooling Technologies
- PAO6: Zero GWP, cheaper, - Better cooling, FC-3284 - Better cooling, PG25 - Better cooling, R134a,
lower cooling capability - Heatsink design is boiling - Zero GWP Novec7000 or other refrigerant
- FC-40: Better cooling, higher enhancement coating - Leaks can be catastrophic - Enables highly dense systems,
GWP - Material compatibility - Requires parallel connections series connections ok
- Material compatibility - High GWP to avoid pre-heat - Leaks not catastrophic
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 93
Compute Express Link (CXL)
Disaggregation Technologies
• Alternate protocol that runs across the standard PCIe physical layer
• Uses a flexible processor port that can auto-negotiate to either the
standard PCI transaction protocol or alternate CXL transaction protocols
• First generation CXL aligns to 32 Gbps PCIe 5.0
• CXL usage expected to be a key driver for an aggressive timeline to PCIe
6.0
• Allows you to build fungible platforms
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 94
UCS X-Fabric Technology For Disaggregation
Open, modular design enables compute and accelerator node connectivity
GPU Node
GPU Node
No midplane nor cables =
Compute
Compute
Compute
Compute
Compute
Compute
easy upgrades Chassis Rear
Chassis Front
Expandability to address
new use cases in future
(memory & storage nodes) UCS X-Fabric Technology
• Internal Fabric interconnects nodes
CXL will evolve outstandard
• Industry of PCIe forTraffic
PCIe, CXL next generation
speeds, cache coherency,
• Upgrade shared-IO, memory
to future generations
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 95
Expanding Ecosystem of Viable GPU Options
Available Now Available Now 1H CY 2024 In Development
Next Generation
AI Accelerator:
Falcon Shores 1
(7nm) (5nm)
2024 2025
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
Ultra Ethernet Consortium - UEC
https://ultraethernet.org/uec-progresses-towards-v1-0-set-of-specifications/
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 97
Ultra Ethernet Consortium - UEC
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 98
Open Standard NVLink Alternatives
Introduction of Ultra Accelerator Link (UALink)
AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Low Latency, high bandwidth fabric for 100’s of
and Microsoft are announcing the formation of a accelerators.
group that will form a new industry standard,
UALink, to create the ecosystem. Interconnect for GPU<->GPU Communications
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 99
Silicon Photonics
Bringing Higher Data Rates, Lower Latency & Reduced Power Consumption
• Fiber Optic Photonics • Silicon Photonics
• Over length scales of hundreds or • Integrated Photonics Technology
thousands of kilometers i.e undersea
fiber optic links for internet • All optical components directly created
on same silicon-on-insulator (SOI)
• Majority of optical link involves light in substrate i.e. compact photonics chips
fiber optic cable that can closely be integrated with
• Source Laser, Periodic Repeaters/amps CMOS logic.
and photodetector at receiver. • All components are created on same
• All components (lasers, amplifiers, substrate allowing optical components
photodetector optical modulators, to be packed far denser than discreate
splitters etc) are discrete and connected. optics can achieve.
== Very costly
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 100
Summary
Take Aways and Closing
- Cisco Makes AI Hybrid Cloud Possible
Very few customers will train The use cases must drive AI is driving the next push for Major investments are not
the largest models which AI models, methods, modernized data center required to start. You can get
and techniques to utilize facilities, upgraded networks, started with CPU based
compute, and storage and acceleration and existing
Most will use pre-trained models AI consultants play a vital role operational models infrastructure
with their own data and deploy in assessment, guidance, and
associated inference models adoption.
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 102
Complete Your Session Evaluations
Earn 100 points per survey completed and compete on the Cisco Live
Challenge leaderboard.
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 103
• Visit the Cisco Showcase
for related demos
Contact us at:
eminchen@cisco.com, nicgeyer@cisco.com
BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Thank you
#CiscoLive
Congestion in the fabric
Congestion could always happen
some maths:
• 16 ToR, each of them is dual-connected to
every spine with 2x200Gbps links L1 L2
... L15 L16
100Gbps
How RoCEv2 Solves This?
• RoCEv2 MUST run over a lossless network,
retransmission must be avoided
Flows Sum = 300Gbps
• Ethernet networks are lossy by design, drops can
happen S1 S2 S3 S4
L1
Sender –S1
L2 Spine L4
Sender –S2 Receiver -R
L3
Sender – S3
200Gbps
100Gbps
ECN In Action With RoCEv2
S:S1; D:R X; ECN:0b10
50Gbps
L1
Sender –S1
50Gbps
L2 Spine L4
Sender –S2 Receiver -R
S:S1; D:R X; ECN:0b10
L3
Sender – S3
200Gbps
100Gbps
ECN In Action With RoCEv2
L3
Sender – S3
200Gbps
100Gbps
ECN In Action With RoCEv2
IMPORTANT: The next slides status changes
and actions happen in nanoseconds
L3
Sender – S3
50Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
random-detect minimum-threshold 150 kbytes maximum-
50Gbps threshold 3000 kbytes drop-probability 7 weight 0 ecn
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b11
S:S3; D:R X; ECN:0b10
L3
Sender – S3
50Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S2 X; CNP
CNP = Congestion Notification Packet
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
50Gbps
L1
Sender –S1
25Gbps 125Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
S:S3; D:R X; ECN:0b11
L3
Sender – S3
50Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
50Gbps
L1
Sender –S1
25Gbps 125Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
25Gbps
L1
Sender –S1
25Gbps 75Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b10
S:S3; D:R X; ECN:0b10
L3
Sender – S3
25Gbps
200Gbps
100Gbps
ECN In Action With RoCEv2
Considerations
PAUSE
them into its buffer
• The process repeats from here until it reaches
the original senders, at that point they will also
stop temporarily sending packets
• By the time this happens all the buffers in the
XOFF
network should be flushed and forwarding can
start again XON
queues 1 2 3n 4 5 6 7 8
PFC and ECN
Joining Forces
Buffer
PFC and ECN
Joining Forces
Buffer
xOFF
PFC Frames are sent
towards source
xOFF
PFC Frames are sent xON
towards source PFC Frames are no
longer sent
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
Note: Spine will also start marking packets
with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
Note: Every Switch will also start marking
packets with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
50Gbps
L1
PFC
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
PFC
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
Priority Flow Control In Action With RoCEv2
20Gbps
L1
Sender –S1
20Gbps 60Gbps
L2 Spine L4
Sender –S2 Receiver-R
L3
Sender – S3
20Gbps
200Gbps
100Gbps
ECN and PFC – What Each One Brings
RoCEv2 can leverage the use of both ECN and PFC to achieve its goals (i.e. lossless transport)
• ECN is an IP layer notification system. It allows the switches to indirectly inform the sources as
soon as a threshold is reached and let them slow down the throughput
• PFC works at Layer 2 and serves as a way to use the buffer capacity of switches in the data
path to temporarily ensure the no-drop queue is honoured. It effectively happens at each
switch, hop-by-hop, back to the source, giving the source time to react without dropping
packets
• ECN should react first, and PFC acts as a fail-safe if the reaction is not fast enough
• In any case the combo can help achieving a lossless outcome required by AI/ML traffic
• This collaboration of both is called Data Center Quantized Congestion Notification (DCQCN)
• All Nexus 9000 CloudScale ASICs support DCQCN
Alternatives to ECN with WRED
Approximate Fair Drop
• Nexus 9000 ASIC also implements advanced queuing algorithms that can avoid some non-optimized
WRED results
• As an example WRED has no knowledge on which flows are consuming most of the bandwidth. ECN
marking happens only based on probability
• AFD constantly tracks the amount of traffic exchanged and divides them in two categories:
• Elephant Flows: long and heavy which will be penalized (ECN marked)
• Mice Flows: short and light which will not be penalized(ECN marked)