Brkcom 1008

The Blueprint to Building
End-To-End Hybrid-Cloud
AI Infrastructure
Nick Geyer, Cisco Systems Inc.
Eugene Minchenko, Cisco Systems Inc.
BRKCOM-1008
#CiscoLive
Cisco Webex App
https://ciscolive.ciscoevents.com/
ciscolivebot/#BRKCOM-1008
Questions?
Use Cisco Webex App to chat
with the speaker after the session
How
1 Find this session in the Cisco Live Mobile App
2 Click “Join the Discussion”
3 Install the Webex App or go directly to the Webex space
4 Enter messages/questions in the Webex space
Webex spaces will be moderated Enter your personal notes here
by the speaker until June 7, 2024.
BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Agenda
• Introduction
• AI Fundamentals & Impacts on Infrastructure Design Decisions
• Training Infrastructure & Network Considerations for AI Environments
• Inferencing, Fine-Tuning, & Compute Infrastructure
• Sizing for Inferencing
• AI Infrastructure Automation & Cisco Validated Designs
• Future Trends and Industry Impacts of AI Infrastructure Demands
• Summary
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
AI sets a new standard for Infrastructure
only 13% of Data Center management leaders say their network

can accommodate AI computational needs.
AIOps Scale and Performance Sustainability
How can we harness all the Is our network AI-ready, with How are we addressing
data available to us to simplify the ability to support data corporate and regulatory
data center operations? training and inferencing use sustainability requirements in
cases? our data center design?
What we know
Every organization’s AI approach

and needs are different
Build the Model | Training Optimize the Model | Fine-tuning & RAG Use the Model | Inferencing
What we’re hearing from IT infra and operations
Need consistency; avoid Optimize for utilization and efficiency in Comprehensive security
new islands of operations many dimensions—support multiple protocols and measures
projects, leverages GPUs wisely, power
and cooling needs, lifecycle
management
Support rapidly-evolving Manage cloud vs. on-prem Straddle the training → fine tuning
software ecosystem vs. hosted model → inferencing → repeat model
Cisco’s 2-Fold AI Strategy & Our Focus Today
Using AI to maximize YOUR experience Enabling YOUR infrastructure to
with Cisco products support adoption of AI applications
In On
Develop AI tools across the Cisco Develop products that help accelerate
portfolio that help manage networks YOUR adoption of AI for your business
more effectively solutions
• Delivering better results
• High-speed networking for AI training and
• Providing intelligent guidance inference clusters
• Providing better security • Flexible compute building blocks to build AI

compute clusters
• Solving day-to-day challenges
AI Fundamentals &
Impacts on
Infrastructure
Design Decisions
AI: Level Setting and Definition
Data Computer
Science Science
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Generative Adversarial Networks (GAN)

Transformer-Based LM
ChatGPT, LLaMA2 etc
AI Infrastructure Requirements
AI Infrastructure Requirements Spectrum
Extensive Model Customization
Infrastructure Requirements
Custom foundation models or extensive fine tuning

$10M+ infrastructure and resources
Months of development
Most GenAI projects are here
Hyperscaler and Large enterprise.
Compute: 1,000/10,000 GPU

Moderate Model Customization
Pre-trained model. RAG, P-tuning and fine tuning
Clusters $M+ infrastructure and resources
Weeks of development
Network: InfiniBand / Ethernet Large enterprise training
Compute: 100/1,000 GPU Clusters
Network: InfiniBand / Ethernet Large production Inferencing
and AI model lifecycle Low Model Customization

Gen AI-as-a-service
Compute: 4-8 GPU/Node Smaller Production Consumption model,
Network: InfiniBand/ Ethernet Inferencing and AI model $ per inference
Fastest time to market
lifecycle; Small parameter
training Initial testing of pretrained models

Compute: 2-4 GPU/Node Compute: CPU Only – 2 GPU/Node
Network: Ethernet Network: Ethernet
AI-as-a-Service
AI Innovator
Market Profile AI Adopter
LLM Training vs. Fine Tuning vs. Inferencing
Relative Relative
Model Stage Compute Utilization Analogy
Answering biology questions

Inferencing 1X “What is the role of
mitochondria?”
Fine Tuning 10X Learning Biology:

Words, terms, concepts
Training 100X Learning the English language:

Patterns of words
AI Maturity Model
Align customer capabilities to technology investment
Exploratory Experimental Plan Transform
Business use for AI not yet Formulated short term AI Defined AI standalone AI Strategy based on long
defined strategy, proof of concept strategy, platform in place term roadmap for new
scenarios for quick wins, dedicated services,
Data culture to support AI AI budget
not established Exec, board support for AI,
not across all lines of Decentralized support Framework defined to
Exec agenda for AI not a business. Small skillset of across staff, adequate assure quality, format,
priority data science on staff resources for early stages ownership
No AI processes or Data advancement with Data gathering, analytics to

technologies in place for policy and degree of centralized platform for Data available in Realtime
implementation governance using point variety of use cases for predictive analysis
solutions
No investment in AI used for internal
infrastructure to support AI Trial AI adjacent processes – Billing A centralized platform
workloads technologies with future automation, segment model with pre-integrated
budget allocation analysis AI capabilities.
Operationalizing AI/ML is not trivial
Everyone in your organization plays a critical role in a complex process
AI and Infrastructure Pipelines
Data Engineer Data Scientist
DevOps | SecOps | Infrastructure
Data Preparation Training and Customization Inference

Prompt
A selected model learns from When prompted the model
Preparing structured or
the training data set and interprets new, unseen data and User
unstructured data to create a Response
builds relationships creates a response based on its
training data set for the model
training
Storage
Compute
Network
High storage requirement for ETL, data Compute intensive often with GPU acceleration Lower compute requirements, GPU acceleration
cleansing and optimized for AI retrieval and high-speed low latency network and network demands. Requirements can
increase with scale
Framework and Common Software
Data Model Model Model
Data Ingest
Preprocessing Training Validation Deployment
▸ Data Engineering ▸ Weekly ▸ Production

Retraining ▸ Model Ranking Deployment
▸ Data Visualization
▸ Model and Validation ▸ Establishing
▸ Feature Identification Management Feedback loop
Latency Sensitive
IO Intensive Compute / GPU Intensive (Training)
(Inferencing)
Data AI/ML/DL Framework Inferencing & Ingestion End Point
The need for flexible AI acceleration
1
Inference
Real-time Generative AI
for real-time
For mixed video and audio
streams
Group chat Screen share Recording for meeting
summarization
transcription
workloads or translation
Example: AI-enabled Video Conferencing App
General Purpose Compute (CPU) AI Acceleration (GPU)
2
Data Ingest Edge Data Center Fine Large Foundational
and Preparation Inferencing Inferencing Tuning Training
For the
diversity of GPUS 0-2 1-4 4-64 64-10K+
AI workloads
Network Dedicated Fabrics

Considerations Shared Fabric
Revolutionizing AI workloads with 5th Gen Intel
Xeon Scalable Processors
High Performance Features
• Intel AMX with built-in AI accelerator in each
core
• Accelerated computations and reduced memory
bandwidth pressure
• Significant memory reductions with BF16/INT8
Enhanced System Capabilities

• Larger last-level cache for improved data locality
• Higher core frequency and faster memory with
DDR5
• Intel AVX-512 for non-deep learning vector
computations
Software Optimization
Kubernetes (Red Hat OpenShift)
• Software suite of optimized open-source
frameworks and tools
• Intel Xeon optimizations integrated into popular
deep learning frameworks
TCO Benefits and Compatibility

• Lower operational costs and a smaller environmental
footprint
• Available on UCS X-Series, C240, C220 platforms
Will Organizations Build
Large Clusters with over
1000 GPUs?
Inference and Fine Tuning
https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/
99% of customers will not
be building infrastructure
to train their own LLMs
Many customers will build
GPU clusters in their existing
DCs for training use case
specific ”smaller” models, for
fine tuning existing models,
and to do inferencing or
generative AI.
Sample Large Language Model use Cases
LLMs are highly effective in text Language translation is a key
summarization use-case for LLMs in areas such
tasks, in areas such as Academic Research, Travel & Tourism, Legal, Emergency
Summarization
Business Report summary, Legal Analysis, Services, Education, Real-time
Education materials, Emails, etc Translation translation.
Some examples of use cases for LLM Use of LLMs for content
chatbots include Customer Service, creation, marketing,
Personal Assistants, Tech Support, News documentation, Business
and Information. communication, product
documentation, etc
Text Generation
Dialog
LLMs can be used to

Use LLMs to determine sentiment increase coding productivity, with
in areas such as comments, tools such as co-pilot in areas
responses, content moderation, web development, data analysis,
feedback, Market Research. Education tools, etc.
Sentiment Analysis Code Generation
Enterprise Considerations to Define Requirements
• What is the use case? • Cost
• Am I Training? Fine Tuning? • Accuracy
Inferencing? RAG?
• Model Size
• How much data am I training on?
• User Experience (Response Time)
• How many models am I training on?
• Data Fidelity
• Am I using Private Data?
• Concurrent User/Inputs
• Who is responsible for
Management?
Where can this be run
Enterprises can choose where any model should be trained. Primarily there are two options:
On Premises Public Clouds

• Always available for enterprise to use • Provides flexibility, pay for what you need
• Flexibility for large enterprise to leverage same cluster for • Cost will grow with more data and training
different functions • Challenge: Cost of egress data from the cloud, latency and
• Data is stored locally/ Data Sovereignty lock in.
Smart Cloud, not Cloud First
On-Premise Data Center
Quantitative Trading Firm, London, UK (12,000 GPUs)

Example Hyperscaler Cost Model Example On-Prem Cost Model
Cloud Provider Lamba Labs @$1.99ph per (H100) GPU. CoLo, Servers, Storage and NW
Potential Annual Cost: $210Million PA Potential Annual Cost: $130 Million PA (3 Years)
Bringing it all together
A helicopter view of an AI Deployment Journey
1 2 3 4 5
Periodic model
Install common Prep and inject
Deploy AI-ready updates and Deploy application
AI models from data to fine tune
infrastructure infrastructure scaling for inferencing
industry repositories the model
as required
Cisco validated Customer

designs AI Enterprise data
Core
Cisco
Intersight
HCI AI NGC
Edge
FlexPod AI FlashStack AI
AI Training
Infrastructure &
Network
Considerations for
AI Environments
Breaking-down Machine Learning – The Process
Retraining
(as required)
Feedback
Decision
Training Data Training Inference Feedback Recommendation

Dataset
Infrastructure Model Infrastructure Output Trend
Algorithm Function Predictive
Training Classification
Inference
(weighted parameters) Generative
Recommendation
Retraining
(as required) New/Live Data
Architecting an AI/ML training cluster - Considerations
AI models and applications consume massive amounts of data,

and the data is constantly growing…
So, there are many challenges for the infrastructure to grow at the same scale as the data
Inferencing Scalability
Congestion Management
JOB
AI/ML COMPLETION Low Latency
Lifecycle
TIME
Training High Bandwidth
& Feedback
Retraining
No traffic drops
Training and Inference Network Behaviors
Network Network
Bandwidth Bandwidth
Network Network
Memory Memory
Latency Latency
Bandwidth Bandwidth
Sensitivity Sensitivity
Memory Memory
Compute Compute
Capacity Capacity
LLM Trainig Ranking Training LLM Inference Ranking Inference
AI Networking: RDMA
Remote Direct Memory Access
Benefits of RDMA
• Low latency and CPU overhead
• High network utilization
• Efficient data transfer
• Supported by all major operating systems
Zero Copy Networking
Remote Direct Memory Access (RDMA)….InfiniBand
• RDMA allows AI/ML nodes to Direct Memory
System GPU GPU System
exchange data over a network by Memory Memory to NIC
Memory Memory
communication
accessing the bytes directly in the
RAM
• Latency is very low as CPU and CPU GPU GPU CPU
kernel can be bypassed
• RDMA data was natively
PCI PCI
exchanged over InfiniBand fabrics e e
RDMA RDMA
• Later, RoCEv2 (RDMA over NIC
RoCEv2
NIC
Converged Ethernet) protocol Non-blocking,
allowed the exchange over Lossless Ethernet
transport, requires
Ethernet fabrics ECN and PFC
AI Networking: RoCE v1/RoCE v2 Protocol Stacks
RDMA Over Converged Ethernet
Software
Hardware
RoCE v1 RoCE v2
• Ethernet link layer protocol • Internet layer protocol – can be
routed
• Dedicated ether type (0x8915)
• Dedicated UDP port (4791)
• Can be used with or without VLAN tag
• UDP source port field is used to carry
an opaque flow-identifier
RoCEv2: PFC and ECN Together for Lossless Transport
How does it work?
ECN is a layer 3 congestion avoidance protocol
ECN is an IP Layer Notification System allowing
switches to indirectly inform the sources to slow
down the throughput.
WRED thresholds are set low in no-drop queue.
• Signal early for congestion with CNP’s, gives
enough time for end points to react.
PFC is a layer 2 congestion avoidance protocol
PFC thresholds are set higher than ECN

• Oversubscription buffers can be filled
quickly without giving time for ECN to react.
• PFC will react and mitigate congestion.
Data Center Quantized Congestion Notification
• IP ECN or PFC cannot alone provide a valid
Congestion Management framework
• IP ECN signalling might take too long to relieve the
congestion
• PFC can could introduce other problems like Head
Of Line Blocking and unfairness for the flows
• The two of them together provide the desired
result of having lossless RDMA communications
across Ethernet networks (this is called DCQCN)
• The requirements are:
• Ethernet devices compatible with both
techniques
• Proper configurations applied
AI/ML Flow Characteristics (Training Focused)
Bringing Visibility to AI workloads
With the granular visibility provided Tune thresholds until congestion hot This is the first and most important step
by Cisco Nexus Dashboard Insights spots clear and packet drops stop in to ensure that the AI/ML network will
the network administrator can normal cope with regular traffic congestion
observe drops traffic conditions occurrences effectively
Monitoring These Events
• DCQCN leaves the fabric congestion
management in a self healing status
• Still it is important to keep it under
control:
• Frequently congested links can be
discovered
• QoS policies can be tweaked with a
direct feedback from the monitoring
tools
• Nexus ASICs can stream these metrics
directly to Nexus Dashboard Insights
• NDI will then collect, aggregate and
visualize them all to provide insights to
the operations team
Nexus Dashboard Insights – Congestion Visibility
Designing a Network for AI success
• Dedicated Network Stalled/Idle Job
• Non-Blocking, Lossless Fabric
• High Throughput • Optimize job completion

• No Oversubscription
time
• Low Jitter, Low Latency

• On average 25% of Jobs Fail
• Clos Topology
• Expensive, wasted
resources/time
• Visibility is key!
Do I need a backend network?
Storage Compute GPU DPU FPGA
Frontend Network
Nexus Dashboard
10G | 25G | 50G | 100G | 400G | 800G

Backend Network
Lossless | High-Throughput | Low Jitter | Low-Latency
Cisco Nexus
HyperFabric Cisco Nexus HyperFabric
AI Cluster On-prem AI Infrastructure
in partnership with NVIDIA

Pods of plug-and-play
data center fabrics
Democratize AI Visibility into
Infrastructure full stack AI
Unified stack AI-native Cisco 6000

Including NVAIE operational model Series Switches
High-performance Cloud managed

Ethernet operations
NVIDIA NVIDIA VAST

DPU/NIC Servers
A solution that will enable you to GPU
BlueField-3
Storage
spend time on AI innovation—not on IT.
Built on Cisco Silicon One and Optics innovations
A Simplified Backend Network for AI Environments
Cisco Nexus HyperFabric Use Cases
Cloud SaaS Controller

• Single global UI for all owned fabrics
• Single global API endpoint
• Underlay and lifecycle automation
Build new data centers

• Ease-of-use for IT generalists
Start small and grow fabrics (1+)
•
• Self-service for fabric tenants
API AI/ML/HPC fabrics
Cisco Nexus
• Simple to deploy and manage HyperFabric AI cluster
• Scalable AI-ready Ethernet fabric
Extend data centers Downsize data center Manage multiple customer data centers
tooling footprint
• Plug-and-play deployment • Managed from cloud
• Easily expand to data center edge/colo • Remote hands assistance
• Data center anywhere with cloud controller
• Small fabrics of 1-2 switches
• Planning/design tools to help build rollout
Building High-performance AI/ML Ethernet Fabrics
Maximizing customer choice and options
Cisco Nexus HyperFabric Nexus 9000 with
Cisco 8000
AI Cluster Nexus Dashboard
Enterprise/
Enterprise/ Public Service Tier2 Web/ Tier2 Web/
Public Sector/ Hyperscalers
Sector/ Providers AI aaS AI aaS
Commercial Commercial
Cisco Cloud Managed as a Service, Private Cloud Managed, Customizable Solution

Full Stack Interoperable BYO Management, SONiC / BYO-NOS
• General purpose AI multi-pod fabric • Cisco validated SONiC or community

• Turnkey AI pod
• Simplified network operations with sourced
• Nexus HyperFabric managed servers
Nexus Dashboard • Customer assembled & operated
(BMC), NICs, and switches
• CVDs for converged ethernet infra • ECMP* & Scheduled Ethernet** options
• Converged ethernet infra
• Greenfield & brownfield deployments • Greenfield deployments
• Greenfield deployments only
• 100G -> 400G -> 800G • 400G -> 800G
• 400G -> 800G
Nexus (Cloud Scale & Silicon One) • Silicon One switches
Cisco 6000 (Silicon One) switches
switches
*FCS Target CY25 *Shipping *Shipping **FCS Target 2H 2024
Building an AI Workload Pod for Training
Backend Spines
• Backend network for Frontend TOR
training
Backend Leafs
• 32 rack servers split
across 2 racks Front End Compute Fabric
• Scale up to 30 pods per

spine
• 960 servers
Clustered GPUs with Direct Memory Access
• 1920 GPUs RoCE Enabled NICs
• Full RoCEv2 support on

Compute 100gbps
400gbps
GPU Intensive Applications converged
infrastructure example Front-end Network
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Cisco Nexus 1/10G-
copper mgmt.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
switch
Performance Testing S
X X X X X X X X X X X X X X X X X X X X X X X X
10GbE Copper
10GbE Copper
Linear Scalability demonstrated through benchmark
800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
tests on real life model simulation, showcasing

C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
consistent performance even with varying dataset 800 GB

NVMEHWH800
NVME SSD
1
800 GB
NVMEHWH800
NVME SSD
2
800 GB
NVMEHWH800
NVME SSD
3
800 GB
NVMEHWH800
NVME SSD
4
2 TB
HD2T7KL6GN
SATA HDD
5
2 TB
HD2T7KL6GN
SATA HDD
6
2 TB
HD2T7KL6GN
SATA HDD
7
2 TB
HD2T7KL6GN
SATA HDD
8
2 TB
HD2T7KL6GN
SATA HDD
9
2 TB
HD2T7KL6GN
SATA HDD
10
2 TB
HD2T7KL6GN
SATA HDD
11
2 TB
HD2T7KL6GN
SATA HDD
12
2 TB
HD2T7KL6GN
SATA HDD
13
2 TB
HD2T7KL6GN
SATA HDD
14
2 TB
HD2T7KL6GN
SATA HDD
15
2 TB
HD2T7KL6GN
SATA HDD
16
2 TB
HD2T7KL6GN
SATA HDD
17
2 TB
HD2T7KL6GN
SATA HDD
18
2 TB
HD2T7KL6GN
SATA HDD
19
2 TB
HD2T7KL6GN
SATA HDD
20
2 TB
HD2T7KL6GN
SATA HDD
21
2 TB
HD2T7KL6GN
SATA HDD
22
2 TB
HD2T7KL6GN
SATA HDD
23
2 TB
HD2T7KL6GN
SATA HDD
24
UCS
C245 M6
sizes.
Cisco C240 M7
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
• Weather Simulation (MiniWeather)

with MLNX-CX7
S

2x200G
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
• Nuclear Engineering (Minisweep)


C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
• Cosmology (High Performance Geometric


C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Multigrid)
100Gb
100Gb
S

C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Accelerated Deployment
E
• Centralized management and automation
NetApp A800
• NVIDIA HPC-X Software Toolkit Setup &
Configuration
• NetApp DataOps Toolkit to help developers, data BCN
1
2
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
BCN
STS
1
2
3
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Cisco Nexus
N9K-C9364D-GX2A
STS 4
ENV ENV
Cisco LS Cisco LS
Nexus Nexus
scientists to perform numerous data

N9K-C9336C-FX2 N9K-C9336C-FX2
Back-end Lossless Non-blocking 400G Network

management tasks
Cisco UCS C-Series Rack Server and NetApp AFF A400 storage array connected to Cisco
CVD Link Nexus 93600CD-GX leaf switch with layer 2 configuration for a single rack testing
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices
Inferencing,
Fine-Tuning, &
Compute
Infrastructure
Model Inferencing Use Cases
Productization Phase
Self-driving vehicles
Face recognition and computer vision Conversational agents
Machine translation
Analysis of medical images
Content generation
Recommender systems Images/Video/Voice
Large Language Models (LLMs)
Limitations for enterprise use
Hallucination Can make stuff up, always has an answer
Sources Where did the information come from ?
Outdated Models maybe stale as quickly as it is released
Customize Cannot personalize or use more current data
Update Cannot edit the model to remove/change data
Training LLMs
Resource-Intensive and costly
Large Language Models are… GPT-3 Large – 175B parameters

• Training Set Tokens: 300B
Pre-trained on a large corpus of • Vocabulary Size: ~50k
publicly available unlabeled data • Number of GPUs: 10k x V100
• Training Time: One Month
Training takes 1000s of GPUs Llama – 65B parameters

over a span of months
• Training Set Tokens: ~1-1.3T
• Vocabulary Size: ~32k
Requires periodic re-training to • Number of GPUs: 2048 x A100
stay up to date • Training Time: 21 Days
Building LLMs from scratch is cost-prohibitive for the average Enterprise
Use Foundational Models
Starting point for most Enterprises
BERT
GPT
Llama Customize or
integrate directly for
Foundational Download Mistral AI inferencing in
models (FM) Stable Diffusion enterprise
Cohere applications
Model Size Claude
LLMs <100B BLOOM
Other Generative <1B …
Predictive <100M Pre-trained,
general-purpose
models
LLM, Fine-Tuning and RAG?
REAL TIME PRIVATE

TRAINING FINE TUNING DATASET
LLM
TRAINING SET LLM PRIVATE
DATASET
GENERATE ASK ASK + GENERATE

ASK
USER RETRIEVER LLM

USER FINE-TUNED
LLM
RAG: Retrieval-Augmented Generation
Business value of LLM + RAG
• RAG helps in mitigating hallucination or generation of incorrect or
misleading information.
• Fine-tuning a pre-trained language model can be a resource-intensive

process. RAG offers a cost-effective alternative.
• RAG generates context-aware responses by retrieving relevant data

before crafting a response, this leads to clearer and more meaningful
interactions with users.
• One of the major concerns with AI models is their "black box" nature, that
is we are unsure of the source it has used to generate content. When
RAG generates a response, it references the sources it used, enhancing
transparency and instilling trust in the users.
IT Infrastructure for Enterprise GenAI
High-level Architecture
AI/ML Infrastructure
AI/ML Use case AI/ML Use case AI/ML Use case

(App + Model) (App + Model) (App + Model)
Generative AI and ML Model ML Model ML Model
Predictive AI ML frameworks, tools ML frameworks, ML frameworks,
and runtimes tools and runtimes tools and runtimes
Use Cases
MLOps
Kubernetes
Infrastructure
Compute Block/File
CPU + Network Storage
GPU
Object
Store
Scale fine-tuning and inferencing compute
from the data center to the edge
Scale the Enterprise Scale at the Edge Intersight
UCS X-Series
Optimize for smaller scale

Decrease components, operating
UCS costs, and management complexity
X-Series
Direct Drive sustainable outcomes
Reduce power, cooling,
and physical footprint
Run any workload

From transactions
to AI inferencing
Simpler, Smarter, More Agile
Fabric-Based adaptive
Computing
Storage HBA NIC Network
policies policies policies policies
Server
profiles
Scale seamlessly to changing Innovative

business needs stateless server
configuration
B A R E
M E T A
L
Faster deployment of Greater control Infrastructure

applications and flexibility shapes to your
specific workloads
Better performance, Less cost and

resiliency, high complexity with
availability fewer components
Modular architecture
Ideal for AI component evolution
Investment Multi-vendor Management &

preservation support Upgradability
• Convenience to upgrade • Can select components • Keep your technology

or replace individual parts from different vendors stack current, adaptable,
without overhauling the and competitive
entire system • Best example is within CPU
as you can move from AMD • Cisco Intersight is a SaaS-
• Reduces cost and ensures and Intel AMX to NVIDIA GPU based provides cloud-scale
that initial investments remain A100 and then H100, or management from DC to
valuable over time AMD in the future edge
X-Series modular system decouples the lifecycles of CPU,

GPU, memory, storage and fabrics - providing a perpetual
architecture that efficiently brings you the latest innovations.
PCIe Node 1
PCIe Node 2
PCIe Node 3
PCIe Node 4
Modularity Cloud-powered composability with Cisco Intersight

on X-Series
X-Fabric Flexible GPU acceleration across server nodes
X-Fabric
No backplane or cables = easily upgrades
UCS X-Series for AI Workloads
Expandability and Flexibility
1. No backplane
2. X-Fabric
3. Server disaggregation (PCIe Node)
PCIe Node 1
PCIe Node 2
PCIe Node 3
PCIe Node 4
X-Fabric
X-Fabric
UCS X-Fabric Technology
X440P PCIe Node
X440P X440P X440P X440P
• Two different types PCIe
node 1
PCIe
node 2
PCIe
node 3
PCIe
node 4
• Provides 2 or 4 PCIe
slots per slot
GPU 10
GPU 5
GPU 6
GPU 9
GPU
GPU
4
1
• Connects via X-Fabric to
adjacent compute node
• Dedicated power and
GPU 11
GPU 12
GPU 7
GPU 8
cooling to GPU (no disks
GPU
GPU
3
2
or CPUs blocking airflow)
Riser Style A
• Up to two dual width A16,
A40, L40, L40S, A100 or H100
(NVL*), Flex170, MI210* GPUs
• One x16 per riser = 1 per CPU
• No mixing of GPUs
* planned
Riser Style B
• Up to 4 single width
T4/L4/Flex140 GPUs
• Two x8 per riser = 2 per CPU
• No mixing of GPU models
X210c/X215c Blade with GPU options
Additional Front Card GPU Options
• Up to six U.2 NVME drives
• Up to two GPUs
GPU 4
GPU 6
GPU 2
GPU 4
GPU 1
GPU 2
GPU 1
• Slides into front of
X210C/X215C compute node
• Can be used with PCIe node to
provide up to 6 GPUs per host
GPU 3
GPU 5
GPU 3
• Intel or AMD CPU
Cisco GPU-accelerated platforms offering
X-Series C-Series Rack Servers
C240 M6 INTEL
C245 M6 AMD C240 M7 INTEL C245 M8 AMD
C220 M6 INTEL C225 M8 AMD
3x NVIDIA T4
5x NVIDIA A10 3x NVIDIA A16 3x NVIDIA L4 Plan (2H’24) Plan (2H’24)
3x NVIDIA A16 3x NVIDIA A30
Up to 24x HHHL GPUs or 3x NVIDIA L4
3x NVIDIA A30 3x NVIDIA A40 C225 M6 AMD NVIDIA H100-80
8x FHFL GPUs per X9508 chassis
3x NVIDIA A40 3x NVIDIA A100-80 NVIDIA L40S
3x NVIDIA A100-80 2x NVIDIA H100-80 3x NVIDIA T4 NVIDIA L40
Plan (Q3’24)
3x NVIDIA L40 NVIDIA L4
8x NVIDIA L4 8x NVIDIA L4 C220 M7 INTEL NVIDIA H100-NVL
X210c M6/M7 2S Blades X440p + X210c M6/M7 X440p + X210c M7 (C240 M6 only) 2x NVIDIA L40S NVIDIA A16
2x NVIDIA T4 (MEZZ) 4x NVIDIA T4 (M6 Only) 2x NVIDIA H100-NVL 5x Intel Flex140 AMD MI210
3x NVIDIA L4
2x NVIDIA A16 3x Intel Flex170
X210c M7 2S Blade 3x Intel FLex140
Intel Flex140 (MEZZ) 2x NVIDIA A40
2 x NVIDIA A100-80 X440p + X215c M8 AMD
X210c M7 2S Blade 2x NVIDIA H100-NVL Plans are Subject to change
X440p + M7 (X210c & X410c)
NVIDIA L4 (MEZZ) 2x AMD MI210
2x NVIDIA H100-80
X215c M8 2S Blade 4x NVIDIA L4
2x NVIDIA L40
NVIDIA L4 (MEZZ) 2x NVIDIA L40S
4x NVIDIA L4
2x NVIDIA L40
2x NVIDIA L40S
2x NVIDIA A16
X440p + M7 (X210c & X410c)
4x Intel Flex140
Please Refer to the Server Specifications and HCL for detailed configuration support:
2x Intel Flex170 Plans are Subject to change C-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-c-series-rack-servers/series.html#~tab-documents
X-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-x-series-modular-system/series.html#~tab-documents
UCS HCL: https://ucshcltool.cloudapps.cisco.com/public/
Sizing for
Inferencing
LLM Inference Performance
How many GPUs do I need for inference?
Model Context GPU

Use Case performance
architecture Length
• Determines • Impacts • Will depend on • Will depend on
model and compute the model its performance
minimum GPU requirements per • Use average (TFLOPS)
• CPU will also inference token size or • Use tests to
have an impact (TFLOPs ) vary token verify
lengths in tests performance
LLM Inferencing Performance
Objective and Subjective
Latency Prompt: What is Cisco UCS?

• Time to first token First Token
• Total Generation Time
• Time to second/next time
Throughput 43 Output Tokens

• Requests per second dependent on
concurrency and total generation time
• Tokens per second is the standard
measure (> 30 per second)
User experience – combination of low

latency, throughput and accuracy
LLM Inference – Estimating Memory
How much memory does my model need?
For a given precision: FP32, FP16, TF16…
• Model Memory
Precision in Bytes x # of parameters (P)
Example: Llama2 – 13B parameters
• Model Memory:
13 billion x 2Bytes/parameter = 26GB
LLM Inference – Estimating Memory
How much memory does my model need?
For a given precision: FP32, FP16, TF16…
• Memory (Inference)
Model Memory + ~20% overhead
Example: Llama2 – 13B parameters
• Memory (Inference):
26GB + 20% overhead = 31.2GB
LLM Inference - GPU Estimation
Which GPU do I use?
Based on model memory, number of Similarly, a 70B parameter model,

GPUs needed to load a 13B parameter would require:
model = any GPU with at least 32 GB ~2 A100-80 GPUs (168GB/80GB)
GPU Model Memory (GB) Memory Bandwidth FP16 Tensor Core

(GB/s) (TFLOP/s)
H100 80 2000 756
A100 80 1935 312
L40s 48 864 362
L4 24 300 121
LLM Inference - Methodology
How many GPUs do I need for inference?
For a given model and inferencing runtime, start with
enough GPUs to load the model based on memory sizing
Vary concurrent inference requests and measure throughput

and latency metrics for a given token length (context)
Vary batch sizes and measure throughput and latency -

maximizes compute for non-RT use cases
Add a second GPU and repeat concurrent inference request

and batch size tests (as needed)
Monitor GPU compute and memory utilization, along with

inferencing performance, across all tests
Select a configuration that optimally balances latency,

throughput and cost
Sample tool: https://github.com/openshift-psap/llm-load-test
Sample Performance Comparison with Nvidia A100
Llama 2 7B | NV-GPT-8B-Chat-4k-SFT | Llama2 13B
Llama 2 – 7B
Input tokens Length: 128 and output Tokens Length: 20
Average
Batch Average
GPUs Throughput
Size Latency (ms)
(sentences/s)
1 1 241.1 4.1
2 1 249.9 8.0
4 1 280.2 14.3
8 1 336.4 23.8
1 2 197.1 5.1
2 2 204.1 9.8
4 2 230.2 17.4
8 2 312.6 25.5
Optimized price to performance ratio with

FLASHSTACK AI
AI Infrastructure
Automation
Policy based compute to scale operations
Integrate with DevOps to accelerate
AI application delivery
Data center
Dev and Infra

DevOps teams and Ops
Colo
Edge
Accelerate CI/CD processes and extend infrastructure Simplify lifecycle management with integrated
as code (IaC) workflows by integrating Intersight into infrastructure and workload orchestration tools
your DevOps toolchains
#CiscoLive © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public
Day 0/2: Operations (Full Stack Bare Metal)
• Lack of visibility across multiple infra and cluster deployments
Operational • Difficulty gathering compliance and resource audits
Challenges • Capacity planning and inventory expansion
Hybrid Cloud
Optimization Security Supported Alerts Admin
Telemetry – Infra Health, Alerts, Alarms, Security
Hybrid Cloud Console Infra capacity management for expansion
SaaS
On-Prem Intersight Private Appliance - Optional

(Air-Gap Use Case)
K8s Admin
Edge Site Edge Site Edge Site n
1 2 Inventory (firmware, network, storage)
Add/remove Bare Metal Nodes
Field alerts & alarms, security
Cluster Life-cycle advisories
Cluster Upgrade/downgrade Cluster Cluster Cluster Telemetry, metrics and actionable

insights
Observability
Hardware Compatibility
RBAC, Multi-tenant
UCS-X at the
Edge sites
One-click Openshift cluster deployment
AI project deployment workflow example
1 2 4 8 10
Deploy Red Hat Load LLM from Deploy Vector Deploy Q/A Chatbot
Deploy AI-ready OpenShift and other Hugging Face and Database for RAG App using Enterprise
infrastructure resources explore/evaluate Knowledge Base
Attu open-source GUI

Cisco
Intersight 5 11
Image Registry 9
Save and upload
Pipelines Artifacts Repo model to Model Repo Deploy GUI front-end
Ingest Enterprise for Q/A Chatbot
Model Repo data to vector
6 database
3 Deploy model serving

12
runtime
Deploy two projects* Deploy Enterprise Q/A

Unstructured
in Openshift AI Chatbot for inferencing
7 Data
Model delivery pipeline

Deploy LLM for
Core Edge
Application
inferencing
inferencing pipeline**
* Workbenches/namespaces
** For demo purposes
Scalability
Model Delivery Lifecycle Governance
Streamline and scale using MLOps Efficiency

Reliability
Adaptability
Iterate
Serve and integrate

Prepare Data Experiment/Tune model Monitor/Maintain model
with App
Apply scientific rigor to

Gathering & preparing Model available for Track model quality,
understand data and
data for AI production inferencing metrics and drift
build/customize model
Pace of AI/ML technology shifts require a strong foundation to adapt
Cisco Validated
Designs (CVD’s)
for AI
Cisco Validated Designs (CVD)
Accelerate Expert Guidance

Ready to ‘Go’ CVDs provide everything
solutions for faster from system designs to
time to value implementation guides, and
ansible automation
Less risk Cisco unified

Cisco TAC support
computing
Reduce risk with system Single point of contact for
tested architectures for solution. Cisco will
standardized, coordinate with partners as
repeatable deployments needed to resolve issues
Cisco Compute Coverage UCS ONLY FLEXPOD FLASHSTACK NUTANIX
Explore Cisco validated AI demos

Large Language Retrieval Augmented
showcasing a broad spectrum of AI
Models (LLMs) Generation (RAG)
technologies and practices ready
to transform your business
Discover the power of Large Language Experience an enterprise-grade Retrieval
Model (LLM) inferencing as it seamlessly Augmented Generation (RAG) chatbot
processes and generates human-like text delivering responses tailored to your
in real-time. enterprise-specific content.
NVIDIA AI Hugging NVIDIA Text-to- NVIDIA AI NVIDIA Vector Text-to-

Gen AI Gen AI
Enterprise Face TRT-LLM Text Enterprise NIM Database Text
MLOps Image Synthesis Image Analysis
Explore the cutting-edge of MLOps, where Immerse yourself in the innovative world of Delve into the realm of Image Analysis,
the efficiency of machine learning text-to-image synthesis, where vivid where advanced algorithms interpret and
workflows meets the rigor of operational images are conjured from descriptive understand visual data with astonishing
excellence. language or existing photos. accuracy.
Red Hat Keras

NVIDIA AI Hugging Diffusion Text-to- Predictive Intel Image-
Gen AI OpenShift LangChain Mistral vLLM Gen AI Kaggle Neural
Enterprise Face Models Image AI AMX to-Text
AI Network
FlexPod for Generative AI Inferencing
Optimized for AI
• Comprehensive suite of AI tools and frameworks

with NVIDIA AI Enterprise that support optimization
for NVIDIA GPU
• Validated NVIDIA NeMo with TRT-LLM that accelerates

inference performance of LLMs on NVIDIA GPUs
• Metrics dashboard for insights into cluster and GPU

performance and behavior
Accelerated Deployment
• Deployment validation of popular Inferencing Servers
and AI models such as Stable Diffusion and Llama 2
LLMs with diverse model serving options
• Automated deployment with Ansible playbook
AI at Scale
• Scale discretely with future-ready and modular design
FlashStack for Generative AI | Inferencing with LLMs
Foundational Architecture for Gen AI
• Validated NVIDIA NeMo Inference with
TensorRT-LLM that accelerates inference
performance of LLMs on NVIDIA GPUs Generative AI Models
Nemo GPT, Llama, Stable Diffusion
Cisco Intersight
• Validated models using Text Generation
Inference server from Hugging Face Inferencing Servers
NVIDIA Triton, Text Generation Inference, PyTorch
• Metrics dashboard for insights into

infrastructure, cluster and GPU performance NVIDIA AI Enterprise
Advanced AI platform with advanced integration
and behavior
Portworx Enterprise
Simplify and Accelerate Model Model repository and storage for applications
Deployment
• Extensive breadth of validation of AI models such Red Hat OpenShift
Control plane and worker virtual machines
as GPT, Stable Diffusion and Llama 2 LLMs with
diverse model serving options Virtualization
VMware vSphere
• Automated deployment with Ansible playbook
FlashStack Infrastructure
Consistent Performance Cisco UCS X210 Compute nodes
Cisco UCS X440p PCIe nodes
• Consistent average latency and Throughput Pure Storage FlashBlade or FlashArray
NVIDIA GPU accelerators
• Better price to performance ratio
Cisco and Nutanix partner for AI: The Power of Two
Chat GPT-in-a-box
AI Everywhere
Existing apps and
new experiences
Nutanix
Cloud Platform Cisco Intersight
Proven platforms
CVDs and automated
playbooks
Cisco Compute
and Networking
Secure foundation
End-to-end resiliency
Cisco Compute Hyperconverged GPT-in-a-Box
Deploy hybrid-cloud AI-ready clusters with Cisco Validated Designs (CVDs)
Generative AI Apps
• Optimized GenAI infrastructure Foundation Models

Streamlined governance with enterprise
Business •
software Kubeflow
Nutanix Files
Challenges Storage
• Sustainable energy use PyTorch
and
• Hybrid cloud is complex Object Storage
Kubernetes
AHV Virtualization
Nutanix AOS
Cisco
• Risk reduction & fast time to market Intersight
• Streamline operations
Benefits • Proven performance
• Protect valuable data
• Simplified hybrid cloud operations
GPU-enabled
CVDs to simplify end-to-end AI infrastructure
1 2 EXPANDED ROADMAP 3 NEW
CVDs for simplified AI-ready CVD playbooks supporting common

CVD blueprint for AI networks
infrastructure AI models
Large language models

Best performing Intelligent buffer, (GPT3, BERT, T5)
AI/ML networks, low latency,
focus on application telemetry NVIDIA AI Red Hat
performance and RoCEv2 Enterprise OpenShift AI
Computer vision models
(ResNet, EfficientNet, YOLO)
Dynamic One IP network for

congestion both front-end GPT-in-a-box Gen-AI with Generative models
avoidance and back-end on Nutanix Cloudera (GANs, VAEs)
Hyperconverged Data Platform
Automation for Validated designs

day-2 operations for network and
ecosystem partners NGC Developer Cloud
Future Trends and
Industry Impacts of
AI Infrastructure
Demands
AI drives a better future
With a new kind

Artificial Simplified Cloud
intelligence operations
of data center
Simple, sustainable, future-ready

Future-ready
More programmability Less operational

and control complexity
More efficient Less costly to build,

performance for deploy, and operate
new workloads Edge Inferencing Sustainability &
and fleet Power Efficiency
management
Power & Cooling Trends 2U Server Power Total ~ 2400
• CPU, GPU and Switch ASIC power requirements FAN, 240, 10%
moving from ~350W TDP today to 400W+ and far CPU, 500, 21% CPU
beyond in the coming year(s) Memory
GPU, 600, 25%
• Traditional fan cooling consumes lot of power and Memory, 480, 20%
Storage
less efficient as system power increases Misc
PCIe -IO
• Passive cooling is approaching its limitation Misc, 150, 7%
Storage, 360, 15% GPU
• Liquid cooling technology to address future cooling FAN
requirement with significantly better cooling efficiency
& reduced noise levels
• Closed loop liquid cooling provides a retrofit solution
• Future Data Center designs will need to provision for

Rack level liquid cooling infrastructure (with external
Cooling Distribution Unit - CDU)
Liquid Cooling Technologies
- PAO6: Zero GWP, cheaper, - Better cooling, FC-3284 - Better cooling, PG25 - Better cooling, R134a,
lower cooling capability - Heatsink design is boiling - Zero GWP Novec7000 or other refrigerant
- FC-40: Better cooling, higher enhancement coating - Leaks can be catastrophic - Enables highly dense systems,
GWP - Material compatibility - Requires parallel connections series connections ok
- Material compatibility - High GWP to avoid pre-heat - Leaks not catastrophic
Single-Phase Two-Phase Single-Phase Cold Two-Phase Cold

Immersion Immersion Plate Plate
Compute Express Link (CXL)
Disaggregation Technologies
• Alternate protocol that runs across the standard PCIe physical layer
• Uses a flexible processor port that can auto-negotiate to either the
standard PCI transaction protocol or alternate CXL transaction protocols
• First generation CXL aligns to 32 Gbps PCIe 5.0
• CXL usage expected to be a key driver for an aggressive timeline to PCIe
6.0
• Allows you to build fungible platforms
UCS X-Fabric Technology For Disaggregation
Open, modular design enables compute and accelerator node connectivity
Open standards: PCIe 4/5/6,

CXL*
Not just another PCIe switch
GPU Node
GPU Node
No midplane nor cables =
Compute
Compute
Compute
Compute
Compute
Compute
easy upgrades Chassis Rear
Chassis Front
Expandability to address
new use cases in future
(memory & storage nodes) UCS X-Fabric Technology
• Internal Fabric interconnects nodes
CXL will evolve outstandard
• Industry of PCIe forTraffic
PCIe, CXL next generation
speeds, cache coherency,
• Upgrade shared-IO, memory
to future generations
Expanding Ecosystem of Viable GPU Options
Available Now Available Now 1H CY 2024 In Development
Next Generation
AI Accelerator:
Falcon Shores 1
(7nm) (5nm)
Native RoCE Native RoCE

Scaleup & out Scaleup & out Native RoCE
Scaleup & out Native RoCE
Available via: Available via:
Scaleup & out
- HLS-1 Server (x8) - HLS-Gaudi2 Server (x8)
- SMC Server (x8) - SMC Server (x8)
- SDSC - Aivres/IEI Server (x8)
- Public Cloud AWS: EC2 - Intel Dev Cloud
2024 2025
Ultra Ethernet Consortium - UEC
https://ultraethernet.org/uec-progresses-towards-v1-0-set-of-specifications/
Ultra Ethernet Consortium - UEC
Open Standard NVLink Alternatives
Introduction of Ultra Accelerator Link (UALink)
AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Low Latency, high bandwidth fabric for 100’s of
and Microsoft are announcing the formation of a accelerators.
group that will form a new industry standard,
UALink, to create the ecosystem. Interconnect for GPU<->GPU Communications
Silicon Photonics
Bringing Higher Data Rates, Lower Latency & Reduced Power Consumption
• Fiber Optic Photonics • Silicon Photonics
• Over length scales of hundreds or • Integrated Photonics Technology
thousands of kilometers i.e undersea
fiber optic links for internet • All optical components directly created
on same silicon-on-insulator (SOI)
• Majority of optical link involves light in substrate i.e. compact photonics chips
fiber optic cable that can closely be integrated with
• Source Laser, Periodic Repeaters/amps CMOS logic.
and photodetector at receiver. • All components are created on same
• All components (lasers, amplifiers, substrate allowing optical components
photodetector optical modulators, to be packed far denser than discreate
splitters etc) are discrete and connected. optics can achieve.
== Very costly
Summary
Take Aways and Closing
- Cisco Makes AI Hybrid Cloud Possible
Compute Network Storage

Flexible GPU Lossless, high Scalability, tight coupling
acceleration performance fabrics with compute & networking
AI is pushing infrastructure requirements
Very few customers will train The use cases must drive AI is driving the next push for Major investments are not
the largest models which AI models, methods, modernized data center required to start. You can get
and techniques to utilize facilities, upgraded networks, started with CPU based
compute, and storage and acceleration and existing
Most will use pre-trained models AI consultants play a vital role operational models infrastructure
with their own data and deploy in assessment, guidance, and
associated inference models adoption.
Complete Your Session Evaluations
Complete a minimum of 4 session surveys and the Overall Event Survey to be

entered in a drawing to win 1 of 5 full conference passes to Cisco Live 2025.
Earn 100 points per survey completed and compete on the Cisco Live
Challenge leaderboard.
Level up and earn exclusive prizes!
Complete your surveys in the Cisco Live mobile app.
• Visit the Cisco Showcase
for related demos
• Book your one-on-one

Meet the Engineer meeting
Continue • Attend the interactive education

with DevNet, Capture the Flag,
your education and Walk-in Labs
• Visit the On-Demand Library

for more sessions at
www.CiscoLive.com/on-demand
Contact us at:
eminchen@cisco.com, nicgeyer@cisco.com
BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Thank you
#CiscoLive
Congestion in the fabric
Congestion could always happen
• Congestion can always happen even in a

non-blocking switch/fabric
• Let’s consider the following example with S1 S2 S3 S4
some maths:
• 16 ToR, each of them is dual-connected to
every spine with 2x200Gbps links L1 L2
... L15 L16
• Every ToR has 3.2Tbps of uplink capacity

• Each ToR is attached to 26 dual-homed nodes
via 100Gbps links ... ...
• Every node could be firing up 200Gbps of
2x400Gbps
traffic without affecting the uplinks capacity
100Gbps
• But where is this traffic going?
Congestion could always happen
• If traffic traffic aggregated in a node Flows Sum = 300Gbps

exceeds egress bandwidth capacity then we
have congestion S1 S2 S3 S4
• Impact depends on the data plane protocol.

• Protocols with congestion control
L1 L2
... L15 L16
capabilities, like TCP, can auto-adjust the

flow throughput
...
• Other protocols, like UDP, have no concept
about congestion control. 2x400Gbps
100Gbps
How RoCEv2 Solves This?
• RoCEv2 MUST run over a lossless network,
retransmission must be avoided
Flows Sum = 300Gbps
• Ethernet networks are lossy by design, drops can
happen S1 S2 S3 S4
• RoCEv2 encapsulates data chunks over IP/UDP

packets L2
... L15 L16
• UDP doesn’t have a native congestion control

mechanism
• RoCEv2 uses the Data Center Quantized
...
Congestion Notification scheme that relies 2x400Gbps
primarily on two existing flow control techniques:
100Gbps
• IP Explicit Congestion Notification (RFC 3168, 1999)
• Priority Flow Control (802.1Qbb)
Data Center Quantized Congestion Notification
• IP ECN or PFC cannot alone provide a valid
Congestion Management framework
• IP ECN signalling might take too long to relieve the
congestion
• PFC can could introduce other problems like Head
Of Line Blocking and unfairness for the flows
• The two of them together provide the desired
result of having lossless RDMA communications
across Ethernet networks (this is called DCQCN)
• The requirements are:
• Ethernet devices compatible with both
techniques
• Proper configurations applied
Explicit Congestion Notification
Explicit Congestion Notification
• ECN is implemented via QoS queuing policies leveraging WRED (Weighted Random Early
Detection)
• Buffer utilization is constantly monitored, when the buffer goes above the low threshold then
some packets get marked with the ECN bits to 0b11. Only ECN capable packets are marked
• If it goes above the high threshold then all ECN capable packets are marked with 0b11
MAC IP UDP RoCEv2
WRED high threshold

WRED low threshold
DSCP ECN
Buffer Utilization
0b 00 --> Non ECN capable
0b 01 --> ECN capable
0b10 --> ECN capable
0b11 --> Congestion Experienced
ECN In Action With RoCEv2
S:S1; D:R X; ECN:0b10
L1
Sender –S1
L2 Spine L4
Sender –S2 Receiver -R
L3
Sender – S3
200Gbps
100Gbps
50Gbps
L1
Sender –S1
50Gbps
L2 Spine L4
L3
Sender – S3
200Gbps
100Gbps

50Gbps
L1
Sender –S1
S:S2; D:R X; ECN:0b10 50Gbps 100Gbps

L2 Spine L4
L3
Sender – S3
200Gbps
100Gbps
IMPORTANT: The next slides status changes
and actions happen in nanoseconds

50Gbps
L1
Sender –S1
S:S2; D:R X; ECN:0b10 50Gbps 150Gbps

L2 Spine L4
S:S3; D:R X; ECN:0b10 S:S3; D:R X; ECN:0b10
L3
Sender – S3
50Gbps
200Gbps
100Gbps
random-detect minimum-threshold 150 kbytes maximum-
50Gbps threshold 3000 kbytes drop-probability 7 weight 0 ecn
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
L3
Sender – S3
50Gbps
200Gbps
100Gbps
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
S:R; D:S2 X; CNP
CNP = Congestion Notification Packet
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps
100Gbps
50Gbps
L1
Sender –S1
25Gbps 125Gbps
L2 Spine L4
L3
Sender – S3
50Gbps
200Gbps
100Gbps
50Gbps
L1
Sender –S1
25Gbps 125Gbps
L2 Spine L4
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps
100Gbps
25Gbps
L1
Sender –S1
25Gbps 75Gbps
L2 Spine L4
L3
Sender – S3
25Gbps
200Gbps
100Gbps
Considerations
Buffer Saturation • Latency between ECN marking and

120% subsequent throttling of the throughput
100% rate could be significant
80%
60% • CNP packets must be prioritized!
40%
20% • While notifications are running buffers
0% might get fully saturated and this will
cause a tail drop
• This is why DCQCN combines ECN with
PFC
RoCEv2 Q
Priority Flow Control
Priority Flow Control queues 1 2 3n 4 5 6 7 8
• With PFC we can define a no-drop queue

• Every time the queue reaches a defined threshold
the almost saturated device sends pause frames
to the devices causing that
• The device which receives it will stop forwarding
packets classified for that queue and will place
PAUSE
them into its buffer
• The process repeats from here until it reaches
the original senders, at that point they will also
stop temporarily sending packets
• By the time this happens all the buffers in the
XOFF
network should be flushed and forwarding can
start again XON
queues 1 2 3n 4 5 6 7 8
PFC and ECN
Joining Forces
Buffer
PFC and ECN
Joining Forces
Buffer
WRED low threshold

We mark some packets
with CE
PFC and ECN
Joining Forces
Buffer
WRED high threshold

We mark all packets
with CE
WRED low threshold

with CE
PFC and ECN
Joining Forces
Buffer
xOFF
PFC Frames are sent
towards source
WRED high threshold

We mark all packets
with CE
WRED low threshold

with CE
PFC and ECN
Joining Forces
Buffer
xOFF
PFC Frames are sent xON
towards source PFC Frames are no
longer sent
WRED high threshold

We mark all packets
with CE
WRED low threshold

with CE
Priority Flow Control In Action With RoCEv2
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11
200Gbps
100Gbps
50Gbps
L1
Sender –S1
50Gbps 150Gbps
L2 Spine L4
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
Sender – S3
200Gbps
100Gbps
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
Sender – S3
200Gbps
100Gbps
Note: Spine will also start marking packets
with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
Sender – S3
200Gbps
100Gbps
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
Sender – S3
200Gbps
100Gbps
Note: Every Switch will also start marking
packets with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
Sender – S3
200Gbps
100Gbps
50Gbps
L1
PFC
Sender –S1
50Gbps 150Gbps
PFC
L2 Spine L4
PFC PFC
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
PFC
Sender – S3
200Gbps
100Gbps
20Gbps
L1
Sender –S1
20Gbps 60Gbps
L2 Spine L4
L3
Sender – S3
20Gbps
200Gbps
100Gbps
ECN and PFC – What Each One Brings
RoCEv2 can leverage the use of both ECN and PFC to achieve its goals (i.e. lossless transport)
• ECN is an IP layer notification system. It allows the switches to indirectly inform the sources as
soon as a threshold is reached and let them slow down the throughput
• PFC works at Layer 2 and serves as a way to use the buffer capacity of switches in the data
path to temporarily ensure the no-drop queue is honoured. It effectively happens at each
switch, hop-by-hop, back to the source, giving the source time to react without dropping
packets
• ECN should react first, and PFC acts as a fail-safe if the reaction is not fast enough
• In any case the combo can help achieving a lossless outcome required by AI/ML traffic
• This collaboration of both is called Data Center Quantized Congestion Notification (DCQCN)
• All Nexus 9000 CloudScale ASICs support DCQCN
Alternatives to ECN with WRED
Approximate Fair Drop
• Nexus 9000 ASIC also implements advanced queuing algorithms that can avoid some non-optimized
WRED results
• As an example WRED has no knowledge on which flows are consuming most of the bandwidth. ECN
marking happens only based on probability
• AFD constantly tracks the amount of traffic exchanged and divides them in two categories:
• Elephant Flows: long and heavy which will be penalized (ECN marked)
• Mice Flows: short and light which will not be penalized(ECN marked)

Brkcom 1008

Uploaded by

Copyright:

Available Formats

Brkcom 1008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brkcom 1008

Uploaded by

Copyright:

Available Formats

The Blueprint to Building

2 Click “Join the Discussion”

3 Install the Webex App or go directly to the Webex space

4 Enter messages/questions in the Webex space

Webex spaces will be moderated Enter your personal notes here

by the speaker until June 7, 2024.

only 13% of Data Center management leaders say their network

AIOps Scale and Performance Sustainability

Every organization’s AI approach

• Providing better security • Flexible compute building blocks to build AI

Generative Adversarial Networks (GAN)

ChatGPT, LLaMA2 etc

Custom foundation models or extensive fine tuning

Compute: 1,000/10,000 GPU

Network: InfiniBand / Ethernet Large production Inferencing

and AI model lifecycle Low Model Customization

training Initial testing of pretrained models

Answering biology questions

Fine Tuning 10X Learning Biology:

Training 100X Learning the English language:

Exploratory Experimental Plan Transform

No AI processes or Data advancement with Data gathering, analytics to

Data Engineer Data Scientist

DevOps | SecOps | Infrastructure

Data Preparation Training and Customization Inference

▸ Data Engineering ▸ Weekly ▸ Production

Data AI/ML/DL Framework Inferencing & Ingestion End Point

Example: AI-enabled Video Conferencing App

General Purpose Compute (CPU) AI Acceleration (GPU)

Network Dedicated Fabrics

Enhanced System Capabilities

TCO Benefits and Compatibility

LLMs can be used to

Sentiment Analysis Code Generation

On Premises Public Clouds

Quantitative Trading Firm, London, UK (12,000 GPUs)

Cisco validated Customer

Training Data Training Inference Feedback Recommendation

AI models and applications consume massive amounts of data,

LLM Trainig Ranking Training LLM Inference Ranking Inference

Zero Copy Networking

PFC thresholds are set higher than ECN

• Non-Blocking, Lossless Fabric

• High Throughput • Optimize job completion

• Low Jitter, Low Latency

Storage Compute GPU DPU FPGA

10G | 25G | 50G | 100G | 400G | 800G

Lossless | High-Throughput | Low Jitter | Low-Latency

AI Cluster On-prem AI Infrastructure

in partnership with NVIDIA

Unified stack AI-native Cisco 6000

High-performance Cloud managed

NVIDIA NVIDIA VAST

Cloud SaaS Controller

Build new data centers

Cisco Cloud Managed as a Service, Private Cloud Managed, Customizable Solution

• General purpose AI multi-pod fabric • Cisco validated SONiC or community

*FCS Target CY25 *Shipping *Shipping **FCS Target 2H 2024

• Backend network for Frontend TOR

FCS Target CY25 Shipping *Shipping **FCS Target 2H 2024