Brkcom 1008

Download as pdf or txt
Download as pdf or txt
You are on page 1of 139

The Blueprint to Building

End-To-End Hybrid-Cloud
AI Infrastructure
Nick Geyer, Cisco Systems Inc.
Eugene Minchenko, Cisco Systems Inc.
BRKCOM-1008

#CiscoLive
Cisco Webex App
https://ciscolive.ciscoevents.com/
ciscolivebot/#BRKCOM-1008

Questions?
Use Cisco Webex App to chat
with the speaker after the session

How
1 Find this session in the Cisco Live Mobile App

2 Click “Join the Discussion”

3 Install the Webex App or go directly to the Webex space

4 Enter messages/questions in the Webex space

Webex spaces will be moderated Enter your personal notes here

by the speaker until June 7, 2024.

BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Agenda

• Introduction
• AI Fundamentals & Impacts on Infrastructure Design Decisions
• Training Infrastructure & Network Considerations for AI Environments
• Inferencing, Fine-Tuning, & Compute Infrastructure
• Sizing for Inferencing
• AI Infrastructure Automation & Cisco Validated Designs
• Future Trends and Industry Impacts of AI Infrastructure Demands
• Summary

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
AI sets a new standard for Infrastructure

only 13% of Data Center management leaders say their network


can accommodate AI computational needs.

AIOps Scale and Performance Sustainability

How can we harness all the Is our network AI-ready, with How are we addressing
data available to us to simplify the ability to support data corporate and regulatory
data center operations? training and inferencing use sustainability requirements in
cases? our data center design?

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
What we know

Every organization’s AI approach


and needs are different
Build the Model | Training Optimize the Model | Fine-tuning & RAG Use the Model | Inferencing

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
What we’re hearing from IT infra and operations

Need consistency; avoid Optimize for utilization and efficiency in Comprehensive security
new islands of operations many dimensions—support multiple protocols and measures
projects, leverages GPUs wisely, power
and cooling needs, lifecycle
management

Support rapidly-evolving Manage cloud vs. on-prem Straddle the training → fine tuning
software ecosystem vs. hosted model → inferencing → repeat model

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Cisco’s 2-Fold AI Strategy & Our Focus Today
Using AI to maximize YOUR experience Enabling YOUR infrastructure to
with Cisco products support adoption of AI applications

In On
Develop AI tools across the Cisco Develop products that help accelerate
portfolio that help manage networks YOUR adoption of AI for your business
more effectively solutions
• Delivering better results
• High-speed networking for AI training and
• Providing intelligent guidance inference clusters

• Providing better security • Flexible compute building blocks to build AI


compute clusters
• Solving day-to-day challenges

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
AI Fundamentals &
Impacts on
Infrastructure
Design Decisions
AI: Level Setting and Definition

Data Computer
Science Science
Supervised Learning
Unsupervised Learning
Reinforcement Learning

Generative Adversarial Networks (GAN)


Transformer-Based LM

ChatGPT, LLaMA2 etc

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
AI Infrastructure Requirements
AI Infrastructure Requirements Spectrum
Extensive Model Customization
Infrastructure Requirements

Custom foundation models or extensive fine tuning


$10M+ infrastructure and resources
Months of development
Most GenAI projects are here
Hyperscaler and Large enterprise.

Compute: 1,000/10,000 GPU


Moderate Model Customization
Pre-trained model. RAG, P-tuning and fine tuning
Clusters $M+ infrastructure and resources
Weeks of development
Network: InfiniBand / Ethernet Large enterprise training
Compute: 100/1,000 GPU Clusters

Network: InfiniBand / Ethernet Large production Inferencing

and AI model lifecycle Low Model Customization


Gen AI-as-a-service
Compute: 4-8 GPU/Node Smaller Production Consumption model,
Network: InfiniBand/ Ethernet Inferencing and AI model $ per inference
Fastest time to market
lifecycle; Small parameter

training Initial testing of pretrained models


Compute: 2-4 GPU/Node Compute: CPU Only – 2 GPU/Node
Network: Ethernet Network: Ethernet
AI-as-a-Service

AI Innovator
Market Profile AI Adopter
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
LLM Training vs. Fine Tuning vs. Inferencing
Relative Relative
Model Stage Compute Utilization Analogy

Answering biology questions


Inferencing 1X “What is the role of
mitochondria?”

Fine Tuning 10X Learning Biology:


Words, terms, concepts

Training 100X Learning the English language:


Patterns of words

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
AI Maturity Model
Align customer capabilities to technology investment

Exploratory Experimental Plan Transform

Business use for AI not yet Formulated short term AI Defined AI standalone AI Strategy based on long
defined strategy, proof of concept strategy, platform in place term roadmap for new
scenarios for quick wins, dedicated services,
Data culture to support AI AI budget
not established Exec, board support for AI,
not across all lines of Decentralized support Framework defined to
Exec agenda for AI not a business. Small skillset of across staff, adequate assure quality, format,
priority data science on staff resources for early stages ownership

No AI processes or Data advancement with Data gathering, analytics to


technologies in place for policy and degree of centralized platform for Data available in Realtime
implementation governance using point variety of use cases for predictive analysis
solutions
No investment in AI used for internal
infrastructure to support AI Trial AI adjacent processes – Billing A centralized platform
workloads technologies with future automation, segment model with pre-integrated
budget allocation analysis AI capabilities.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Operationalizing AI/ML is not trivial
Everyone in your organization plays a critical role in a complex process

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
AI and Infrastructure Pipelines

Data Engineer Data Scientist

DevOps | SecOps | Infrastructure

Data Preparation Training and Customization Inference


Prompt
A selected model learns from When prompted the model
Preparing structured or
the training data set and interprets new, unseen data and User
unstructured data to create a Response
builds relationships creates a response based on its
training data set for the model
training

Storage

Compute

Network

High storage requirement for ETL, data Compute intensive often with GPU acceleration Lower compute requirements, GPU acceleration
cleansing and optimized for AI retrieval and high-speed low latency network and network demands. Requirements can
increase with scale

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Framework and Common Software
Data Model Model Model
Data Ingest
Preprocessing Training Validation Deployment

▸ Data Engineering ▸ Weekly ▸ Production


Retraining ▸ Model Ranking Deployment
▸ Data Visualization
▸ Model and Validation ▸ Establishing
▸ Feature Identification Management Feedback loop

Latency Sensitive
IO Intensive Compute / GPU Intensive (Training)
(Inferencing)

Data AI/ML/DL Framework Inferencing & Ingestion End Point

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
The need for flexible AI acceleration
1
Inference
Real-time Generative AI
for real-time
For mixed video and audio
streams
Group chat Screen share Recording for meeting
summarization
transcription
workloads or translation

Example: AI-enabled Video Conferencing App

General Purpose Compute (CPU) AI Acceleration (GPU)

2
Data Ingest Edge Data Center Fine Large Foundational
and Preparation Inferencing Inferencing Tuning Training
For the
diversity of GPUS 0-2 1-4 4-64 64-10K+
AI workloads

Network Dedicated Fabrics


Considerations Shared Fabric

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Revolutionizing AI workloads with 5th Gen Intel
Xeon Scalable Processors
High Performance Features
• Intel AMX with built-in AI accelerator in each
core
• Accelerated computations and reduced memory
bandwidth pressure
• Significant memory reductions with BF16/INT8

Enhanced System Capabilities


• Larger last-level cache for improved data locality
• Higher core frequency and faster memory with
DDR5
• Intel AVX-512 for non-deep learning vector
computations

Software Optimization
Kubernetes (Red Hat OpenShift)
• Software suite of optimized open-source
frameworks and tools
• Intel Xeon optimizations integrated into popular
deep learning frameworks

TCO Benefits and Compatibility


• Lower operational costs and a smaller environmental
footprint
• Available on UCS X-Series, C240, C220 platforms

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Will Organizations Build
Large Clusters with over
1000 GPUs?

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Inference and Fine Tuning

https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
99% of customers will not
be building infrastructure
to train their own LLMs

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Many customers will build
GPU clusters in their existing
DCs for training use case
specific ”smaller” models, for
fine tuning existing models,
and to do inferencing or
generative AI.
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
Sample Large Language Model use Cases
LLMs are highly effective in text Language translation is a key
summarization use-case for LLMs in areas such
tasks, in areas such as Academic Research, Travel & Tourism, Legal, Emergency
Summarization
Business Report summary, Legal Analysis, Services, Education, Real-time
Education materials, Emails, etc Translation translation.

Some examples of use cases for LLM Use of LLMs for content
chatbots include Customer Service, creation, marketing,
Personal Assistants, Tech Support, News documentation, Business
and Information. communication, product
documentation, etc
Text Generation
Dialog

LLMs can be used to


Use LLMs to determine sentiment increase coding productivity, with
in areas such as comments, tools such as co-pilot in areas
responses, content moderation, web development, data analysis,
feedback, Market Research. Education tools, etc.

Sentiment Analysis Code Generation

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Enterprise Considerations to Define Requirements
• What is the use case? • Cost
• Am I Training? Fine Tuning? • Accuracy
Inferencing? RAG?
• Model Size
• How much data am I training on?
• User Experience (Response Time)
• How many models am I training on?
• Data Fidelity
• Am I using Private Data?
• Concurrent User/Inputs
• Who is responsible for
Management?

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Where can this be run
Enterprises can choose where any model should be trained. Primarily there are two options:

On Premises Public Clouds


• Always available for enterprise to use • Provides flexibility, pay for what you need
• Flexibility for large enterprise to leverage same cluster for • Cost will grow with more data and training
different functions • Challenge: Cost of egress data from the cloud, latency and
• Data is stored locally/ Data Sovereignty lock in.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Smart Cloud, not Cloud First
On-Premise Data Center

Quantitative Trading Firm, London, UK (12,000 GPUs)


Example Hyperscaler Cost Model Example On-Prem Cost Model
Cloud Provider Lamba Labs @$1.99ph per (H100) GPU. CoLo, Servers, Storage and NW
Potential Annual Cost: $210Million PA Potential Annual Cost: $130 Million PA (3 Years)

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Bringing it all together
A helicopter view of an AI Deployment Journey
1 2 3 4 5

Periodic model
Install common Prep and inject
Deploy AI-ready updates and Deploy application
AI models from data to fine tune
infrastructure infrastructure scaling for inferencing
industry repositories the model
as required

Cisco validated Customer


designs AI Enterprise data

Core

Cisco
Intersight

HCI AI NGC
Edge

FlexPod AI FlashStack AI

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
AI Training
Infrastructure &
Network
Considerations for
AI Environments
Breaking-down Machine Learning – The Process

Retraining
(as required)
Feedback

Decision

Training Data Training Inference Feedback Recommendation


Dataset
Infrastructure Model Infrastructure Output Trend
Algorithm Function Predictive
Training Classification
Inference
(weighted parameters) Generative

Recommendation

Retraining
(as required) New/Live Data

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Architecting an AI/ML training cluster - Considerations

AI models and applications consume massive amounts of data,


and the data is constantly growing…
So, there are many challenges for the infrastructure to grow at the same scale as the data

Inferencing Scalability

Congestion Management
JOB
AI/ML COMPLETION Low Latency
Lifecycle
TIME
Training High Bandwidth
& Feedback
Retraining
No traffic drops

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Training and Inference Network Behaviors
Network Network
Bandwidth Bandwidth

Network Network
Memory Memory
Latency Latency
Bandwidth Bandwidth
Sensitivity Sensitivity

Memory Memory
Compute Compute
Capacity Capacity

LLM Trainig Ranking Training LLM Inference Ranking Inference

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
AI Networking: RDMA
Remote Direct Memory Access
Benefits of RDMA
• Low latency and CPU overhead
• High network utilization
• Efficient data transfer
• Supported by all major operating systems

Zero Copy Networking

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
Remote Direct Memory Access (RDMA)….InfiniBand
• RDMA allows AI/ML nodes to Direct Memory
System GPU GPU System
exchange data over a network by Memory Memory to NIC
Memory Memory
communication
accessing the bytes directly in the
RAM
• Latency is very low as CPU and CPU GPU GPU CPU
kernel can be bypassed
• RDMA data was natively
PCI PCI
exchanged over InfiniBand fabrics e e
RDMA RDMA
• Later, RoCEv2 (RDMA over NIC
RoCEv2
NIC
Converged Ethernet) protocol Non-blocking,
allowed the exchange over Lossless Ethernet
transport, requires
Ethernet fabrics ECN and PFC

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
AI Networking: RoCE v1/RoCE v2 Protocol Stacks
RDMA Over Converged Ethernet

Software
Hardware
RoCE v1 RoCE v2
• Ethernet link layer protocol • Internet layer protocol – can be
routed
• Dedicated ether type (0x8915)
• Dedicated UDP port (4791)
• Can be used with or without VLAN tag
• UDP source port field is used to carry
an opaque flow-identifier
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
RoCEv2: PFC and ECN Together for Lossless Transport
How does it work?
ECN is a layer 3 congestion avoidance protocol
ECN is an IP Layer Notification System allowing
switches to indirectly inform the sources to slow
down the throughput.
WRED thresholds are set low in no-drop queue.
• Signal early for congestion with CNP’s, gives
enough time for end points to react.
PFC is a layer 2 congestion avoidance protocol

PFC thresholds are set higher than ECN


• Oversubscription buffers can be filled
quickly without giving time for ECN to react.
• PFC will react and mitigate congestion.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Data Center Quantized Congestion Notification
• IP ECN or PFC cannot alone provide a valid
Congestion Management framework
• IP ECN signalling might take too long to relieve the
congestion
• PFC can could introduce other problems like Head
Of Line Blocking and unfairness for the flows
• The two of them together provide the desired
result of having lossless RDMA communications
across Ethernet networks (this is called DCQCN)
• The requirements are:
• Ethernet devices compatible with both
techniques
• Proper configurations applied

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
AI/ML Flow Characteristics (Training Focused)

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Bringing Visibility to AI workloads

With the granular visibility provided Tune thresholds until congestion hot This is the first and most important step
by Cisco Nexus Dashboard Insights spots clear and packet drops stop in to ensure that the AI/ML network will
the network administrator can normal cope with regular traffic congestion
observe drops traffic conditions occurrences effectively

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Monitoring These Events
• DCQCN leaves the fabric congestion
management in a self healing status
• Still it is important to keep it under
control:
• Frequently congested links can be
discovered
• QoS policies can be tweaked with a
direct feedback from the monitoring
tools
• Nexus ASICs can stream these metrics
directly to Nexus Dashboard Insights
• NDI will then collect, aggregate and
visualize them all to provide insights to
the operations team

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Nexus Dashboard Insights – Congestion Visibility

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Designing a Network for AI success
• Dedicated Network Stalled/Idle Job

• Non-Blocking, Lossless Fabric

• High Throughput • Optimize job completion


• No Oversubscription
time

• Low Jitter, Low Latency


• On average 25% of Jobs Fail

• Clos Topology
• Expensive, wasted
resources/time
• Visibility is key!

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Do I need a backend network?

Storage Compute GPU DPU FPGA

Frontend Network

Nexus Dashboard

10G | 25G | 50G | 100G | 400G | 800G


Backend Network

Lossless | High-Throughput | Low Jitter | Low-Latency

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Cisco Nexus
HyperFabric Cisco Nexus HyperFabric

AI Cluster On-prem AI Infrastructure

in partnership with NVIDIA


Pods of plug-and-play
data center fabrics
Democratize AI Visibility into
Infrastructure full stack AI

Unified stack AI-native Cisco 6000


Including NVAIE operational model Series Switches

High-performance Cloud managed


Ethernet operations

NVIDIA NVIDIA VAST


DPU/NIC Servers
A solution that will enable you to GPU
BlueField-3
Storage
spend time on AI innovation—not on IT.
Built on Cisco Silicon One and Optics innovations

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
A Simplified Backend Network for AI Environments
Cisco Nexus HyperFabric Use Cases

Cloud SaaS Controller


• Single global UI for all owned fabrics
• Single global API endpoint
• Underlay and lifecycle automation

Build new data centers


• Ease-of-use for IT generalists
Start small and grow fabrics (1+)

• Self-service for fabric tenants
API AI/ML/HPC fabrics
Cisco Nexus
• Simple to deploy and manage HyperFabric AI cluster
• Scalable AI-ready Ethernet fabric

Extend data centers Downsize data center Manage multiple customer data centers
tooling footprint
• Plug-and-play deployment • Managed from cloud
• Easily expand to data center edge/colo • Remote hands assistance
• Data center anywhere with cloud controller
• Small fabrics of 1-2 switches
• Planning/design tools to help build rollout

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Building High-performance AI/ML Ethernet Fabrics
Maximizing customer choice and options
Cisco Nexus HyperFabric Nexus 9000 with
Cisco 8000
AI Cluster Nexus Dashboard

Enterprise/
Enterprise/ Public Service Tier2 Web/ Tier2 Web/
Public Sector/ Hyperscalers
Sector/ Providers AI aaS AI aaS
Commercial Commercial

Cisco Cloud Managed as a Service, Private Cloud Managed, Customizable Solution


Full Stack Interoperable BYO Management, SONiC / BYO-NOS

• General purpose AI multi-pod fabric • Cisco validated SONiC or community


• Turnkey AI pod
• Simplified network operations with sourced
• Nexus HyperFabric managed servers
Nexus Dashboard • Customer assembled & operated
(BMC), NICs, and switches
• CVDs for converged ethernet infra • ECMP* & Scheduled Ethernet** options
• Converged ethernet infra
• Greenfield & brownfield deployments • Greenfield deployments
• Greenfield deployments only
• 100G -> 400G -> 800G • 400G -> 800G
• 400G -> 800G
Nexus (Cloud Scale & Silicon One) • Silicon One switches
Cisco 6000 (Silicon One) switches
switches

*FCS Target CY25 *Shipping *Shipping **FCS Target 2H 2024

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Building an AI Workload Pod for Training
Backend Spines

• Backend network for Frontend TOR

training
Backend Leafs
• 32 rack servers split
across 2 racks Front End Compute Fabric

• Scale up to 30 pods per


spine
• 960 servers
Clustered GPUs with Direct Memory Access
• 1920 GPUs RoCE Enabled NICs

• Full RoCEv2 support on


Compute 100gbps
400gbps
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
GPU Intensive Applications converged
infrastructure example Front-end Network
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Cisco Nexus 1/10G-
copper mgmt.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

switch
Performance Testing S
X X X X X X X X X X X X X X X X X X X X X X X X

10GbE Copper

10GbE Copper
Linear Scalability demonstrated through benchmark
800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

X X X X X X X X X X X X X X X X X X X X X X X X

tests on real life model simulation, showcasing


800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

X X X X X X X X X X X X X X X X X X X X X X X X

consistent performance even with varying dataset 800 GB


NVMEHWH800
NVME SSD

1
800 GB
NVMEHWH800
NVME SSD

2
800 GB
NVMEHWH800
NVME SSD

3
800 GB
NVMEHWH800
NVME SSD

4
2 TB
HD2T7KL6GN
SATA HDD

5
2 TB
HD2T7KL6GN
SATA HDD

6
2 TB
HD2T7KL6GN
SATA HDD

7
2 TB
HD2T7KL6GN
SATA HDD

8
2 TB
HD2T7KL6GN
SATA HDD

9
2 TB
HD2T7KL6GN
SATA HDD

10
2 TB
HD2T7KL6GN
SATA HDD

11
2 TB
HD2T7KL6GN
SATA HDD

12
2 TB
HD2T7KL6GN
SATA HDD

13
2 TB
HD2T7KL6GN
SATA HDD

14
2 TB
HD2T7KL6GN
SATA HDD

15
2 TB
HD2T7KL6GN
SATA HDD

16
2 TB
HD2T7KL6GN
SATA HDD

17
2 TB
HD2T7KL6GN
SATA HDD

18
2 TB
HD2T7KL6GN
SATA HDD

19
2 TB
HD2T7KL6GN
SATA HDD

20
2 TB
HD2T7KL6GN
SATA HDD

21
2 TB
HD2T7KL6GN
SATA HDD

22
2 TB
HD2T7KL6GN
SATA HDD

23
2 TB
HD2T7KL6GN
SATA HDD

24
UCS
C245 M6

sizes.
X X X X X X X X X X X X X X X X X X X X X X X X

Cisco C240 M7
800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB
NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

• Weather Simulation (MiniWeather)


X X X X X X X X X X X X X X X X X X X X X X X X

with MLNX-CX7
S

800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB


NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN

2x200G
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

• Nuclear Engineering (Minisweep)


X X X X X X X X X X X X X X X X X X X X X X X X

800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB


NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

• Cosmology (High Performance Geometric


X X X X X X X X X X X X X X X X X X X X X X X X

800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB


NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Multigrid)
X X X X X X X X X X X X X X X X X X X X X X X X

100Gb

100Gb
S

800 GB 800 GB 800 GB 800 GB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB 2 TB


NVMEHWH800 NVMEHWH800 NVMEHWH800 NVMEHWH800 HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN HD2T7KL6GN
NVME SSD NVME SSD NVME SSD NVME SSD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD SATA HDD UCS
C245 M6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Accelerated Deployment

E
• Centralized management and automation
NetApp A800
• NVIDIA HPC-X Software Toolkit Setup &
Configuration
• NetApp DataOps Toolkit to help developers, data BCN
1
2
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

BCN
STS
1
2
3
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Cisco Nexus
N9K-C9364D-GX2A
STS 4
ENV ENV

Cisco LS Cisco LS
Nexus Nexus

scientists to perform numerous data


N9K-C9336C-FX2 N9K-C9336C-FX2

Back-end Lossless Non-blocking 400G Network


management tasks

Cisco UCS C-Series Rack Server and NetApp AFF A400 storage array connected to Cisco
CVD Link Nexus 93600CD-GX leaf switch with layer 2 configuration for a single rack testing

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Inferencing,
Fine-Tuning, &
Compute
Infrastructure
Model Inferencing Use Cases
Productization Phase

Self-driving vehicles

Face recognition and computer vision Conversational agents

Machine translation

Analysis of medical images

Content generation
Recommender systems Images/Video/Voice

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
Large Language Models (LLMs)
Limitations for enterprise use

Hallucination Can make stuff up, always has an answer

Sources Where did the information come from ?

Outdated Models maybe stale as quickly as it is released

Customize Cannot personalize or use more current data

Update Cannot edit the model to remove/change data

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
Training LLMs
Resource-Intensive and costly

Large Language Models are… GPT-3 Large – 175B parameters


• Training Set Tokens: 300B
Pre-trained on a large corpus of • Vocabulary Size: ~50k
publicly available unlabeled data • Number of GPUs: 10k x V100
• Training Time: One Month

Training takes 1000s of GPUs Llama – 65B parameters


over a span of months
• Training Set Tokens: ~1-1.3T
• Vocabulary Size: ~32k
Requires periodic re-training to • Number of GPUs: 2048 x A100
stay up to date • Training Time: 21 Days

Building LLMs from scratch is cost-prohibitive for the average Enterprise

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
Use Foundational Models
Starting point for most Enterprises

BERT
GPT
Llama Customize or
integrate directly for
Foundational Download Mistral AI inferencing in
models (FM) Stable Diffusion enterprise
Cohere applications
Model Size Claude
LLMs <100B BLOOM
Other Generative <1B …
Predictive <100M Pre-trained,
general-purpose
models

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
LLM, Fine-Tuning and RAG?

REAL TIME PRIVATE


TRAINING FINE TUNING DATASET

LLM
TRAINING SET LLM PRIVATE
DATASET

GENERATE ASK ASK + GENERATE


ASK

USER RETRIEVER LLM


USER FINE-TUNED
LLM

RAG: Retrieval-Augmented Generation

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 53
Business value of LLM + RAG
• RAG helps in mitigating hallucination or generation of incorrect or
misleading information.

• Fine-tuning a pre-trained language model can be a resource-intensive


process. RAG offers a cost-effective alternative.

• RAG generates context-aware responses by retrieving relevant data


before crafting a response, this leads to clearer and more meaningful
interactions with users.

• One of the major concerns with AI models is their "black box" nature, that
is we are unsure of the source it has used to generate content. When
RAG generates a response, it references the sources it used, enhancing
transparency and instilling trust in the users.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
IT Infrastructure for Enterprise GenAI
High-level Architecture

AI/ML Infrastructure

AI/ML Use case AI/ML Use case AI/ML Use case


(App + Model) (App + Model) (App + Model)
Generative AI and ML Model ML Model ML Model
Predictive AI ML frameworks, tools ML frameworks, ML frameworks,
and runtimes tools and runtimes tools and runtimes
Use Cases
MLOps

Kubernetes

Infrastructure

Compute Block/File
CPU + Network Storage
GPU
Object
Store

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Scale fine-tuning and inferencing compute
from the data center to the edge

Scale the Enterprise Scale at the Edge Intersight

UCS X-Series

Optimize for smaller scale


Decrease components, operating
UCS costs, and management complexity
X-Series
Direct Drive sustainable outcomes
Reduce power, cooling,
and physical footprint

Run any workload


From transactions
to AI inferencing

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 56
Simpler, Smarter, More Agile

Fabric-Based adaptive
Computing
Storage HBA NIC Network
policies policies policies policies

Server
profiles

Scale seamlessly to changing Innovative


business needs stateless server
configuration
B A R E
M E T A
L

Faster deployment of Greater control Infrastructure


applications and flexibility shapes to your
specific workloads

Better performance, Less cost and


resiliency, high complexity with
availability fewer components

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
Modular architecture
Ideal for AI component evolution

Investment Multi-vendor Management &


preservation support Upgradability

• Convenience to upgrade • Can select components • Keep your technology


or replace individual parts from different vendors stack current, adaptable,
without overhauling the and competitive
entire system • Best example is within CPU
as you can move from AMD • Cisco Intersight is a SaaS-
• Reduces cost and ensures and Intel AMX to NVIDIA GPU based provides cloud-scale
that initial investments remain A100 and then H100, or management from DC to
valuable over time AMD in the future edge

X-Series modular system decouples the lifecycles of CPU,


GPU, memory, storage and fabrics - providing a perpetual
architecture that efficiently brings you the latest innovations.
PCIe Node 1

PCIe Node 2

PCIe Node 3

PCIe Node 4

Modularity Cloud-powered composability with Cisco Intersight


on X-Series
X-Fabric Flexible GPU acceleration across server nodes
X-Fabric
No backplane or cables = easily upgrades

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
UCS X-Series for AI Workloads
Expandability and Flexibility

1. No backplane
2. X-Fabric
3. Server disaggregation (PCIe Node)
PCIe Node 1

PCIe Node 2

PCIe Node 3

PCIe Node 4
X-Fabric

X-Fabric

UCS X-Fabric Technology

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
X440P PCIe Node
X440P X440P X440P X440P
• Two different types PCIe
node 1
PCIe
node 2
PCIe
node 3
PCIe
node 4

• Provides 2 or 4 PCIe
slots per slot

GPU 10
GPU 5
GPU 6

GPU 9
GPU
GPU

4
1
• Connects via X-Fabric to
adjacent compute node
• Dedicated power and

GPU 11
GPU 12
GPU 7
GPU 8
cooling to GPU (no disks

GPU
GPU

3
2
or CPUs blocking airflow)

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
Riser Style A
• Up to two dual width A16,
A40, L40, L40S, A100 or H100
(NVL*), Flex170, MI210* GPUs
• One x16 per riser = 1 per CPU
• No mixing of GPUs

* planned

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
Riser Style B
• Up to 4 single width
T4/L4/Flex140 GPUs
• Two x8 per riser = 2 per CPU
• No mixing of GPU models

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
X210c/X215c Blade with GPU options
Additional Front Card GPU Options

• Up to six U.2 NVME drives

• Up to two GPUs

GPU 4
GPU 6

GPU 2

GPU 4
GPU 1
GPU 2

GPU 1
• Slides into front of
X210C/X215C compute node
• Can be used with PCIe node to
provide up to 6 GPUs per host

GPU 3
GPU 5

GPU 3
• Intel or AMD CPU

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
Cisco GPU-accelerated platforms offering
X-Series C-Series Rack Servers

C240 M6 INTEL
C245 M6 AMD C240 M7 INTEL C245 M8 AMD
C220 M6 INTEL C225 M8 AMD

3x NVIDIA T4
5x NVIDIA A10 3x NVIDIA A16 3x NVIDIA L4 Plan (2H’24) Plan (2H’24)
3x NVIDIA A16 3x NVIDIA A30
Up to 24x HHHL GPUs or 3x NVIDIA L4
3x NVIDIA A30 3x NVIDIA A40 C225 M6 AMD NVIDIA H100-80
8x FHFL GPUs per X9508 chassis
3x NVIDIA A40 3x NVIDIA A100-80 NVIDIA L40S
3x NVIDIA A100-80 2x NVIDIA H100-80 3x NVIDIA T4 NVIDIA L40
Plan (Q3’24)
3x NVIDIA L40 NVIDIA L4
8x NVIDIA L4 8x NVIDIA L4 C220 M7 INTEL NVIDIA H100-NVL
X210c M6/M7 2S Blades X440p + X210c M6/M7 X440p + X210c M7 (C240 M6 only) 2x NVIDIA L40S NVIDIA A16
2x NVIDIA T4 (MEZZ) 4x NVIDIA T4 (M6 Only) 2x NVIDIA H100-NVL 5x Intel Flex140 AMD MI210
3x NVIDIA L4
2x NVIDIA A16 3x Intel Flex170
X210c M7 2S Blade 3x Intel FLex140
Intel Flex140 (MEZZ) 2x NVIDIA A40
2 x NVIDIA A100-80 X440p + X215c M8 AMD
X210c M7 2S Blade 2x NVIDIA H100-NVL Plans are Subject to change
X440p + M7 (X210c & X410c)
NVIDIA L4 (MEZZ) 2x AMD MI210
2x NVIDIA H100-80
X215c M8 2S Blade 4x NVIDIA L4
2x NVIDIA L40
NVIDIA L4 (MEZZ) 2x NVIDIA L40S
4x NVIDIA L4
2x NVIDIA L40
2x NVIDIA L40S
2x NVIDIA A16
X440p + M7 (X210c & X410c)
4x Intel Flex140
Please Refer to the Server Specifications and HCL for detailed configuration support:
2x Intel Flex170 Plans are Subject to change C-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-c-series-rack-servers/series.html#~tab-documents
X-Series: https://www.cisco.com/c/en/us/support/servers-unified-computing/ucs-x-series-modular-system/series.html#~tab-documents
UCS HCL: https://ucshcltool.cloudapps.cisco.com/public/

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
Sizing for
Inferencing
LLM Inference Performance
How many GPUs do I need for inference?

Model Context GPU


Use Case performance
architecture Length
• Determines • Impacts • Will depend on • Will depend on
model and compute the model its performance
minimum GPU requirements per • Use average (TFLOPS)
• CPU will also inference token size or • Use tests to
have an impact (TFLOPs ) vary token verify
lengths in tests performance

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
LLM Inferencing Performance
Objective and Subjective

Latency Prompt: What is Cisco UCS?


• Time to first token First Token
• Total Generation Time
• Time to second/next time

Throughput 43 Output Tokens


• Requests per second dependent on
concurrency and total generation time
• Tokens per second is the standard
measure (> 30 per second)

User experience – combination of low


latency, throughput and accuracy

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
LLM Inference – Estimating Memory
How much memory does my model need?

For a given precision: FP32, FP16, TF16…

• Model Memory
Precision in Bytes x # of parameters (P)

Example: Llama2 – 13B parameters

• Model Memory:
13 billion x 2Bytes/parameter = 26GB

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
LLM Inference – Estimating Memory
How much memory does my model need?

For a given precision: FP32, FP16, TF16…

• Memory (Inference)
Model Memory + ~20% overhead

Example: Llama2 – 13B parameters

• Memory (Inference):
26GB + 20% overhead = 31.2GB

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
LLM Inference - GPU Estimation
Which GPU do I use?

Based on model memory, number of Similarly, a 70B parameter model,


GPUs needed to load a 13B parameter would require:
model = any GPU with at least 32 GB ~2 A100-80 GPUs (168GB/80GB)

GPU Model Memory (GB) Memory Bandwidth FP16 Tensor Core


(GB/s) (TFLOP/s)
H100 80 2000 756
A100 80 1935 312
L40s 48 864 362
L4 24 300 121

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
LLM Inference - Methodology
How many GPUs do I need for inference?
For a given model and inferencing runtime, start with
enough GPUs to load the model based on memory sizing

Vary concurrent inference requests and measure throughput


and latency metrics for a given token length (context)

Vary batch sizes and measure throughput and latency -


maximizes compute for non-RT use cases

Add a second GPU and repeat concurrent inference request


and batch size tests (as needed)

Monitor GPU compute and memory utilization, along with


inferencing performance, across all tests

Select a configuration that optimally balances latency,


throughput and cost

Sample tool: https://github.com/openshift-psap/llm-load-test

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
Sample Performance Comparison with Nvidia A100
Llama 2 7B | NV-GPT-8B-Chat-4k-SFT | Llama2 13B

Llama 2 – 7B
Input tokens Length: 128 and output Tokens Length: 20

Average
Batch Average
GPUs Throughput
Size Latency (ms)
(sentences/s)

1 1 241.1 4.1

2 1 249.9 8.0

4 1 280.2 14.3

8 1 336.4 23.8

1 2 197.1 5.1

2 2 204.1 9.8

4 2 230.2 17.4

8 2 312.6 25.5

Optimized price to performance ratio with


FLASHSTACK AI
AI Infrastructure
Automation
Policy based compute to scale operations

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
Integrate with DevOps to accelerate
AI application delivery
Data center

Dev and Infra


DevOps teams and Ops

Colo

Edge

Accelerate CI/CD processes and extend infrastructure Simplify lifecycle management with integrated
as code (IaC) workflows by integrating Intersight into infrastructure and workload orchestration tools
your DevOps toolchains
#CiscoLive © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public
Day 0/2: Operations (Full Stack Bare Metal)
• Lack of visibility across multiple infra and cluster deployments
Operational • Difficulty gathering compliance and resource audits
Challenges • Capacity planning and inventory expansion

Hybrid Cloud
Optimization Security Supported Alerts Admin
Telemetry – Infra Health, Alerts, Alarms, Security
Hybrid Cloud Console Infra capacity management for expansion
SaaS

On-Prem Intersight Private Appliance - Optional


(Air-Gap Use Case)
K8s Admin
Edge Site Edge Site Edge Site n
1 2 Inventory (firmware, network, storage)
Add/remove Bare Metal Nodes
Field alerts & alarms, security
Cluster Life-cycle advisories

Cluster Upgrade/downgrade Cluster Cluster Cluster Telemetry, metrics and actionable


insights
Observability
Hardware Compatibility

RBAC, Multi-tenant
UCS-X at the
Edge sites

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
One-click Openshift cluster deployment

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
AI project deployment workflow example
1 2 4 8 10

Deploy Red Hat Load LLM from Deploy Vector Deploy Q/A Chatbot
Deploy AI-ready OpenShift and other Hugging Face and Database for RAG App using Enterprise
infrastructure resources explore/evaluate Knowledge Base

Attu open-source GUI


Cisco
Intersight 5 11
Image Registry 9
Save and upload
Pipelines Artifacts Repo model to Model Repo Deploy GUI front-end
Ingest Enterprise for Q/A Chatbot
Model Repo data to vector
6 database

3 Deploy model serving


12
runtime

Deploy two projects* Deploy Enterprise Q/A


Unstructured
in Openshift AI Chatbot for inferencing
7 Data

Model delivery pipeline


Deploy LLM for
Core Edge
Application
inferencing
inferencing pipeline**

* Workbenches/namespaces
** For demo purposes

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Scalability
Model Delivery Lifecycle Governance

Streamline and scale using MLOps Efficiency


Reliability
Adaptability

Iterate

Serve and integrate


Prepare Data Experiment/Tune model Monitor/Maintain model
with App

Apply scientific rigor to


Gathering & preparing Model available for Track model quality,
understand data and
data for AI production inferencing metrics and drift
build/customize model

Pace of AI/ML technology shifts require a strong foundation to adapt

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
Cisco Validated
Designs (CVD’s)
for AI
Cisco Validated Designs (CVD)

Accelerate Expert Guidance


Ready to ‘Go’ CVDs provide everything
solutions for faster from system designs to
time to value implementation guides, and
ansible automation

Less risk Cisco unified


Cisco TAC support
computing
Reduce risk with system Single point of contact for
tested architectures for solution. Cisco will
standardized, coordinate with partners as
repeatable deployments needed to resolve issues

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 83
Cisco Compute Coverage UCS ONLY FLEXPOD FLASHSTACK NUTANIX

Explore Cisco validated AI demos


Large Language Retrieval Augmented
showcasing a broad spectrum of AI
Models (LLMs) Generation (RAG)
technologies and practices ready
to transform your business
Discover the power of Large Language Experience an enterprise-grade Retrieval
Model (LLM) inferencing as it seamlessly Augmented Generation (RAG) chatbot
processes and generates human-like text delivering responses tailored to your
in real-time. enterprise-specific content.

NVIDIA AI Hugging NVIDIA Text-to- NVIDIA AI NVIDIA Vector Text-to-


Gen AI Gen AI
Enterprise Face TRT-LLM Text Enterprise NIM Database Text

MLOps Image Synthesis Image Analysis

Explore the cutting-edge of MLOps, where Immerse yourself in the innovative world of Delve into the realm of Image Analysis,
the efficiency of machine learning text-to-image synthesis, where vivid where advanced algorithms interpret and
workflows meets the rigor of operational images are conjured from descriptive understand visual data with astonishing
excellence. language or existing photos. accuracy.

Red Hat Keras


NVIDIA AI Hugging Diffusion Text-to- Predictive Intel Image-
Gen AI OpenShift LangChain Mistral vLLM Gen AI Kaggle Neural
Enterprise Face Models Image AI AMX to-Text
AI Network
FlexPod for Generative AI Inferencing

Optimized for AI

• Comprehensive suite of AI tools and frameworks


with NVIDIA AI Enterprise that support optimization
for NVIDIA GPU

• Validated NVIDIA NeMo with TRT-LLM that accelerates


inference performance of LLMs on NVIDIA GPUs

• Metrics dashboard for insights into cluster and GPU


performance and behavior

Accelerated Deployment
• Deployment validation of popular Inferencing Servers
and AI models such as Stable Diffusion and Llama 2
LLMs with diverse model serving options
• Automated deployment with Ansible playbook

AI at Scale
• Scale discretely with future-ready and modular design

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 85
FlashStack for Generative AI | Inferencing with LLMs
Foundational Architecture for Gen AI
• Validated NVIDIA NeMo Inference with
TensorRT-LLM that accelerates inference
performance of LLMs on NVIDIA GPUs Generative AI Models
Nemo GPT, Llama, Stable Diffusion
Cisco Intersight
• Validated models using Text Generation
Inference server from Hugging Face Inferencing Servers
NVIDIA Triton, Text Generation Inference, PyTorch

• Metrics dashboard for insights into


infrastructure, cluster and GPU performance NVIDIA AI Enterprise
Advanced AI platform with advanced integration
and behavior
Portworx Enterprise
Simplify and Accelerate Model Model repository and storage for applications

Deployment
• Extensive breadth of validation of AI models such Red Hat OpenShift
Control plane and worker virtual machines
as GPT, Stable Diffusion and Llama 2 LLMs with
diverse model serving options Virtualization
VMware vSphere
• Automated deployment with Ansible playbook
FlashStack Infrastructure
Consistent Performance Cisco UCS X210 Compute nodes
Cisco UCS X440p PCIe nodes
• Consistent average latency and Throughput Pure Storage FlashBlade or FlashArray
NVIDIA GPU accelerators
• Better price to performance ratio

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
Cisco and Nutanix partner for AI: The Power of Two
Chat GPT-in-a-box

AI Everywhere
Existing apps and
new experiences
Nutanix
Cloud Platform Cisco Intersight

Proven platforms
CVDs and automated
playbooks

Cisco Compute
and Networking

Secure foundation
End-to-end resiliency

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
Cisco Compute Hyperconverged GPT-in-a-Box
Deploy hybrid-cloud AI-ready clusters with Cisco Validated Designs (CVDs)
Generative AI Apps

• Optimized GenAI infrastructure Foundation Models


Streamlined governance with enterprise
Business •
software Kubeflow
Nutanix Files
Challenges Storage
• Sustainable energy use PyTorch
and
• Hybrid cloud is complex Object Storage
Kubernetes

AHV Virtualization

Nutanix AOS
Cisco
• Risk reduction & fast time to market Intersight

• Streamline operations
Benefits • Proven performance

• Protect valuable data

• Simplified hybrid cloud operations

GPU-enabled

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
CVDs to simplify end-to-end AI infrastructure
1 2 EXPANDED ROADMAP 3 NEW

CVDs for simplified AI-ready CVD playbooks supporting common


CVD blueprint for AI networks
infrastructure AI models

Large language models


Best performing Intelligent buffer, (GPT3, BERT, T5)
AI/ML networks, low latency,
focus on application telemetry NVIDIA AI Red Hat
performance and RoCEv2 Enterprise OpenShift AI
Computer vision models
(ResNet, EfficientNet, YOLO)

Dynamic One IP network for


congestion both front-end GPT-in-a-box Gen-AI with Generative models
avoidance and back-end on Nutanix Cloudera (GANs, VAEs)
Hyperconverged Data Platform

Automation for Validated designs


day-2 operations for network and
ecosystem partners NGC Developer Cloud

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
Future Trends and
Industry Impacts of
AI Infrastructure
Demands
AI drives a better future

With a new kind


Artificial Simplified Cloud
intelligence operations

of data center

Simple, sustainable, future-ready


Future-ready

More programmability Less operational


and control complexity

More efficient Less costly to build,


performance for deploy, and operate
new workloads Edge Inferencing Sustainability &
and fleet Power Efficiency
management

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 91
Power & Cooling Trends 2U Server Power Total ~ 2400
• CPU, GPU and Switch ASIC power requirements FAN, 240, 10%

moving from ~350W TDP today to 400W+ and far CPU, 500, 21% CPU
beyond in the coming year(s) Memory
GPU, 600, 25%

• Traditional fan cooling consumes lot of power and Memory, 480, 20%
Storage
less efficient as system power increases Misc
PCIe -IO
• Passive cooling is approaching its limitation Misc, 150, 7%
Storage, 360, 15% GPU
• Liquid cooling technology to address future cooling FAN
requirement with significantly better cooling efficiency
& reduced noise levels
• Closed loop liquid cooling provides a retrofit solution

• Future Data Center designs will need to provision for


Rack level liquid cooling infrastructure (with external
Cooling Distribution Unit - CDU)

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
Liquid Cooling Technologies
- PAO6: Zero GWP, cheaper, - Better cooling, FC-3284 - Better cooling, PG25 - Better cooling, R134a,
lower cooling capability - Heatsink design is boiling - Zero GWP Novec7000 or other refrigerant
- FC-40: Better cooling, higher enhancement coating - Leaks can be catastrophic - Enables highly dense systems,
GWP - Material compatibility - Requires parallel connections series connections ok
- Material compatibility - High GWP to avoid pre-heat - Leaks not catastrophic

Single-Phase Two-Phase Single-Phase Cold Two-Phase Cold


Immersion Immersion Plate Plate

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 93
Compute Express Link (CXL)
Disaggregation Technologies
• Alternate protocol that runs across the standard PCIe physical layer
• Uses a flexible processor port that can auto-negotiate to either the
standard PCI transaction protocol or alternate CXL transaction protocols
• First generation CXL aligns to 32 Gbps PCIe 5.0
• CXL usage expected to be a key driver for an aggressive timeline to PCIe
6.0
• Allows you to build fungible platforms

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 94
UCS X-Fabric Technology For Disaggregation
Open, modular design enables compute and accelerator node connectivity

Open standards: PCIe 4/5/6,


CXL*
Not just another PCIe switch

GPU Node

GPU Node
No midplane nor cables =

Compute
Compute
Compute

Compute
Compute
Compute
easy upgrades Chassis Rear
Chassis Front

Expandability to address
new use cases in future
(memory & storage nodes) UCS X-Fabric Technology
• Internal Fabric interconnects nodes
CXL will evolve outstandard
• Industry of PCIe forTraffic
PCIe, CXL next generation
speeds, cache coherency,
• Upgrade shared-IO, memory
to future generations

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 95
Expanding Ecosystem of Viable GPU Options
Available Now Available Now 1H CY 2024 In Development

Next Generation
AI Accelerator:
Falcon Shores 1
(7nm) (5nm)

Native RoCE Native RoCE


Scaleup & out Scaleup & out Native RoCE
Scaleup & out Native RoCE
Available via: Available via:
Scaleup & out
- HLS-1 Server (x8) - HLS-Gaudi2 Server (x8)
- SMC Server (x8) - SMC Server (x8)
- SDSC - Aivres/IEI Server (x8)
- Public Cloud AWS: EC2 - Intel Dev Cloud

2024 2025

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
Ultra Ethernet Consortium - UEC

https://ultraethernet.org/uec-progresses-towards-v1-0-set-of-specifications/
#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 97
Ultra Ethernet Consortium - UEC

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 98
Open Standard NVLink Alternatives
Introduction of Ultra Accelerator Link (UALink)
AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Low Latency, high bandwidth fabric for 100’s of
and Microsoft are announcing the formation of a accelerators.
group that will form a new industry standard,
UALink, to create the ecosystem. Interconnect for GPU<->GPU Communications

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 99
Silicon Photonics
Bringing Higher Data Rates, Lower Latency & Reduced Power Consumption
• Fiber Optic Photonics • Silicon Photonics
• Over length scales of hundreds or • Integrated Photonics Technology
thousands of kilometers i.e undersea
fiber optic links for internet • All optical components directly created
on same silicon-on-insulator (SOI)
• Majority of optical link involves light in substrate i.e. compact photonics chips
fiber optic cable that can closely be integrated with
• Source Laser, Periodic Repeaters/amps CMOS logic.
and photodetector at receiver. • All components are created on same
• All components (lasers, amplifiers, substrate allowing optical components
photodetector optical modulators, to be packed far denser than discreate
splitters etc) are discrete and connected. optics can achieve.
== Very costly

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 100
Summary
Take Aways and Closing
- Cisco Makes AI Hybrid Cloud Possible

Compute Network Storage


Flexible GPU Lossless, high Scalability, tight coupling
acceleration performance fabrics with compute & networking

AI is pushing infrastructure requirements

Very few customers will train The use cases must drive AI is driving the next push for Major investments are not
the largest models which AI models, methods, modernized data center required to start. You can get
and techniques to utilize facilities, upgraded networks, started with CPU based
compute, and storage and acceleration and existing
Most will use pre-trained models AI consultants play a vital role operational models infrastructure
with their own data and deploy in assessment, guidance, and
associated inference models adoption.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 102
Complete Your Session Evaluations

Complete a minimum of 4 session surveys and the Overall Event Survey to be


entered in a drawing to win 1 of 5 full conference passes to Cisco Live 2025.

Earn 100 points per survey completed and compete on the Cisco Live
Challenge leaderboard.

Level up and earn exclusive prizes!

Complete your surveys in the Cisco Live mobile app.

#CiscoLive BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 103
• Visit the Cisco Showcase
for related demos

• Book your one-on-one


Meet the Engineer meeting

Continue • Attend the interactive education


with DevNet, Capture the Flag,
your education and Walk-in Labs

• Visit the On-Demand Library


for more sessions at
www.CiscoLive.com/on-demand

Contact us at:
eminchen@cisco.com, nicgeyer@cisco.com

BRKCOM-1008 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Thank you

#CiscoLive
Congestion in the fabric
Congestion could always happen

• Congestion can always happen even in a


non-blocking switch/fabric
• Let’s consider the following example with S1 S2 S3 S4

some maths:
• 16 ToR, each of them is dual-connected to
every spine with 2x200Gbps links L1 L2
... L15 L16

• Every ToR has 3.2Tbps of uplink capacity


• Each ToR is attached to 26 dual-homed nodes
via 100Gbps links ... ...
• Every node could be firing up 200Gbps of
2x400Gbps
traffic without affecting the uplinks capacity
100Gbps
• But where is this traffic going?
Congestion could always happen

• If traffic traffic aggregated in a node Flows Sum = 300Gbps


exceeds egress bandwidth capacity then we
have congestion S1 S2 S3 S4

• Impact depends on the data plane protocol.


• Protocols with congestion control
L1 L2
... L15 L16

capabilities, like TCP, can auto-adjust the


flow throughput
...
• Other protocols, like UDP, have no concept
about congestion control. 2x400Gbps

100Gbps
How RoCEv2 Solves This?
• RoCEv2 MUST run over a lossless network,
retransmission must be avoided
Flows Sum = 300Gbps
• Ethernet networks are lossy by design, drops can
happen S1 S2 S3 S4

• RoCEv2 encapsulates data chunks over IP/UDP


packets L2
... L15 L16

• UDP doesn’t have a native congestion control


mechanism
• RoCEv2 uses the Data Center Quantized
...
Congestion Notification scheme that relies 2x400Gbps
primarily on two existing flow control techniques:
100Gbps
• IP Explicit Congestion Notification (RFC 3168, 1999)
• Priority Flow Control (802.1Qbb)
Data Center Quantized Congestion Notification
• IP ECN or PFC cannot alone provide a valid
Congestion Management framework
• IP ECN signalling might take too long to relieve the
congestion
• PFC can could introduce other problems like Head
Of Line Blocking and unfairness for the flows
• The two of them together provide the desired
result of having lossless RDMA communications
across Ethernet networks (this is called DCQCN)
• The requirements are:
• Ethernet devices compatible with both
techniques
• Proper configurations applied
Explicit Congestion Notification
Explicit Congestion Notification
• ECN is implemented via QoS queuing policies leveraging WRED (Weighted Random Early
Detection)
• Buffer utilization is constantly monitored, when the buffer goes above the low threshold then
some packets get marked with the ECN bits to 0b11. Only ECN capable packets are marked
• If it goes above the high threshold then all ECN capable packets are marked with 0b11

MAC IP UDP RoCEv2

WRED high threshold


WRED low threshold
DSCP ECN
Buffer Utilization
0b 00 --> Non ECN capable
0b 01 --> ECN capable
0b10 --> ECN capable
0b11 --> Congestion Experienced
ECN In Action With RoCEv2
S:S1; D:R X; ECN:0b10

L1
Sender –S1

L2 Spine L4
Sender –S2 Receiver -R

L3
Sender – S3

200Gbps

100Gbps
ECN In Action With RoCEv2
S:S1; D:R X; ECN:0b10
50Gbps
L1
Sender –S1

50Gbps
L2 Spine L4
Sender –S2 Receiver -R
S:S1; D:R X; ECN:0b10

L3
Sender – S3

200Gbps

100Gbps
ECN In Action With RoCEv2

S:S1; D:R X; ECN:0b10


50Gbps
L1
Sender –S1

S:S2; D:R X; ECN:0b10 50Gbps 100Gbps


L2 Spine L4
Sender –S2 Receiver -R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b10

L3
Sender – S3

200Gbps

100Gbps
ECN In Action With RoCEv2
IMPORTANT: The next slides status changes
and actions happen in nanoseconds

S:S1; D:R X; ECN:0b10


50Gbps
L1
Sender –S1

S:S2; D:R X; ECN:0b10 50Gbps 150Gbps


L2 Spine L4
Sender –S2 Receiver -R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b10
S:S3; D:R X; ECN:0b10 S:S3; D:R X; ECN:0b10

L3
Sender – S3
50Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2
random-detect minimum-threshold 150 kbytes maximum-
50Gbps threshold 3000 kbytes drop-probability 7 weight 0 ecn

L1
Sender –S1

50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b11
S:S3; D:R X; ECN:0b10

L3
Sender – S3
50Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2

50Gbps
L1
Sender –S1

50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S2 X; CNP
CNP = Congestion Notification Packet
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2

50Gbps
L1
Sender –S1

25Gbps 125Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
S:S3; D:R X; ECN:0b11

L3
Sender – S3
50Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2

50Gbps
L1
Sender –S1

25Gbps 125Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
sent once every X msec
L3
Sender – S3
50Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2

25Gbps
L1
Sender –S1

25Gbps 75Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:S1; D:R X; ECN:0b10
S:S2; D:R X; ECN:0b10
S:S3; D:R X; ECN:0b10

L3
Sender – S3
25Gbps
200Gbps

100Gbps
ECN In Action With RoCEv2
Considerations

Buffer Saturation • Latency between ECN marking and


120% subsequent throttling of the throughput
100% rate could be significant
80%
60% • CNP packets must be prioritized!
40%
20% • While notifications are running buffers
0% might get fully saturated and this will
cause a tail drop
• This is why DCQCN combines ECN with
PFC
RoCEv2 Q
Priority Flow Control
Priority Flow Control queues 1 2 3n 4 5 6 7 8

• With PFC we can define a no-drop queue


• Every time the queue reaches a defined threshold
the almost saturated device sends pause frames
to the devices causing that
• The device which receives it will stop forwarding
packets classified for that queue and will place

PAUSE
them into its buffer
• The process repeats from here until it reaches
the original senders, at that point they will also
stop temporarily sending packets
• By the time this happens all the buffers in the
XOFF
network should be flushed and forwarding can
start again XON

queues 1 2 3n 4 5 6 7 8
PFC and ECN
Joining Forces
Buffer
PFC and ECN
Joining Forces
Buffer

WRED low threshold


We mark some packets
with CE
PFC and ECN
Joining Forces
Buffer

WRED high threshold


We mark all packets
with CE

WRED low threshold


We mark some packets
with CE
PFC and ECN
Joining Forces
Buffer

xOFF
PFC Frames are sent
towards source

WRED high threshold


We mark all packets
with CE

WRED low threshold


We mark some packets
with CE
PFC and ECN
Joining Forces
Buffer

xOFF
PFC Frames are sent xON
towards source PFC Frames are no
longer sent

WRED high threshold


We mark all packets
with CE

WRED low threshold


We mark some packets
with CE
Priority Flow Control In Action With RoCEv2

50Gbps
L1
Sender –S1

50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2

50Gbps
L1
Sender –S1

50Gbps 150Gbps
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2

50Gbps
L1
Sender –S1

50Gbps 150Gbps
PFC
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2
Note: Spine will also start marking packets
with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1

50Gbps 150Gbps
PFC
L2 Spine L4
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2

50Gbps
L1
Sender –S1

50Gbps 150Gbps
PFC
L2 Spine L4
PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2
Note: Every Switch will also start marking
packets with ECN 0b11 before sending PFCs
50Gbps
L1
Sender –S1

50Gbps 150Gbps
PFC
L2 Spine L4
PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2

50Gbps
L1
PFC
Sender –S1

50Gbps 150Gbps
PFC
L2 Spine L4
PFC PFC
Sender –S2 Receiver-R
S:R; D:S1 X; CNP
S:R; D:S2 X; CNP
S:R; D:S3 X; CNP
PFC
L3 S:S1; D:R X; ECN:0b11
S:S2; D:R X; ECN:0b11
Sender – S3
50Gbps S:S3; D:R X; ECN:0b11

200Gbps

100Gbps
Priority Flow Control In Action With RoCEv2

20Gbps
L1
Sender –S1

20Gbps 60Gbps
L2 Spine L4
Sender –S2 Receiver-R

L3
Sender – S3
20Gbps
200Gbps

100Gbps
ECN and PFC – What Each One Brings
RoCEv2 can leverage the use of both ECN and PFC to achieve its goals (i.e. lossless transport)
• ECN is an IP layer notification system. It allows the switches to indirectly inform the sources as
soon as a threshold is reached and let them slow down the throughput
• PFC works at Layer 2 and serves as a way to use the buffer capacity of switches in the data
path to temporarily ensure the no-drop queue is honoured. It effectively happens at each
switch, hop-by-hop, back to the source, giving the source time to react without dropping
packets
• ECN should react first, and PFC acts as a fail-safe if the reaction is not fast enough
• In any case the combo can help achieving a lossless outcome required by AI/ML traffic
• This collaboration of both is called Data Center Quantized Congestion Notification (DCQCN)
• All Nexus 9000 CloudScale ASICs support DCQCN
Alternatives to ECN with WRED
Approximate Fair Drop
• Nexus 9000 ASIC also implements advanced queuing algorithms that can avoid some non-optimized
WRED results
• As an example WRED has no knowledge on which flows are consuming most of the bandwidth. ECN
marking happens only based on probability
• AFD constantly tracks the amount of traffic exchanged and divides them in two categories:
• Elephant Flows: long and heavy which will be penalized (ECN marked)
• Mice Flows: short and light which will not be penalized(ECN marked)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy