Dell Networking RoCE Configuration
Dell Networking RoCE Configuration
Dell Networking RoCE Configuration
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND
TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF
ANY KIND.
© 2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express
written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Other trademarks and trade names may be used
in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims any
proprietary interest in the marks and names of others.
Networking equipment vendors have realized the need to implement and deliver a solution to their
customers; and an ecosystem between NIC (Networf Interface Card) vendors and networking vendors
have created a unique solution enhancing the necessary protocol within the data center to achieve
this data transfer efficiency.
Dell Networking with its data center product family support DCB (Data Center Bridging). DCB comes
with three additional features: PFC, ETS, and DCBx. These additional features enable the
implementation of a granular and dedicated service that RDMA leverages and ensures its ability to
deliver on the promises of speed and lossless services.
Objective
The following technical brief is intended for both internal and external consumption. It is written to
provide a set of recommended step-by-step instructions used to deploy a converged data center
environment based on proven practices using DCB end-to-end and its companion set of features.
This document discusses RoCE v1 & 2 only.
Audience
The intended audience for this document are system engineers, system administrators, network
architects, or network operators that perform system design, or maintain and deploy a converged
network infrastructure.
Steps 1 – 5:
The steps mentioned above have the disadvantages of throughput, and efficiency (latency) especially
when dealing with high performance computing environments or applications. Without RDMA,
multiple memory copies take place and this becomes a bottleneck in an HPC (High Performance
Computing) environments primarily because the memory speeds have not kept up with the current
increased rate of CPUs and interconnects. In addition to this bottleneck, there is also the extra
amount of processing required for TCP/IP transfers. The typical TCP/IP network stack is implemented
in the host operating system which means that the host CPU is required to perform all the necessary
operations such as packet processing, checksum, and handling of system interrupts.
To address these issues, the creation of the RDMA Consortium was created; where RDMA technology
enables a more efficient way of data transfer by removing the basic requirements of copies, operating
system and, CPU involvement. As a result, reducing latency and increasing throughput by allowing one
compute node to directly place data in another node’s memory with minimal demands on memory
bus bandwidth and CPU processing overhead. (see Figure 2).
There are two commonly known RDMA technologies that run over Ethernet:
With RoCE, DCB is needed to be deployed from the host end, across the network infrastructure, to the
other host end. In other words, RoCE has a network element to it and it works together.
There are two versions, 1 and 2. Version 1 is a pure Layer 2 solution, and version 2 which is still under
review is a Layer 3 solution.
Unlike RoCE, iWARP’s based TCP/IP implementation causes it to not have any special requirements in
order to be supported by any Layer 2 or Layer 3 networking devices.
So what changed? Ethernet changed. It evolved from the basic 802.1 to what we know as 802.,1Qx.
RoCE and 802.1Qx work together and we cannot mention one withouth the other.
This evolution provide the necessary tools to address RDMA transport requirements (see Table 2).
Together PFC, ETS, and QCN make up what is known as Data Center Bridging or DCB. In addition to
PFC, ETS, and QCN, DCBx is another important component of what DCB offers eventhough it is not
defined as one of the 802.1 extensions.
802.1Qbb (PFC)
802.1Qbb provides a link level flow control mechanism that can be controlled independently using the
802.1p Class of Service values ranging from 0-7. The goal of this enhancement is to ensure no packet
loss due to congestion in the wire in a DCB enabled network. With this enhancement QoS (Quality of
Service) over Ethernet is delivered.
802.1Qaz (ETS)
802.1Qaz continues in the spirit of delivering solid quality of service by guaranting a percentage of the
link bandwidth to a specific traffic class. For example, voice, video, and data can be assigned a
different amount of guaranteed bandwidth.
802.1Qau (QCN)
802.1Qau provides end-to-end congestion mechanism control for upper layer protocols that
otherwise do not have any native built-in congestion mechanism. 802.1Qau together with Qbb and
Qaz make Ethernert a “lossless” technology.
RoCE v1 and V2
Version 1 is the first version that was made available and it was defined as a link layer protocol allowing
two hosts in the same broadcast domain (VLAN) to communicate. This version of the protocol uses
ethertype 0x8915, which means that the frame length is limited by the standard Ethernet protocol
definition, i.e. 1500 bytes for a standard Ethernet frame and 9000 bytes for an Ethernet jumbo frame.
Starting with Dell Networking OS release 9.0 RoCE v1 support was introduced on the S6000 data
center switch allowing for HPC (High Performance Computing) and similar applications. With release
OS 9.5 and above version 2 was introduced.
Version 2 is referred as Routable RoCE (RRoCE). Version 2 overcomes the limitation of version 1 being
bounded to a single broadcast domain (VLAN). With version 2, IPv4 and IPv6 are supported allowing
for the delivery of RoCE across different subnets bringing scalability into the picture.
RoCE; continues to enhance itself and with the ubiquitness of Ethernet and the constant speed
improvements and cost reductions per port it is poised to be a strong and certainly a viable alternative
to Infiniband. Certainly there will always be cases where IB is more relevant than RoCE or iWARP, but
for the most part the inherent HPC or HPC-like environments requirements can be met by the current
Ethernet standards enhancements.
Using this setup, RoCE v1 and v2 were configured and deployed using the following items:
Following are the tested configuration for each of the different type of traffic found in this typical data
center deployment. The following converged data traffic charateristics are:
1. Enable DCB
S6K_1#conf
S6K_1(conf)#dcb enable pfc-queues 4 enable DCB and pfc with 4 queues
2. Configure the DCB map and priority groups. Turn pfc on or off per traffic class type
S6K_1(conf)#dcb-map ALL configure the dcb map and turn on pfc and ets on specific traffic class
S6K_1(conf-dcbmap-ALL)#priority-group 0 bandwidth 50 pfc off LAN traffic, pfc is off, assign 50% BW
S6K_1(conf-dcbmap-ALL)#priority-group 1 bandwidth 25 pfc on iSCSI traffic pfc is on, assign 25% BW
S6K_1(conf-dcbmap-ALL)#priority-group 2 bandwidth 25 pfc on RoCE traffic pfc is on, assign 25% BW
S6K_1(conf-dcbmap-ALL)#
S6K_1#conf
S6K_1(conf)#int fo0/4
S6K_1(conf-if-fo0/4)#port-channel-protocol lACP
S6K_1(conf-if-fo0/4-lacp)#port-channel 1 mode active
S6K_1(conf-if-fo0/4-lacp)#no shut
S6K_1(conf-if-fo0/4-lacp)#end
S6K_1#conf
S6K_1(conf)# vlt domain 1
S6K_1(conf-vlt-domain)#peer-link port-channel 1
S6K_1#conf
S6K_1(conf)#int te0/32
S6K_1(conf-if-te0/32)#port-channel-protocol lacp
S6K_1(conf-if-te0/32-lacp)#port-channel 10 mode active
S6K_1(conf-if-te0/32-lacp)#no shut
S6K_1(conf-if-te0/32-lacp)#end
S6K_1#conf
S6K_1(conf)#int po 10
S6K_1(conf-if-po-10)#portmode hybrid
S6K_1(conf-if-po-10)#switchport
S6K_1(conf-if-po-10)#vlt-peer-lag port-channel 10
S6K_1(conf-if-po-10)#no shut
The same set of commands would be configured on the S6K_2 port Te0/35 to create the VLT peer port-channel.
S6K_1#conf
S6K_1(conf)#int vlan 10
S6K_1(conf-if-vl-10)#tag po10
1. Enable DCB
MXL_1#conf
MXL_1(conf)#dcb enable
2. Re-configure the queue maps on the MXL, so that 802.1p 3 is assigned to the same queue (queue 3) and
802.1p 5 is assigned to queue 1. This re-configuration is necessary because the S6K has 4 pfc queues,
while the MXL has 2 pfc queues usable. The dcb configuration is being pushed by the S6K as the source,
and the dcb-map configuration calls for two pfc queues on which pfc is turned on.
The configuration line “priority-pgid 0 0 0 0 1 2 0 0” as part of the dcb-map, says turn on pfc on 802.1p 4 &
5, but the default queue mapping (see above) states that 802.1p 5 is part of queue 3 and so is 802.1p 6 & 7.
This creates a dcb configuration conflict between the S6K (source) and MXL which requires the re-
configuration of the queue map.
NOTE: The following data center product family (Z9100, 9500, S6000, S4048) have 4 “lossless” queues while
the following product family (S4810, S4820T, S5000) have 2 “lossless” queues.
MXL_1#conf
MXL_1(conf)#service-class dot1p-mapping dot1p3 3 dot1p5 1
MXL_1(conf)#do sh qos dot1p-queue-mapping
Dot1p Priority : 0 1 2 3 4 5 6 7
Queue : 0 0 0 3 2 1 3 3
3. Configure the respective DCBx port-role on the interfaces facing the hosts.
MXL_1#conf
MXL_1(conf)#int range te0/1 , te0/7
MXL_1(conf-if-te-0/1,te-0/7)#portmode hybrid
MXL_1(conf-if-te-0/1,te-0/7)#switchport
MXL_1(conf-if-te-0/1,te-0/7)#protocol lldp
MXL_1(conf-if-te-0/1,te-0/7-lldp)#dcbx port-role auto-downstream
MXL_1(conf-if-te-0/1,te-0/7-lldp)#dcbx version auto
MXL_1(conf-if-te-0/1,te-0/7-lldp)#end
4. Configure the port-channel to the VLT domain and respective DCBx port-role on the upstream interfaces
to the source.
MXL_1#conf
MXL_1(conf)#int range te0/49 – 50
MXL_1(conf-if-range-te-0/49-50)#port-channel-protocol lacp
MXL_1(conf-if-range-te-0/49-50-lacp)#port-channel 10 mode active
MXL_1(conf-if-range-te-0/49-50-lacp)#no shut
MXL_1(conf-if-range-te-0/49-50-lacp)#exit
MXL_1(conf-if-range-te-0/49-50-lldp)#protocol lldp
MXL_1(conf-if-range-te-0/49-50-lldp)#dcbx port-role auto-upstream
MXL_1(conf-if-range-te-0/49-50-lldp)#no shut
MXL_1(conf-if-range-te-0/49-50-lldp)#end