BCSDNutshell VC
BCSDNutshell VC
BCSDNutshell VC
Product Training
BCSD-in-a-Nutshell Virtual Classroom Version
Part 1
Objectives
Review the BCSD-in-a-Nutshell Training
Review the key business applications of SANs
Identify the specific business problems being addressed by the SAN
solution
Capture information that defines the current SAN environment
Capture information that defines the requirements for the target SAN
environment
Review performance design methodology for SANs and MetaSANS
We will identify the key pieces of information that Brocade believes should be
collected before designing a SAN. This information covers both business issues and
requirements, as well as more technically-oriented information about servers,
storage, device availability, and performance.
Note - Brocade recognizes that many of its partners use their own data gathering
tools, including questionnaires and spreadsheets. The methods demonstrated in this
course are not meant to supercede those of Brocade partners, but instead are
provided as examples.
BCSD Overview
This training prepares you to pass the certification exam for the
Training Objectives
The BCSD-in-a-Nutshell training focuses on the four main tasks for a
performance requirements
Given a particular configuration with performance information, evaluate locality,
Training Agenda
The BCSD-in-a-Nutshell virtual class training is delivered in (2) two-
Increase
Resource
Utilization
Improved Data
Availability
Business
Continuity
Performance
& Scalability
Simplify
management:
7-10 times more
storage managed
per administrator
Reduce
backup and
restore times
by 50-90%
Extend
SANs over
distance for
Disaster
Recovery
Accelerate
Mission critical
applications
10
Server consolidation
Storage applications
Backups
High-availability applications
Disaster tolerance
Keep data not only at a local site, but also at a remote site
Performance
Security
Scalability
11
SAN Applications
What SANs Add
Storage Applications and Server Consolidation:
Improved connectivity (fan-in)
Improved bandwidth (vs. SCSI)
Improved availability
Better utilization of storage systems
Backups:
Separate infrastructure that off-loads the existing WAN/LAN
infrastructure
SCSI-based infrastructure better suited to large-block I/Os
Avoids IP stack for better performance
Long-distance connectivity to facilitate remote site operations
12
Storage Applications: As storage systems become larger and more robust, users
are migrating data from numerous small storage systems (either SCSI-based or FCbased) to smaller numbers of large storage systems. The various servers that
accessed the distributed storage must now share fewer storage ports when accessing
the consolidated storage. The availability of each storage port is thus more
important, and connectivity to the storage port must be more flexible so that the
storage can be highly utilized. The connection to the storage port must support
higher levels of the bandwidth per storage port. In a storage consolidation solution,
we choose Storage Area Networks that connect servers to highly available and
reliable storage. The switches to which the storage is connected must support
improved resource utilization by providing additional connectivity and high
performance.
Server Consolidation: The similarities between storage applications and server
consolidation are clear the only difference is whether we focus on the server or the
storage.
Backups: In a SAN-based backup, the SAN is used to move the backup data, hence
the definition of LAN-free. In reality, though, a LAN is used to send a few bytes
of information about each file being backed up.
SAN Applications
What SANs Add (cont.)
High-availability applications
Highly-available data network
Easier sharing of storage
Better long-distance connectivity for clusters
Disaster tolerance
Fibre Channel high-availability features
Longer-distance storage access through the SCSI driver (better
performance)
Performance
Improved Fibre Channel bandwidth (vs. SCSI)
SilkWorm switch features (trunking)
13
SAN Applications
What SANs Add (cont.)
Security needs
Fabric zoning ensures that servers access only the storage to
14
subject categories:
Business
Requirements
Current SAN
Requirements
Target SAN
Requirements
15
Business Requirements
What are the business challenges that need to be addressed by a
16
success/failure is measured:
Server and Storage Consolidation
High Availability
Backup Needs
Disaster Tolerance and Business Continuance
Performance
SAN-Enabled Applications
Manageability
Scalability
Security Needs
Reduced Total Cost of Ownership
Storage Utilization
17
Business requirements are a key part of the SAN design process. For both the end user and
the SAN designer, they establish standards by which success and failure can be ascertained.
Server/Storage Consolidation The following servers shall access the following storage
systems:.
High Availability No more than X minutes of downtime per year, including Y minutes
for scheduled maintenance.
Backup Needs The following servers must complete their backups according to this
schedule:.
Disaster Tolerance and Business Continuance In the event of failures at site A, site B
shall assume control of all applications and data within C minutes.
Performance On server E, application F shall load database G within H minutes.
SAN-Enabled Applications I will need to use Multiple Pathing Software, Volume
Management, Shared Storage, Clustering.
Manageability X Management Station will run Y applications and access the
infrastructure using Z methods.
Scalability My network will grow A% over the next B time frame, so my interface will
need to provide C number of ports.
Security Needs Our Depth of Defense is X and we require Y security needs.
Reduced Total Cost of Ownership Overall SAN costs, including hardware, software,
and labor, will be reduced by X% over the next 12 months.
Storage Utilization - Disk storage should be at least Y% utilized, up from the current level
of X%.
Any SAN design must take into consideration the existing SAN infrastructure
Before creating a SAN design, collect information on the following requirements:
Facilities
SAN Infrastructure
SAN-Enabled Applications
Servers
Storage (Disk and Tape)
Availability (Data and Application)
Performance
Security
LAN Infrastructure
Multiprotocol Infrastructure
18
19
Documenting the geographic information at an early stage gives you time to prepare
for additional changes that may be required (e.g., acquiring additional floor space
for new equipment, adding fiber optic cable pulls, ensuring sufficient power and
cooling, etc.).
information is needed:
Number of fabrics in SAN
Number of sites in environment
Expected growth rate
20
You can obtain most of this information by running the supportShow command
and uploading the switch configuration file. Brocade also provides services to help
with this type of data collection more on this later in this module.
implemented?
Server consolidation
Storage consolidation and storage utilization
High availability
Backup Needs
Disaster tolerance
Performance
Security Needs
Scalability
21
Here are some of the server, storage, and switch hardware and software decisions
that may be affected by the SAN-enabled applications implemented on the current
SAN:
Server consolidation Fewer Faster Servers, Clustering, Volume Management,
Multi-Pathing, Persistent Binding
Storage consolidation and storage utilization Selective Presentation, LUN
Masking, LUN Security, Virtualization
High availability Clustering, Multi-Pathing
Backup Needs Tape Consolidation, LAN-Free, Serverless, Virtualization
Disaster tolerance Disaster recovery/DR, Remote take-over, Extended Fabrics,
Remote Switch or ISL_RDY Mode, FC-to-FC Routing, FCIP
Performance Trunking, Multi-Pathing
Security Needs Secure Fabric OS, Zoning, LUN Security
Scalability Fabric Topology selection, switch/Director selection
22
Consider the above list to be the absolute minimum information to be gathered about
each server. Some additional notes on the values above:
Server system The server name, manufacturer, and the manufacturers model
name.
Operating system The name of the operating system, and the specific release of
the OS (and any service packs, patches, etc.).
Ethernet requirements The number of Ethernet interfaces needed (higher for
clustered servers), and the required speeds.
HBAs The quantity, manufacturer, model, and driver revision information.
Software Applications The main application run by this server; for multipleserver applications (cluster software, multi-server databases, etc.), note the partner
server.
Disk storage The storage system used; the amount of storage used; the number of
storage-system connection points used by the server to access the storage; and the
number of separate logical units (or LUNs) through which the server data is
accessed.
Tape storage The tape library that contains the drive(s), and the speed of each
drives.
SAN Issues Whether some type of dynamic multi-pathing (DMP) software is
installed on this server, and additional SAN data security (LUN masking on the
server or disk storage, zoning).
Physical location Room, building, city, etc. where the server is installed.
2006 Brocade Communications Systems, Incorporated.
Revision March 2006
Page 1-22
23
24
25
Data and application availability depend on servers being able to access the data
stored on the SAN. This availability must endure implementation, routine
maintenance, and any repairs or upgrades. If you implement a dual-fabric
redundant SAN solution, then application availability will be improved.
26
Any SAN design must always support sustained performance requirements and
provide for peak needs. Gathering relevant information at an early stage helps to
identify peak times, and encourages planning for additional capacity or modifying
processes to avoid network overloading.
SERV1
SERV2
SERV3
Target
Peak
(MB/sec)
When is peak?
Sustained
(MB/sec)
STOR2
10
1st, 1800
TAPE6
20
Sun, 0100
STOR1
10
1st, 1800
STOR2
10
1st, 1800
TAPE3
20
Sun, 0100
STOR1
40
30th, 2300
20
TAPE3
20
Sun, 0100
27
The above example represents a performance table for three servers within a
proposed SAN. We can see that the connection to TAPE3 has a large peak every
Sunday at 0100 (possibly a full backup by SERV2 and SERV3?), and that STOR2
has a large peak on the first of every month at 1800 (perhaps an end-of-month
process?). As we gather data about other servers in the SAN, we can develop
greater insight about the requirements of the final design.
(persistent binding)
2. Fabric Level - Hard Zoning, Secure Fabric OS
3. Storage Level - Selective LUN Presentation, LUN Security
SAN data security is rarely just implemented as a single-level solution
28
SAN security is implemented on both servers and storage, as well as in the fabric or
SAN itself. By implementing a multi-level SAN security solution, simple evasive
techniques (like password theft or WWN spoofing) are thwarted.
29
When documenting the current LAN infrastructure, focus on two key areas:
To manage the SAN-attached devices, what are the IP addresses reserved, and are
they in separated VLANs or subnets? In addition, do the SAN management
applications require constant device access (for live monitoring)?
If the SAN devices are going to transfer data via SAN gateway switches or an
MPR (over FCIP), what are the speed and distance requirements for the
WAN/MAN?
30
31
32
Before creating a SAN design, the target SAN environment must be clearly defined.
When collecting information on the current SAN environment, you may also be
able to collect the information listed above on the target SAN. The focus here
should be on new initiatives, growth estimates, and planning details for the target
SAN solution.
33
These questions can play a major role in determining the final structure of a
solution.
planning details
Customer schedule: target time for pilot, first implementation, any
34
To ensure timely delivery of the target SAN, as well as correct interlock with your
switch vendor and SAN partners, cover vital business items up front.
design
As needed, create your own spreadsheet or database
Many vendors use their own proprietary spreadsheets
Brocade also provides the SAN Health tool that automates data collection
35
The organization of current and target SAN requirements are vital to ensuring a
valid, accurate SAN design. Depending on the size of the data set, use either a basic
spreadsheet or a more complex database. Depending on your switch vendor, you
may be required to use a specific spreadsheet. The Brocade SAN Health tool
(downloadable from http://www.brocade.com) automates collection of data
from the current SAN.
36
The above points are essential to planning and design a SAN solution that can be
implemented and managed successfully.
Switch 1
Switch 2
Switch A
SPOF =
Switch B
Switch 3
Switch 4
Switch C
Switch D
Resilient Fabric
Non-Resilient Fabric
37
SAN
A dual resilient-fabric SAN with no SPOFs
Switch 1A
Switch 2A
Switch 1B
Switch 2B
Switch 3A
Switch 4A
Switch 3B
Switch 4B
Resilient Fabric A
Resilient Fabric B
38
39
A redundant SAN, comprised of at least two resilient fabrics, is a key part of any
long-term high-availability solution. With a redundant SAN solution, all SAN
installations and maintenance can be performed on one fabric at a time without
affecting ongoing data traffic that continues in the second fabric.
Making each fabric resilient (e.g., by the addition of multiple paths between
switches, dual-HBA servers with path failover, and dual-ported storage) ensures
that any device failures do not compromise data access. Multiple resilient fabrics
ensure that hardware failures, software failures, human errors, and site problems
which might fail a switch or fabric do not fail the SAN.
40
The above slide shows a dual-fabric solution in which each fabric contains a single
Brocade SilkWorm 12000 chassis with two logical switches each.
Environmental hazards: The Brocade SilkWorm 12000 chassis is a robust
enclosure, but it may not survive all physical hazards (including fires, water
damage, earthquakes, etc.). To ensure best availability, only one fabric should be
reside in a single Brocade SilkWorm 12000 chassis. With dual-fabric designs, try to
keep the Brocade SilkWorm 12000 chassis for each fabric in separate locations
within the data center or the enterprise to provide even better chances of survival.
Firmware faults: Within a single Brocade SilkWorm 12000 chassis, the same
version of Brocade Fabric OS runs on both logical switches. During a firmware
upgrade, we want to test new firmware in one fabric before taking both fabrics
operational. The logical switches in a Brocade SilkWorm 12000 chassis should be
kept within the same fabric so that new firmware (or any firmware faults) can be
isolated to only one of the two fabrics in the SAN.
Chassis-level operator errors: As we have seen, there are certain Brocade
SilkWorm 12000 operations that can affect both logical switches within the chassis.
To guard against any errors resulting from these operations from affecting both
fabrics in a SAN, keep the logical switches in a Brocade SilkWorm 12000 chassis
within the same fabric.
41
SAN solutions?
Make each fabric resilient (at least two paths between any two
switches in the fabric)
Have redundant fabrics (at least two fabrics per SAN)
Attach devices redundantly to the SAN (to at least two fabrics)
When designing a highly-available MetaSAN solution, leverage the
same techniques used to design highly-available SAN solutions
Focus: Avoid creating single fault domains
42
The list of techniques above should already be familiar from SAN design:
Fabric resiliency ensures that any failures within a fabric do not interrupt data
flow.
Redundant fabrics protect against any failures of an entire fabric.
Redundant device attachment defends against failures within the attached devices,
or in any cables.
These techniques result in a SAN design that avoids single points of failure that
could result in loss of data flow. When designing MetaSANs, seek to leverage
resiliency and redundancy, but in the context of Edge Fabrics and Routed Fabrics.
43
Attaching dual-fabric SANs: In the example above, SAN-1 and SAN-2 are dual
fabric SANs (Fabric A and Fabric B). Each SAN fabric is connected to
separate Routed Fabrics (A and B), ensuring that devices in the two SANs can
communicate, but that fault isolation is maintained between fabrics in the same
SAN.
4
2
4
2
44
Attaching Edge Fabrics with resiliency: In the example above, the Edge Fabrics in
SAN-1 are each attached to a Routed Fabric by 4 IFLs. This ensures that the
connection between these Edge Fabrics and the Backbone Fabric is resilient. In a
similar manner, the Edge Fabrics in SAN-2 are each attached to a Backbone Fabric
by 2 IFLs the minimum needed to ensure resiliency.
45
The example above expands on the concepts presented in the previous slide. By
connecting each Edge Fabric to two Backbone Fabrics, Routed Fabric maintenance
(and other issues) can be handled without interrupting Edge Fabric access an
essential requirement for solutions that require the highest levels of availability.
46
47
Brocade 8-port and 16-port switches are non-blocking and congestion free.
The SilkWorm 3900 and 12000 are non-blocking switches that can encounter
congestion under certain conditions, particularly if the sustained SilkWorm 12000
quad-to-quad traffic is expected to exceed 4 Gbit/sec, or sustained SilkWorm 3900
octet-to-octet traffic is expected to exceed 8 Gbit/sec. This happens only if several
2 Gbit/sec devices connected to different quads/octets are expected to communicate
simultaneously at a sustained rate of greater than 1 Gbit/sec. This is a high
requirement, as a device that connects at 2 Gbit/sec rarely sustains the full
bandwidth over time. In this case, try one of these options:
If multiple high-performance devices are communicating with each other
simultaneously and at full bandwidth, localize all the high-performance 2
Gbit/sec devices together on the same quad or octet. This is leverages the
congestion-free connects available within the same quad/octet.
If the high-performance are not communicating with each other or are shared
by many other devices, then these devices should be connected to quads/octets
with lower-performing devices. By limiting the number of high-performance
devices per quad/octet, we manage the quad-to-quad and octet-to-octet traffic
within the limits discussed earlier.
Storage
Tape
Storage
Tape
Servers
Servers
ISLs
ISLs
48
2-4
2-4
49
Connections between Brocade SilkWorm 12000 switches are first distributed across
port cards to ensure that the high levels of switch availability are not lost through
over-concentrated connections. After distributing these ISLs across at least two port
cards, we can then leverage trunking by concentrating any additional ISLs within
the same quad of a port card.
SW1
SW2
STOR2
50
The ISL over-subscription ratio is defined in terms of bandwidth, not just HBA and
ISL counts. This recognizes the difference in performance of HBAs and ISLs,
depending on the speed of the switches in the Fabric and the HBAs in the attached
servers. In the example above, the four servers SERV1-4 connected to switch SW1
must share a single ISL to reach storage system STOR2, making for a 4:1 ISL oversubscription ratio.
of 7:1
One ISLs per eight-port switch
Two ISLs per 16-port card or switch
Four ISLs per SilkWorm 3900 switch
High-performance SANs or conservative designs may consider an ISL
over-subscription ratio of 3:1
Edge
Core
2006 Brocade Communications Systems, Incorporated.
Revision March 2006
51
Brocade has found that SAN designs with an ISL over-subscription ratio of 7:1 are
suitable for many user environments. This corresponds to two ISLs per SilkWorm
12000/24000 port card, four ISLs per 32-port SilkWorm 3900, two ISLs per 16-port
SilkWorm 3800/3850, and one ISL per eight-port SilkWorm 3200/3250. The
example above demonstrates a 7:1 ISL over-subscription ratio between the
SilkWorm 3850 Edge switches and the Core switches.
To provide additional ISL bandwidth for large-block applications, add ISLs in
parallel with existing ISLs. A common ISL over-subscription ratio for higherbandwidth solutions (or more conservative designs) is 3:1.
Edge
Core
52
Locality between servers and storage has the lowest cost of any of the techniques
discussed today. By carefully selecting attach points from the available switch
ports, we can keep server-storage I/O traffic away from the ISLs. By some
measures, this may be the most difficult technique to implement, as the planning
needed for locality grows even the size of the SAN grows. As discussed earlier,
high-traffic servers are best suited to locality here, we refine this definition to
large-bandwidth servers, as these servers tend to share the available ISL bandwidth
poorly.
Adding ISLs adds bandwidth immediately, and can leverage trunking if you plan
carefully. In the example above, two additional ISLs have been added between the
left-most Edge and left-most Core switches.
ISL
Quad
53
Ports within the same quad share the same port control circuitry. If we connect
multiple ISLs between two switches that share the same quad on both switches, the
SilkWorm port circuitry merges them into a single trunk. On most SilkWorm
switches, the port numbers are color-coded to indicate quad membership.
For the SilkWorm 3000-series and 12000 switches, the following port groups are
members of the same quad:
SilkWorm 3200: Ports 0-3; ports 4-7.
SilkWorm 3800 (image, above): Ports 0-3; ports 4-7; ports 8-11; ports 1215.
SilkWorm 3900: Ports 0-3; ports 4-7; ports 8-11; ports 12-15; ports 16-19;
ports 20-23; ports 24-27; ports 28-31.
SilkWorm 12000 (16-port card): Ports 0-3; ports 4-7; ports 8-11; ports 1215.
In the example above, one quad has been allocated for ISLs, two quads for servers,
and one quad for storage. We will refine this methodology later in this module.
Octet
Octet
Quad
ISL
Quad
Quad
Quad
ISL
Quad
Quad
Quad
Quad
Octet
Octet
54
In the slide above, we see a SilkWorm 3900 with the recommended allocation of
ISL quads.
On the SilkWorm 3900, adjacent pairs of quads share additional connections,
providing even better levels of performance. These octets, shown in the drawing
above, are the following group of ports: ports 0-7, ports 8-15; ports 16-23, and ports
24-31.
55
In the slide above, the port blades in a SilkWorm 12000 logical switch are shown.
The strengths of this strategy are:
Distributing the ISLs across the blades maximizes availability by minimizing the
impact of a failed Port Card.
Using different quads on each Port Card provides for a slight improvement in
performance.
Host A->Storage B:
High fabric locality
Host X->
Tape Y:
No fabric
locality
56
Hosts and storage are considered fabric local if they are connected to the same
Edge Fabric. In the example above, host A and storage B have a high degree of
fabric locality, as both are connected to SAN-1 Fabric A; thus, IFLs in the Routed
Fabric do not carry traffic between these devices. However, host X and tape Y are
connected to separate Edge Fabrics, so that any traffic between these devices would
be carried by IFLs (as shown by the arrows above).
traffic to IFLs
Similar to the ISL Oversubscription Ratio concept used in SAN
design
Check fan-in and fan-out profiles
If actual bandwidth data is available, use the formula: (sum of
server bandwidths) / (sum of IFL bandwidths)
Recommendations for connecting and provisioning IFLs:
Connect at a 15:1 IFL oversubscription ratio
Provision for a 7:1 IFL oversubscription ratio
To provide resiliency, connect each Edge Fabric to each Backbone
Fabric with at least two 2 Gbit/sec IFLs
57
In SAN design, the ISL Oversubscription Ratio determines the number of ISLs
needed to connect a switch to the fabric, based on the number of devices attached to
the switch that may need to share the ISLs. In a similar manner, the IFL
Oversubscription Ratio determines the number of IFLs need to connect an Edge
Fabric to a Backbone Fabric, based on the number of devices with non-fabric-local
traffic that must share the available IFL bandwidth.
The IFL oversubscription ratio above translate to IFLs as:
15:1 one IFL per SilkWorm Multiprotocol Router
7:1 two IFLs per SilkWorm Multiprotocol Router
This technique should yield a conservative estimate for the number of IFLs, but
may not be suitable for specific MetaSAN solutions. As in SAN-based solutions,
use tools that track actual port traffic metrics (Fabric Watch, Web Tools, etc.) to
fine-tune IFL counts.
As with ISLs, provision for future IFLs by pre-allocating ports on Edge Fabric
switches and Backbone Fabric routers. For good resiliency, connect each Edge
Fabric to each Backbone Fabric by at least two 2 Gbit/sec IFLs.