M05 Cluster Storage Overview
M05 Cluster Storage Overview
In environments where information transfers between the computer and storage must be
maximized for speed, efficiency, and reliability, such as in banking or trading businesses, high-
performance interconnects that maximize input/output (I/O) throughput are critical. Such
organizations use storage area networks (SANs), usually with Fibre Channel cabling and
interconnects, or links, in redundant configurations, and storage arrays with hardware based
redundant array of independent disks (RAID) for high availability and high performance.
High-performing equipment helps to ensure that high performance needs can be met, but it does
not guarantee them. The functioning of the storage network is also dependent on the capabilities of
the host operating system, specifically the host operating system drivers that interface with the
storage hardware to pass I/O requests to and from the storage devices. This is especially important
in Fibre Channel SANs, where a complex system of switches and links between servers and
storage requires an effective means of detecting link problems and eliciting the appropriate
response from the operating system.
This module describes in detail how SANs are used in a clustered environment. Storage array
terminology will be discussed in detail. This module also describes the architecture of a multipath
I/O.
It is important to remember that shared storage planning is one of the most critical pieces of cluster
deployment. If due diligence in this process is not done prior to production deployment, this is an
area where customers can have the greatest chance for cluster instability.
Prerequisites
1
Before starting this session, you should be able to:
● Understand the steps needed prior to cluster deployment.
● Describe the hardware needed for a Windows Failover cluster.
3
A SAN is defined as a set of interconnected devices, such as disks, tapes, and servers, that are
connected to a common communication and data transfer infrastructure, such as Fibre Channel.
The common communication and data transfer mechanism for a given deployment is called the
storage fabric.
The purpose of the SAN is to enable multiple servers access to a pool of storage in which any
server can potentially access any storage unit. In this environment, management, which determines
who is authorized to access which devices, and sequencing or serialization guarantees, which
determines who can access which devices at what point in time, plays a large role in providing
security guarantees.
SANs evolved to address the increasingly difficult job of managing storage at a time when the
storage usage is growing explosively. With devices locally attached to a given server or in the
server enclosure itself, performing day-to-day management tasks becomes extremely complex.
Backing up the data in the data center requires complex procedures as the data is distributed
amongst the nodes and is accessible only through the server to which it is attached. As a given
server outgrows its current storage pool, storage specific to that server has to be acquired and
attached, even if there are other servers with plenty of storage space available. Other benefits can
also be gained, such as multiple servers sharing data sequentially or in parallel, devices back ups
through transferring data directly from device to device without first transferring it to a backup
server.
A SAN is a network like any other network, such as a LAN infrastructure. You can use a SAN to
connect many different devices and hosts to provide access to any device from anywhere. Direct
attached storage (DAS) technologies, such as SCSI, are tuned to the specific requirements of
connecting mass storage devices to host computers. In particular, they are low latency, high
bandwidth connections with extremely high data integrity semantics. Network technology, on the
other hand, is tuned more to providing application-to-application connectivity in increasingly
complex and large-scale environments. Typical network infrastructures have high connectivity, can
route data across many independent network segments, potentially over very large distances, and
have many network management and troubleshooting tools.
SANs capitalize on the best of the storage technologies and network technologies to provide a low
latency, high bandwidth interconnect which can span large distances, has high connectivity, and
good management infrastructure from the start.
These features enable you to implement Failover Clustering Technologies. Without the ability to
connect multiple hosts to a collection of storage, clustering would not be possible.
A SAN environment provides the following benefits:
● Centralization of storage into a single pool enables storage resources and server resources to
grow independently. It also enables storage to be dynamically assigned from the pool as and
when it is required. Storage on a given server can be increased or decreased as needed
without complex reconfiguring or re-cabling of devices.
● Common infrastructure for attaching storage enables a single common management model for
configuration and deployment.
● Storage devices are inherently shared by multiple systems. Ensuring data integrity guarantees
and enforcing security policies for access rights to a given device is a core part of the
infrastructure.
● Data can be transferred directly from device to device without server intervention. For
example, data can be moved from a disk to a tape without first being read into the memory of
a backup server. This frees up compute cycles for business logic rather than management
related tasks. Direct device-to-device transfer is also used in Geo-Cluster solutions where
SAN replication is needed to keep multiple sites synchronized from a data perspective.
● It enables clusters to be built where shared access to a data set is required. Consider a
clustered Microsoft SQL Server environment. At any point in time, a SQL Server instance
may be hosted on one computer in the cluster and it must have exclusive access to its
associated database on a disk from the node on which it is hosted. In the event of a failure or
an explicit management operation, the SQL Server instance may failover to another node in
the cluster. After it fails over, the SQL Server instance must be able to have exclusive access
to the database on disk from its new host node.
The following sections discuss the different aspects of a SAN that should be evaluated for your
cluster solution. Diligence and time spent on this evaluation can increase the reliability and
availability of your cluster solution, providing a strong foundation component for your
implementation.
While Fibre Channel is by far the leading technology today, other SAN technologies are available,
for example SCSI over Infiniband, iSCSI (which is SCSI protocol running over a standard IP
network). All these technologies allow a pool of devices to be accessed from a set of servers,
decoupling the compute needs from the storage needs.
In a NAS solution, the file servers hide the details of how data is stored on disks and present a
high-level file system view to application servers. In a NAS environment, the file servers provide
file system management functions such as the ability to back up a file server.
A SAN provides a broad range of advantages over locally connected devices. It enables computers
to be detached from storage units, providing flexible deployment and re-purposing of servers and
storage to suit current business needs. In a SAN environment, you do not have to be concerned
about buying the right devices for a given server, or with re-cabling a data center to attach storage
to a specific server.
Microsoft supports SANs, both as part of the base Microsoft Windows platform and as part of a
complete Windows Server Clustering, high availability solution. You can deploy multiple server
clusters can be deployed in a single SAN environment, along with stand-alone Windows servers
and with non-Windows–based platforms.
Storage Topologies
7
There are four types of Storage I/O technologies supported in 2003 Server clusters: iSCSI, Parallel
SCSI, Serial Attached SCSI (SAS), and Fibre Channel.
Note Microsoft Windows Server 2003 provides support for SCSI interconnects and Fibre Channel
arbitrated loops (FC-AL) for two nodes only. For configurations with more than two nodes, you
need to use a Switched Fibre Channel (FC-SW), called fabric, environment, or an iSCSI solution.
As a clustering administrator, you should be aware of the following implementation issues with
regard to SAN:
● iSCSI
○ Physical Components
iSCSI uses standard IP networks to transfer block-level data between computer systems
and storage devices. Unlike Fibre Channel, iSCSI uses existing network infrastructure,
such as network adapters, network cabling, hubs, switches, routers, and supporting
software. Using network adapters rather than HBAs enables transfer of both SCSI block
commands and normal messaging traffic. This gives iSCSI an advantage over Fibre
Channel network configurations, which require use of both HBAs and network adapters
to accommodate both types of traffic. While this is not a problem for large servers, thin
servers can accommodate only a limited number of interconnects.
iSCSI is based on the serial SCSI standards, and can operate over existing Category 5, or
higher, copper Ethernet cable or fiber optic wiring.
○ Transfer Protocols
iSCSI describes the transport protocol for carrying SCSI commands over TCP/IP. TCP
handles flow control and facilitates reliable transmission of data to the recipient by
providing guaranteed in-order delivery of the data stream. IP routes packets to the
destination network. These protocols ensure that data is correctly transferred from
requesting applications, called initiators, to target devices.
Transmissions across Category 5 network cabling are at speeds up to 1 gigabit per
second. Error rates on gigabit Ethernet are in the same low range as Fibre Channel.
○ Limitations
The amount of time it takes to queue, transfer, and process data across the network is
called latency. A drawback of transmitting SCSI commands across a TCP/IP network is
that latency is higher than it is on Fibre Channel networks, in part because of the
overhead associated with TCP/IP protocols. Additionally, many currently deployed
Ethernet switches were not designed with the low latency specifications associated with
Fibre Channel. Thus, although Ethernet cabling is capable of high speeds, the actual
speed of transmission may be lower than expected, particularly during periods of
network congestion.
The second concern about iSCSI transmission is data integrity, both in terms of errors
and security. Error handling is addressed at each protocol level. Security for tampering
or snooping as the data passes over networks can be handled by implementing the IP
security protocol (IPSec).
○ Simplify Implementation
iSCSI enables you to create storage networks from existing network components, rather
than having to add a second network fabric type. This simplifies hardware
configurations because Ethernet switches can be used. It also enables using existing
security methods, such as firewalls and IP security, which includes encryption,
authentication, and data integrity measures.
Sharing storage among multiple systems requires a method for managing storage access
so that systems access only the storage that is assigned to them. In Fibre Channel
networks, this is done by assigning systems to zones. In iSCSI, this can be done by using
virtual LAN (VLAN) techniques. In Fibre Channel, LUN masking must be used to
provide finer granularity of storage access. For iSCSI, this is handled as part of the
design by allowing targets to be specific to individual hosts.
The use of IP traffic prioritization or Quality of Service (QoS) can help ensure that
storage traffic has the highest priority on the network, which helps to alleviate latency
issues.
○ Enable Remote Capabilities
iSCSI is not limited to the metropolitan-wide areas to which Fibre Channel is limited.
iSCSI storage networks can be LANs, MANs, or WANs, allowing global distribution.
iSCSI has the ability to eliminate the conventional boundaries of storage networking,
enabling businesses to access data worldwide, and ensuring the most robust disaster
protection possible. To do this for Fibre Channel based SANs, it is necessary to
introduce additional protocol translations, such as Fibre Channel IP, and devices that
provide this capability on each end of the SAN links.
○ Simplify Clustering
When multiple servers share access to the same storage, as is done with Server Clusters,
configuration of Fibre Channel SANs can be very difficult—one improperly configured
system affects the entire SAN. iSCSI clusters, unlike Fibre Channel clusters, do not
require complex configurations. Instead, iSCSI configuration is easily accomplished as
part of the iSCSI protocol, with little need for intervention by system administrators.
Changes introduced by hardware replacement are largely transparent on iSCSI but are a
major source of errors on Fibre Channel implementations.
● Parallel SCSI
○ Supported in Windows Server 2003 Enterprise Server only up to two nodes.
○ SCSI Adaptors and storage solutions need to be certified.
○ SCSI cards that host the interconnect, or shared disks, should have different SCSI IDs,
which are normally 6 and 7. Ensure that device access requirements are inline with SCSI
IDs and priorities.
○ SCSI adaptor BIOS should be disabled.
○ If devices are daisy-chained, ensure that both ends of the shared bus are terminated.
○ Use physical terminating devices and not controller-based or device-based termination.
○ SCSI hubs are not supported.
○ Avoid the use of connector converters, such as 68-pin to 50-pin.
○ Avoid combining multiple device types, such as single ended and differential.
● Fibre Channel
○ Fibre Channel Arbitrated Loops (FC-AL) are supported up to two nodes.
○ Fibre Channel Fabric (FC-SW) is supported for all higher node combinations.
○ Components and configuration need to be on the Windows Server Catalog.
○ Supports a multi-cluster environnent.
○ Fault tolerant multi-path I/O drivers and components also need to be certified.
○ Virtualization engines need to be certified.
Note The switch is the only component that is not currently certified by Microsoft. It is
recommended that the end user get the appropriate interoperability guarantees from the switch
vendor before implementing switch fabric topologies. In complicated topologies, where multiple
switches are used and connected through ISLs, it is recommended that the customer work closely
with Microsoft and the switch and storage vendors during the implementation phase to ensure that
all of the components work well together.
iSCSI Basics
8
An iSCSI-based network consists of:
● Server and storage device end nodes.
● Either network interface cards or HBAs with iSCSI over TCP/IP capability on the server.
● Storage devices with iSCSI-enabled Ethernet connections, called iSCSI Targets.
● iSCSI storage switches and routers.
● Clients attached to the same network with iSCSI initiator software installed.
Most current SANs use Fibre Channel technology, making it necessary to use multi-protocol
bridges, or storage routers capable of translating iSCSI to Fibre Channel. This enables iSCSI-
connected hosts to communicate with existing Fibre Channel devices.
Storage traffic is commonly initiated by a host computer and received by the target storage device.
Since target devices can have multiple storage devices associated with them (each one being a
logical unit), the final destination of the data is not the target per se, but specific logical units
within the target.
iSCSI, or Internet SCSI, is an IP-based storage networking standard developed by the IETF. By
enabling block-based storage over IP networks, iSCSI enables storage management over greater
distances than Fibre Channel technologies, and at lower cost. The Microsoft iSCSI initiator service
helps to bring the advantages of high-end SAN solutions to small and midsized businesses. The
service enables a full range of solutions, from low-end network, adapter-only implementations to
high-end offloaded configurations that can rival Fibre Channel solutions.
iSCSI Protocol
9
The iSCSI protocol stack links SCSI commands for storage and IP protocols for networking to
provide an end-to-end protocol for transporting commands and block-level data down through the
host initiator layers and up through the stack layers of the target storage devices. This
communication is fully bidirectional, as shown in Figure 4, where the arrows indicate the
communication path between the initiator and the target by means of the network.
Figure 4. iSCSI Protocol Stack Layers
The initiator, usually a server, makes the application requests. These are converted by the SCSI
class driver to SCSI commands, which are transported in command description blocks (CDBs). At
the iSCSI protocol layer, the SCSI CDBs, under control of the iSCSI device driver, are packaged
in a protocol data unit (PDU) which now carries additional information, including the logical unit
number of the destination device. The PDU is passed on to TCP/IP, which encapsulates the PDU.
TCP/IP then passes it to IP, which adds the routing address of the final device destination. Finally,
the network layer, typically Ethernet, adds information and sends the packet across the physical
network to the target storage device.
Additional PDUs are used for target responses and for the actual data flow. Write requests are sent
from the initiator to the target, and are encapsulated by the initiator. Read requests are sent from
the target to the initiator, and the target does the encapsulation.
Discovery
SANs can become very large and complex. While using pooled storage resources is a desirable
configuration, initiators must be able to determine both the storage resources available on the
network, and whether or not access to that storage is permitted. A number of discovery methods
are possible, and to some degree, the method used depends on the size and complexity of the SAN
configuration.
● Administrator Control
For simple SAN configurations, you can manually specify the target node name, IP address,
and port to the initiator and target devices. If any changes occur on the SAN, you must
upgrade these names as well.
● SendTargets
A second small storage network solution is for the initiator to use the SendTargets operation
to discover targets. The address of a target portal is manually configured and the initiator
establishes a discovery session to perform the SendTargets command. The target device
responds by sending a complete list of additional targets that are available to the initiator.
This method is semi-automated, which means that the administrator might still be required to
enter a range of target addresses.
● SLP
A third method is to use the Service Location Protocol (SLP). Early versions of this protocol
did not scale well to large networks. In the attempt to rectify this limitation, a number of
agents were developed to help discover targets, making discovery management
administratively complex.
● iSNS
The Internet Storage Name Service (iSNS) is a relatively new device discovery protocol,
ratified by the IETF, that provides both naming and resource discovery services for storage
devices on the IP network. iSNS builds upon both IP and Fibre Channel technologies.
The protocol uses an iSNS server as the central location for tracking information about targets
and initiators. The server can run on any host, target, or initiator on the network. iSNS client
software is required in each host initiator or storage target device to enable communication
with the server. In the initiator, the iSNS client registers the initiator and queries the list of
targets. In the target, the iSNS client registers the target with the server.
iSNS provides the following capabilities:
○ Name Registration Service
This enables initiators and targets to register and query the iSNS server directory for
information regarding initiator and target ID and addresses.
○ Network Zoning and Logon Control Service
iSNS initiators can be restricted to zones so that they are prevented from discovering
target devices outside their discovery domains. This prevents initiators from accessing
storage devices that are not intended for their use. Logon control allows targets to
determine which initiators can access them.
○ State Change Notification Service
This service enables iSNS to notify clients of changes in the network, such as the
addition or removal of targets, or changes in zoning membership. Only initiators that are
registered to receive notifications will get these packets, reducing random broadcast
traffic on the network.
From its inception, iSNS was designed to be scaleable, working effectively in both
centralized and distributed environments. iSNS also supports Fibre Channel IP, enabling
configurations that link Fibre Channel and iSCSI to use iSNS to get information from Fibre
Channel networks as well. Hence, iSNS can act as a unifying protocol for discovery.
Session Management
For the initiator to transmit information to the target, the initiator must first establish a session with
the target through an iSCSI logon process. This process starts the TCP/IP connection, verifies that
the initiator has access to the target (authentication), and allows negotiation of various parameters
including the type of security protocol to be used, and the maximum data packet size. If the logon
is successful, an ID is assigned to both initiator (an initiator session ID, or ISID) and target (a
target session ID, or TSID). Thereafter, the full feature phase—that allows for reading and writing
of data—can begin. Multiple TCP connections can be established between each initiator target
pair, allowing unrelated transactions during one session. Sessions between the initiator and its
storage devices generally remain open, but logging out is available as an option.
Error Handling
While iSCSI can be deployed over gigabit Ethernets, which have low error rates, it is also designed
to run over both standard IP networks and WANs, which have higher error rates. WANs are
particularly error-prone since the possibility of errors increases with distance and the number of
devices across which the information must travel. Errors can occur at a number of levels, including
the iSCSI session level (connection to host lost), the TCP connection level (TCP connection lost),
and the SCSI level (loss or damage to PDU).
Error recovery is enabled through initiator and target buffering of commands and responses. If the
target does not acknowledge receipt of the data because it was lost or corrupted, the buffered data
can be resent by the initiator, a target, or a switch.
iSCSI session recovery, which is necessary if the connection to the target is lost due to network
problems or protocol errors, can be reestablished by the iSCSI initiator. The initiator attempts to
reconnect to the target, continuing until the connection is reestablished.
Security
Security is critically important because iSCSI operates in the Internet environment. The IP protocol
itself does not authenticate legitimacy of the data source (sender), and it does not protect the
transferred data. ISCSI, therefore, requires strict measures to ensure security across IP networks.
The iSCSI protocol specifies the use of IP security (IPsec) to ensure that:
● The communicating end points (initiator and target) are authentic.
● The transferred data has been secured through encryption and is thus kept confidential.
● Data integrity is maintained without modification by a third party.
● Data is not processed more than once, even if it has been received multiple times.
The Internet Key Exchange (IKE) protocol can assist with key exchanges, a necessary part of the
IPsec implementation.
iSCSI also requires that the Challenge Handshake Authentication Protocol (CHAP) be
implemented to further authenticate end node identities. Other optional authentication protocols
include Kerberos (such as the Windows implementation), which is a highly scalable option.
Even though the standard requires that these protocols be implemented, there is no such
requirement to use them in an iSCSI network. Before implementing iSCSI, you should review the
security measures to make sure that they are appropriate for the intended use and configuration of
the iSCSI storage network.
Performance
SCSI over TCP/IP can suffer performance degradation, especially in high traffic settings, if the
host CPU is responsible for processing TCP/IP. This performance limitation is especially dramatic
when compared with Fibre Channel, which does not have TCP/IP overhead
Point-to-Point
11
Point-to-point Fibre Channel is a simple way to connect two (and only two) devices directly
together, as shown in Figure 5. It is the Fibre Channel equivalent of direct attached storage (DAS).
From a cluster and storage infrastructure perspective, point-to-point is not a scalable enterprise
configuration.
Arbitrated Loops
12
A Fibre Channel arbitrated loop is exactly what it says; it is a set of hosts and devices that are
connected into a single loop, as shown in Figure 6. It is a cost-effective way to connect up to 126
devices and hosts into a single network.
Figure 6. Fibre Channel Arbitrated Loop
Note An FCAL can only support up to 126 devices because it is running in Half Duplex, unlike
other solutions, such as a fabric environment. An FCAL can support up to 256 devices on each
channel.
Devices on the loop share the media; each device is connected in series to the next device in the
loop and so on around the loop. Any packet traveling from one device to another must pass
through all intermediate devices. In Figure 6, for host A to communicate with device D, all traffic
between the devices must flow through the adapters on host B and device C. The devices in the
loop do not need to look at the packet; they will simply pass it through. This is all done at the
physical layer by the Fibre Channel interface card itself; it does not require processing on the host
or the device. This is very analogous to the way a token-ring topology operates.
When a host or device wants to communicate with another host or device, it must first arbitrate for
the loop. The initiating device does this by sending an arbitration packet around the loop that
contains its own loop address.
The arbitration packet travels around the loop and when the initiating device receives its own
arbitration packet back, the initiating device is considered to be the loop owner. The initiating
device next sends an open request to the destination device, which sets up a logical point-to-point
connection between the initiating device and target. The initiating device can then send as much
data as required before closing down the connection. All intermediate devices pass the data
through. There is no limit on the length of time for any given connection and therefore other
devices wanting to communicate must wait until the data transfer is completed and the connection
is closed before they can arbitrate.
If multiple devices or hosts want to communicate at the same time, each device or host sends out
an arbitration packet that travels around the loop. If an arbitrating device receives an arbitration
packet from a different device before it receives its own packet back, it knows there has been a
collision.
In this case, the device with the lowest loop address is declared the winner and is considered the
loop owner. There is a fairness algorithm built into the standard that prohibits a device from re-
arbitrating until all other devices have been given an opportunity, however, this is an optional part
of the standard.
Note Not all devices and HBAs support loop configurations because it is an optional part of the
Fibre Channel standard. However, for a loop to operate correctly, all devices on the loop must have
arbitrated loop support. Figure 7 shows a schematic of the wiring for a simple arbitrated loop
configuration.
Communication in an arbitrated loop can occur in both directions on the loop depending on the
technology used to build the loop. In some cases communication can occur both ways
simultaneously.
Although a loop can support up to 126 devices, as the number of devices on the arbitrated loop
increases, so does the length of the path. This increases the latency of individual operations.
Many loop devices, such as JBODs, have dipswitches to set the device address on the loop, known
as hard addressing. Most, if not all devices, implement hard addresses so it is possible to assign a
loop ID to a device. However, just as in a SCSI configuration, different devices must have unique
hard IDs. In cases where a device on the loop already has a conflicting address when a new device
is added, the new device either picks a different ID or it does not get an ID at all.
Note Most of the current FC-AL devices are configured automatically to avoid any address
conflicts. However, if a conflict does happen then it can lead to I/O disruptions or failures.
Unlike many bus technologies, the devices on an arbitrated loop do not have to be given fixed
addresses either by software configuration or via hardware switches. When the loop initializes,
each device on the loop must obtain an Arbitrated Loop Physical Address, which is dynamically
assigned. This process is initiated when a host or device sends out a Loop Initialization Primitive
(LIP), which accomplishes similar functionality to a parallel-SCSI Bus Reset; a master is
dynamically selected for the loop and the master controls a well-defined process where each device
is assigned an address.
A LIP is generated by a device or host when the adapter is powered up or when a loop failure is
detected, such as loss of carrier. Unfortunately, this means that when new devices are added to a
loop or when devices on the loop are power-cycled, all the devices and hosts on the loop can
change their physical addresses. LIPs are also issued when the Cluster Server needs to break
reservations to disks and take ownership when nodes are arbitrating for disks.
For these reasons, arbitrated loops provide a solution for small numbers of hosts and devices in
relatively static configurations.
When a host or device is powered on, it must first login to the fabric. This enables the device to
determine the type of fabric (there is a set of characteristics about what the fabric will support) and
it causes a host or device to be given a fabric address. A given host or device continues to use the
same fabric address while it is logged into the fabric and the fabric address is guaranteed to be
unique for that fabric. When a host or device wants to communicate with another device, it must
establish a connection to that device before transmitting data in a way similar to the arbitrated
loop. However, unlike the arbitrated loop, the connection open packets and the data packets are
sent directly from the source to the target. In this case, the switches take care of routing the packets
in the fabric.
Fibre channel fabrics can be extended in different ways, such as by federating switches or
cascading switches. Therefore, Fibre Channel Switched Fabrics provide a much more scalable
infrastructure for large configurations. Switched fabrics provide a much more scalable SAN
environment than is possible using an arbitrated loop configuration.
You can deploy Fibre Channel arbitrated loop configurations in larger switched SANs. Many
incorporate functionality to enable arbitrated loop or point-to-point devices to be connected to any
given port. The ports can typically sense whether the device is a loop device or not and adapt the
protocols and port semantics accordingly. This enables platforms, specific host adapters, or devices
that only support arbitrated loop configurations today, to be attached to switched SAN fabrics.
Pros Cons
FC-AL ● Low Cost ● More complex to deploy
● Loops are easily expanded and ● Maximum of 126 devices
combined with up to 126 hosts ● Devices share media thus lower
and devices overall bandwidth
● Easy for vendors to develop
Switched Fabric ● Easy to deploy ● Increased development
● Supports 16 million hosts and complexity
devices ● Interoperability issues between
● Communicate at full wire-speed, components from different
no shared media vendors
● Switches provide fault isolation ● Higher cost
and re-routing
Note Be sure to select the correct HBA for the topology that you are using. Although some
switches can auto-detect the type of HBA in use, using the wrong HBA in a topology can cause
many stability issues to the storage fabric.
Hubs
16
Hubs are the simplest form of Fibre Channel devices and are used to connect devices and hosts
into arbitrated loop (FC-AL) configurations. Hubs typically have 4, 8, 12, or 16 ports, allowing up
to 16 devices and hosts to be attached. However, the bandwidth on a hub is shared by all devices
on the hub. In addition, hubs are typically half-duplex, although newer full duplex hubs are
becoming available. In other words, communication between devices or hosts on a hub can only
occur in one direction at a time. Because of these performance constraints, hubs are typically used
in small and low bandwidth configurations.
Figure 9 below shows two hosts and two storage devices connected to the hub with the arrows
showing the physical loop provided by the hub.
Figure 9. FC-AL Hub Configuration
A typical hub detects empty ports on the hub and does not configure them into the loop. Some
hubs provide higher levels of control over how the ports are configured and when devices are
inserted into the loop.
Switches
17
A switch is a more complex storage fabric device that provides the full Fibre Channel bandwidth to
each port independently, as shown in Figure 10. Typical switches enable ports to be configured in
either an arbitrated loop or a switched mode fabric.
When a switch is used in an arbitrated loop configuration, the ports are typically full bandwidth
and bi-directional, enabling devices and hosts to communicate at full Fibre Channel speed in both
directions. In this mode, ports are configured into a loop, providing performance, arbitrated loop
configuration.
Switches are the basic infrastructure used for large, point-to-point, switched fabrics. In this mode, a
switch enables any device to communicate directly with any other device at full Fibre Channel
speed, which is currently 1Gbit/Sec or 2Gbit/sec today.
Figure 10. Switched Fibre Configuration
Switches typically support 16, 32, 64, or 128 ports, enabling complex fabric configurations. In
addition, switches can be connected together in a variety of ways to provide larger configurations
that consist of multiple switches. Several manufacturers, such as Brocade, Cisco, and McData,
provide a range of switches for different deployment configurations, from very high performance
switches that can be connected together to provide a core fabric to edge switches that connect
servers and devices with less intensive requirements.
Figure 11 below shows how switches can be interconnected to provide a scalable storage fabric
supporting many hundreds of devices and hosts.
The core backbone of the SAN fabric is provided by high performance and high port density
switches. The inter-switch bandwidth in the core is typically 8Gbit/sec and above. Large data
center class machines and large storage pools can be connected directly to the backbone for
maximum performance. Severs and storage with less performance requirements, such as
departmental servers, may be connected via large arrays of edge switches, each of which may have
16 to 64 ports.
Storage Controller
19
In its most basic form, a storage controller is a box that houses a set of disks and provides a single
connection, which is redundant and highly available, to a SAN fabric. Typically, disks in this type
of controller appear as individual devices that map directly to the individual spindles housed in the
controller. This is known as a Just a Bunch of Disks (JBOD) configuration. The controller provides
no value-add, and is just a concentrator to easily connect multiple devices to a single, or small
number for high availability, fabric switch port.
Modern controllers usually provide some level of redundancy for data. For example, many
controllers offer a wide variety of RAID levels, such as RAID 1, RAID 5, RAID 0+1, and many
other algorithms to ensure data availability in the event of the failure of an individual disk drive. In
this case, the hosts do not see devices that correspond directly to the individual spindles. Rather,
the controller presents a virtual view of highly available storage devices, called logical devices, to
the hosts.
In Figure 13, although there are five physical disk drives in the storage cabinet, only two logical
devices are visible to the hosts and can be addressed through the storage fabric. The controller does
not expose the physical disks themselves.
Many controllers are capable of connecting directly to a switched fabric. However, the disk drives
themselves are typically either SCSI or are disks that have a built-in FC-AL interface.
In Figure 14, the storage infrastructure that the disks connect to is completely independent from the
infrastructure presented to the storage fabric.
A controller requires at least two ports for highly available storage controllers. Controllers usually
have a small number of ports for connection to the Fibre Channel fabric. The logical devices
themselves are exposed through the controller ports as LUNs.
● Built-in hot-swap and hot-plug for all components from HBAs to switches and controllers.
Many high-end switches and most if not all enterprise class storage controllers enable
interface cards, memory, CPU, and disk drives to be hot-swapped.
There are various SAN designs that have different performance and availability characteristics.
Different switch vendors provide different levels of support and different topologies. However,
most of the topologies are derived from standard network topology design. The topologies include:
● Multiple independent fabrics
● Federated fabrics
● Core backbone
Listed below are the pros and cons of multiple independent fabrics:
● Pros
Resilient to management or user errors. For example, if security is changed or zones are
deleted, the configuration on the alternate fabric is untouched and can be re-applied to the
broken fabric.
● Cons
○ Managing multiple independent fabrics can be costly and error prone. Each fabric should
have the same zoning and security information to ensure a consistent view of the fabric
regardless of the communication port chosen.
○ Hosts and devices must have multiple adapters. In the case of a host, multiple adapters
are typically treated as different storage buses. Additional multipathing software, such as
Microsoft MPIO, is required to ensure that the host gets a single view of the devices
across the two HBAs.
Federated Fabrics
22
In a federated fabric, multiple switches are connected together, as shown in Figure 16. Individual
hosts and devices are connected to at least two switches.
Figure 16. Federated Switches for Single Fabric View
Listed below are the pros and cons of multiple independent fabrics:
● Pros
○ Management is simplified, the configuration is a highly available, single fabric, and
therefore there is only one set of zoning information and one set of security information
to manage.
○ The fabric itself can route around failures such as link failures and switch failures.
● Cons
○ Hosts with multiple adapters must run additional multipathing software to ensure that the
host gets a single view of the devices where there are multiple paths from the HBAs to
the devices.
○ Management errors are propagated to the entire fabric.
Core Backbone
23
A core backbone configuration enables you to scale-out a federated fabric environment. Figure 17
shows a backbone configuration. The core of the fabric is built using highly scalable, high
performance switches where the inter-switch connections provide high performance
communication, currently at speeds of 8-10GBit/Sec.
Figure 17. Backbone Configuration
Redundant edge switches can be cascaded from the core infrastructure to provide high numbers of
ports for storage and hosts devices. Listed below are the pros and cons of a backbone
configuration:
● Pros
○ Highly scalable and available SAN configuration.
○ Management is simplified; the configuration is a highly available single fabric.
Therefore, there is only one set of zoning information and one set of security
information to manage.
○ The fabric itself can route around failures, such as link failures and switch failures.
● Cons
○ Hosts with multiple adapters must run additional multipathing software to ensure that the
host gets a single view of the devices where there are multiple paths from the HBAs to
the devices.
Fabric Management
24
SANs are becoming increasingly complex and large configurations are becoming more and more
common. While SANs certainly provide many benefits over direct attach storage, the key issue is
how to manage this complexity.
A storage fabric can have many devices and hosts attached to it. With all of the data stored in a
single, ubiquitous cloud of storage, controlling which hosts have access to what data is extremely
important. It is also important that the security mechanism be an end-to-end solution so that badly
behaved devices or hosts cannot circumvent security and access unauthorized data.
The most common methods to accomplish this are:
● Zoning
● LUN Masking
Zoning
25
You can attach multiple devices and nodes to a SAN. When data is stored in a single cloud, or
storage entity, it is important that you control which hosts have access to specific devices. Zoning
controls access from one node to another. Zoning enables you isolate a single server to a group of
storage devices or a single storage device, or associate a grouping of multiple servers with one or
more storage devices, as might be needed in a server cluster deployment.
The basic premise of zoning is to control who can see what in a SAN. There are a number of
approaches broken down according to server, storage and switch.
On any server, there are various mechanisms to control what devices an application can see and
whether or not the application can talk to another device. At the lowest level, the firmware or
driver of an HBA have a masking capability to control whether or not the server can see other
devices. In addition, you can configure the operating system to control which devices it tries to
mount as a storage volume. Finally, you can also use extra-layered software for volume
management, clustering, and file system sharing, which can also control applications access.
For storage zoning, if you ignore JBODS and the earlier RAID subsystems on most disk arrays,
there is a form of selective presentation. The array is configured with a list of which servers can
access which LUNs on which ports and quite simply ignores or rejects access requests from
devices that are not in those lists.
In terms of switch zoning, most Fibre Channel switches support some form of zoning to control
which devices on which ports can access other devices or ports.
Ideally, you should use a combination of both approaches. You should, using some operating
system or software capability, control what devices or LUNs are mounted on the server. Do not use
a mount-all approach. Use selective presentation on the storage array, and use zoning in the fabric.
Hard Zoning
Hard zoning is a mechanism, implemented at the switch level, which provides an isolation
boundary. A port, which may be either host adapters or storage controller ports, can be configured
as part of a zone. Only ports in a given zone can communicate with other ports in that zone. The
zoning is configured and access control is implemented by the switches in the fabric, so a host
adapter cannot spoof the zones that it is in and gain access to data for which it has not been
configured.
Some switches cannot do hard zoning at all. Some can do it, although not to the granularity of
individual ports and with lots of restrictions. Other switches could only do hard zoning if all the
zones were using the port-ID syntax, hence the assumption that port-ID zoning is the same as hard
zoning. Yet some switches can now do hard zoning of zones that are using either port-ID or WWN
syntax.
Figure 18. Zoning
In Figure 18, hosts A and B can access data from storage controller S1. However, host C cannot
access the data because it is not in Zone A. Host C can only access data from storage S2.
Many current switches allow overlapping zones. This enables a storage controller to reside in
multiple zones, enabling the devices in that controller to be shared amongst different servers in
different zones, as shown in Figure 19. Finer granularity access controls are required to protect
individual disks against access from unauthorized servers in this environment.
Soft Zoning
Soft Zones are a software-based zoning method. Unlike Hard Zones, which are implemented at a
hardware level and have finite boundaries, a soft zone is based on the World-Wide Name (WWN).
WWNs are a 64-bit identifiers for devices or ports.
All zoning can be implemented in either hardware or software. Software zoning is done by the
name server or other fabric access software. When a host tries to open a connection to a device,
access controls can be checked at that time.
Zoning is not only a security feature, but also limits the traffic flow within a given SAN
environment. Traffic between ports is only routed to those pieces of the fabric that are in the same
zone. With modern switches, as new switches are added to an existing fabric, the new switches are
automatically updated with the current zoning information.
I/Os from hosts or devices in a fabric cannot leak out and affect other zones in the fabric causing
noise or cross talk between zones. This is fundamental to deploying Server clusters on a SAN.
A virtual SAN (VSAN) has a higher level construct with a separate name server database rather
than one database that is common to all zones. It may even run as a separate service within the
switch, so the possibility of cross contamination is lower and problems are more highly localized.
Of course, there will still be problems if a device is connected to two separate VSANs, because the
device could behave unexpectedly and can potentially bring down both VSANs.
Alternatively, a standards-based management system might be using the Fibre Channel unzoned
name server query in order to identify all the devices on the fabric.
LUN Masking
26
While zoning provides a high-level security infrastructure in the storage fabric, it does not provide
the fine-grain level of access control needed for large storage devices. In a typical environment, a
storage controller may have many gigabytes or terabytes of storage to be shared amongst a set of
servers.
Storage controllers typically provide LUN-level access controls that enable an administrator to
restrict access to a given LUN to one or more hosts. By providing this access control at the storage
controller, the controller itself can enforce access policies to the data.
LUN masking, performed at the storage controller level, enables you to define relationships
between LUNs and individual servers. Storage controllers usually provide the means for creating
LUN-level access controls that allow access to a given LUN by one or more hosts. By providing
this access control at the storage controller, the controller itself enforces access policies to the
devices. LUN masking provides more granular security than zoning, because LUNs provide a
means for sharing storage at the port level.
LUN masking, performed at the host level, hides specific LUNs from applications. Although the
HBA and the lower layers of the operating system have access to and could communicate with a
set of devices, LUN masking prevents the higher layers from knowing that the device exists and
therefore applications cannot use those devices. LUN masking is a policy-driven software security
and access control mechanism enforced at the host. For this policy to be successful, the
administrator has to trust the drivers and the operating systems to adhere to the policies.
LUN masking is a SAN security technique that accomplishes a similar goal to zoning, but in a
different way. To understand this process, you must know that an initiator, which is typically a
server or workstation, begins a transaction with a target, which is typically a storage device such as
a tape or disk array, by generating an I/O command. A logical unit in the SCSI-based target
executes the I/O commands. A LUN, then, is a SCSI identifier for the logical unit within a target.
In Fibre Channel SANs, LUNs are assigned based on the WWNs of the devices and components.
LUNs represent physical storage components, including disks and tape drives.
In LUN masking, LUNs are assigned to host servers; the server can see only the LUNs that are
assigned to it. If multiple servers or departments are accessing a single storage device, LUN
masking enables you to limit the visibility of these servers or departments to a specific LUN, or
multiple LUNS, to help ensure security.
You can implement LUN masking at various locations within the SAN, including storage arrays,
bridges, routers, and HBAs.
When LUN masking is implemented in an HBA, software on the server and firmware in the HBA
limit the addresses from which commands are accepted. You can configure the HBA device driver
to restrict visibility to specific LUNs. One characteristic of this technique is that its boundaries are
essentially limited to the server in which the HBA resides.
You can also implement LUN masking in a RAID subsystem, where typically a disk controller
orchestrates the operation of a set of disk drives. In this scenario, the subsystem maintains a table
of port addresses via the RAID subsystem controller. This table indicates which addresses are
allowed to issue commands to specific LUNs. In addition, certain LUNs are masked out so specific
storage controllers cannot show them. This form of LUN masking extends to the subsystem in
which the mapping is executed.
If a RAID system does not support LUN masking, you can implement this functionality by using a
bridge or router placed between the servers and the storage devices and subsystems. In this case,
you can configure the system so that only specific servers can to see certain LUNs.
When properly implemented, LUN masking fully isolates servers and storage from events such as
resets. It is important to thoroughly test your design and implementation of LUN masking,
especially if you use LUN masking in server clusters.
As with all things, the approach you use depends as much on your technology as how you operate.
Consider the technology and the implementation carefully.
Although zoning is not the answer to all your problems, it is a vital part of storage provisioning. It
is recommended to implement zoning even if it seems like overkill for a small SAN. After you get
going in the right direction, it will be easier to continue with a robust and reliable approach.
SAN Management
SAN management is a huge topic on its own and is outside the scope of this training. Different
vendors provide a wide range of tools for setting up, configuring, monitoring, and managing the
SAN fabric, as well as the state of devices and hosts on the fabric.
Note Certain applications, such as Exchange and SQL Server, used in clusters only support
Synchronous data replication. Ensure that the application in question supports the method of
replication you choose.
However, such solutions have limited scaling capacities and any additional complexities, such
as storage replication and recovery mechanisms, must be addressed by the hardware vendors.
Some server clusters carry a feature that enables the boot disk, pagefile disks, and the cluster
disks to be hosted on the same channel. There are other performance and operational
implications that need to be considered before implementation.
See Microsoft Knowledge Base article Q305547: Support for Booting from a Storage Area
Network (SAN), which discusses this feature.
Note Enabling Kerberos on a network name has a number of implications that you should ensure
you fully understand before checking the box.
All cluster node computer accounts, as well as the virtual server computer account, must be trusted
for delegation. See online help for how to do this.
To ensure that the user’s private keys are available to all nodes in the cluster, you must enable
roaming profiles for users who want to store data using EFS. See online help for how to enable
roaming profiles.
After the cluster file shares have been created and the configuration steps above have been carried
out, user data can be stored in encrypted files for added security.
Disk Quotas
Configuring disk quotas on shared disks is fully supported.
Autochk/Chkdsk/Chkntfs
Every time Windows restarts, autochk.exe is called to scan all volumes to check if the volume dirty
bit is set. If the dirty bit is set, autochk performs an immediate chkdsk /f on that volume to repair
any potential corruption. Chkdsk is a native Windows tool that can determine the extent of file and
file system corruption. If Chkdsk runs in write-mode, it will automatically attempt to remedy disk
corruption.
The Chkntfs.exe utility is designed to disable the automatic running of chkdsk on specific volumes,
when Windows restarts from an improper shutdown. Chkntfs can also be used to unschedule a
chkdsk if chkdsk /f was used to schedule a chkdsk on an active volume on the next system restart.
All of the above are supported to run in specific configurations in Server Clusters. The relevant KB
articles that explain the procedures in detail are:
● Q174617 - Chkdsk runs while running Microsoft Cluster Server Setup
● Q176970 - How to Run the CHKDSK /F Command on a Shared Cluster Disk
● Q160963 - CHKNTFS.EXE: What You Can Use It For
○ Cluster server only supports the NTFS file system on cluster disks. This ensures that file
protection can be used to protect data on the cluster disks. Since the cluster disks can
failover between nodes, you must only use domain user accounts, Local System,
Network Service, or Local Service, to protect files. Local user accounts on one machine
have no meaning on other machines in the cluster.
○ Cluster disks are periodically checked to make sure that they are healthy. The cluster
service account must have write access to the top-level directory of all cluster disks. If
the cluster account does not have write access, the disk may be declared as failed.
● Quorum Disk
○ The quorum disk health determines the health of the entire cluster. If the quorum disk
fails, the cluster service will become unavailable on all cluster nodes. The cluster service
checks the health of the quorum disk and arbitrates for exclusive access to the physical
drive using standard I/O operations. These operations are queued to the device along
with any other I/Os to that device. If the cluster service I/O operations are delayed by
extremely heavy traffic, the cluster service will declare the quorum disk as failed and
force a regroup to bring the quorum back online somewhere else in the cluster. To
protect against malicious applications flooding the quorum disk with I/Os, the quorum
disk should be a dedicated disk that is not used by other applications that generate heavy
I/O. Access to the quorum disk should be restricted to the local Administrator group and
the cluster service account.
○ If the quorum disk fills up, the cluster service may be unable to log required data. In this
case, the cluster service will fail, potentially on all cluster nodes. To protect against
malicious applications filling up the quorum disk, access should be restricted to the local
Administrator group and the cluster service account.
○ For both reasons above, do not use the quorum disk to store other application data.
● Cluster Data Disks
○ As with the quorum disk, other cluster disks are periodically checked using the same
technique. When securing clustered data disks, the Cluster Service Account (CSA) needs
to still have privileges to the disk or the health checks will not be able to complete and
the Cluster Service will assume the disk is failed.
Must Do
Each cluster on a SAN must be deployed in its own zone. The cluster uses mechanisms to
protect access to the disks that can have an adverse effect on other clusters that are in the
same zone. By using zoning to separate the cluster traffic from other cluster or non-cluster
traffic, there is no chance of interference. Figure 21 shows two clusters sharing a single
storage controller. Each cluster is in its own zone. The LUNs presented by the storage
controller must be allocated to individual clusters using fine-grained security provided by
the storage controller itself. LUNs must be setup as visible to all nodes in the cluster and a
given LUN should only be visible to a single cluster.
Figure 21. Clusters assigned to individual zones
All HBAs in a single cluster must be the same type and at the same firmware revision
level. Many storage and switch vendors require that all HBAs on the same zone, and in
some cases the same fabric, are the same type and have the same firmware revision
number.
All storage device drivers and HBA device drivers in a cluster must be at the same
software version.
SCSI bus resets are not used on a Fibre Channel arbitrated loop. They are interpreted by
the HBA and driver software and cause a LIP to be sent. This resets all devices on the
loop.
When adding a new server to a SAN, ensure that the HBA is appropriate for the topology.
In some configurations, adding an arbitrated loop HBA to a switched fibre fabric can
result in widespread failures of the storage fabric. There have been real-world examples of
this causing serious downtime.
The cluster software ensures that access to devices that can be accessed by multiple hosts
in the same cluster is controlled and only one host actually mounts the disk at any one
time. When first creating a cluster, make sure that only one node can access the disks that
are to be managed by the cluster. This can be done either by leaving the other (to be)
cluster members powered off, or by using access controls or zoning to stop the other hosts
from accessing the disks. After a single node cluster has been created, the disks marked as
cluster-managed will be protected and other hosts can be either booted or the disks made
visible to other hosts to be added to the cluster.
This is no different to any cluster configuration that has disks that are accessible from multiple
hosts.
Must Not Do
You must never allow multiple hosts access to the same storage devices unless they are in
the same cluster. If multiple hosts that are not in the same cluster can access a given disk,
it will lead to data corruption. In addition, you must never put any non-disk device into the
same zone as cluster disk storage devices.
Other Hints
Highly available systems such as clustered servers should typically be deployed with multiple
HBAs with a highly available storage fabric. In these cases, be sure to always load the multi-path
driver software. If the I/O subsystem in the Windows Server 2003 platform sees two HBAs, it will
assume they are different buses and enumerate all the devices assuming that they are different
devices on each bus. In this case, the host is actually seeing multiple paths to the same disks.
Many controllers provide snapshots at the controller level that can be exposed to the cluster as a
completely separate LUN. The cluster does not react well to multiple devices having the same
signature. If the snapshot is exposed back to the host with the original disk online, the base I/O
subsystem will re-write the signature. However, if the snapshot is exposed to another node in the
cluster, the cluster software will not recognize it as a different disk. Do not expose a hardware
snapshot of a clustered disk back to a node in the same cluster. While this is not specifically a SAN
issue, the controllers that provide this functionality are typically deployed in a SAN environment.
Note Some controllers use a different cluster resource than physical disk. You need to create a
resource of the appropriate type for such environments. Only Basic, MBR format disks that contain
at least one NTFS partition can be managed by the cluster. Before adding a disk, it must be
formatted.
Remember that the same rules apply when adding disks as in creating a cluster. If multiple nodes
can see the disk before any node in the cluster is managing it, this will lead to data corruption.
When adding a new disk, first make the disk visible to only one cluster node. After the disk is
added as a cluster resource, make the disk visible to the other cluster nodes.
To remove a disk from a cluster, first remove the cluster resource corresponding to that disk. After
it is removed from the cluster, the disk can be removed, either physically or through deletion and
re-assignment of the LUN.
There are several KB articles on replacing a cluster-managed disk. While disks in a cluster should
typically be RAID sets or mirror sets, there are sometimes issues that cause catastrophic failures
leading to a disk having to be rebuilt from the ground up. There are also cases where cluster disks
are not redundant and failure of those disks also leads to a disk having to be replaced. The steps
outlined in those articles should be used if you need to rebuild a LUN due to failures. You can
refer to the following KB articles:
● Q243195 - Replacing a cluster managed disk in Windows NT 4.0
● Q280425 – Recovering from an Event ID 1034 on a Server Cluster
● Q280425 – Using ASR to replace a disk in Windows Server 2003
Expanding Disks
36
In Windows Server 2003, you can expand volumes dynamically without requiring a reboot. You
can use the Diskpart tool from Microsoft to expand volumes dynamically. Diskpart is available for
both Windows 2000 and Windows Server 2003 from www.microsoft.com.
As long as the underlying disk subsystem, primarily the SAN, supports dynamic expansion of a
LUN, and that free space can be seen at the end of the partition in Disk Manager, diskpart can be
used to move the end of the partition anywhere within that free space.
The command switch used with diskpart to expand a volume is:
extend
Extends the volume with focus into next contiguous unallocated space. For basic volumes, the
unallocated space must be on the same disk as the partition with focus. It must also follow (be of
higher sector offset than) the partition with focus. A dynamic simple or spanned volume can be
extended to any empty space on any dynamic disk. Using this command, you can extend an
existing volume into newly created space.
If the partition was previously formatted with the NTFS file system, the file system is
automatically extended to occupy the larger partition. No data loss occurs. If the partition was
previously formatted with any file system format other than NTFS, the command fails with no
change to the partition.
You cannot extend the current system or boot partitions.
Syntax
extend [size=N] [disk=N] [noerr]
Parameters
size=N
The amount of space in megabytes (MB) to add to the current partition. If no size is given, the
disk is extended to take up all of the next contiguous unallocated space.
disk=N
The dynamic disk on which the volume is extended. An amount of space equal to size=N is
allocated on the disk. If no disk is specified, the volume is extended on the current disk.
filesystem
For use only on disks where the file system was not extended with the volume. Extends the file
system of the volume with focus so that the file system occupies the entire volume.
noerr
For scripting only. When an error is encountered, DiskPart continues to process commands as if
the error did not occur. Without the noerr parameter, an error causes DiskPart to exit with an
error code.
See the following link for a more extensive documentation on the syntax of diskpart:
http://technet2.microsoft.com/WindowsServer/en/library/ca099518-dde5-4eac-a1f1-
38eff6e3e5091033.mspx?mfr=true
○ For device removal, the device receives an alert and the driver determines that the device
can be removed. When an adapter is removed, the multipath bus driver must remove any
child devices in the stack as appropriate.
● Dynamic Load Balancing
The multipath software supports the ability to distribute I/O transactions across multiple
adapters. The device-specific module is responsible for load balancing policy for its storage
device. As an example, this load balancing policy can enable the device-specific module to
recommend which path to take for better performance purposes based upon the current
characteristics of a connected SAN.
● Fault Tolerance
Multipath software can also function in a fault-tolerant mode in which only a singe channel is
active and any given moment. Should a failure in that channel be detected by the MPIO
subsystem, disk I/O can be transparently routed to the inactive channel without interruption
being detected by the OS. Some MPIO solutions can operate in a hybrid type mode where
both channels are active and a failure of any one channel would reroute to the other.
Architecture
The primary goal of the MPIO architecture is to ensure correct operation to a disk device that can
be reached via more than one physical path. When the operating system determines that there are
multiple paths to a disk device, it uses the multi-path drivers to correctly initialize and then use
one of the physical paths at the right time. At a high level, the MS MPIO architecture
accomplishes this by use of pseudo device objects within the operating system that replace the
physical and functional device objects involved.
In multi-path, multiple HBAs can see the same logical disk. The MS MPIO architecture creates a
pseudo HBA that acts in their place whenever an operation is attempted on a disk device that can
be reached via either of the HBAs. This is similar in concept to NIC teaming.
The MS MPIO architecture also creates pseudo logical disk devices, one for each logical disk in
the system that can be reached via a pathway. The drivers place the pseudo logical devices in the
I/O stack and prevent direct I/O to the physical devices involved.
When an I/O operation is sent to one of these pseudo devices, the MS MPIO architecture makes a
determination based upon interaction with the DSM on which physical path to use. The I/O
directed to the pseudo device is instead directed to one of the physical pathways for completion.
Additional References
38
The following Microsoft Knowledge Base articles provide additional information about storage:
● Q174617: Chkdsk runs while running Microsoft Cluster Server Setup
● Q176970: How to Run the CHKDSK /F Command on a Shared Cluster Disk
● Q160963: CHKNTFS.EXE: What You Can Use It For
● Q293778: Multiple-Path Software May cause Disk Signature to Change.
● Q243195: Replacing a cluster managed disk in Windows NT 4.0
Review
39
Topics discussed in this session include:
● Storage Area Network terms and technologies
● iSCSI
● Microsoft disk features in a clustered environment
● Multipath I/O