Isilon OneFS
Isilon OneFS
Abstract
This white paper provides technical details on the key features and
capabilities of the Dell EMC Isilon OneFS operating system that is used to
power all Dell EMC Isilon scale-out NAS storage solutions.
December 2019
Acknowledgements
This paper was produced by the following:
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright © Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.
Other trademarks may be trademarks of their respective owners.
Intended Audience
This paper presents information for deploying and managing a Dell EMC Isilon cluster and provides a comprehensive background to
the Isilon OneFS architecture.
The target audience for this white paper is anyone configuring and managing an Isilon clustered storage environment. It is assumed
that the reader has a basic understanding of storage, networking, operating systems, and data management.
More information on OneFS commands and feature configuration is available in the OneFS Administration Guide.
OneFS overview
OneFS combines the three layers of traditional storage architectures—file system, volume manager, and data protection—into one
unified software layer, creating a single intelligent distributed file system that runs on an Isilon storage cluster.
This is the core innovation that directly enables enterprises to successfully utilize the scale-out NAS in their environments today. It
adheres to the key principles of scale-out; intelligent software, commodity hardware and distributed architecture. OneFS is not only the
operating system but also the underlying file system that drives and stores data in the Isilon scale-out NAS cluster.
Isilon nodes
OneFS works exclusively with the Isilon scale-out NAS nodes, referred to as a “cluster”. A single Isilon cluster consists of multiple
nodes, which are rack-mountable enterprise appliances containing: memory, CPU, networking, Ethernet or low-latency InfiniBand
interconnects, disk controllers and storage media. As such, each node in the distributed cluster has compute as well as storage or
capacity capabilities.
With the new generation of Isilon hardware (‘Gen6’), a single chassis of 4 nodes in a 4U form factor is required to create a cluster,
which scales up to 252 nodes in OneFS 8.2 and later. Previous Isilon hardware platforms need a minimum of three nodes and 6U of
rack space to form a cluster. There are several different types of nodes, all of which can be incorporated into a single cluster, where
different nodes provide varying ratios of capacity to throughput or input/output operations per second (IOPS).
Each node or chassis added to a cluster increases aggregate disk, cache, CPU, and network capacity. OneFS leverages each of the
hardware building blocks, so that the whole becomes greater than the sum of the parts. The RAM is grouped together into a single
coherent cache, allowing I/O on any part of the cluster to benefit from data cached anywhere. A file system journal ensures that writes
that are safe across power failures. Spindles and CPU are combined to increase throughput, capacity and IOPS as the cluster grows,
for access to one file or for multiple files. A cluster’s storage capacity can range from a minimum of 18 terabytes (TB) to a maximum of
greater than 58 petabytes (PB). The maximum capacity will continue to increase as storage media and node chassis continue to get
denser.
Isilon nodes are broken into several classes, or tiers, according to their functionality:
Network
There are two types of networks associated with a cluster: internal and external.
Back-end network
All intra-node communication in a cluster is performed across a dedicated backend network, comprising either 10 or 40 Gb Ethernet, or
low-latency QDR InfiniBand (IB). This back-end network, which is configured with redundant switches for high availability, acts as the
backplane for the cluster. This enables each node to act as a contributor in the cluster and isolating node-to-node communication to a
private, high-speed, low-latency network. This back-end network utilizes Internet Protocol (IP) for node-to-node communication.
Front-end network
Clients connect to the cluster using Ethernet connections (1GbE, 10GbE or 40GbE) that are available on all nodes. Because each
node provides its own Ethernet ports, the amount of network bandwidth available to the cluster scales linearly with performance and
capacity. The Isilon cluster supports standard network communication protocols to a customer network, including NFS, SMB, HTTP,
FTP, HDFS, and OpenStack Swift. Additionally, OneFS provides full integration with both IPv4 and IPv6 environments.
Figure 2 depicts the complete architecture; software, hardware and network all working together in your environment with servers to
provide a completely distributed single file system that can scale dynamically as workloads and capacity needs or throughput needs
change in a scale-out environment.
Isilon SmartConnect is a load balancer that works at the front-end Ethernet layer to evenly distribute client connections across the
cluster. SmartConnect supports dynamic NFS failover and failback for Linux and UNIX clients and SMB3 continuous availability for
Windows clients. This ensures that when a node failure occurs, or preventative maintenance is performed, all in-flight reads and writes
are handed off to another node in the cluster to finish its operation without any user or application interruption.
Client services
The front-end protocols that the clients can use to interact with OneFS are referred to as client services. Please refer to the Supported
Protocols section for a detailed list of supported protocols. In order to understand, how OneFS communicates with clients, we split the
I/O subsystem into two halves: the top half or the ‘initiator’ and the bottom half or the ‘participant’. Every node in the cluster is a
participant for a particular I/O operation. The node that the client connects to is the initiator and that node acts as the ‘captain’ for the
entire I/O operation. The read and write operation are detailed in later sections
Cluster operations
In a clustered architecture, there are cluster jobs that are responsible for taking care of the health and maintenance of the cluster
itself—these jobs are all managed by the OneFS job engine. The job engine runs across the entire cluster and is responsible for
dividing and conquering large storage management and protection tasks. To achieve this, it reduces a task into smaller work items and
then allocates, or maps, these portions of the overall job to multiple worker threads on each node. Progress is tracked and reported on
throughout job execution and a detailed report and status is presented upon completion or termination.
Job Engine includes a comprehensive check-pointing system which allows jobs to be paused and resumed, in addition to stopped and
started. The Job Engine framework also includes an adaptive impact management system.
The Job Engine typically executes jobs as background tasks across the cluster, using spare or especially reserved capacity and
resources. The jobs themselves can be categorized into three primary classes:
ChangelistCreate Create a list of changes between two consecutive SyncIQ snapshots Tree
Collect Reclaims disk space that could not be freed due to a node or drive being Drive + LIN
unavailable while they suffer from various failure conditions.
FlexProtect Rebuilds and re-protects the file system to recover from a failure scenario. Drive + LIN
FSAnalyze Gathers file system analytics data that is used in conjunction with InsightIQ. LIN
IntegrityScan Performs online verification and correction of any file system inconsistencies. LIN
QuotaScan Updates quota accounting for domains created on an existing directory path. Tree
SmartPools Job that runs and moves data between the tiers of nodes within the same LIN
cluster.
SnapshotDelete Frees disk space that is associated with deleted snapshots. LIN
TreeDelete Deletes a path in the file system directly from the cluster itself. Tree
Although the file system maintenance jobs are run by default, either on a schedule or in reaction to a particular file system event, any
job engine job can be managed by configuring both its priority-level (in relation to other jobs) and its impact policy.
An impact policy can consist of one or many impact intervals, which are blocks of time within a given week. Each impact interval can
be configured to use a single pre-defined impact-level which specifies the amount of cluster resources to use for a particular cluster
operation. Available job engine impact-levels are:
• Paused
• Low
• Medium
• High
This degree of granularity allows impact intervals and levels to be configured per job, in order to ensure smooth cluster operation. And
the resulting impact policies dictate when a job runs and the resources that a job can consume.
Additionally, job engine jobs are prioritized on a scale of one to ten, with a lower value signifying a higher priority. This is similar in
concept to the UNIX scheduling utility, ‘nice’.
The job engine allows up to three jobs to be run simultaneously. This concurrent job execution is governed by the following criteria:
• Job Priority
• Exclusion Sets - jobs which cannot run together (i.e., FlexProtect and AutoBalance)
• Cluster health - most jobs cannot run when the cluster is in a degraded state.
Because all information is shared among nodes across the internal network, data can be written to or read from any node, thus
optimizing performance when multiple users are concurrently reading and writing to the same set of data.
OneFS is truly a single file system with one namespace. Data and metadata are striped across the nodes for redundancy and
availability. The storage has been completely virtualized for the users and administrator. The file tree can grow organically without
requiring planning or oversight about how the tree grows or how users use it. No special thought has to be applied by the administrator
about tiering files to the appropriate disk, because Isilon SmartPools will handle that automatically without disrupting the single tree. No
special consideration needs to be given to how one might replicate such a large tree, because the Isilon SyncIQ service automatically
parallelizes the transfer of the file tree to one or more alternate clusters, without regard to the shape or depth of the file tree.
This design should be compared with namespace aggregation, which is a commonly-used technology to make traditional NAS
“appear” to have a single namespace. With namespace aggregation, files still have to be managed in separate volumes, but a simple
“veneer” layer allows for individual directories in volumes to be “glued” to a “top-level” tree via symbolic links. In that model, LUNs and
volumes, as well as volume limits, are still present. Files have to be manually moved from volume-to-volume in order to load-balance.
The administrator has to be careful about how the tree is laid out. Tiering is far from seamless and requires significant and continual
intervention. Failover requires mirroring files between volumes, driving down efficiency and ramping up purchase cost, power and
cooling. Overall the administrator burden when using namespace aggregation is higher than it is for a simple traditional NAS device.
This prevents such infrastructures from growing very large.
Data layout
OneFS uses physical pointers and extents for metadata and stores file and directory metadata in inodes. OneFS logical inodes (LINs)
are typically 512 bytes in size, which allows them to fit into the native sectors which the majority of hard drives are formatted with.
However, in OneFS 8.0 and onward, support is also provided for 8KB inodes, in order to support the denser classes of hard drive
which are now formatted with 4KB sectors.
B-trees are used extensively in the file system, allowing scalability to billions of objects and near-instant lookups of data or metadata.
OneFS is a completely symmetric and highly distributed file system. Data and metadata are always redundant across multiple
hardware devices. Data is protected using erasure coding across the nodes in the cluster, this creates a cluster that has high-
efficiency, allowing 80% or better raw-to-usable on clusters of five nodes or more. Metadata (which makes up generally less than 1%
of the system) is mirrored in the cluster for performance and availability. As OneFS is not reliant on RAID, the amount of redundancy is
selectable by the administrator, at the file- or directory-level beyond the defaults of the cluster. Metadata access and locking tasks are
managed by all nodes collectively and equally in a peer-to-peer architecture. This symmetry is key to the simplicity and resiliency of the
architecture. There is no single metadata server, lock manager or gateway node.
File writes
The OneFS software runs on all nodes equally - creating a single file system that runs across every node. No one node controls or
“masters” the cluster; all nodes are true peers.
If we were to look at all the components within every node of a cluster that are involved in I/O from a high-level, it would look like
Figure 6 above. We have split the stack into a “top” layer, called the Initiator, and a “bottom” layer, called the Participant. This division
is used as a “logical model” for the analysis of any one given read or write. At a physical-level, CPUs and RAM cache in the nodes are
simultaneously handling Initiator and Participant tasks for I/O taking place throughout the cluster. There are caches and a distributed
lock manager that are excluded from the diagram above to keep it simple. They will be covered in later sections of the paper.
When a client connects to a node to write a file, it is connecting to the top half or Initiator of that node. Files are broken into smaller
logical chunks called stripes before being written to the bottom half or Participant of a node (disk). Failure-safe buffering using a write
coalescer is used to ensure that writes are efficient and read-modify-write operations are avoided. The size of each file chunk is
referred to as the stripe unit size.
Figure 7 below shows a file write happening across all nodes in a three-node cluster.
• Concurrency: Optimizes for current load on the cluster, featuring many simultaneous clients. This setting provides the best
behavior for mixed workloads.
• Streaming: Optimizes for high-speed streaming of a single file, for example to enable very fast reading with a single client.
• Random: Optimizes for unpredictable access to the file, by adjusting striping and disabling the use of any prefetch cache.
OneFS also includes real-time adaptive prefetch, providing the optimal read performance for files with a recognizable access pattern,
without any administrative intervention.
The largest file size that OneFS currently supports is increased to 16TB in OneFS 8.2.2, up from a maximum of 4TB in prior
releases.
OneFS caching
The OneFS caching infrastructure design is predicated on aggregating the cache present on each node in a cluster into one globally
accessible pool of memory. To do this, Isilon uses an efficient messaging system, similar to non-uniform memory access (NUMA). This
allows all the nodes’ memory cache to be available to each and every node in the cluster. Remote memory is accessed over an
internal interconnect and has much lower latency than accessing hard disk drives.
For remote memory access, OneFS utilizes a redundant, under-subscribed flat Ethernet network, as, essentially, a distributed system
bus. While not as fast as local memory, remote memory access is still very fast due to the low latency of 40Gb Ethernet.
The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of
multiple nodes, this cached data is consistent across all instances. OneFS utilizes the MESI Protocol to maintain cache coherency.
This protocol implements an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache.
OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or coalescer. These, and their high-level
interaction, are illustrated in the following diagram.
The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in
processors (CPUs). These two cache layers are present in all Isilon storage nodes.
1. Node 1 and Node 5 each have a copy of data located at an address in shared cache.
4. Node 1 must re-read the data from shared cache to get the updated value.
OneFS utilizes the MESI Protocol to maintain cache coherency. This protocol implements an “invalidate-on-write” policy to ensure that
all data is consistent across the entire shared cache. The following diagram illustrates the various states that in-cache data can take,
and the transitions between them. The various states in the figure are:
• M – Modified: The data exists only in local cache and has been changed from the value in shared cache. Modified data is typically
referred to as dirty.
• E – Exclusive: The data exists only in local cache but matches what is in shared cache. This data is often referred to as clean.
• S – Shared: The data in local cache may also be in other local caches in the cluster.
L1 cache coherency is managed via a MESI-like protocol using distributed locks, as described above.
OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact
on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and
metadata, which can be quickly returned from the cached inode.
L1 cache is utilized differently in the Isilon Accelerator nodes, which don’t contain any disk drives. Instead, the entire read cache is
L1 cache, since all the data is fetched from other storage nodes. Also, cache aging is based on a least recently used (LRU) eviction
policy, as opposed to the drop-behind algorithm typically used in a storage node’s L1 cache. Because an accelerator’s L1 cache is
large, and the data in it is much more likely to be requested again, so data blocks are not immediately removed from cache upon use.
However, metadata & update heavy workloads don’t benefit as much, and an accelerator’s cache is only beneficial to clients directly
connected to the node.
Level 2 cache
The Level 2 cache (L2), or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 cache
is globally accessible from any node in the cluster and is used to reduce the latency of a read operation by not requiring a seek directly
from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1
cache.
L2 cache is also known as local cache because it contains data retrieved from disk drives located on that node and then made
available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm.
Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Since the node knows
where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A
remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no
MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system
and NVRAM.
Level 3 cache
An optional third tier of read cache, called SmartFlash or Level 3 cache (L3), is also configurable on nodes that contain solid state
drives (SSDs). SmartFlash (L3) is an eviction cache that is populated by L2 cache blocks as they are aged out from memory. There
are several benefits to using SSDs for caching rather than as traditional file system storage devices. For example, when reserved for
caching, the entire SSD will be used, and writes will occur in a very linear and predictable way. This provides far better utilization and
also results in considerably reduced wear and increased durability over regular file system usage, particularly with random write
workloads. Using SSD for cache also makes sizing SSD capacity a much more straightforward and less error prone prospect
compared to using use SSDs as a storage tier.
The following diagram illustrates how clients interact with the OneFS read cache infrastructure and the write coalescer. L1 cache still
interacts with the L2 cache on any node it requires, and the L2 cache interacts with both the storage subsystem and L3 cache. L3
cache is stored on an SSD within the node and each node in the same node pool has L3 cache enabled. The diagram also illustrates a
separate node pool where L3 cache is not enabled. This node pool either does not contain the required SSDs, or has L3 cache
disabled, with the SSDs being used for a filesystem-based SmartPools SSD data or metadata strategy.
OneFS dictates that a file is written across multiple nodes in the cluster, and possibly multiple drives within a node, so all read requests
involve reading remote (and possibly local) data. When a read request arrives from a client, OneFS determines whether the requested
data is in local cache. Any data resident in local cache is read immediately. If data requested is not in local cache, it is read from disk.
For data not on the local node, a request is made from the remote nodes on which it resides. On each of the other nodes, another
cache lookup is performed. Any data in the cache is returned immediately, and any data not in the cache is retrieved from disk.
When the data has been retrieved from local and remote cache (and possibly disk), it is returned back to the client.
The high-level steps for fulfilling a read request on both a local and remote node are:
1. Determine whether part of the requested data is in the local L1 cache. If so, return to client.
2. If not in the local cache, request data from the remote node(s).
On remote nodes:
1. Determine whether requested data is in the local L2 or L3 cache. If so, return to the requesting node.
2. If not in the local cache, read from disk and return to the requesting node.
Write caching accelerates the process of writing data to an Isilon cluster. This is achieved by batching up smaller write requests and
sending them to disk in bigger chunks, removing a significant amount of disk writing latency. When clients write to the cluster, OneFS
temporarily writes the data to an NVRAM-based journal cache on the initiator node, instead of immediately writing to disk. OneFS can
then flush these cached writes to disk at a later, more convenient time. Additionally, these writes are also mirrored to participant nodes’
NVRAM journals to satisfy the file’s protection requirement. Therefore, in the event of a cluster split or unexpected node outage,
uncommitted cached writes are fully protected.
• An NFS client sends Node 1 a write request for a file with +2n protection.
• Node 1 accepts the writes into its NVRAM write cache (fast path) and then mirrors the writes to participant nodes’ log files for
protection.
• As Node 1’s write cache fills, it is periodically flushed, and writes are committed to disk via the two-phase commit process
(described above) with the appropriate erasure code (ECC) protection applied (+2n).
• The write cache and participant node log files are cleared and available to accept new writes.
File reads
In an Isilon cluster, data, metadata and inodes are all distributed on multiple nodes, and even across multiple drives within nodes.
When reading or writing to the cluster, the node a client attaches to acts as the “captain” for the operation.
In a read operation, the “captain” node gathers all of the data from the various nodes in the cluster and presents it in a cohesive way to
the requestor.
Due to the use of cost-optimized industry standard hardware, the Isilon cluster provides a high ratio of cache to disk (multiple GB per
node) that is dynamically allocated for read and write operations as needed. This RAM-based cache is unified and coherent across all
nodes in the cluster, allowing a client read request on one node to benefit from I/O already transacted on another node. These cached
blocks can be quickly accessed from any node across the low-latency InfiniBand backplane, allowing for a large, efficient RAM cache,
which greatly accelerates read performance.
As the cluster grows larger, the cache benefit increases. For this reason, the amount of I/O to disk on an Isilon cluster is generally
substantially lower than it is on traditional platforms, allowing for reduced latencies and a better user experience.
For files marked with an access pattern of concurrent or streaming, OneFS can take advantage of pre-fetching of data based on
heuristics used by the Isilon SmartRead component. SmartRead can create a data “pipeline” from L2 cache, prefetching into a local
“L1” cache on the “captain” node. This greatly improves sequential-read performance across all protocols and means that reads come
directly from RAM within milliseconds. For high-sequential cases, SmartRead can very aggressively prefetch ahead, allowing reads or
writes of individual files at very high data rates.
Figure 10 illustrates how SmartRead reads a sequentially-accessed, non-cached file that is requested by a client attached to Node1 in
a 3-node cluster.
1. Node1 reads metadata to identify where all the blocks of file data exist.
6. For highly sequential cases, data in L1 cache may be optionally “dropped behind” to free RAM for other L1 or L2 cache demands.
SmartRead’s intelligent caching allows for very high read performance with high levels of concurrent access. Importantly, it is faster for
Node1 to get file data from the cache of Node2 (over the low-latency cluster interconnect) than to access its own local disk.
SmartRead’s algorithms control how aggressive the pre-fetching is (disabling pre-fetch for random-access cases) and how long data
stays in the cache and optimizes where data is cached.
Figure 13 below illustrates an example of how threads from different nodes could request a lock from the coordinator.
2. Thread 1 from Node 4 and thread 2 from Node 3 request a shared lock on a file from Node 2 at the same time.
4. If no exclusive locks exist, Node 2 grants thread 1 from Node 4 and thread 2 from Node 3 shared locks on the requested file.
5. Node 3 and Node 4 are now performing a read on the requested file.
6. Thread 3 from Node 1 requests an exclusive lock for the same file as being read by Node 3 and Node 4.
7. Node 2 checks with Node 3 and Node 4 if the shared locks can be reclaimed.
8. Node 3 and Node 4 are still reading so Node 2 asks thread 3 from Node 1 to wait for a brief instant.
9. Thread 3 from Node 1 blocks until the exclusive lock is granted by Node 2 and then completes the write operation.
Multi-threaded IO
With the growing use of large NFS datastores for server virtualization and enterprise application support comes the need for high
throughput and low latency to large files. To accommodate this, OneFS Multi-writer supports multiple threads concurrently writing to
individual files.
In the above example, concurrent write access to a large file can become limited by the exclusive locking mechanism, applied at the
whole file level. In order to avoid this potential bottleneck, OneFS Multi-writer provides more granular write locking by sub-diving the file
into separate regions and granting exclusive write locks to individual regions, as opposed to the entire file. As such, multiple clients can
simultaneously write to different portions of the same file.
Smaller clusters can be protected with +1n protection, but this implies that while a single drive or node could be recovered, two drives
in two different nodes could not. Drive failures are orders of magnitude more likely than node failures. For clusters with large drives, it
is desirable to provide protection for multiple drive failures, though single-node recoverability is acceptable.
To provide for a situation where we wish to have double-disk redundancy and single-node redundancy, we can build up double or triple
width protection groups of size. These double or triple width protection groups will “wrap” once or twice over the same set of nodes, as
they are laid out. Since each protection group contains exactly two disks worth of redundancy, this mechanism will allow a cluster to
sustain either a two or three drive failure or a full node failure, without any data unavailability.
Most important for small clusters, this method of striping is highly efficient, with an on-disk efficiency of M/(N+M). For example, on a
cluster of five nodes with double-failure protection, were we to use N=3, M=2, we would obtain a 3+2 protection group with an
efficiency of 1−2/5 or 60%. Using the same 5-node cluster but with each protection group laid out over 2 stripes, N would now be 8 and
M=2, so we could obtain 1-2/(8+2) or 80% efficiency on disk, retaining our double-drive failure protection and sacrificing only double-
node failure protection.
OneFS supports several protection schemes. These include the ubiquitous +2d:1n, which protects against two drive failures or one
node failure.
Isilon recommends using the recommended protection level for a particular cluster configuration. This recommended level of
protection is clearly marked as ‘suggested’ in the OneFS WebUI storage pools configuration pages and is typically configured by
default. For all current Gen6 hardware configurations, the recommended protection level is “+2d:1n’.
The hybrid protection schemes are particularly useful for Isilon Gen6 chassis and other high-density node configurations, where the
probability of multiple drives failing far surpasses that of an entire node failure. In the unlikely event that multiple devices have
simultaneously failed, such that the file is “beyond its protection level”, OneFS will re-protect everything possible and report errors on
the individual files affected to the cluster’s logs.
OneFS also provides a variety of mirroring options ranging from 2x to 8x, allowing from two to eight mirrors of the specified content.
Metadata, for example, is mirrored at one level above FEC by default. For example, if a file is protected at +1n, its associated metadata
object will be 3x mirrored.
The full range of OneFS protection levels are summarized in the following table:
OneFS enables an administrator to modify the protection policy in real time, while clients are attached and are reading and writing
data.
Be aware that increasing a cluster’s protection level may increase the amount of space consumed by the data on the cluster.
Automatic partitioning
Data tiering and management in OneFS is handled by the SmartPools framework. From a data protection and layout efficiency point of
view, SmartPools facilitates the subdivision of large numbers of high-capacity, homogeneous nodes into smaller, more ‘Mean Time to
Data Loss’ (MTTDL) friendly disk pools. For example, an 80-node archive cluster would typically run at a +4n protection level.
However, partitioning it into four, twenty node disk pools would allow each pool to run at +2d:1n, thereby lowering the protection
overhead and improving space utilization, without any net increase in management overhead.
In keeping with the goal of storage management simplicity, OneFS will automatically calculate and partition the cluster into pools of
disks, or ‘node pools’, which are optimized for both MTTDL and efficient space utilization. This means that protection level decisions,
such as the eighty-node cluster example above, are not left to the customer.
With Automatic Provisioning, every set of compatible node hardware is automatically divided into disk pools comprising up to forty
nodes and six drives per node. These node pools are protected by default at +2d:1n, and multiple pools can then be combined into
logical tiers and managed with SmartPools file pool policies. By subdividing a node’s disks into multiple, separately protected pools,
nodes are significantly more resilient to multiple disk failures than previously possible.
Sled 1
Sled 2
Sled 3
Sled 4
Sled 5
Figure 18. Gen6 Isilon Platform Chassis Front View Showing Drive Sleds
Each sled is a tray which slides into the front of the chassis and contains between three and six drives, depending on the configuration
of a particular chassis. Disk Pools are the smallest unit within the Storage Pools hierarchy. OneFS provisioning works on the premise
of dividing similar nodes’ drives into sets, or disk pools, with each pool representing a separate failure domain. These disk pools are
protected by default at +2d:1n (or the ability to withstand two drives or one entire node failure).
Disk pools are laid out across all five sleds in each Gen6 Isilon node. For example, a node with three drives per sled will have the
following disk pool configuration:
Partner node protection increases reliability because if both nodes go down, they are in different failure domains, so their failure
domains only suffer the loss of a single node.
Chassis protection ensures that if an entire chassis failed, each failure domain would only lose one node.
A 40 node or larger cluster with four neighborhoods, protected at the default level of +2d:1n can sustain a single node failure per
neighborhood. This protects the cluster against a single Gen6 chassis failure.
Overall, a Gen6 platform cluster will have a reliability of at least one order of magnitude greater than previous generation clusters of a
similar capacity as a direct result of the following enhancements:
• Mirrored Journals
• Smaller Neighborhoods
Compatibility
Certain similar, but non-identical, node types can be provisioned to an existing node pool by node compatibility. OneFS requires that a
node pool must contain a minimum of three nodes.
Due to significant architectural differences, there are no node compatibilities between the Gen6 Isilon platform and any previous
generations of Isilon hardware.
Supported protocols
Clients with adequate credentials and privileges can create, modify, and read data using one of the standard supported methods for
communicating with the cluster:
On the Microsoft Windows side, the SMB protocol is supported up to version 3. As part of the SMB3 dialect, OneFS supports the
following features:
• SMB3 Multi-path
• SMB3 Encryption
SMB3 encryption can be configured on a per-share, per-zone, or cluster-wide basis. Only operating systems that support SMB3
encryption can work with encrypted shares. These operating systems can also work with unencrypted shares if the cluster is
configured to allow non-encrypted connections. Other operating systems can access non-encrypted shares only if the cluster is
configured to allow non-encrypted connections.
By default, only the SMB/CIFS and NFS protocols are enabled in the Isilon cluster. The file system root for all data in the cluster is /ifs
(the Isilon OneFS file system). This is presented via SMB/CIFS protocol as an ‘ifs’ share (\\<cluster_name\ifs), and via the NFS
protocol as a ‘/ifs’ export (<cluster_name>:/ifs).
Data is common between all protocols, so changes made to file content via one access protocol are instantly viewable from all
others.
OneFS provides full support for both IPv4 and IPv6 environments across the front-end Ethernet network(s), SmartConnect, and the
complete array of storage protocols and management tools.
Additionally, OneFS CloudPools supports the following cloud providers’ storage APIs, allowing files to be stubbed out to a number of
storage targets, including:
• Microsoft Azure
• Virtustream
File filtering
OneFS file filtering can be used across NFS and SMB clients to allow or disallow writes to an export, share, or access zone. This
feature prevents certain types of file extensions to be blocked, for files which might cause security problems, productivity disruptions,
throughput issues or storage clutter. Configuration can be either via a blacklist, which blocks explicit file extensions, or a white list,
which explicitly allows writes of only certain file types.
SmartDedupe architecture
The OneFS SmartDedupe architecture is comprised of five principle modules:
• Deduplication Job
• Deduplication Engine
• Shadow Store
• Deduplication Infrastructure
Shadow stores
OneFS shadow stores are file system containers that allow data to be stored in a sharable manner. As such, files on OneFS can
contain both physical data and pointers, or references, to shared blocks in shadow stores. Shadow stores were introduced in OneFS
7.0, initially supporting Isilon OneFS file clones, and there are many overlaps between cloning and deduplicating files.
Shadow stores are similar to regular files, but typically don’t contain all the metadata typically associated with regular file inodes. In
particular, time-based attributes (creation time, modification time, etc.) are explicitly not maintained. Each shadow store can contain up
to 256 blocks, with each block able to be referenced by 32,000 files. If this 32K reference limit is exceeded, a new shadow store is
created. Additionally, shadow stores do not reference other shadow stores. And snapshots of shadow stores are not allowed, since
shadow stores have no hard links.
Shadow stores are also utilized for OneFS file clones and small file storage efficiency (SFSE), in addition to deduplication.
Small File Storage Efficiency trades a small read latency performance penalty for improved storage utilization. The archived files
obviously remain writable, but when containerized files with shadow references are deleted, truncated or overwritten it can leave
unreferenced blocks in shadow stores. These blocks are later freed and can result in holes which reduces the storage efficiency.
The actual efficiency loss depends on the protection level layout used by the shadow store. Smaller protection group sizes are more
susceptible, as are containerized files, since all the blocks in containers have at most one referring file and the packed sizes (file size)
are small.
A deframenter is provided in OneFS 8.2 and later to reduce the fragmentation of files as a result of overwrites and deletes. This
shadow store defragmenter is integrated into the ShadowStoreDelete job. The defragmentation process works by dividing each
containerized file into logical chunks (~32MB each) and assess each chunk for fragmentation.
If the storage efficiency of a fragmented chunk is below target, that chunk is processed by evacuating the data to another location. The
default target efficiency is 90% of the maximum storage efficiency available with the protection level used by the shadow store. Larger
protection group sizes can tolerate a higher level of fragmentation before the storage efficiency drops below this threshold.
The in-line data reduction write path comprises three main phases:
The F810 includes a hardware compression off-load capability, with each node in an F810 chassis containing a Mellanox Innova-2
Flex Adapter. This means that compression and decompression are transparently performed by the Mellanox adapter with minimal
latency, thereby avoiding the need for consuming a node’s expensive CPU and memory resources.
The OneFS hardware compression engine uses zlib, with a software implementation of igzip as fallback in the event of a compression
hardware failure. OneFS employs a compression chunk size of 128KB, with each chunk comprising sixteen 8KB data blocks. This is
optimal since it is also the same size that OneFS uses for its data protection stripe units, providing simplicity and efficiency, by avoiding
the overhead of additional chunk packing.
Consider the diagram above. After compression, this chunk is reduced from sixteen to six 8KB blocks in size. This means that this
chunk is now physically 48KB in size. OneFS provides a transparent logical overlay to the physical attributes. This overlay describes
whether the backing data is compressed or not and which blocks in the chunk are physical or sparse, such that file system consumers
are unaffected by compression. As such, the compressed chunk is logically represented as 128KB in size, regardless of its actual
physical size.
Efficiency savings must be at least 8KB (one block) in order for compression to occur, otherwise that chunk or file will be passed over
and remain in its original, uncompressed state. For example, a file of 16KB that yields 8KB (one block) of savings would be
compressed. Once a file has been compressed, it is then FEC protected.
Compression chunks will never cross node pools. This avoids the need to de-compress or recompress data to change protection
levels, perform recovered writes, or otherwise shift protection-group boundaries.
Interfaces
Administrators can use multiple interfaces to administer an Isilon storage cluster in their environments:
• Command Line Interface via SSH network access or RS232 serial connection
• RESTful Platform API for programmatic control and automation of cluster configuration and management.
More information on OneFS commands and feature configuration is available in the OneFS Administration Guide.
OneFS supports the use of more than one authentication type. However, it is recommended that you fully understand the interactions
between authentication types before enabling multiple methods on the cluster. Refer to the product documentation for detailed
information about how to properly configure multiple authentication modes.
Active Directory
Active Directory, a Microsoft implementation of LDAP, is a directory service that can store information about the network resources.
While Active Directory can serve many functions, the primary reason for joining the cluster to the domain is to perform user and group
authentication.
Each node in the cluster shares the same Active Directory machine account making it very easy to administer and manage.
LDAP
The Lightweight Directory Access Protocol (LDAP) is a networking protocol used for defining, querying, and modifying services and
resources. A primary advantage of LDAP is the open nature of the directory services and the ability to use LDAP across many
platforms. The Isilon clustered storage system can use LDAP to authenticate users and groups in order to grant them access to the
cluster.
NIS
The Network Information Service (NIS), designed by Sun Microsystems, is a directory services protocol that can be used by the Isilon
system to authenticate users and groups when accessing the cluster. NIS, sometimes referred to as Yellow Pages (YP), is different
from NIS+, which the Isilon cluster does not support.
Local users
The Isilon clustered storage system supports local user and group authentication. You can create local user and group accounts
directly on the cluster, using the WebUI interface. Local authentication can be useful when directory services—Active Directory, LDAP,
or NIS—are not used, or when a specific user or application needs to access the cluster.
Access zones
Access zones provide a method to logically partition cluster access and allocate resources to self-contained units, thereby providing a
shared tenant, or multi-tenant, environment. To facilitate this, Access Zones tie together the three core external access components:
• Authentication
As such, Isilon SmartConnect™ zones are associated with a set of SMB/CIFS shares, NFS exports, HDFS racks, and one or more
authentication providers per zone for access control. This provides the benefits of a centrally managed single file system, which can be
provisioned and secured for multiple tenants. This is particularly useful for enterprise environments where multiple separate business
units are served by a central IT department. Another example is during a server consolidation initiative, when merging multiple
Windows file servers that are joined to separate, un-trusted, Active Directory forests.
With Access Zones, the built-in System access zone includes an instance of each supported authentication provider, all available SMB
shares, and all available NFS exports by default.
These authentication providers can include multiple instances of Microsoft Active Directory, LDAP, NIS, and local user or group
databases.
Software upgrade
Upgrading to the latest version of OneFS allows you to take advantage of any new features, fixes and functionality on the Isilon
Cluster. Clusters can be upgraded using two methods: Simultaneous or Rolling Upgrade
Simultaneous upgrade
A simultaneous upgrade installs the new operating system and reboots all nodes in the cluster at the same time. A simultaneous
upgrade requires a temporary, sub-two-minute, interruption of service during the upgrade process while the nodes are restarted.
Rolling upgrade
A rolling upgrade individually upgrades and restarts each node in the cluster sequentially. During a rolling upgrade, the cluster remains
online and continues serving data to clients with no interruption in service. Prior to OneFS 8.0, a rolling upgrade can only be performed
within a OneFS code version family and not between OneFS major code version revisions. From OneFS 8.0 onwards, every new
release will be rolling-upgradable from the prior version.
Non-disruptive upgrades
Non-disruptive upgrades (NDUs) allow a cluster administrator to upgrade the storage OS while their end users continue to access data
without error or interruption. Updating the operating system on an Isilon cluster is a simple matter of a rolling upgrade. During this
process, one node at a time is upgraded to the new code, and the active NFS and SMB3 clients attached to it are automatically
migrated to other nodes in the cluster. Partial upgrade is also permitted, whereby a subset of cluster nodes can be upgraded. The
subset of nodes may also be grown during the upgrade. In OneFS 8.2 and later, an upgrade can be paused and resumed allowing
customers to span upgrades over multiple smaller Maintenance Windows. Additionally, OneFS 8.2.2 introduces parallel upgrades,
whereby clusters can upgrade an entire neighborhood, or fault domain, at a time, substantially reducing the duration of large cluster
upgrades.
InsightIQ™ Performance Maximize performance of your Isilon scale-out storage system with innovative
Management performance monitoring and reporting tools
ClarityNow™ Data Analysis and Locate, access and manage data in seconds, no matter where it resides – across
Management file and object storage, on-prem or in the cloud. Gain a holistic view across
heterogeneous storage systems with a single pane of glass, effectively breaking
down data trapped in siloes.
SmartPools™ Resource Implement a highly efficient, automated tiered storage strategy to optimize storage
Management performance and costs
SmartQuotas™ Data Management Assign and manage quotas that seamlessly partition and thin provision storage
into easily managed segments at the cluster, directory, sub-directory, user, and
group levels
SmartConnect™ Data Access Enable client connection load balancing and dynamic NFS failover and failback of
client connections across storage nodes to optimize use of cluster resources
SnapshotIQ™ Data Protection Protect data efficiently and reliably with secure, near instantaneous snapshots
while incurring little to no performance overhead. Speed recovery of critical data
with near-immediate on-demand snapshot restores.
Isilon for vCenter Data Management Manage Isilon functions from vCenter.
SyncIQ™ Data Replication Replicate and distribute large, mission-critical data sets asynchronously to multiple
shared storage systems in multiple sites for reliable disaster recovery capability.
Push-button failover and failback simplicity to increase availability of mission-
critical data.
SmartLock™ Data Retention Protect your critical data against accidental, premature or malicious alteration or
deletion with our software-based approach to Write Once Read Many (WORM)
and meet stringent compliance and governance needs such as SEC 17a-4
requirements.
SmartDedupe™ Data Deduplication Maximize storage efficiency by scanning the cluster for identical blocks and then
eliminating the duplicates, decreasing the amount of physical storage required.
CloudPools™ Cloud Tiering CloudPools enables you to define which data on your cluster should be archived
to cloud storage. Cloud providers include Microsoft Azure, Google Cloud, Amazon
S3, Dell EMC ECS, Virtustream and native Isilon.
Please refer to product documentation for details on all of the above Isilon software products.
Conclusion
With Isilon scale-out NAS solutions powered by the OneFS operating system, organizations can scale from 18 TB to more than 68 PB
within a single file system, single volume, with a single point of administration. OneFS delivers high-performance, high-throughput, or
both, without adding management complexity.
OneFS is the next-generation file system designed to meet these challenges. OneFS provides:
• High availability
OneFS is ideally suited for file-based and unstructured “Big Data” applications in enterprise data lake environments – including large-
scale home directories, file shares, archives, virtualization and business analytics – as well as a wide range of data-intensive, high
performance computing environments including energy exploration, financial services, Internet and hosting services, business
intelligence, engineering, manufacturing, media & entertainment, bioinformatics, and scientific research.
Shop Dell EMC Isilon to compare features and get more information.
Learn more about Dell Contact a Dell EMC Expert View more resources Join the conversation
EMC solutions with #DellEMCStorage