IDC White Paper
IDC White Paper
IDC White Paper
IDC OPINION
As enterprises move into the digital era through a process termed digital transformation, they are
moving to more data-centric business models. While there are big data and analytics workloads that
do not use artificial intelligence (AI), AI-driven applications are growing at a rapid rate over the next five
years. AI workloads also include machine learning (ML) and deep learning (DL) workloads, and while
more data helps drive better business insights across both application types, it is particularly true for
DL workloads. Experience with DL workloads over the past three years indicates that outdated storage
architectures can pose serious challenges in efficiently scaling large AI-driven workloads, and over
88% of enterprises purchase newer, more modernized storage infrastructure designs for those types of
applications.
AI-driven applications commonly have a multistage data pipeline that includes ingest, transformation,
training, inferencing, production, and archiving, with much of the data shared between stages. When
enterprises can consolidate the storage processing for all of these stages onto a single storage
system, it provides the most cost-effective infrastructure. This type of consolidation can, however,
potentially put performance, availability, and security at risk if the underlying storage does not support
low latency, high data concurrency, and the multitenant management to manage the data for each
stage according to individual requirements. Successful AI workloads tend to grow very fast, making
easy and high scalability important as well. Enterprises often want to use public cloud–based services
during the data life cycle, so systems need to be enabled with an ability to support cloud-native
capabilities and integration. And finally, because these systems tend to be quite large (often growing to
multi-petabyte [PB] stage and beyond), high infrastructure efficiency is important to be able to drive a
low total cost of ownership.
Because of these requirements, enterprises are gravitating more and more to software-defined, scale-
out storage infrastructures that support extensive hybrid cloud integration. NetApp, a well-established
storage systems vendor that held the number two market share spot by revenue for external storage in
2021, offers enterprise-class storage systems that meet these requirements well. In addition to offering
the technical capabilities demanded by AI workloads, NetApp has also implemented a deployment
model that makes buying IT infrastructure solutions for AI-driven workloads fast and easy.
NetApp ONTAP AI provides a prepackaged solution. The solution includes accelerated compute and
networking from NVIDIA that is so often needed for AI workloads, and a software-defined, scale-out
storage architecture based on ONTAP, the vendor's proven enterprise-class storage operating system,
SITUATION OVERVIEW
AI technologies are becoming an increasingly important driver of business success for digitally
transformed enterprises today. ML, a subset of AI, reviews data inputs, identifies correlations and
patterns in that data, and then applies that learning to make informed decisions across a wide variety
of use cases — everything from recommendation engines and fraud detection to customer analysis and
forecasting events — that can supplement and help guide better human decision making. DL is an
evolution of machine learning that uses more complex, multilayered neural networks that can actually
learn and make intelligent decisions on their own without human involvement.
Unlike ML, DL uses a multilayered structure of algorithms whose design is inspired by the biological
network of neurons in the human brain to provide a learning system that is far more capable than that
of standard ML models (which are based on only one- or two-layer "network"). DL is used to address
far more complex problems such as natural language processing, virtual assistants that mimic
humans, and autonomous vehicle systems, and these models may have thousands to trillions of
"parameters" in the multilayer neural network. IDC's AI DL Training Infrastructure Survey, completed in
June 2022, explored existing enterprises' use of AI technology across both ML and DL workloads.
AI workloads in general perform better when they leverage larger data sets for training purposes, and
this is particularly true for DL workloads. AI life-cycle applications are one of the two fastest-growing
workloads over the next five years (with non-AI-driven big data analytics workloads being the other
one), and they are contributing strongly to the projected data growth rates in the enterprise. Most
enterprises are experiencing data growth rates from 30% to 40% per year and will soon be managing
multi-petabyte storage environments (if they are not already). Roughly 70% of enterprises undergoing
digital transformation will be modernizing their storage infrastructure over the next two years to deal
with the performance, availability, scalability, and security requirements for the new workloads they are
deploying in an era of heightened privacy concerns and rampant cyberattacks.
Figure 1 shows a typical AI data pipeline with its multiple stages. In a typical AI project, the enterprise
initially carefully identifies a problem and the data science team chooses the algorithm they will use
that maximizes the possibility of success. Data is then gathered and ingested into the storage
infrastructure where it will be converted into a usable format and then used to train the AI model. As
the model is being built through this "training" effort that runs the data set through the algorithm, the
data science team is evaluating the accuracy of the recommendations or predictions the model is
making (the "inferencing" stage). Once the accuracy is acceptable, the model can be deployed in
production and the team will develop applications that use the model. But AI applications are not static,
and the training workflow must continue to be refined and optimized over time or the usefulness of the
FIGURE 1
Validate
Ingest Transform Train Production Data
model with
data data model inference tiering
inference
Inference workflow
Training workflow
The different stages of the AI data pipeline require different capabilities from the IT infrastructure.
Accelerated compute, implemented through graphics processing units, application-specific integrated
circuits (ASICs), or floating-point gate arrays (FPGAs), can be much more efficient than general-
purpose processors when dealing with data-intensive workloads. IDC's AI DL Training Infrastructure
Survey indicated that 62% of enterprises running AI workloads were running them on "high-density
clusters," defined as scale-out IT infrastructure leveraging some form of accelerated compute.
Acceleration is based on parallelization of the data and/or the model, allowing for simultaneous
processing on thousands of cores and potentially hundreds of compute systems rather than sequential
processing. Parallel processing architectures are optimal for deep learning neural networks.
Implementing the right storage infrastructure can also make a big difference, particularly for larger-
scale AI workloads. Already, over 43% of enterprises running these workloads operate in integrated
hybrid cloud environments, and 69% of them are regularly moving data between on- and off-premises
locations. Software-defined storage is important in providing the flexibility and data mobility needed in
these hybrid cloud environments. Given the high data growth rates, scale-out architectures provide the
widest range of scalability needed with easy nondisruptive expansion over time.
Figure 2 shows AI DL training projects being run in the most common deployment scenarios.
(% of respondents)
n = 314
Base = all respondents
The use of managed services for AI training workloads is popular, with 62% of the enterprises
surveyed using them although deployment locations are split. Over 46% of them rely on infrastructure
at a managed service provider location, almost 30% of them use this approach with on-premises
infrastructure, and over 8% are working with colocation partners like Equinix. Top factors driving the
interest in the use of managed services include a better perceived value proposition versus more
traditional "do it yourself" methods, protection against obsolescence in a rapidly changing market, and
the ability to offload infrastructure management so that staff can focus on more strategic operations.
The major driver for these developments is the AI model size. To achieve high levels of accuracy, it is
not enough to feed an AI training model large amounts of data; the model also needs to have a large
number of layers and parameters, which are the weights of the connections in the neural network. Just
two years ago, the AI community marveled at model sizes that had reached 300 billion parameters;
today AI scientists have developed trillion-plus parameter models.
Unfortunately, the number of parameters in a neural network and the volumes of data fed into it
correlate directly with the amount of compute required. In other words, the more capable and/or
accurate an AI model needs to be, the more compute is required to train that model. What's more, the
faster an organization wants to have its model trained, the more compute is required as well since
model training can be distributed across many nodes in a cluster; hence a larger cluster trains a model
with greater speed. This is why companies like NVIDIA today build and install DGX SuperPODs in
customer datacenters.
It is also the reason for two other trends: the increasing interest in foundational or pretrained models,
whereby an organization does not attempt to train a complex model "from scratch" but instead
purchases it and then fine-tunes it for its own purposes; and the growing interest in as-a-service AI
platforms, which remove all infrastructure management from the customer and provide an AI training
environment that is operated as a service and paid for with a subscription model, either on premises or
in a cloud. (Note that public cloud puts significant pressure on AI scientists and engineers as every
iteration of the model training adds to the cloud bill.)
(% of respondents)
n = 314
Base = all respondents
Data is managed by IDC's Quantitative Research Group.
Those higher-level objectives translate into a number of specific requirements for the IT administrators
actually managing the storage infrastructure for AI training workloads. A primary capability needed in
storage that will be used for AI is its ability to support high degrees of concurrency. The same data is
generally used across different stages, each of which can have very different I/O profiles, and multiple
stages of the AI data pipeline will be operating concurrently much of the time. For example, depending
on the scale of data collection, the ingest stage can demand extremely high sequential write
performance, while the model training stage typically requires very high random read performance
(with a smaller percentage of random writes). For real-time workloads, which are on the rise for
enterprises, the production inferencing stage can also require extremely rapid response, driving the
need for very low latencies.
IT administrators were clear about the capabilities topping their list of storage purchase criteria in IDC's
AI DL Training Infrastructure Survey. Ease of use and high-performance data movement (for rapidly
loading data sets for new training runs) were in a virtual tie at the top, followed by cloud-enabled
storage, high read performance, and high availability. For those enterprises running multiple AI
workloads on the same storage system and sharing data across applications, one of the driving forces
behind this workload consolidation was a significant reduction in the time spent moving data between
different storage systems that were hosting various stages of the AI data pipeline. 52% of respondents
were either already consolidating AI workloads onto fewer platforms or expressed an interest in
moving to higher levels of consolidation, and 75% are prioritizing data sharing without data movement
(a factor often requiring support for higher-speed host networks like NVMe over Fabrics). The callout
for high availability is driven by the increasing importance of AI workloads and their criticality to daily
business operations that would suffer in the event of outages.
Figure 4 lists storage infrastructure criteria that enterprises are looking at to support AI deep learning
training workloads.
High-performance data movement (for loading data sets for training runs) 27.4%
Scalability 18.5%
(% of respondents)
n = 314
Base = all respondents
Enterprises expressed a strong preference for mixed media support in the storage systems used for AI
training workloads. Storage administrators need to be able to field all-flash, hybrid, and hard disk drive
(HDD)–based nodes to meet different performance, capacity, and cost requirements, and they need the
ability to easily accommodate new storage device types as they become available. While HDD storage
can provide high capacity at a low price per gigabyte for archive and other less performance-sensitive
AI data stages, all-flash configurations were used to support low latency, high degrees of concurrency,
higher storage density (i.e., TB/U), and better infrastructure efficiency for performance-sensitive data
stages and workloads. NVMe-based flash is particularly important in making efficient use of GPUs
since they operate with much higher performance than general-purpose CPUs in data-intensive
environments. IDC would also note that for data services like compression, deduplication, and
encryption that run inline, flash storage provides a much better ability to maintain low-latency
performance over time as a system grows.
IDC's AI DL Training Infrastructure Survey found that more than 83% of the enterprises that had
purchased a storage system for AI workloads had also at some point purchased a converged
infrastructure offering. The top drivers of those purchases included better administrative productivity
(due to the unified management interface), ease of purchasing based on the implied reference
architecture, a faster time to value, and a single point of support contact. When converged
infrastructure stacks already include most of the components an enterprise would have bought
separately anyway, they are a much better way to go. And even with these offerings, enterprises do
have some flexibility in adding other components they may need going forward (although those may or
may not be pretested by vendors and fall under the single support contract).
Figure 5 lists the major reasons behind enterprises' converged infrastructure stack purchase decisions
for AI DL training workloads.
(% of respondents)
n = 257
Base = respondents indicated organization purchased converged infrastructure stack for AI training workloads
Reference architectures are available from many vendors as well, although their benefits do not go
quite as far as converged infrastructure stacks. A reference architecture specifies a pretested
configuration using multivendor components so that they have been validated to work together but
leaves it up to the customer to buy the components from their various vendors. Ordering a complete
system requires more manual effort on the part of customers, support contacts are split across the
various vendors, and there is no unified management interface. When converged infrastructure stacks
that include the products an enterprise wants are not available, it may be better to work from a
reference architecture since enterprises will not need to validate all the product combinations
themselves.
NetApp is a $6.3 billion hybrid cloud data services and data management vendor headquartered in
San Jose, California. Today, the vendor is recognized as a leader in enterprise storage, and its broad
portfolio includes block-, file-, and object-based storage platforms as well as converged infrastructure,
technical support, and consulting services — all based around a software-defined product strategy that
offers outright purchase and subscription-based deployment options. As the first major enterprise
Working jointly with Cisco back in 2010, NetApp created the converged infrastructure market with its
FlexPod offering. In 2018, NetApp partnered with NVIDIA to create a converged infrastructure stack for
enterprise AI called NetApp ONTAP AI. The system combined NVIDIA DGX accelerated compute
systems and networking, and NetApp NVMe-based all-flash storage (the NetApp AFF A800 at the
time), along with a unified management interface, single point of support contact, and integration with
the NetApp Data Fabric. Through the NetApp Data Fabric, the vendor's unifying architecture to
standardize data management practices across cloud, on premises, and edge devices, NetApp
ONTAP AI enables enterprises to create a seamless data pipeline that spans from the edge to the core
to the cloud for AI workloads. The benefits of this targeted solution for enterprise AI workloads included
fast and easy ordering and deployment, an ability to nondisruptively scale to hundreds of petabytes of
cluster capacity, and the ability to operate with confidence using NetApp's enterprise-grade, highly
available storage platforms.
The vendor has kept this converged infrastructure offering up to date as new accelerated compute and
storage platforms have become available. Today's NetApp ONTAP AI combines the NVIDIA DGX
A100, the world's first 5PFLOPS system, with the new NVMe-based NetApp AFF systems and the
latest NVIDIA high-performance networking. The NVIDIA DGX A100 has the power to unify AI
workloads for training and inferencing as well as data analytics and other high-performance workloads
on a single compute platform, while the high-end NetApp A900 delivers the low latency, high
availability, and scalable storage capacity and the proven enterprise storage capabilities to meet the
needs of both data scientists and IT administrators. The converged infrastructure offering is available
in 2, 4, and 8-node (referring to the number of interconnected DGX systems) configurations combined
with optimized AI software.
Bundled NetApp ONTAP AI solution components include the NVIDIA Base Command software stack,
the NetApp AI Control Plane, and the NetApp DataOps Toolkit. The NVIDIA Base Command software
stack is a full-stack suite of pre-optimized AI software including a DGX-optimized OS, drivers,
infrastructure acceleration libraries that speed I/O, enterprise-grade cluster management, job
scheduling and orchestration, and full access to the NVIDIA AI Enterprise software suite for additional
developer assets like optimized frameworks, pretrained models, model scripts, AI and data science
tools, and industry SDKs. The NetApp AI Control Plane integrates Kubernetes and KubeFlow with the
NetApp Data Fabric to simplify data management in multicloud environments, while the NetApp
DataOps Toolkit is a Python library that makes it easy for data scientists to perform common data
management tasks in AI environments like provisioning new storage, cloning data, and creating
snapshots for traceability and baselining. ONTAP AI also supports NVIDIA GPU Direct Storage for its
ONTAP NFS and E-Series BeeGFS parallel file system.
At the heart of ONTAP AI are NVIDIA DGX compute systems, a fully integrated hardware and software
turnkey AI platform that's purpose built for analytics, AI training, and AI inferencing, delivering
5PFLOPS with a 6U form factor. The NVIDIA DGX A100 integrates eight NVIDIA A100 Tensor Core
GPUs, interconnected with the NVIDIA NVSwitch architecture, offering an ultra-high-bandwidth, low-
NVIDIA DGX A100 features NVIDIA ConnectX-7 InfiniBand/Ethernet network adapters with 500GBps
(gigabytes per second) of peak bidirectional bandwidth. These network adapters enable the DGX A100
to serve as the building block for large AI clusters such as NVIDIA DGX SuperPOD, enabling
datacenter-scale AI performance and supporting datacenter-scale workloads.
The storage in NetApp ONTAP AI is based around the vendor's flagship scale-out storage operating
system (ONTAP). Of interest to data scientists, ONTAP environments can nondisruptively scale from 2
nodes up to 24 nodes in a single cluster and support over 700PB across multiple namespaces (a
single namespace can host over 20PB of raw capacity). Provisioning new storage is intuitive, fast, and
very easy through the NetApp Cloud Manager. The use of end-to-end NVMe technology allows
NetApp ONTAP AI clusters to support very low latencies and a high degree of concurrency to support
multiple applications simultaneously working with data across multiple data pipeline stages and
support very high-speed data loading.
NetApp worked with NVIDIA to produce several reference architectures, based on ONTAP AI, that are
targeted for use cases in different industries. Available NetApp ONTAP AI reference architectures
include:
Storage administrators will also appreciate the comprehensive and proven enterprise storage
management feature set on all ONTAP arrays. ONTAP is an enterprise-class, clustered data
management solution that delivers high performance, high capacity, nondisruptive operations (to
ONTAP runs on both NetApp's NVMe-based all-flash systems (AFF) and its hybrid and/or HDD-based
systems (FAS) and supports unified (block and file) as well as block-only (all-SAN array) storage.
Supported access methods for ONTAP include NFS, SMB, and S3 as well as Fibre Channel (FC) and
iSCSI (for block access). It includes a wide range of storage efficiency technologies, space-efficient
snapshot technology (including immutable snapshots), and many replication options (including both
synchronous and asynchronous as well as Metro Clusters for instant recovery with zero data loss in
metro environments) and support for popular APIs from vendors such as VMware, Microsoft, and
Oracle. Storage administrators interested in a deeper take from IDC on what ONTAP brings to the
table can refer to Meeting the High Availability Requirements in Digitally Transformed Enterprises (IDC
#US48442021, March 2022).
ONTAP also supports tiering to external S3 targets whether those are on premises (e.g., using the
vendor's object-based storage called StorageGRID) or off premises (any public cloud provider that
supports S3 storage). The use of external storage targets can provide a very cost-effective, massively
scalable platform for long-term archiving. It also can provide a potential secondary HDD-based tier
(with the first HDD-based tier being in ONTAP-based FAS systems).
ONTAP also supports S3 protocol access to allow NFS mount points and associated files to be
accessed as buckets via the S3 protocol — a feature named multiprotocol access.
BeeGFS is a parallel file system with a distributed metadata architecture that was developed and
optimized for HPC environments demanding extremely high throughput to very large files. It is POSIX
compliant, offers an extremely scalable intelligent (parallel) client, and is widely used with technical
and HPC workloads. NetApp has created a reference architecture around BeeGFS called "NetApp E-
Series Storage with BeeGFS" to meet these types of requirements, which can also arise with certain
AI-driven workloads. This reference architecture is based on the NetApp E-Series systems, which
support all-flash, hybrid, and all-HDD-based systems to meet a variety of requirements. Included within
the E-Series portfolio are the EF-Series systems, which are all-flash, based on NVMe, and deliver the
lowest latencies within the NetApp storage portfolio for the most demanding workloads. The E-Series
systems are enterprise grade, include a variety of storage management capabilities, support Ansible
automation, and feature "six nines" (99.9999%) availability in a highly scalable system built around
performance-optimized storage building blocks. Although BeeGFS is free open source software,
NetApp customers purchasing this solution can obtain a support contract that comprehensively covers
both the hardware and software in this reference architecture solution.
When customers are looking for a parallel file system, prefer to use InfiniBand for storage interconnect,
and are working with single long-running jobs using larger file sizes, NetApp E-Series Storage with
NetApp has delivered solutions in the cloud for AI use cases, including:
CHALLENGES/OPPORTUNITIES
Enterprises vary in terms of who makes the buying decisions for AI infrastructure. IDC's AI DL Training
Infrastructure Survey indicated that the CIO/CTO level made that decision in almost 40% of cases,
while the IT infrastructure team made it in 34% of cases and lines of business made it in over 10% of
cases. Data scientists and AI application developers made that decision in 9.5% of cases. To select
the best system for a given enterprise's AI training workloads, decision makers must have a good idea
of the requirements from the data scientist/developer side as well as the IT operations side. When
selecting the right storage infrastructure for these workloads, it is important that all those affected are
consulted and have an opportunity to agree on objectives and priorities.
The opportunity for enterprises when adding a new storage system specifically for AI workloads is to
determine how much other workload consolidation that platform could (and should) support. The more
AI data pipeline stages can be consolidated onto a single storage platform, the better, and when that
Within enterprises themselves, it is to their benefit if the requirements of all constituencies are
understood up front as well, so they can be very clear with potential storage vendors what their
different organizations require in an AI storage platform. Enterprises should seek to maximize
workload consolidation during infrastructure modernization efforts to drive better infrastructure
efficiencies but need to ensure that they do not compromise performance, availability, or security in
doing so.
CONCLUSION
While it is still early in the move toward AI in the enterprise, it is clear that this will be a central and
strategic workload in digitally transformed enterprises. Storage infrastructure spend in enterprises for
AI workloads alone will be a $5.4 billion market by 2024. A large percentage of this spend will be for
the replacement of existing storage systems that, after getting past AI pilots, enterprises discover
cannot provide the performance and scalability needed for AI workloads. DL workloads in particular
require large data sets and delivering performance consistently at scale for those applications
outpaces the capabilities of many general-purpose storage systems.
AI workloads generally have a number of data pipeline stages running from ingest and transformation
through training, inferencing, and archiving, and enterprises are rightly focusing on storage systems
that can simultaneously handle all those stages at once without undue performance and availability
impacts. When a single storage system can meet all the requirements, it offers far better economics
than older approaches that require storage silos for one or more stages. As enterprises go through
digital transformation in general, they are looking to improve IT infrastructure efficiencies, and being
able to consolidate all stages of AI workloads on a single system goes a long way toward achieving
that goal. When those storage systems have the performance and capacity to be able to consolidate
additional workload types, the economic advantages only get better.
NetApp ONTAP AI is a converged infrastructure stack that combines NVIDIA DGX accelerated
compute, NVIDIA high-speed switching, ONTAP-based storage, and a large selection of tools to help
manage AI workloads effectively. The stack combines all of those components into an integrated
system, purchased under a single SKU, that is easy to buy and deploy, is fully supported by NVIDIA,
and includes a unified management interface that boosts administrative productivity for these types of
configurations. These systems are based on proven technologies, and NetApp has deployed hundreds
of these systems for AI workloads in enterprises over just the past several years. Overall, NetApp has
tens of thousands of storage systems in enterprises across all workloads and use cases, but the
vendor has been very successful in delivering rapid time to value for customers using ONTAP for AI
workloads. Almost 60% of enterprises are also running non-AI workloads on the storage systems they
are using for those workloads, and NetApp ONTAP offers a comprehensive set of capabilities to
enable these types of multitenant environments without compromising performance, availability, or
APPENDIX
Figures 6–9 provide additional results from IDC's AI DL Training Infrastructure Survey.
FIGURE 6
Yes 93.3%
No 6.7%
(% of respondents)
n = 314
Base = all respondents
Yes 68.5%
No 31.5%
(% of respondents)
n = 251
Base = respondents indicated organization currently running AI deep learning training projects in on-premises only/a nonintegrated
mix of on premises and the cloud/an integrated hybrid cloud.
Cloning a data set for development, testing, and/or training runs 43.0%
Other 0.0%
(% of respondents)
n = 314
Base = all respondents
(% of respondents)
n = 294
Base = respondents indicated organization currently running AI deep learning training workloads in IT environment
Global Headquarters
Copyright Notice
External Publication of IDC Information and Data — Any IDC information that is to be used in advertising, press
releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or
Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right
to deny approval of external usage for any reason.