Enterprise AI
Enterprise AI
Enterprise AI
AI in the Cloud
ENTERPRISE AI IN THE CLOUD
Rabi Jay
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means,
electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of
the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through
payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923,
(978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201)
748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its
affiliates in the United States and other countries and may not be used without written permission. All other trademarks
are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned
in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this
book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book
and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be
created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not
be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware
that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not
limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care
Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
If you believe you’ve found a mistake in this book, please bring it to our attention by emailing our reader support team at
wileysupport@wiley.com with the subject line “Possible Book Errata Submission.”
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in
electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Control Number: 2023950107
THERE HAVE BEEN so many people who helped make this book a reality. Without them you would not be read-
ing this today.
I would like to thank my wife, Suji, my son Rohan, my friends Dominic, Sundi, Karthik, J Siva, Ramkumar,
Aravind, Rajesh, Kumaran, Stanley Mammen, for his tips, CM for helping me promote the book, and many oth-
ers for their ongoing support and for putting up with me through it all.
I had the opportunity to learn from many of the top minds in the enterprise AI space, as well as some brilliant
authors, during the development of this book. I hope some of that has been translated into the knowledge I have
imparted here in this book.
Special thanks to everyone who helped edit this book. Thanks to Kenyon Brown, acquisition editor, who helped
me throughout the writing process and made it a pleasant experience.
Thanks to Kezia Endsley for her tireless editing assistance, for always being quick and positive, and for keeping
me on target through it all. She was very supportive of my efforts, and her kind approach helped me when the
going was tough. Her feedback and input on every chapter of the book helped make it ten times better than it
would have otherwise been. I would also like to thank Evelyn Wellborn for catching many errors as part of the
proofreading process that has certainly elevated the quality of this book.
Thanks to Navin Vijayakumar, managing editor, and Pete Gaughan, senior managing editor, for their insight into
planning the content and marketing tips. Also many thanks to Magesh Elangovan and Vijayprabhakar Settu for
their help with proofreading and images respectively.
A huge thanks to my friend and technical editor, Irwin Castelino, for reading the first drafts and making his
technical edits. His critical evaluation and suggestions helped me make this a valuable product from the reader’s
perspective.
Without these folks and more, this book simply would not have happened.
Thank you,
—Rabi Jay
ABOUT THE AUTHOR
RABI JAY IS a seasoned digital transformation specialist with more than 15 years steering a ship through the
uncharted waters of industries such as retail, aerospace, and software technology. Armed with a plethora of certi-
fications, including AWS Machine Learning, Azure, ITIL, and SAP, he’s your go-to guy for anything cloud and AI.
In various roles, from digital platform lead to VP of architecture, he’s led transformative projects in Martech, AI,
and business process optimization. But Rabi Jay isn’t just about executing projects; he’s about shaping the very
landscape of digital transformation. As a global alliance manager for major cloud corporations, he’s had a bird’s-
eye view of tech upheavals across diverse industries.
The author of SAP NetWeaver Portal Technology, published by McGraw-Hill, Rabi Jay is also the voice behind
the LinkedIn newsletter and podcast Enterprise AI Transformation, where he dishes out cutting-edge insights. In
his role as VP of digital transformation, he has championed the convergence of human-centric design with state-
of-the-art AI platforms and change management methodologies.
Why does all this matter? Because Rabi Jay is not just an expert; he’s a visionary. Passionate about guiding
companies through the evolving maze of AI and digital transformation, he’s got his finger on the pulse of what’s
next. And he’s bundled all this expertise into this latest endeavor: Enterprise AI in the Cloud: A Practical Guide to
Deploying End-to-End Machine Learning and ChatGPT Solutions.
He is excited about what the future holds for us and looks forward to playing a lead role in shaping this AI-fueled
smart world of the future. Connect with him to be part of this revolution.
➤➤ LinkedIn: www.linkedin.com/in/rabijay1
➤➤ Website: rabiml.com
➤➤ Twitter: https://twitter.com/rabijay1
➤➤ YouTube: www.youtube.com/@rabijay1
➤➤ Podcast: https://open.spotify.com/show/7vCeNI8c02pvIgeYDE8xwS
IRWIN CASTELINO IS an executive with 15+ years of experience in digital transformation and data analytics, lev-
eraging AI and ML to optimize business processes for organizations. His expertise spans the implementation and
integration of enterprise applications such as SAP and Oracle. He has delivered predictive solutions to optimize
container shipments and improve equipment maintenance to enable better ROI and uptime of resources. The
solutions he has delivered have consistently been a mix of internal and external data flows, with very large data-
sets being an integral part of the delivered solutions. His clients have ranged across the supply chain, beverage,
food distribution, pharmaceuticals, personal products, banking, energy, chemicals, manufacturing, automotive,
aerospace, and defense industries.
CONTENTS
INTRODUCTION xvii
PART I: INTRODUCTION
xiv
CONTENTS
INDEX 485
xv
Introduction
WELCOME TO Enterprise AI in the Cloud: A Practical Guide to Deploying End-to-End Machine Learning and
ChatGPT Solutions. This book is the definitive guide to equip readers with the methodology and tools necessary
to implement artificial intelligence (AI), machine learning (ML), and generative AI technologies. You have in your
hands a powerful guide to potentially transform your company and your own career.
In this book, you learn how to
➤➤ Develop AI strategy, solve challenges, and drive change
➤➤ Identify and prioritize AI use cases, evaluate AI/ML platforms, and launch a pilot project
➤➤ Build a dream team, empower people, and manage projects effectively
➤➤ Set up an AI infrastructure using the major cloud platforms and scale your operations using MLOps
➤➤ Process and engineer data and deploy, operate, and monitor AI models in production
➤➤ Use govern models and implement AI ethically and responsibly
➤➤ Scale your AI effort by setting up an AI center of excellence (AI COE), an AI operating model, and an
enterprise transformation plan
➤➤ Evolve your company using generative AI such as ChatGPT, plan for the future, and continuously inno-
vate with AI
From real-world AI implementation, AI/ML use cases, and hands-on labs to nontechnical aspects such as team
development and AI-first strategy, this book has it all.
In a nutshell, this book is a comprehensive guide that bridges the gap between theory and real-world AI deploy-
ments. It’s a blend of strategy and tactics, challenges, and solutions that make it an indispensable resource for
those interested in building and operating AI systems for their enterprise.
➤➤ Part IV: Building and Governing Your Team: People make magic happen! Part IV explores the organi-
zational changesrequired to empower your workforce. I guide you through the steps to launching your
pilot and assembling your dream team. It’s all about nurturing the human side of things.
➤➤ Part V: Setting Up Infrastructure and Managing Operations: In this part, you roll up your sleeves and
get technical. Part V is like your DIY guide to building your own AI/ML platform. Here, I discuss the
technical requirements and the daily operations of the platform with a focus on automation and scale.
This part is a hands-on toolkit for those who are hungry to get geeky.
➤➤ Part VI: Processing Data and Modeling: Data is the lifeblood of AI. Part VI is where you get your hands
dirty with data and modeling. I teach you how to process data in the cloud, choose the right AI/ML
algorithm based on your use case, and get your models trained, tuned, and evaluated. It is where the
science meets the art.
➤➤ Part VII: Deploying and Monitoring Models: Yay! It is launching time. Part VII guides you through the
process of deploying the model into production for consumption. I also discuss the nuances of moni-
toring, securing, and governing models so they are working smoothly, safely, and securely.
➤➤ Part VIII: Scaling and Transforming AI: You have built it, so now you can make it even bigger! In Part
VIII, I present a roadmap to scale your AI transformation. I discuss how to take your game to the next
level by introducing the AI maturity framework and establishing an AI COE. I also guide you through
the process of building an AI operating model and transformation plan. This is where AI transitions
from the project level to an enterprise-level powerhouse.
➤➤ Part IX: Evolving and Maturing AI: This is where you peek into a crystal ball. I delve into the exciting
world of generative AI, discuss where the AI space is headed, and provide guidance on how to continue
your AI journey.
xviii
Introduction
C-Level Executives
This is a great book for C-level executives who want to learn about the strategic impact, business case, and execu-
tion of AI transformation at scale across the enterprise.
One of their major struggles is applying AI practically and profitably into their business processes and strategies.
This book provides a detailed list of AI use cases and guides them to develop an AI strategy and business case,
thus helping them make intelligent and informed decisions.
UNIQUE FEATURES
This book is a toolbox with tools to achieve end-to-end AI transformation. This isn’t your regular technical book,
and here’s why.
xix
Introduction
Up-to-Date Content
I believe that my book is the most comprehensive and up-to-date guide to AI transformation available. It is a
must-read for anyone who wants to use AI to transform their business.
I wish you the best on your enterprise AI transformation journey. You can reach me at www.linkedin.com/
in/rabijay1. You can also keep in touch with my latest progress via rabiml.com.
Hands-on Approach
The chapters in this book have been aligned to the various steps in an enterprise AI implementation initiative,
starting all the way from Strategy to Execution to Post go-live operations. And each Chapter contains a number
of review questions to cement the understanding of the topics covered in the book. In addition to that, I have
added a number of hands-on exercises, best practice tips, and downloadable templates with examples to guide
you in your enterprise-wide AI implementation.
xx
PART I
Introduction
In this section, we dive into how enterprises are undergoing transformation through the adoption of AI
using cloud technologies. I cover industry use cases for AI in the cloud and its benefits, as well as the
current state of AI transformation. I also discuss various case studies of successful AI implementations,
including the U.S. Government, Capital One, and Netflix.
1
Enterprise Transformation with
AI in the Cloud
The future of computing is at the edge, powered by AI and the cloud.
—Satya Nadella
Welcome to the exciting journey of enterprise transformation with AI in the cloud! This chapter is designed
for anyone eager to understand the power and potential of AI in today’s business landscape. You’re prob-
ably here because you sense that AI isn’t just another buzzword, but a game-changer. But how exactly can
it transform your business? That’s the question explored in this chapter.
Considering AI
5%
Considering AI
Tested PoCs
12%
with limited Enterprise-wide AI
success adoption
10% 20%
Enterprise-wide AI
adoption
Launched PoCs/Ready Tested PoCs with
35%
to scale limited success
15% 23%
Limited AI adoption of
use cases
25%
Limited AI adoption of
use cases Launched PoCs/Ready
35% to scale
20%
NOTE Enterprise AI transformation involves the implementation of end-to-end AI, ML, and
Gen AI systems to drive business outcomes.
Understanding Enterprise AI Transformation ❘ 5
Define enterprise AI transformation and the importance of adopting AI and ML, including
generative AI technologies for enterprises.
3 4
2 5
1 Identify Use
Cases
Plan to
Integrate AI
6
Build PoC–
AI-First Set Up
Strategy Infrastructure,
Mindset Tools, and Increased
Develop Skills Innovation,
Current State
Growth, and
AI-First Strategy Profitability
Enterprise AI
Business
Objectives Transformation Measure Results
An AI-first strategy is first a change in mindset that seeks to embrace AI to achieve business objectives. It involves
identifying use cases to adopt AI, planning for proof of concepts (PoCs), and coming up with a plan to integrate
AI into various aspects of an organization, such as customer service, product development, and decision-making.
It includes building the necessary infrastructure, tools, and skills; building a commitment to learning; partnering
with experts and vendors to be at the forefront of AI; and embracing AI for business opportunities.
It also promotes greater collaboration between IT and business and gives companies a competitive advantage to
drive innovation, growth, and profitability.
If you do not adopt the AI-first strategy, you run the risk of losing your competitive advantage. You may not use
the resources effectively, continue to live in silos between the business and IT, and, more importantly, lose out on
opportunities to drive innovation, growth, and increased profits.
Adopting an AI-first strategy is the scope of this book. By following the steps outlined in this book, you learn to
implement an AI-first strategy in your company.
Explain what AI-first strategy means and mention at least three benefits of an AI-first strategy.
Using Netflix or any other case study, explain how the adoption of cloud and AI technologies
led to process transformation.
Explain the role of cloud computing in implementing robust, scalable, and ethical AI.
Scale
We are currently seeing two trends in the modern world.
One is that customers constantly demand immediacy
Big Data
Companies are no longer tethered to their physical data centers. Thanks to Amazon, Google, and Microsoft’s
cloud services, companies can now set up a lot of computing and networking power in minutes.
The cloud provides machine learning frameworks (such as MXNet, TensorFlow, and PyTorch) and AI services
(such as transcription, image recognition, and speech recognition) that anyone can leverage to do intense machine
learning computations to predict, classify, and
enable quick actions.
Innovative Products
Recently, generative AI has caught the attention and Services
of many. It was made possible by the underlying
computing power that the cloud technologies
provide, along with the data infrastructure to
process petabytes of data to train these models.
NOTE The need for different architectures for different use cases has forced companies to
rely increasingly on cloud computing, which can meet these unique demands on infrastruc-
ture.
Enterprise-wide AI Opportunities
Speaking of AI opportunities, you can categorize AI opportunities as enterprise-wide and industry-specific oppor-
tunities. Let’s discuss enterprise AI opportunities first.
Customer complaint Some examples of automation include sorting out and categorizing incoming
categorization customer complaints using natural language processing to understand the content
and classify it as technical issues, shipping issues, or billing issues.
Computer vision Computer vision systems can be used to analyze videos and images to spot
systems in defects in products in a car assembly or computer manufacturing plant.
quality control
Improved customer service Better customer service through chatbots using NLP to answer
via chatbots queries 24/7, route to the right person, and schedule appointments.
Customer recognition with Using computer vision and speech recognition to recognize
computer vision and speech customers on the website and contact centers and to tailor the
recognition experience accordingly.
Speech recognition for call and Use speech recognition to analyze phone calls and chat to identify
chat analysis areas of improvement as well as to train agents.
Transform
Automate
Compliance
Processes
Processes
Enhance
Employee Enterprise AI Optimize
Collaboration Opportunities Processes
Develop
Improve
Innovative
Customer
Products/
Service
Services
Identify New Customer Needs and Develop Innovative Products and Services
AI can identify new customer needs to develop new products and services by analyzing social media, purchase
history, reviews, complaints, and browsing behavior.
10 ❘ CHAPTER 1 Enterprise Transformation with AI in the Cloud
A financial company may find that customers are paying a third-party service higher fees to move
money overseas, in which case the company can create a money transfer service that’s cheaper
and better quality.
Competitor analysis Companies can analyze competitors’ products, pricing, and customer
reviews to identify customer pain points to meet a need in the
marketplace.
Network analysis for Companies can use network analysis of internal and external partners
business expansion and individuals to identify influential clusters in gaining access to more
business ideas and new customers and contact them via social media.
Enhanced collaboration via Using chatbots, better decision-making, and predictive insights,
chatbots and predictive insights employees can collaborate more effectively to drive better
business outcomes.
AI-powered workflows for Employees can use AI-powered workflows to collaborate more effectively
complex projects on complex projects. This can help automate routine tasks so employees
can work more strategically.
Analysis of project management By analyzing project management tools, collaboration tools, and
and collaboration tools productivity apps, AI can explain why breakdowns happen along with
solutions. AI can recommend which employee is best suited for a job.
Videoconferencing for remote Employees can use AI-powered videoconferencing to work remotely and
collaboration collaborate virtually from anywhere. AI can recommend ideal times to
have calls.
Leveraging Enterprise AI Opportunities ❘ 11
Predictive analytics for proactive You can use predictive analytics to identify issues proactively before they
issue identification happen and deal with individuals and departments accordingly.
Chatbots and virtual assistants You can also use chatbots and virtual assistants to answer compliance
for compliance questions from employees.
Changing compliance AI can help keep pace with changing compliance requirements
requirements by staying up-to-date and proposing improvement areas in
compliance programs.
Fraud Detection, Risk Management, and Customer Service in the Finance and
Insurance Industries
Here are some finance and insurance industry–related use cases:
Fraud detection, risk AI is widely adopted in the finance industry for use cases such as fraud
management, and detection, risk management, and customer service.
customer service
Loan processing by Human agents are now being replaced by smart robots that can process
smart robots loans in just a few seconds.
Robo-advisors for Financial advisors are being replaced by robots that can process large
investment decisions amounts of data to make the right investment decisions for customers.
These robots are smart enough to analyze data from social media, emails,
and other personal data.
Claims processing and product AI is used in the insurance industry to reduce claims processing time and
recommendation in insurance recommend insurance plans and products based on customer data.
NOTE Adoption of AI can enable better patient outcomes such as better health
due to better diagnosis and treatment, reduced costs due to reduced patient re-
admissions, and increased operational efficiency.
Predictive maintenance, quality AI is used for predictive maintenance, quality control, and supply
control, and supply chain optimization chain optimization.
Collaborative robots (cobots) This industry is also seeing the advent of cobots, which are robots
that work collaboratively with humans.
Amazon is a classic example of how AI can be used to provide product recommendations based on the
user’s browsing behavior.
14 ❘ CHAPTER 1 Enterprise Transformation with AI in the Cloud
Self-driving cars Active research is being carried out by Tesla, Volvo, Uber, and Volkswagen,
and though it’s used in controlled situations, pretty soon we will be seeing
self-driving cars more commonly on the road.
Navigational and driver Some of the practical applications of AI are navigating obstacles and
assistance systems roads, driver assistance systems such as adaptive cruise control, automatic
emergency braking, lane departure warning, radars, and cameras to enable
safer driving.
Warehouse automation and AI-powered robots are sorting out products in warehouses for delivery,
shipment optimization moving cases within warehouses, delivering for pallet building, and so on. AI
algorithms are used for shipment route determination and optimization for
cost minimization.
Public transport Even in public transport, AI is used for traffic light management,
management transportation scheduling, and routing.
Taking orders and serving food AI is used to take orders and serve food.
Crop management in agriculture In agriculture, it is used to raise crops by analyzing salinity, heat, UV light,
and water.
Smart machinery in farming Smart tractors and plucking machines are also being used in the
farming sector.
Price prediction, buyer-seller Predict prices and rental income, match buyers with sellers, analyze
matching, and market analysis market conditions and trends, process real estate documents
automatically, and use chatbots to answer customer queries
Workbook Template - Enterprise AI Transformation Checklist ❘ 15
Automated document Smart home technology for increased security and energy efficiency
processing and in the home
customer support
Smart home technology Voice assistants like Alexa, Google Assistant, and Siri; smart appliances
like refrigerators, washing machines, thermostats, smart lighting, home
security systems
Recommendation systems in Netflix and Amazon use AI to recommend movies and songs to
streaming platforms users based on their preferences and viewing behavior.
Storyboarding and music In the movie industry, AI is used to build a script for storyboarding
composition in movies and is even used to compose music.
Storyline advancement in gaming AI is used in the gaming industry to advance the storyline in a
specific direction to keep the player engaged.
Predictive maintenance Predictive maintenance of equipment such as transmission lines and power
and energy management plants, managing the supply and consumption of energy, and integrating with
renewable energy to reduce costs.
Energy efficiency Improve energy efficiency in consumption in office spaces and homes by
in buildings adjusting cooling and heating systems and lighting
Fraud prevention Prevent fraud such as billing errors and meter tampering
SUMMARY
The key takeaways from this chapter underscore the necessity for organizations to take AI seriously to gain a
competitive edge. This means elevating machine learning projects from mere PoC pet projects into large-scale,
enterprise-wide transformative efforts. To achieve this, you need to adopt a more systematic AI-first strategy and
methodology, the primary focus of this book.
This chapter also delved into the transformative power of cloud computing, enabling many use cases previously
unattainable and providing a pathway to thrilling consumer experiences. You reviewed some of the use cases that
AI and ML enabled and realized their broad adoption and impact across various industries.
The next chapter showcases three case studies of companies that successfully implemented AI on an enterprise-
wide level.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. AI and ML can be incorporated into business processes to
A. Detect quality issues in production
B. Analyze the user sentiment in a Twitter feed
C. Translate documents
D. All the above
2. An AI-first strategy
A. Prioritizes customer service over AI and ML technologies
B. Focuses on product management over AI technology
C. Focuses on embracing AI to achieve business objectives
D. Focuses on building a solid technology platform for AI
3. Prioritizing data and AI initiatives helps
A. Companies to focus on customer service
B. Companies to identify new business opportunities
C. Companies to enable greater collaboration between IT and business
D. Companies to build stronger business cases
4. Neglecting an AI-first strategy will lead to
A. Losing the competitive edge
B. Not using resources effectively
C. Living in silos between AI and the business
D. Innovation, growth, and increased profits
5. What is the main advantage that digital natives have when it comes to adopting AI?
A. They have a risk-taking culture and view failure as a steppingstone for success.
B. They have a lot more money to invest in AI.
C. They have a better brand image than traditional companies.
Review Questions ❘ 17
ANSWER KEY
1. D 6. D 11. B
2. C 7. D 12. C
3. B 8. C 13. D
4. A 9. C
5. A 10. C
2
Case Studies of Enterprise
AI in the Cloud
Every company is a software company. You have to start thinking and operating like a digital company.
It’s no longer just about procuring one solution and deploying one. It’s not about one simple software
solution. It’s really you yourself thinking of your own future as a digital company.
—Satya Nadella, CEO, Microsoft
Now that you have learned about some of the possibilities of AI in the previous chapter, this chapter dives
deeper into the fascinating world of enterprise transformation with AI in the cloud through three real-
world case studies. If you are interested in knowing how AI transforms industries, this chapter is tailor-
made for you.
You not only learn how giants like the U.S. government, Capital One, and Netflix have leveraged AI and
cloud technologies, but their stories also serve as practical examples that inform the methodology that I
present in this book.
The key takeaway is how these organizations built resilient systems by embracing cloud-native principles
and AI technologies to become AI-first companies, which is essential for any enterprise aspiring to thrive in
the digital age.
Augmented AI, ML
Cloud Migration
Data Lakes Foundation
FIGURE 2.1: Challenges, benefits, and solutions adopted by the U.S. government
The following table lists the challenges faced by the U.S. government:
CHALLENGES DETAILS
Enrollment process Given the large volume of applications, one can only imagine how complex
complexity this process can be and how long it can take. Millions of customers depend on
this for their healthcare, unemployment needs, and to stay out of poverty. The
challenge with the enrollment process includes complex application processes,
huge backlogs in processing the claims, and lengthy adjudication, evaluation, and
eligibility processes.
Long customer Customers had to wait several weeks to get a response even though a large
response time workforce was trying to serve them.
Inadequacies of Legacy systems provided poor customer experience and inability to scale during
legacy systems times of surge in demand. They also lacked self-service capabilities and did not
have a mobile application submission option. Text communications for case status
and the ability to schedule appointments online needed to be included.
Multisource The applications came from multiple sources, such as the web, mail-in, and contact
application processing centers, and in multiple formats, such as PDFs and images. Reviewing these
applications in such large quantities took a lot of time and effort, was error-prone,
needed domain expertise, and was not an efficient process. Setting up human
review systems to process this many applications not only was costly but also took
a long time. It involved custom software for review tasks, complex workflows, and
many reviewers.
Case Study 2: Capital One and How It Became a Leading Technology Organization ❘ 21
CHALLENGES DETAILS
Security concerns Another challenge was the need to maintain data security as millions of documents
amounting to terabytes of data needed to be stored and sensitive data needed to
be protected and encrypted.
High volume of Another major challenge was the number of customer service calls, amounting to
customer service calls more than 50 million calls just for the Social Security Administration team. These
calls were related to pending application status disputes and other benefits-related
questions.
Lack of Moreover, management lacked insight into the program operations, enrollment
management insight issues, backlogs, budgeting, waste, fraud, and abuse.
AI and A2I Automated some of the workflow steps such as adjudication and approvals
technologies
Data-driven insights Provided leadership with the required information to take remedial actions
Cloud migration Leveraged some of the operational efficiencies that come with the cloud such as
scalability, efficiency, and performance
CHALLENGES DETAILS
Data management To begin with, they had the data challenge. They had data spread across multiple
parts of the organization, such as customer relationship management (CRM),
enterprise resource planning (ERP), and streaming data from social media. They had
to bring all that into one place, which they did with Amazon S3.
Infrastructure They wanted to solve their infrastructure problems. They wanted to become an agile
challenges company and reduce their development time from months to days or even minutes.
Cloud They wanted to migrate their 11 data centers to the AWS cloud and leverage the
migration challenge cloud-native development capabilities to develop a resilient system.
Moving to the cloud was one of the secrets that many organizations took on this journey to become an AI-
first company.
NOTE Cloud-first thinking lays the foundation for a machine learning platform by adopting
cloud principles when developing applications, deploying them, and serving your customers.
THRILLING
CUSTOMER
EXPERIENCES
AI-First
Company
Mindset
Shift
Cloud-Native Eno chatbot,
Principles fraud detection,
speech
Infrastructure Generating new recognition in call
Transformation insights from centers, etc.
AI/ML models,
Data DevOps, Agile resulting in high
Challenges development, scalability,
and customer- reliability, high
Adopting centric approach availability, and
microservices reduced cost
architecture,
Migration into the CI/CD pipelines,
Cloud: Reduced blue-green
environment build switching,
Scattered data time from months A/B testing,
collected into a to minutes failover
data lake techniques
NOTE You will find that most companies follow this path—adopting technology results in
process improvement. Process improvement leads to organizational change, which results in
product transformation, which leads to innovation in developing new products and services,
thus generating new business outcomes.
Netflix’s story can easily be applied to any digital-native company such as Uber, Amazon, and Airbnb. They have
all followed this path to become world-class, AI-first companies.
Technology Adoption
Cloud Computing
Big Data
Machine Learning
Product Transformation
Product Innovation
Intelligent Products
New Business Models
Transition to streaming service By adopting the cloud, Netflix could switch from DVDs to a streaming
service. It helped Netflix not only disrupt its competitors but also
become a global company almost overnight, as it could reach people
worldwide.
Utilization of data analytics Netflix used customer behavior and profile data from a data analytics
platform to generate new insights and drive their content production,
acquisition, and marketing to focus on their customer’s needs.
Technology adoption By adopting technology, Netflix reduced costs while improving its
operational efficiency and customer experience.
Driving process change It also allowed them to drive process change by changing their
operating model around products and services.
This ability to quickly develop new software helped Netflix focus
more on products and reorganize their business model around
the products and services.
Breaking down of functional silos and It has also helped them break down their functional silos. In the
organizational transformation process, they became an agile, product-centric, and customer-
focused organization.
Thus, the process transformation led them to drive organizational
change as a product-centric, customer-centric, agile organization.
Product innovation driven by Netflix leveraged its new organizational transformation to drive
organizational transformation product innovation.
By being customer-centric, Netflix has been able to come up with
ideas such as new products and services, for instance, streaming
original content. Their customer centricity has also helped them
prioritize their recommendation algorithms as part of their
streaming service.
Continuous innovation and customer Being agile helped Netflix identify the requirement for new
centricity customer experiences on a regular basis. They created local
material to fulfill regional demands, which has helped them
increase globally.
Their customer centricity is also helping them determine the need
for new programs, such as helping customers pick their journey in
their hit program called Bandersnatch.
NOTE The cloud made becoming agile easy. By adopting the cloud, Netflix could develop
software iteratively and quickly. In other words, it helped them adopt an agile development
approach rather than a waterfall methodology. Agile, in turn, helped them become more
product-centric and customer-focused.
By adopting cloud/AI innovation, you can see that companies find it simpler to transform their processes, causing
organizational transformation. Consequently, they become more product-centric and customer-centric.
Review Questions ❘ 27
In turn, this creates the right conditions for innovation, breaking down silos and encouraging experimentation
that eventually results in new products, services, and business models like you just saw in the Netflix
example.
AI helps you respond quickly to business events in an agile manner. Agility is one of the few competitive advan-
tages companies have today.
Today’s customers want to be treated with unique offers based on their specific needs, and in real time. That
means the business must crunch mass amounts of data to generate new insights and act in minutes rather than
what used to be days and weeks. Machine learning has now made this possible. You too can achieve these feats
by adopting the best practices and the comprehensive methodology in this book.
Explain how Netflix enabled process transformation, organizational change, and product
innovation. Research other companies that have achieved similar transformation.
SUMMARY
This chapter delved into case studies of how three organizations transformed into world-class organizations by
adopting technology that spurred process improvement and organizational change. This transformative journey
eventually helped them become customer-centric, leading to product innovation and even pioneering new busi-
ness models.
The next chapter unravels some of the challenges of implementing a machine learning deployment that might
prevent organizations from deploying one into production.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. The U.S. government solved its challenges processing multisource applications by
A. Setting up human review systems
B. Using custom software to manage review tasks
C. Automating some of the workflows using AI/ML technology
D. All the above
2. What is the common theme among companies that have become world-class?
A. They focus on reducing business risks.
B. They prioritize enhancing their performance in terms of the environment, society, and governance.
28 ❘ CHAPTER 2 Case Studies of Enterprise AI in the Cloud
ANSWER KEY
1. D 3. A 5. A
2. C 4. C
PART II
Strategizing and Assessing for AI
In this part, we discuss the nitty-gritty of AI, such as the challenges you may face during the AI journey
along with the ethical concerns, and the four phases that you can adopt to build your AI capabilities.
I then discuss the roadmap to develop AI strategy, finding the best use cases for your project, and evaluat-
ing the AI/ML platforms and services from various cloud providers. It’s like a step-by-step guide to your AI
adventure.
3
Addressing the Challenges with
Enterprise AI
Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one
more time.
—Thomas A. Edison
This chapter might just be your ally on your path to AI transformation. As someone keen on implementing
AI, whether you are a business strategist, technology leader, data scientist, or curious innovator, you must
be aware that deploying AI, ML, and generative AI solutions is exciting, but it can also be like navigating a
minefield. That’s because AI isn’t just about algorithms and models; it’s also intrinsically connected to data,
infrastructure, change management, regulations, and strategic decision-making. You will soon realize that
an AI transformation is markedly different from a digital transformation.
This chapter is designed to help you understand the technical challenges, ethical dilemmas, and strategic
hurdles. Recognizing these challenges is the first step toward preparing your organization for them. In later
chapters, I guide you with a practical, systematic methodology to address these challenges with clearly
defined steps, best practices, case studies, hands-on exercises, and templates. Each chapter in this book
maps to the tasks in the phases shown in Figure 3.1.
Note that these challenges by no means represent a complete list. Although the challenges are many, they
are not insurmountable. You can convert these challenges into opportunities for innovation and growth
based on the insights and best practices I share throughout this book.
Business-Related Challenges
Here is a sample of business-related challenges:
Cost of running The cost of running intensive machine learning workloads may come in the way of
ML workloads deploying models into production.
Keeping up AI and machine learning are constantly evolving, and it is challenging for AI/ML
with change practitioners and cloud engineers to keep up with those changes.
Selecting the Choosing the right use cases can be challenging, as you need to understand the
right use cases business problem, ensure data availability for the use cases, choose the right ML
model, integrate with backend systems, factor in the ethical and legal issues, and
measure ROI for the use cases.
Alignment with Alignment between AI/ML and business strategies can be challenging if not handled
business strategy properly with business prioritization of requirements, understanding AI impact, and
stakeholder communication.
Skills and Challenges include having the right skills and talent due to the shortage of qualified
talent shortages professionals, needing cross-functional expertise, and trying to keep up with new
developments.
Challenges Faced by Companies Implementing Enterprise-wide AI ❘ 33
Collecting high- Collecting high-quality data can be challenging and time-consuming. Models
quality data require data in the right format, so data cleaning and labeling data for model
training is required, which can be time-consuming and costly.
Complexity of Machine learning can be complex, with statistics, mathematical concepts, and
machine learning machine learning programming skills that make navigating challenging.
Measuring model Measuring the performance of models, coming up with the proper performance
performance metrics, and validating them during changing business conditions and data can be
challenging.
Monitoring model Monitoring the model performance can be challenging because of changing
performance business conditions, the ability to measure model drift, and the need to ensure the
model performs well over time and aligns with the business needs.
Choosing the Choosing the right model is a challenge as simpler models do not capture all the
right model underlying data patterns; complex models can capture complex patterns, but they
can remain black boxes and are difficult to interpret.
NOTE Poor data quality can impact model performance and sometimes prevent models
from being deployed in production.
Platform-Related Challenges
Here is a sample of platform-related challenges:
Integrating with Integrating with backend systems and infrastructure can be challenging due
backend systems to technical, data, security, scalability, human-in-the-loop integration, and
regulatory challenges.
Managing the entire Managing the entire machine learning lifecycle is challenging due to the
machine learning lifecycle complexity around data preparation, feature engineering, model selection, and
tuning, evaluation, deployment, monitoring, model interpretation, and scaling
for large datasets.
Scaling machine Scaling machine learning workloads for large amounts of data in real-time can
learning workloads be challenging, as doing so can require significant computational resources.
Selecting the right Selecting the right cloud provider to ensure proper performance, scalability,
cloud provider reliability, and security, choosing from a wide range of available complex
options, and ensuring alignment with the business goals and culture.
34 ❘ CHAPTER 3 Addressing the Challenges with Enterprise AI
Table 3.1 summarizes all the challenges related to AI transformation covered so far.
DOMAIN CHALLENGES
Operations Model should be monitored for bias, data, and model drift.
Managing end-to- Managing end-to-end security across the entire machine learning lifecycle can be
end security challenging, as security needs to be implemented across the data layer, models,
infrastructure, network, and cloud components.
Ensuring compliance Ensuring compliance, maintaining data privacy, implementing security, and
and avoiding bias avoiding bias and ethical discrimination in models can be challenging.
Brainstorm with your team the different ways AI, ML, and Generative AI can integrate with
your company or a company of your choice.
Get Ready: AI Transformation Is More Challenging Than Digital Transformation ❘ 35
business domain knowledge. Bringing the data, the technology, the skill sets, and the infrastructure together takes
a lot of work. A strong mix of technology and business domain knowledge is the foundation for success in this
journey. Figure 3.2 shows the different skills you need. It shows how these skills are a combination of domain,
mathematics, and computer science knowledge areas.
Ml
Finance Engineer DevOps
Data
Scientist
Business Data
Analyst Engineer
Computer Science
Marketing Mathematics
Operations
Domain Expertise
Factors to Consider
In a nutshell, given all these considerations, whether an organization should proceed with a small-scale PoC solu-
tion or a more significant AI transformational effort will depend on where they are in their AI journey.
➤➤ If your company is just beginning to test AI and has not identified a particular use case, then a point
PoC solution may be a good fit.
➤➤ However, if your company has already gained some experience through a previous point-based PoC
solution and has identified several use cases, as well as the right resources, skill sets, infrastructure, and
data available for a large-scale initiative, then a large-scale initiative may be the way to go, given the vast
benefits it can have for the organization.
It is recommended that you kick-start the AI transformation effort with a PoC and then follow it up with a larger-
scale AI transformation initiative.
NOTE Deciding between a smaller PoC and large-scale AI transformation depends on the
risk-taking culture and org change capability of the organization as well as the maturity of
their AI processes, practices, data, and org structure.
Step 3: The teams need to decide the next course of action, whether to go forward into a
large-scale AI initiative or continue with their PoCs. The teams must give a short presentation
capturing the situation, their decision, and their reasons.
Step 4: Once all the teams have presented, discuss as a class regarding common viewpoints and
differences in the approach. The individual can capture their own notes and decide how they
would want to apply it in a real-world setting.
SUMMARY
While embarking on the AI transformative journey can be quite exhilarating, the landscape ahead can also be
pretty daunting. Without a doubt, the promises are alluring, but it is essential to approach them with both enthu-
siasm and preparation. This chapter is your first step in the journey ahead.
This chapter considered some challenges in implementing a machine learning deployment that prevents organiza-
tions from deploying them into production. Given the immense power behind an AI solution, it is also vital for
the implementation team to factor in some of the social, ethical, and humanitarian considerations when deploying
an AI solution. The next chapter discusses the requirements to implement responsible AI.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the
information. You’ll find the answers in the following section.
1. What is the main advantage that digital natives have when it comes to adopting AI?
A. They have a risk-taking culture and view failure as a steppingstone for success.
B. They have a lot more money to invest in AI.
C. They have a better brand image than traditional companies.
2. What are the challenges when deploying AI/ML solutions in a business environment? Choose all
that apply.
A. AI and ML technologies are complex to deliver value.
B. The cost of running machine learning workloads can be a barrier to deployment.
C. All business problems are easily solvable using AI/ML.
D. There is a surplus of qualified AI/ML professionals in the market.
3. What is a common data-related challenge in machine learning?
A. There’s usually an excess of high-quality data available.
B. Models don’t need data in a specific format, making the process easier.
C. Measuring the performance of models is straightforward and easy.
D. Collecting high-quality data and preparing it for machine learning can be time-consuming
and costly.
40 ❘ CHAPTER 3 Addressing the Challenges with Enterprise AI
4. What is a significant challenge when managing the entire machine learning lifecycle?
A. It is easier to conduct feature engineering and select models.
B. It is easy because data preparation does not require any significant effort.
C. The complexity around data preparation and feature engineering makes it challenging.
D. None of the machine learning tasks presents any significant challenge.
5. What makes AI transformation more complex than digital transformation?
A. AI transformation requires basic IT skills and infrastructure.
B. AI transformation involves changes in how work is done and the interaction of employees with
technology.
C. AI transformation does not require a substantial amount of high-quality data.
D. AI transformation does not present any challenges related to bias, ethics, privacy, and
accountability.
ANSWER KEY
1. A
2. A, B
3. D
4. C
5. B
4
Designing AI Systems
Responsibly
The real problem is not whether machines think but whether men do.
—B.F. Skinner
This chapter grapples with one of the most pressing issues of our time: designing AI systems responsibly.
It is even more critical if you are a data scientist, project manager, AI developer, or executive involved in
implementing AI. It is not just about developing code or training models but about ensuring that what you
develop is aligned with human values, ethics, and safety.
This chapter explores the key pillars of Responsible AI, from robustness and collaboration to trustworthi-
ness and scalability. This chapter also shares the essential pillars of a Responsible AI framework, the four
key principles of Responsible AI, and some of the nuances of AI design, development, and deployment. As
shown in Figure 4.1, this chapter sets the stage for subsequent chapters for the practical deployment and
scaling of AI. Whether you have just started or are trying to refine an existing implementation, this chapter
will guide you.
02
01 Strategize and Plan and
Prepare Launch Pilot
03
Ethical AI Robust AI
Human- Collaborative
centric AI Responsible AI AI
Trustworthy
Scalable AI
AI
Without these Responsible AI attributes, it will be challenging to deploy it in production because users will not
have the required trust and confidence to use the system. The following sections look at each one of those attrib-
utes and explain what they mean.
The Pillars of Responsible AI ❘ 43
Robust AI
You need to design AI systems to be robust, meaning they should be able to perform well even in situations in
which they have not been previously trained. It is especially critical for use cases such as defense, autonomous
vehicles, and robotics, where the consequences could be catastrophic if the system misbehaves.
BEST PRACTICE TIP A robust AI system should perform consistently even when the data
constantly changes, especially in an enterprise setting, where many data sources are feeding
the model. If the system produces unreliable insights, users will lose trust and confidence
when using the system. You need to test the system for any surprising attacks or situations.
Collaborative AI
The idea behind collaborative AI is to leverage the strengths of both AI and humans. While humans have creative
thinking, leadership, judgment, and emotional and social skills, robots can crunch large amounts of data and help
humans with physical labor, information gathering, speed, and routine tasks such as customer service. By combin-
ing the strengths of both humans and AI, you can build better systems to improve business processes. Figure 4.3
shows examples of this collaboration, such as chatbots, cobots, virtual assistants, and other intelligent machines.
These intelligent machines work alongside humans to make decisions and carry out tasks collaboratively.
Human
Data Crunching
Cobots Creative Thinking
Physical Labor
Virtual Machines Leadership Artificial Intelligence
Information
Chatbots Judgment
Gathering
Intelligent Emotional Skills
Speed AI and Humans
Machines Social Skills
Routine tasks Working Together
The critical takeaway from collaborative AI systems is that AI will not replace humans but will complement and
enhance human decision-making capacity. While taking care of mundane, repetitive tasks, they help humans focus
on unique, high-value tasks and increase their capacity to deliver on more tasks.
Collaborative AI has created the need for humans to train AI systems so that they behave correctly, as well as
explain why AI systems behave in a certain way. For example, in the case of car accidents, you need to explain
why an accident happened or why the car failed to prevent one. It has also created the need for human supervi-
sion to ensure that AI systems work responsibly.
It’s been shown that companies that replace humans can realize only short-term gains. In contrast, companies that
use AI to improve human capacity to make decisions and carry out mundane tasks faster at scale can redesign
business processes and reap greater long-term benefits.
44 ❘ CHAPTER 4 Designing AI Systems Responsibly
Some examples of collaborative AI are HSBC and Danske Bank using AI to improve the speed and accuracy of
fraud detection. Another example is that of Mercedes Benz, which has replaced robots with AI-enabled cobots,
which are robots working with humans to achieve flexibility in the process. This redesigned their human-
machine collaborations.
Trustworthy AI
You need to build trustworthy systems, meaning they should behave consistently, reliably, and safely. They should
also protect the user’s well-being, safety, and privacy. The goal is to build trust between the user and the system.
You should also be able to explain to the user how the system took its actions and be accountable to the user
regarding how it concluded or made a prediction. Figure 4.4 shows the key elements of a trustworthy system.
Accuracy
Human Oversight and
Accountability
Trustworthy
AI
Fairness Privacy
Security Reliability
Safety Robustness
The Google search engine is an excellent example of a trustworthy system. It has algorithms to ensure that
results are shown based on the relevance of those websites to the search query. It also ranks the results and
filters all the spam and malicious websites when presenting the results. It provides transparent and explainable
results to the user regarding how it came up with those results. Google has also developed a set of ethical
guidelines and principles and has explained to the users how they collect the data, protect, and use it by being
transparent to the users. Google has built a reputation among users as a trustworthy search engine AI system.
Scalable AI
You must build AI systems that can handle large amounts of data without compromising performance during
operation. For a system to be scalable, you need to ensure not just the scalable management of data and models
The Pillars of Responsible AI ❘ 45
but also scalability during development and deployment, as well as scalable algorithms and infrastructure (see
Figure 4.5). This section also discusses how to solve data scarcity and collection, data reusability to scale data in
the subsequent chapters on cloud platform architecture, automating operations, and data and model processing.
To address the development and deployment scalability, the book discusses using production pipelines and scal-
able architectures, including distributed cloud computing and edge devices.
Human-centric AI
You need to build AI systems that are human-centric. Human-centric means that they should respect humans’
values and enable them to make decisions. For this to happen, a conscious effort needs to be made to incorporate
human values and needs in the design. You should factor in social, ethical, and human behavioral impact. These
systems must be built to be transparent, explainable, and accountable to the users. They should understand how
these systems work and be able to make informed decisions based on the system’s recommendations. Figure 4.6
shows some of the key considerations when building a human-centric AI system.
46 ❘ CHAPTER 4 Designing AI Systems Responsibly
Respect Human
Adaptability
Values
HUMAN-CENTRIC
AI
FIGURE 4.6: Building human-centric AI systems with human values at the forefront
A healthcare system that provides personalized treatment recommendations to patients based on their
historical medical data, genetics, and lifestyle is an excellent example of a human-centric AI system because it
ensures that the treatment is not based on a one-size-fits-all approach but is personalized based on the patient’s
unique needs and circumstances. The developers must also consider the data privacy of the user. They should
also be careful about the potential impact of making wrong recommendations. That is why it would be
necessary for the system to explain how a decision was made. This is critical because it will help the doctors
trust this system more, allowing them to make recommendations to patients much more confidently. Moreover,
you should continuously monitor the system to ensure it operates as expected. It should also adapt to the
changing patient and healthcare system requirements.
Identify potential social, humanitarian, and ethical issues when implementing AI and proposing
ethically sound solutions. Use this understanding to explore the key design elements to ensure
the AI system is robust, collaborative, trustworthy, scalable, safe, and human-centric.
Develop solution for prejudices and bias: Data scientist Strategies for
Propose methods to address biases in data minimizing biases
collection, model training, and deployment.
Develop solutions for privacy and security: Security expert Data privacy and
Design controls to ensure data privacy and security plan
compliance with relevant regulations.
Step 1: Assign a student or a group of students to a topic related to Responsible AI, such as
potential ethical issues, biases, privacy concerns, or other challenges.
Step 2: Have them present a short presentation with their scenario, challenges, and proposed
solutions. Identify the common themes and unique insights and document them for applying it
to their future projects.
SUMMARY
This chapter explained the importance of Responsible AI to ensure AI systems are designed and deployed with
ethical, social, and human-centric principles in mind. Users should be able to understand AI decisions, and the
systems should eliminate bias and prioritize human values. These systems must behave consistently and reliably
and should be scalable to large amounts of data. That means you must ensure adherence to data privacy regula-
tions, implement robust security and testing, and involve stakeholders during design.
In the next chapter, we will discuss developing the AI strategy, the roadmap, and getting alignment from various
stakeholders.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation ere. You’ll find the answers in the following section.
1. Transparency in AI design means
A. The models used in AI systems should not introduce bias when they are adapted to changing
conditions.
B. The data used as input to a model should be biased.
C. The users and organizations understand how an AI system operates and makes its predictions so
that they can use it appropriately.
D. It should not be possible for either humans or the programs to change the predefined model devel-
opment process.
2. Which of the following is NOT a best practice tip for Responsible AI?
A. The data used as input to a model should not be biased.
B. The organization implementing AI should focus only on the positive impacts of AI.
C. Humans or programs should not be able to change the predefined model development process.
D. The models used in an AI system should not introduce bias when they are adapted to changing
conditions.
Answer Key ❘ 49
ANSWER KEY
1. C 3. B 5. A
2. B 4. B
Envisioning and Aligning Your
5
AI Strategy
By failing to prepare, you are preparing to fail.
—Benjamin Franklin
This chapter not only sets the tone for your AI transformation but also ensures that every step you take
from now on resonates with your company’s broader objectives.
Targeting business strategists struggling to find their AI direction, technology leaders grappling with busi-
ness and technology alignment, and AI enthusiasts eager to embed AI meaningfully into their business,
this chapter is focused on charting a clear vision for your AI initiatives and aligning it with your business
objectives. Without a clearly defined vision and business case, even the most robust AI implementation can
fail when not aligned with the business goals.
This chapter presents a methodology that clearly lays down the tasks and deliverables you can follow
across various stages of your AI implementation journey. It serves as a bridge between the groundwork
laid in Chapters 1–4 for idea generation and later chapters, teeing it up for developing the business cases in
Chapter 6 to managing strategic changes in Chapter 7 (see Figure 5.1).
By the end of this chapter, you will have a toolkit of strategies, a clear roadmap of tasks, a business case,
and a checklist of tasks vital for your AI journey.
I know that implementing enterprise-wide AI is not an easy thing to do. However, to drive business
transformation with AI rather than solve one or two AI use cases, enterprise-wide AI is the answer, and
this book is your end-to-end guide. As shown in Figure 5.2, the chapters in the book are arranged in a
Step-by-Step Methodology for Enterprise-wide AI ❘ 51
manner that should help you manage all these challenges and implement enterprise-wide AI using a systematic
methodology.
02
01 Strategize and Plan and
Prepare Launch Pilot
03
Integrate Your
Select Your AI/ML
Apps and Systems Automate Your Build Algorithms and
Operations Your Frameworks
Platform
Envision Phase
Tasks and Deliverables
Identify Business Establish a
Define AI Vision, Obtain Leadership Prioritize AI
Transformation Measurement
Strategy Support Initiatives
Opportunities Framework
Prioritize Initiatives: Identify the different areas that Project managers, Prioritized list
Prioritize your AI AI can impact and prioritize strategy team of AI initiatives,
initiatives across their value. impact assessment
multiple areas across areas
Say you are working for a retail company to improve its supply chain. As part of the Envision
phase, you have identified the AI strategy, brainstormed several AI initiatives, obtained
leadership buy-in, established a measurement framework, and prioritized your AI initiatives.
For example:
➤➤ Business strategy: Leverage AI to improve supply chain operations.
➤➤ AI strategy: Use appropriate algorithms to analyze supply chain data and generate insights
to optimize operations. This includes processing data, building models, and leveraging
appropriate AI tools.
➤➤ Business goals: Reduce lead times by 20 percent, increase on-time delivery to 98 percent,
and reduce transportation costs by 10 percent.
➤➤ Measurement framework: This is a set of metrics used to track progress. In this case, it
includes lead times, on-time delivery, transportation costs, inventory turnover, and order
cycle time.
➤➤ AI initiatives: These include demand forecasting for better inventory management, route
optimization to minimize transportation costs, and warehouse automation to reduce
labor costs.
Align Phase
Tasks and Deliverables
Identify Initiate Change
Assess Current Create a
Remedial Management Develop a Plan
Maturity Levels Roadmap
Measures Processes
Assess your current maturity levels: You focus on Business analysts, Maturity assessment
identifying the current maturity level in your organization, technology leaders report, list of
across business, technology, people, process, security, and concerns and
governance. Document the concerns, limitations, or gaps limitations
shared by your stakeholders.
Initiate your change management processes: You can then Change managers, Change
put in place organizational change management processes IT partners, management plan,
to adequately address any hurdles. You also talk to different training leads stakeholder support
stakeholders for their support; for example, you may have to documentation
speak with IT partners to procure the right resources or the
training curriculum lead for putting together the right set of
courses. Your goal is to ensure that all the stakeholders are
fully on board and incredibly supportive of this AI initiative.
Summary ❘ 55
Develop your plan: During this phase, you capture these Project managers, Project plan, AI
gaps in the form of a plan so that you can work toward strategy team transformation plan
launching the AI initiative in the next phase. There are two
types of plans. One is at the project level to implement the
identified AI POC as part of the Launch phase, and the other
enables enterprise AI from a transformation perspective. The
latter is known as the AI transformation plan.
Create your roadmap: Create a roadmap with clearly Strategic planners, AI initiative
defined milestones with your idea of the resources and project managers roadmap,
timelines. milestone chart
Continuing with the retail example, you may perform the following tasks during the
Align phase:
➤➤ Change communication plan: You find out during your gap analysis that some stakehold-
ers are concerned about AI’s impact on their jobs. So you put together a change manage-
ment process that involves periodic communication about the value of the AI initiative
along with reassurances about how it benefits their jobs.
➤➤ Skills gap: You identify remedial action items to address technical skills gaps. One of those
measures is hiring new employees with AI and ML skills and working with the training
curriculum team to develop AI courses for your employees to bring them up to speed with
the technologies.
➤➤ Data issue: During the gap analysis you also realized that your company does not have all
the data required to build and train the models. You assess different options to generate
or acquire necessary data from various marketplaces. While discussing this with your IT
leads, you also learn that you need to build certain aspects of the AI platform.
➤➤ Roadmap: You consolidate all the action items into a plan and define a roadmap with
clear timelines and milestones.
➤➤ Business case: At the end of this phase, you build a strong business case to launch the pro-
ject and ensure all the impacted stakeholders are supportive and committed to the success
of this AI initiative.
SUMMARY
In this chapter, you took a tour of the Envision and Align phases for implementing enterprise AI across the
organization. By now, you must be aware of the challenges in implementing enterprise AI and why a detailed step-
by-step methodology is not just helpful but also essential.
56 ❘ CHAPTER 5 Envisioning and Aligning Your AI Strategy
In Chapter 10, you will learn about the Launch phase and in subsequent chapters, I will dive into how to scale
your AI adoption across the enterprise. If you have ever wondered how to make AI work for your company, these
four phases—Envision, Align, Launch, and Scale—must be your allies and guide on this journey. Think of these as
the signposts that guide you on your path to becoming an AI-first organization.
In particular, the essence of strategically implementing AI and aligning your AI initiatives with your business goals
cannot be underplayed. Your journey has just begun. A clear roadmap, coupled with strategic alignment, is a must
for success.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Which one of the following is not a fundamental building block in an enterprise AI initiative?
A. Data
B. People
C. Business
D. Marketing
2. What is the primary objective of the Envision phase?
A. To launch a pilot to evaluate the idea
B. To define the AI strategy and prioritize AI initiatives
C. To take an iterative approach to learn from the pilot and scale it
D. To align the stakeholders for AI readiness
3. Which of the following is not a primary objective of the Align phase?
A. To assess current maturity levels
B. To identify remedial measures
C. To implement AI initiatives identified in the Envision phase
D. To initiate change management initiatives
4. What is the primary objective of the Launch phase?
A. To identify pilot projects
B. To implement pilot projects
C. To deploy a full-scale AI initiative
5. Which of the following is not part of the Scale phase?
A. To implement a robust AI/ML platform
B. To implement strong security controls
C. To establish an automated operations process to track and remedy issues
D. To implement a pilot project
ANSWER KEY
1. D 3. C 5. D
2. B 4. B
6
Developing an AI Strategy and
Portfolio
All men can see these tactics whereby I conquer, but what none can see is the strategy out of which vic-
tory is evolved.
—Sun Tzu, The Art of War
This chapter transitions from mere ideation and vision, as laid out in Chapters 3 to 5, to translate that
vision into a compelling business case. Even the most exciting AI ideas will not gain traction, the necessary
resources, or the support in your organization unless you have a robust, strategically aligned business case.
In addition to the business case, this chapter crafts the AI strategy that guides the direction, scope, and
approach for your AI initiatives, thus setting the stage for planning the pilot and building the team (see
Figure 6.1). By the end of this chapter, you will have a well-defined business case and a comprehensive AI
strategy aligned with your organizational goals.
02
01 Strategize and Plan and
Prepare Launch Pilot
03
FIGURE 6.1: STRATEGIZE AND PREPARE: Develop business case and AI strategy
In Chapter 2, I talked about how a company like Netflix transformed their business by
leveraging cloud and AI technology. I discussed how technology drove process transformation,
enabling them to become more efficient and scale globally. Process transformation, in turn,
created new opportunities for organizational transformation, enabling them to become
product-centric, customer-focused, and agile. And this organizational transformation resulted
in their ability to identify new products, services, and, eventually, new business models.
To achieve business transformation and build its competitive advantage, Netflix had to lever-
age processes to deploy various resources, tangible and nontangible, from within and outside
the enterprise. Typically, these resources can be a mix of people, processes, and technology. The
ability to bring together all those resources to achieve a favorable business outcome is known
as capability.
In the case of your AI effort, you would do the same to build your AI capability to achieve a competitive advan-
tage. You will leverage a team of skilled data scientists, cloud and data engineers, software engineers, and business
teams using the Agile methodology to develop machine learning models on a complex, scalable cloud infrastruc-
ture employing leading-edge best practices.
➤➤ Platforms
➤➤ Machine learning
➤➤ Operations
➤➤ Security
➤➤ Governance
Different organizations have different strengths and weaknesses in these categories, which determines the success
rate and the path they take toward achieving their business goals. For example, a few companies may be good at
product design, which may help them design new products and drive innovation. In contrast, other companies
may be good at supply chain management, helping them deliver products quickly and cheaply. These capabili-
ties have been developed over time through investments in people, processes, and technology and have become a
competitive advantage for an organization.
Business People
Ensure AI and business strategy alignment Have the right people, skills, and
and AI and business goals alignment, and culture to build and deploy AI
develop use cases
Data Governance
Ensure data is available with high Define AI governance and risk
quality to train accurate models and management framework for ethical
generate insights to drive action and Responsible AI
Enterprise AI
Capability
AI/ML/Generative AI Platform
Develop and use high-performing Suitable and reliable platform to store
models for process automation, and process data, build, and deploy
decision-making, human language models
use cases, and innovation
Security Operations
Ensure data, model, and infrastructure Build and deploy quality models
security against unauthorized access, using high-quality data using
use, or destruction MLOps
In the context of an enterprise-wide AI initiative, below are the focus areas for you to manage effectively:
➤➤ Define your AI strategy
➤➤ Manage your portfolio
➤➤ Encourage innovation
➤➤ Manage your product lifecycle
➤➤ Develop strategic partnerships
➤➤ Monetize data
➤➤ Draw business insights
➤➤ Maximize machine learning capability
Leveraging the eight business aspects shown in Figure 6.3 effectively will increase the probability of success of
your AI transformation effort, enterprise-wide.
Define your AI AI strategy document A comprehensive document that captures the strengths,
strategy weaknesses, opportunities, and threats, and the AI vision
and business goals
• For example, evaluating the growth of technology
platforms, customer expectations, regulatory
requirements
Manage your AI Portfolio analysis Analyze your portfolio based on short- and long-term
portfolio goals
• For example, categorize your list of AI initiatives
into high value, low value, low risk, high risk, high
complexity, and low complexity in terms of delivery
capability
continues
62 ❘ CHAPTER 6 Developing an AI Strategy and Portfolio
Manage your AI AI product strategy and Develop the product strategy from a customer-centric
product lifecycle roadmap point of view to solve a problem identified through
customers and data
Product development and Helps to have a clear roadmap so all parties are clear
delivery plan about their role; helps to manage plans, resources,
milestones, and risks effectively
Develop your List of potential partners Helpful to identify companies with machine learning,
strategic data analytics, and AI skills
partnerships
Partnership agreements Includes a list of the roles and responsibilities of all the
parties involved
The company wanted to optimize its supply chain operations using AI. During the Envision
and Align phases, the company evaluated the competitor landscape, technology developments,
regulatory changes, and customer expectations to understand the business impact.
By looking at its business goals and the impact assessment, it decided to leverage predictive
analytics to optimize its operations and meet its long-term goals.
The company developed its technology strategy to deploy AI technology to improve its supply
chain operations and meet its business goals. As part of its strategy, it reorganized its teams
around products and value streams to become more customer-centric. To ensure a successful
implementation, it solicited feedback from many stakeholders, such as customers, employees,
leadership teams, and third-party vendors.
In the end, the company achieved its business goals using AI technology.
Figure 6.4 shows examples of business strategy, AI strategy, business goals, and AI initiatives.
Consider the example of Netflix. They employed portfolio management to prioritize their
products to deliver an engaging customer experience and maintain their competitive edge.
Netflix used data about their customers’ behavior to personalize their viewing experience.
They also prioritized the products that optimized their content delivery network to allow
faster streaming and reduce buffer times. They also ensured they had the necessary technical
skills, resources, and infrastructure to deliver the customer experience without compromising
on performance, security, and reliability. Using portfolio management to prioritize their AI
products based on their need to achieve strategic goals to personalize the customer experience,
operational efficiency would have helped them ensure superior performance and their ability
to deliver.
Q1 Q2 Q3 Q4
FIGURE 6.4: Examples of business strategy, AI strategy, business goals, and AI initiatives
Scenario: Assume you are part of an AI team trying to improve supply chain opera-
tions using AI.
Goal: Leveraging your company’s organizational nontechnical capabilities in strategy, portfo-
lio, data, modeling, and innovation.
64 ❘ CHAPTER 6 Developing an AI Strategy and Portfolio
01 Manage AI Strategy
03 Manage Innovation
05 Strategic Partnerships
EXECUTION
06 Develop Data Science Capability
07 Monetize Data
07 Data
Monetization Ideas
SUMMARY
This chapter delved into the task of translating AI ideas into actionable and impactful AI strategies. This chapter
was a step-by-step guide to craft a compelling AI strategy and a prioritized portfolio.
You also learned how to chart out a comprehensive AI strategy that provides direction and focus for your future
AI endeavors. This chapter serves as the bridge between your overarching AI strategy and execution. In the next
chapter, let’s discuss the critical components to accelerate AI adoption in your company.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. The ability to bring together resources to achieve a business outcome is called what?
A. Innovation
B. Achievement
C. Capability
D. Transformation
2. What is the first step in developing an AI strategy?
A. Identify your business goals.
B. Assess your current AI capabilities.
C. Develop a roadmap for AI adoption.
D. Identify potential use cases.
3. What is the purpose of managing your portfolio in the context of an AI strategy?
A. To ensure all projects have the same priority
B. To get a holistic picture of all the products and services to meet customer needs
C. To prioritize products and services based on their cost
D. To focus primarily on customer experience
4. What is the key to leveraging organizational capabilities for competitive advantage in AI projects?
A. Choosing the latest machine learning models and algorithms
B. Increasing the technical skills of your team
C. Focusing on nontechnical aspects such as business, governance, and people
D. Investing in the most expensive AI technologies
ANSWER KEY
1. C 3. B
2. A 4. C
Managing Strategic Change
7
Change is the only constant in life, but with strategic management, we navigate its tides.
—Inspired by Heraclitus
If you are an executive, team leader, change leader, or someone driving strategic change in your organiza-
tion, this chapter is for you. The world of AI, in particular generative AI, is fast-moving, and being able to
manage change is not merely a nice thing to have but a must. While earlier chapters focused on under-
standing the promise of AI, this chapter delves into how to strategically manage and accelerate that change
(see Figure 7.1).
02
01 Strategize and Plan and
Prepare Launch Pilot
03
This chapter guides you through a three-phase approach to accelerate AI adoption, from developing an AI
acceleration charter to ensuring leadership alignment and creating a change acceleration strategy. I also tie
these steps to tangible deliverables that will serve as evidence of your progress. And also note that the steps
outlined here apply to both traditional AI and generative AI implementations.
Accelerating Your AI Adoption with Strategic Change Management ❘ 67
Whether you are trying to reposition a team or redefine an entire company, this chapter provides you with the
tools needed to drive strategic change. Remember that though every organization’s AI journey is unique, the
change management principles are universal.
Create a Communication
Analyze Value Drivers Assess Stakeholders
Strategy and Plan
ACTIVITIES
Establish AI Governance
Assessment Strategy and Plan
Define Change
Acceleration Metrics Analyze Value Drivers
Establish benchmarks to measure Identify the main factors contributing
the success and impact of the Develop AI to the organization’s value, such as
AI initiative on the organization Acceleration operational efficiency and cost reduction
Charter and
Define Your Future Governance
State AI Vision Mechanisms Establish AI Governance
Articulate a vision for the future Set up procedures for
state of the organization, decision-making, issue management,
enabled by AI technology and accountability within the
AI initiative
BEST PRACTICE TIP Develop an AI charter, analyze value drivers, establish governance,
build an AI team, and then define the goals, vision, metrics, and budget to implement AI
successfully.
from concerns such as poor communication, lack of cross-functional collaboration, and AI implementation-
related challenges such as data handling and model building, trust, and ethics.
Envision Align
Launch Scale
Leadership By Example
Encourage leaders to lead by example
and display behaviors to support AI adoption
Transforming your workplace involves assessing your current capabilities in your workforce and producing strat-
egies to fill the gaps. Figure 7.6 shows the steps and deliverables.
Accelerating Your AI Adoption with Strategic Change Management ❘ 71
ACTIVITIES
PERFORMANCE system that was decided in the management plan and the
MANAGEMENT SYSTEM align phase performance management
system as new developments
Develop a performance happen in AI
TALENT PROGRAM
management system that
aligns with the new roles that Develop a talent program that
arise in an AI implementation BRAND COMMUNICATION
includes hiring third-party
STRATEGY
consultant, vendors, as well as
in-house training program Develop brand communication
SKILLS GAP ASSESSMENT
strategy and promote your
Conduct a skills gap analysis brand
that outlines the gaps in skill
sets and areas where
upskilling is required
IBM CASE STUDY: IBM’S NEW COLLAR JOB INITIATIVE: A MODEL FOR
WORKFORCE TRANSFORMATION
IBM’s new collar job initiative is a notable example of workforce transformation. IBM, with
this program, is trying to fill the skills and experience gap in the technology space by helping
those who cannot get a traditional four-year degree to get into these high-tech jobs. With the
advent of modern technologies such as AI, ML, and advanced analytics, there is a need for
people who can maintain and operate these systems. This does not require a four-year degree; a
nontraditional vocational program suffices. IBM created a variety of training and education
initiatives, such as online courses, apprenticeships, and internship programs.
Since its launch in 2017, this has found tremendous success. According to IBM, in 2018, about
50 percent of their new employees came from those who had nontraditional education, and
most of them came from this new collar initiative. More importantly, it has helped IBM diver-
sify its workforce by including people from all kinds of educational backgrounds.
Overall, this has been a new trend in the industry, and this is something that you need to consider as you look for
ways to augment your workforce.
Create a
Case for
Organizational
Change
Readiness
Assessment
Change Impact
Assessment
Create a
Assess Stakeholders persuasive
document
Gauge the highlighting the
organization’s benefits of the AI
Launch and Scale
readiness for initiative to gain
Evaluate how the change and
AI initiative will buy-in from
prepare strategies stakeholders
Envision and Align affect different to fill in any gaps
Understand the stakeholders and
needs, concerns, develop a plan to
and attitudes of manage these
Initiate the AI stakeholders impacts
initiative and
scale it up per
Set a clear vision organizational
and align leaders needs
toward the AI
initiative
FIGURE 7.7: Ensuring leadership alignment for AI, including generative AI initiatives
ACTIVITIES
Create mentoring programs, CERTIFICATIONS
job rotation, shadowing, and Encourage employees to take
COMMUNICATION PLAN HANDS-ON LABS
coaching so they can learn charge of their own learning
Communicate the business from experts Allow employees to create and get certified in AI
case for AI adoption and how Establish a community of sandboxes to get their hands
it aligns with the business practice such as AI COE to dirty with AI/ML tools
goals communicate AI/ML standards
KNOWLEDGE SHARING
REWARDS PROGRAM
Encourage employees to
SKILLS ASSESSMENT ONBOARDING PROCESS Set up a rewards program for share their knowledge with
Conduct a skills gap Make AI/ML training part of the employees who use AI/ML others and become AI
assessment for AI/ML and process skills in their work evangelists
cloud skills
When you implement these recommendations effectively, you will see a significant increase in the number of
employees using AI/ML and generative AI in their work, leading to increased efficiency, improved customer sat-
isfaction, and revenue. You will notice that your culture has become innovative and agile, with employees taking
ownership of their learning and adopting AI in their workplace. All this will eventually help you become competi-
tive in a highly competitive landscape and even position you as an industry leader.
continues
Review Questions ❘ 75
continued
SUMMARY
This chapter discussed the critical components of accelerating AI adoption in your company. By building the AI
acceleration charter, achieving leadership alignment, and developing a systematic AI acceleration strategy, you can
galvanize your employees around AI to achieve business outcomes.
You learned that you could achieve organizational transformation by adopting a systematic, programmatic
approach to business transformation and people empowerment.
In the next chapter, let’s discuss finding the right AI/ML use cases for your company.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Which of the following is not included in Phase 1 of the change acceleration program?
A. Analyze value drivers
B. Establish governance for the AI initiative
C. Develop an AI acceleration charter
D. Ensure leadership alignment
76 ❘ CHAPTER 7 Managing Strategic Change
ANSWER KEY
1. D 3. D
2. C 4. C
PART III
Planning and Launching a Pilot
Project
This part covers all the challenges and tasks centered on planning and launching a pilot project, including
identifying use cases for your project, evaluating appropriate platforms and services, and launching the
actual project.
8
Identifying Use Cases for Your
AI/ML Project
The only way to do great work is to love what you do.
—Steve Jobs
In the next few chapters, let’s discuss planning and launching a pilot (see Figure 8.1). This chapter focuses
on identifying the right use cases. Having addressed challenges, approached design responsibly, and charted
the AI strategy and vision, it is time to identify the right use case for your AI/ML project.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your
journey Team
FIGURE 8.1: PLAN AND LAUNCH: Identify Use Cases for Your AI/ML & Gen AI Project
80 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
Whether you are a business executive looking to solve business problems or a developer trying to develop state-
of-the-art applications, this chapter is for you. Selecting the right use case is more than just a decision—it’s the
compass that sets your AI journey in the right direction. It will not only help your team solve the business prob-
lem by choosing the suitable set of technologies, but it will also lead to increased operational efficiency, reduced
costs, and even new business opportunities for a competitive edge.
You also learn to prioritize your use cases, which will help you focus all your energies on the most pressing
problems and maximize business impact. It’s your path to positive stakeholder results and a higher probability of
success. Let’s get started.
TIP Prioritize your use cases that have the highest business impact to reduce costs, increase
efficiency, and gain a competitive advantage.
Decide on the
business Review Conduct a
objectives applications PoC to Conduct
and success in various Prioritize the choose the Develop a vendor
metrics industries use cases right model project plan evaluation
Document Research Identify the Assess data Design the Build versus
your industry right use availability solution buy
business trends case for the and quality architecture
problem or business
pain points problem
Remember that machine learning helps you identify patterns in your dataset to predict outcomes. This feature/
capability can be used to improve existing processes while also enabling new opportunities. Processes that are
well suited to AI/ML are those
➤➤ Where decisions are based on data
➤➤ Where decisions happen repeatedly, either thousands or millions of times
➤➤ That use software to automate
➤➤ That are slow
Identify Business
AI/ML Benefits Quantify Potential Measure Business
Objectives/ Celebrate!
Education Business Value Value
Pain Points
Spread awareness and Outline the key Estimate the potential Establish metrics like
understanding of the business objectives or business value that ROI, payback period,
potential benefits of pain points that AI/ML could be realized by or net present value to
AI/ML and Gen AI and Gen AI implementing AI/ML measure the actual
across the organization could address & Gen AI solutions business value gained
from AI/ML & Gen AI
implementation
TIP Start by educating everyone about the benefits of AI/ML and Gen AI. Define the
business objectives and quantify the potential business value to measure success.
Based on these business objectives/pain points, you can quantify the business value that can be realized upon
completion of your use case. Some examples of business value are cost savings, revenue growth, improved
accuracy, automation, better customer experience, and faster decision-making. You should be able to measure the
business value using ROI, payback period, or net present value.
Customer experience Using customer Helps increase customer It’s a common use
improvement data to personalize engagement, satisfaction, case to personalize
recommendations. and loyalty. marketing campaigns
and targeting ads.
Fraud detection Analyzing You can notify customers of Many companies across
transactional potential fraud. multiple industries
data to detect analyze transactional
unusual patterns. data to detect
fraudulent activity.
Pain points can be manual processes, redundant processes, complex processes, data quality issues, and other inef-
ficiencies in your process.
TIP Identify the pain points that hinder your business, and conduct root-cause analysis to
shortlist your use cases.
The Use Case Identification Process Flow ❘ 83
Customer Satisfaction
For use cases that improve customer
service, resolve customer issues quicker,
and provide personalized customer
experiences
Cost Savings
Employee
Useful when
Productivity
automating business
For repetitive
processes,
Define employee tasks,
eliminating errors, or
improving process Success reduce errors, and
Metrics provide real-time
efficiency
insights
TIP Define success metrics that align with your business objectives and use cases to measure
the effectiveness of your AI solution.
84 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
Learn Preempting and preparing Learn about risk factors Case studies,
from pitfalls for potential challenges from healthcare industry industry reports
by understanding pitfalls applications
faced by others
TIP Look to AI industry trends and other AI applications across other industries to learn
from best practices and avoid potential pitfalls.
continues
86 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
continued
Map Use Case to Refer to the “Use Chosen use Choosing a use
Business Problem Cases to Choose case that case like “Predictive
Choose a suitable From” section and aligns with Maintenance” if your
use case that can select a use case your business business problem
address your busi- that aligns with your problem is high equipment
ness problem. business problem. downtime.
Prioritizing Your Use Cases ❘ 87
continued
➤➤ Time to implement
➤➤ Risks, if any
As part of your feasibility analysis, you need to consult various stakeholders across your organization from busi-
ness and technology, to identify their concerns and address them appropriately, as well as get buy-in.
Bu
sis
sin
on ly
iti na
Hy ess
gn t A
r-P ce
Business Value (High to Low)
er ss High Feasibility
Re m
so Au
e nti
na to
ag Se
liz m
at at
Im ee
io io
n
oy
High Feasibility
Em
Use Case
Prioritization
M
Low Feasibility
e
u
nc
as rin
ts na
e g
bo inte
Pr Au
t
ed to
a a
Ch M Low Business Value
i c t
it ve
io ati
Low Feasibility
n on
ic
m
ed
Pr
FIGURE 8.5: Business value and feasibility analysis to prioritize use cases
As shown in Figure 8.5, you can classify the use cases based on the business value and feasibility. Your business
value criteria can include factors such as cost reduction, revenue growth, and customer satisfaction, and feasibility
criteria can include factors such as technical feasibility and organizational readiness.
Prioritizing Your Use Cases ❘ 89
➤➤ High business value and high feasibility use cases: Your priority use cases will be those that deliver high
business value and are easier to implement. Predictive maintenance in manufacturing can prevent equip-
ment downtime and extend the life of machinery. The business value is very high, and with the necessary
sensor data and appropriate ML models, it is often easier to implement. Similarly, chatbots are easier to
implement and can have a significant impact on customer service and efficiency.
➤➤ High business value and low feasibility use cases: Under this category fall use cases such as predicting
diseases and full automation of manufacturing processes. While these use cases can save millions of lives
and save costs, they can be challenging to implement due to technology limitations.
➤➤ Low business value and high feasibility use cases: Use cases such as employee sentiment analysis to ana-
lyze employee feedback may be easier to implement but may have little impact on the business. Similarly,
employing AI to tag images automatically may be easy to implement but have a low business impact.
➤➤ Low business value and low feasibility use cases: Hyper-personalization of advertisements may appear
very attractive, but the cost of implementing it may make it highly prohibitive due to data privacy and
technical requirements. Similarly, complete automation of all business processes may appear very attrac-
tive but may be of limited value for a small business and technically and financially prohibitive.
TIP Prioritize your use cases based on their potential impact and feasibility.
This is a use case prioritization workshop that involves collaborating with various stakeholders
to compare different use cases and arrive at a decision.
continues
90 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
continued
Prioritize Use Cases Create a matrix Ranked list of Use case A: High
Based on the impact using the impact and use cases impact, high
and feasibility feasibility criteria to feasibility
assessments, rank the use cases. Use case B: High
prioritize the use impact, low
cases. feasibility, etc.
Use Cases to Choose From ❘ 91
TIP Use AI for DevOps to automate, optimize, and improve the DevOps process.
TIP Ensure data privacy and compliance with regulations like HIPAA to build trust and
safeguard patient information.
BEST PRACTICE TIP AI should enhance customer experience, streamline processes, and
complement rather than replace human agents to maintain a personal touch.
identify trends and patterns, forecast future performance, detect anomalies, determine root causes, and adapt to
changing conditions.
Sales, marketing, and CX insights can help identify growth and strategy opportunities; IT monitoring can help
manage disruptions; inventory and workforce planning can optimize resources; and AI for Data Analytics (AIDA)
enhances existing analytics with ML.
Success Stories: Foxconn boosted forecast accuracy by 8 percent with AWS Forecast, saving approximately
$500,000. Digitata used Amazon Lookout for Metrics and saved 7.5 percent in sales revenue in hours
instead of a day.
Tools: Amazon’s Business Metrics Analysis ML solution helps analyze business metrics dynamically to changing
business conditions, Azure ML, and Google BigQuery.
TIP Use AI to analyze business metrics, adapt to changing business conditions, and identify
new growth opportunities.
Content Moderation
To create high-quality content in large quantities, you need scalable platforms and extensive resources, which
the cloud provides. AI has been facilitating personalized content creation at scales beyond human capacity, with
machine learning that produced varied content. AI-enabled content creation allows for the following:
➤➤ Personalized content
➤➤ Automated generation
➤➤ Machine learning models
➤➤ Massive scale
Success Stories: News media companies such as the Associated Press, the Washington Post, and Forbes have used
content generation tools.
Tools: GANs, neural language processing and computer vision tools, ChatGPT, Claude, WriteSonic, Perplexity,
BARD, and Copy.ai.
TIP Financial services companies can use AI for applications such as risk management, per-
sonalized finance, and trading.
94 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
Cybersecurity
Cybersecurity is the practice of protecting systems, networks, and data from external hacking, unauthorized
access, disclosure, modification, use, or destruction of systems. It involves the use of systems, processes, and poli-
cies to protect assets from cyber-threats such as viruses, malware, hacking, phishing, and ransomware attacks.
Cybersecurity ensures business continuity to prevent downtime and aids in compliance with regulations such
as GDPR. It uses tools like IBM’s Qradar for network security and Symantec’s software for AI-driven endpoint
protection.
Success Stories: FICO uses AI for fraud detection; Splunk uses SOAR to address security incidents; Firefly’s Helix
detects malicious activity; and Dell uses Cylance for cybersecurity.
Tools: You can use natural language processing to detect malware, filter emails, and identify fraudulent trans-
actions, while deep learning can detect malicious images, analyze videos for unauthorized access, and recog-
nize scams.
TIP Use AI to improve the security of systems, networks, and data by proactively identifying
and preventing cyber-attacks.
Digital Twinning
The digital twin is a virtual replica of a physical system or object. As shown in Figure 8.6, you can use that replica
to simulate system behavior and identify any potential issues or improve the performance of those systems. They
are created by gathering the data from multiple sources or sensors and then using that data to create a virtual
model of that system.
Success Stories: Rolls Royce, GE Healthcare, Johnson Controls, Siemens, and ENGIE utilized digital twins to
optimize performance, reduce equipment failures, and improve energy efficiency in their respective industries.
Use Cases to Choose From ❘ 95
Tools: AWS IOT twin maker, Google Cloud IOT, Core and Microsoft’s Azure IOT hub can be used to create digi-
tal twins. Combined with these twins, you can use machine learning models to detect patterns for further action.
TIP Use digital twins to improve the efficiency, safety, and performance of physical systems
by creating a virtual replica and simulating behavior to identify issues and optimize opera-
tions.
Identity Verification
Identity verification is essential for customers creating new accounts or accessing different accounts, systems, and
information. Identity verification is used in the healthcare industry to access patient records, in the financial ser-
vices industry to access or open new accounts, and in the government sector to access services based on authenti-
cation. You can verify identities through document and biometric authentication methods like passports or facial
recognition or by using personal information for authentication.
Success Stories: Goldman Sachs uses Google Cloud Vision for verifying transactions; Gamestatix uses Azure’s
facial recognition for user access; and HealthTab utilizes Amazon Rekognition to authenticate medical profes-
sionals to access patient records.
Tools: You can use Azure cognitive services and Active Directory for identity verifications, Amazon Rekogni-
tion for facial and text recognition. Google provides Google Cloud Identity Platform and Google Firebase
Authentication.
TIP Use identity verification to protect individuals and organizations from fraud and iden-
tity theft.
TIP Use IDP to automate manual processes and extract valuable information from docu-
ments; this frees employees’ valuable time.
Intelligent Search
Intelligent search is a type of search that uses AI to understand the meaning of the search term and the context
behind that search and to rank the search results based on relevance. Figure 8.7 shows how by understanding the
96 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
context and the meaning of the search queries, intelligent search improves the search accuracy, efficiency, and user
engagement. It’s used in ecommerce, customer service, and knowledge management.
Improved Accuracy
By understanding the meaning
and the context behind the
search term, the system can
provide more accurate search results.
Improved Efficiency
By providing the search results
ranked by relevance, it involves
Intelligent
less interference from the users
Search to get the results that they want
and thus improving efficiency.
Success Stories: Coursera uses Amazon machine learning to provide the right recommendations to their students
on their platform.
Tools: You can use Amazon Kendra for intelligent search; Microsoft Graph for relational data; and IBM Watson
Health to analyze medical data for doctor recommendations. AWS Comprehend can help with text analysis using
NLP, and you can use unsupervised machine learning and NLP techniques for relevant result retrieval and contex-
tual understanding of search terms.
TIP Use intelligent search to provide more relevant and accurate search results, save users’
time, and avoid frustration.
Machine Translation
Machine translation involves translating from one language to another using AI and ML technologies.
Success Stories: It is used in education, government, and healthcare to communicate between people speaking
different languages. It helps with enhanced learning, collaboration, government services, travel, safety, and better
healthcare through better communication.
Tools: Machine translation uses AI for text or speech conversion, while AWS Translate, Microsoft Azure Transla-
tor, and Google Cloud Translate all provide translation services.
TIP Use machine translation to translate text or speech from one language to another,
leading to greater collaboration, learning, and safety.
Use Cases to Choose From ❘ 97
Media Intelligence
The demand for media in the form of images, video, and audio is increasing at an unprecedented rate in industries
across education, entertainment, and advertisement. Companies must engage with their customers through con-
tent, and a lot of manual work is required to label, tag, review, and come up with headlines, summaries, and so
on. By adding machine learning to content workflows, you can increase these assets’ lifetime value, reduce costs,
and increase speed in delivering content.
Some of the use cases include audience analysis, market research, public relations, and social media monitoring.
Benefits include faster content processing, reduced costs, increased lifetime value, and higher quality experiences.
Success Stories: Netflix employs machine learning for personalized content streaming, while the New York Times
uses media intelligence to curate engaging topics for its readers.
Tools: AWS Elemental Media Insight, Amazon Rekognition, Google Cloud Language, Microsoft Cognitive Ser-
vices, and Azure HDInsight enable sentiment analysis, data extraction, and content pattern analysis from media
content such as video, audio, and text.
TIP Use media intelligence to analyze media content and gain insights into audience demo-
graphics, trends, and brand perception.
ML Modernization
ML modernization involves updating existing machine learning models and pipelines to adopt new technologies
and tools and make them better by improving their efficiency accuracy and performance (see Figure 8.8).
Outdated ML
Models
Some of the focus areas of ML modernization are upgrading the algorithm, incorporating new technologies,
retraining, preventing bias, scaling model development, and reducing model development time.
Success Stories: Intuit uses AWS SageMaker for accurate fraud detection in financial transactions. Airbus utilizes
the Google AI platform to enhance airplane wing design, achieving cost savings and better fuel efficiency.
Tools: AWS offers AWS SageMaker, which is a comprehensive ML platform with prebuilt models and AWS inte-
grations. Google Cloud Auto ML is useful for those with limited ML knowledge, whereas Google AI Platform is
for custom model development with Google Cloud integrations. Microsoft’s Azure Machine Learning Platform is
an all-encompassing platform integrated with Azure storage and serverless computing functions.
TIP Use ML modernization to keep your machine learning models and pipelines up-to-date
with the latest technologies and tools, improving their accuracy, performance, and efficiency.
ML-Powered Personalization
ML-powered personalization involves analyzing user demographic and behavioral data to predict their interests
and provide them with personalized product services and content. It is used in recommendation engines, search
engine results, social media posts, and ecommerce products.
Success Stories: Yamaha Corporation boosted sales by 30 percent using Amazon Personalize. YouTube enhanced
user engagement by 60 percent with Google Cloud Recommendations API, and Air France raised sales by 13
percent by adopting Azure Machine Learning and Azure Personalizer.
Tools: Using machine learning, natural language processing, and computer vision, services like Amazon Personal-
ize, Amazon Forecast, Google Cloud Auto ML, and Microsoft Azure Learning offer personalized content and user
experiences.
TIP Use ML algorithms to predict user interests and provide personalized product services
and content, such as recommendation engines, search engines, and social media.
Computer Vision
Computer vision involves the ability of computers to interpret and understand visual data from the world around
them such as images, videos, and other visual data to identify objects, track movements, and analyze scenes.
Figure 8.9 illustrates the various applications of computer vision, use cases, and the technologies that can be used
to implement it.
Computer vision can be used to detect objects, understand scenes, recognize actions in videos, identify diabetic
retinopathy for proactive screening, and spot defects in manufacturing.
Success Stories: Amazon uses computer vision in its Go store to monitor customer behavior and automatically
adds products to carts.
Tools: Amazon uses Rekognition to analyze images and videos and SageMaker for building models. Google
provides Google Cloud Vision and GAN and CNNs for image analysis. Azure Cognitive Services provides similar
capabilities.
TIP Use computer vision in applications for object detection, scene understanding, facial
recognition, action recognition, and more.
Use Cases to Choose From ❘ 99
Computer
Vision
Technologies
Use Cases
(Track behaviors, understand)
TIP Use machine learning for personal protective equipment detection to ensure workers are
wearing the proper safety gear in manufacturing, healthcare, and construction industries.
Generative AI
Generative AI refers to a type of machine learning where it can generate new types of data like the data it was
trained on. It can be used to generate new content such as images, speech, natural language, and data augmenta-
tion. This feature makes generating new product descriptions, news articles, and other types of content possible.
You can also use it to generate new types of videos and images, which is helpful in marketing.
Generative AI can keep users engaged through marketing campaigns and by creating new fashion designs in
the fashion industry. It has been making waves ever since ChatGPT, discussed in more detail in Chapter 24,
was released in November 2022. I include a few use cases to employ generative AI, but note that this is a very
fast-moving space, and new use cases are continuously emerging. Figure 8.10 shows the different types of data
supported currently by generative AI.
100 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
Content creation You can create content such as blogs, News agencies create news articles
articles, and reports based on keywords or sports reports.
or questions.
Drafting emails AI suggests full sentences or even Gmail Smart Compose feature
paragraphs to draft an email.
Creative writing Writing stories, scripts, or poetry based Scripts for movies or commercials
on user prompts.
Text summarization Create summaries for large pieces Summarizing large articles
of content. and reports
Generate code Generate code based on Microsoft’s CoPilot and AWS Code Whisperer.
keywords or requirements
Detect bugs Identify bugs DeepCode can detect bugs in static code.
Refactoring Suggest improvement in code Codota now known as Tabnine suggests code
improvements.
Review code Assist in code review process Code Climate provides automated code review
by suggesting improvements or for test coverage and maintainability.
identifying issues
Unit testing Generate unit test cases Diffblue Cover generates unit tests using AI.
Use Cases to Choose From ❘ 101
Code Text
Models can be used to s Models like GPT-4 generate
le
generate code, debug, and ecu texts based on user
review code ol Text prompts
M
Data
Data Images
Im
Models can generate DALL-E 2 and Stable
ag
synthetic datasets for Data Types
es
Diffusion generate images
training other models Supported from text
by
Generative
AI
Co
Molecules Audio
Audio
ed
Models can generate novel Generative models can generate
molecular structures with synthetic audio such as music and
desired chemical properties speech, used for voice assistants,
TTS (Text to Speech) systems, and
3D o music composition
de
3D Vi Video
Models can generate 3D shapes and Generative AI models can generate
objects like cars, furniture from text synthetic video, which can be used to create
descriptions deepfakes and movie production from text
prompts
Image synthesis Generates new images DALL-E by OpenAI can create images from text
resembling the original images descriptions.
Style transfer Applies the style of one image DeepArt can recreate photos in the artistic style
to another of any painting.
Image Improves the image quality by Let’s Enhance can increase the image resolution
enhancement increasing resolution, removing without losing quality.
noise, etc.
Image Identifies individual objects Google’s DeepLab uses AI for semantic image
segmentation within an image segmentation.
Image colorization Adds color to black-and-white DeOldify colorizes black and white images
images or videos and videos.
Image captioning Generates a textual description Google’s Cloud Vision API provides a text
of an image description for an image.
Facial recognition Identifies or verifies a person in Face++ can do facial recognition in security
a digital image or a video frame systems, payment gateways, etc.
Object detection Detects instances of Amazon Rekognition can identify objects, people,
objects in images text, scenes, and activities in images.
continues
102 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
(continued)
Augmented reality Integrates digital information Snapchat’s lenses can overlay digital content onto
with the user’s environment the physical world.
in real time
Speech synthesis Generates human-like speech Google Text-to-Speech and Amazon Polly convert
text into lifelike speech.
Music composition Generates new music OpenAI’s MuseNet composes new music in
compositions various styles.
Voice cloning Replicates a person’s voice after ElevenLabs and Resemble AI’s voice cloning
learning from a small sample technology can create unique voices for virtual
assistants or voiceover.
Speech Enhances the quality of speech Krisp, an AI-based app, can mute background
enhancement signals by reducing noise noise in any communication app.
Speech-to-text Converts spoken language into Google’s Cloud Speech-to-Text converts audio to
written text text for transcription services.
Video synthesis Generates new video Nvidia’s GANs can create videos of virtual
content based on specific environments, such as cityscapes.
instructions or data
Use Cases to Choose From ❘ 103
Video colorization Adds color to black-and- DeOldify is an AI model for colorizing and
white videos restoring old images and videos.
Super resolution Increases the resolution of Video Enhance AI by Topaz Labs improves video
video content quality using AI.
Motion transfer Transfers the motion of one First Order Motion Model for Image Animation
person in a video to another enables animating new faces in images.
person in a different video
Video compression Compresses videos without H.266/VVC is a video codec standard enabling
losing quality high-quality video compression.
Predictive video Predicts the next frames in a PredRNN is a recurrent network for predictive
generation video sequence video generation.
Video inpainting Fills in missing or removed parts Free-Form Video Inpainting with 3D Gated
of a video Convolution and Temporal PatchGAN restores
missing video parts.
3D object generation Creates 3D models from NVIDIA’s GANs can generate 3D models of
descriptions or data objects from 2D images.
Virtual reality Generates immersive VR Artomatix can generate texture maps for
content creation environments virtual reality.
Generative design Suggests design alternatives Autodesk’s Generative Design tool provides
for designers design alternatives for maximum efficiency.
104 ❘ CHAPTER 8 Identifying Use Cases for Your AI/ML Project
SUMMARY
This chapter provided a detailed review of how to find the right AI and ML use cases for your company. It began
by exploring the process flow for identifying use cases, starting from educating people about the potential of AI/
ML and Gen AI, and how it can solve business problems. It covered defining business objectives, identifying pain
points, conducting a root-cause analysis, identifying the success criteria, and exploring the latest AI industry trends.
Armed with this information, you can identify your use cases, prioritize them, choose the suitable model, and
finalize them through an iterative process.
In the next chapter, let’s discuss cloud-based AI/ML services offered by different cloud providers to bring your AI/
ML use cases to life.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Identify the type of processes that are ready for AI/ML.
A. Automated processes
B. Processes that depend on data and happen repeatedly
C. Processes that are easier to automate by software and are fast
D. All the above
2. The purpose behind conducting a root-cause analysis is
A. To identify the pain points that are hindering your business
B. To quantify the business value upon completion of the use case
C. To measure the success of your use case using ROI, payback period, or NPV technique
D. To understand the root cause to identify the proper use case
3. Choose the success metric that would result in cost savings.
A. Improved customer satisfaction
B. Reduced errors
C. Increased revenue growth
4. What is the primary purpose of mapping the use case to a business problem?
A. To assess the potential impact of an AI solution
B. To ensure the use case aligns with the business objectives
C. To identify potential pitfalls in the solution
D. All the above
5. Why is it important to iterate until you finalize the use case?
A. To get a solid understanding of the problem being solved
B. To get feedback from the stakeholders
Answer Key ❘ 105
ANSWER KEY
1. B 5. D 9. A
2. D 6. A 10. C
3. B 7. A
4. B 8. B
Evaluating AI/ML Platforms and
9
Services
Information is the oil of the 21st Century, and analytics is the combustion engine.
—Peter Sondergaard
Welcome to the exciting world of evaluating AI/ML platforms and services. Whether you are a data scien-
tist, business analyst, developer, or decision-maker for AI/ML implementation, this chapter will help you
choose the right platform and services.
Choosing the right platform and services is pivotal to the success of your AI/ML as well as Gen AI imple-
mentation. This chapter covers the various AI and ML services that the cloud service providers such as
AWS, Microsoft, and Google provide. The idea is to leverage what’s available to get your project kick-
started. Now is an excellent time to look at the different portfolios of AI/ML services and map them to the
business problems you are trying to solve. This includes evaluating specialized services for healthcare and
industry solutions so that you can embark on a path that resonates with your business goals and regulatory
landscape.
It isn’t just about choosing the right tool; it is about aligning with the fastest time to market, handling high-
intensity workloads, maintaining cost-effectiveness, and staying at the cutting edge of technology. Once you
have identified the right use cases, you need the right set of tools, setting the stage to launch your pilot, the
focus of the next chapter (see Figure 9.1).
Given the plethora of AI/ML platforms and services, each with its own features and offerings, it can be
quite overwhelming. This chapter guides you through a comprehensive evaluation process and presents the
choices from the three major cloud providers to help you choose the right platform and services.
This chapter can help you to turn your possibilities into reality. Let’s get started.
Benefits and Factors to Consider When Choosing an AI/ML Service ❘ 107
08 Objective
Scale and Process Data
Transform and Modeling
Conduct a comprehensive assessment of leading
AI Deploy and 05
Monitor
AI/ML platforms and services to determine the
Models most suitable solutions for your enterprise's
07 specific use cases and business goals
06
FIGURE 9.1: PLAN AND LAUNCH: Evaluate AI/ML Platforms & Services
BENEFIT DESCRIPTION
Faster time to market You can speed up your implementation time by leveraging these services because
you do not have to build these algorithms from scratch. Instead, you can focus
on solving the specifics of your complex business problems. These services come
as part of preconfigured environments, which allows for faster prototyping and
experimentation, faster time to market, and more cost-effective implementation.
Tip: Focus on solving complex business problems instead of building algorithms
from scratch.
continues
108 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
(continued)
BENEFIT DESCRIPTION
Intensive workloads These AI services are designed to handle high computational demands and data
processing workloads that are typical of enterprise AI. They provide a scalable
infrastructure along with high-performance capabilities for these AI services to be
developed, trained, and deployed efficiently.
By leveraging the scalability, reliability, security, and performance offered by these AI
services, you can address the common challenges faced by typical machine learning
projects easily. The providers offer large GPUs and TPUs that are optimized for large
ML workloads and reduce training and inference times.
Tip: Use these services to handle large data volumes and spikes in processing
demands by dynamically allocating cloud computing resources based on need.
Cost- By leveraging these services that are hosted in the cloud, you do not have to invest
effective solutions up front or build in-house expertise. Thus, you save time and money.
By leveraging the cloud provider’s economy of scale, the pay-as-you-go pricing
model, you can avoid up-front infrastructure costs, reduce operating costs, and
implement cost-effective AI solutions.
Tip: Monitor the usage of cloud resources to effectively manage costs.
Collaboration and These services allow sharing of models, code repositories, and version control
integration systems. Additionally, since they can easily be integrated with other cloud services,
databases, and APIs, it is possible to leverage additional capabilities and data
sources to meet the needs of the use cases.
Better security These AI services can also be compliant with industry standards and regulatory
compliance requirements and data privacy laws. You can implement security measures such as
data encryption and access control.
Rich capabilities These services come with capabilities such as natural language processing, computer
vision, speech recognition, and recommendation engines in the form of pretrained
models and APIs.
Tip: Integrate these capabilities into business applications, thus empowering your
company to deliver personalized experiences, automate processes, and leverage
valuable business insights that drive customer-centric actions.
Better tool support AI services include popular frameworks such as TensorFlow, PyTorch, and AutoML
platforms that simplify and speed up the model development process.
Up-to-date with Cloud providers also invest a lot in innovation and keep releasing new services and
innovation updates to existing services, which will help you stay up-to-date and benefit from the
latest research and development.
Benefits and Factors to Consider When Choosing an AI/ML Service ❘ 109
Identify your business problem Before choosing an AI service, you need to first identify
the business problem you are trying to solve, including the
business objectives, the desired outcomes, and the impact
expected on your business operations, customer experience,
and decisions.
Explore AI/ML service capabilities Explore the various capabilities offered by these AI services,
such as natural language processing, computer vision, and
speech recognition, and ensure that they meet your use case.
Evaluate performance and scalability needs Evaluate the performance and scalability needs of your ML
solution and ensure that your service can handle them. Factor
in training and inference times, processing speeds, and
dataset processing needs.
Tip: Consider training and inference times, processing
speeds, and dataset processing needs.
Assess development and integration needs Assess the availability of software development kits and the
ability to integrate with your company’s backend systems,
applications, databases, and processes.
Assess model development tools Assess the tools available for model development, such as
SageMaker, PyTorch, AutoML, TensorFlow, and scikit-learn.
Evaluate service’s security capabilities Evaluate your security needs and ensure that the provider can
meet your needs.
Evaluate service pricing Evaluate the pricing models and ensure that they meet your
budget expectations.
Tip: Factor in costs for training, data storage, and API calls to
avoid future surprises.
Evaluate vendor support Evaluate vendor support factors such as their support SLAs,
their responsiveness, their uptime guarantees, and their track
record with you.
Keep up with innovations Make necessary allowances for the growth of your use case,
such as increasing data volumes, innovation, and changing
business conditions.
110 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
This is a hands-on exercise that needs the team to participate, collaborate, research, and arrive
at a solution.
The goal is helping the team identify a business problem, evaluate various AI/ML services
available from different cloud providers to address that problem, and then develop an imple-
mentation plan for the chosen solution.
At the end of this exercise, the team should be aware of the process involved when selecting an
AI/ML service and consider various business and technical factors.
AI Services
AWS provides high-level AI services, which are pretrained AI services that provide intelligent solutions for well-
known use cases such as fraud detection, personalized recommendations, contact center intelligence, document
processing, intelligent search, business metrics analysis, forecasting, quality control, cybersecurity, and so on.
These AI services can be incorporated into the applications and workflows and used to build end-to-end
solutions in AWS.
NOTE AWS also provides prebuilt, industry-specific AI services, such as the industrial and
healthcare industries.
Amazon SageMaker
In the middle layer, AWS provides SageMaker. This tool is geared toward experts, such as data scientists and ML
engineers, who can use it to execute the entire machine learning pipeline, all the way from data ingestion, labe-
ling, data preparation, feature engineering, model building, model training, model tuning, model monitoring, and
monitoring deployment. It can also be used for bias detection, AutoML, hosting, explainability, and workflows.
These topics are discussed in greater detail in later chapters.
AI Frameworks
At the bottom of the stack are the AI frameworks that can be used by expert practitioners. This provides all the
major frameworks, such as TensorFlow, MXNet, PyTorch, and Caffe 2. AWS also provides high-performance
instances for annual training, such as P4D instances powered by NVIDIAs, TensorFlow, and core GPUs. For infer-
ences that typically constitute 90 percent of the ML cost, AWS provides EC2 Inf1 instances powered by inferential
chips. Please note that this a fast moving space and you should watch out for the latest developments.
Core AI Services ❘ 113
Purpose To analyze data and learn To make predictions To provide tools and
from it to solve a business and decisions infrastructure for building ML
problem apps
Examples of algorithms are linear regression, logistic regression, decision trees, and support vector machines.
Examples of models are deep neural networks, and examples of AI services are AWS SageMaker, Azure Machine
Learning, and Google Cloud ML Engine.
The next sections review the AI services. For the sake of easy understanding, they are categorized into core and
specialized AI services.
CORE AI SERVICES
This section discusses the core AI services offered by AWS, as shown in Figure 9.3.
114 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
id
Do
eo
&
Text
Core AI
Services
Ch
h
bo
at
ec
ts e
Amazon Lex Sp Amazon Polly
Amazon Transcribe
Amazon Transcribe
Call Analytics
Amazon Comprehend
Amazon Comprehend is a natural language processing service. It uses machine learning models to classify words
and phrases into categories. It can be used to solve business problems such as the following:
FUNCTIONALITY DESCRIPTION
Key phrase Extract critical phrases and entities such as people, organizations, and products
extraction from text.
Accuracy Increase the accuracy of your search results, product recommendations, and security.
Sentiment analysis Identify the sentiment in a text document, whether it is neutral, positive, or negative.
Text classification Classify text such as product reviews, customer feedback, customer service tickets,
news articles, etc.
Language detection Detect abusive language, phishing attacks in text, and fake news.
Compliance risks Understand customer feedback better and identify compliance risks.
To get started, you need to create an AWS account, enable the Amazon Comprehend service, upload your text
data to Amazon S3, and use Amazon Comprehend to extract insights from it.
AWS Translate
AWS Translate is a fully managed service that can translate content and be integrated into applications. It can
translate various text formats such as HTML, JSON, XML, plain text, and speech. It can be used in the follow-
ing domains:
Core AI Services ❘ 115
MARKET APPLICATION
Amazon Textract
Amazon Textract is a machine learning service that uses optical character recognition to extract information such
as text and tables from documents. The document types can be either PDF, image, or image-based documents. It
can be used for the following:
FUNCTIONALITY DESCRIPTION
Document analysis Identify different types of content such as text, tables, and images.
Form extraction Extract information from forms such as names, addresses, phone numbers, etc.
Data extraction Extract information from documents such as invoice numbers, dates, amounts, and
recipients.
Text recognition Recognize text such as authors, document titles, and date created.
Amazon Textract can save time and money extracting data from various documents. These use cases are shown in
Figure 9.4.
• Loan Tax Document • Medical record • Contract extraction • Resume and • Improve citizen
Analysis analysis • Document processing application review services
• Automated Invoicing • Patient care • Recruitment
• Eliminate manual improvement acceleration
errors
INDUSTRY APPLICATION
Finance Extract data from loan tax documents and other documents to speed up the loan
approval process and eliminate errors due to manual data entry.
Healthcare Review medical treatment plans, patient medical records, etc., to provide better
patient care.
Legal Extract data from legal documents such as contracts and court filings to reduce the time
required to process this information manually.
Human resources Use it to review resumes and applications and speed up the recruitment process to
recruit the top candidates.
TIP Be sure that you thoroughly analyze and understand your specific use cases and require-
ments before choosing AWS Text and Document services such as Amazon Comprehend,
AWS Translate, and AWS Textract.
Customer Support
Answer questions Home Security
Process orders Voice-activated control
Manage accounts Alarm systems
Make recommendations Music playback
Education Healthcare
Coursework assistance Scheduling appointments
Recommendations Prescription refills
Q&A Health advice
FIGURE 9.5: Amazon Lex: Conversational interfaces using voice and text
Core AI Services ❘ 117
Chatbots Can create chatbots that can handle customer questions, process orders, check
shipping status, manage accounts, and make product recommendations.
Voice To control home security systems, alarms, clocks, and play music.
activated systems
Virtual assistants Can take orders and book reservations and appointments.
SECTOR APPLICATION
Financial Use chatbots to check account balances, make payments, and transfer funds.
Healthcare Schedule appointments for patients, refill prescriptions, and make generic
recommendations related to health.
Education Help students find coursework, make course recommendations, and answer course-
related questions.
Human resources Answer employees’ questions about benefits and policies, schedule time off, and update
personal information.
Travel Make reservations, learn about travel policies, and check flight status.
Speech
Amazon provides Amazon Polly and Amazon Transcribe for speech related services.
TIP You should ensure accurate and high-quality speech-to-text and text-to-speech conver-
sions for optimal performance and best user experience.
Amazon Polly
Amazon Polly is a cloud-based text-to-speech service that uses deep learning to create authentic lifelike voices.
These voices can be added to applications to create innovative speech-enabled products. Amazon Polly can be
used for the following:
IVR systems Develop IVR systems to interact with customers to answer their queries
using NLP.
Enterprise applications Help make applications more accessible by allowing users with visual
impairments and reading difficulties.
continues
118 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
(continued)
E-learning and Add audio to courses to make the learning experience pleasant for users who
training platforms prefer to listen rather than read.
Voice-enabled IoT devices Integrate Amazon Poly with a thermostat to control that device and make the
experience much more pleasant.
Amazon Transcribe
Amazon Transcribe is a speech-to-text cloud service capability from AWS that businesses can use to convert large
volumes of audio into text for subsequent processing.
INDUSTRY APPLICATION
Call centers Convert call center call recordings into text and analyze them to understand customer needs,
preferences, and problems.
Media Convert audio and video content into text to be easily searchable by users and can be
analyzed to understand user engagement and preferences.
Legal Transcribe large volumes of court hearings and proceedings to help law firms and legal
departments.
Vision Services
This section discusses some of the available vision services, such as Amazon Rekognition and AWS Panorama.
Amazon Rekognition
Amazon Rekognition helps businesses add video and image analysis capabilities to their applications. It can detect
objects and analyze people and scenes in videos and images. It is used in the following situations:
SECTOR APPLICATION
Security and Identify people in real-time who are trespassing on the premises.
surveillance
Content moderation Remove inappropriate content in online communities and other online marketplaces.
Marketing and Personalize marketing campaigns by analyzing videos and images to understand
advertising customer demographics, behavior, and preferences.
Core AI Services ❘ 119
SECTOR APPLICATION
Retail Categorize products based on images to create catalogs, saving much time.
TIP You can leverage vision services like Amazon Rekognition and AWS Panorama for
enhanced security, personalized marketing, and optimized operations.
AWS Panorama
AWS Panorama allows you to add computer vision to your applications. It enables you to detect objects, people, and
scenes in videos and images and create 3D models of objects and scenes. It has uses in the following industries:
INDUSTRY APPLICATION
Retail Track customers and based on their behavior, recommend products they are likely to
buy. Use Panorama to determine where customers spend more time in the store by
tracking their movements. Use this information to redesign the store layout so that
customers can quickly find their products.
Manufacturing Identify defects in the products. Also, track the movement of products within the
manufacturing assembly and improve the manufacturing process.
Supply chain Identify the movement of goods through the supply chain, identify potential areas
management where delays are happening, and identify opportunities to improve.
Warehouse Identify the movement of people and vehicles within a warehouse and identify
management potential opportunities to improve the logistics process.
Task 1: Text AWS Data Develop a AWS Comprehend Use AWS Comprehend for sentiment
and document Comprehend, scientist document AWS Translate analysis on customer reviews.
analysis Translate, detailing results Use AWS Translate to translate content
AWS Textract
Textract and potential use such as customer support tickets
cases. Sample data
Use AWS Textract to extract data from
a set of invoices.
Task 2: Speech AWS Polly ML Develop a AWS Polly Use AWS Polly to develop IVR systems
services and engineer document AWS Transcribe and add audio to e-learning courses.
implementation Transcribe detailing results Use AWS Transcribe to convert call
Sample text and
and potential use center recordings into text for analysis.
audio data
cases.
Task 3: Vision AWS Data Develop a AWS Rekognition Use AWS Rekognition for security
services Rekognition scientist document AWS Panorama surveillance to identify trespassers.
implementation and detailing results Use AWS Panorama to track customer
Sample images
Panorama and potential use behavior in retail stores.
and video data
cases.
Task 4: Amazon Lex AI Develop a working Amazon Lex Use Amazon Lex to build a chatbot that
Chatbots engineer chatbot and Sample prompts handles customer questions, processes
document its for testing orders, and checks shipping status.
functionalities.
Task 5: Project Develop a Use the feedback Collect feedback and results from all
Evaluation and manager comprehensive from task owners, task owners.
report writing report on the objectives, and Evaluate the results against the
implementation, key results (OKRs) objectives and key results (OKRs).
evaluation, and
Prepare a comprehensive report
potential business
detailing the implementation, results,
use cases of AWS
potential business use cases, and future
AI Services.
improvements.
Specialized AI Services ❘ 121
SPECIALIZED AI SERVICES
This section covers the specialized AI services from AWS, as illustrated in Figure 9.6.
Code + DevOps
Amazon DevOps
Amazon DevOps Guru
Search Industrial
Amazon Kendra Amazon Monitron
Amazon Lookout for Equipment
Amazon Lookout for Vision
Specialized AI
Services
Amazon Personalize
Amazon Personalize is a fully managed service that can personalize customer experiences at scale. It can be used
to make recommendations for websites to improve search results and in marketing campaigns by recommending
the products to customers.
It uses collaborative filtering, content-based filtering, and rule-based filtering. User, behavioral, and product data
are used to make these recommendations.
Collaborative filtering Uses data from users interested in these products to recommend them to
other users
Content-based filtering Uses data from the products that the user has already interacted with to
recommend other related products
Rules-based filtering Is based on the rules that you specify to recommend the products
122 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
Amazon Forecast
Amazon Forecast is a fully managed service that uses machine learning to generate accurate time-series forecasts.
Here are some places where it can be leveraged and a view into some of its features and workings:
Collaborative filtering Uses data from other users interested in these products to recommend
them to other users
Content-based filtering Uses data from the products that the user has already interacted with to
recommend other related products
Rules-based filtering Is based on the rules that you specify to recommend the products
Figure 9.7 shows how Amazon Forecast works by using latest algorithms to predict future time-series data using
historical data and needs no machine learning experience on the user’s part.
Forecasts can be
Historical Data viewed in a console
Sales, customer,
website, app usage,
financial, weather,
social media, supply
chain data
Upload Data Amazon Forecast Customized Forecasts can be
Upload data into Amazon Forecast does Forecasting exported for
Amazon Forecast the following Amazon Forecast downstream analysis
• Inspects the data uses the trained
• Identifies key model to
attributes generate
• Selects right forecasts
algorithms Forecasts can be
Related Data
• Trains a custom accessed via API
Promotions, events,
model
holidays, product
data, and so on
ARIMA
ARIMA stands for Auto-Regressive Integrated Moving Average and is a statistical model that uses past data to
forecast future values. It is used for the following:
Forecast Can be used to forecast financial data such as stock prices and exchange rates
financial data
Predict Capable of predicting sales or demand during different times of the year such as
sales or demand holidays or seasons, due to its ability to capture trends and seasonality in data
Specialized AI Services ❘ 123
EDS Prophet
EDS (Extended Kalman Filter Dynamic Linear Model) Prophet is a time-series forecasting tool developed by
Facebook. Some of its features include the following:
FEATURE DESCRIPTION
Additive models Uses additive models to capture seasonality trends and other data that affect time-
series data when making predictions
Use cases Useful when you have multiple sources of data such as IoT and sensor data, for
example, to predict when a machine may require maintenance
DeepAR
DeepAR is a deep learning based supervised learning technique. Some of its features include the following:
FEATURE DESCRIPTION
Recurrent Neural RNNs are used to model complex time-series data and can handle multiple time-
Networks (RNNs) series data with different levels of granularity.
Use cases RNNs are used where there are multiple factors that can influence the outcome.
Customer behavior Often used to forecast customer behavior such as website traffic and sales.
Stock prices Can be used to predict stock prices based on past data, current events, and
other factors.
ETS
ETS (Exponential Smoothing State Space Model) is a time-series forecasting method. Some of its features include
the following:
FEATURE DESCRIPTION
Smoothing approach ETS uses a smoothing approach by generating a weighted average of past data to
predict future values.
Seasonality ETS works well with seasonal data and trends and is particularly useful for fast-
and trends changing situations.
Short-term trends ETS is useful to predict short-term trends such as forecasting demand for perishable
goods, or the demand for different types of medications based on historical and
other data.
Note that no forecasting model is perfect. All these models have some degree of error, and you’ll get the best
results by combining predictions from multiple models. Table 9.2 compares and contrasts these forecast models.
124 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
ARIMA Statistical Sales data, website Simple and easy to Assumes linear
traffic, and customer interpret. Handles relationships. Can
behavior data seasonality and trends well. be sensitive to
outliers.
FEATURE DESCRIPTION
Prebuilt and You can either use prebuilt models or create your own custom models.
custom models
AWS console You can use the AWS console to create a detector, feed in the labeled data to train
the detector, test it, and, when satisfied, deploy it into production.
FEATURE DESCRIPTION
Detect anomalies Used to detect anomalies in websites, traffic, machine downtime, sales data inventory
levels, customer satisfaction, and operational data.
Automatic detection Automatically detects and labels anomalous data points and even identifies the root
cause behind those anomalies.
Integration Designed to be fully integrated with other AWS data storage solutions such as
Redshift, S3, RDS, and other third-party databases.
Relevance
tuning
Website
Data Source Pre-trained search
Connectors domains
S3, Salesforce,
Slack, etc. Custom content
attributes
Amazon Kendra Optimize results Contact
Search enabled by Techniques to Custom center
machine learning improve search synonyms agent assist
for websites and accuracy
apps use Custom Data Source Custom
document
Connectors
enrichment
S3, Salesforce, Slack, Embedded
etc. search apps
Employee knowledge base Set up an enterprise search solution that includes content from databases,
websites, and other internal documentation such as FAQs, employee
handbooks, and product documentation.
Customer support knowledge For customer support teams that include FAQs, troubleshooting guides,
management system product documentation, and customer support tickets.
continues
126 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
(continued)
Ecommerce Create a custom search engine where customers can find their products
quickly, thus increasing customer service satisfaction and sales.
Healthcare Create a medical search engine that enables doctors and professionals to
find their information quickly; this can help doctors make better decisions
and improve patient outcomes.
Legal Create a legal search engine for lawyers to find their information quickly,
which can improve their service to customers by building more robust
legal cases and providing better legal advice.
Write Code
01
Test App
Deploy App 04 03
03 Am
Amazon CodeGuru
Reviewer
FIGURE 9.9: Amazon CodeGuru uses machine learning to improve app code.
Specialized AI Services ❘ 127
BENEFIT DESCRIPTION
Improve Conduct code reviews, automated suggestions, and remediations to enhance the
code quality quality of the code.
Find and fix bugs Reduce development time by providing suggestions to improve the code.
Improve security Find security bugs and help comply with regulations by providing suggestions to
improve the code.
Improve Provide suggestions to code so that you can leverage the appropriate infrastructure
performance to improve performance.
TIP Use Amazon CodeGuru to enhance code quality, security, and performance, and use
DevOps Guru to proactively monitor DevOps processes and detect issues.
FEATURE DESCRIPTION
Anomaly detection Leverages operational data from AWS CloudWatch, CloudTrail, and other AWS
services, continuously analyzes them, and detects anomalies that are then used
to generate alerts for subsequent reactive or proactive action
Real-time alerts Generates alerts through integration with AWS Systems Manager, Amazon
EventBridge, Amazon SNS, and other third-party incident management systems
Performance, security, Helps improve performance, provides security, and ensures high availability by
and availability preventing downtime
TIP The integration capability of DevOps comes in handy, especially for AI and ML appli-
cations, because they are often complex, require considerable computer resources, and are
challenging to monitor in real time.
This is a type of learning exercise that is focused on applying the skills and knowledge gained
about AWS services to a series of tasks in the real world.
continues
128 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
continued
Industrial Solutions
Amazon provides a few industrial solutions, including Amazon Monitron, Amazon Lookout for Equipment,
and Amazon Lookout for Vision (see Figure 9.10). These solutions apply artificial intelligence (AI) and machine
learning (MI) to vast amounts of industrial data generated by sensors and systems to generate insights so they can
optimize industrial processes and improve quality and safety.
Predictive
maintenance,
Amazon operational efficiency
Monitron
01
02
Amazon Detect defects and
Amazon Lookout anomalies
Lookout for
for Vision
Equipment
03 Reduce downtime,
improve reliability
Amazon Monitron
Amazon Monitron is an industrial solution that can detect anomalies in industrial equipment. Some of its features
include the following:
FEATURE DESCRIPTION
Predictive Combines sensors, gateways, and machine learning to reduce unplanned downtimes
maintenance and improve predictive maintenance.
FEATURE DESCRIPTION
Issue detection Detects vibration, flow, temperature, sound, and pressure to identify potential issues.
Equipment care Improves equipment reliability, reduces maintenance costs, and improves safety.
Industry adoption Can be used in several industries, such as manufacturing, transportation, and energy
automotive, to monitor equipment such as pumps, motors, bearings, valves, and more.
Use cases Industrial use cases include quality control, energy optimization, predictive maintenance,
and asset management.
FEATURE DESCRIPTION
Detect This service or tool can identify visual damages including dents and water damage and
visual damages safety hazards due to oil spills, leaks, blocked walkways, and product quality issues such
as color and texture anomalies.
Use cases This tool is particularly useful in industries such as manufacturing, packaging, and
logistics.
Healthcare Solutions
AWS provides several healthcare-related services, including Amazon HealthLake, Amazon Comprehend Medical,
and Amazon Transcribe Medical, each of which is discussed in the following sections.
Specialized AI Services ❘ 131
Amazon HealthLake
HealthLake is a managed service that stores, transforms, and analyzes healthcare data to improve healthcare
quality, reduce costs, and improve the efficiency of healthcare organizations. Here are some of its features:
FEATURE DESCRIPTION
Proactive care HealthLake can look into patients’ data and try to identify those with the risk of
developing complications and then proactively treat or educate them.
Cost reduction Hospitals can use HealthLake to understand where they are spending money, such
as on prescription drugs, and then negotiate that with the drug companies for
better pricing.
Appointment Hospitals can automate scheduling appointments using HealthLake, which will not
scheduling only impart efficiency to the hospital’s operations but also result in better patient
satisfaction.
Clinical trials HealthLake can analyze large datasets to identify anomalies, which can help medical
researchers identify potential participants for clinical trials and lead to new therapies
and treatments.
Clinical HealthLake can also act like a clinical support system by providing physicians with
support system the latest clinical research and insights to help them change their treatment plans or
identify other potential treatment options.
Population health Healthcare organizations can also track the trends in population health over time and
then proactively produce means to prevent diseases or improve overall health.
TIP You can leverage Amazon healthcare services for proactive treatment, cost optimization,
automation, clinical support, and secure medical transcription.
FEATURE DESCRIPTION
Population health Helps with population health management by better understanding the patient’s health
and developing personalized treatments.
Medical coding You can also use Amazon Comprehend Medical to automatically code medical diseases
and procedures for billing purposes, thus saving the time and effort required for
medical coding.
FEATURE DESCRIPTION
Compliance HIPAA-eligible
Use case Securely transcribes medical speech, such as physician-patient conversations, medical
dictations, and clinical notes
Clinical notes Includes progress notes, discharge summaries, and consultation notes
The focus of this exercise is to provide a practical experience to using these Amazon Industrial
solutions to improve operational efficiency, predictive maintenance, and quality control in an
industry setting.
Amazon SageMaker
Amazon SageMaker is a fully managed machine learning service provided by AWS to build, train, and deploy
machine learning models. It provides end-to-end support for the entire machine learning lifecycle, as shown in
Figure 9.12. It allows developers and data scientists to deploy models at scale in production.
s end to end su
rt ovide ppo
Capabilities Sta Pr rt f
or
include D ata Stor
e fu
ll
a r e F e
d
notebooks, ep atures
Pr
En
model training,
ac
es e
vic dg
hin
De
hosting,
De ge E
el
te
autopilot,
c
a
ear
tB
an
integration
ias
M
ning
lifecycle
Manage &
Notebook
Build with
Monitor
Amazon
SageMaker
els
Ex icti
Pr
od
ed
pl on
M
ian s
Speeds up development
ain
Dep
lo e data scientists to deploy
scientists and Prod y in Tun ters
uctio e models at scale in
developers to focus on n P aram production
core tasks
CATEGORY FEATURES
Capabilities ➤➤ Uses Jupyter notebooks to develop and train machine learning models
Benefits Helps companies to reduce costs, improve efficiency, and make better business decisions
FEATURE DESCRIPTION
Collaboration Enables more collaboration between developers, data scientists, and business analysts.
Speed Helps organizations speed up the machine learning workflow and deploy
models quickly.
Democratization Business analysts can use SageMaker with limited machine learning knowledge.
Use Cases Includes fraud detection, product recommendations, and customer churn prediction.
FEATURE DESCRIPTION
Hands-on Helps employees to learn machine learning development through guided, hands-on
experience experiences
Experimentation Allows users to learn and experiment with different models without worrying about
the underlying hardware and infrastructure
TIP Use Amazon SageMaker’s end-to-end support for ML lifecycle, democratize ML with
SageMaker Canvas, and upskill employees with SageMaker Studio Lab.
136 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
End-to-End Workflow
Data Preparation
Model Training, Deployment, Management
Vertex AI
Vertex AI
Vertex AI is a cloud-based machine learning platform that helps companies build, train, and deploy models in
production. It helps with the end-to-end machine learning workflow, including data preparation, model training,
deployment, and management.
Vertex AI Workbench
A Vertex AI Workbench is a single development environment for data scientists who can develop and then deploy
models. Here are some of its features:
FEATURE DESCRIPTION
Easy transition from development Integrated with Vertex AI and other tools, which helps minimize the
to production transition from development to production
AI Hub
Google’s AI Hub is a centralized platform where data scientists, machine learning engineers, developers, and other
AI practitioners discover, share, and deploy models from within the hub. Here are some of its features:
FEATURE DESCRIPTION
Share assets Gain access to other assets such as datasets and APIs.
Collaboration tools Access other tools for collaboration, such as version control, code reviews, and
issue tracking.
This hands-on exercise is focused on helping data scientists and AI practitioners to use
Google’s AI/ML services such as Vertex AI, Vertex AI Workbench, and AI Hub. It will help the
team members to create and deploy models and prototype solutions.
Task 1: Assume you are working for a retail company to improve the sales using a predic-
tive model.
TOOLS STEPS
Vertex Gather and prepare necessary data such as historical sales data and data
AI related to external factors that could impact the sales.
Deploy the trained model and create a mechanism for new data input.
TOOLS STEPS
continued
TOOL STEPS
For Developers
The AI/ML services listed in Table 9.3 are available to developers.
➤➤ ➤➤ Cloud
Text
recognition translation API
➤➤ ➤➤ Google
Image search
Translate
➤➤ Landmark
➤➤ Neural machine
detection
translation
AutoML
AutoML is a set of tools and techniques that facilitates the development of models accessible by automating
many tasks. Some of its features include the following:
The Google AI/ML Services Stack ❘ 139
FEATURE DESCRIPTION
Easier model Intended to make model development easier for those with limited knowledge of
development machine learning
Development tools Provides several tools, such as AutoML Vision, AutoML Natural Language, and
AutoML Tables, to solve specific ML problems
FEATURE DESCRIPTION
Natural Helps with sentiment analysis, entity recognition, syntax analysis, content
language tasks classification, and more.
Application Developers can apply natural language processing to applications using this API.
integration
Sentiment analysis Developers can train machine learning models to classify, extract, and detect
sentiments.
Dialogflow
The Google Dialogflow platform enables developers to create conversational interfaces for applications such as
chatbots, voice assistants, and other conversational agents. Some of its features include the following:
FEATURE DESCRIPTION
User requests Leverages machine learning and natural language understanding techniques to
understand user requests and provide responses.
Human-like Facilitates natural interaction and human-like capabilities and offers enterprise-wide
interactions scalability.
TIP Use Vertex AI for data scientists, AutoML and Dialogflow to streamline development,
and Cloud Natural Language and Vision AI to enhance applications.
Media Translation
Media translation enables real-time translation of audio or video content into text in the language of your choice.
Currently, it supports 12 languages. The following sections discuss the services that AWS offers under the media
translation umbrella.
140 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
TIP Media translation can be used to make your content more accessible to a global audi-
ence.
Speech-To-Text
FEATURE DESCRIPTION
Google Speech to Text plays a vital role in the Google Contact Center AI by converting
Contact Center AI spoken words into written text, which helps real-time transcription, sentiment
analysis, call analytics, and interactive voice response (IVR).
Text-to-speech Provides real-time text conversion into speech and can be used for video
conversion captioning, voice-enabled chatbots, transcribing meetings and calls, and so on.
Supports This includes English, French, Spanish, and Japanese. It can understand multiple
multiple languages dialects and accents.
Personalized Can be used to personalize communications and engage with customers via voice-
communications based user interfaces in devices and applications.
Translation AI
Google offers this translation service, which can be used to translate text and speech between different languages.
Some of its features include the following:
FEATURE DESCRIPTION
Global reach Helps reach customers globally and provides compelling experiences by
engaging with them in their local languages.
Translation options Google provides three translation options: Cloud Translation API, Google
Translate, and Neural Machine Translation.
Language support Google Translate is free and can be used for nearly 100 languages. It translates
text, images, speech, and web pages.
Google Translate API A paid service that can be used to integrate translation into your applications
and services.
Neural machine A much more accurate translation service that uses neural networks for
translation translation.
Video AI
Google Video AI is used in many applications such as Google Photos, YouTube, and Google Cloud. It provides
automatic video tagging, video recommendations, video transcription and translation, and content moderation.
Vision AI
Google Vision AI is a cloud-based image analysis service that detects and analyzes objects, faces, text, and other
visual entities from videos and images. It is helpful for label detection, face detection, text recognition, optical
character recognition (extracting text for scanned documents), image search (finding similar images), and land-
mark detection.
In this exercise, the project team will build an application using Google AI/ML
services using image recognition, natural language processing, translation, and
chatbot functionality.
STEP NO. GOOGLE AI/ML HANDS-O N TASK EXPECTED OUTCOME
SERVICES
continues
142 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
continued
Natural language
Extract entities from audio
processing, computer
and videos
vision, speech recognition
Azure Video Azure
Indexer Cognitive
Services
FEATURE DESCRIPTION
Purpose Helps developers integrate AI and ML capabilities into their applications, websites,
and devices
Capabilities Includes natural language processing, computer vision, speech recognition, and more
Use cases Can improve customer service, fraud detection, product recommendations, image
classification, and natural language processing
Method Just make an API call to embed the ability to speak, hear, see, search, understand, and
thus include decision-making into the apps
Specific services Can be used for speech-to-text, text-to-speech, speech translation, and speaker
recognition
FEATURE DESCRIPTION
Cognitive service Provides content moderation and personalization of experiences for the users, and
for decisions addresses device problems earlier, all contributing to better decision-making
Open API service Provides an open API service for advanced language models and coding
144 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
TIP Use Azure Applied AI services to integrate AI capabilities into your applications and
unlock the power of natural language processing, computer vision, real-time analytics, and
intelligent search.
TIP Azure Stream Analytics can be used to monitor and analyze streaming data in real time,
which can help you make better decisions and take action more quickly.
FEATURE DESCRIPTION
AI and NLP Uses AI and Natural Language Processing to search for data such as images and text
Search features Provides semantic search, faceted navigation, fuzzy matching, and geospatial
search features
Unstructured data Can extract insights from unstructured data such as sentiment analysis, key phrase
extraction, and entity recognition
PHASE FEATURE
Data preparation During data preparation, developers can label training data, manage label projects, use
analytics to explore data, and access and share datasets.
Build and Provides notebooks, automated machine learning capability, the ability to run
train models experiments, accelerate training using a scalable compute engine, access to a secure
compute instance, and access to open-source libraries such as Scikit-learn, PyTorch,
TensorFlow, Keras, and more.
Validate and Provides the ability to deploy model endpoints for real-time and batch inference,
deploy models pipelines and CI/CD workflows, prebuilt instances for inference, model repository, and
hybrid and multi-cloud deployment options.
Manage and Provides the ability to track logs, analyze data and models, detect drift, maintain
monitor models model accuracy, debug models, trace machine learning artifacts for compliance, and
ensure continuous monitoring of security and cost control features, including quota
management and automatic shutdown.
146 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
This exercise is to give your team practical experience with building, training, and deploying a
machine learning model using Azure Machine Learning. It includes important aspects such as
data preparation, model training, validation, deployment, and monitoring.
Remember to clean up resources at the end of the exercise to avoid unnecessary charges. Also
note that the roles in the exercise can vary in your organization.
Dataiku
Dataiku is an enterprise-class data science platform that allows data preparation, machine learning, and collabo-
ration all in one place. It enables use cases such as customer segmentation, fraud detection, product recommenda-
tion, inventory management, and supply chain optimization.
Dataiku has key capabilities for explainable AI. It includes interactive reports for feature importance, partial
dependence plots, subpopulation analysis, and individual prediction explanations.
DataRobot
DataRobot’s specific capabilities include AutoML, which automates the machine learning development and
deployment lifecycle, and explainability, which explains how the models work and helps businesses build trust
around their models. It facilitates collaboration by allowing users to share their work, including code, models,
and workflows, in a collaboration platform. Its governance platform provides the tools to manage risks with
machine learning models to comply with regulations and protect data.
KNIME
KNIME is a powerful open-source data analytics tool that helps with data preparation and machine learning
modeling. In addition to typical data preparation feature extraction and machine learning modeling tools, it pro-
vides a web extraction tool that extracts data from websites.
IBM Watson
IBM Watson is a robust set of AI/ML tools that businesses can leverage to carry out several use cases around nat-
ural language processing, speech recognition, machine vision, and so on. It provides capabilities such as natural
language processing, machine learning to make predictions, deep learning for highly accurate speech recognition
and image recognition, speech-to-text, text-to-speech, visual recognition, and virtual agents.
TIP When choosing an enterprise cloud AI platform, it is important to thoroughly assess the
features, scalability, security, and integration capabilities of the platform.
Salesforce Einstein AI
Salesforce Einstein AI is a robust set of AI and machine learning tools developed by Salesforce to enhance the
CRM experience for businesses. Some of its key capabilities include predictive analytics, NLP, image recognition,
automated insights, and intelligent automation for workflows.
Oracle Cloud AI
Developed by Oracle, this product helps developers and businesses to integrate AI into their applications and
workflows. It provides capabilities such as machine learning, an autonomous database to manage the data
automatically, chatbots that can integrate with messaging platforms and websites, intelligent apps that provide
recommendations and automate tasks, and data analytics.
148 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
This hands-on exercise is focused on researching the capabilities of the platforms to implement
the use cases under consideration. The team will be evaluating the platforms against the fol-
lowing: features such as AutoML, explainability, collaboration, security, scalability, integration,
and deployment options.
SUMMARY
This chapter reviewed the benefits and factors to consider when choosing a cloud-based AI/ML service. This
chapter explained the criteria to consider when choosing an AI/ML service from one of the major cloud service
providers—AWS, Google, and Azure.
It began by discussing the benefits of using cloud AI/ML services, including faster time to market, handling inten-
sive machine learning workloads, cost effectiveness, collaboration, integration capabilities, security compliance,
rich AI capabilities, tool support, and keeping up with innovations.
Review Questions ❘ 149
To assist you with the evaluation process, the chapter reviewed several factors, including identifying the business
problem, exploring service capabilities, reviewing performance and scalability needs, and considering develop-
ment and integration needs, security capabilities, price points, vendor support, and regulations.
It also examined various AI/ML services offered by the three major cloud providers and obtained a glimpse into
the capabilities and tools available for various use cases and industries.
In the following chapters, you continue your journey toward deploying enterprise AI in the cloud. You explore
data processing, building AI platforms, modeling pipelines, ML operations, and governance practices, and delve
into AI/ML algorithms to train models.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Choose the correct answer regarding AI/ML services.
A. The ML technology was limited to a few companies in the past but is now spreading to more
companies due to the advent of AI/ML services.
B. The ML technology is still in the periphery and has not reached the core of business applications.
C. AI and ML are only popular with data scientists and not so much with other team members.
D. Cloud service providers are still working on providing the right set of AI and ML services
for projects.
2. Which of the following is a high-level AI service?
A. SageMaker
B. MXNet
C. Fraud detection
D. TensorFlow
3. Choose the correct machine learning algorithm.
A. SageMaker
B. Azure ML
C. Decision trees
D. Google Cloud ML Engine
4. Choose the AWS service that helps developers identify the phrases and entities in a document.
A. AWS Comprehend
B. AWS Rekognition
C. Amazon Textract
D. SageMaker
5. How can Amazon Polly be used in enterprise applications?
A. To develop IVR systems
B. To make applications more accessible to the visually impaired
C. To create lifelike voices in apps
D. To add audio to e-learning courses
6. What can AWS Panorama be used for in retail industries?
A. To personalize marketing campaigns
B. To track customer behavior and recommend products
150 ❘ CHAPTER 9 Evaluating AI/ML Platforms and Services
ANSWER KEY
1. A 6. B 11. C
2. C 7. D 12. C
3. C 8. B 13. A
4. A 9. B 14. B
5. B 10. C 15. C
Launching Your Pilot Project
10
The people who are crazy enough to think they can change the world are the ones who do.
—Steve Jobs
This chapter is a pivotal moment in your AI journey, where you transition from strategy to execution. You
are on the cusp of introducing AI/ML into your company operations.
Launching a pilot is no small feat, fraught with many challenges and requiring precise coordination, and
that’s where the methodology comes into play. This chapter assists you in navigating the essential stages of
launching, from initial planning to the nitty gritty details of a machine learning process.
This chapter brings all the strategy and vision discussed in the previous chapters to reality through a pilot
launch, starting from business objectives to model development, deployment, and monitoring. I make this
real through a hands-on project implementation exercise that bridges theory with practice. Ultimately, the
goal is to launch a successful pilot that makes your strategic vision a reality. See Figure 10.1.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your
journey Team
NOTE Understanding your company’s vision and goals while conducting a maturity
assessment helps identify gaps and enables you to create an AI transformation plan. This
in turn helps you move from pilot to production and make AI a core part of your business
strategy.
Change You have initiated some change-management activities to get leadership alignment
management and communicated the AI vision to the key stakeholders to get their support. You
activities have also established a smaller version of COE, a group of A-class players from various
departments working on the pilot as a core team.
Talent You also have to assist with any gaps in the talent required to implement the pilot, such
management as data scientists, machine learning engineers, or cloud engineers who will be required
to establish the platform.
154 ❘ CHAPTER 10 Launching Your Pilot Project
ML platform From an ML platform perspective, you have identified the steps required to set up the
AI platform and any gaps that need to be addressed during the launch.
Data capability You have assessed the current state of the data, identified any data gaps, and the
data infrastructure required to collect, ingest, clean, and process the data to feed
the models.
Governance Lastly, from a governance perspective, you have procured funding for this launch
by working with a sponsor, ideally an executive who understands the value of the
technology, is behind this launch, and is looking forward to ensuring its success.
Ethical You should consider the principles to be factored in during the design of the
considerations models to ensure trustworthy, ethical, and socially responsible implementation of
your AI model.
• Identify initiatives
t
ec
• Initiate change management initiatives
2
oj
Pr
• Form a small COE team
L
M
ch
un
La
1
n
lig
A
n
sio
NOTE After completing the AI maturity model assessment, use the insights to develop a
roadmap and project plan for your AI launch.
Note that these phases are not necessarily sequential and that there are feedback loops between some phases, such
as problem definition and monitoring, data preparation and model building, and model development and model
monitoring.
The next sections briefly examine each of these phases shown in Figure 10.3, discuss in greater detail the business
goal, and frame the machine learning problem.
Identify the First, you should have a clear idea of the problem you’re trying to solve and the
business problem business value you get by solving that problem.
Measure the You should be able to measure the business value against specific business objectives
business value and success criteria. The business objective is what you want to achieve, while the
business value is what you expect to achieve at the end of this implementation.
156 ❘ CHAPTER 10 Launching Your Pilot Project
Identify the machine Your primary focus is to translate your business problem into a machine learning
learning problem type problem. You try to identify the type of ML problem, such as whether it is
a classification problem, regression problem, anomaly detection problem,
and so on.
Identify the label Given a dataset, you will identify one of the attributes as a label, which is the
attribute that needs to be predicted.
Define the performance Focus on optimizing the performance and error metrics.
and error metrics
Follow a defined machine learning lifecycle that includes defining business goals, framing the problem as a
machine learning problem, data processing, model building, model deployment, and model monitoring, with
iterative feedback loops between some of the phases.
Data Processing
Data processing is covered in greater detail in Chapter 15. Data must be input into a model during the build
process in the correct format.
Data collection This involves identifying the different sources for the data—internal and external—as
and preparation well as collecting data from multiple sources. You must also preprocess the data, such
as addressing the missing data problem, formatting data types, and addressing any
anomalies or outliers.
Feature Feature engineering involves creating new data, transforming data, extracting, and
engineering selecting a subset of the data from the existing set.
Model Development
Model development is covered in greater detail in Chapter 17. Model development involves the following:
Model building, tuning, and Building, testing, tuning, and evaluating the model before it can be
evaluation deployed into staging and production.
Setting up CI/CD pipelines for It also involves setting up a CI/CD pipeline for automatic deployment
automatic deployment into staging and production environments based on a set of criteria.
NOTE A CI/CD pipeline is a set of tools that automate the process of software deployment
from source code control to production deployment.
Following the Machine Learning Lifecycle ❘ 157
Model Deployment
Once the model has been tested, tuned, evaluated, and validated, it is ready to be deployed into production to
make inferences and predictions. Chapter 18 covers this.
Model Monitoring
Model monitoring involves continuously monitoring the model’s performance so that it performs well against a
particular set of desired parameters and proactively detects and mitigates performance degradation. Chapter 19
covers this.
TIP When a model performance tends to degrade, it is important to retrain the model on
new data that is representative of the current data distribution.
Your ultimate goal is to implement AI effectively and scale it across the enterprise. However,
you need to start somewhere, and that’s the goal of this exercise—to help you plan and launch
your pilot project.
SUMMARY
This chapter covered the various steps you take as you move from the Align phase to the Launch phase. The
chapter discussed how to prepare for the launch of your machine learning project. It discussed the different
phases of a typical machine learning project: business goals, problem framing, data processing, model develop-
ment, deployment, and monitoring. The chapter also included a hands-on exercise to plan and launch an AI pilot.
In the next chapter, let’s discuss putting into motion some organizational change management steps to ensure
your people wholeheartedly embrace your AI plans.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. The focus of the core team implementing the AI pilot is
A. To establish the platform
B. To implement AI initiatives
C. To conduct a maturity model assessment
D. To establish a COE
2. Why do you need a label in machine learning problem framing?
A. To make predictions.
B. To evaluate the model.
C. It identifies the type of machine learning problem.
D. It is a feature created during feature engineering.
3. What is the primary focus of the machine learning problem framing phase in the machine learning
lifecycle?
A. Data processing
B. Model definition
C. Defining the business goal
D. Identifying the type of machine learning problem
4. Feature engineering is used
A. To evaluate the model’s features
B. To collect and preprocess features
C. To create new data and transform existing data
D. To identify the business data
5. Which phase involves translating a business problem into a machine learning problem?
A. Data processing
B. Model deployment
C. Machine learning problem framing
D. Business goal identification
ANSWER KEY
1. A 3. D 5. C
2. A 4. D
PART IV
Building and Governing Your
Team
People make magic happen! Part IV explores the organizational changes required to empower your
workforce. I guide you through the steps to launching your pilot and assembling your dream team. It’s all
about nurturing the human side of things.
11
Empowering Your People
Through Org Change
Management
Culture eats strategy for breakfast.
—Peter Drucker
This chapter is a pivotal step in the AI implementation methodology. By now, you have a good grip on the
AI cloud ecosystem, use cases, and AI strategy. But here comes the crucial part: ensuring your AI plans are
wholeheartedly embraced by your people (see Figure 11.1).
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your
journey Team
08 Objective
Scale and Process Data
Transform and Modeling
Foster a transformative organizational culture that
AI
Deploy and 05 champions AI, prioritizing people-centered
Monitor
Models approaches to ensure seamless AI adoption and
07 alignment with your business goals
06
FIGURE 11.1: BUILD AND GOVERN YOUR TEAM: Empower Your People Through Org Change
Management
164 ❘ CHAPTER 11 Empowering Your People Through Org Change Management
The chapter starts by exploring how to evolve your company culture to foster AI adoption, bring innovation, and
enable meaningful change. Knowing that technology alone won’t cut it, the methodology delves into strategies to
redesign your organization to enable agility and renewed thinking around adopting AI at the core. It is built on
three strategies.
➤➤ Evolving the culture
➤➤ Redesigning the organization
➤➤ Aligning the organization
This is where the key deliverables—such as the cultural evolution program and organizational redesign plan
discussed in this chapter—come in.
You need to get your entire organization aligned, starting from top executives to frontline staff, around AI adop-
tion. In a world where change is constant, mastering org change management is a must. This chapter helps you
with the insights and tools needed to lead change confidently so that your AI initiatives not only survive but
thrive. Let’s get started.
As you can see, you need to take care of the people when implementing the cloud and AI at the enterprise level.
In the following two sections, you see how to systematically address people aspects using a comprehensive change
management methodology. As part of this methodology, I address the objectives, tasks, and deliverables for these
capabilities, such as cultural evolution, org redesign, org alignment, change acceleration, leadership transforma-
tion, workforce transformation, and cloud/AI upskilling.
166 ❘ CHAPTER 11 Empowering Your People Through Org Change Management
Organizational Alignment
Ensure technology change is driven by Cloud Fluency
business to drive business outcomes Develop and hire the skills and
People knowledge to use cloud
enablement
for enterprise
AI
Organizational Design
Design your organization structure to
Ai Fluency
implement AI strategy Develop and hire the skills and
knowledge to use AI technology
Note that you would also need to implement change management techniques when choosing the AI/ML plat-
forms and services. AI platforms come with their own training modules, may impact daily workflows, and can
act as catalysts for collaboration. AI may also require additional security controls. It is therefore important to get
the stakeholders’ buy-in when adopting a new AI/ML platform or service and adopt these change management
practices enabling growth and development.
None of these
Other
Regardless of the type of stress, the fact is it is going to impact their performance. It is, therefore, important to
assess the organization’s current culture and how the people react to change. Based on this assessment, you must
take a certain number of steps, as shown in Figures 11.4 and 11.5. The focus of these steps is to bring in a change
that enables innovation, get them aligned on transformational values, and build an open culture to encourage tak-
ing risks. You should also provide resources to build their capability, and foster collaboration between teams.
TIP If you manage your company culture well, you will see greater employee participation,
higher motivation levels, and increased productivity, which translate into innovation and
customer satisfaction.
lig
vi
adoption
En
n
Evolve Your
Culture
h
al
Sc
nc
MONITOR AND MEASURE e u IMPLEMENT CULTURE
La
EVOLUTION PLAN
Monitor the effectiveness of the
culture evolution plan and make Implement a cultural
necessary improvements to evolution plan to target
increase its effectiveness employee groups through
training programs,
workshops, and mentoring
ACTIVITIES
To solve the issue of hierarchy USE CASE EXAMPLE USE CASE EXAMPLE
USE CASE EXAMPLE and lack of collaboration, you You may create a new AI The company reviews the
You may realize that your may decide to implement division with a dedicated team progress so far and makes
company is very hierarchical cross-functional teams with of data scientists, machine further changes to the
and siloed. appropriate roles and learning, data analysts, and structure, jobs, and roles.
responsibilities. machine learning engineers.
CURRENT JOBS EVALUATE JOB
JOB PLAN STRUCTURES
ASSESSMENT IMPLEMENT JOB PLAN
Create the plan for new job Evaluate the success of the
Assess the current job Implement the new job
descriptions and role profiles new jobs and make changes
descriptions and role profiles structure and role profiles.
for the new AI adoption. as required.
required for AI adoption.
NOTE Chapter 5 titled “Envisioning and Aligning Your AI Strategy” discussed aligning the
organization around the AI vision and strategy. That theme continues to play throughout the
book. For example, the people-centric alignment plays an even more vital role at the ground
level when it comes to choosing an AI/ML platform or service. It only continues to increase
in importance as you scale your AI capabilities. You learn about this more in Part VIII of
this book.
Figure 11.7 shows different components that need to be aligned as part of ensuring organizational alignment.
Aligning Your Organization Around AI Adoption to Achieve Business Outcomes ❘ 169
This exercise gives you tangible experience in dealing with org change management practices in
the context of AI adoption.
continues
170 ❘ CHAPTER 11 Empowering Your People Through Org Change Management
continued
SUMMARY
This chapter explained how companies can empower their people through organizational change management to
build a competitive advantage. The key takeaway from this chapter is that technology won’t be enough to bring
change and innovation at the enterprise scale.
The chapter started by understanding the different cultural tensions that might arise during major transforma-
tional efforts and devised strategies to address them. This chapter provided step-by-step instructions to foster
innovation, risk-taking, collaboration, and alignment with transformational values.
Remember that it’s important to align your organization at all levels to ensure that strategy, processes, systems,
culture, and teams work together to achieve the organizational goals. Prioritizing people and culture to drive
organizational change, continuous learning, growth, and innovation will translate to business success and cus-
tomer satisfaction.
In the next chapter, let’s discuss building a high performing team that can kickstart your AI transformation effort.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Which of the following is not the only factor in the success of an AI project?
A. Leadership support
B. Employee training and development
C. Adequate resources
D. Change management plan
2. What are the challenges of evolving your culture evolution for AI adoption, innovation, and change?
A. Employees are afraid of adopting modern technologies.
B. Employees lack the necessary skills.
C. Employees are resistant to change.
D. It is difficult to measure the ROI of AI projects.
3. Which of the following is NOT one of the eight major areas of focus to build your competitive advan-
tage in an AI initiative?
A. Strategy and planning
B. People
C. Data
D. Location
E. Platforms
F. Operations
G. Security
H. Governance
4. What did the case study company identify as an obstacle to the adoption of cloud and AI technologies?
A. Lack of funding
B. Hierarchical and siloed organizational structure
C. Lack of market demand
D. Technological incompatibility
172 ❘ CHAPTER 11 Empowering Your People Through Org Change Management
ANSWER KEY
1. D 3. D 5. B
2. A, B, C 4. B
NOTE
1. www.gartner.com/en/corporate-communications/trends/diagnosing-cultural-tensions#:
~:text=During%20times%20of%20significant%20change,and%20helping%20employees%20build%20
judgment.
12
Building Your Team
Coming together is a beginning, staying together is progress, and working together is success.
—Henry Ford
In this chapter, you embark on the fundamental task of building a collaborative team essential to kick-
starting your AI project. The importance of assembling a synergistic team cannot be understated. This
chapter delves into the various AI/ML roles and responsibilities involved in an AI implementation, be it
the core roles such as data scientists and machine learning engineers or the equally crucial supporting roles
such as security engineers.
Given that AI is taking center stage and evolving so rapidly, even noncore roles are quickly becoming core.
What is essential, though, is to tailor these roles according to your specific business, project, and company
requirements, and that’s where this chapter comes in. You should also get good at transforming traditional
roles, such as system administrators and business analysts, to harness your company’s existing knowledge
base and expertise. This, coupled with judiciously tapping into external skill sets, can bridge any looming
skills gap.
Building a team is almost an art; the goal is to assemble a diverse, cross-functional team with clear roles
and responsibilities to drive effective AI transformation tailored to business objectives. See Figure 12.1.
This dream team will go on to steward your AI operations, starting from data processing in Chapter 15,
model monitoring in Chapter 19, and managing the ethics of AI in Chapter 20.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your
journey Team
08 Objective
Scale and Process Data
Transform and Modeling
Assemble a diverse, cross-functional team with
AI
Deploy and 05 clear roles and responsibilities to drive effective AI
Monitor
Models transformation tailored to business objectives
07
06
FIGURE 12.1: BUILD AND GOVERN YOUR TEAM: Building your team
When implementing your first enterprise-grade AI/ML project, you will face some challenges in identifying the
right resources. This is because the ML platform has many components, and you need to bring together skill sets
from multiple organizational units in your company. It includes data science, data management, model building,
governance, cloud engineering, functional, and security engineering skills.
TIP Create a cross-functional team with the appropriate skill sets, while also adapting exist-
ing job roles as needed to align with your company and project goals.
defined, and there is often overlap between these roles. I recommend you use this as a guideline but then custom-
ize it so you align your specific needs based on the skill sets that you have in your organization and the hiring
plan that you may have to initiate to fill the gaps.
NOTE Most AI initiatives have a higher risk of failing if they don’t have the right skills.
The following sections review the new roles in the cloud-based AI scenario. Let’s first look at the core team and
then the extended teams they need to work with. Figure 12.2 summarizes the roles involved in a typical machine
learning project.
The actual composition of your team and the definition of core versus auxiliary depends on your
specific use case and project needs.
Core AI Roles
Several roles are required on an AI project. Depending on the type of AI project, some roles may be critical for a
project’s success and in other cases may not. In general, the following roles can be considered core roles. Use this
for guidance only, as the actual roles may vary based on your specific needs.
➤➤ AI architect
➤➤ Data scientist
➤➤ Data engineer
➤➤ Machine learning engineer
➤➤ MLOps engineer
AI Architect
The AI architect is an emerging role primarily focused on defining the transformational architecture that AI intro-
duces. You can think of AI architects as the glue between the business stakeholders, the data engineers, the data
scientists, the machine learning engineers, DevOps, DataOps, MLOps, and others involved in an AI initiative.
They are responsible for the following:
176 ❘ CHAPTER 12 Building Your Team
TASK DESCRIPTION
Designing the overall AI strategy Identifying the AI application opportunities and use cases, as well as
the underlying business and technical requirements to enable those use
cases. Includes developing the AI roadmap.
Designing the AI architecture Selecting the right AI technologies and ensuring the right data quality
and security.
Defining the overall framework Ensuring that the new AI technology is integrated with the existing
for building and deploying AI business processes and applications.
applications
Data Scientist
At its core, the data scientist role is about taking in data and refining it by creating and running models and
drawing insights to make business decisions. It is a critical role to ensure that the project is successful. Defining
the problem is one of the first things this role must do by working with the business stakeholders.
Data Engineer
The data engineer works closely with the data scientist, to whom they provide the data. They focus primarily on
the following:
➤➤ Data acquisition, integration, modeling the data, optimization, quality, and self-service
➤➤ Ensuring that the data pipeline is appropriately implemented and is working satisfactorily so that the
models get the correct data in the proper format and quality
MLOps Engineer
The MLOps engineer builds and manages automation pipelines to operationalize the ML platform and ML
pipelines for fully/partially automated CI/CD pipelines. These pipelines automate building Docker images, model
training, and model deployment. MLOps engineers also have a role in overall platform governance, such as data/
model lineage, infrastructure, and model monitoring.
Cloud Engineer
Cloud engineers aim to ensure the cloud infrastructure is secure, reliable, and cost-effective. They are responsible
for the following:
RESPONSIBILITY DESCRIPTION
Setting up the cloud This includes configuring virtual machines, setting up databases, and storage. It
infrastructure may also include configuring cloud-based services such as Amazon SageMaker,
Azure Machine Learning Studio, and Google Cloud platform.
Understanding the Roles and Responsibilities in an ML Project ❘ 177
RESPONSIBILITY DESCRIPTION
Setting up the accounts This includes creating and managing user accounts, setting up access controls, and
and working with security working with security teams to ensure the infrastructure is secure.
Managing the security and This includes monitoring and enforcing security policies, managing user
access controls permissions, and responding to security incidents.
Monitoring the This includes monitoring resource usage, identifying bottlenecks, and
performance of the system implementing cost-saving strategies.
Business Analyst
Business analysts engage early in the ML lifecycle and translate business requirements into actionable tasks. Their
primary role is to bridge the gap between business stakeholders and the technical team.
Data scientists are responsible for collecting, cleaning, and analyzing data. They use their
knowledge of statistics and machine learning to develop models that can be used to make
predictions or decisions. Business analysts are responsible for understanding the business needs
and requirements for an AI/ML project. They work with data scientists to ensure the project
meets the business’ goals.
Data scientists typically have a strong mathematics, statistics, and computer science back-
grounds. Business analysts typically have a strong background in business, economics, and
communication.
Data scientists are typically more focused on the technical aspects of AI/ML, while business
analysts are typically more focused on the business aspects. However, both roles are essential
for the success of an AI/ML project.
Business analysts need to understand the business requirements and translate them into technical specifications
that the developers and data scientists can use to build the AI/ML system. They also need to analyze the data and
provide insights to the stakeholders on how the system is performing and how it can be improved.
Let’s say you are working as a business analyst on an AI/ML project for a telecom company. One of
the company’s fundamental business problems is customer churn when customers cancel their
subscriptions and switch to a competitor’s service. The company wants to use AI/ML to predict which
customers will likely churn so they can take measures to retain them.
As a business analyst, your first step is to gather the requirements from the stakeholders. You need to
understand what data sources are available, what business rules need to be applied, and what the
system’s output should look like. You then work with the data scientists and technical team to design
the system architecture and define the data processing pipeline.
Once the system is built, you must work with the data scientists to analyze the data and provide
insights to the stakeholders. You need to monitor the system’s performance and recommend improve-
ments. For example, specific customers are more likely to churn if they experience network issues, so
you should recommend that the company invest in improving its network infrastructure in those areas.
178 ❘ CHAPTER 12 Building Your Team
Overall, as a business analyst in an AI/ML project, your role is crucial in ensuring that the AI/ML system is
aligned with the business objectives and that the stakeholders are informed about the system’s performance and
potential improvements.
Domain Expert
Domain experts have valuable functional knowledge and understanding of the environment where you implement
the ML solution. They do the following:
RESPONSIBILITIES DESCRIPTION
Requirements gathering Help ML engineers and data scientists develop and validate assumptions and
hypotheses.
Design involvement Engage early in the ML lifecycle and stay in close contact with the engineering
owners throughout the Evaluation phase.
Industry compliance Provide valuable insights and help ensure the project adheres to industry
standards and best practices.
Domain experts have a deep understanding of their respective fields and are essential to any project involving spe-
cialized knowledge. With their expertise, projects can succeed, significantly saving time and money and minimiz-
ing reputation losses. Therefore, a domain expert is critical for any project requiring specialized knowledge.
Domain experts are especially important for healthcare, finance and banking, retail and e-commerce, manufactur-
ing, energy, agriculture, real estate, and so on.
NOTE Not having a domain expert can result in models being created in a vacuum, thus
missing critical industry requirements and resulting in poor model performance.
Data Analyst
A data analyst is a professional who does the following:
RESPONSIBILITIES DESCRIPTION
Data collection Collects, cleans, analyzes, and draws insights from the data.
and analysis
Business interpretation Helps interpret that data for the business stakeholders to make business
decisions.
Exploratory data analysis Conducts exploratory data analysis to identify patterns and trends and uses data
visualization tools to help interpret the data for the business stakeholders.
Understanding the Roles and Responsibilities in an ML Project ❘ 179
RESPONSIBILITIES DESCRIPTION
Collaboration with data In the context of enterprise AI, may also work with data scientists to select and
scientists preprocess data.
Model testing and Tests and validates the models to ensure they are reliable and accurate.
validation
NOTE Data scientists have a broader scope and dive deep into advanced computations and
predictive analytics, while data analysts are focused on deriving insights from data to guide
business decisions. In small organizations, the roles can be played by the same person.
RESPONSIBILITIES DESCRIPTION
Responsible Ensures the model is deployed responsibly and ethically, protecting the company’s
deployment brand image, and preventing reputational risk.
Data management Ensures the data used to train the model is free from bias, is accurate, and
is complete.
Model testing and Ensures the model is adequately tested and validated before deploying in
validation production by testing it against several scenarios so that it is robust and can
perform well even in unexpected situations.
Compliance and Ensures the model is transparent, explainable, and complies with legal
transparency requirements by working with legal and compliance teams.
Security Engineer
The role of the security engineer is to ensure the security of the AI systems end to end, starting from managing the
data, the models, the network, and the infrastructure. They need to do the following:
RESPONSIBILITIES DESCRIPTION
Implement Do so at all layers of the AI system, starting from data preparation, model training,
security controls testing, and deployment stages.
Risk management Identify and assess the security risks, implement controls to mitigate risks, and
assess the effectiveness of those controls.
Incident response Respond to security incidents such as ransomware attacks, address data breaches
in real-time, identify the root cause, and implement steps to remediate those risks.
180 ❘ CHAPTER 12 Building Your Team
RESPONSIBILITIES DESCRIPTION
Legal compliance Ensure that the AI infrastructure is legally compliant concerning data privacy and
security by working with legal and risk teams.
Model security Manage the security of machine learning models by ensuring no tampering with
training data and no bias introduced into the data.
Software Engineer
The role of a software engineer is to do the following:
RESPONSIBILITIES DESCRIPTION
Integration and interface Integrate the AI algorithms/their output into other back-end applications and
development develop front-end user interfaces for users to interact with the models.
Performance and security Ensure these solutions are performant, reliable, scalable, and secure.
Maintenance Continue monitoring the AI solution in production, identify bugs, solve them
when required, and add new features.
Model Owners
The model owner is a new role primarily responsible for the model’s development, maintenance, governance, and
performance. The model owner is typically a technical person who works with the business stakeholders. They
need to do the following:
RESPONSIBILITIES DESCRIPTION
Safety and responsibility Ensure that the model can perform safely and responsibly.
Business problem Understand the business problem the model is supposed to solve.
understanding
Performance metrics definition Define the performance metrics by which the model will be evaluated.
Data selection and Select the correct data to train the model upon and ensure there is no bias
bias checking in the data.
Algorithm selection and Work with the data scientists to choose a suitable algorithm, configure it,
deployment and deploy it in production.
Model monitoring Monitor the model’s performance to ensure it’s working accurately with
fairness and transparency.
Model maintenance Maintain the model by considering new data, retraining it, and addressing
any issues.
Understanding the Roles and Responsibilities in an ML Project ❘ 181
Model Validators
Model validators are responsible for the following:
RESPONSIBILITIES DESCRIPTION
Model testing and Test and validate the models to ensure they work accurately and responsibly
validation while meeting the business’ objectives and requirements.
Error and bias check Ensure that the models are working without any errors and bias and with
fairness, which may include designing tests for various scenarios and conducting
statistical analysis to check for biases or limitations in the model.
Legal compliance and Ensure that the models comply with legal requirements and capture
documentation documentation about how these models work so they can share it with other
stakeholders, auditors, and senior management.
IT Auditor
The IT auditor is responsible for analyzing system access activities, identifying anomalies and violations, prepar-
ing audit reports for audit findings, and recommending remediations. Their areas of focus include model govern-
ance, data integrity and management, AI applications security, ethical considerations, AI systems performance and
scalability, and continuous monitoring.
This exercise is to help everyone understand their new roles better and learn how these roles
collaborate with each other.
continues
182 ❘ CHAPTER 12 Building Your Team
continued
SUMMARY
The chapter discussed the roles and responsibilities you need to fill for your project to take off and ensure the
efficient development, deployment, and management of your AI systems. It also discussed the core roles and other
noncore roles that you need to fill depending on the specific needs of your project. The discussion of these roles
should help you build a blueprint for assembling a team of professionals who can navigate the complexities of AI
development by taking on responsibilities and driving innovation. Remember to customize these roles according
to your company’s needs.
Understanding of these roles and how they interact with each other will help you plan the team structure, com-
municate the vision of the AI system, and ensure that the AI system is robust, secure, fair, and legally compliant.
In the next chapter, we will discuss setting up a robust enterprise AI cloud platform that will help you take your
model from ideation to production quickly with the least amount of risk.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. The primary responsibility of a machine learning engineer is
A. Running models and drawing insights
B. Ensuring the data pipeline is appropriately implemented
C. Designing the overall AI strategy
D. Deploying models into production
Answer Key ❘ 183
ANSWER KEY
1. D 3. C 5. C
2. C 4. C
PART V
Setting Up Infrastructure and
Managing Operations
In this part, you roll up your sleeves and get technical. Part V is like your DIY guide to building your own
AI/ML platform. Here, I discuss the technical requirements and the daily operations of the platform with a
focus on automation and scale. This part is a hands-on toolkit for those who are hungry to get geeky.
13
Setting Up an Enterprise AI
Cloud Platform Infrastructure
Design is not just what it looks like and feels like. Design is how it works.
—Steve Jobs
In this chapter, you get your hands dirty by setting up an enterprise AI cloud platform infrastructure (see
Figure 13.1). By now, you should have a good idea of the strategic vision, have aligned with the stakehold-
ers on the AI/ML use cases and the business metrics, implemented change management processes, built a
team, and may even have a good idea of the ML algorithm that you want to adopt.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Setup Infrastructure and
journey
09 Manage Operations
ENTERPRISE AI
Setup 04 • Setup Enterprise AI Cloud
Evolve and JOURNEY MAP Infrastructure
Mature AI Platform Infrastructure
and Manage
Operations • Operate Your AI Platform with
Automation for Scalability and Speed
08 Process Data
Scale and Objective
Transform and Modeling
AI
Deploy and 05 Design and implement a robust, scalable AI
Monitor
Models cloud infrastructure tailored to specific use
07 cases, while weighing the key considerations
of build-vs-buy and multi-cloud strategies
06
Building a production-ready, robust enterprise AI cloud platform infrastructure is important because it helps you
take your models from ideation to production in the shortest possible time and with the least risk possible.
This chapter is not just about the theory; it’s about integrating AI into business processes to automate, reduce
costs, increase efficiency, and create new opportunities. To do this, you must create efficient models and build a
repeatable, robust, stable, and secure enterprise AI system adaptable to changing conditions. The right infrastruc-
ture will serve as a solid foundation for serving models with good performance, scalability, and reliability and as a
launchpad to transition from pilot projects to full-blown AI operations. See Figure 13.1.
In this chapter, you
➤➤ Review the typical reference architectures (AWS, Azure, and the Google Cloud Platform) for the most
common use cases
➤➤ Review the general components of an ML platform
➤➤ Learn how to choose between building and buying a ready-made ML platform
➤➤ Learn how to choose between different cloud providers
➤➤ Review hybrid and edge computing and the multicloud architecture
1 Data sources Ingest the data from multiple sources into AWS Data Migration Service,
a single data source. Data can come from AWS Glue
SAP, ERP, CRM, websites, Salesforce, Google
Analytics, and so on.
Reference Architecture Patterns for Typical Use Cases ❘ 189
4 Activation Use the insights from the intelligence AWS Lambda, Amazon
layer layer and enrich it with recommendations Personalize
or predictions from AI/ML to create the
next best action and personalize customer
experiences.
DATA
SOURCES
• CRM DATA DATA
• ERP CONNECTED CUSTOMER 360
INGESTION HARMONIZATION
• SFDC
• Social media • Transforms,
• Rest API
• Mobile • Data Lake • Catalog Aggregate
• Streaming
• POS • Data • Data Discovery • Deduping,
• ETL (Extract,
• MDM Warehouse • Data Lineage Entity Linking
Transform,
• Sensors • Feature Store • Data Quality • Ontology
Load)
• Chat • Graph Database • Data Lifecycle Creation,
• FTP
• Call center Refinement
• Images
• Videos INTELLIGENCE/
• Emails CONSUMPTION
ACTIVATION
• Third-party
data (e.g.,
Identity Access Management • SQL Query
weather,
Key Management Service • Data • Digital
futures,
Secret Manager Warehouse Channels
commodities,
etc. Logging • Business • Call Center
Auditing Intelligence • Data Exchange
CI/CD Stack • AI/ML
Figure 13.3 shows an implementation of the customer 360-degree architecture using AWS components. You
should be able to build similar architecture using Google or Azure Cloud components.
NOTE The customer 360-degree architecture has four steps: data ingestion, knowledge
graph system, intelligence layer for insights, and activation layer for personalized customer
experiences.
190 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
DATA
CONNECTED CUSTOMER 360
SOURCES DATA DATA
• CRM INGESTION • Data Lake (S3) HARMONIZATION
• ERP • Data • Transforms,
• SFDC • Rest API Warehouse • Catalog Aggregate
• Social media • Streaming (Amazon • Data Discovery • Deduping,
• Mobile • ETL (Extract, Redshift) • Data Lineage Entity Linking
• POS Transform, • Feature Store • Data Quality • Ontology
• MDM Load) (Amazon • Data Lifecycle Creation,
• Sensors • FTP Neptune) Refinement
• Chat • Graph Database
• Call center (Feature Store)
• Images
• Videos
• Emails INTELLIGENCE/
CONSUMPTION
ACTIVATION
FIGURE 13.3: Customer 360-degree architecture using AWS components (AWS components shown in bold)
This is an experiential learning exercise that allows the team members to apply their theoreti-
cal knowledge of AWS services into practice and acquire AWS skills.
Set up data Data engineer Create mock data that Mock data files
sources represents different
data sources (CRM
system, web analytics,
etc.)
Data ingestion Data scientist Ingest the data from AWS Data Migration
multiple sources into a Service or AWS Glue
single data source on configuration
AWS using AWS Data
Migration Service or
AWS Glue.
Reference Architecture Patterns for Typical Use Cases ❘ 191
FIGURE 13.4: Event-driven near real-time predictive analytics using IoT data (AWS components
shown in bold)
FIGURE 13.5: IoT-based event-driven predictive analytics using AWS components (AWS components
shown in bold)
Figure 13.5 shows the IoT-based event driven architecture using AWS components. The following table shows the
steps involved in setting up an Internet of Things (IoT)–based architecture that facilitates an end-to-end journey—
from data collection to insight generation.
Reference Architecture Patterns for Typical Use Cases ❘ 193
Data collection Data from IoT devices, such as medical devices, car AWS Greengrass
sensors, and industrial equipment is collected.
Data ingestion The collected data is then ingested into the cloud AWS IoT Core, AWS IoT
platform. SiteWise
Stream transformation The streamed data from these IoT devices is Amazon Kinesis Data
transformed using data from a data warehouse. Analytics
Data storage The transformed data is then stored for further Amazon Redshift
downstream analysis.
Model training and The model is trained and deployed to generate Amazon SageMaker
deployment inferences.
Event triggering Inferences from the models trigger certain events. AWS Lambda
TIP IoT architecture leverages streaming data from IoT devices to make real-time decisions
for use cases such as personalized healthcare, autonomous cars, and equipment monitoring.
This is an experiential learning exercise that allows the team members to design and implement
an event-driven architecture that collects, processes, and analyzes data from IoT devices in
real time.
Prerequisites: You need a basic knowledge of AWS services, IoT devices, and some data sci-
ence concepts.
Tasks: Use the steps outlined in the previous table to build this architecture. The details of
building this architecture is outside the scope of the book, but it would be a good starting
point if your use case involves processing IoT data.
PERSONALIZE ENGINE
1 DATA 2 3 4 EVENT
Data Files DATASET Models
Upload LAKE Input Create Recommended TRIGGER
Data Preprocessed Models List
Output
Input
Preprocessed
Data
Output
Files
5
PREPROCESSOR CLIENT Get Real-Time
API GATEWAY
Recommendation
PERSONALIZE ENGINE
Amazon Personalize
EVENT
1 DATA LAKE 2 3 Model 4 TRIGGER
Data Files DATASET
Upload S3 Input Create (CAMPAIGNS) Recommended AWS
Data Preprocessed Models List Lambda
Output
Input
Preprocessed
Data
Output
Developer A Files
PREPROCESSOR
5 Amazon
AWS Glue CLIENT Get Real-Time API GATEWAY
DataBrew
Recommendation
FIGURE 13.7: Personalized recommendation architecture using AWS components (shown in bold)
The second step involves training the model in Amazon Personalize to make real-time personalization recommen-
dations. In this step, you use three types of data.
➤➤ Event data, such as the items the user clicks on, purchases, or browses
➤➤ Item data, such as the price, category, and other information in the catalog
➤➤ Data about the users, such as the user’s age, location, and so on
Reference Architecture Patterns for Typical Use Cases ❘ 195
The next step is to feed this data into the Amazon Personalize service and get a personalized private model hosted
for you through the Amazon API Gateway. You can then use this API to feed the recommendations to your users
in your business application.
NOTE To train a personalized model, you must gather event data, item data, and user data.
This is an experiential learning exercise that allows the team members to develop a personal-
ized recommendation system using Amazon Personalize.
Prerequisites: You need a basic knowledge of AWS services, data gathering techniques, data
cleaning processes, and some machine learning concepts.
Data gathering Data engineer Gather click views and item Data stored in
view data from websites. Amazon S3
Store this data in Amazon
S3.
Model hosting Machine Host the trained model via API endpoint for
learning Amazon API Gateway to model
engineer make it accessible for the
application.
DATA LAKE
Email
DATA COMMUNICATIONS STREAMING Engagement Data
CENTRALIZED
SOURCES SERVICE DATA Customer
DATA CATALOG
Email Email PIPELINE Demographic and
Subscription Data
S3
The following steps implement the real-time customer engagement reference architecture using the AWS compo-
nents shown in Figure 13.9:
Upload user data First, you upload the users and their contact info, Amazon Pinpoint, Amazon S3
such as emails, into Amazon Pinpoint and collect
data about customer interactions into S3.
Ingest data You then ingest this data in Amazon S3 using Amazon S3, Amazon Kinesis Data
either Amazon Kinesis Data Firehose or Amazon Firehose, Amazon Kinesis Data
Kinesis data stream. Stream
Train model You then use this data in S3 to train a model in Amazon S3, Amazon SageMaker
Amazon SageMaker to predict the likelihood of
customer churn or gather customer segmentation
data. You can create a SageMaker endpoint once
the model is ready for production.
Run inference You then run the inference in a batch manner and Amazon S3, Amazon Pinpoint
and export export the results into S3 and Pinpoint.
results
Combine data You can combine the data in Pinpoint with other Amazon Pinpoint, Amazon
and gather data from your data lake and get insights using Athena
insights Amazon Athena.
Visualize data Finally, you can use Amazon QuickSight to Amazon QuickSight
visualize the data and share it with others.
Reference Architecture Patterns for Typical Use Cases ❘ 197
DATA LAKE
STREAMING
COMMUNICATIONS Email
DATA
DATA SERVICE Engagement Data CENTRALIZED
PIPELINE
SOURCES Email Customer DATA CATALOG
(Amazon
Email (Amazon PinPoint Demographic and (Amazon Glue)
Kinesis Data
Amazon SNS) Subscription Data
Firehose)
S3
DATA CONSUMPTION
BATCH
Train SQL Query Analyzer
TRANSFORMATION
(Amazon
MODEL (Amazon Athena)
Jupyter BI Intelligence
SageMaker)
Notebook (Amazon QuickSight)
FIGURE 13.9: Real-time customer engagement architecture on AWS (AWS components shown in bold italics)
TIP By using Amazon SageMaker Pipelines, you can automate the steps to build, train, and
deploy machine learning models and save time and effort.
Figure 13.10 shows a customer engagement architecture that you can implement using Azure components.
Azure
Cosmos DB
This is an experiential learning exercise that allows the team members to implement a real-time
customer engagement system focusing on use cases such as customer churn, and personalized
recommendations using customer segmentation.
Prerequisites: You need a basic knowledge of AWS services, IoT devices, and some data
science concepts.
Tasks: Use the steps outlined in the table to build this architecture. The detail of building this
architecture is outside the scope of the book, but it would be a good starting point if your use
case involves engaging with customers.
NOTE Use Amazon SageMaker Autopilot to automatically build, train, and deploy machine
learning models for fraud detection.
EVENT
API ML DATA LAKE
MANAGEMENT
GATEWAY PLATFORM MODEL AND DATA
SERVICE
DATA CONSUMPTION
DATA LAKE
DATA PIPELINE SQL Query Analyzer
RESULTS
BI Intelligence
To build the fraud detection architecture using AWS components shown in Figure 13.12, you must first use the
Amazon CloudFormation template to instantiate an instance of Amazon SageMaker with the sample dataset and
train models using that dataset.
The solution also contains an Amazon Lambda function that invokes the two Amazon SageMaker endpoints for
classification and anomaly detection using the sample data.
The Amazon Kinesis Data Firehose loads the transactions into the Amazon S3, which is used to make predictions
and generate new insights.
Reference Architecture Patterns for Typical Use Cases ❘ 199
ML
API PLATFORM
EVENT
GATEWAY (Amazon DATA LAKE
MANAGEMENT
Amazon SageMaker MODEL AND DATA
SERVICE
API - Random S3
(Amazon Lambda)
Gateway cut forest
- xgboost)
DATA CONSUMPTION
DATA PIPELINE DATA LAKE SQL Query Analyzer
(Amazon Kinesis RESULTS (Amazon Athena)
Data Firehose) S3 BI Intelligence
(Amazon QuickSight)
FIGURE 13.12: Fraud detection architecture on AWS (AWS components shown in bold italics)
This is an experiential learning exercise that allows the team members to build a working
fraud detection system that can automatically detect fraudulent transactions in real time.
Prerequisites: You need a basic knowledge of AWS services and some data science concepts.
continues
200 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
continued
This is a hands-on experiential learning exercise that allows the team members to make an
informed decision on building versus buying an AI/ML platform based on the company’s needs.
Using the table, execute the steps to deliver a recommendation to your leadership team for a
build vs buy decision.
TASK NAME/ ROLE BUY BUILD DELIVERABLE/
DETAILS IMPORTANCE
Identify Organization’s Business Pay specific attention to the Consider factors like Comprehensive list
Needs analyst need for customization and in-house expertise, cost, of organizational
List the specific requirements configuration, integrate-ability, integration, scalability, requirements for an
and constraints of your vendor expertise and support. training needs, and the ML platform
organization with respect to Also factor in cost, scalability, and need for control of data.
an ML platform. security requirements.
Explore Available Options Data Factor in attributes to evaluate N/A Comparative analysis
Explore various ML platforms scientist such as features, cost, community report of different
both ready-made and open- support, security and compliance. ML platforms
source options and evaluate Also consider vendor support and
their pros and cons. skillsets, ease of use, scalability,
integration ability, interoperability,
training, and upgradability.
Assess In-house Talent HR You must consider the vendor’s You need AI/ML talent Report on the
Your success depends on manager AI/ML expertise and support. You to build, maintain, and current state of
your capacity to support need a smaller team to implement operate the platform in-house talent and
either scenario. and maintain the system. Evaluate and there is significant training needs
your inhouse skills and training investment in training
needs. and development.
Consider Interoperability IT architect Evaluate the cloud provider’s You can customize Assessment report
Ensure the chosen AI/ compatibility with other systems. to ensure maximum on interoperability
ML services and solutions Usually, it is easier to integrate compatibility with other of potential ML
can work seamlessly with within their ecosystem. Vendor systems, but it requires platforms
existing or planned systems is lock-in is possible. extra resources and time.
paramount.
TASK NAME/ ROLE BUY BUILD DELIVERABLE/
DETAILS IMPORTANCE
Evaluate Control over IT manager Limited control over underlying AI/ You would have greater Report on the
Platform ML algorithms and tools, and you control over aspects such degree of control
For an AI/ML platform, may have to depend on vendor’s as adding or modifying provided by
the ability to tweak the upgrade/release cycle for new features as needed. different options
algorithms or change features.
the architecture can be
crucial depending on your
company’s specific needs.
Explore Open-Source Data Vendor may not offer full support You could incorporate Report on the open-
Support scientist for open-source AI/ML tools. as many open-source source support by
Open-source tools provide Some may work well only with components as necessary different options
greater community support their proprietary tools. providing you with
and have frequent updates flexibility and innovation
and hence is important to speed.
consider.
Estimate Time-to-Market Project You can deploy faster due to Custom development Estimated timeline
You have to factor in the time manager prebuilt AI/ML capabilities but can be time-consuming, for each option
urgency to choose between may still need customization and leading to slower
the two options. integration. deployment.
Assess Cost Financial You would have upfront costs, Potentially high Cost assessment
You must factor in both the analyst possibly with ongoing licensing or upfront development report for build
initial and ongoing costs subscription fees. costs, ongoing costs versus buy
when choosing to buy or for maintenance,
build. upgrades, and potential
modifications.
TASK NAME/ ROLE BUY BUILD DELIVERABLE/
DETAILS IMPORTANCE
Review Feature Business May not be as customizable. Might It is usually possible Report on the
Requirements analyst not meet all specific requirements. to meet most specific alignment of feature
Your choice will depend on business requirements. requirements with
your specific business and available options
functional requirements.
Examine Scalability and Data The vendor should be able to You can build your Report on the
Flexibility engineer scale and adapt to your growing platform with scalability scalability and
As your business grows, data and model complexity. in mind from the start flexibility of each
your need to deal with large and provide greater option
volumes of data and complex flexibility.
models will also grow.
Assess Vendor Support and IT manager Vendors provide regular support You have to take greater Assessment report
Upgrades and updates, but you are still responsibility for support on vendor support
AI/ML field is constantly dependent on their update cycle. and updates, and this and upgrades
evolving, and this is a key may require additional
factor to consider from an resources.
innovation perspective.
Analyze Security and IT security AI/ML cloud providers should You have greater control Report on the
Compliance officer ensure that their solution over the compliance security and
Ensuring data security meets industry and regulatory process, but it also compliance of each
and meeting regulatory compliance standards such as means you have to option
compliance are critical, HIPAA or GDPR. design with security and
regardless of whether the regulatory compliance in
solution is bought or built. mind, potentially adding
complexity.
Make a Recommendation Project Make a recommendation on whether to build or buy the ML Recommendation
lead platform report on whether to
build or buy the ML
platform
204 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
TIP Consider factors such as talent, interoperability, control, open-source support, time-to-
market, cost, feature requirements, scalability, vendor support, and security compliance when
making the build versus buy decision.
The goal of this hands-on exercise is to help you gain a comprehensive understanding of the
different cloud providers, including their strengths and weaknesses, and choose the right cloud
provider or a combination of providers for different AI/ML services, platforms, and tools that
best suit your company’s needs.
Here are the steps needed to choose the best cloud provider:
TIP Consider the range of AI/ML services offered by the various cloud providers, including
prebuilt models, development tools, and support for the entire ML lifecycle when choosing a
cloud provider.
206 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
COMPONENT DESCRIPTION
Data ingestion Ingest enormous amounts of medical data such as X-rays, CT scans, and MRIs
from multiple sources.
Preprocessing and data Automatically preprocess and clean the data such as reducing noise, adjusting
cleaning contrast, and carrying out normalization.
Labeling Label a subset of images to create a labeled dataset for subsequent model
training. This may involve identifying and annotating abnormalities and identifying
new feature sets.
Model training Train the ML models using the labeled data to carry out machine learning tasks,
such as anomaly detection classification and segmentation.
Model testing Test the machine learning models by integrating them into a simulation
environment that simulates real-life clinical scenarios. This helps with thorough
performance analysis and evaluation of the models before they can be deployed
in production.
Model deployment Deploy the models in a production scenario, such as a specific hospital or a
diagnostic center, to assist doctors and radiologists in accurately interpreting
medical images.
Performance monitoring Continuously monitor the performance of these models by measuring their
accuracy recall and precision metrics in real time.
Key Components of an Enterprise AI/ML Healthcare Platform ❘ 207
COMPONENT DESCRIPTION
Alerting, dashboarding, Generate alerts, dashboards, and feedback reports to assist healthcare
and feedback reports professionals so that they can make further improvements to these models,
as well as provide better patient care by suggesting alternative diagnoses,
identifying better treatments, and providing better care overall.
NOTE Handling large amounts of data requires a robust and automated data ingestion and
preprocessing pipeline.
Figure 13.13 shows some core platform architecture components, each of which is covered in the follow-
ing sections.
MODEL
DATA DATA ML MODEL
TRAINING AND
SOURCES MANAGEMENT EXPERIMENT DEPLOYMENT
EVALUATION
• ERP • Data
• CRM Transformation • Pre-
• Model Registry
• SCM • Versioning training • Model
and Deployment
• Third- • Shared Data • Automate Training and
• Model
party Lake Triggers Tuning
Operationalization
• Social • Processing & • Feature • Model
• Model Execution
media, Labeling Engineering Evaluation
• Model Serving
etc.
Once you store large datasets in S3, other data processing and ML services can access that data. Storing data in
one central location, such as S3, helps you manage data scientists’ workflow, facilitates automation, and enables
true collaboration and governance.
Creating the following folder structure with proper data classification will help you exercise access control and
facilitate machine learning workflows, tracking, traceability, and reproducibility during experiments:
FOLDER DESCRIPTION
Users and project teams Individual data scientists and teams can use these folders to store their
folders training, test, and validation datasets that they use for experimentation
purposes.
Shared features folders These folders can contain shared features such as customer data, product
data, time series, and location data, which can all be used for model training
across teams and models.
Training, validation, and These are used for formal model training, testing, and validation purposes.
test datasets folders Managing these folders is essential, as this can help you with the traceability
and reproducibility of models.
Automated ML pipelines These folders can be accessed only by automated pipelines, such as a WS
folders code pipeline for storing artifacts related to the pipeline.
Models in production These models should be actively monitored and managed through version
folders control and play a key role in ensuring traceability.
Data Sources
Your data can come from one or more sources located in various locations or systems. It helps to treat them
as enterprise-level assets rather than department-specific assets. The data sources can include databases, data
warehouses, data lakes, external APIs, files, and even streaming data from IoT devices and sensors. These sources
provide the raw data for the ML models during the training process.
Data Ingestion
Data ingestion is the process of loading the data from these data sources into the machine learning platform,
which can be an automated or manual process. It includes techniques such as data extraction, integration,
and loading.
Data Catalog
The data catalog acts as the central repository for all metadata about the datasets that are used in the ML system.
It contains information related to data sources, dataset description, data quality metrics, schema information,
data format, and usage statistics. Data scientists, analysts, and data engineers use this catalog to discover, under-
stand, and access data related to their projects.
TIP Establish a shared, centralized data lake to promote governance, collaboration, and
seamless data access across your company.
Key Components of an Enterprise AI/ML Healthcare Platform ❘ 209
Data Transformation
Data transformation is the process of getting the raw data in a format and structure ready to use in ML models.
It involves cleaning the data, preparing it, feature transformation, data normalization, dimensionality reduction,
and so on.
Data Versioning
Data versioning is the process of assigning versions to datasets and tracking any changes to the data used in the
model. It helps you revert to the older version if necessary, and you can track the lineage of datasets used in dif-
ferent model training experiments. This can help you with the traceability and reproducibility of ML experiments.
Data querying Teams must be able to explore data, analyze data, Amazon Athena
and analysis conduct SQL analysis or data transformations, and Google BigQuery ML
visualize results. For example, use Amazon Athena
Dataflow
to query data in a data lake using SQL.
Azure HDInsight, Azure Data
Lake Analytics
continues
210 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
(continued)
Code authoring Teams should be able to write and edit code Amazon SageMaker Notebook,
using IDEs and notebooks. For example, Amazon SageMaker Studio
SageMaker Notebook Instance and SageMaker AI Platform Notebook
Studio can be used for code authoring, data
Azure ML Studio
processing, model training, and experimentation.
Data processing This includes tools for data preprocessing and Amazon Data Wrangler,
transformation, such as data cleaning, data SageMaker Processing
wrangling, and feature engineering. For example, Google AI Hub
use Amazon Data Wrangler for data import,
Azure Machine Learning Studio
transformation, visualization, and analysis. Amazon
SageMaker processing can be used for large-scale
data processing, including integration with Apache
Spark and scikit-learn.
Feature store Enables sharing common features across teams for Amazon SageMaker Feature
model training and inference. Store
Vertex AI Feature Store
Azure Machine Learning
Model training Involves the ability to train and test using various Amazon SageMaker Training/
and tuning techniques and algorithms, evaluate models, and Tuning Service
conduct hyperparameter tuning. For example, use Google Cloud AutoML
Amazon SageMaker training/tuning service.
Vertex AI
Azure ML Studio
Azure Automated Machine
Learning
Key Components of an Enterprise AI/ML Healthcare Platform ❘ 211
Containerization The ability to build, test, and deploy models and Amazon ECR
workflows as containerized applications or services. Kubernetes Engine
For example, use Amazon ECR to store training,
Azure Container Registry
processing, and inference containers.
Source code Managing code using source code control systems Git
control such as Git to enable tracking changes and
collaboration between team members and ensure
reproducibility. For example, Artifacts repositories
are private package repositories used to manage
library packages and mirror public library sites.
Data science Providing easy access to data science libraries NumPy, Pandas, scikit-learn,
library access such as NumPy, Pandas, scikit-learn, PyTorch, PyTorch, TensorFlow
and TensorFlow to allow easy development and
experimentation.
CENTRALIZED ENVIRONMENT
Source Code Control - CodeCommit
Continuous Delivery Service - CodePipeline
Artifact Repository - CodeArtifact
DEVELOPMENT ENVIRONMENT
Container repository – Amazon ECR
Data lake – S3
Internet Workflow orchestrator – AWS Step Functions
Event management – AWS Lambda
ETL software – AWS Glue
ML platform – Amazon SageMaker
➤➤ Deploy models in the cloud for increased scalability and performance and then deploy some models in
the edge for increased security.
➤➤ You can carry out data storage, data preprocessing, and model training in your own on-prem infrastruc-
ture and offload distributed processing, large-scale training, and hosting the model to the cloud.
➤➤ Hybrid computing can help you with resource optimization, cost optimization, and data governance.
Edge Computing
Here are some considerations for edge computing:
➤➤ It brings data processing and model training to edge devices such as IoT devices, sensors, or edge servers
closer to the point of data generation or data source.
➤➤ It can be handy for use cases that require real-time processing of data with exceptionally low latencies,
reduced bandwidth, and increased security and privacy.
NOTE Hybrid and edge computing are especially useful in use cases such as remote health-
care monitoring, autonomous vehicles, and industrial IoT.
To meet these requirements, you may be required to adopt model compression techniques such as quantization
and model pruning to reduce the model size. Moreover, the model should be designed to operate in offline or
intermittent connectivity scenarios.
Key Components of an Enterprise AI/ML Healthcare Platform ❘ 213
➤➤ Data synchronization, latency, and network connectivity: In the case of hybrid and edge computing sce-
narios, data synchronization latency and network connectivity are crucial factors for the effective opera-
tion of machine learning models.
➤➤ To keep the data consistent between the cloud and the edge devices: Adopt a push or pull synchroniza-
tion strategy. In the case of push synchronization, the cloud pushes the data to the edge device, or in the
pull synchronization strategy, the edge device pulls the data from the cloud.
➤➤ To improve latency: You can adopt techniques like edge caching, preprocessing, or local inference to
avoid round trips to the cloud.
➤➤ To address network connectivity issues: You can build fault-tolerant systems by using compression tech-
niques or by optimizing data transmission protocols.
CATEGORY DETAILS
Workload You can implement workload distribution in such a manner that you can choose one
distribution cloud provider for model training based on their infrastructure capabilities, such as
specialized GPU instances, and choose another provider for real-time inference based
on their low latency services.
• You can also reduce the risk of service disruptions by distributing the workloads
across multiple cloud providers.
• Using geolocation services, you can distribute the workload to the users based on
their locations.
• You can also use load-balancing features to distribute the load across multiple cloud
providers so that no one cloud provider is overloaded.
continues
214 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
(continued)
CATEGORY DETAILS
Redundancy You can employ redundancy management to ensure no single point of failure. You
management can implement this by having data redundancy, infrastructure redundancy, application
redundancy, load balancing, and service failover techniques. You can have either active-
active or active-passive configurations set up to ensure continuity of services.
Disaster recovery By implementing the redundancy management techniques discussed earlier, you can
also ensure you have a sound disaster recovery system in place so that you can failover
should the system go down.
Data Keeping the data synchronized across multiple cloud environments is a critical step, and
synchronization you must employ different techniques, such as real-time or scheduled synchronization.
You must employ distributed databases, data synchronization, replication, and
streaming techniques.
Security You must employ a robust multicloud security strategy to protect data, applications,
and networks. Security measures such as IAM, access controls, data encryption,
network security, and monitoring are essential. Data flow between cloud environments
should be encrypted, and security protocols across the different cloud environments
should be synchronized.
NOTE Achieving a successful multicloud architecture helps you avoid vendor lock-in, but it
can be a complex endeavor.
SUMMARY
This chapter laid the foundations for the rest of your AI journey. The infrastructure serves as the technical bridge
between your initial strategic planning, team building, and the subsequent stages of data processing, model train-
ing, deployment, and scaling.
This chapter covered the complex components of an enterprise AI cloud platform and dived into five types of
reference architectures that are most common. You also reviewed the considerations for buy versus build deci-
sions and the criteria for choosing between cloud providers. The chapter delved into the specific components of
the ML and DL platforms, data management architecture, and the nuances of the hybrid and edge computing and
multicloud architecture.
Whether it is building an AI platform for healthcare or data synchronization in real time, this chapter provided
the necessary knowledge and tools to benefit IT professionals, cloud architects, system administrators, data scien-
tists, decision-makers, and technology strategists.
Review Questions ❘ 215
But your journey doesn’t stop here. With your infrastructure now set, you will be processing your data, choosing
the right algorithms, training your models, and ensuring they are deployed efficiently and ethically. Remember
that the infrastructure plays a key role in scaling your AI efforts, deploying models seamlessly, and evolving and
maturing your AI capabilities, all of which are covered in the remaining chapters of this book.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Name the AWS service used to collect data from IoT devices in the described architecture.
A. AWS Greengrass
B. AWS IoT Core
C. Amazon Redshift
D. Amazon S3
2. The ______ service is used to train a personalized recommendation model.
A. Amazon S3
B. Amazon Glue DataBrew
C. Amazon Personalize
D. Amazon API Gateway
3. The ______ service is used to automate the steps of building, training, and deploying machine learn-
ing models.
A. Amazon Pinpoint
B. Amazon S3
C. Amazon SageMaker
D. Amazon Athena
4. The biggest advantage of adopting machine learning in the cloud is that developers with little or no
machine learning skills can:
A. Only deploy pretrained models
B. Only build simple models
C. Build, train, and deploy complex models in production
D. Only access curated data
5. ____________ should be considered when deciding whether to build or buy solutions for an enterprise
ML platform. (choose more than one answer)
A. Availability of in-house talent
B. Integration with third-party applications
C. Cost of hardware and software licenses
D. Time required for model training
6. True or False: It is more important to manage data than to improve models in machine learning.
7. _______ is NOT a technique for improving latency in hybrid and edge computing.
A. Edge caching
B. Preprocessing
C. Local inference
D. Round trips to the cloud
216 ❘ CHAPTER 13 Setting Up an Enterprise AI Cloud Platform Infrastructure
ANSWER KEY
1. B 4. C 7. D
2. C 5. A, B, C, D 8. A
3. C 6. True
14
Operating Your AI Platform with
MLOps Best Practices
The value of an idea lies in the using of it.
—Thomas Edison
Now that the AI platform has been built, your focus shifts toward operating the AI platform; ensuring the
efficiency, reliability, and repeatability of the ML operations workflow; and enabling scale with automation
(see Figure 14.1).
The target audience includes tech leads, operations managers, ML engineers, data scientists, systems admin-
istrators, and IT professionals looking to employ automation in daily operations.
Automation using MLOps isn’t just a buzzword. It’s the secret sauce to scaling rapidly, reducing manual
errors, speeding up processes, and enabling adaptability in a dynamic business landscape.
In this chapter, you review various key components, from model operationalization and deployment sce-
narios to automation pipelines, platform monitoring, performance optimization, security, and much more.
It includes actionable deliverables that will help you crystallize the MLOps ideas presented in this chapter.
This chapter serves as a pivotal transition point from setting up your AI infrastructure (Chapter 13) to
actual data processing and modeling (Chapters 15–17), highlighting the role of MLOps in ensuring this
transition is smooth, efficient, and automated.
What Is MLOps?
MLOps stands for “machine learning operations,” which is a set of practices, principles, and tools that unify
machine learning (ML) development and operations (Ops). It focuses on automating the end-to-end machine
learning lifecycle, from data preparation to model training to deployment and monitoring. It enables faster,
more reliable, and scalable ML implementations, which is the focus of this chapter. MLOPs is similar to
DevOps in software development, but it is specifically geared towards automating ML workflows.
218 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Setup Infrastructure and
journey
09 Manage Operations
ENTERPRISE AI
Setup 04 • Setup Enterprise AI Cloud
Evolve and JOURNEY MAP Infrastructure
Mature AI Platform Infrastructure
and Manage
Operations • Automate AI platform operations for
scale and speed
08 Process Data
Scale and Objective
Transform and Modeling
AI
Deploy and 05 Efficiently manage and scale your AI platform
Monitor
Models by integrating MLOps best practices for
07 automation, model deployment, and tracking
throughout the AI lifecycle
06
FIGURE 14.1: SETUP INFRASTRUCTURE AND MANAGE OPERATIONS: Automate AI platform operations
for scale and speed
MLOps is about continuous integration and continuous delivery (CI/CD) of not just models, but also of data.
The goal of MLOps is to automate and scale the tasks of processing and engineering data. Your goal must be to
ensure that you employ the best practices in this chapter so that when processing data in Chapter 15, as new data
comes in, it’s processed and made available automatically for model training seamlessly.
In Chapters 16 and 17, you may try different algorithms and train and tune multiple models. Your goal must be
to deploy the best practices in this chapter to ensure that each model version is tracked, stored, and can be rolled
back or forward as needed. This also ties in with the concept of tagging and container image management, which
will facilitate this process.
Once you have trained your models, they need to be evaluated. The practices outlined in this chapter and
Chapter 19 can help automate this step. It ensures that as soon as the model is trained using the practices in
Chapter 17, it’s automatically evaluated and metrics are logged.
Finally, I want to emphasize the importance of having feedback loops. By that I mean the evaluations and insights
obtained from the training, tuning, and evaluation of the models should be fed back into the system to refine the
data processing methods in Chapter 15.
Data collection, ingestion, Automate these steps in an ordered, Apache Airflow, Google N/A.
cleaning, feature predefined sequence. Cloud Composer
transformation
MLOps automation: You can create machine learning Kubeflow, MLflow, AWS You can retrain the model if it doesn’t
model training, testing, workflows using the CI/CD pipeline SageMaker pipelines, meet the performance thresholds,
tuning, deployment, and and a discipline known as MLOps, Azure Machine Learning validate it, and redeploy it once it
monitoring which evolved from DevOps, a pipelines, Google Cloud meets the performance criteria.
common practice in software AI platform pipelines
development.
Using a feature store A feature store is a key feature in Online and offline It is helpful for training and batch
the MLOps lifecycle. It speeds up feature storage. Online inferences. It helps machine learning
machine learning tasks and eliminates feature storage is used engineers and data scientists create,
the need to create the same feature for real-time inferences, share, and manage features during ML
multiple times. while offline feature development and reduces the curation
storage is used to work required to create features out of
maintain a history of raw data.
feature values.
Continuous learning When the model was released AWS SageMaker, Azure Taking the cooking analogy, just like a
in production, it would have Machine Learning, restaurant’s menu needs to be updated
been trained on past data, but Google Cloud AI with the customers’ changing tastes,
its performance may decline as Platform. you need to keep your models up-to-
time goes on. This is because date.
customers’ tastes, product lines,
and many other variables change.
In other words, while the world
has changed, the model remains
unchanged. This means you need to
continuously monitor the model for
its performance and retrain it in an
automated manner. This process is
known as continuous learning.
STEPS IN MACHINE DESCRIPTION TOOLS/APPLICATIONS USE CASES
LEARNING WORKFLOW
Data version control Data version tracking is like a recipe You can use data version One advantage of DVC is that you
book, where you capture all the control tools like the can maintain a record of all the data,
ingredients, the process steps, and Data Version Control code, and models used and share that
the meal created in the end. This (DVC) to manage the data and models across other teams,
list helps you re-create or tweak the data versions and track reproduce experiments, and roll back
meal because you know what was the models with which changes, thus making your machine
done precisely. In the context of they were used. DVC learning workflows collaborative,
machine learning, you maintain the can integrate with other reproducible, and reliable.
list of datasets used, the models and version control tools,
their versions, and the parameters such as Git, and handle
used to tune the model. All this large amounts of data.
information helps you in the future
to either tweak the model for better
performance or to troubleshoot the
model to identify the root cause
of its poor performance. If you get
different results every time, you can
assess whether the model, the data,
or both contribute to the results.
Model Operationalization ❘ 221
TIP Use a machine learning workflow management tool that can automate the machine
learning process to scale your ML operations
MODEL OPERATIONALIZATION
Model operationalization is the process of taking a previously trained model and getting it ready for execution,
hosting, monitoring, and consumption in production. It ensures that the model can be executed on new data,
hosted on a scalable and reliable infrastructure, monitored for performance, and seamlessly integrated into appli-
cations and systems to draw business insights for further action. Remember that while training a model by itself is
a significant achievement, its real value is only realized when it’s operationalized. Let’s look at each step of model
operationalization in more detail:
Model The process of taking a previously trained model Model execution can happen using two
execution and integrating it into an existing application or methods: one is online processing, and
system to take new data and make predictions the other is batch processing.
in real-world scenarios. Online processing: In the case of
online processing, one piece of data is
fed into the model, as is the case with
product recommendations.
Batch processing: In the case of batch
processing, a large batch of data is fed
into the model for prediction, which is
the case for fraud detection and risk
management.
Model The ability to host models in production in a Model hosting can be done using
hosting controlled environment to ensure functionality various approaches, such as deploying
and performance testing before deploying it it on serverless computing, containers
for widespread use. It involves setting up the such as Dockers, and dedicated servers.
necessary infrastructure and providing the In-house platforms can also host
necessary compute resources to run efficiently models but can be more expensive to
without a significant latency for high workloads. maintain. Amazon SageMaker Hosting
is an example of this.
Model The process of continuously monitoring the Metrics gathering tools, Amazon
monitoring deployed models for performance and behavior CloudWatch.
in a production environment.
Metrics: It involves gathering the metrics such
as model response time, prediction outputs,
and other data to assess the model’s accuracy,
stability, and adherence to performance goals.
It helps detect concept drift and model drift
and identify anomalies to ensure the model
performance does not degrade over time.
222 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
Model The actual process of using the model to make Model consumption can happen in
consumption predictions. two ways: by using an API for real-time
You may need to integrate the model into prediction and a web service for batch
downstream applications, systems or user predictions for large datasets.
interfaces, and other business processes or
workflows so that users can make predictions
using the model.
TIP Implement continuous model monitoring to maintain quality and reliability of models
in real-world scenarios.
Automation Pipelines
Automation pipelines are the bedrock of enterprise AI. Through automation, you can enforce governance and
standardization such as tagging and naming conventions and repeatability of the entire machine learning lifecycle,
starting from data preparation model training, testing, deployment, and monitoring.
As shown in Figure 14.2, the automation pipeline consists of several components for model training and deploy-
ment, such as the following:
TEST/UAT
ENVIRONMENT
DEVELOPMENT Data Storage
ENVIRONMENT (S3)
ML Workflow PROD ENVIRONMENT
Internet Data Storage API Gateway
(AWS Step Functions)
(S3) (SageMaker Endpoint)
Data Pipeline
ML Platform
(SageMaker Processing)
(Amazon SageMaker)
Model Training
(SageMaker Training)
The table below describes the various components included in Figures 14.2 and 14.3 for model training and
deployment.
Code repository The starting point for an automated pipeline Docker files, training scripts,
run, and it consists of several artifacts such dependency packages
as Docker files, training scripts, and other Azure OpenAI Code Repository,
dependency packages required to train Google Artifact Registry
models, build containers, and deploy models.
Code build service A code build service can be used to build AWS CodeBuild, Jenkins
custom artifacts like the Docker container Azure DevOps, Google Cloud Build,
and push them to a container repository like Kubernetes
Amazon ECR.
Data processing A data processing service can be used to Amazon SageMaker data processing,
service process raw data into training/validation/test Dataflow, Dataproc, Azure Databricks
datasets for model-training purposes.
Model training The model training service gets the data from Amazon SageMaker training service,
service the data lake, trains the model, and stores Google AutoML, Vertex AI, Azure
the model output back into the output folder Machine Learning Studio
of the data storage, typically S3.
Model registry A model registry is used to manage the AWS Model Registry, AI Hub, Vertex
model inventory, and it contains the AI Model Registry, Azure OpenAI
metadata for the model artifacts, the Code Repository, Azure DevOps,
location, the associated inference container, Azure Machine Learning Model
and the roles. Registry
Model hosting The model hosting service accepts the model AWS Model hosting service, Google
service artifacts from the model registry and deploys AI Platform Prediction, Azure
them into a serving endpoint. Kubernetes Service
Figure 14.3 shows the implementation of a CI/CD pipeline using AWS components.
This pipeline architecture completely automates the machine learning workflow from code commit, pipeline start,
container build, model building, model registration, and production deployment. In the case of AWS, you can use
tools like CodeCommit, CodeBuild, CloudFormation, Step Functions, and SageMaker Pipelines.
224 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
Implementing automated pipelines is the secret sauce behind an enterprise AI platform, as it facilitates automa-
tion, standardization, and reproducibility in machine learning workflows.
TIP By automating the deployment of models in production it becomes possible for com-
panies to scale their AI implementations across the enterprise.
Model Approver
CodeStar CodePipeline Registry/Container CodePipeline
Corporate Data Center
ECR S3
CodeCommit
CodeBuild
On prem code repo
CodeBuild SageMaker
Registry
Lambda
The goal of this exercise is to give you an idea of the steps to design a data architecture and
implement a data pipeline to gather, clean, and store data for machine learning experiments. In
addition to automating the model training and deployment process, automating the data
pipeline can help you scale.
Identify data sources: List different data Data engineer A list of different
sources that will be utilized, including data sources
databases, APIs, filesystems, etc.
Test the pipeline: Ensure that the data Data engineer A report
is correctly processed and stored in detailing the
the desired format and create a report implementation
summarizing the results. and test results
Deployment Scenarios
In the previous sections, we discussed automating the data pipeline, model training and deployment steps. In
this section, let us discuss creating different automation pipelines based on the deployment scenario. This section
reviews these different scenarios. The deployment scenarios can range anywhere from comprehensive end-to-
end pipelines that span from raw code production to production deployment, to more specialized scenarios like
Docker build, or model registration pipelines.
➤➤ End-to-end automated pipeline: This involves taking the source code and data and training the model
all the way to deploying it into production. It involves stages like building the Docker image, processing
data, training the model, getting approval for deployment, and finally deploying the model.
➤➤ Docker build pipeline: This involves building the Docker image from the Docker file and pushing it into
a container repository, such as Amazon ECR, along with associated metadata.
➤➤ Model training pipeline: This involves training the model using an existing Docker container and then
optionally registering it into a model registry.
➤➤ Model registration pipeline: This pipeline registers the model in the ECR registry and stores the
model in S3.
TIP Use a continuous integration/continuous delivery (CI/CD) pipeline to automate the pro-
cess of building, testing, and deploying your machine learning models.
➤➤ Model deployment pipeline: This pipeline deploys the model from the model registry into an endpoint.
Figure 14.4 shows a code deployment pipeline. These pipelines can be triggered by an event, or a source code
commit or through the command line.
In Figure 14.4, a Lambda function is used to trigger the pipeline. The Lambda function can look for a change in
an existing source code file or determine when a new file is deployed to trigger the pipeline accordingly.
transparency, efficiency, and control over these assets. You can keep track of deployed models including their
versions, metadata, and so on, and monitor and control their performance to ensure they are aligned with busi-
ness objectives. The table shows the different model inventory management tools available from the major cloud
providers.
1 - Change
committed to
source SOURCE CONTROL CD PIPELINE
CI PIPELINE
FILE REPOSITORY (Amazon
(AWS CodeBuild)
(AWS CodeCommit) CodePipeline)
3 - Triggers
MONITORING
a function COMPUTE
SERVICE
PLATFORM
(Amazon
(AWS Lambda)
CloudWatch)
AWS SageMaker Model Registry Whenever a new model is developed, you can
register it there, and every time it is trained, the
latest version can be created. It supports
➤➤ Model auditing
Azure Azure Machine Learning Model ➤➤ Attach metadata to models (model name,
Management Service owner name, business unit, version number,
approval status, and other custom metadata)
Model Registry –
Container Registry – Store Model File –
SageMaker Model
Amazon ECR Amazon S3
Registry
NOTE Implement model inventory management to track model lineage, manage model
metadata, and audit model use.
228 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
Google Google Cloud Audit Logs ➤➤ Captures logs that are sent to other downstream systems
Google Cloud Storage, for analysis.
Google Cloud Logging ➤➤ Stores logs in Google Cloud storage buckets.
Google’s Security ➤➤ Logs can be analyzed using Google’s centralist monitoring
Command Center
system, Google Cloud Logging.
➤➤ Security command center offers security and compliance
monitoring, vulnerability scanning, and threat detection
capabilities.
TIP Implement a robust auditing system to track all activity and monitor suspicious activity.
Create roles for each user so that proper attribution can be made for these activities during
auditing.
Model Operationalization ❘ 229
PRODUCTION
ACCOUNT
CENTRAL OPS
Experimentation Auditing and Logging ENVIRONMENT
• ML Platform - SageMaker • Audit – CloudTrail
• Data Lake - S3 • Notification – Automation Pipeline Components
• Container Store - ECR SNS
• CodePipeline
• CodeBuild
QA/UAT
• Data Lake
ACCOUNT
• Container Registry
This table lists the tools provided by AWS to track code, datasets, metadata, container, training job, model, and
endpoint versions:
Code Versioning GitLab, Bitbucket, Code repositories help you track the versions of the code
CodeCommit artifacts. You can check in and check out using a commit ID,
which you can use to track the versions of the artifact.
Dataset Versioning S3, DVC You can track the versions of your dataset by using a proper
S3 data partitioning scheme. When you upload a new dataset,
you can add a unique S3 bucket/prefix to identify the dataset.
DVC is another open-source dataset versioning system that
can be integrated with source code control systems such as
GitLab and Bitbucket when creating new datasets, and S3 can
remain your backend store.
Metadata Tracking SageMaker Lineage Metadata associated with datasets can be tracked by
Tracking Service providing details such as the dataset properties and the
lineage.
230 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
Container Amazon ECR Amazon ECR tracks the container versions using the image
Versioning (Elastic Container URL containing the repository URI and image digest ID.
Repository)
Training Job SageMaker SageMaker’s training job contains its ARN name and other
Versioning metadata, such as URIs for hyperparameters, containers,
datasets, and model output.
Model Versioning Model Packages Model versions can be tracked using model packages that
contain their own ARN names and details such as the model
URI and the container URI.
Endpoint SageMaker As part of creating the SageMaker endpoint, you can add
Versioning other metadata, such as the model used to the endpoint
configuration data
NOTE Implement a data and artifact lineage tracking system to understand how models
work and identify potential problems.
Code Versioning Azure DevOps Tracks code versions and integrates well with Git
repositories.
Metadata Tracking Azure Machine Learning Helps track metadata such as model training metrics,
Services dataset properties, and experiment configurations.
Container Azure Container Registry Keeps track of container images and their versions.
Versioning
Training Job Azure Machine Learning Manages training job versions and other parameters
Versioning Services such as hyperparameters, training data, and outputs.
Model Versioning Azure Machine Learning Tracks model versions. Models can be deployed as
Services web services and metadata such as model properties,
deployment configurations, and version information
can be tracked.
Artifact Versioning Azure DevOps, Azure Data Used to track the versions of artifacts throughout the
Factory, Azure Data Catalog machine learning lifecycle.
Model Operationalization ❘ 231
Code Versioning Git, Google Cloud Source Supports integrating Git and Google Cloud source
Repositories repositories to help track code versions.
Dataset Versioning Google Cloud Storage Datasets can be stored in Google Cloud Storage in
versioned buckets, which allows you to track different
versions of the datasets.
Metadata Tracking Cloud Metadata Catalog Can track different versions of model datasets and
other artifacts by associating metadata such as
properties, descriptions, and lineage.
Container Google Cloud Container Can track different versions of container images.
Versioning Registry
Training Job Google Cloud AI Can track different versions of training jobs, models,
Versioning Platform and endpoints.
Artifact Versioning Kubeflow, Data Catalog, Provides services to track artifacts throughout the
Dataflow machine learning lifecycle.
Code Versioning Git, Google Cloud Source Supports integrating Git and Google Cloud source
Repositories repositories to help track code versions.
Google Cloud also provides services such as Kubeflow, Data Catalog, and Dataflow to track artifacts throughout
the machine learning lifecycle.
Public images Docker Hub, GitHub, ECR Public Publicly available images stored in public
Gallery repositories.
Base images Customer ECR Instance Foundational images made from public
images. They contain additional OS patches
and other hardening.
Framework images Central Team Built on top of base images and represent
stable versions of the environment. They may
include specific versions of frameworks such
as TensorFlow Serving.
NOTE A secure container registry should allow you to store container images and provide
image signing, vulnerability scanning, role-based access control, and image auditing features.
The following are container management-related best practices using AWS to meet security and regulatory
requirements:
Security controls Security patches Implement security controls, consistent patching, and
vulnerability scanning to meet security and regulatory
requirements.
Versioning Image tags Use image tags to indicate specific versions of containers,
and unique tags can be used to track GitHub repository
branches.
Model Operationalization ❘ 233
Access control Identity access Implement granular access control for containers by
management adopting resource-based and identity-based policies.
This table lists Azure-related security management best practices for containers:
Image building Azure Pipelines Use Azure Pipelines, a CI/CD service, to build your images in
an automated fashion.
Image storage Azure Container Store your containers in Azure Container Registry, a secure
Registry and scalable repository.
Policy Azure Policy Use Azure Policy to enforce organizational security policies
enforcement on your images.
This table lists Google-related security management best practices for containers:
Image building Cloud Build Use Cloud Build to build your images.
Tag Management
Tag management is another critical component of MLOps as it helps to track ML experiments and model ver-
sions. In addition to all the things discussed about using tags to manage resource usage, you can also use tags to
manage other tasks, covered in this section.
➤➤ SageMaker studio domains: These platforms help data scientists, machine learning engineers, and devel-
opers collaborate. They provide capabilities such as Jupyter notebooks, integrated debuggers, experi-
ment environments, model registries, and Git integration.
Identify different ML
environments
Identify the Create a tag key
Assign the tags to
resources that and value for
the resources
you want to tag each resource
Automatically delete
resources
FIGURE 14.8: Using tags to track resource usage, cost management, billing, and access control
Azure Azure resource manager tags help you attach The Azure cost management and billing
metadata to various resources using tags to service helps you control the usage of
control their cost, usage, organization, and ML resources and manage its budget
access control. You can assign these tags to accordingly. The Azure machine learning
virtual machines, storage systems, and ML service controls costs by setting up
platform components. These tags can be compute targets, for instance, types and
assigned based on organizational units, cost sizes.
centers, environments, or projects. Azure also provides cost management
You can use the Azure Policy to mandate that APIs that can help you retrieve costs and
the tag should be assigned to all resources, usage data related to various resources,
which can help you ensure governance, cost which can be used in custom reporting for
control, and enforce resource usage limitations. tracking purposes.
Google Google provides similar resource labels to tag You can also control costs using Google’s
virtual machines, storage buckets, models, and budget and billing alerts, AI platform
other ML components. auto-scaling, BigQuery cost control, and
Using the Google Cloud policy, you can cloud billing API.
mandate that tags be associated with various
resources, and these tags can be maintained at
the organization, folder, or project level.
Model Operationalization ❘ 235
TIP You can use tags to track resource ownership, automate tasks, and comply with regula-
tions.
This hands-on exercise aims to set up an enterprise AI platform with MLOps best practices
that contains the major architectural components such as a data pipeline, model experimenta-
tion platform, MLOps environment, workflow automation and CI/CD pipeline, performance
optimization, platform monitoring, security, and governance control practices.
Also note that the roles used here are for general guidance, and the actual roles used in our
company will depend on your org structure, operating culture, and other considerations.
Finally, note that each of the following tasks depends on other tasks, and they have to come
together holistically to contribute to the effective functioning of the platform as a whole:
continues
236 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
continued
TASK DESCRIPTION TASK OWNER DELIVERABLES
(SKILLS)
SUMMARY
In the world of AI/ML production environments, three pillars stand out: efficiency, scalability, and reliability. To
achieve those goals, automating the ML lifecycle is essential. MLOps is the secret sauce to achieve that automa-
tion, covered in this deep, technically intensive chapter. By enabling this automation, MLOps serves to bridge ML
development and operations, thus bringing speed to deployment. This chapter focused mainly on cloud engineers
and other IT professionals responsible for operating and automating as much of the ML lifecycle as possible.
Moreover, business and technology strategists stand to benefit, given the accelerated speed to market and scale
that MLOps brings.
The chapter covered a number of key concepts, such as container images and their management, using tagging
for access control, tracking resource usage, and cost control. In addition, it explored platform monitoring, model
operationalization, automated pipelines, model inventory management, data, and artifact lineage tracking. It
discussed best practices using AWS, Azure, and Google Cloud platforms, covering topics such as security controls,
organization, versioning, and access control.
The chapter provided hands-on exercises for setting up a modern enterprise AI platform incorporating the latest
MLOps practices, encapsulating crucial architectural components like data pipeline, MLOps environment, CI/
CD pipeline, and governance practices. These exercises structured around real-world roles like data engineer, data
scientist, and ML engineer not only teach the MLOps methodology but also increase awareness around the roles
and responsibilities of these roles.
This chapter is tailored toward professionals working in fields such as data engineering, machine learning,
DevOps, and cloud architecture, providing both theoretical knowledge and practical tools to implement robust AI
platforms. Whether a beginner or an experienced practitioner, you are likely to gain insights that will further your
understanding of the complex landscape of modern AI infrastructure.
In the next chapter, let’s discuss how to process data and engineer features.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
238 ❘ CHAPTER 14 Operating Your AI Platform with MLOps Best Practices
ANSWER KEY
1. A 4. A 7. A
2. D 5. D 8. D
3. D 6. D
PART VI
Processing Data and Modeling
Data is the lifeblood of AI. Part VI is where you get your hands dirty with data and modeling. I teach you
how to process data in the cloud, choose the right AI/ML algorithm based on your use case, and get your
models trained, tuned, and evaluated. It is where science meets art.
15
Process Data and Engineer
Features in the Cloud
The world is one big data problem.
—Andrew McAfee
In the world of AI, the true essence lies not just in sophisticated algorithms, but in the quality and struc-
ture of data. As most AI practitioners would attest, raw data is rarely ready to be used as is. This is where
cloud-based data processing comes in. It refers to the process of collecting, ingesting, storing, preprocessing,
engineering, and processing data for use in machine learning models through cloud-based data technologies
from AWS, Azure, and Google.
So why is data so vital? Simply put, the most sophisticated model is only as good as the data on which it is
trained. Incorrectly processed data or poorly engineered data can lead to inaccurate predictions and deci-
sions costly to your business.
As companies try to embrace the scalability and power of the cloud, learning how to process data in the
cloud becomes paramount. Cloud provides several benefits, including scalability, performance, security,
flexibility, and cost-effectiveness.
In this chapter, you learn about exploring your data needs and the benefits and challenges of cloud-based
data processing, and you dive into hands-on exercises, including feature engineering and transformation
techniques.
This chapter underscores the importance of data augmentation, showcases methods to handle missing data
and inconsistencies, and presents the art and science of feature engineering, all happening in the cloud. See
Figure 15.1.
This chapter addresses many topics, including data storage and exploration, data storage architectures, and
distributed data processing. Always remember, in the end, data is king—it is not just the starting point, but
the foundation for everything else when it comes to AI.
244 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
journey Team Process Data and Modeling
09 • Process Data and Engineer
ENTERPRISE AI Features in the Cloud
Evolve and
Setup 04 • Choose your AI/ML Algorithm
Mature AI JOURNEY MAP Infrastructure
and Manage • Train, Tune, and Evaluate Models
Operations
08 Objective
Scale and Process Data
Transform and Modeling
Master cloud-based data processing by
AI
Deploy and 05 understanding, collecting, and refining data,
Monitor
Models coupled with advanced feature engineering
07 techniques, to ensure optimized input for
AI/ML model training
06
FIGURE 15.1: PROCESS DATA AND MODELING: Process data and engineer features in the cloud
Problem Statement: The management of a retail chain store is facing problems in predicting the
sales for the upcoming quarter. In the past, inaccuracies in the forecasts have resulted in either
overstocking or understocking resulting in inefficient resource allocation or missed revenue
opportunities. The management wants to leverage the company’s past data to develop a
machine learning model that can predict sales more accurately for the next quarter across their
retail stores. They need this prediction a few weeks before the start of the quarter to allow
effective inventory planning, staffing, and other resource planning. The goal is to minimize
waste and increase profitability.
Here are the steps your team can follow:
STEP TASK EXAMPLE TASK OWNER DELIVERABLES
(SKILL SETS)
Understanding First you need to consult stakeholders to get Stakeholders want to predict sales Project A clear
the business a clear understanding of what they want to a few weeks before the start of manager, statement of
problem predict their time frame. the next quarter. business the problem,
analyst goals, and
objectives of
the project
Identifying You can now start understanding the data Past sales data is essential, along Data A
data needs needs, such as what kinds of data are with data related to promotional scientist, data comprehensive
required and the potential sources for the offers, seasonal trends, price analyst list of required
same. List what types of data you might changes, customer reviews, data and
need for such a prediction and possible competitor data, economic potential data
sources. indicators, and so on. sources
Gathering You then try to identify the data sources Collect data from company’s Data Collected
data and decide to gather the past sales data databases, marketing team, engineer, raw data
from the company’s databases, promotional publicly available economic data, data analyst from different
offers, and seasonal data from the marketing industry reports, etc. sources
team, customer reviews from the website,
economic indicators from the public
databases, and competitor data from
industry reports. Do not underestimate the
amount of effort needed to clean this data.
STEP TASK EXAMPLE TASK OWNER DELIVERABLES
(SKILL SETS)
Data cleaning Once you gather the data, the next step is Handle missing sales figures, Data Cleaned and
to clean the data, which may have missing correct inconsistencies in scientist, data standardized
values, outliers, and other inconsistencies. promotion categorization, remove analyst dataset
outlier values in sales figures.
Exploratory Explore the data for seasonal trends, and the Identify patterns in sales based on Data Report on the
Data Analysis relationships between various data elements. promotions, holidays, competitor scientist, data findings of the
(EDA) This will eventually help you choose the activities, economic indicators, analyst EDA, including
suitable ML model and guide you in etc. visualizations
preparing the data for your model. and initial
insights
Feature Based on EDA, brainstorm potential new Create features like frequency of Data New features
engineering features that could improve your model’s promotions, duration since last scientist, added to
and model performance. promotion, average spending per machine the dataset,
building Prepare the data for consumption in machine visit, etc. learning ML model
learning models. engineer prototypes
Benefits and Challenges of Cloud-Based Data Processing ❘ 247
NOTE While data may be available internally, one point that is often overlooked is the suit-
ability of the data for consumption by the required algorithms.
TIP Cloud-based tools like AWS Glue can clean the data, handle missing values, engineer
features, and prepare data for the ML model.
As noticed in the previous case study about the retail company, there are different data types, such as structured,
unstructured, and semi-structured data. Each data type has a different purpose, format, and suitable models, as
shown in the following table:
Structured data Historical Helps identify Comes as CSV file or a SQL Regression
sales data sales trends database models for
and seasonality predicting future
sales
Different ML models can use this data, such as a regression model for predicting future sales that uses structured
sales data well. However, a deep learning model can be used on unstructured customer review data to identify
patterns in customer sentiment. On the other hand, a decision tree may work well with semi-structured data, such
as economic indicators, to assess how economic indicators impact sales.
TIP Match the data types with the most suitable models and processing techniques.
Going back to the meals analogy, the better the quality and variety of the ingredients used, the better the meal—in
this case—the predictions.
This hands-on exercise can help make this concept of handling different data types clearer.
Problem Statement: A retail company is facing challenges dealing with handling multiple data
types such as structured, unstructured, and semi-structured data to use in the ML models. Your
task is to guide them to build a clean, consolidated dataset to use in their ML models.
Align datasets. Now ensure that all the Data analyst, Datasets that are
structured, unstructured, data scientist aligned based on
and semi-structured time, e.g., daily,
datasets are aligned weekly, or monthly
properly and aggregate
them as needed.
continues
250 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
continued
Check and clean Perform a quality check for Data analyst, A high-quality
datasets. missing values, outliers, data scientist dataset free of
and other inconsistencies missing values,
and correct them. outliers and other
inconsistencies
and is ready for
machine learning
model training
STAGE DESCRIPTION
Labeled data Data about which you already know the target answer. If the label is missing, the activity
involves labeling the missing data either manually or using automated means.
Data collection Depending on the use case, data can be collected from various sources, such as sensors,
time-series data, event data, IoT data, and social media.
In the retail company case study, you may need to use an API or a SQL query to store
the past sales data, pricing data, or promotional offers data in a cloud storage such as
S3.
Web scraping Use web scraping to gather the review data and handle real-time ingestion for customer
reviews, as review data is continuously trickling in. The marketing data can come as
semi-structured data such as an XML, JSON, or CSV file, which you need to ingest into
the cloud environment.
Direct downloads In the case of economic indicators data, you can use the APIs offered by the financial
institutions or the government bodies or a direct download to extract data and use
different methods to ingest data into the cloud.
Data ingestion Collected data must be ingested into storage facilities, through either real-time or batch
methods processing.
The Data Processing Phases of the ML Lifecycle ❘ 251
Data Collection
Label
Aggregate
NOTE Data can be collected from various sources such as sensors, time-series data, event
data, IoT data, and social media using API or SQL query, web scraping, or direct down-
loads.
Data that’s collected must be ingested into various storage facilities using real-time or batch processing. The fol-
lowing table compares the pros and cons of the two methods:
Real-time data For real-time Customer reviews can Handles Requires specific
ingestion data ingestion, be smaller in volume but high tools and systems
you must use high in velocity as new velocity and is not suitable
a real-time reviews keep coming in. data; for large volume
processing For this scenario, a real- ensures data.
system such as time ingestion method is data is
Apache Kafka appropriate. current
or AWS Kinesis.
Batch data Processes large The velocity with which Handles Data may not
processing volumes of the data comes in can large always be current
data at specific vary; for example, sales volume and is not suitable
time intervals. data can come in large data; for high velocity
quantities but be updated doesn’t data.
once a day or weekly. In require
such situations, you may immediate
use a batch ingestion processing
process.
NOTE Choose real-time or batch ingestion methods and tools based on the velocity and
volume of data. Unless you are positive that you are doing one-time pulls of data from a
source, consider automation as an important part of your AI/ML journey.
252 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
The following table lists the pros and cons of these options:
Transactional SQL Ideal for structured data and supports Not ideal for handling large
databases real-time processing volumes of data.
Data lakes Useful for storing a vast amount of raw Data may need further processing
data in its native format before it can be used.
Data warehouses Optimized for analyzing structured data Might not be the best choice for
and performing complex queries unstructured or semi-structured
data.
The following table lists the storage options from various cloud providers:
Azure Blob Azure Blob Storage • It is scalable. • It can also become more
Storage can also be a good • It has good Azure expensive.
choice due to its Cloud services • Integration with non-Azure
scalability and integration. services can be challenging.
integration ability with • It is also less user intuitive than
Azure Cloud services. S3 or Google Cloud Storage.
TIP When choosing a data storage option, it’s crucial to consider factors such as cost,
user-friendliness, integration with other services, in-house skill sets, tiered storage, disaster
recovery, and data type and volume.
When choosing these storage options, you should evaluate them against the following factors:
➤➤ Cost
➤➤ User-friendliness
➤➤ Integration ability with other services
➤➤ In-house and easily acquirable skills
➤➤ The ability to have tiered storage where you can store infrequently accessed data to lower cost storage
➤➤ Multiregional storage for disaster recovery
Data Preparation
Data preparation is the process of preparing the collected data for further analysis and processing. Data prepara-
tion includes data preprocessing, exploratory data analysis, data visualization, and feature engineering.
Figure 15.5 shows a data processing workflow. Note that your project may have different workflows with dif-
ferent steps.
➤➤ Data preprocessing: Data preprocessing is the stage where data is cleaned to remove any errors or incon-
sistencies.
➤➤ Exploratory data analysis: Data exploration or exploratory data analysis is about looking at what’s in
your dataset. It is part of data preparation and can be used to understand data, carry out sanity checks,
and validate the quality of the data.
254 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
➤➤ Methods used: It involves generating summary statistics, visualizing the distribution of data and
variables, and identifying relationships between variables. You can use out-of-the-box features
provided by various cloud ML platforms, data visualization tools such as Tableau and Google
Data Studio, and software libraries such as Pandas and Matplotlib in Python.
➤➤ Example: In this case study example, you can explore summary statistics such as mean, median,
and range of sales per day or product. You can create histograms to visualize this data or create
scatterplots to identify relationships between sales and other variables such as price and promo-
tional events.
➤➤ Feature engineering: Feature engineering is the phase where new features are created from existing data
to improve the performance of the models.
➤➤ Data visualization: Data visualization can help with exploratory data analysis by providing visual repre-
sentations of data to better understand patterns, trends, and correlations.
Deploying models
10%
Data loading
20%
Model selection
10%
Data cleansing
26%
Data visualization
22%
TIP When preparing data, ensure thorough data preprocessing for cleaning, exploratory
data analysis for understanding, feature engineering for model performance enhancement,
and data visualization to discern patterns and correlations.
Data Preprocessing
Data preprocessing is about preparing the data to be in the right shape and quality for model training.
Figure 15.6 shows the different strategies available for data preprocessing.
Understanding the Data Exploration and Preprocessing Stage ❘ 255
Data Wrangling
Data
Preprocessing
Data Data
Data Collection EDA
Preparation Visualization
Feature
Engineering
Data Preprocessing
Augment
Data wrangling is part of the data preparation process. It involves the process of transforming and mapping data
from one “raw” format into another to allow subsequent analysis and consumption.
Using tools such as SageMaker Data Wrangler will help you prepare data using its visual interface to access data,
perform EDA, perform feature engineering, and then seamlessly import the data into the SageMaker pipeline for
model training.
SageMaker Pipeline comes with about 300 built-in transforms, custom transforms using Python and PySpark,
built-in analysis tools such as scatterplots, histograms, and model analysis capabilities such as feature importance,
target leakage, and model explainability.
simply remove the record. If you’re using imputation techniques, you can fill the missing data with mean or
median values, use the regression method, or use multiple imputation by chained equations (MICE) techniques.
Missing Missing values in the data can Imputation Problem: A retail store finds missing
values lead to inaccurate results or sales data for some days in its sales
algorithm failure. tracking system.
These gaps in the dataset Regression Action: They decide to fill the missing
need to be dealt with method values to maintain the continuity and
effectively to maintain the reliability of the data.
integrity of the data analysis.
Handling Outliers
In the case of outliers, you have to identify if there’s an error or an extreme value. If it is an error, you try to
resolve the error, and if it is an extreme value, you can decide to keep it but try to reduce the impact of it by using
techniques such as normalization. Data normalization helps keep the data on a similar scale. See the following
table for more details:
Handling Outliers are Normalization Problem: A retail store notices that its sales
outliers data points quantity data can vary from hundreds to
that deviate thousands, while product price ranges from $1
significantly to $50.
from other
Min-max scaling Action: To avoid giving more importance to
observations.
sales quantity data due to its higher value, they
They can normalize the data to bring it to a similar scale.
be errors or
extreme values Standardization Technique Used: They apply min-max scaling,
and can lead where they transform the data to fit within a
to skewed or specified range, typically 0 to 1. This reduces
misleading the impact of outliers.
results in data Z-score They also apply standardization (Z-score
analysis. normalization normalization) to handle outliers as it’s less
sensitive to them. This process makes the mean
of the data 0 and standard deviation 1.
Understanding the Data Exploration and Preprocessing Stage ❘ 257
TIP Be sure to clean up data by handling missing values using imputation techniques like
mean or median values, regression, or MICE, and take care of outliers using normalization
techniques like min-max scaling or standardization.
Data Partitioning
It is important to split data into training, validate, and test datasets so that the models do not overfit, and you can
evaluate the models accurately. You need to ensure that there is no leakage of test datasets into training datasets.
One way to achieve this is to remove duplicates before splitting.
Normalization
To calculate the normalized value, you take each value, subtract the minimum value from it, and then divide it by
the range. So, for a footfall value of 200, your normalized value is 200 – 100/300 = 0.33.
You do the same for the income level and the price. By normalizing all the values, the vast difference in the abso-
lute values is minimized, and therefore your algorithm will give equal weightage to all these features.
Standardization
If you didn’t scale these values, income levels would have an unfair advantage in influencing the predictions,
making your results inaccurate. Note that standardization is preferred for handling outliers, as it’s less sensi-
tive to them.
might happen if most of your data belongs to middle-income, middle-aged customers, so your model may be very
good at predicting those segments. However, it might not be suitable for lower-income, younger groups, or high-
income, older groups.
Sometimes the bias can come from the algorithm itself. For example, if it has the predisposition to predict favora-
bly for larger group segments, and if you’re running a discount program using this model, it might give more
discounts to the middle-aged, middle-income groups.
You can mitigate bias by adopting data level, algorithm level, post-deployment level including continuous moni-
toring techniques, as shown in the following table:
Data level You can build a balanced You could attempt to balance your
dataset by undersampling dataset by including more data from
the overrepresented groups lower-income, younger groups, or high-
and oversampling the income, older groups.
underrepresented groups.
Algorithm level 1. Try different algorithms. You can test different algorithms or
2. Change the parameters of the adjust the parameters of your chosen
chosen algorithm. algorithm to see if they perform better for
underrepresented groups. You could also
3. Use an ensemble method that
use ensemble methods to combine the
combines the predictions of
predictions of several models.
multiple models.
Post-processing level 1. Conduct a fairness audit of First you must check if the model is
the model. predicting accurately across multiple
2. Do a fairness correction to groups. Then, you can experiment by
adjust the model’s output until adjusting the model’s output to reduce
the bias is reduced. any identified bias.
Continuous monitoring 1. Monitor the model’s To begin with, regularly monitor the
performance across all groups. model’s performance across all customer
2. Make corrections to the model groups and adjust as necessary. This can
as needed. help ensure that the model continues to
make accurate predictions for all customer
groups.
NOTE You can detect and mitigate bias in machine learning models by using tools like Sage-
Maker Clarify, AWS Machine Learning Interpretability Toolkit, Azure Fairlearn, and GCP
Explainable AI.
Using SageMaker Clarify, you can monitor for bias at the data preparation stage before the training starts and
during each stage of the ML lifecycle. This can also detect bias after deployment and provides detailed reports
that quantify different types of bias and provide feature importance scores to explain how the model made its
predictions. It explains the data, model, and monitoring used to assess predictions.
Feature Engineering ❘ 259
Azure provides the Azure Machine Learning Interpretability Toolkit and Azure Fairlearn, and Google provides
GCP Explainable AI to detect and mitigate bias and provide explanations for predictions.
Data Augmentation
Another technique you can use is data augmentation. This involves creating new data from existing data. It is
often used in deep learning. See the following table for more details:
Creation of new data This involves creating new In a retail setting, you could create synthetic
data from the existing data. new customer reviews to augment your existing
dataset.
Adding noise to data Adding noise to data can help You could add noise to the sales data. This could
improve the robustness of the allow your model to handle a broader range of
model. data and possibly improve its performance.
FEATURE ENGINEERING
Each unique attribute of data is a feature. Feature engineering is about selecting the right attributes for the
model to become more accurate and generalizable for new data. You choose different features based on the
type of model.
Feature engineering consists of the components shown in Figure 15.7.
Feature Engineering
Feature selection
Feature transformation
Feature Types
This section explains the various feature types that you can create for your models. The following table lists these
feature types, their suitable models, and some examples:
260 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
Binary features Binary features take on two Various models Whether or not a sales
possible values like yes/no promotion was running.
or 1/0.
Sentiment These features are derived Deep learning models Sentiment scores or
scores or topic from text data like customer such as RNNs, topics obtained from
features reviews. Sentiment scores or transformers the analysis of customer
topics extracted using NLP reviews.
techniques.
Interaction Interaction features represent Useful for any model For instance, a feature
features dependencies between that might miss these representing the effect
other features, such as the relationships of a promotion on a
interaction between price high-priced item.
and promotions.
TIP Use domain knowledge to create features that are relevant to the problem you are try-
ing to solve.
Feature Selection
Feature selection is the process of selecting the attributes that are more relevant to predicting the results. Con-
sider the use case of predicting a customer’s spending habits. Assume the dataset contains features such as the
customer’s age, income, location, purchase habits, number of children, job title, education level, and frequency of
store visits.
When selecting the features to predict the customer’s spending habits, you have to decide which feature is going
to be more relevant to predict this spending habit, and you have to decide which one is irrelevant, a duplicate, or
even counterproductive.
Feature Engineering ❘ 261
Correlation Matrix
Another metric that can help you select features is the correlation matrix, a table showing the relationship
between the target variable and the feature. A high value implies that it plays a greater role in predicting the
target, in this case, the spending habits. You can also use this correlation metric to eliminate features that have a
high correlation. For example, suppose income and occupation title are highly correlated. In that case, you may
retain the income and remove the occupation title from the dataset, because the income may correlate more to the
target variable, namely, the spending habit.
The advantage of feature selection is that you can reduce the size of the datasets by removing redundant and irrel-
evant features. The smaller dataset will result in less overfitting of the model in addition to higher performance
and simplicity, which makes it easier to interpret.
NOTE Use feature selection techniques like examining feature importance scores and cor-
relation matrices to select relevant features, reduce overfitting, improve accuracy, and reduce
training time. This results in higher performance and simplicity, leading to better interpret-
ability.
Feature Extraction
Feature extraction involves techniques that combine the existing attributes to create new ones. This is done to
reduce the dimensionality of the dataset and retain the original information. Figure 15.8 shows some examples of
those techniques.
NOTE Use feature extraction techniques like PCA, ICA, and linear discriminant analysis
(LDA) to create new features. This reduces the dimensionality of the dataset while retaining
the original information.
262 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
Independent
Component
Analysis (ICA)
Principal
Linear
Component
Discriminant
Analysis (PCA)
Analysis (LDA)
FEATURE
EXTRACTION
Feature Creation
Feature creation helps you add new relevant variables to your dataset to improve your model’s performance. Still,
you must be careful to ensure that it does not add noise or complexity to the model. The following table explains
the different feature creation techniques with examples:
One-hot encoding One-hot encoding involves You can convert the preferred shopping
converting the categorical variables day category into two binary categories:
into a form that can be fed into the the prefers_weekday or prefers_
model for better predictions. weekend feature. Each feature takes a
value of 0 or 1, depending on whether
the customer prefers to shop during the
weekend or weekday.
Binning Binning is a technique that combines You can create groups of ages such as
similar items to create numerical or 16 to 25, 26 to 35, 36 to 45, and so on. It
categorical bins. increases the accuracy of the prediction by
reducing the impact of minor differences
between ages.
Feature Engineering ❘ 263
Splitting Splitting involves creating two or You can split an address into different
more features from one feature. features, such as city, state, and street,
which can help you discover spending
patterns based on these new features.
Calculated features Calculated features are new features You can create a new feature, such as
that you can create by performing average spending per visit, by adding all
calculations on existing features. the spending over the year and dividing
it by the number of customer visits in that
year.
TIP Evaluate the performance of your model with different sets of features.
Feature Transformation
Feature transformation helps ensure the data is complete, clean, and ready for use in a model. The following table
explains the different feature transformation techniques available with examples:
Cartesian products of By calculating the cartesian Suppose you have two features: average
features products of two features, you spending per visit and total number of visits per
can create a new feature with month. In that case, multiplying these two will
greater meaning to the end give you the monthly spending, which may have
result. more relevance to solving your problem.
Nonlinear The nonlinear transformation A good example is binning numerical values into
transformations will help you understand the categories. For example, you can create a group
nonlinear effects of data. of categories using the age feature, such as age
groups between 16 to 25, 26 to 35, 36 to 45,
and so on. This categorization will help you assist
with the impact of age on spending habits from a
nonlinear perspective.
Domain-specific These are new features that If you know that some customers buy organic-
features you can create based on your only products, you know these are high spenders,
domain knowledge. and you can create a new feature named organic
buyers.
TIP Use a variety of transformation techniques, such as scaling, normalizing, and transform-
ing. Start with a small number of transformation techniques and add more as needed.
264 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
Feature Imputation
Feature imputation is the process of handling missing data using different imputation methods depending on the
feature type, as shown in Figure 15.9.
Predictive
Imputation
Simple
Multivariate
Imputation
Imputation
FEATURE
IMPUTATION
Simple Imputation
Simple imputation involves replacing the missing values with mean, median, or mode values. For example, if you
have a feature named “number of visits per month,” you can fill in the missing values with the median number of
visits for all customers per month.
Predictive Imputation
Predictive imputation involves predicting the missing values using other features. For example, you can predict
the missing value for income by training a model using other attributes such as occupation and education levels.
Multivariate Imputation
Multivariate imputation is used when there are relationships between features. Missing values are filled by run-
ning a model using other features in the round-robin fashion. You can use methods like MICE to fill in missing
values in different features by predicting them from other features in a round-robin fashion. This results in multi-
ple datasets, which can then be combined to account for uncertainty in the missing data.
For example, in a dataset that contains three variables such as Math score, English score, and Attendance, and if
the values for Math and English scores are missing, instead of imputing them separately, you can use the Attend-
ance record to predict the Math score and then use the predicted Math score and the Attendance record to predict
the English score.
NOTE Choose the right imputation method based on the type of missing data—such as
simple imputation, predictive imputation, or multivariate imputation—to reduce bias based
on available resources.
Review Questions ❘ 265
SUMMARY
This chapter covered a wide range of topics related to cloud-based data processing and feature engineering and
essential steps in the machine learning lifecycle that are crucial to professionals in cloud engineering, data engi-
neering, and machine learning. The detailed insights into the data collection techniques, preprocessing methods,
and feature engineering combined with the hands-on exercises should give you the skills and knowledge to start
data processing. You also learned how cloud providers such as AWS, Microsoft, and Google play a vital role in
making data scalable, secure, and cost-effective.
Beyond the technology factor, you also learned how the human element comes into play for making decisions
around data partitioning, augmentation, balancing, and handling inconsistent values.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Which of the following is true about data processing in the cloud?
A. Local servers are more scalable than cloud servers.
B. Data from multiple sources cannot be processed easily using cloud tools.
C. AWS Glue can be used for ETL and Amazon Athena for data query in S3.
D. Data stored in the cloud does not have rigorous security requirements.
2. Which of the following is an example of structured data?
A. Customer reviews
B. Economic indicators in XML or JSON files
C. Sales trends or seasonality
D. None of the above
3. Which of the following is true regarding the data processing workflow?
A. Data processing workflow includes only data collection and does not include data preparation.
B. Labeled data indicates information that has no target.
C. Exploratory data analysis is part of data preparation and can be used to understand data.
D. Data cannot be collected from sources such as sensors, time-series data, event data, or IoT data.
4. Which of the following is true about real-time and batch data ingestion?
A. Real-time data ingestion cannot handle high-velocity data.
B. Apache Kafka and AWS Kinesis are used for real-time data ingestion.
C. Batch data ingestion is the preferred method to keep data current.
D. The choice between the real-time and batch ingestion methods does not depend on the velocity
and volume of data.
266 ❘ CHAPTER 15 Process Data and Engineer Features in the Cloud
13. Which of the following statements is true about data warehouses and data lakes?
A. Data warehouses hold highly structured data, while data lakes hold structured and unstruc-
tured data.
B. Data warehouses are like a farmer’s market, while data lakes are like well-organized gro-
cery stores.
C. Data warehouses provide scalability and flexibility, while data lakes allow for higher performance
and easy data analysis.
D. Data warehouses and data lakes are interchangeable terms for storing big data.
14. What is the main purpose of a feature store in MLOps?
A. To maintain a history of model performances
B. To provide a platform for hosting machine learning models
C. To create, share, and manage features during machine learning development
D. To orchestrate and schedule data pipeline tasks
15. What is the main purpose of data version control in machine learning?
A. To store multiple versions of data, track the models with which they were used, and reproduce
experiments
B. To keep track of all the models in production and monitor their performance
C. To orchestrate and schedule data pipeline tasks
D. To provide a platform for hosting machine learning models
16. What is the purpose of data lineage tracking in machine learning?
A. To create versions of your data and model artifacts to protect from deletion
B. To track the changes the data has undergone throughout the machine learning lifecycle and iden-
tify any unintended changes
C. To enforce security by developing a plan to reduce the exposure of data and the sprawl of data
D. To protect data privacy and ensure compliance with data protection regulations
ANSWER KEY
1. C 7. C 13. A
2. C 8. B 14. C
3. C 9. D 15. A
4. B 10. D 16. B
5. C 11. D
6. B 12. C
Choosing Your AI/ML
16
Algorithms
The only source of knowledge is experience.
—Albert Einstein
Data, as pivotal as it is, still requires algorithms to become actionable and give life to your AI initiatives.
Choosing the right AI/ML algorithm is like choosing the right tool for a job. This chapter covers different
machine learning algorithms and explores aspects such as how they work, when to use them, and what use
cases they can be employed for (see Figure 16.1).
02
Plan and
01 Strategize
and Prepare Launch Pilot
03
Continue Build and
Your AI Govern Your
journey Team
09
ENTERPRISE AI
Build 04
Evolve and
Mature AI
JOURNEY MAP Infrastructure
and Manage
Operations
Process Data and Modeling
08 Scale and Process Data
Transform and Modeling
• Process Data and Engineer Features in the Cloud
AI
Deploy and 05 • Choose your AI/ML Algorithms
Monitor
• Train, Tune, and Evaluate Models
Models
07
06
FIGURE 16.1: PROCESS DATA AND MODELING: Choose your AI/ML algorithm
Back to the Basics: What Is Artificial Intelligence? ❘ 269
Note that the choice of a machine learning algorithm is not a straightforward process, and the relationship
between the algorithms and models is many to many; the same use case can be served by more than one algo-
rithm, and the same algorithm can solve many use cases and business problems.
This chapter explores different categories of machine learning, such as supervised learning, unsupervised learn-
ing, and deep learning. You’ll discover a plethora of algorithms such as linear regression, decision trees, neural
networks, and more.
Along the way, I discuss factors to consider when choosing an algorithm, ensuring that you have a robust meth-
odology to align your choice with the problem at hand. By the end of this chapter, you’ll have a firm grasp on
how to pick the right algorithm.
Artificial intelligence includes those collections of technologies, such as natural language processing, speech rec-
ognition, and computer vision. Figure 16.2 shows the range of technologies and applications that fall under the
broad umbrella of artificial intelligence.
Here’s Gartner’s definition of AI:
Artificial intelligence (AI) applies advanced analysis and logic-based techniques, including
machine learning, to interpret events, support and automate decisions, and take action.
(Source: www.gartner.com/en/topics/artificial-intelligence)
FACTOR DESCRIPTION
Identify the type The first thing to do is identify the problem you need to solve. In this case, you’re
of problem. trying to increase sales, which means you’re trying to identify customers who are
likely to purchase from your site. This is the classification problem because you’re
trying to identify who will buy and who will not.
Assess the size of The next step is to assess the size of the dataset. If you have a large dataset, using
the dataset. complex algorithms such as deep learning is possible. However, if the dataset is
smaller, you can use simpler algorithms such as linear regression.
Decide on the The algorithm choice also depends on whether the model needs to be interpretable,
interpretability of meaning is it required to explain how the model came up with its prediction? If
the model. you have to explain how the model predicts the outcome, you cannot use complex
algorithms such as deep learning.
Decide on the It is also vital to decide on the metrics you’ll use to evaluate the model’s
performance metrics. performance. In this case, you can use accuracy, precision, and recall, thereby
evaluating how well the model predicts which customer will likely make a purchase.
Plan for scalability. It would help if you also planned for the scalability of the model. In this case, given
that you have a large dataset, it may be a good idea to plan for a distributed
machine learning algorithm such as Apache Spark.
TIP When choosing a machine learning algorithm, carefully consider the problem type,
dataset size, interpretability requirements, appropriate performance metrics, and the need for
scalability to ensure an effective and efficient solution.
This is a strategic hands-on exercise focused on choosing the right machine learning algorithm
for your project’s use case. It is a critical step because choosing the right algorithm will decide
the accuracy of the model and therefore drives the outcome of your overall project.
It will help you to think critically about various phases of machine learning such as identifying
a problem, assessing data, interpreting models, measuring performance, and considering scal-
ability requirements. This exercise mirrors the thought process that goes into planning a data
analysis or modeling project.
Goal: Assume that you have been hired as a data scientist for an e-commerce company that
has trusted you with the responsibility of leveraging the customer data to gain a competitive
advantage to address the stiff competition the company has been facing. The marketing team
has told you that they would like to predict which customers are likely to purchase next month
so they can plan their marketing campaigns accordingly. They have given you a dataset that
contains the customers’ past purchases as well as other demographic information such as age,
location, browsing behavior, and so on. Now your task is to identify a suitable algorithm to
solve the problem.
continues
272 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
continued
STEP DESCRIPTION
Identify the type of problem. Decide what type of machine learning problem this
is; in this case, this a classification problem as this is
trying to identify the customers who are likely to buy
vs those who may not.
Assess the size of the dataset. If the dataset is very large, you may need a deep
learning algorithm, else a simpler algorithm would
suffice.
Check whether the model needs If the model needs to be interpretable, you will need
to be interpretable. a simpler model.
Check out the scalability If the dataset is large and likely to grow, you
requirements. should explore machine learning algorithms that
are compatible with distributed processing such as
those offered by Apache Spark.
Write a report regarding the Prepare a report with the type of algorithm, how it
finalized algorithm. meets the requirements, as well as the challenges,
limitations with this approach, and other alternatives
if available.
NOTE An essential characteristic of machine learning is that these models keep improving
over time because they learn from the existing data and can make better predictions over
time.
Data-Driven Predictions Using Machine Learning ❘ 273
The next few sections discuss the different algorithms shown in Table 16.1.
Note that the main point of this discussion of different applications is for you to be able to understand these dif-
ferent types and then go back to your organization and apply these to your different problems.
274 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
TIP When using supervised learning, ensure you have a large, labeled, well-formatted, and
clean dataset for accurate training and prediction.
A human labels
the input data
Labeling
01
02 The algorithm is
Training then trained on
the labeled input
Prediction data
The algorithm is
03 applied to new
data to predict
the outcome
➤➤ Labeling: A human labels the input data by identifying the input and output variables. In the case of
identifying spam, they would identify the input fields such as subject, to, and from fields and define the
output variable as spam or not spam.
➤➤ Training: The algorithm is then trained on the labeled input data to understand the patterns between the
input variables and the output.
➤➤ Prediction: Once the training is complete and when you determine that the algorithm is pretty accurate,
the algorithm is applied to new data to predict the outcome. See Figure 16.5.
Classification
Within supervised learning, you have two types of algorithms, classification and regression (see Figure 16.6). In
the case of classification, you are trying to predict whether the outcome belongs to a particular type of category.
In contrast, you’re trying to predict a continuous value in the case of regression.
Data-Driven Predictions Using Machine Learning ❘ 275
(a)
Model
(e)
Pictures of animals
labeled by humans
Binary Classification
Multiclass Classification
Multilabel Classification
Imbalanced Classification
KNN
Supervised Learning
Linear Regression
Ridge Regression
Lasso Regression
Another example of a classification problem involves an Amazon fulfillment center. There are millions of boxes.
They have hundreds of items in each box, and some items may be missing or damaged. You can use machine learn-
ing to identify which boxes are damaged and then send warehouse representatives to address the issue. Note that
classification can also be further subdivided into multilabel classification, imbalanced classification, ensemble meth-
ods, support vector machines, KNN, and so on. Logistic regression, SVM, and decision tress are types of binary clas-
sification, and neural networks, random forests, and naïve Bayes multivariate are types of multiclass classification.
Regression Problem
This section looks at the example of a regression problem that focuses on predicting a continuous value. One
such example is forecasting the demand for a product based on past historical time-series data. In this case, the
output of the regression problem is the number of products the customers will buy. You can even use regression to
predict a product’s price based on its historical data and other factors. Regression can be further subdivided into
linear regression, multiple linear regression, polynomial regression, ridge regression, lasso regression, and so on.
Energy Energy
To implement energy Energy Energy data
consumption consumption
saving measures, optimize Sustainability Building
analysis analysis
resources data
Other factors can impact the price of the car, such as the car’s condition, make and model, local market condi-
tions, competition, advertisement, and so on. Although this is a trivial example, it serves to help you understand
how linear regression works. Note that you will be using linear regression algorithms provided by different
frameworks in the real world.
Linear regression is a commonly used tool in various fields and industries, including finance, manufacturing,
statistics, economics, psychology, and so on. It can be used to optimize price points and estimate product-price
elasticities.
278 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Simplicity of Linear regression may not be suited for capturing Neural networks or other
relationships complex relationships between the independent complex algorithms
variables.
15
10
–20 –10 10 20 30 40 50 60
TIP When using linear regression, ensure a linear relationship between variables, avoid mul-
ticollinearity, and limit the number of independent variables. For complex relationships, con-
sider advanced models such as neural networks.
Categorical values represent different categories or groups and are represented by labels. They can be further
divided into nominal or ordinal types.
Consider an example where you want to predict whether a customer will churn. In this case, you collect all the
data that will contribute to this event, such as how many products they bought, how often they bought, how
much they spent, and how often they canceled their subscription. You should be able to get all this information
from a customer loyalty program.
Once you train the logistic regression model using this data, this software will draw a logistic regression line fit-
ting this data, and you can use this line to understand the relationship between the various independent variables
and their probability of whether the customer will churn, as shown in Figure 16.8.
1.00
Probability of Customer Churn
0.75
0.50
0.25
0.00
1 2 3 4 5
Number of Days of Inactive Usage
FIGURE 16.8: Example logistic regression graph showing the probability of customers churning based on the
number of days not using the service
This algorithm is useful in many use cases in healthcare, finance, and marketing, as shown in Table 16.3. Some of
these use cases are whether a customer will default on a loan, whether a prospect will buy a car, and whether a
customer will churn.
280 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Insurance
Telecommunications
Email spam Reduce spam and Technology Email data Spam/not spam
filtering increase email Email Metadata
deliverability
Security
Sentiment analysis Make data driven Social media Text data Sentiment label
decisions Marketing
Classification problems When the outcome variable is binary, the outcome can be yes or no, true or
false, and so on. In other words, it’s a classification problem.
Large datasets When the dataset is small, the model is likely to overfit. It is ideal for large
datasets, as it can lend itself to computational efficiency and more statistical
power to make more accurate predictions.
Data-Driven Predictions Using Machine Learning ❘ 281
Interpretability When you want the model to be interpretable, meaning you want to understand
how the outcome is predicted based on the inputs.
Efficiency When you want an efficient algorithm that can perform fast with fewer
computational resources.
TIP Overfitting means the model has almost memorized the outcomes and cannot be gen-
eralized for new data. It performs poorly on new data even though it works perfectly on
training data.
continues
282 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
(continued)
Equipment Equipment
Proactive maintenance Manufacturing Sensor data,
maintenance maintenance
scheduling, reduced IoT maintenance
needs
downtime, optimized records
equipment performance
This section covers the situations when decision trees can be adopted.
Both numerical and Decision trees can handle both Dataset that contains both numerical
categorical data numerical and categorical data types. values like age, income, and categorical
values like gender, nationality
Handling missing data Decision trees can handle missing data Imputing missing data based on the
by assigning probabilities based on the relationships in available data
available data.
Large datasets Decision trees are efficient for Processing datasets with high volume
large datasets.
Ensemble models Decision trees can be combined Building a random forest or gradient
with other algorithms to build boosting model
ensemble models.
Better performance Decision trees can be tuned to develop Improving the prediction accuracy or
high performing models. reducing error rates
gender
male female
age survived
0.73; 36%
died
siblings
0.17; 61%
died survived
0.02; 2% 0.89; 2%
FIGURE 16.9: Decision tree that shows the survival of passengers on the Titanic. Figures under the leaves
show the probability of survival and the percentage of observations in the leaf.
Source: Wikipedia
TIP Consider decision trees for large, mixed-type datasets that need to be well-understood
by stakeholders and for handling missing data and use ensemble models for enhanced
performance.
284 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
The goal of this exercise is to build a decision tree model to predict whether a loan applicant
will default.
Prerequisites: This is a high-level overview, but to actually execute this exercise, you should be
familiar with Python, pandas, and the scikit-learn library.
Access to a loan dataset that contains data such as applicant name, age, income, and so on,
and whether they defaulted.
Dataset Data analyst/ Load the loan dataset into A cleaned and
preparation scientist a pandas dataframe. Clean formatted dataset
and format the data as
needed.
Splitting the Data analyst/ Divide the dataset into a A training set and a
data scientist training set and a testing set. testing set
Training the Data analyst/ Train the selected model A trained decision
model scientist using the training dataset. tree model
Model Data analyst/ Evaluate the model using the Evaluation results
evaluation scientist test dataset. Use relevant
metrics such as accuracy,
precision, recall, F1 score, or
AUC-ROC.
Instance
Random Forest
Majority-Voting
Final-Class
The methodology and benefits of using Random Forest are outlined in the following table:
Training You break down your dataset into a training By doing this, you can eliminate some of
and test dataset and then divide the dataset the drawbacks of your decision tree, such
to build multiple decision trees, each using a as overfitting and avoiding bias. The final
random subset of data and features. decision is arrived at by combining the
outcomes of the individual trees.
Testing To evaluate the model’s performance, you Helps evaluate the performance of the
then test the model against the test dataset model and improve its accuracy.
and compare the predictions to the actual
outcomes using metrics, such as accuracy,
precision, recall, and F1 score.
Accuracy When you need accuracy in the predictions, Essential for making reliable predictions
this method is good because you are and minimizing errors.
using multiple decision trees, which is
more accurate.
continues
286 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
(continued)
Robustness When you need robustness in the model Enhances the stability of the model
predictions, this method is good because it and its ability to handle a variety of
uses multiple decision trees and issues such data inputs.
as overfitting are avoided.
Random forests are especially suited for customer churn prediction, fraud detection, image classification, recom-
mendation systems, predictive maintenance, demand forecasting, and credit risk assessment.
TIP Random forests are helpful for both classification and regression types of problems
because they use multiple decision trees to enhance the stability of the model and its ability
to handle a variety of data inputs.
Building the You’re building a classifier that identifies Helps in categorizing data based on
classifier different types of fruits, and you have independent features. Note: This is called
several features such as color, texture, size, a naïve theorem because in the real world
and shape. features are usually not independent.
Creating a Calculate the probability of the different The base model serves as the initial
base model labels based on the training dataset. representation of the problem, which can
then be improved through further training
and adjustments.
Prediction You then use the new data against this Applying the model to new data allows you
model to make predictions. to predict outcomes based on the learned
relationships from the training data.
The Naïve Bayes theorem is a powerful tool that can be used to identify spam, classify text and images, categorize
documents, analyze customer reviews, classify news articles, detect fraud, diagnose medical issues, perform senti-
ment analysis, and perform facial recognition.
Data-Driven Predictions Using Machine Learning ❘ 287
The following table captures the conditions under which the Naïve Bayes theorem is recommended:
Text data Naïve Bayes works well when the Useful for sentiment analysis, topic
data is text type. classification, and spam detection
Categorical data Naïve Bayes works well when the Useful when dealing with colors, demographic
data is categorical. information, product classification, etc.
High Performs well on large datasets with Applicable in scenarios with many input
dimensionality many features. variables or features
Real-time Naïve Bayes is a simple and fast Beneficial for applications requiring real-time
inference algorithm, suitable for real-time predictions or decisions
applications.
Limited Naïve Bayes works well with limited Useful in situations where there’s limited
labeled data labeling in the dataset. labeled data for training
TIP Consider using Naïve Bayes for classification tasks involving text or categorical data,
especially in high-dimensional spaces. Its simplicity and efficiency make it a strong candidate
for real-time prediction tasks.
the model would add a third feature, such as the frequency of certain words in the email, and transform the
two-dimensional data into three dimensions.
SVM can also choose several other types of kernel functions, such as linear, polynomial, and radial base functions.
The following table captures the conditions under which support vector machines can be adopted:
Linear and Support vector machines are helpful for classification Suitable for various
nonlinear data use cases, but more important, they are useful for classification problems,
both linear and nonlinear data due to the use of regardless of linearity
kernel functions.
High- SVMs can be helpful for high-dimensional data (many Useful when dealing with
dimensional data features) use cases with large amounts of data. many features or large
amounts of data
Limited data SVMs work well with limited data, avoiding overfitting Ideal for scenarios with
and improving generalization. limited training data
Complex Because SVMs can handle nonlinear use cases better, Suitable for complex or
classifications they are helpful for complex classification scenarios. intricate classification tasks
Outliers and SVMs are efficient at handling outliers and noise Good fit for datasets that may
noise handling in the data. contain noise or outliers
Table 16.5 lists some of the use cases suited to support vector machines.
Marketing
SVMs can also be used for handwriting recognition, fraud detection, customer sentiment analysis, spam email
detection, medical diagnosis, and market trend prediction.
Data-Driven Predictions Using Machine Learning ❘ 289
TIP Consider using SVMs for classification tasks involving both linear and nonlinear data.
They are especially useful when dealing with high-dimensional or limited datasets and
capable of handling complex classifications and outliers effectively.
Large and Particularly useful for large and complex Useful when dealing with
complex datasets enterprise-level datasets. a complex dataset with
many variables
High accuracy Useful for high accuracy requirements. Ideal when high precision
is essential
Complex feature Can handle various input features, including Applicable in scenarios that
engineering categorical and numerical features. require intricate feature
engineering
Noise and outliers Since this is an ensemble model, it can handle Good fit for datasets that may
complex scenarios, making it more robust to contain noise or outliers
noise and outliers.
Real-time inferences Ensemble models can be trained very quickly and Beneficial in scenarios that
therefore are ideal for time-sensitive applications require real-time predictions
and real-time inferences.
TIP Consider using gradient boosting for complex, large datasets requiring high accuracy
because it corrects previous model errors, handles intricate features, noise and outliers, and
helps with real-time inferences.
290 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Small and Works best with small to Not recommended for very Decision tree,
medium medium-sized datasets; large datasets due to high random forest, or
sized datasets larger datasets can become computational cost gradient boosting
computationally intensive. for large datasets
Classification and Suitable for problems that Not suitable for Deep learning
regression only involve classification or problems that don’t or reinforcement
regression. involve classification or learning for
regression tasks complex tasks
Data-Driven Predictions Using Machine Learning ❘ 291
Not for real- Suitable for situations where Not ideal for situations Random forests or
time inference you can train offline and requiring real-time gradient boosting
deploy it online. If real- predictions machines (GBMs) for
time online performance real-time predictions
is essential, other models
should be explored.
Explainability A good model when Helpful when it’s necessary Decision tree,
explainability is required, to explain the decision- logistic regression
as predictions can be making process for explainability
easily traced.
TIP Use KNN in classification or regression tasks for small to medium-sized, structured, and
labeled datasets. It is essential to select an appropriate value for K to balance between noise
sensitivity and the included dissimilar data points.
Large data Suitable for very large amounts of data Data-intensive fields like genomic
research or social network analysis
Large-scale cloud AI Well suited for large-scale enterprise Allows workload distribution across
cloud AI implementations multiple servers and GPUs for
speedier training and inference
Refer to the section “Deep Learning” in this chapter for more details.
TIP Consider using neural networks for large datasets, complex pattern analysis, predictive
modeling, real-time CNN tasks, or large-scale cloud AI deployments.
292 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Clustering
An excellent example of a clustering application
is when you have a large amount of customer
data. You use this algorithm to break the data into
multiple segments based on buying patterns and
other attributes. You could then analyze those dif- Original Data Clustered Data
ferent segments and identify one of those as college
students or moms, and so on, and then tailor your FIGURE 16.13: How unsupervised learning works
products and services based on which customer seg-
ments you are marketing to.
Another example of clustering applications is topic modeling. For example, you can feed the content from multi-
ple books into this algorithm, generate topics, and classify those books into different categories.
Anomaly Detection
Anomaly detection is used to detect anomalies in data. In other words, look for patterns that do not conform
to expected behavior. It is of great use in the case of fraud detection in customer transactions, such as an unusu-
ally large number of purchases in a foreign country or abnormal patient readings indicating a potentially serious
health condition. The following table summarizes the different types of unsupervised learning algorithms:
TIP Use unsupervised learning when dealing with unlabeled data to discover hidden pat-
terns. Clustering helps with grouping similar data, while anomaly detection uncovers outliers
useful in fraud detection or health diagnostics.
Dim. 2
k-Means Clusters
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
3.1
3
2.9
2.8
2.7 Dim. 3
7
2.6
6
2.5
2.4
5 Cluster 1
4
2.3
2.2
3 Cluster 2
2.1 2
21 Cluster 3
4 5 6 7 Dim. 1
8
The algorithm starts by randomly assigning K cluster centers and then attaching the remaining data points to one
of the clusters closer to it. Then the cluster centers are recomputed by calculating the mean of all the data points
attached to that cluster center. The process is then repeated until the cluster centers do not change anymore.
K-means clustering is helpful for image segmentation, customer segmentation, fraud detection, and anomaly
detection.
These are the conditions under which means clustering can be adopted:
Large data You have a large amount of data to Analyzing massive user datasets in social
be processed. media platforms
Pattern analysis When you want to identify patterns Identifying common buying patterns in
in the data. retail consumer data
Segmentation use cases There is a need for segmentation Segmenting customers into distinct
of customers, products, or any groups based on purchasing behavior
other entity.
continues
294 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
(continued)
Uncorrelated data Useful when the data is not highly Clustering unrelated blog posts
correlated. based on topics
Unknown number Ideally, the number of clusters is Organizing news articles into categories
of clusters not known. without pre-established groups
TIP Use K-means clustering for large, unlabeled datasets where the number of clus-
ters is unknown and the data is not highly correlated. It’s useful for tasks like customer
segmentation and process optimization.
NOTE Given that PCA doesn’t use labels, it is considered an unsupervised learning tech-
nique.
Say you have product data served with a lot of information, such as sales history and specifications of the prod-
uct, such as its color, size, shape, and so on. Each row represents a product, and each column is an attribute of the
product. However, the dataset is so large that making any sense of that data is becoming difficult. Follow these
steps to use PCA:
Calculate the The first step is to use PCA to calculate For a dataset with variables like age,
covariance matrix. the covariance matrix of the data. This income, and purchasing history,
matrix tells you how correlated the the covariance matrix will reflect
different variables are. If two variables are how these variables correlate with
highly correlated, they will have a high each other.
value for covariance value, and if they are
negatively correlated, they will show a
negative covariance value.
Data-Driven Predictions Using Machine Learning ❘ 295
Identify the Next, you use the covariance matrix to In a customer dataset, the principal
principal identify the principal components in the components might be combinations
components. data. These components are the linear of variables that explain the most
combination of variables that capture variability, such as a combination of
the significant patterns in the data. The age and income.
first principal component is the one that
explains the most variance in the data,
followed by the second.
Retain the top N After identifying all the principal If the top three principal components
components. components, you can choose to retain account for 90 percent of the data
the top N components and discard the variance, these can be retained while
rest and thus reduce the number of the rest are discarded, simplifying the
dimensions. After doing the principal data from 10 dimensions to 3.
component analysis, you discover that the
top three principal components account
for 90 percent of the data variance, as
shown in Figure 16.15. You can therefore
retain those three dimensions instead of
the ten or more you had.
Scree Plot
3.5
3
Point of Inflextion
2.5
Eigenvalues
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3
Component
FIGURE 16.15: Interpreting the PCA: The start of the bend indicates that three factors should be retained.
NOTE A screen plot is a line plot of the eigenvalues of factors or principal components used
in PCA to identify the significant factors.
296 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Dimensionality reduction Used to reduce the dimensions in the data, thus helping to simplify complex
datasets with numerous variables
Pattern analysis Helps to reveal underlying patterns in the data by focusing on the principal
components that explain the most variance
Dataset compression Used to compress a dataset by retaining only the most informative dimensions,
which can be beneficial for saving storage and improving computational
efficiency
Noise reduction Removes noise from a dataset by focusing on the components that explain the
most variance and ignoring components associated with noise
TIP Use PCA for high-dimensional data to simplify the dataset and is beneficial for noise
reduction, data compression, and improving cluster visualization.
Large dimensionality When you have large dimensionality in the data and you want to reduce the
computational resources reducing the dimensions.
Identifying SVD can be used to identify the most essential features in a dataset.
essential features
Collaborative filtering SVD can be used for product recommendations, such as collaborative filtering,
to identify which products the users are likely to purchase based on their
historical behavior.
Data-Driven Predictions Using Machine Learning ❘ 297
Image and signal In image and signal processing tasks, SVD can be employed to manipulate
processing and analyze data effectively. This includes tasks like image compression, noise
reduction, and signal enhancement.
TIP Utilize SVD for efficient dimension reduction and feature extraction in large datasets,
making it useful for recommendation systems and image processing tasks.
Using Autoencoders
An autoencoder is a type of artificial neural network that is primarily used in unsupervised learning and for
dimensionality reduction. It solves many problems, such as facial recognition, feature detection, and anomaly
detection, and it can generate meanings of words and in generative models to create new data that is like the
input training data.
NOTE Autoencoders are versatile tools that can be used for both supervised and unsuper-
vised learning.
The autoencoder is focused on creating the most compact representation of the input data, effectively reducing
the dimensionality of the data.
The autoencoder has an input layer that’s connected to an encoder, which is a series of hidden layers that
reduce the dimensionality of the input data. It contains two functions. It contains an encoding function that
transforms the input data into a lower-dimensional latent space representation, also known as an encoding.
Figure 16.16 shows how an autoencoder works.
ENCODER DECODER
Input Output
X Code X’
Layer Layer
The encoded data is then passed to a decoder, which is a series of hidden layers. The decoder layer uses a decoder
function that re-creates the input data from the encoded representation. The output layer is compared to the input
layer and the error is used to train the autoencoder.
298 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Autoencoders can be used for the use cases listed in Table 16.6.
Hierarchical clustering is a bottoms-up approach where it treats each data point as a cluster and then groups
them together to form clusters that form a hierarchical structure. It uses different linking techniques, such as
single linkage, complete linkage, and average linkage.
Hierarchical clustering is used in various domains, including biology, social sciences, image analysis, market
segmentation, customer segmentation, and so on. It can be used to cluster customers into micro-segmented groups
using various criteria, such as customer loyalty and social media data.
Content-based filtering is based on the content of the item itself. For example, if you liked a book, the system may
recommend books that are like the book that you liked.
Recommender systems are used in the following areas:
Product To suggest products to users based on their past purchase and Amazon
recommendation browsing history
Social media To suggest friends, groups, or posts based on a user’s social Facebook
interactions
News personalization To recommend news articles based on a user’s reading history and Google News
topic preferences
Travel To suggest hotels, flights, and other travel arrangements based Expedia
recommendations on a user’s past search and travel history
continues
300 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
(continued)
Personalized To suggest treatments, medications, and other healthcare services IBM Watson
patient care based on a patient’s medical history
TIP To deliver relevant recommendations at scale, leverage recommender systems that apply
collaborative filtering or content-based algorithms.
Customer To segment customers based on their purchase history, demographics, and other
segmentation factors to target their marketing campaigns effectively
Risk management To assess the risk of a customer defaulting on a loan or failing to meet a contract
Enterprise AI From an enterprise AI perspective, you can use manifold learning for dimensionality
reduction, data visualization, feature extraction, and anomaly detection
Environment
REWARD ACTION
Interpreter
STATE
Agent
FIGURE 16.18: Reinforcement learning is when an agent takes action in an environment, which leads to a
reward and is fed back to the agent.
Robotics RL can be used in robotics applications, such as picking and placing objects in a
warehouse, navigating through cluttered environments, and even playing sports.
Finance Can be used in finance to develop algorithms to predict stock prices and to automatically
place trades.
Customer service Used to create chatbots to place orders and troubleshoot problems.
TIP For problems with limited training data or undefined end states, consider a reinforce-
ment learning approach where algorithms learn through trial-and-error interactions.
Let’s assume you have a robot that needs to be programmed to pick and place objects in
various locations.
Here are the high-level steps you can follow. Actual details are beyond the scope of the book,
but you can find additional resources from Open AI Gym, GitHub, and so on.
Step 4 Train the reinforcement learning algorithm using one of the methods available.
Step 5 Test the reinforcement learning algorithm by giving the agent a new set of
tasks and check its performance.
302 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Deep Learning
Deep learning is a subset of machine learning, and it uses the concept of neural networks, which mimic how the
human brain works. It’s a way of training computers to learn and improve independently without being explicitly
programmed. Just like our brains learn from experience, these deep learning models learn by training on large
amounts of data. Figure 16.19 shows how deep learning works and how it’s different from traditional and other
machine learning algorithms.
Traditional
Traditional programming Input Data Software New Data Predictions
Code
Manually
Machine learning Input Data Developed Model New Data Predictions
Features
Machine
Deep learning Input Data Developed Model New Data Predictions
Features
FIGURE 16.19: Deep learning is different from machine learning and traditional programming.
Neural networks consist of layers of interconnected nodes, where each node processes the data and sends it to
the next layer. This helps the neural network represent the data and generalize the patterns in that data. These
multiple layers allow these deep learning models to learn more complex data representations.
This ability to learn complex data makes these deep learning models good candidates for speech and image recog-
nition, natural language processing, and gaming use cases.
TIP One advantage of deep learning is that you do not have to do manual feature extraction
because the system can do that on its own.
From an enterprise AI perspective, deep learning can improve the quality of customer service interactions because
of its ability to improve the accuracy of speech and image recognition. Deep learning can also be used to detect
anomalies in financial transactions and fraud detections and improve supply chain operations by predicting
demand and optimizing inventory levels.
Deep learning can process a wider range of data resources, requires less preprocessing from humans, and pro-
duces more accurate results than traditional machine learning approaches.
Implementing deep learning in an enterprise does come up with its own nuances.
TIP Because of the intensive nature of the neural networks, you need a robust data storage,
processing, analytics infrastructure, and specialized hardware such as GPUs for proper
training and inference.
Data-Driven Predictions Using Machine Learning ❘ 303
You also need strong data scientists and engineers who can design, build, and train deep learning models and
deploy them in production. They need to understand the different types of deep learning models available to use
the appropriate model to solve the business problem at hand. A proper understanding of the business require-
ments to develop technical specifications to adopt deep learning is also required.
Fully
Connected
Convolution
Pooling Output
Input
The convolution operation is then applied to the input data in a sliding window fashion. They filter across the
input data, one pixel at a time. The output of a convolution operation is a feature map. Feature maps can then be
used to classify the data or detect objects in the input data.
CNNs are useful under the following conditions:
SITUATION EXPLANATION
Spatial When the data has spatial relationships where the position of the features is important
relationships for classification.
Large data When the amount of data is large. Since CNNs learn in a hierarchical manner, they can
learn from large amounts of data without overfitting.
Noisy data When the data has a lot of noise. Since CNNs learn in a hierarchical manner, they can
learn features despite noise in the data.
304 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
Object detection Utilized in self-driving cars to detect objects such as other cars, pedestrians,
and cyclists
Image classification Employed in Google Photos to automatically label the photos based on what’s
in the image
Financial trading Used by investment firms to predict stock prices or identify patterns in
market trends
TIP For machine learning problems involving image, video, audio, or time-series data,
leverage CNNs to automatically learn spatial relationships and features.
The goal is to develop a CNN model that can accurately classify images in a predefined
dataset. But the larger intention is for the team to gain insights into the end-to-end process of
developing, training, and deploying a deep learning model.
Model Data scientist Train the selected model using the A trained CNN
training training dataset. During this step, model
choose an appropriate loss function
and optimizer. Use techniques such
as data augmentation, dropout,
or batch normalization to improve
performance and reduce overfitting.
Forecasting To forecast. For example, a bank may use it to predict the price of gold in the
next month
TIP For sequential or time-series data like text, audio, or sensor streams, RNNs are well-
suited to learn temporal relationships.
Transformer Models
A transformer is an advanced neural network used for natural language processing tasks such as machine transla-
tion, text generation, text summarization, natural language understanding, speech recognition, and question
answering.
Transformers work based on the attention mechanism, which allows them to learn long-term dependencies in
text. They calculate similarity scores between each input token and output token. Similarity scores are then used
to compare the weights of the input tokens to the output token, which the transformer then uses to cater to dif-
ferent parts of the input to generate the output text.
The transformer model can learn long-range dependencies because of the use of the attention mechanism and is
also more efficient because it is not as computationally intensive as the recurrent neural network.
Transformer models are suitable under the following conditions:
Translation By capturing the relationship between words in different languages, they helped
translate between languages.
Contextual meaning By capturing the contextual meaning of the words, they have been effective
in sentimental analysis, text classification, named entity recognition, and text
summarization.
Question answering They have been able to help answer questions based on context.
Data-Driven Predictions Using Machine Learning ❘ 307
Recommendation systems Useful for building recommendations systems by analyzing the user preferences,
historical data, and item features.
Image processing Helpful for image classification, object detection, image generation, and image
captioning.
Information extraction Useful in assisting with extracting key information from documents, document
classification, summarization, and information retrieval.
Text generation Useful in text generation for chatbots, story generation, content creation for
marketing, and so on.
Translation By capturing the relationship between words in different languages, they helped
translate between languages.
Contextual meaning By capturing the contextual meaning of the words, they have been effective
in sentimental analysis, text classification, named entity recognition, and text
summarization.
Question answering They have been able to help answer questions based on context.
Recommendation systems Useful for building recommendation systems by analyzing the user preferences,
historical data, and item features.
Image processing Helpful for image classification, object detection, image generation, and image
captioning.
Information extraction Useful in assisting with extracting key information from documents, document
classification, summarization, and information retrieval.
Text generation Useful in text generation for chatbots, story generation, content creation for
marketing, and so on.
Realistic
Discriminator
Images
Random input
Generator Sample
Image and The image and video editing capabilities of GAN can be used to generate realistic
video editing images and videos in the entertainment, marketing, and advertising industries.
Text generation The text generation capabilities of GAN can be used to create content for websites,
social media, as well as new forms of art and literature.
Music composition The music composition capabilities of GAN can be used to create songs, soundtracks,
entire music albums, as well as for music in TV and movies.
The AI/ML Framework ❘ 309
Drug discovery GANs can be used to create new drug molecules with the potential to treat diseases.
Financial data GANs can be used to create realistic financial data, such as stock prices and
generation economic indicators, which can be used to make financial decisions and forecast
economic trends.
Climate control GANs can be used in climate control by generating realistic climate models to gain a
better understanding of climate change.
Artificial GANs can be used in artificial intelligence to create training data for other models to
intelligence improve their accuracy.
Facial GANs can be used in reconstructing a face from incomplete or degraded images that
reconstruction can be used in forensics, entertainment, and virtual character creation.
TIP For generating realistic synthetic data like images, audio, and text, consider using
GANs.
TIP AI/ML frameworks provide prebuilt components for data preprocessing, model
architecture, building, and deployment, allowing you to focus more on the functionality and
business problem.
continues
310 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
(continued)
Visualization Good visualization tools, including Good visualization tools, including the
TensorBoard PyTorch Debugger, Visdom
Keras
Keras is a high-level neural network API built to run on top of other frameworks, such as TensorFlow and Theano.
Keras aims to make model development user-friendly on top of complex frameworks such as TensorFlow and Theano.
Keras has a user-friendly interface for beginners, but even experienced developers can use it to build models and
experiment with deep learning. It reduces the number of user actions, provides clear error messages, and makes it
easier to develop documentation and guides. It makes it easy to define and train deep learning models and is help-
ful for complex tasks such as image classification, NLP, and speech recognition. It helps researchers and scientists
develop iterative prototypes and drives faster innovation through experimentation.
Caffe
Caffe stands for Convolutional Architecture for Fast Feature Embedding. It is a deep learning framework devel-
oped by the Berkeley Learning and Vision Center. It is maintained by the community and is written in C++ with
the Python interface. Since it is developed in C++, it is more user-friendly for beginners. It is less user-friendly
than TensorFlow or PyTorch. It has extensive documentation and is intended to be fast and flexible, as it can be
extended to support new tasks.
Summary ❘ 311
NOTE Caffe is helpful for large-scale, complex tasks such as image recognition, object
detection, semantic segmentation, and speech recognition.
MXNet
MXNet is a robust deep-learning framework developed as open source by the University of Washington and the
University of Hong Kong. It is used by Baidu for its search algorithms and is developed in C++. It is tricky for
beginners and is not as user-friendly as TensorFlow and PyTorch.
Scikit
Scikit stands for scientific kit. It’s written in Python and is a powerful and versatile machine-learning library
for beginners. It includes classification, regression, clustering algorithms, support vector machines, and random
forests, and K-means.
Scikit is intended to work with other Python libraries—NumPy and SciPy. It’s free software released under the
BSD license. It is a good choice for beginners who want to learn machine learning.
Chainer
Chainer is a powerful open-source deep learning tool used to develop many deep learning models written in
Python on top of other libraries such as NumPy and CuPy. It can be used for classification, object detection, and
NLP use cases. It was developed by the Japanese venture company called Preferred Networks in collaboration
with IBM, Intel, NVIDIA, and Microsoft.
Many more AI/ML frameworks exist, such as CNTK, DLib, and so on.
SUMMARY
You took a comprehensive trip into the world of AI and ML algorithms. Tailored for experienced data scientists
and beginners alike, this chapter demystified the process of selecting the right algorithms based on use cases.
The chapter dived into the basics of artificial intelligence, reviewed different types of machine learning, discussed
the nuances between models and algorithms, and provided hands-on exercises to help you choose the right
algorithms.
Understanding these algorithms is akin to processing the keys that unlock the vast potential hidden in the data.
While choosing the algorithms, note that the choice of an algorithm is rooted not only in the data but also in the
business problem you are trying to solve. In subsequent chapters, you can look forward to diving deeper into the
training, tuning, evaluating, and deploying models. Your journey continues!
312 ❘ CHAPTER 16 Choosing Your AI/ML Algorithms
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What are the types of data that machine learning algorithms can use?
A. Text data
B. Images
C. Only structured data
D. All kinds of data
2. What is the primary purpose of AI?
A. To automate business processes
B. To improve customer service
C. To mimic human intelligence
D. To monitor events and take action
3. The purpose of labeled data in supervised learning is
A. To identify input variables
B. To predict continuous values
C. To create a model
D. To define the output variable
4. Which one of the following is a regression problem?
A. Identifying spam emails
B. Forecasting product demand
C. Recommendation systems
D. Classifying images
5. The purpose of reinforcement learning is to
A. Classify data into labels
B. Predict continuous values
C. Learn from interactions and rewards
D. Automate process interactions
6. Random forests are useful for which types of problems?
A. Classification
B. Regression
C. Both classification and regression
D. None of the above
7. How does KNN work?
A. It uses deep learning to work.
B. It calculates the mean of all data points to make predictions.
C. It identifies the K closest data points and uses their outcomes to make predictions.
D. It randomly assigns clusters to data points.
Review Questions ❘ 313
ANSWER KEY
1. D 6. C 11. A
2. C 7. C 12. C
3. D 8. C 13. B
4. B 9. B 14. B
5. C 10. B 15. C
17
Training, Tuning, and Evaluating
Models
Success is not final; failure is not fatal: It is the courage to continue that counts.
—Winston Churchill
Traveling the AI journey is like solving a jigsaw puzzle. In the previous chapter, we got one step closer to
solving this puzzle by learning how to choose the most apt AI/ML algorithm, a critical step in transitioning
data into actionable intelligence. This chapter builds that up to get into the actual act of modeling, which
brings these algorithms to life.
This chapter delves into the three pillars of modeling: training, tuning, and evaluation (see Figure 17.1).
The focus is on ensuring these models are secure, efficient, and high performing. We begin by looking at
the intricacies of model building and the likely challenges during the training process. From distributed
processing to container code, we dive into the tools and techniques of model development. You learn about
optimizing models for high performance using hyperparameters and model tuning. Evaluation and valida-
tion will help you align with your enterprise objectives when dealing with real-world data and ensure the
models are robust and ready to achieve tangible, transformative outcomes.
MODEL BUILDING
Model building is an iterative process, and it is the first step in the process that sets up the model for train-
ing to begin. Figure 17.2 shows the model development lifecycle.
The purpose of model building is to build a working model that can be used to make predictions on new
data. It is the process of defining the model’s structure, parameters, and hyperparameters. Taking the cook-
ing analogy, model building is like building a recipe to build your meal.
Start by selecting an algorithm suited to your data type and problem. However, always be open
to trying different algorithms based on model performance and the insights gained during the
training phase.
316 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Process Data and Modeling
journey
09 • Process Data and Engineer Features in
ENTERPRISE AI the Cloud
Setup 04 • Choose your AI/ML Algorithm
Evolve and
Mature AI
JOURNEY MAP Infrastructure
• Train, Tune, and Evaluate Models
and Manage
Operations
08 Objective
Scale and Process Data
Transform and Modeling
Master the iterative process of refining AI/ML
AI
Deploy and 05 models through rigorous training,
Monitor
Models hyperparameter tuning, and evaluation,
ensuring they are ready for deployment with
07
high precision, reliability, and efficiency
06
Model
aluation
Model Tun
Development
Lifecycle
Model Evaluation Model Tuning
l Ev
ing
Model Validation
Validate the model using a
validation dataset
You will be using various tools and frameworks such as TensorFlow, PyTorch, scikit-learn, and so on.
01 Building 03 Model
05 Validation
07 Training
09
Code Training Metrics Container
02 04 06 08
Model building is all about experimenting with different algorithms, features, and hyperparameters iteratively
until a satisfactory performance is achieved. As part of this process, you may be required to carry out some data
preprocessing, such as scaling, encoding, or feature engineering.
TIP Model architecture refers to the algorithm or framework that is used to predict. It can
refer to the layers, nodes, activation functions, loss functions, and optimization algorithms
that are arranged and configured in a certain way. Some common machine learning model
architectures are linear regression, decision trees, neural networks, convolution neural net-
works, and recurrent neural networks.
For our case study, you may choose classification algorithms such as logistic regression, naïve Bayes, and support
vector machines as possible candidates to build the model.
NOTE Refer to Chapter 16 on algorithms to learn more about various algorithms and when
to use which one.
MODEL TRAINING
Once you have shortlisted a model during model building, the next step is to train your model with the training
dataset. Model training is a crucial step in machine learning that results in a working model, which you can vali-
date, test, and deploy. Using the cooking analogy, model training is like actually starting to cook the meal using
the recipe you came up with during the model building process.
NOTE A model is a set of rules or equations that the machine can use to learn from data to
make predictions. Depending on the type of problem, these predictions could be about iden-
tifying if an email is spam or classifying images as different types of animals.
During this step, you will feed the training data into the model and begin the training process. The training
process teaches the machine how to make its predictions. During this process, the weights (parameters) of the
algorithm get updated to reduce the loss, which is the difference between the prediction and the actual labels. In
other words, the training process helps to increase the accuracy with which the machine makes its predictions.
TIP Keep a close eye on metrics such as accuracy, precision, and recall to check if the model
is performing well or needs adjustment.
Distributed Training
When dealing with a lot of data or a complex model, your training process can take a long time, sometimes even
days or hours. It is where distributed training comes to our rescue. Distributed training involves using multiple
machines to train your model.
You have two types of distributed training, namely, data parallelism and model parallelism.
Model Parallelism
Model parallelism involves splitting the model into smaller parts to predict different features and assigning each
part to different machines. The models are then combined to form the final model.
Model parallelism is more complex than data parallelism but is more suited for models that cannot fit into one
device. You can also combine model parallelism with other techniques such as data parallelism to improve train-
ing efficiency.
Data Parallelism
Data parallelism involves splitting the training dataset into mini batches evenly distributed across multiple nodes.
For example, if you have 1,000 emails to identify spam, you can divide them into 10 groups of 100 emails each
and assign each group to a different computer. Therefore, each node trains on only a fraction of the dataset. You
will update the model with the results from the other machines to make the final model.
Model Training ❘ 319
NOTE An optimization algorithm helps update the parameters of the model to reduce a
loss function, such as mean squared error. A good example is the gradient descent algorithm,
which adjusts the parameters iteratively in the direction of the steepest descent of the loss
function.
CHALLENGES IN
MODEL TRAINING
Saturated
Vanishing Gradients
Activations
03 04
• Gradient slope graphic fading • Output flatlining
• Model stops learning • Model not improving
System Bottlenecks
When your system faces a bottleneck, it can bring down the entire process performance regarding memory, CPU
disk space, or network bandwidth.
reached their minimum or maximum levels. When this happens, we say the activation functions are saturated.
ReLU is an alternative in such situations.
Vanishing Gradients
Vanishing gradients happen when the gradients or the incremental changes to the neural network weights are
minimal or close to 0 during backpropagation. When this happens, the neural network will not learn much from
the data. It can happen when using sigmoid or tanh, as well as when the gradients are not propagated back
through the network.
TIP Think of saturated activation functions as rusty gears in a machine. If they are not
moving, neither is your model’s learning. Similarly, vanishing gradients are like whispers in a
noisy room and your model is not hearing them. Opt for RelU for activation functions and
proper weight initialization to address these issues.
To resolve this situation, you can either use ReLU or use weight initialization schemes to prevent the gradients
from becoming too small.
You can use a debugger tool to keep track of the network’s inputs, outputs, gradients, weights, metrics, logs,
errors, and so on. The tool can also capture the state of a machine learning job at periodic intervals, such as vari-
ous epochs, batches, steps, or layers.
Some examples of a debugger are the PyTorch debugger, TensorFlow debugger, SageMaker Debugger, and Azure
Machine Learning debugger. ML debuggers help you to debug and profile the performance at various intervals
using breakpoints, alerts, watches, visualizations, and filters.
TIP Building training code containers is an essential step in the machine learning process to
build and deploy models quickly and reliably at scale.
Suppose you’re trying to build a model to predict customer churn. You have collected all your data and used the
logistic regression model to predict customer churn. And now, you have written some Python code in your local
machine to train the model. You have integrated your code with other frameworks and libraries such as scikit-
learn and pandas.
Now, you are all set to deploy your working code into production using the cloud. Unfortunately, this is where
you are going to face some issues, such as the following:
➤➤ Your code may not run due to differences in the hardware, software, and operating system configuration.
➤➤ Your code may not be able to access the data that is stored in the cloud platform.
➤➤ You may need to scale up or down your compute and storage resources depending upon the workload.
Here comes your training container to your rescue. You will start by creating a container image with the training
code and the entire dependent stack, such as the libraries, framework, tools, and environment variables for your
model to run. You can then train the model quickly and deploy it on another platform at scale because now your
code is running as part of a self-contained unit with all the dependencies taken care of.
Model Training ❘ 321
You can then use the Docker file to build the container image and push it to a container registry such as the
Docker Hub or the Google Container Registry.
Model Artifacts
You need to realize that the final model is one of many outputs you will get when training a model. There will
also be other model artifacts, such as the trained parameters, model definition, and metadata. These model arti-
facts are essential pieces of the puzzle to understand, use, reproduce, and improve upon your model. Figure 17.5
shows the artifacts.
Preprocessing steps,
03 library versions,
hyperparameters,
validation metrics
Trained Parameters
In the case of predicting the price of a house, you had trained parameters such as the weights and biases that your
model learned during the model training process. These parameters define the relationship between the features
and the price of the house. They basically determine how the model works.
322 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
Model Definition
Model definition defines the architecture or structure of the model. It defines how the data flows into the model,
the layers, the nodes, and how they interact. In the case of the house pricing prediction model, this could define
the structure of the neural network in all the regression models. It would detail how the inputs (house features)
are processed to produce the output (predicted price).
Other Metadata
This could be any information that you may have used during the model training process, such as the preprocess-
ing steps that you used to prepare the data, the versions of the libraries, the hyperparameters to tune the model,
and the validation metrics that you used to evaluate the model, and so on.
TIP It is vital to keep track of the model artifacts as they will help in reproducibility,
governance, explainability, troubleshooting, lineage tracking, and centralized artifacts
management.
MODEL TUNING
Model tuning involves fine-tuning the parameters not optimized during the training process. Model tuning helps
to modify the structure of the model itself. During model training, the internal trainable parameters, such as
weights and biases of the neural networks, are trained using the training data.
In the case of model tuning, the focus is on tuning hyperparameters. Those hyperparameters are structural set-
tings that are set before fitting the model and are not learnable from the training data. These are parameters such
as the learning rate, number of epochs, regularization strength, and so on.
TIP Think of model tuning as sharpening your pencil. While training sets the foundation,
tuning is where you achieve perfection.
Hyperparameters
Suppose you are a chef trying to create a recipe to bake bread. While creating your bread, you can change a few
parameters, such as the temperature and the baking time. You can think of these parameters as hyperparameters.
Hyperparameters control how the model works and help optimize the model performance on training and valida-
tion datasets. It helps to avoid overfitting or underfitting. These values cannot be estimated from data and are
often configured by a practitioner. Hyperparameters help to control the behavior of the algorithm or the model
and control how fast the training occurs and how complex the model is. Hyperparameters are set before the
model gets trained, and they guide how the model learns its trainable parameters.
I next discuss the learning rate, regularization, batch size, number of epochs, hidden layers, and hidden unit
hyperparameters, as shown in Figure 17.6.
Learning Rate
Learning rate is the rate at which the model parameters are updated at each iteration of the gradient descent.
You can think of the learning rate as the baking temperature. A high temperature can speed up the baking, but
it can get spoiled by overheating. Similarly, a high learning rate speeds up the convergence to minimize the loss
Model Tuning ❘ 323
function. However, it can also cause instability or overshooting. A low learning rate can ensure stability but can
take a long time to converge and potentially get stuck in local minima.
Learning Rate
Determines
Counters overfitting
capacity Hidden Regularization
Units
Hyperparameters
Adds model
Impacts memory usage
complexity Hidden Layers Batch Size
Epochs
Regularization
Regularization is adding a penalty to the loss to reduce the complexity or magnitude of the parameters. While this
approach may reduce the variance and help avoid overfitting, it can introduce bias. Using the cooking analogy,
compare the situation of overfitting to having added more salt. To reduce this saltiness, you would add more
water. You can think of regularization in a similar manner.
Batch Size
Batch size is the number of data points in each dataset used to train the model for each iteration of the gradi-
ent descent.
A large data size can reduce the variance in the parameter updates and noise, but it takes a long time to compute
and uses a lot of memory. Smaller batch size is the reverse effect: it reduces the time taken to compute and uses
less memory. However, it increases the noise and variance in the parameter updates.
Number of Epochs
The number of epochs stands for the number of times the entire training dataset is fed to the model for training.
A higher number means the model runs the risk of overfitting, while a lower number may lead to underfitting of
the model.
To find the sweet spot for the number of epochs, you can resort to techniques such as early stopping in the case of
overfitting and adopt techniques such as cross-validation to get better results.
324 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
Hidden Layers
Hidden layers are the layers that are not visible to the user in a neural network.
NOTE The hidden layer is an important parameter that can decide the complexity of the
model.
Simple layers can catch broad features, and deeper layers can catch more intricate, specific details. A higher num-
ber of hidden layers can result in overfitting, while a lower number can lead to underfitting. To identify the right
number of hidden layers, you can try grid search or random search techniques to evaluate the performance using
different numbers of hidden layers.
Hidden Units
Hidden units are the number of neurons in a layer in a neural network. They determine the representational
capacity of the neural network. A higher value can lead to overfitting, and a lower value can lead to underfitting.
You can identify the correct optimal number of hidden units using techniques such as a grid or random search to
evaluate the network’s performance using different numbers of hidden units.
You can use your intuition or experience when choosing the hyperparameters to manually tune a model. Your
goal is to choose the model with the best validation performance. While using an automated approach, the grid
search provides the best possible model. In contrast, the random search approach is more efficient.
TIP If you’re running on, you will choose the random search approach, but if you’re looking
for accuracy, you will choose the grid search approach.
To use grid search to tune the hyperparameters, you can set the maximum depth of the decision tree to 10, 20, 30,
40, and 50. You can set the number of trees in the random forest to 100, 200, 300, 400, and 500.
When you use grid search, the system will try all 25 combinations of the hyperparameter values, which will take a
long time but will provide the best model.
If you used random search, the system would randomly sample the maximum depth of the tree from a uniform
distribution between 1 and 50 and randomly sample the number of trees from a uniform distribution between
100 and 500.
NOTE While random search will not give the best possible model, it will be faster than the
grid search approach.
Bayesian Search
Bayesian search is a hyperparameter optimization technique well suited to noise or use cases that are more com-
plex for a grid or random search. It is more efficient than grid or random search but can be complex to imple-
ment and may need some knowledge of Bayesian statistics.
TIP Bayesian search can be used for any machine learning model but may not be as effective
as a grid or random search for more straightforward objective functions.
The Bayesian search uses Bayesian statistics to find the best hyperparameters and works by building a probabil-
istic model of the objective function being optimized. It has been successfully used in many applications such as
spam filtering and credit card fraud detection.
MODEL VALIDATION
This section discusses how to validate the models and the metrics that you have to use.
Holdout Validation
Holdout validation involves breaking the dataset into a training dataset and a test dataset. You will use the train-
ing dataset to train the model and then use the model to test the dataset using the test dataset. By comparing the
results with the actual dataset, you can evaluate how accurate the model is.
326 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
Holdout Validation
Leave-One-Out
• Dataset split into two groups
• Dataset split into N groups
• Train on one, test on the
VALIDATING • Train on N-1 groups, test on held-out
other
MACHINE case
LEARNING
MODELS
K Cross-Validation
This involves breaking down the validation set into K folds and training and testing the model on each fold. Then
you will compare the performance average across all folds with other models.
TIP Using the leave-one-out cross-validation technique helps you test the data against all the
data points, which is very accurate.
Validation Metrics
As shown in Figure 17.9, validation metrics are used during the different phases of a model development lifecycle.
TIP Validation metrics are used to measure the model performance in a quantifiable manner
during different phases of the model development lifecycle. During model building, use them
to choose the right model; during training, use them to ensure learning is on the right track;
during tuning, use them to choose the right hyperparameters; and during evaluation, use
them to finalize the model.
TIP No single metric will tell you the whole story. While accuracy may be very impressive, it
can be misleading if your classes are imbalanced. That’s why you need to look at other met-
rics such as precision, recall, and F1 score.
Accuracy
Accuracy is the percentage of the correctly predicted labels out of all predictions calculated as follows:
F1 Score
• Formula combining
precision and
recall
• Balance between
precision and
recall
Recall
Precision
• Formula with
• Formula with
true positives
true and false
and false
positives
negatives
• Positive
• Completeness
prediction
of positive
accuracy
predictions
Accuracy AUC-ROC
• Formula with • Area under the receiver
correct
predictions
Key operating characteristic
curve
• Overall Classification
• How well the model can
performance Metrics to differentiate between the
metric Evaluate positive and negative
Models classes
Precision
Precision is the percentage of the correctly predicted positive labels out of the total positive predictions.
Precision Number of true positives / (Number of true positiives / (Total number of true positives
+ Total number off false positives))
Recall
Recall is the percentage of the correctly predicted positive labels out of the total actual positive predictions.
Recall Number of true positives / (Number of true positivess / (Total number of true positives
+ Total number of faalse negatives))
F1 Score
F1 score is the harmonic mean of the precision and recall.
AUC-ROC
AUC-ROC stands for area under the receiver operating characteristic curve, and it helps to determine how well
the model can differentiate between the positive and negative classes. Its value ranges from 0 to 1. A value of 0.5
implies random guessing, and 1 implies its perfect classification. It plots the true positive rate (recall) against the
false positive rate (1 – specificity) for various threshold values.
Model Validation ❘ 329
Since these predictions are probabilities, we must convert them into binary values using the threshold concept. So,
for a threshold of .5, if the predicted probability is greater than .5, it means it’s positive, and any value less than
.5 is treated as negative.
NOTE AUC-ROC is a measure of how well a model can distinguish between two classes,
ranging from 0 to 1.
Confusion Matrix
Confusion matrix is a table that summarizes the number of true positives (TPs), true negatives (TNs), false posi-
tives (FPs), and false negatives (FNs) for a given class.
Consider the scenario for predicting customer churn. Table 17.1 shows The confusion matrix.
Predicted No Churn TN FN
Predicted Churn TP
FP
1 0 0.2 0
2 1 0.8 1
3 0 0.6 1
4 0
1 0.4
Predicted No Churn 90 10
Predicted Churn 80
20
330 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
These metrics reveal that this model is fairly accurate, but still, there is some scope to improve as it still has some
false predictions. You can use these metrics to choose between different models or algorithms. You can also use
these metrics to choose between different hyperparameters, such as regularization or threshold value, as part of
the model-tuning process.
MODEL EVALUATION
After training a model, the next important step is to evaluate the model for its performance and accuracy. Think
of model evaluation as your final examiner that validates if the model is ready for the real world.
As part of the evaluation, you have to take care of three things.
➤➤ Ensure the model can generalize well for new unseen data.
➤➤ Ensure the model satisfies the business rules, objectives, and expectations. Otherwise, it’s like a ship
without a compass.
➤➤ Ensure it can accommodate different scenarios and handle different trade-offs. The fact is no model can
handle every situation perfectly.
Model evaluation will help you to address these challenges. As part of it, you will select the suitable model after a
lot of experimentation, followed by validating the model and fine-tuning it with the correct hyperparameters.
NOTE Note that the model evaluation can take place using offline or online data.
Offline data evaluation will be done using holdout data, while online evaluation will take place in real time in
production, which can be risky but more effective.
Figure 17.11 shows the various components of the model performance evaluation pipeline. Model evaluation
involves testing the model against a test dataset not used for training or validation. It helps to test the final accu-
racy and quality of the model. You will use evaluation metrics such as accuracy, precision, recall, and F1 score.
During this phase, you will look for potential bias and discrimination in the model and also try to explain how
the model works using a feature importance score or confusion matrix.
TIP Model evaluation is a critical step to ensure your model is robust, fair, aligned with your
business objectives, and ready for the real world.
BEST PRACTICES
This section reviews some practices you can implement during the model development lifecycle to streamline
workflows, secure your platform, enhance model robustness, improve performance, and reduce costs.
Best Practices ❘ 331
Develop
Process Data Model Deploy Model Monitor
Feature Pipeline
Fetch Real-time/Batch
Features Inference
Train/Tune/ Apps
Data Prepare Deploy
Pipeline Evaluate
Re-Train Pipeline Monitor
CI/CD/CT Pipeline
Alarm
Model Registry
Store Artifacts (Data, Code, Model)
TIP IaC and CaC will lead to more predictable deployments and fewer manual errors.
➤➤ Implementing CaC will ensure the traceability of different data and model artifacts across multiple envi-
ronments and configurations. It will help you troubleshoot issues and avoid the manual, tedious process
of managing configurations across multiple environments.
TIP IaC and CaC help manage versions, making rolling back changes easier. They are life-
savers to scale ML operations.
You can use AWS CloudFormation, Terraform (open source), and AWS Cloud Development Kit to implement IaC.
332 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
Embrace Containerization
Containerization provides the following benefits:
➤➤ Governance and Compliance: By packaging the ML code, libraries, and frameworks into the container
and ensuring that these are approved by the organization, you can ensure compliance and compatibility
with environments as well as share these containers across multiple teams within the organization.
➤➤ Standardization: It will ensure standardization of the machine learning infrastructure and code, in addi-
tion to enabling governance across the enterprise.
TIP Implementing MLOps best practices will not only streamline your ML development
process but also ensure consistency, governance, and collaboration, which are foundational
pillars for an agile, reliable, and scalable AI implementation.
TIP To avoid transfer learning risk, you can use the AWS SageMaker Debugger to detect any
bias in the new predictions.
Best Practices ❘ 333
Protect from
Transfer
Learning Risks
Machine
Protect Against Securing Your
Learning
Data Poisoning ML
Security Best
Threats Environment
Practices
Encryption of
Inter-Node Cluster
Communications
➤➤ Diversify data sources: Use more than one data source to train your models so the impact of compro-
mise to a single data source is minimized.
➤➤ Audit and oversight: Have an audit process in place to check for unusual changes and ensure alignment
with corporate security.
Feature
Consistency
• Use the same feature engineering code in training
and inference environments.
• Use only trusted data sources for training.
• Before training, check the quality of the data for
outliers or incorrect labels.
Relevant Data
CI/CD/CT Building Robust • Use a validation dataset
Automation and representative of the data that
• Use the CI/CD/CT pipeline to Trustworthy the model will use in production.
automate the process of model Models • Check for underlying shifts in the
building, testing, and deploying the patterns and distributions in the
models. training data.
• Use version control system to track • Be on the lookout for drift in data
the changes in the model and data. and the model.
➤➤ Detect data shifts: Check for underlying shifts in the patterns and distributions in the training data,
which might affect model reliability.
➤➤ Drift monitoring: Be on the lookout for drift in data and the model and address them appropriately to
ensure accuracy.
➤➤ Automated validation checks: Implement automated validation checks using relevant and updated data
to ensure the model works as expected.
TIP Using SageMaker Debugger, you can capture real-time training metrics that you can use
to optimize performance.
er Instance Size
Use Hyperparameter Techniques et
m
Fine-tune settings for best ra g Start small, then scale up
pa in
results
p er un Instance
y T
H Size
Checkpointing
Use Warm Starting and
Checkpointing Select Local Training
Lo inin
Reuse previous knowledge Experiment locally before
Tr
ca g
a
and save states regularly
l
moving to the cloud
Cost
Optimization
Best Practices
Choose an Optimal ML
Re Us
so ag
ML Framework
Stop Resources Not in Use Framework
ur e
ce
Goal: Understand how to choose the correct instance size for a given problem and then
experiment with local training before scaling.
Scale Up: Gradually increase the Data scientist Performance report including
instance size, observing the training training time and metrics at
time and performance metrics. Note various instance sizes
down the changes.
Analyze: Compare the results to Data scientist Optimal instance size report,
identify the optimal balance between cost-benefit analysis
instance size and performance.
Goal: Use AWS SageMaker to automate the training process and optimize hyperparameters.
continues
338 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
continued
training and evaluation phases. It will help you to ensure that the model meets the required quality standards
before it is deployed.
SUMMARY
In this chapter, you began a comprehensive journey through training, tuning, and evaluating machine learn-
ing models. You examined the various stages, including model building, model training, model tuning, model
validation, and model evaluation. You also learned about some best practices in areas such as security, reliability,
performance, and cost management.
This chapter serves as a foundational guide for data scientists, machine learning engineers, and leaders to con-
tinue to explore machine learning and the model-building process in particular. As we close this chapter, it is
essential to remember that this is a continuous process, and the challenges and opportunities will continue.
In the next chapter, let’s discuss how to deploy models in production to ensure the models function optimally and
responsibly.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What is one of the benefits of building models in the cloud?
A. Scalability
B. Security
C. Privacy
D. None of the above
2. What is the purpose of model building?
A. To build a working model that can be used to make predictions on new data
B. To train the model
C. To tune the model
D. To validate the model
3. ______ is the primary step in machine learning that results in a working model that can be validated,
tested, and deployed.
A. Model training
B. Model building
C. Model tuning
D. Model evaluation
4. What is the difference between the model build process and the model training process?
A. The model build process focuses on discovering patterns in the training dataset and selecting the
best-performing model, while the model training process focuses on adjusting the model’s param-
eters to improve its performance.
B. The model build process focuses on adjusting the model’s parameters to improve its performance,
while the model training process focuses on discovering patterns in the training dataset and select-
ing the best-performing model.
C. The model build process and the model training process are the same.
D. None of the above.
340 ❘ CHAPTER 17 Training, Tuning, and Evaluating Models
C. Model fine-tuning
D. Model training
13. What is the benefit of implementing infrastructure as code (IaC)?
A. Consistency across different infrastructures
B. Traceability of different data and model artifacts
C. Both A and B
D. Neither A nor B
14. What is one of the risks associated with transfer learning?
A. The inherited weights from the pretrained model may not produce the correct predictions.
B. The fine-tuned model may take longer to train than a new model.
C. The pre-trained model might not have been trained on relevant data.
D. All of the above.
15. What is the purpose of conducting a performance trade-off analysis in machine learning?
A. To find a balance between model complexity and accuracy
B. To ensure optimal utilization of resources without compromising performance
C. To compare the performance of different models
D. Both A and B
ANSWER KEY
1. A 6. B 11. C
2. A 7. A 12. D
3. A 8. C 13. A
4. A 9. B 14. D
5. A 10. A 15. D
PART VII
Deploying and Monitoring
Models
Hooray! It is launch time. Part VII guides you through the process of deploying the model into produc-
tion for consumption. I also discuss the nuances of monitoring, securing, and governing models so they are
working smoothly, safely, and securely.
18
Deploying Your Models Into
Production
Do not wait to strike till the iron is hot but make it hot by striking.
—William Butler Yeats
Having navigated through data processing, algorithm selection, model training and fine-tuning, you are
now on the verge of an important step: model deployment. This is an essential chapter for data scientists,
ML engineers, IT professionals, and organizational leaders involved in deploying models into production
using the cloud.
The true value of an AI model is not just in its design or accuracy but in its real-world use. This chapter
dives deep into the nuances of deploying your model, from understanding the challenges in model deploy-
ment, monitoring, and governance to deciding between real-time and batch inferences.
Model deployment isn’t just about pushing a model live. It involves strategic decisions, systematic pro-
cesses, and synchronized architecture. Keep in mind that you will not just be launching models into pro-
duction; you will also be ensuring that the models are functioning optimally and responsibly within your
larger AI ecosystem. See Figure 18.1.
The focus of this chapter, though, is to have a successfully deployed model that is making the impact that it
was meant to have during the design process to achieve continuous innovation and growth.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Deploy and Monitor Models
journey
09
• Deploy Your Models into
ENTERPRISE AI Setup 04 Production
Evolve and
Mature AI
JOURNEY MAP Infrastructure • Monitor and Secure Models
and Manage
Operations
• Govern Models for Bias and Ethics
FIGURE 18.1: DEPLOY AND MONITOR MODELS: Deploy your models into production.
➤➤ Ethical use and responsibility: You need to ensure that the model is used responsibly and ethically in
a production environment without discriminating against its users. To enforce governance, you must
define policies and procedures integrated into your organization’s overall governance framework and
establish systems to detect bias proactively and address them in real time.
TIP Adopt deployment best practices such as automatically deploying through CI/CD pipe-
lines, deployment scripts, version control, continuous performance monitoring, ethical use,
proactive bias detection, and establishing a model governance framework.
TIP Solving these challenges requires a mix of technology initiatives such as MLOps, pipe-
lines, explainability techniques, governance policies, data use policies, regulatory guidelines,
and culture.
Pre-deployment Checklist
First, you must ensure the model is fully trained and tested. Second, ensure that you have fully optimized the
model using the best practices discussed in the previous chapter. Third, perform dry runs in a staging or sandbox
environment to uncover unforeseen issues.
348 ❘ CHAPTER 18 Deploying Your Models Into Production
Feature Store
Fetch
Features
Container
Production Consuming
Inference Code Endpoint Application
Fetch Artifacts
(Models, Code)
Model Registry
Cloud Deployment
Here are some considerations for cloud deployment:
➤➤ Scalability: Cloud is the best option if you have many users to serve and if the load fluctuates. It
allows you to process large amounts of data using a large computing capacity, and to store large
amounts of data.
Deploying Your Models ❘ 349
CLOUD
DEPLOYMENT
OPTIONS
03 04
Ideal for real-time processing, Combines flexibility and
high privacy, low bandwidth scalability of cloud with control
requirements, and low latency and security of on-prem
➤➤ Service integration: The cloud providers also provide several services that make deploying and managing
your model easy. For example, you can leverage services for computing, storage, and networking.
➤➤ Data security: The cloud allows you to connect to sensitive data using robust network security protocols
offered by the cloud providers.
For these reasons, the cloud option is often the most popular option.
On-Premises Deployment
Some companies may resort to on-premises deployment of their models using their own servers, storage, and
networks in their data centers.
➤➤ Security requirement: It is preferred when security is a primary requirement, and you want to keep your
data in your own locations.
➤➤ Infrastructure control: It also gives more control over their infrastructure. However, you need to regu-
larly update your hardware and software, have robust backup and disaster recovery, and regularly moni-
tor for bottlenecks.
➤➤ Expensive: However, it can be more expensive and resource intensive to manage.
Edge Deployment
Here are some considerations for edge deployment:
➤➤ Model performance and latency: If your model performance and latency are important, then you would
resort to deploying the model at the edge where the data resides and action happens. Your edge devices
could be smartphones, IoT sensors, tablets, or industrial machines.
350 ❘ CHAPTER 18 Deploying Your Models Into Production
➤➤ Real-time processing: It allows you to process data in real time, provides greater privacy, and has fewer
bandwidth requirements.
➤➤ More complex: However, this can be the most complex deployment type because managing and updat-
ing models across multiple devices can take time and effort.
➤➤ Other considerations: Remember that you need to implement device management to manage software
updates, may encounter issues because of limited computing power, and may need to synchronize data
between central servers/cloud and devices.
NOTE AWS IoT Greengrass can run models in edge devices using models trained in the
cloud.
Hybrid Deployment
People use a hybrid deployment to get the best of both worlds, namely, cloud and on-prem deployments.
➤➤ Flexible and scalable: This approach can provide flexibility and scalability of cloud deployment.
➤➤ Greater control and security: This approach can provide greater control and security through an on-
prem deployment.
➤➤ More complex: It can be challenging to implement and manage.
With Amazon SageMaker, you can run the models on the cloud and at the edge. It has a compiler that can read
models from various frameworks and compile them into binary code to run the model in the target platform. It
also enables the “train models once and run anywhere” concept.
TIP Think long term and ensure that your deployment method suits your future growth
potential. Avoid vendor lock-in and factor in regulatory requirements.
Blue/Green Deployment
Consider these characteristics of blue/green deployment.
➤➤ Environment consistency: In the blue/green deployment approach, you have two identical production
environments running in parallel. You will continue the live traffic on the blue production environment,
and the green environment will be used for the deployment. When a new version of the model is ready
for deployment, you will deploy it first in the green environment to test it. Once it has been tested and
found to be working fine, you will start directing the live traffic to the green environment. The blue envi-
ronment will now become the backup.
➤➤ Rapid rollback: The benefit of this approach is that you can roll back your changes if the model is not
working correctly. It is scalable and requires minimal downtime.
➤➤ Resource management: But the disadvantage is that it has additional infrastructure costs due to the need
to maintain two production environments in parallel. It is more complex to implement and needs addi-
tional resources.
Deploying Your Models ❘ 351
Shadow Blue/Green
Deployment Deployment
Run old and new Test on green, switch
versions side-by-side for live traffic upon success.
testing. Real-life testing 3 1 Scalable with minimal
environment, scalable downtime
Canary
A/B Testing 3 2 Deployment
Split users randomly Release to a small
and serve old and new group initially. Monitor
versions. Allows data- and scale up traffic if
driven decisions successful
Canary Deployment
Consider these characteristics of canary deployment:
➤➤ Segmentation: You release the model to a small group of users.
➤➤ Continuous monitoring: It allows you to monitor the model’s performance closely, and once you find it
satisfactory, you can slowly ramp up the traffic to 100 percent.
➤➤ Live testing: The benefit of this approach is that you can test the model in a live, fully controlled
environment.
➤➤ Rollback: It allows you to roll out a new model without impacting all the users and can roll back if
there is an issue with the new model.
➤➤ Small sample size: The disadvantage is that you are testing it against a small sample size, and not all
errors may be caught. It won’t be easy to target a suitable set of users and assess the full impact of the
new model.
A/B Testing
Here are some characteristics of A/B testing:
➤➤ Controlled variables: You split the users into two groups in a random fashion and then test them with
the old and new versions of the model to compare their performance. The first group is called the con-
trol group and is served the old model, and the other group is called the treatment group and is served
the new model.
➤➤ Statistical relevance: The benefit of this approach is that it allows you to make data-driven decisions. It
allows you to test different features and configurations.
➤➤ Planning: It takes a lot of planning to avoid biases. It will need to run for a substantial amount of time
or need a substantial amount of traffic to reach statistical significance.
➤➤ Targeting: It is also difficult to target the right set of users and therefore understand the full impact of
the model.
352 ❘ CHAPTER 18 Deploying Your Models Into Production
Shadow Deployment
With shadow deployment, you have the old and new versions running side by side.
➤➤ Data synchronization: While the old version serves the users, the new version is used purely for testing
and analysis. Both versions take in the same input data.
➤➤ Real-life testing: The advantage of this method is that you can do real-life testing in a real-life environ-
ment, which involves minimal downtime, is scalable, and can be rolled back in the case of an issue.
➤➤ Additional complexity: The cost of maintaining two models may require additional resources and are
more complex to implement.
TIP Regardless of the deployment strategy used, always ensure that the infrastructure is
robust, scalable, and have well-defined rollback plans.
Business requirement Requires instant feedback (e.g., Result isn’t immediately necessary (e.g.,
fraud detection). monthly reports).
Infrastructure and cost Needs robust infrastructure for Tuned for large-scale processing.
24/7 uptime and scalability. Can Generally, more cost-effective.
be more expensive.
Latency Requires quick decisions (e.g., Can tolerate delayed results (e.g., daily
algorithmic trading). analytics).
Data consistency Handles sporadic or inconsistent Deals with consistent, structured datasets.
streaming data.
Error handling Needs instant alerts and quick Requires checkpoints and logging for
correction mechanisms. post-facto error detection.
Resource allocation High availability and redundancy Needs significant storage and compute
are critical. during processing.
Deploying Your Models ❘ 353
Hybrid approaches Possible to combine for Can be combined with real-time for
immediate predictions with comprehensive insights.
deeper analytics.
TIP Batch inference is not valuable for use cases where real-time predictions are essential.
Step 02 Step 04
Data Model
Preprocessing Inference
Cleanse and Make predictions
format data for using the model
model input
Data Collection
During this step, the data from various sources, such as IoT sensors, is collected and stored in data storage. How-
ever, this data is not ready yet for use in the model.
Data Preprocessing
During the data preprocessing step, the incoming data is cleansed and converted into a suitable format to be fed
into the model. The data may need to be resized, normalized, standardized, or augmented. It is an essential step
because it will determine the accuracy with which the model predicts, as it depends on the accuracy and consist-
ency of the input data.
Feature Extraction
The pipeline extracts the relevant features from the input dataset during this step. Again, this is an important step
that ensures accuracy in the model performance. For example, in the case of image processing, this may involve
edge detection, object segmentation, and other techniques to extract meaningful information from the raw image.
Model Inference
This is the most critical part of the inference pipeline and is the core because this is where the actual prediction
happens. This step may provide meaningful insights to drive subsequent actions.
TIP Use optimized models for faster inference and retrain models as necessary.
Post-processing
The predictions are converted into the format for other downstream applications during this step.
You can see that implementing an inference pipeline can improve efficiency, latency, and accuracy in real time.
However, it can be a complex process that requires careful planning, design, and implementation.
Collect and preprocess data: Data collection team Preprocessed data ready for
Gather data from various feature extraction
sources and preprocess it to
make it suitable for the model
(e.g., resizing, normalizing).
Test the pipeline: Evaluate Testing and QA team A fully functioning inference
the entire inference pipeline pipeline validated with
using real-time data to ensure it real-time data, including a
functions as expected. testing report
NOTE Sometimes even a minor change in the architecture configuration can result in differ-
ent inferences in different environments.
You must also ensure the model shows the same accuracy range across development, staging, and production
environments. Any model that fails to show this level of accuracy should not be deployed in production. It
becomes imperative as you scale your operations across multiple geographies and product lines.
356 ❘ CHAPTER 18 Deploying Your Models Into Production
Identify bias metrics: Define Data science team List of selected bias metrics
and select the specific metrics tailored to the specific use
that will be used to measure case and data.
and identify bias in the
training data.
Run bias analysis: Execute the Machine Raw output of bias analysis,
bias detection process using learning engineers including identified biases and
SageMaker Clarify, analyzing related statistical measures.
the training data against the
defined metrics.
MLOps Automation: Implementing CI/CD for Models ❘ 357
Analyze bias results: Interpret Data science and In-depth analysis report
the results of the bias analysis, analytics team highlighting the findings, causes,
understanding the underlying and potential remedies for
causes and potential impact on detected biases.
model performance.
Implement remediation Data science team Revised training data and model
strategies: Based on the training approach, minimizing
analysis, apply mitigation the identified biases.
strategies to reduce or
eliminate the identified biases
in the training data.
TIP Implementing CI/CD can introduce a lot of efficiency and scale, especially when
managing large enterprise AI implementations with many models to manage.
358 ❘ CHAPTER 18 Deploying Your Models Into Production
1 2 3
Goal: Automate the process of training, testing, and deploying a model using Jenkins.
Write Scripts Develop scripts for training, testing, and Development and
deploying the model as part of the CI/ data science team
CD process.
Configure Integrate the written scripts into the Jenkins DevOps and
the Pipeline pipeline and set triggers, parameters, etc. configuration team
Review Questions ❘ 359
SUMMARY
This chapter discussed the intricate process of deploying models. It started with an introduction to model
deployment followed by considerations and challenges in model deployment. It then discussed choosing the right
deployment option from cloud to on-premises, edge, and hybrid setups, and choosing an appropriate deployment
strategy, including blue/green, canary, A/B testing, and shadow deployments. It addressed the choice between real-
time and batch inference and discussed the pros and cons of each approach. It also discussed other deployment
best practices such as synchronizing architecture and configuration across environments, striving for identical
performance between training and production, and generating insightful governance reports.
In the next chapter, let’s discuss how to monitor models for data quality, data drift, concept drift, model perfor-
mance, bias drift, and feature attribution drift.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Which of the following is NOT a common challenge in model deployment?
A. Models may not scale well to handle large data volumes.
B. Lack of integration with backend processes can create data silos.
C. Models become more explainable and interpretable over time.
D. Reproducing models for retraining and redeployment can become difficult due to changes in
dependencies.
2. Which deployment option can be the most complex because managing and updating models across mul-
tiple devices can take time and effort?
A. Cloud deployment
B. On-premises deployment
C. Edge deployment
D. Hybrid deployment
3. Which of the following is a disadvantage of real-time inference?
A. It is not scalable for large datasets.
B. It requires more resources and is expensive to implement.
360 ❘ CHAPTER 18 Deploying Your Models Into Production
C. It is not valuable for use cases where real-time predictions are essential.
D. Predictions can’t be stored in a database for later access.
4. Which step in an inference pipeline converts predictions into a format suitable for other downstream
applications?
A. Data collection
B. Data preprocessing
C. Feature extraction
D. Post-processing
5. What do TensorFlow Extended (TFX), Azure Machine Learning, and Google Cloud AI Platform have
in common?
A. They are open-source platforms.
B. They are end-to-end platforms to build, train, and deploy machine learning models.
C. They are specifically designed for version control.
D. They are data collection tools used in the data preprocessing stage of model development.
ANSWER KEY
1. C 3. B 5. B
2. C 4. D
19
Monitoring Models
What gets measured gets managed.
—Peter Drucker
This chapter deals with the critical components of monitoring, including making informed decisions
around real-time and batch monitoring. It delves into checking the health of model endpoints, selecting
appropriate performance metrics, and ensuring model freshness. To ensure the AI systems remain agile and
responsive, you learn how to review and update features, automate endpoint changes, and scale on demand
(see Figure 19.1).
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Deploy and Monitor Models
journey
09
• Deploy Your Models into Production
ENTERPRISE AI Setup 04 • Monitor and Secure Models
Evolve and
Mature AI
JOURNEY MAP Infrastructure • Govern Models for Bias and Ethics
and Manage
Operations
MONITORING MODELS
Monitoring models is an essential aspect of MLOps to ensure that the model performs well and that the data
used is high quality.
TIP It is important to monitor the models for any changes and update them with new data,
algorithms, or any other changes required.
Addressing these challenges requires a mix of monitoring strategies, tools, infrastructure, and analysis techniques,
covered in this chapter.
Key Strategies for Monitoring ML Models ❘ 363
TIP Do a phased rollout by starting with shadow deployments, and then move to canary
before a full-scale deployment.
While monitoring data quality involves ensuring that the data used to train and deploy machine learning models
is of high quality, monitoring for data drift involves detecting changes in the distribution of data used to train and
deploy machine learning models.
Data drift can be caused by changes in how data is collected, processed, or stored. Poor data quality can result
from errors during data entry, data corruption, or data inconsistencies. While poor data quality can lead to inac-
curate results, data drift can lead to decreased model accuracy.
An example of data drift is a model that predicts if a customer will default on a loan. It may have been trained
five years ago based on customers with good credit. But over time, the model performance may decrease because
customers with poor credit ratings have started to apply in more significant numbers, and your data distribution
is no longer the same.
364 ❘ CHAPTER 19 Monitoring Models
TIP You need to clearly distinguish between data drift and data quality issues because the
resolution is different. For data drift issues, you would need to retrain the model using new
data, while for data quality, you would need to clean the data or remove outliers. You can
use unsupervised learning algorithms to train the model so that it can adapt to the changing
data by changing the hidden structures dynamically.
There are several methods to detect data drift, such as the following (see Figure 19.2):
➤➤ Population stability index (PSI), which compares the expected probability distribution in the present
dataset with that of the training dataset
➤➤ Kolmogorov-Smirnov test
➤➤ Kullback-Leibler divergence
TIP Implement a comprehensive monitoring system that automates the monitoring process
to ensure models work accurately.
TIP Plan for periodic retraining of models when significant concept drift is detected.
TIP To detect bias drift, you must measure metrics such as demographic parity, equal oppor-
tunity, and equal odds.
SageMaker Clarify can help detect bias in the models and is also integrated with SageMaker Model Monitor,
which can help detect data and concept drifts.
TIP To monitor feature attribution drift, you must track metrics such as featuring impor-
tance score and feature correlation.
An example of feature attribution drift is for a customer churn model; it might have been that the number of
years the customer has been with the company may have been the most critical feature, but when a competitor
introduces a new product, the customers may churn. Their stay with the company will no longer be a signifi-
cant factor.
While it is true that implementing a comprehensive monitoring system can be very intensive, it can automate the
monitoring effort and save you a lot of manual effort and time to ensure the models are working accurately.
TIP SageMaker Model Monitor can help detect data and model drifts in real time by evalu-
ating the inputs and outputs and sending you alerts.
366 ❘ CHAPTER 19 Monitoring Models
Model Explainability
Model explainability is about explaining how the model makes its predictions so that it can be trusted. It is essen-
tial as it helps you understand the root causes of a specific prediction and avoid biases. You need to know how
each feature contributes to the prediction and whether the prediction is in line with the existing domain knowl-
edge. It requires measuring metrics related to model quality, fairness, and drift.
TIP Where possible, opt for models that are inherently more interpretable. Leverage tools
like LIME or SHAP to understand complex model behaviors.
Classification Metrics
The previous chapter discussed classification metrics, including accuracy, precision, recall, F1 score, AUC-ROC,
and confusion matrix.
Start
Problem
Type?
Regression Metrics
Consider these regression metrics:
➤➤ Mean absolute error (MAE): This gives an idea of how wrong the predictions were and is calculated by
taking the average of the absolute differences between the predicted and actual values.
➤➤ Mean squared error: This is similar to MAE except that the differences are squared before averaging, so
they accentuate the more significant errors more than the MAE.
Tools for Monitoring Models ❘ 367
➤➤ Root mean squared error: As the name suggests, it is the square root of the MSE.
➤➤ R-squared: Also known as the coefficient of determination and is the proportion of the variation in the
dependent variable that is predictable from the independent variable(s).
Clustering Metrics
Adjusted Rand index and silhouette coefficient are two clustering metrics.
➤➤ Adjusted Rand index (ARI) measures the similarity between two data clusters.
➤➤ Silhouette coefficient identifies a model that has better-defined clusters. It provides a visual way to
assess clusters.
Ranking Metrics
Precision@k and mean average precision are two metrics used to measure rankings.
➤➤ Precision@k measures how many of the first k recommendations are in the set of true positive items.
True positive is when the actual and predicted values are the same.
➤➤ Mean average P(MAP) is used to measure the ranking of the recommended items.
TIP Set reasonable ranges for metrics, and when the performance falls out of range, trigger a
review or retraining process.
NOTE If the data under consideration is extensive, real-time monitoring may not be an
option. Real-time monitoring is also more expensive.
Open-Source Libraries
In this category, you have the following options:
➤➤ Evidently, Prometheus, with Grafana monitor model performance, data, and fairness.
➤➤ Amazon Deequ is a validation library for Apache Spark.
➤➤ Tensorboard tracks TensorFlow models.
Third-Party Tools
A number of third-party tools such as Whylabs, Arize, Datadog, Monte Carlo, Labelbox, and Seldon are worth
checking out.
TIP Choose cloud provider tools for scale, open-source tools for customizability, and third-
party tools for an end-to-end integrated solution.
1 2 3 4 5
Employ a Conduct
Decide on the Set Up the
Monitoring Set Up Alerts Periodic
Model Metrics Thresholds
Service Reviews
Monitor Establish statistical, Use monitoring Configure alerts for Perform regular
infrastructure business, and services like issue detection checks and adjust
health, endpoint adaptive thresholds Amazon metrics and
responses, track CloudWatch and thresholds
model versions and AWS Cost Profiler
data to track metrics
NOTE You can use Amazon SageMaker Model Monitor to monitor the quality of the
models and send metrics data to CloudWatch.
➤➤ Monitor the health of the endpoint responses. The metrics used to monitor the model responses are the
accuracy and number of errors in the model response. You can use Amazon SageMaker Clarify to iden-
tify bias and feature attribution drift.
➤➤ Track and trace these metrics to the model’s version or the data used for these predictions.
Setting Up Alerts
You should also set up alerts to be notified in case of an issue, such as increased memory utilization, a sudden
increase in latency, and so on. Integrate them with other notification systems such as emails, SNS, SMS, and
even Slack.
TIP Ensure the alerts also have the necessary information as to what went wrong and pos-
sibly the root cause so that you can act on them.
By integrating the endpoint into the change management system, you can track every time a new model is
deployed to the endpoint. When you face a problem during a deployment, you can roll back to the older version.
This process is not only automatic, but it also reduces downtime. You could use a change control system such as
Git and CI/CD pipeline management tools like Jenkins, GitLab, or AWS CodePipeline.
TIP By automating the model deployment to the endpoint, you can eliminate the errors
and reduce the downtime, which will reflect in the improved accuracy of your model, thus
helping you achieve your business objectives.
01 Start
Implementing a
Model Development,
Recoverable 04 Training, Deployment
SageMaker Popeline
Endpoint
Collaboration &
05 Resource Management
SageMaker Projects
07 Recoverable Endpoint
TIP Plan for scale and prepare for the unexpected and leverage tools to automate pipelines,
control versions, centralize components, track lineage, and adopt autoscaling.
The field of MLOps is complex and requires a combination of technical skills, strategic
thinking, and effective implementation. This chapter contains a set of hands-on exercises that
cater to different aspects of a machine learning pipeline, reflecting real-world scenarios. From
continues
372 ❘ CHAPTER 19 Monitoring Models
continued
setting up alerts and monitoring dashboards to automating deployment and ensuring scalabil-
ity, these exercises are reflective of most MLOps responsibilities. It is vital that the different
task owners in these exercises collaborate with each other to deploy these configurations. Of
course, in your specific organization, these titles may vary, and therefore I recommend you
tailor them accordingly.
This section delves into reviewing features periodically, implementing model update pipelines, and keeping models
fresh with scheduler pipeline (see Figure 19.6).
374 ❘ CHAPTER 19 Monitoring Models
Offline
Offline Feature Pipeline Batch
Feature
Store
Fetch Artifacts
(Model/Data Quality, Model
Explainability, Model Bias)
Alarm Model Scheduler
Retrain Pipeline
Manager Registry
For example, the patient’s health may depend on several factors. However, these factors can change with time,
a phenomenon known as concept drift. There may have been a disease outbreak that caused a change in these
factors. Under such situations, you need to retrain your model; this is where a scheduler pipeline can greatly help.
The scheduler pipeline retrains the model at predefined business intervals, which helps keep the model always
reflective of the latest data patterns.
The data preparation pipeline gets activated when the scheduler pipeline activates the retraining process. The data
preparation pipeline collects and cleans the new data for use in a model.
At the same time, the feature pipeline gets activated, which extracts the important or relevant features from the
prepared data to feed into the model.
Once the model gets retrained with the new features and when done, the CI/CD/CT pipeline gets activated. Dur-
ing this process, the trained model gets integrated with the rest of the system, is then deployed to the production
environment, and then tested to ensure its functionality and performance are intact.
TIP A scheduler pipeline ensures that the model is constantly kept up-to-date even if the
data or model drift happens.
SUMMARY
This chapter discussed monitoring models and the key components of monitoring, such as monitoring data qual-
ity, data drift, concept drift, model performance, bias drift, and feature attribution drift. It discussed real-time
versus batch monitoring and the tools for monitoring models.
To ensure resilience and recovery, the chapter also covered advanced topics such as automating endpoint changes
through a pipeline, implementing recoverable endpoints, and autoscaling to handle variable workloads.
At this point, you should be equipped with the knowledge and strategies to not just deploy but also to monitor
your models to ensure reliability, scalability, and performance. Needless to say, this is just the beginning of your
learning journey, but I hope this chapter has given you a good start.
In the next chapter, let’s deal with model governance nuances such as ethics, bias, fairness, explainability, and
interpretability.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What is the main purpose of monitoring model performance?
A. To ensure that the model performs above a certain threshold
B. To detect bias in the model’s predictions
C. To monitor the distribution of the data used to train the model
D. To predict future changes in the model’s accuracy
376 ❘ CHAPTER 19 Monitoring Models
2. Which of the following performance metrics for regression problem gives an idea of how wrong the
predictions were?
A. Mean absolute error (MAE)
B. Mean squared error
C. Root mean squared error
D. R-squared
3. Which clustering metric measures the similarity between two data clusters?
A. Mean average precision (MAP)
B. Adjusted rand index (ARI)
C. Silhouette coefficient
D. Precision@k
4. What is the concept drift in the context of AI models?
A. The idea that the factors impacting the output of a model may change over time
B. The phenomenon that AI models become outdated if not updated regularly
C. A situation where the model’s accuracy improves over time
D. The occurrence of errors in AI models as a result of excessive retraining
5. How can you identify sensitive data such as personally identifiable information (PII) and intellectual
property in your system?
A. By enabling data access logging
B. By classifying your data using tools such as Amazon Macie
C. By monitoring data to look for any anomalous incidents
D. By using threat detection services
ANSWER KEY
1. A 3. B 5. B
2. A 4. A
20
Governing Models for Bias and
Ethics
In matters of truth and justice, there is no difference between large and small problems, for issues
concerning the treatment of people are all the same.
—Albert Einstein
This chapter dives into governing models, explaining why they are more than just a regulatory concern
and are an integral part of responsible AI. This chapter serves as your guide to deal with nuances such as
ethics, addressing bias, ensuring fairness, and diving into making models explainable and interpretable (see
Figure 20.1).
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Govern Your
Your AI
Team Deploy and Monitor Models
journey
09
• Deploy Your Models into Production
ENTERPRISE AI Setup 04 • Monitor and Secure Models
Evolve and
Mature AI
JOURNEY MAP Infrastructure • Govern Models for Bias and Ethics
and Manage
Operations
FIGURE 20.1: DEPLOY AND GOVERN MODELS: Govern models for bias and ethics
378 ❘ CHAPTER 20 Governing Models for Bias and Ethics
Whether you are implementing augmented AI for human review, managing artifacts, controlling costs, or setting
up an entire AI governance framework, this chapter can guide you.
Addressing
Fairness and
Bias in
Models
Implement
Augmented
AI for Human
Review
TIP Schedule periodic bias audits, preferably by a third-party to ensure an unbiased review.
Analyze for bias: Utilize techniques such Data scientist Bias analysis report
as statistical tests to identify any inherent
biases.
Model explainability is about trying to make sense of the model’s outputs, irrespective of whether the model is
interpretable. Model explainability especially comes into play when dealing with neural networks where it is
simply impossible to interpret how the model is working due to its complexity.
For model explainability, you can use feature importance scores, partial dependence plots, or more advanced
methods like Local Interpretable Model-Agnostic Explanations (LIME) or Shapley Additive Explanations (SHAP).
TIP Consider setting up a central repository to store model decision explanations that can
be referenced later during audits or inquiries.
TIP Regularly discuss with ethicists or an ethics review board and get their sign-off before
deploying a model in production.
Operationalizing Governance ❘ 381
Bias and
Fairness
Transparency
Inclusivity and
Explainability
Ethical
Considerations
Data Privacy
Accountability
and Security
Robustness
and Reliability
TIP Leverage augmented AI to provide valuable feedback to the AI team that built the
model to improve the model parameters when needed.
OPERATIONALIZING GOVERNANCE
The previous section discussed different strategies to ensure fairness in your models. This section dives into some
of the best practices to operationalize governance in the model’s lifecycle, from creation to monitoring. It involves
tracking the models, including its owners, lineage, versions, artifacts, and costs, and establishing an ethics govern-
ance framework.
TIP Store all tracking information in a central repository for access by others for transpar-
ency and ease of governance.
Use tools like MLflow, Data Version Control (DVC), and even Git for version control.
Performance Monitoring
Reproducibility
➤➤ Version management and rollback: Maintaining the versions of these artifacts and their relation-
ships will help you to roll back to a previous version, which will be handy for troubleshooting and
auditability.
➤➤ Performance monitoring: When the model performance starts to dwindle, you can try to compare differ-
ent versions to understand what changed that caused this performance degradation.
➤➤ Governance and compliance: It is important to keep different versions of the artifacts simply for govern-
ance and compliance reasons so that we can provide the required documentation to the regulators.
➤➤ Cost savings and efficiency: By implementing an efficient artifact management process, we can remove
or archive old files, avoid duplicate files, save costs, and improve efficiency.
TIP Organize periodic knowledge sharing sessions to familiarize teams with existing models
and artifacts.
Feature
Engineering Logs
Artifacts
Model
Evaluation
Parameters and
Metrics
Configuration
Datasets
TIP Be inclusive of stakeholders from various departments to get a holistic view when draft-
ing an ethics policy.
Goal: Promote transparency, governance, compliance, quality, trust, and alignment with
organizational objectives through a model governance framework.
SUMMARY
This chapter delved into the governance of models. Ethical considerations were emphasized, and you explored
the topics of fairness, bias, explainability, and interpretability, as well as the implementation of augmented AI for
human review. You learned about model tracking, from ownership to version control and lineage, and discussed
managing model artifacts and controlling costs through tagging. The chapter concluded with practical steps to
establish a robust model governance framework.
REVIEW QUESTIONS
1. Which of the following services is useful for detecting threats like data leaks or unauthorized access?
A. Amazon S3
B. Amazon Macie
C. Amazon GuardDuty
D. AWS CodeCommit
2. What is the primary difference between model explainability and interpretability?
A. Explainability is about understanding the model’s inner workings, while interpretability is about
making sense of the model’s outputs.
B. Explainability is about making sense of the model’s outputs, while interpretability is about under-
standing the model’s inner workings.
C. There is no difference; they mean the same thing.
D. Both concepts do not exist in model development.
3. Which of the following is NOT typically considered an artifact to be stored and managed in machine
learning operations?
A. The preprocessed dataset used for training
B. Evaluation metrics to evaluate model performance
C. Temporary files generated during data cleaning
D. Model configuration, such as the hyperparameters
4. What is the first step in setting up a model governance framework?
A. Establish review processes
B. Implement continuous monitoring
C. Define ownership
D. Conduct periodic audits
ANSWER KEY
1. C 3. C
2. B 4. C
PART VIII
Scaling and Transforming AI
You have built it, so now you can make it even bigger! In Part VIII, I present a roadmap to scale your
AI transformation. I will discuss how to take your game to the next level by introducing the AI maturity
framework and establishing an AI center of excellence (AI COE). I will also guide you through the process
of building an AI operating model and transformation plan. This is where AI transitions from a project-
level to an enterprise-level powerhouse.
21
Using the AI Maturity
Framework to Transform Your
Business
The greatest danger in times of turbulence is not the turbulence—it is to act with yesterday’s logic.
—Peter Drucker
Having laid the foundation blocks for integrating AI into your business workflows, you will now venture
into a pivotal phase: scaling and transforming your business through AI.
This chapter signifies a transformative shift from merely deploying AI systems to aspiring to become a
genuinely AI-first company. But then, how do you know you are ready to make the transition? Enter the AI
Maturity Framework (see Figure 21.1).
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Scale and Transform
journey Team
FIGURE 21.1: SCALE AND TRANSFORM: Use the AI Maturity Framework to transform your business
392 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
As you navigate this chapter, you will find the AI Maturity Framework to be an indispensable tool. This chapter
walks you through the five stages of maturity, starting with the initial discovery, then pilot, operationalization,
scaling, and culminating at the Transforming stage, where AI is not an addition but is core to your business strat-
egy and operations.
TIP Understanding your company’s vision and goals so you can conduct a maturity
assessment helps identify gaps and enables you to create an AI transformation plan. This, in
turn helps you move from pilot to production and makes AI a core part of your business
strategy.
This chapter focuses on leveraging the AI Maturity Framework, a vital tool tailored to guide your business
through the transformational journey of AI integration. By understanding where your organization stands in its
AI maturity and recognizing the stages ahead, you can strategically plan, execute, and optimize your AI-driven
solutions.
These are interrelated, as you will learn after reading the rest of this chapter. The AI COE is responsible for build-
ing the AI operating model, a framework for adopting AI. The AI COE is also responsible for developing the AI
transformation plan, which is used to implement the AI operating model. See Figure 21.2.
Comprised of subject
matter experts for
everything AI
AI Center of
Excellence
Plan to implement AI
A framework for operating model and
adopting AI AI Operating AI Transformation scale AI adoption
Model Plan across the enterprise
This section explains how to set up an AI center of excellence to implement AI at the enterprise level. Typically,
companies face many challenges during the first six months of setting up an AI COE, so following some of the
best practices outlined in this section is wise.
NOTE Setting up an AI COE is a strategic endeavor and needs careful thought and leader-
ship support.
90%
100 Strategy & Planning
80
70%
Data/Models 60 60%
People & Change Management
40
20
90% 100%
Governance
Platforms & Operations
70%
Security
Note that AI is more than just a technology; it is a new way of thinking and operating. To succeed with AI,
organizations must become digital at the core.
TIP To succeed with AI, you must measure AI maturity continuously and address gaps to
become more data-driven and AI-enabled, ultimately leading to new products, services, and
business models.
394 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
Measuring AI maturity and addressing the gaps will help you assess and scale your business impact from AI
systems. As you must have gathered from some of the case studies I presented earlier in Chapter 2, your initial
focus will be improving your operations to achieve your business goals. Eventually, your focus will move toward
making AI a core part of your organizational strategy, ultimately leading to defining new products, services, and
business models.
It is essential to make progress across all dimensions. Otherwise, it’s not possible to make good progress overall
with AI. For example, even if you make good progress with your platform, people, and operations, you may not
succeed if you lack a strong business case for AI.
Understanding AI Organizations are just beginning to hear about AI and trying to understand how
it can help them.
AI adoption plans Organizations have yet to make any clear plans to adopt AI.
Business use cases This is typically handled by those focused on building business use cases and
understanding costs and benefits.
During the next stage, they are becoming good at identifying good AI opportunities and start putting together a
roadmap to define AI solutions.
A retail company identified hundreds of potential AI projects but, upon deeper analysis, found that
most of them were addressable through traditional programming options. They evaluated the
remaining use cases for feasibility, viability, and desirability. This process eventually helped them to
develop a strategy and a vision for AI setting them up for the next stage, the Pilot stage.
The AI Maturity Framework ❘ 395
TIP Use the AI Maturity Framework to conduct self-assessment, perhaps annually, bi-
annually, or as needed to identify areas of strength and improvement areas.
Initial experimentation The organization has just started experimenting with AI and may have done a
proof of concept but have yet to put it into production.
Identifying blocks The organization’s focus is on identifying the blocks and enablers of AI.
and enablers
Identifying The organization is trying to identify which projects can be put into production and
potential projects how they will measure success.
An insurance company wanted AI to enable their claims processing division to add scale. They
implemented a proof of concept using deep learning optical character recognition and predictive
techniques to improve the process. They curated test data and evaluated metrics to evaluate the
project’s success. They eventually developed a model that improved their processing rates and saved
cost, thus enabling them to work toward putting it into production, which is the next stage—
operationalizing.
396 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
TIP Post assessment, chart out a roadmap for the next phase, but be ready to pivot based on
your progress and external landscape.
Focus The focus is on achieving business outcomes for business impact. There are clear
on business business cases with agreed upon metrics and adequate processes and tools to ensure
outcomes the responsible use of AI.
Capture Lessons learned to improve operations and technology are captured; for example,
lessons learned refining the data strategy and identifying improvements to AI operations and
technology are included as a follow-up action item for the next iterative plan and
execution.
Get leadership More complex use cases require leadership sponsorship for budget and mandates and
sponsorship ensuring the responsible use of AI.
A healthcare company wanted to reduce the number of patients missing appointments, as it was
causing a lot of operational and financial issues. A proof of concept was conducted, and the number of
missed appointments was reduced by 25 percent. They worked with administrative staff and doctors
to come up with the requirements for the model. As they gathered metrics on the model’s performance
and continued to improve, they became satisfied with its performance and confident to deploy it into
production.
AI solution deployment By now, the organization has deployed at least one AI solution in production
and is looking to deploy multiple solutions with positive ROI.
Component and As more solutions are being deployed now, there is a greater opportunity for
model reuse component and model reuse, and organizational alignment is important.
Operational maturity In the final stages, the organization has achieved maturity in development
and deployment operations.
Ethics framework The organization has developed an ethics framework for the responsible use
of AI across the company.
C-level support The organization has obtained C-level support for multiple AI deployments
and alignment and achieved organizational alignment.
Production
Deployment
Obtained C-level support Opportunity for
and achieved component and model
organizational alignment C-Level Artifacts reuse
Support Reuse
Optimizing
Stage
Matured
Operations
New products and services They are beginning to develop new products, services, and business models.
development
AI in budget decisions AI is a key factor during budget decisions, and executives make their
decisions based on AI-driven data and insights.
continues
398 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
(continued)
Organizational Organizational silos have been broken down to further integrate the teams
silo changes across data, infrastructure, operations, and business functions.
AI in business strategy AI has become fundamental to how business strategy is defined, and the
and ethics focus is more on deploying AI in an ethical and socially responsible manner.
NOTE Very few organizations have reached this stage. The companies that have reached this
stage either were born with AI from the start, such as Uber and Airbnb, or were digital-first
companies that eventually shifted to become AI-first. These are companies such as Amazon
and Google. Other traditional companies are trying to become digital at their core before
becoming completely transformational through AI.
This exercise will help your implementation team develop a clear understanding of the
opportunity areas to advance your AI maturity level and create a plan to improve it.
The following table provides the steps your team can take from self-assessment to tracking
progress continuously. At the end of this exercise, every team member should have a clear
understanding of their role in advancing your company’s maturity level.
Group discussion: Hold a team meeting Team leads and Meeting minutes and
to discuss the gaps and areas of department heads identified gaps
concerns openly to arrive at a consensus.
Encourage open dialogue.
Document action items: Document clear Planning and Action items list
action items or initiatives for those six strategy teams
dimensions to achieve the goals agreed
upon.
The AI Maturity Framework ❘ 399
Assign task owners: Assign a team Project manager List of task owners for
member who would be responsible for each action item
the action items or initiatives.
Progress tracking: Track the progress Monitoring and Progress reports and
regularly toward meeting the goals control teams updates
previously agreed upon and make the
necessary changes.
TIP AI maturity is not just about implementing AI projects but is more about getting
maximum value from AI.
Discovery There is no clearly defined strategy for AI. AI is looked Align the business and
upon as a nice-to-have tool. The efforts tend to be either technical leaders with the
too narrow or too broad, having little to no value to the need and value of having
organization. Internal experts are enthusiastic to try some an AI strategy.
projects on the side with a view to learning but no value
to the organization as such.
continues
400 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
Pilot Strategy for AI is just beginning to emerge. Some initial Try to gather leadership
use cases are being identified, but still, there is no clear support based on the
understanding of how AI can add value. success of the initial POCs.
Some proof of concepts are being explored with limited
funding from one business unit or a team. The onus still
lies with the project owner to prove the benefits of AI.
Operationalizing A clear strategy for AI has been developed and Document AI strategy
communicated throughout the organization. There is a for the organization
shared understanding of what benefits AI can give to the to establish a shared
organization and use cases have been prioritized based understanding.
on their potential impact. Executive support is available Gain a budget and secure
for the effort allowing the organization to unlock the C-Suite sponsorship for AI
budget and provide a mandate to execute its strategy. projects.
Optimizing There’s a clearly defined AI strategy for the organization, Align AI strategy with
and their strategy is integrated with the business other organizational
strategy. There are established processes to manage roadmaps.
AI projects. The organization is actively monitoring and Discover opportunities
measuring the impact of AI on the organization. to execute across other
BUs or functional teams
for maximizing business
impact.
People
This dimension involves having the leadership support and the skills, roles, profiles, and performance measures
required to both implement and operate AI successfully (see Figure 21.6).
Chapter 7 discussed the importance of leading people and how to go about building a change acceleration strat-
egy, as well as all the different deliverables that you have to deliver during the four phases in an implementation.
It discussed leadership support and defining job roles, responsibilities, and personas. It talked about building the
AI training courses and curriculums, as well as transforming the workforce through a talent program. Table 21.2
covers how to measure the maturity of this dimension along the five stages of a maturity model.
• Awareness of AI potential
Discovery • Unclear roles
• Need for cross-team collaboration
Define AI-related
roles and • Experimentation
responsibilities Pilot • Collaborative POCs
• Leadership communication
Establish performance
metrics and form • Emerging roles
communities of influence Operationalizing • Defined performance metrics
• Leadership actively engaged
Discovery Employees are aware of the potential of AI but Develop AI literacy for both
do not know how to apply it to the business. business and technical teams so
Organizations do not know how to define roles they can gain confidence.
and responsibilities to implement AI initiatives. Obtain help from internal and
Business teams need help from their AI external AI experts.
counterparts on how to apply AI to solve business
Encourage collaboration and
problems while the data science teams need help
knowledge sharing between
from the business to understand the business
teams.
problems that are suitable for AI solutions.
continues
402 ❘ CHAPTER 21 Using the AI Maturity Framework to Transform Your Business
Operationalizing New roles and responsibilities are emerging. Organizations need to clearly
Performance metrics are defined to track AI define who’s responsible for
performance of employees. Centers of excellence leadership, budget, team
and communities of influence are being structure, and rules.
established to fill up resources for these new roles. Organizations need to have
Leadership is actively engaged in communicating clearly defined rewards
the vision to the employees to motivate them to and recognition programs
participate in the new vision. to recognize excellence in
performance as well as have
clearly defined performance
evaluation programs
acknowledge AI contributions.
Encourage centers of excellence
and communities of influence to
engage employees outside the
formal org structure to facilitate
collaboration and knowledge
sharing.
Optimizing Organizations have clearly defined roles and Include the AI organization in
responsibilities for AI. Communities of influence the discussions for strategy
and community centers of excellence have at the executive table with
expanded mandates including reaching out to the the accountability to achieve
external AI ecosystem. enterprise level KPI for AI.
There is a well-defined talent strategy to enable Establish formal learning
learning for employees. Leaders are very actively journeys for those involved in
engaged to facilitate organizations through the implementing and using AI.
change. AI is fully integrated into all business
operations and employees are skilled in working
with AI and are actively contributing to the
development and optimization of AI solutions.
Discovery Platforms either do not exist or are Research new AI platforms and
disconnected. Operational processes do understand the need for CI/CD for
not exist. AI projects.
Pilot Some tools are in place for AI and some Standardize tools and platforms.
CI/CD processes are there but not Establish processes for deploying and
standardized. monitoring AI.
Operationalizing Platforms and tools are standardized Continuously monitor the platforms and
along with established operational tools and ensure scalability and reliability of
processes that are followed. AI deployments.
Optimizing Advanced tools are used to monitor Focus on automation and keep exploring
models and there is seamless integration emerging AI platforms, tools, and
between platforms and operations. operations best practices.
Data/Models
In this dimension, you focus on ensuring the data is clean, unbiased, reliable, and available. The models need to
be scalable, accurate, ethical, optimized, and relevant.
Discovery Data is siloed and is of poor quality, Centralize storage of data and set up
and few models, if any, have been data cleaning processes. Start with simple
developed. models to understand data patterns better.
Pilot Data pipelines have been started in Streamline data pipelines and focus on
development. Models have been putting models into production and
developed in pilots but not in validation of models.
production yet.
Operationalizing Data is regularly cleansed, and data Focus on getting consistent data quality
pipelines are implemented. Models are and implementing model versions.
in production but need to be monitored.
Optimizing You have a robust data infrastructure. Explore A/B testing and automated feature
Models are regularly updated and engineering updates for performance
optimized for performance. optimization.
Governance
This dimension deals with the ethical use of AI and the alignment of AI projects with the business goals.
Pilot Initial processes are set but are not Review guidelines regularly and refine
enforced or standardized. them to align with business goals.
Optimizing Governance is integrated into AI projects Ensure governance processes are being
and ethical concerns are paramount. adapted to changing situations.
Security
This dimension deals with securing the platforms, tools, infrastructure, models, and data.
Discovery Little to no focus on security. Try to understand the need for security
and initiate some security measures.
Pilot Some security measures are in place but Prioritize data encryption and access
are not enforced or standardized. controls. Review potential security issues.
Operationalizing AI security measures are clear and Conduct regular security audits to
communicated. But there may still be address gaps and ensure all personnel
some security lapses or concerns. are well trained in security.
Optimizing Robust security infrastructure is in place Implement advanced security and threat
and threats are being regularly monitored. detection techniques and be proactive
rather than reactive.
Transforming Security is integrated in the DNA of Adopt the latest security techniques and
AI projects. Security protocols are ensure a culture of security first in all
continuously monitored and evaluated. AI projects.
The goal of this hands-on exercise is to help your team assess the maturity level in MLOps
based on various areas such as data management, model development, deployment, monitor-
ing, and team collaboration.
Workbook Template: AI Maturity Assessment Tool ❘ 405
Note that this is an iterative process, and you need to conduct this exercise periodically to
improve over time with effort and focus.
Identify Key Areas Team leads Identify the key focus A list of key
for assessment across various areas in the ML workflow focus areas in
disciplines that need to be ML workflow
assessed, for example,
data management,
model development,
model deployment,
model monitoring, and
collaboration among
teams.
SUMMARY
This chapter covered the various steps you take as you move from the Align phase to the Launch phase and
eventually the Scale phase. You used the AI Maturity Framework to evaluate your progress in strategy, people,
data/models, platforms, governance, and security dimensions. It is a comprehensive framework to transform your
business by adopting AI at the enterprise level. The chapter also discussed how you can take your company from
the Launching to Optimizing to Transforming stages by evaluating your progress in these dimensions.
The heart of the chapter was a hands-on exercise to develop an AI maturity plan. It can guide your team from
self-assessment to continuous progress tracking.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Conducting a company’s AI maturity assessment during the Align phase helps
A. To identify gaps in the company’s vision and goals
B. To assess the company’s AI adoption across various dimensions
C. To create a proof of concept during the Launch phase
D. To optimize the company’s AI efforts by expanding the adoption of AI
2. Identify the dimension in the AI Maturity Framework that is not discussed in this chapter.
A. Strategy and planning
B. Governance
C. Finance and accounting
D. People and change management
3. A company is just beginning to hear about AI and trying to figure out how AI can help them. Which
stage of maturity are they in?
A. Pilot
B. Operationalizing
C. Optimizing
D. Discovery
4. What identifies the Optimizing stage?
A. Improving AI operations and technology
B. Maximizing business impact
C. Deploying multiple AI solutions
D. Developing an ethics framework
5. What identifies the Transforming stage?
A. Improving AI operations and technology
B. Maximizing business impact
C. Deploying multiple AI solutions
D. Developing new products
ANSWER KEY
1. B 3. D 5. D
2. C 4. C
22
Setting Up Your AI COE
Great things in business are never done by one person: they’re done by a team of people.
—Steve Jobs
As you navigate your journey through enterprise AI transformation, you might realize that creating and
deploying models into production is just one step of the larger journey toward becoming AI-first. For your
company to fully transition into enterprise-wide AI, it needs a holistic approach to integrating AI into its
core. And this holistic approach is the crux of this chapter.
This chapter delves into the compelling reasons to establish an AI Center of Excellence (AI COE), one of
the three pillars to scale and transform your business with AI. You learn about its team structure, multifac-
eted roles, and the symbiotic relationship between the COE business and platform teams (see Figure 22.1).
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Scale and Transform
journey Team
The AI COE serves as an engine, propelling your organization through its transformation phases, and also acts as
the AI evangelist to promote AI adoption. This chapter dives into how to evolve the AI COE’s role from strategy
to operations and its transition from being a central unit to an enterprise-wide advisor.
These are interrelated, as you will realize after reading the rest of this chapter. The AI COE is responsible for
building the AI operating model, a framework for adopting AI. It is also responsible for developing the AI trans-
formation plan, which is used to implement the AI operating model.
This section explains how to set up an AI center of excellence to implement AI at the enterprise level. Typically,
companies face many challenges during the first six months of setting up an AI COE, so following some of the
best practices outlined in this section is wise.
TIP Before starting an AI COE, ensure everyone is clear about its objectives—to foster inno-
vation, streamline operations, or optimize customer experiences.
The AI COE can also help mitigate the risks associated with AI, such as ethics, social responsibility, bias, and data
privacy concerns.
Finally, it can also act as an engine for innovation by promoting a culture of experimentation and innovation by
helping to identify new use cases as well as adopting new technologies and approaches.
TIP Clearly communicate the goals, progress, and achievements of the AI COE to the rest of
the organization to build trust and foster a culture of AI adoption.
NOTE The Center of Excellence team may sometimes also be known as the AI capability
team, AI practices team, and so on.
You should include stakeholders from the business and the technology as part of this AI COE; you could call
them the AI business and platform team. You could have people with titles such as IT manager, finance manager,
digital ops manager, operations manager, systems architect, systems administrator, application developer, database
administrator, DevOps manager, AI engineer, ML engineer, data scientist, data analyst, and BI analyst. You can
even have managers from various lines of business, such as finance, human capital, marketing, sales, operations,
supply chain, and so on.
TIP Having a diverse mix of AI experts, data scientists, cloud architects, business analysts,
and representatives across various departments is critical to get a holistic view of your orga-
nization’s needs and challenges.
TIP Empower your AI COE to make quick decisions that will help AI projects and accel-
erate AI adoption.
Figure 22.3 shows how your AI COE starts as a centralized unit to becoming an enterprise-wide advisor and
shifts its focus from strategy to operations. This is a good visual that you could use in your discussions about AI
COE within your company.
Establishing an AI Center of Excellence ❘ 411
Enterprise-wide
Advisor across BUs Optimizing AI Operations Through
Scaling
MLOps
TIP Embracing an agile approach can help your AI COE adapt quickly to changes and
improve iteratively.
The steps to form an AI COE team are covered next in the form of a hands-on exercise.
Phase 1 Gathering the Gather your team and discuss the need for an AI COE.
requirements Document the gaps that can be addressed by this AI COE
especially with regard to adoption.
Discuss the team make up and identify potential members.
continues
412 ❘ CHAPTER 22 Setting Up Your AI COE
continued
Phase 2 Planning the Plan the roles and responsibilities of the team members.
AI COE Prepare a charter for the AI COE explaining its purpose,
goals, processes, and procedures.
Phase 3 Scoping the Identify the initial scope for the AI COE such as the
AI COE strategy, training needs, pilot projects, scaling strategies,
and other operational aspects.
Identify training needs and develop the course curriculum
working with third-party vendors and internal training
teams.
Identify the KPIs that will be used to evaluate the
performance of the AI COE.
Phase 4 Implementing Identify a pilot project that can help you to develop AI
the pilot practices that can be adopted for future projects. The goal
project is to build agile practices and develop sets of AI policies for
scaling AI.
Implement the project and document the lessons learned.
Your team must gain hands-on experience through pilots
and demo labs.
SUMMARY
This chapter looked closely at the significance, formation, and the ongoing operation of the AI COE. It discussed
how it is one of the strategic pillars for becoming AI-first, including some of the intricacies involved in setting
it up and its evolution from a strategic focus to an operational one. Through a hands-on exercise, the chapter
offered insights on how to implement an AI COE, including a step-by-step guide to overcome the challenges that
you are likely to face.
The path to becoming an AI-first company has been laid. It’s going to take leadership commitment, strategic
planning, and an unwavering focus on execution. It’s your turn to take action.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. Conducting a company’s AI maturity assessment during the Align phase helps
A. To identify gaps in the company’s vision and goals
B. To assess the company’s AI adoption across various dimensions
C. To create a proof of concept during the Launch phase
D. To optimize the company’s AI efforts by expanding the adoption of AI
2. Identify the dimension in the AI maturity framework that is not discussed in this chapter.
A. Strategy and planning
B. Governance
414 ❘ CHAPTER 22 Setting Up Your AI COE
ANSWER KEY
1. B 6. C 11. D
2. C 7. A 12. C
3. D 8. A 13. A
4. C 9. D 14. A
5. D 10. D 15. B
Building Your AI Operating
23
Model and Transformation Plan
If you don’t know where you’re going, any road will get you there.
—Lewis Carroll
You are now standing at an important juncture—architecting the framework that decides how your
company will embed AI into its processes and sketch the master blueprint for your AI transformation. This
chapter builds on the work done in the previous chapter, where you focused on establishing an AI COE.
Your emphasis now shifts to understanding and implementing an AI operating model that resonates with
your organizational ethos to deliver both value and continuous innovation (see Figure 23.1). It is not just
about technology or even strategy but is more about taking a customer-centric lens and navigating the
intricacies of AI-driven innovation. You will focus on building cross-functional teams, instilling a culture of
continuous improvement and innovation, and aligning with core strategic values.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Scale and Transform
journey Team
Objective
08 Scale and Process Data
Transform and Modeling Implement a cohesive AI operating model
AI
Deploy and 05 tailored for your organization, accompanied
Monitor by a robust transformation plan that ensures
Models
a customer-centric, product-oriented
07 approach to AI solutions, emphasizing
06 iterative growth and measurable outcomes
FIGURE 23.1: SCALE AND TRANSFORM: Build your AI operating model and transformation plan
Understanding the AI Operating Model ❘ 417
It doesn’t end there. You will be crafting a transformation plan tailor-made for you, which will act as a catalyst
for transformation to come. You will chart out the long-term vision, objectives, and timelines. A hands-on exer-
cise will help you to draft such a plan.
AI Operating Model
Define Roles
Process Tools and Success
AI Strategy Ideation and
Development Resources Metrics
Responsibilities
Start Small
Be Flexible
Executive Buy-in
Avoid Analysis Paralysis
Celebrate Successes
By ensuring that these aspects are taken care of, an operating model helps to increase the probability of success of
AI initiatives.
TIP To sustain long-term growth and innovation for your business, position AI as not just
a tool but as a force for business transformation through an AI operating model that instills
AI-first thinking.
418 ❘ CHAPTER 23 Building Your AI Operating Model and Transformation Plan
TIP Some tips for developing a successful operating model are having executive buy-in,
starting small, avoiding analysis paralysis, being flexible to incorporate changes as the tech-
nology is fast-moving, and ensuring that you celebrate successes, however small or big they
may be.
Implement
Continuous Improvement Through an AI Product-Centric Approach to
Measurement & Testing 2 Operating
5 Building AI Solutions
Model
Assessment
• Business
Identify Prioritize
Customer • G overnance Implement AI
Business
Pain Points • Security Solutions
Outcomes
• Platforms
• Data/model
• Operations
For example, suppose a customer complains about slow response times, lack of personalized options,
or limited self-service options. In that case, you start by assessing gaps in various dimensions such as
business, governance, security, platforms, data/model, and operations, assess your current operating
model, and devise a plan considering your business strategy. You may realize a talent gap, or the
processes must align with your AI strategy. As a result, you may decide to implement a chatbot and
come up with a plan along with an operating model to address the customer’s concern.
By identifying any blocks to meeting the customers’ needs and addressing them, your company can adopt AI suc-
cessfully across the organization. The outcome is an AI strategy with an operating model that enables achieving
business outcomes while meeting the customer’s needs.
TIP Creating a map of the customer journey and tying them to the AI solutions to provide a
seamless, connected customer experience is a good practice.
Amazon broke down its monolithic Java e-commerce application into a collection of individual
products such as the home page, the search page, the products page, the account pages, the
shopping cart, personalized recommendations, and so on. Behind each one of these products
was a team that was entirely focused on enabling that functionality from end to end, starting
from the requirements gathering all the way to deploying the solution and making sure it
performed well.
Monolithic
Application
(Chatbot)
Natural Language
Understanding User Interface
Personalized Product-Centric
Recommendations Database Approach
Model Migration Decomposing your
monolithic application
into individual products
TIP Have a dedicated product team for each product or functionality with end-to-end own-
ership from ideation to deployment to continuous improvement.
Personalized
Database Migration Team Recommendations Model
Handling the process of moving 04 03 Team
data from one database to another Managing the AI models that
generate personalized
recommendations for users
In the case of the chatbot example, you would need to organize teams around the natural language
understanding, user interface, recommendation model, database migration, and data lake. These teams
would be responsible for setting up their product strategy, vision, and requirements to ensure they
perform well in production.
TIP Celebrating small wins will not only keep your team motivated but will also reinforce
the value of an iterative approach.
422 ❘ CHAPTER 23 Building Your AI Operating Model and Transformation Plan
Scrum Master
AI Pilot Team
Focus
Product Owner Viability
AI Architects Desirability
Data Scientists Feasibility
Machine Learning Engineers Operability
Cloud Engineers
FIGURE 23.7: Start small and build iteratively with cross-functional teams
Say your company has a product team responsible for a product recommendation engine that displays
upsell/cross-sell products on your e-commerce website. Your product recommendation engine team
could establish measures such as the recommendation engine’s uptime, response time, and error rate to
ensure they are delivering what is expected to meet the customer’s needs.
The team may also need to measure other services their product depends on, such as a database or an
authentication service. This would allow them to quickly identify any issues with these services that
could impact their own product’s performance or availability.
To ensure the continuous availability of their recommendation engine, the team could set up a testing
environment that simulates the failure of dependent services and ensures their application can still
function properly and recover itself in such scenarios. This would help the team identify and fix any
issues early on before it impacts customers.
Implementing Your AI Operating Model ❘ 423
Incorporate Improve
Product Define Success Identify Create Testing
Dependencies in Resiliency and Result
Development Metrics Dependencies Scenarios
Testing Plan Availability
• Beginning of • Teams need to • Product owner • Identified • Setup of a • Goal of the • Continuous
the process, come up with identifies other dependencies testing whole process delivery of a
when the measures of products/services are factored environment highly available
product teams success for the on which into the testing that simulates product that
are building product in their product plan. It enables various meets the
their solutions production depends the team to test situations, customer's
dependent including the needs
• For example, services and failure of
recommendation proactively plan dependent
engine's uptime, for self-recovery services
response time, options
and error rate
Aligning Operating Model with Strategic Value and Establishing a Clear Roadmap for
AI
Finally, when establishing the operating model, you need to ensure that it is aligned with your business objectives
and built in a manner to deliver upon the AI roadmap. You should be able to deliver iteratively, with minimal
risks and a focus on quicker time to value (see Figure 23.9).
Ensure the operating Lay out the plan for AI Deliver the basic Add advanced Develop a Minimum
model aligns with the initiatives, from basic chatbot functions, like features such as Viable Product (MVP)
ACTIVITIES
company's business functionality to more responding to multilanguage with high strategic
objectives complex tasks customer queries support, personalized importance, focusing
responses, sentiment on providing accurate
analysis, and answers, high
database integration availability, reliability,
security, and reduced
downtime
BUSINESS OBJECTIVES
Align the operating model with business objectives
Deliver iteratively with minimal risks and quicker time to value
FIGURE 23.9: Aligning operating model with strategic value and establishing a clear AI roadmap
For example, suppose your team is building a chatbot. In that case, you may decide to deliver the essential func-
tionality first, such as responding to customers’ basic queries, and then add more complex requirements, such
as multilanguage support, personalization of the responses, sentiment analysis, and integration with customer
databases.
Moreover, suppose the chatbot is of high strategic importance to the organization. In that case, you may focus on
building an MVP that aims to provide more accurate answers, which would mean your operating model delivery
approach prioritizes high availability, reliability, security, and reduced downtime so as not to impact customer
satisfaction negatively.
424 ❘ CHAPTER 23 Building Your AI Operating Model and Transformation Plan
Goal: The goal of this exercise is to give you and your team a practical sense of how to go
about implementing some of the best practices suggested in this chapter. This exercise can be
done in a training setting, in a workshop, or even as part of a real-world project plan-
ning process.
Vision
The Vision component contains an AI transformation initiative’s overarching vision, goals, and objectives and
explains how this vision helps the organization achieve its business goals.
AI Transformation Plan – From Vision to KPI Evaluation
FUTURE
OBJECTIVES KEY
STATE VISION IMPLEMENTATION
Aims such as PERFORMANCE
Desired state of PLAN AND TIMELINE
assembling a INDICATORS (KPIs)
the organization Time taken to
cross-functional Metrics used to
after the implement each
team, measure the
implementation
01 establishing an
03 of AI 05 phase of a 07 success of the AI
AI ethical project, potential initiative, such
Transformation
framework risks, mitigation as accuracy, etc.
Plan
steps
CHANGE
CURRENT STATE ROADMAP
MANAGEMENT
ASSESSMENT Major milestones
VISION PLAN
Strengths, and phases to
Overarching goal of Manage changes
weaknesses, achieve the future
the AI
Transformation
02 opportunities, and 04 state vision such 06 and communicate
effectively during
08
threats of the as “Assess and
initiative the implementation
current AI Plan,” “Prototype
and Test” process
infrastructure,
etc.
The vision outlines how the transformation initiative will impact the organization’s operations and the outcomes
it aims to achieve through this transformation initiative.
For example, the vision could be “to become a data-driven organization focused on improving productivity
and efficiency, making better decisions, and identifying new business opportunities through the adoption of AI
technology.”
These are other examples of vision:
Thrilling customer To provide thrilling customer experiences by providing personalized and proactive
experiences solutions by adopting AI technology
Enhanced services To leverage AI technologies to provide better, more personalized services to our
and operations customers while also enhancing our own operational efficiency and risk management
capabilities
Personalized user To utilize artificial intelligence to enhance and personalize the user experience by
experience analyzing user data and behavior to recommend personalized content, improve the
streaming quality, and automate our operations
AI in content creation To use AI in content creation and production to optimize our offerings and create
and production more engaging content for our audience
Objectives
Some of the objectives could be as follows:
OBJECTIVES
AI readiness assessment To assess and document the AI readiness capabilities and gaps across the
organization
Comprehensive To develop a comprehensive strategy that includes top-level use cases for AI
AI strategy that align with the overall business strategy
Assemble AI expert team To assemble a cross-functional team of AI experts, such as data scientists,
machine learning engineers, data engineers, industry experts, and establish a
center of excellence to assist with implementing and adopting AI
Ethical AI framework To establish and implement an AI ethical framework that ensures the
responsible and ethical use of AI in accordance with social, ethical, and
regulatory considerations
Data management and To ensure data management and infrastructure plan is in place to develop and
infrastructure plan deploy AI at scale
AI deployment plan To have a plan in place to develop and deploy AI at scale to meet business
goals, deliver business value, and improve productivity, efficiency, and customer
experience
Monitoring and To have a monitoring and deployment infrastructure in place that ensures that
deployment infrastructure the AI solutions operate satisfactorily post deployment
Implementing Your AI Operating Model ❘ 427
Current-State Assessment
The plan should contain the documentation of all the findings during the current-state assessment of the AI
infrastructure process, people, and business domains. It should clearly list these strengths, weaknesses, opportu-
nities, and threats for the organization that it will face during its AI adoption journey. The plan should be well
geared to address the gaps discovered between the current-state assessment and the future vision laid out for the
organization.
Future-State Vision
The future-state vision of the company should be well documented, as well as the benefits you hope to receive
when you achieve that vision. It would depend on the use cases you identified in the Envision phase.
Roadmap
This is the most crucial part of the AI transformation plan because it documents all the significant milestones to
achieve the future state vision during this AI transformation journey. It should contain all the major phases or
milestones along with the individual projects, dates, and deliverables for each of those phases or milestones that
help address the gaps identified.
The following table shows a sample roadmap with descriptions of the major milestones:
MILESTONES DESCRIPTION
Assess and plan Includes business objectives, AI opportunities, use cases, impact and feasibility
analysis, data quality, availability and security assessment, technical infrastructure
and resources, and the project plan and funding.
Prototype and test Shortlist one to two use cases and build MVP, test MVP in a controlled
environment, gather feedback, refine it, and test again iteratively. Finally, develop
a plan to scale the solution and finalize the technical infrastructure.
Pilot and refine Release pilot in the real world, collect data, and refine the solution based on
feedback. Define a change management plan for stakeholders.
Roll out and Roll out to a broader user base, monitor impact and effectiveness, continuously
continually improve refine, and define new use cases based on experience.
You can use several metrics to measure the success of an AI transmission initiative. The following table shows
various KPIs used to measure the success of an AI transformation initiative:
KPIS DESCRIPTION
Accuracy of models Measures the percentage of accurate predictions made by the AI models.
Efficiency of models Measures the speed at which the predictions are made.
ROI Measures the return on investment, calculated as the profits generated by the AI
initiative divided by the investment made.
Customer Measures customer satisfaction, typically based on surveys, feedback forms, and
satisfaction customer reviews.
Employee Measures employee satisfaction, typically based on surveys and feedback forms.
satisfaction
Time to market Measures the time it takes to deploy a model in production. The faster, the more
competitive the organization can be.
Data quality Measures the quality of the data. The better the data quality, the better the
performance of the models.
Cost reduction Measures the cost savings from the AI effort, calculated by comparing costs before
and after the initiative.
SUMMARY
This chapter took a deep dive into how to craft an AI operating model that fits businesses of all sizes. It unlocked
some of the keys for achieving seamless AI integration, such as starting small, scaling with agility, building cross-
functional teams around a product, and fostering a culture of innovation.
While the AI COE team will be responsible for developing and executing the AI transformation plan, the AI
operating model will provide the framework and the best practices to foster AI adoption at your enterprise. The
transformation plan includes vision and objectives, current-state assessment, future-state vision, roadmap, imple-
mentation plan, timeline, change management, and communication plan.
Regardless of whether you are beginning your AI journey or looking to mature your AI adoption, the lessons
hereto are pivotal in integrating AI into your business and staying ahead of the innovation curve.
In the next chapter, we will discuss implementing generative AI.
Answer Key ❘ 429
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What is an AI operating model?
A. A formal structure or framework to implement AI in a scaled manner
B. A model to develop AI solutions for operations
C. A way to align AI operations with business goals
D. A model to operate AI solutions without compliance
2. An AI transformation plan is
A. A plan for implementing individual AI initiatives
B. A plan to increase the maturity levels of the organization
C. A plan to develop AI technologies
D. A plan to outsource AI development
3. What are the three pillars to scale and become AI-first, as described in the chapter?
A. AI operating model, AI transformation plan, customer-centric strategy
B. AI center of excellence, AI operating model, AI transformation plan
C. AI adoption at scale, AI innovation, AI roadmap
D. AI center of excellence, AI innovation curve, AI operating model
4. Which of the following is NOT part of the AI transformation plan discussed in the chapter?
A. Vision and objectives
B. Future-state vision
C. Timeline
D. Marketing strategies
5. According to the chapter, what is vital for achieving seamless AI integration?
A. Building cross-functional teams around a product and fostering a culture of innovation
B. Focusing only on large corporations and excluding small businesses
C. Ignoring the need for strategic alignment with business goals
D. Starting with a complex plan that covers it all
ANSWER KEY
1. A 3. B 5. A
2. B 4. D
PART IX
Evolving and Maturing AI
This is where you peek into the crystal ball. I delve into the exciting world of Generative AI, discuss where
the AI space is headed, and provide guidance on how to continue your AI journey.
24
Implementing Generative AI
Use Cases with ChatGPT for the
Enterprise
The best way to predict the future is to invent it.
—Alan Kay
Welcome to the exciting world of generative AI and ChatGPT, in particular, which has been creating news
recently because of its ability to create realistic data by mimicking existing training datasets. It is like an
artist who learned the style of Picasso or van Gogh and created similar pieces of art that mimic their art
but are entirely new creations.
With every step in your enterprise AI journey, your understanding of AI and the ability to integrate AI
within your company has been increasing. It’s now an excellent time to explore the cutting-edge world of
generative AI and its implementation opportunities. See Figure 24.1.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Evolve and Mature
journey Team
FIGURE 24.1: EVOLVE AND MATURE: Generative AI and ChatGPT use cases for your enterprise
434 ❘ CHAPTER 24 Implementing Generative AI Use Cases with ChatGPT for the Enterprise
This chapter gives you a tour of the exciting world of generative AI, starting from the innovative GANs to the
intricate diffusion models. It guides you through the technological marvels that have revolutionized industries,
from drug design to chip design.
With the strategy framework, risk mitigation plans, and other tools presented in this and other chapters in this
book, you can make generative AI applications such as ChatGPT an enterprise reality. This is where innovation
truly comes to life!
Artificial Intelligence
Intelligence
ificial Tools and systems that simulate human intelligence to optimize
Art business processes and decision-making
Learning
Deep
Deep Learning
Advanced ML techniques that use neural networks to analyze
massive amounts of data to identify patterns
Generative AI
Generative AI
Enterprise solutions that autonomously generate new content,
designs, or products based on learned business data patterns
—Gartner
At this point, you already know what artificial intelligence, machine learning, and deep learning mean. If you are
unsure, refer to Chapter 16, where I discuss this in greater detail.
In this section, you learn how generative AI evolved over a period of time to where it is now.
TIP Use deep learning techniques only when traditional machine learning techniques are not
sufficient.
The Rise and Reach of Generative AI ❘ 437
X11
X21
X12
X13
X1[…] X22
X1[…]
X1[…]
X23
X1[…]
X1[…]
X2[…]
X1[…]
X1[…]
X1[…] X2[…]
X1[…]
X1[…]
X28
X13698
X13699
X1370
0
Input layer Hidden layer Output layer
Definition A type of model that tries to learn the A type of mode that tries to model on
boundary between different classes how the data is generated
Training Tries to learn the decision boundary Tries to learn the distribution of the data
Approach between classes itself
On the other hand, generative models are focused on generating new data instances based on learning the prob-
ability distribution of the existing data.
In summary, generative models generate new data, while discriminative models try to distinguish between real
and fake data.
NOTE It is not generative AI when the output is a number or a class, meaning it cannot be
a probability or a classification. For it to be generative AI, the output needs to be in natural
language, such as text, image, audio, or video.
Tasks
Question
Data Answering
Text Sentiment
Analysis
Image
Information
Training
Speech Adaption Extraction
Image
3D Signals Captioning
Object
Code
Recognition
Instruction
Foundation Model Following
Some examples of these foundation models are the Pathways Language Model (PALM) and the Language Model
for Dialog Applications (LaMDA). These models have been trained on very large amounts of data from the Inter-
net and have built foundation models that can now answer anything by asking a question, either via prompts or
by asking verbally into the prompt.
TIP Training foundation models can be expensive, but they are impressive in performance,
and hence it’s important to put guardrails in place to ensure users are not impacted nega-
tively.
TABLE 24.3: The Differences Between Generative Language and Generative Image Models
Definition A type of model that learns about patterns A type of model that produces new images
in language through training data through techniques like diffusion.
Purpose Classify or predict data Generate new data based on trained data
Training Data Text data from sources like books and Image data from datasets like ImageNet
websites and CIFAR
Training Tries to learn the statistical properties of Tries to learn the statistical properties of
Approach language and generates text accordingly image and generates image accordingly
Large language models are those that can understand, generate, and work with human language. The beauty of
these models is that since they have been trained on such large amounts of text data, they understand various pat-
terns, grammar, and even some real-world facts.
LLMs are considered large because they have a large number of parameters that have been learned through
the training process. The more parameters the model has, the more it can learn and remember. For example,
GPT-3 has 175 billion parameters, which is like saying they have a brain that can store about 175,000 copies
of War and Peace. So, when you ask the GPT-3 model who won the soccer World Cup in 2020, having read this
information in the past, it can immediately respond with the answer. And it gets even more interesting when you
consider that GPT-4 has 100 trillion parameters, which by the same logic means storing 100 million copies of
War and Peace in a brain.
LLMs are so huge that they demand a lot of compute power, which makes them out of reach for most enterprises
to train their own language model. Instead, the trend is to fine-tune these models for specific uses cases or indus-
tries so that the LLM can operate efficiently and provide better performance for that particular use case content.
For example, an LLM trained for healthcare will provide better accuracy for healthcare use cases.
Note that in addition to generative language models, you also have generative image models that can take an
image as input and generate text, an image, or a video.
NOTE A large language model takes in text and generates text, an image, a video, or a
decision.
Data Types Text data Text, images, audio, video, and other sensory
Used for data
Training
Primary Specifically designed for language tasks Designed to be adaptable for tasks beyond
Function such as generate text, translate text, language tasks, which could include
question answering, and summarization computer vision and robotics
Job displacement
AI-Powered Content Personalized search results
Trending topics
Twitter
Personalized recommendations
Job postings
Linkedin
Personalized recommendations
Generative AI
Workforce Optimization Generating New Product Ideas
Applications
Employee Transformation
VIRTUAL Task Automation
ASSISTANT
The uses of generative AI are not restricted to businesses alone but can also impact our personal lives. Take, for
example, the Stanford University research, where they converted human brain signals into written text.
TIP Use generative AI not to replace creativity or other relationships but to augment your
capabilities and open new possibilities to transform business and personal lives.
TIP Use GANs to create realistic human faces and even photorealistic artwork out of
sketches.
444 ❘ CHAPTER 24 Implementing Generative AI Use Cases with ChatGPT for the Enterprise
NOTE Transformers have an encoder and a decoder, where the encoder encodes the input
sequence and passes it to the decoder, which learns how to decode the representation for a
relevant task.
Transformer
Input Generative Pretrained Output
Encoding Decoding Transformer
Component Component Model
How I am doing
are fine and
you? you?
NOTE Diffusion models can generate text-to-image, text-to-video, text-to-3D, and text-to-
tasks.
Implementing Generative AI and ChatGPT ❘ 445
Text-to-3D can be used in video games, and text-to-task models can be used to carry out tasks such as ques-
tion and answer, performing a search, or taking some action such as navigating a UI or updating a document.
Table 24.5 captures the differences between GANs, VAEs, and diffusion models.
Working Two neural networks A type of neural network Adding noise to the original
Principle competing in a zero-sum that learns to encode image and then removing
game to generate new data and decode data by the noise gradually to create
that closely resembles the mapping it to a latent a range of new images
training dataset space depending on the noise with
which the recovery process
started
DALL-E and CLIP are two diffusion-type models from OpenAI. DALL-E can generate images from text descrip-
tions, while CLIP can understand both text and images and can relate them to each other. For example, you can
ask DALL-E to create an orange in the form of an apple, and DALL-E will precisely do that to create a novel
image that was never seen before.
Diffusion models are not without their challenges because they can be very computationally intensive and slower
than GANs. Despite their challenges, diffusion models are fascinating and extend the limits of virtual reality and
what’s possible with Gen AI.
The issue with that approach is that you need to figure out how your data is going to be leveraged by
those companies.
➤➤ Consuming APIs: When you use APIs, cost is a critical factor because you are charged based on the
tokens, you use both as a query to the API and the response you get back. It is therefore important to
exercise controls over the amount of data input to the API as well as the amount of data generated
from the API.
➤➤ Prompt engineering: The other option is prompt engineering, which has been gaining much attention
lately. We are already seeing innovations in this space in the form of new models such as hugging space,
GPT long chain, and so on.
➤➤ Fine-tuning the LLMs: The next best option that gives you even more control is fine-tuning the LLM
model, which requires technical expertise from your company. The last option is very complex: pretrain-
ing a model, which is almost like starting from scratch; it’s costly and therefore is ruled out for most
companies.
➤➤ Using a packaged software with embedded gen AI: The easiest option is to use the packaged software
where you don’t need any technical team; anybody can use those. Examples of those are ChatGPT,
Microsoft Bing, Google BARD, Adobe Firefly, and so on. In the future, we are going to see most of these
large language models embedded in enterprise applications
02 05
Leveraging
LLMs
01 Build vs Buy 06
Use Software with Building New
Gen AI embedded Models
TIP Fine-tuning pretrained models can be a viable middle path between fully customized
models and off-the-shelf products.
➤➤ Customized fine-tuning
➤➤ Retrieval augmented generation (RAG)
Files
User
Request
Prompt Engineering
Folders
Information
Retrieval
Aggregation
Database LLM Model
Databases
Instructions
The RAG approach has been used in various applications such as customer service, customer support, news writ-
ing, and even therapy.
RAG is also used in certain help desk use cases, where the language models can make their own API calls to other
external systems, given their ability to create their own code. This allows them to create their own tools and then
use that to respond to the user with suggestions.
TIP When integrating internal data with external data, pay attention to data privacy issues,
access control security, and leverage domain experts to tailor models efficiently.
Ensuring Fairness and Bias Mitigation Through Continuous Testing and Guardrails
Third, set up the necessary guardrails to check for biases in the system and address them accordingly. Keep testing
continuously to ensure they are working accurately.
TIP Be prepared to manage risks proactively: starting with a pilot, maintaining transparency
with users, looking out for bias, handling data privacy issues, and planning for contingencies
are vital.
Strategy Best
Practices for
Generative AI
Data &
Disruptive Agile Shorter-term Predictive to
Automation to analytics to Talent
potential of platforms focus for generative use
augmentation enterprise- upskilling
generative AI and tools roadmap cases
wide focus
From 3-year
Process Flexible outlook to 1-
Enterprise-
automation to architecture, year outlook, Tools impact,
Update AI wide
smart integrate with monitor process
strategy with governance, AI literacy
automation to open-source frequently and impact, people
Gen AI ethics
people and licensed frequent impact
framework
augmentation software strategy
updates
Risk of Disruption
Generative AI and ChatGPT can disrupt the industry you are currently in. This could mean that you need to
change your current AI strategy or complement it with generative AI and ChatGPT related strategy.
450 ❘ CHAPTER 24 Implementing Generative AI Use Cases with ChatGPT for the Enterprise
NOTE Virtual assistants can help employees make decisions at critical junctures, cope with
stress during the day, mentor employees, and be impartial in their interactions with the
employees.
TIP Anticipate disruption and bake it into your business strategy to create a competitive
edge.
Risk of Disruption
Large industry players run the risk of being disrupted by smaller players.
TIP Plan on investing in advanced deepfake detection techniques to mitigate the impact.
Goal: The goal of this exercise is to help those involved in implementing generative AI to
manage and mitigate risks.
Generative AI Cloud Platforms ❘ 453
Understanding Risks
to answer customer queries based on your company data. The major cloud providers provide the ability to train
these models within your company’s premises without losing control over the data and exposing it to the Internet
by taking all security precautions and best practices. In addition to the cloud platform, you also need develop-
ment platforms that can help you build generative AI applications on top of them.
NOTE When combined with the Model Garden provided in Google’s Vertex AI platform, it
can be a game-changer for software development agent generative AI applications.
Figure 24.15 shows the foundation models Language FMs Vision FMs
available in the model garden. One promising
application of generative AI using Google tools • BERT • Stable Diffusion v1-5
is Google’s Magenta project, which is an applica- • PaLM API for Chat • CLIP
tion that can generate unique music in a par- • PaLM API for Text • Embeddings Extractor
ticular genre or style based on the music shared • BLIP Image Captioning
by the user. • BLIP VQA
• OWL-ViT
No Code AI Development with • ViT GPT2
Generative AI App Builder
Google’s Generative AI App Builder allows you to FIGURE 24.15: Foundation models in the model garden
create generative AI applications without coding.
According to Google, it helps businesses and gov-
ernments to build their own AI-powered chat interfaces and digital assistants. It connects conversational AI flows
with out-of-the-box search experiences and foundation models and thus helps companies to build gen AI apps in
minutes or hours.
AWS
SageMaker
AWS
A Family of Generative Code Faster and
Pretrained AWS Titan AWS Smarter with a
Models
AI Cloud
Foundation CodeWhisperer Coding
Models Platform Companion
Tools
AWS
Bedrock
AWS SageMaker: All-in-One Solution for Training and Deploying Generative AI Models
SageMaker is another excellent tool to train and deploy generative AI models. It provides a comprehensive devel-
opment environment to build, train, and deploy generative AI models. And it comes with its own debugger and
other monitoring tools to manage bias, explainability, and so on.
magic, CodeWhisperer will return the entire function to you with all the code in the programming language of
your choice. During a recent challenge, it was found that CodeWhisperer can help developers complete their cod-
ing 57 percent faster, and 27 percent were successful.
CodeWhisperer is available for Python, Java, JavaScript, TypeScript, C#, Go, Kotlin, Rust, PHP, and SQL. It can
be accessed from IDEs such as VS Code, IntelliJ IDEA, AWS Cloud9, and many more via the AWS Toolkit IDE
extensions. Having been trained on billions of lines of code, CodeWhisperer can identify security vulnerabilities
such as Open Worldwide Application Security Project (OWASP) best practices and is an effective tool for handling
open-source coding responsibly. Figure 24.17 shows the different steps involved when using AWS CodeWhisperer.
Collaboration Set Up
and Version CodeWhisperer
Control
AWS
CodeWhisperer
Select the
Programming
language
Identify Security
Vulnerabilities
TIP One of the most significant benefits of using AWS Bedrock is that you can create a
private copy of the foundation module and then use your data within your own private
VPC; therefore, your data security concerns are well addressed.
Optimizing Performance and Cost Using EC2 Trn1n and Inf2 Instances
Now say you were working with the foundation modules, and you’re looking for a high-performing and cost-
effective infrastructure. That’s where Amazon EC2 Trn1n and Inf2 instances powered by AWS Trainium and AWS
Inferentia2, respectively, will come into play. You can use Trainium instances to reduce the cost of training. In
contrast, you can use Inferentia instances to reduce the cost of making predictions. As of this writing, AWS Train-
ium chips are built to distribute the training across multiple servers employing 800 Gbps of second-generation
Elastic Fabric Adapter (EFA) networking. It allows you to scale up to 30,000 Trainium chips, which equals more
than 6 exaflops of computing power, all within the same AWS Availability Zone. It is equal to an impressive
petabit-scale networking capacity.
AWS Inferentia chips are optimized for models with hundreds of billions of parameters. As of this writing, each
chip can deliver up to 190 tera floating operations per second (TFLOPS). On the other hand, AWS Trainium is
a second-generation ML accelerator for deep learning training of 100B+ parameter models. Amazon Search and
Runway use Trainium to train their models to generate text, translate languages and answer questions. They can
reduce the time from months to weeks or even days while reducing the cost. Money Forward uses it to train its
models to detect fraud. Figure 24.18 captures the features of AWS Inferentia and Trainium.
ChatGPT uses generative AI models to generate new content in response to users’ natural language prompts. The
OpenAI service allows you to leverage these services from ChatGPT and combine them with the security and scal-
ability of the Azure cloud platform. Codex is an AI system developed by OpenAI to develop code from text.
The first step to building a generative AI solution with Azure OpenAI is to provision an OpenAI resource in your
Azure subscription, as shown in Figure 24.19.
NOTE At this point, the OpenAI Azure service is available with limited access to ensure its
ethical use. Certain models are available in certain regions only.
NOTE Azure OpenAI Studio is a great way to build your gen AI apps without worrying
about the underlying infrastructure.
GPT-4 Models that generate natural language and code. gpt-4, gpt-4-32k
These models are currently in preview.
GPT-3 Models that can understand and generate natural text-davinci-003, text-curie-001, text-
language. babbage-001, text-ada-001, gpt-35-turbo
Codex Models that can understand and generate code, code-davinci-002, code-cushman-001
including translating natural language to code.
To make API calls to the model, you must deploy it. Note that you can deploy only one instance of the model.
Once deployed, you can test it using the prompts. A prompt is the text input that is sent to the model’s endpoint.
Responses can come in the form of text, code, or other formats.
Prompts can be of various types, such as classifying content, generating new content, holding conversations,
translating, summarizing content, picking up where you left off, and getting facts.
Other parameters include top probabilities (top P), frequency penalty, presence penalty, pre-response text, and
post-response text.
As shown in Figure 24.21, it can also troubleshoot bugs quickly by catching errors, enhance developer creativ-
ity by suggesting new ideas, reduce learning curves, and improve collaboration by increasing the readability and
maintainability of the code.
Generative AI Cloud Platforms ❘ 461
GitHub CoPilot
Your Coding Companion
Features Benefits
Generating code Helps developers stay in the flow
Filling in missing code Learn new programming languages
Suggesting alternative code Find and fix bugs
Translating code Improve the quality of the code
Troubleshooting bugs
Enhancing creativity
Reducing learning curves
Results Integration
74% focus on more Visual Studio
satisfying work Neovim
88% feel more productive VS Code
96% are faster with JetBrains IDE
repetitive tasks
To use GitHub Copilot, you just need to install the extension in your IDE, such as Visual Studio, Neovim, and
JetBrains. Currently, you need to have an active GitHub Copilot subscription to use Copilot. While GitHub
Copilot may differ in the details from AWS CodeWhisperer, it’s similar in terms of its goals to improve developer
productivity.
NOTE The G in GPT stands for generative, P for pretrained, and T for transformer models.
462 ❘ CHAPTER 24 Implementing Generative AI Use Cases with ChatGPT for the Enterprise
Runway ML
Develop Generative
Hyper-Realistic AI code without
Faces with NVIDIA coding OpenAI
StyleGAN
GPT-3 and DALL-E integrated into
Creating highly realistic faces Kuki.ai chatbot
that do not exist in reality
SUMMARY
In this chapter, you journeyed into the complex world of generative AI. You learned about the critical components
of deep learning, semi-supervised learning, generative versus discriminative models, diffusion models, foundation
models, and large language and large image models.
Ultimately, generative AI opens a new chapter of innovation and potential. It can revolutionize and disrupt busi-
nesses, augment human potential, and provide a new perspective for solving business problems. However, it is not
without risks, potential ethical concerns, and calls for an urgent need to protect data and user privacy and ensure
fairness and transparency in decision-making.
It is undoubtedly a new era for businesses and personal lives, and I look forward to seeing how this technology
evolves and transforms our lives.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What is the estimated economic benefit generated by generative AI use cases annually across industries
according to McKinsey?
A. $1.5 to $2.3 trillion
B. $2.6 to $4.4 trillion
C. $5.5 to $7.2 trillion
D. $6.3 to $8.5 trillion
2. Discriminative models in machine learning are focused on
A. Predicting the label for new data points
B. Generating new data instances
C. Understanding the relationships between the features and the labels
D. Learning the probability distribution of the existing data
3. What is the main limitation of large language models (LLMs)?
A. They can only understand text.
B. They require a large amount of compute power.
C. They can only generate text.
D. They can work only with small amounts of data.
4. Variational autoencoders (VAEs) are beneficial because
A. They can understand the hidden layers that drive the data.
B. They are used in text generation use cases.
C. They can operate like black boxes.
D. Both A and B.
464 ❘ CHAPTER 24 Implementing Generative AI Use Cases with ChatGPT for the Enterprise
ANSWER KEY
1. B 5. B 9. D
2. A 6. C 10. C
3. B 7. C 11. B
4. D 8. A
25
Planning for the Future of AI
The future belongs to those who believe in the beauty of their dreams.
—Eleanor Roosevelt
You have learned that AI is not a static field; instead, it continuously evolves rapidly, reshaping industries
and personal lives. While generative AI resembles the latest pinnacle of current AI capabilities, it is equally
vital to gaze into the future and discern the emerging trends.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Evolve and Mature
journey Team
In this chapter, you learn about the trends in the smart world powered by AR/VR, the captivating concept
of the metaverse, and the intricacies of quantum machine learning. You also learn about the productiv-
ity revolution in industries, the impact of powerful technologies such as digital humans, and the growing
popularity of AI at the edge.
466 ❘ CHAPTER 25 Planning for the Future of AI
This chapter also takes a quick look at the critical enablers poised to reshape AI’s future, such as foundation mod-
els, knowledge graphs, hyper-automation, and the democratization of AI/ML.
This chapter provides you the opportunity to build a futuristic roadmap for continuous innovation and sustain-
able growth in the ever-evolving field of AI.
EMERGING AI TRENDS
We are living in an exciting world, and we haven’t seen anything yet. The future of the world is smart. It is being
shaped by several emerging technologies, such as AR and VR, the metaverse, and digital humans—all of which
are covered in this chapter. The smart world is focused on improving the quality of its citizens. In addition, AI is
transforming businesses to become more productive and efficient.
Behind this revolution are several emerging trends, such as AI in the edge, intelligent apps, compressed models,
self-supervised learning, and generative AI. There are also several critical enablers, including ChatGPT, trans-
former models, foundation models, knowledge graphs, and the democratization of AI/ML, which are all accelerat-
ing this trend. This section looks at these trends.
Smart World
A smart world is about creating virtual replicas of the physical environment and the world we live in. It includes
technologies such as virtual reality, augmented reality, the metaverse, and digital humans. The smart world is
focused on improving citizens’ quality of life. It can be done in many ways, such as the following:
➤➤ Improving transportation: Self-driving cars reduce traffic congestion and improve safety.
➤➤ Managing energy: AI can be used to optimize energy consumption and reduce the cost.
➤➤ Improving healthcare: AI can be used to cure diseases, find new drugs, and improve patient treatments.
➤➤ Increasing security: AI can be used to improve security, detect crime, and prevent terrorism.
➤➤ Protecting the environment: AI can be used to protect the environment through initiatives such as
climate control, pollution control, tracking endangered species, and managing resources.
One of the key initiatives in the smart world category is Singapore’s smart nation initiative, which includes devel-
oping a data exchange program where government agencies can share data and understand the customer’s needs
and problems better.
AR and VR Technology
AR and VR technology can be used to improve business operations and customer service and create new prod-
ucts. Here are some examples:
➤➤ Retail: Augmented reality can be used in the retail industry for customers to try on clothes, makeup, and
other products. This can help customers make better product choices and increase customer satisfaction.
➤➤ Education: Virtual reality can be used for students to go on virtual field trips and experiment with diffi-
cult-to-access or hazardous products.
➤➤ Healthcare: In healthcare, VR can be used to reduce pain for patients, help them recuperate from
surgery, and help them to walk after a stroke.
➤➤ Manufacturing: Employees can be trained under challenging terrains and environments and thus
improve safety and reduce costs using virtual reality.
➤➤ Marketing: Both virtual reality and augmented reality can be used in marketing to create immersive
experiences; for example, augmented reality can be used to insert products in the user’s environment,
while virtual reality can be used to create virtual experiences that customers can explore.
Emerging AI Trends ❘ 467
Metaverse
The metaverse can be thought of as a platform with which businesses can interact with their customers, employ-
ees, and partners.
Here are some examples of how it can be used:
➤➤ Virtual meetings: Meetings can be conducted virtually where participants can interact with each other
and the environment.
➤➤ Product demonstrations: Businesses can demonstrate their products to their customers in a virtual world
using AR and VR technology.
➤➤ Virtual training: Companies can provide immersive simulation environments where employees can train
under challenging circumstances.
➤➤ Remote work: Employees can interact with each other through a platform that allows them to interact
with other employees in a virtual environment.
AI models, natural language processing, and other AI/ML services can be used to enable these services.
This exercise involves exploring the emerging AI technologies within a specific business or
industry context, understanding the potential of these technologies, identifying potential use
cases, and coming up with an initial high-level implementation plan.
You can ask your workshop participants to explore each of these areas and come up with
deliverables such as presentations to explore real-world examples such as Singapore’s Smart
Nation initiative, AR in retail, VR in education, and digital humans in customer service,
among others.
continues
468 ❘ CHAPTER 25 Planning for the Future of AI
continued
AI in the Edge
AI in the edge involves deploying ML models in edge devices such as smartphone sensors rather than on the
cloud. Deploying AI models in edge devices opens more possibilities, such as intelligent apps, compressed models,
and self-supervised learning.
The benefits of AI on the edge are faster response times, lower bandwidth requirements, reduced cost, and higher
security. An example of a challenge related to AI in the edge is that data security can be compromised when unau-
thorized individuals get access to those devices.
Intelligent Apps
Intelligent apps are software applications that use AI algorithms to provide intelligent and personalized services
to users. They provide benefits such as personalized experiences, process automation, predictive analytics, and
real-time decision-making. Intelligent apps include recommendation engines, chatbots, virtual agents, predictive
maintenance applications, and fraud detection applications. These apps have the potential to revolutionize a
companies’ operations and customer service. See Figure 25.2.
TIP Be a part of the productivity revolution by leveraging technologies like AI/ML on the
edge, intelligent apps, compressed models, self-supervised learning, and generative AI to
increase productivity and transform your organization.
470 ❘ CHAPTER 25 Planning for the Future of AI
Compressed Models
Compressed models are about compressing the models in size and complexity without losing accuracy and per-
formance. They open new possibilities concerning the deployment of the models on the edge. Model compression
results in reduced storage, reduced bandwidth requirements, improved performance, lower cost, and improved
security. However, it is not without drawbacks because, with the reduced size, it can lose accuracy, is also chal-
lenging to develop, and is available for only some models. As technology develops, some of these drawbacks may
be eliminated, making this more effective.
Some examples of model compression techniques are defined here:
➤➤ Pruning the weights: Involves removing certain connections (weights) between the nodes of a neural net-
work to reduce the size of the network.
➤➤ Quantization: Reduces the number of bits used to identify a weight. For example, using 8-bit integers
instead of 32-bit floating-point numbers can reduce the model size and increase the speed of arithmetic
operations.
➤➤ Activations: Refer to the outputs of the neural network. Reducing the size of the activations can also
reduce the compute power required to train or use the model.
➤➤ Knowledge distillation: Involves using a smaller, simpler (student) model to replicate the behavior of a
larger, complex (parent) model. By transferring the knowledge from the larger to the smaller model, you
can deploy it on smaller devices with limited resources.
➤➤ Low-rank factorization: Involves approximating the weight matrices with the product of two lower
ranked matrices to reduce the size of the model. This approach can result in loss of accuracy.
Self-Supervised Learning
Supervised learning is a type of machine learning where the model can be developed by training a large unlabeled
dataset, and it develops its own features without supervision. Self-supervised learning provides benefits such as
Critical Enablers ❘ 471
data efficiency, reduced bias, and improved performance. The advantage of self-supervised learning is that it can
look at the existing data and try to come up with patterns and anomalies, which can then be used to make further
predictions, classify new data, and make recommendations. In healthcare, it can look at a lot of imaging data
and come up with patterns and anomalies that can then be used to treat new patients. Similarly, it can look at
customer tickets and come up with patterns that relate customers with issue types and then use that information
to automate responses to customers as well as to provide insights to the customer service reps to address those
types of issues.
CRITICAL ENABLERS
This section looks at some of the critical enablers that are important for the development of AI and ML solutions
in an enterprise that provides a company competitive edge (see Figure 25.3). They are critical enablers because
they help to democratize the use of AI/ML, speed up the deployment of complex models, automate complex tasks,
and increase efficiency, thus driving rapid productivity revolution in companies.
Foundation Knowledge
Models Graphs
Building blocks for AI Connect disparate
apps data
CRITICAL
ENABLERS
Hyper-automation Democratization
Automate processes with Make AI accessible to all
AI
Critical enablers help:
• Speed deployment
• Increase productivity
• Drive innovation
• Provide a competitive
edge
Foundation Models
You can think of foundational models as basic building blocks that form more complex AI solutions. These are
large, pretrained deep learning models pretrained to create a particular type of content and can be adapted for
other use cases. Foundational models, such as GPT 4 and Bard, can be pretrained models on large amounts of
data and can be used for developing more complex models. They’re primarily used for natural language process-
ing, machine translation, and image recognition use cases. Once you have a foundation model, you can develop
an application on top of it to leverage its content creation capabilities. For example, if you consider GPT model,
you have other applications built on top of it such as Jasper and Copy.ai. This is one of the main reasons that we
are going to see an explosion of a number of generative AI applications in the future.
Knowledge Graphs
Knowledge graphs are databases where entities and their relationships are stored so that models can quickly
leverage them and be trained to make further predictions. This ability to represent data from various sources in
472 ❘ CHAPTER 25 Planning for the Future of AI
graphs enables developers to use them in use cases such as product recommendations, and to resolve customer
issues. Figure 25.4 shows a knowledge graph that displays a diverse set of entities and relationships to model real-
world contexts and domains.
FIGURE 25.4: Knowledge graph reflecting complex real-world data as meaningful relationships
TIP When developing AI and ML solutions, consider critical enablers such as foundational
models, knowledge graphs, and hyper-automation to create a competitive edge for your
enterprise.
Hyper-Automation
Hyper-automation refers to the ability to automate business processes using machine learning, robot process
automation, and natural language processing.
Democratization of AI/ML
The mobilization of AI/ML means making AI and ML services accessible to a broad group of users through user-
friendly user interfaces, pretrained models and templates, and services that are hosted on the cloud.
Transformer Models
Transformer models are a class of neural networks useful for natural language processing. They are known for
capturing long-range dependencies between different parts of a sentence and have been found to have exceptional
performance in more effective conversations.
This is a hands-on self-guided learning exercise where the participants conduct independent
research to understand various emerging technologies and relate theoretical concepts to
practical applications.
continues
474 ❘ CHAPTER 25 Planning for the Future of AI
continued
Federated Learning
Federated learning is a new approach that involves training the model across decentralized devices or servers that
hold local data. This is helpful when you cannot move data to cloud servers due to security or latency concerns.
In the case study example about the retail business, this means you can train models using local store data while
maintaining overall model consistency across the company.
AutoML
AutoML and the democratization of AI is another trend that is creating many citizen data scientists by creating
intuitive and accessible models.
AutoML is a significant advancement that automates parts of the machine learning process. It not only makes
model development faster and more accessible but also lowers the barrier to entry to implement AI.
NOTE Federated learning, AI at the edge, quantum computing, AutoML, and explainable
AI are some exciting new trends in machine learning.
Data Flywheels
A data flywheel describes a concept where, as more data is fed into the machine learning applications, more users
start using the information, which creates more demand on the data, which in turn creates more insights that lead
to new machine learning applications. This cycle continues and has come to be known as the data flywheel.
Distributed Everything
Distributed everything involves connecting the data to the people and enabling no-code, low-code, self-service
options to facilitate greater collaboration between people. These might be data scientists, machine learning
engineers, business analysts, data engineers, and other stakeholders who want to facilitate data visualization
and analysis.
SUMMARY
You just traveled through the exciting world of AI, touching upon many innovations such as the metaverse, AR/
VR, digital humans and twins, and similar technologies that are shaping the smart world. You also reviewed some
of the productivity revolution technologies, such as AI on the edge, intelligent apps, self-supervised learning, and
generative models like ChatGPT. Together, these technologies propel the advancements in the smart world in our
personal lives and the productivity revolution in business.
This chapter also delved into the critical enablers that make these advancements possible, such as foundation
models, knowledge graphs, ChatGPT, hyper-automation, transformer models, and leading tools like Keras and
TensorFlow.
REVIEW QUESTIONS
These review questions are included at the end of each chapter to help you test your understanding of the infor-
mation. You’ll find the answers in the following section.
1. What is the metaverse?
A. A platform for businesses to interact with their customers, suppliers, and employees
B. A virtual world for remote work and virtual training
C. A computer-generated version of the metaverse
D. None of the above
2. What are compressed models?
A. Models that are compressed in size and complexity without losing accuracy and performance
B. Models that are uncompressed and have high complexity
C. Models that have low accuracy and performance due to compression
D. Models that have not been trained using machine learning
3. What is self-supervised learning?
A. Models can be developed by training a large unlabeled dataset.
B. Models are supervised by a human expert.
C. Models that are trained on labeled data only.
D. Models that are not used in industry.
Review Questions ❘ 477
12. What principle allows a quantum system to exist in multiple states simultaneously?
A. Entanglement
B. Superposition
C. Quantization
D. Activation
13. What tool can businesses use in the metaverse to interact with customers, employees, and partners?
A. VR 360 goggles
B. AI Fairness 360 toolkit
C. Digital twinning
D. Encryption techniques
14. What is a disadvantage of using digital humans in business interactions?
A. They are available 24/7.
B. Customers may think interactions are fake.
C. They respond consistently to customers.
D. They are engaging.
15. Which term refers to the concept where training the model occurs across decentralized devices or
servers holding local data, especially when data cannot be moved to cloud servers due to security or
latency concerns?
A. Explainable AI
B. Quantum computing
C. Federated learning
D. Edge computing
ANSWER KEY
1. A 6. C 11. C
2. A 7. B 12. B
3. A 8. D 13. A
4. A 9. C 14. B
5. B 10. B 15. C
26
Continuing Your AI Journey
The journey of a thousand miles begins with a single step.
—Lao Tzu
Congratulations on making it this far! I’m impressed with what you’ve achieved, having deployed an AI/
ML and potentially Gen AI solutions that have transformed your organization. This is a significant mile-
stone and a testament to the hard work and dedication of everyone involved.
You gained insights from case studies, addressed challenges and initiated pilots, built your team, developed
an infrastructure, processed data, and deployed and monitored models.
And you didn’t stop there. As your expertise matured, you scaled your AI initiative using the AI maturity
framework and AI center of excellence and then stepped up your game with generative AI use cases.
Figure 26.1 shows this impressive journey.
02
01 Strategize
and Prepare
Plan and
Launch Pilot
03
Continue Build and
Your AI Govern Your Continue Your AI Journey
journey Team
09
Equip your company with strategies and
ENTERPRISE AI insights to sustain AI initiatives, foster a
Setup 04 culture of relentless progress, and
Evolve and
Mature AI
JOURNEY MAP Infrastructure
and Manage
responsibly advance on the path of AI-driven
Operations transformation
But it is only the beginning of many more exciting things you and your company want to achieve. AI is an evolv-
ing field, and you must watch out for the latest opportunities and trends to capitalize upon.
Your current implementation would have also unlocked many more opportunities and opened up a new chapter
in your digital journey. You must explore how to build upon your past success and lessons learned as you move
on to the next phase in your AI adventure.
the previous implementation, which should now help you address them as part of this planning phase. Capture
all the gaps in a document and create an action plan to fill those gaps. Have a mix of action items such as send-
ing people to the training, encouraging them to attend workshops and seminars, and encouraging them to take
up certifications. Where necessary, talk to the recruiters to hire new talent either as permanent employees or as
consultants for the duration of the project.
486
artificial intelligence – AWS Console
487
AWS Elemental Media Insight – case studies
B
C
balance, 257–259
base images, 232 Caffe, 112, 310–311
batch data processing, 251 calculated features, as a feature creation technique, 263
batch inference, 352–353 call centers, Amazon Transcribe for, 118
batch monitoring, 367 canary deployment, 351
batch size, 323 capabilities, 108, 135, 143
Bayesian search, 325 Capital One, 92
best practices, 315, 330–338, 385, 448–453 Capital One and How It Became a Leading Technology
beta testing, 449 Organization in a Highly Regulated Environment
bias, 37, 356, 379, 449. See also model governance case study, 21–24
bias checking, 180, 181 Cartesian products of features, as a feature
bias control, as a model governance challenge, 347 transformation technique, 263
bias drift, monitoring, 365 case for change, creating, 72
binary classification, 275 case studies
binary features, 260 about, 27
binning, as a feature creation technique, 262 answer key, 28
488
categorical features – collaborative filtering
building business case and planning during Align phase, data management architecture, 207–209
55 data science experimentation platform, 209–211
customer-centric AI strategy for driving innovation, 419 developing event-driven architecture using IoT data,
of enterprise transformation with AI in the cloud, 19–28 190–193
evaluating AI opportunities across industries, 11 as a driver for process transformation, 25
How a Company Transformed Itself with AI Using a factors to consider when building ML platforms,
People-Centric Approach, 164–165 200––205
How a Supply Chain Management Company Managed hybrid and edge computing, 211–213
their AI Strategy to Achieve their Business key components of enterprise AI/ML healthcare
Outcomes, 62 platforms, 206–214
How Netflix leveraged their capabilities for competitive key components of ML and DL platforms, 206
advantage, 58 multicloud architecture, 213–214
IBM’s New Collar Job Initiative, 71 personalized recommendation architecture, 193–195
identifying initiatives during Envision phase, 53 real-time customer engagement, 195–198
maturity, 394–396 reference architecture patterns for typical use cases,
Netflix and Portfolio Management, 63 188–200
review questions, 27–28 review questions, 215–216
workbook template, 27 as a role for cloud engineers, 176
categorical features, 260 workbook template, 214
categorical outcomes, predicting using logistic regression, cloud providers, 204–205, 368
278–281 cloud-based data processing, benefits and challenges of,
Central Team, 232 247–250
centralized model management, 227 cloud-first thinking, impact of on DevOps, Agile
centralized unit, 410 development, and machine learning, 23
Chainer, 311 cloud-native principles, 23
change acceleration, 69, 72–75, 166 clustering, 292, 296
change impact assessment, conducting, 71–72 clustering metrics, 367
change logs, as a model governance challenge, 347 CNNs, 98
change management, 36, 153, 166–167, 427 code authoring, as a feature of data science
change tracking, data poisoning threats and, 333 experimentation platform, 210
chatbots, 116–117, 307, 442 code build service, as a component of automation
ChatGPT, 99, 433–434, 440–453, 461–462 pipelines, 223
Chevron, 95 code quality, Amazon Code Guru and, 127
classification, 274–276, 280, 285–286, 327–330 code repository, as a component of automation pipelines,
classification metrics, 366 223
classification models, 260 code versioning, as an artifact, 229–231
C-level support and alignment, in Optimizing stage of CodeCommit, 229
maturity, 397 code-related generative AI use cases, 100–101
climate control, as a use for generative adversarial cognitive service for decisions, Cognitive Services for
networks (GANs), 309 Vision and, 143
clinical notes, Amazon Transcribe Medical and, 132 Cognitive Services for Language, 143
clinical support system, Amazon HealthLake and, 131 Cognitive Services for Vision, 143
clinical trials, Amazon HealthLake and, 131 COIN platform, 93
cloud computing, 7–8, 24–25, 174 Colab, 455
cloud deployment, 348–349 collaboration
cloud engineer, 176–177 AI Hub and tools for, 137
cloud fluency, as a change management focus area, 166 Amazon SageMaker Canvas and, 135
Cloud Metadata Catalog, 231 as a benefit of artifact repositories, 332
Cloud Natural Language, 139, 141 as benefit of using cloud AI/ML services, 108
cloud platform infrastructure with data scientists as a role for data analysts,
about, 187–188, 214–215 179
answer key, 216 encouraging, 482
build vs. buy decision, 200–204 managing artifacts and, 383
choosing between cloud providers, 204–205 SageMaker Studio Lab and, 135
customer 360-degree architecture, 188–190 collaborative AI, 43–44
data anomaly and fraud detection, 198–199 collaborative filtering, 121, 122, 296
489
collaborative robots – data
collaborative robots, transforming manufacturing cost-effective solutions, as benefit of using cloud AI/ML
with, 13 services, 108
collaborative systems, 299 costs
communication, 72, 140, 427 controlling as a challenge of cloud-based data
competitive advantage, 57–59 processing, 247
complex classifications, 288 controlling for models using tagging, 385
complexity, 350, 352 management and billing of, 234
compliance, 11, 114, 132, 179, 247, 384 on-premises deployment and controlling, 349
component reuse, in Optimizing stage of maturity, 396 optimization best practices, 335–338
compressed models, 470 reducing, 8, 15, 83, 131, 384
computer vision, 93, 98–99 course recommendations, as a use case for recommender
computing capacity, as a benefit of cloud-based data systems, 300
processing, 247 Coursera, 96
concept drift, detecting and addressing, 364–365 credit risk assessment, as a use case for decision tree
configuration, synchronizing across environments, algorithms, 282
355–357 criteria, reviewing and refining, 89
configuration as code (CaC), 331 critical enablers, 471–475
confusion matrix, as a validation metric for classification cross-account access, as an AWS best practice, 233
problems, 329–330 cross-functional collaboration, promoting, 481
consistency, in architecture, 356 cross-functional teams, 421–422
container image management, 232–233 culture evolution, as a change management focus area,
container security, as an AWS best practice, 233 166–168
container versioning, as an artifact, 229–231 Current-State Assessment component, of AI
containerization, 211, 332 transformation plan, 427
content accuracy, 440–441 custom models, Amazon Fraud Detector and, 124
content moderation, 93, 118 customer 360-degree architecture, 188–190
content-based filtering, 121, 122, 299 customer analytics, 277
contextual meaning, as a use case for transformer models, customer behavior, DeepAR and, 123
306, 307 customer churn prediction, as a use case for logistic
continuous improvement, 422–423, 480–481 regression, 280
continuous integration and continuous delivery (CI/CD), Customer ECR Instance, 232
156, 218, 335, 357–359 customer engagement, real-time, 195–198
continuous monitoring, 258, 351 customer experience, 8–9, 14–15
continuous testing, 449 customer loyalty, boosting, 441–442
control, hybrid deployment and, 350 customer needs, identifying new, 9–10
controlled variables, as a characteristic of A/B testing, customer satisfaction, 83, 126
351 customer segmentation, 281, 300
convolutional neural networks (CNNs), 303–304 customer service, 12, 82, 282, 301, 480–481
Copy.ai, 93 customer support, AWS Translate for, 115
copyright, protecting, 452 customer support knowledge management system,
core AI roles Amazon Kendra and, 125
about, 175 customer-centric AI strategy, for driving innovation,
AI architect, 175–176 418–424
business analyst, 177–178 customers at scale, enabling real-time service for, 6–7
cloud engineer, 176–177 cybersecurity, 94
data engineer, 176 Cylance, 94
data scientist, 176
machine learning engineer, 176
MLOps engineer, 176 D
core AI services
about, 113 DALL-E 2, 462
chatbots, 116–117 Danske Bank, 44
speech, 117–118 Darktrace, 93
text and document services, 114–116 dashboards, 207, 369
vision services, 118–120 data
correlation matrix, as a feature selection technique, 261 anomaly, 198–200
490
data analysis – democratization
artifacts lineage tracking and, 228–231 benefits and challenges of cloud-based, 247–250
augmentation of, 259 as a component of automation pipelines, 223
defining usage policies for, 482 data exploration and preprocessing stage, 253–259
as a focus areas for building competitive advantage, 59 data needs, 244–247
managing different types of, 247–248 as a feature of data science experimentation
as a maturity dimension, 403 platform, 210
needs for, 244–247 in machine learning lifecycle, 156
selection of, as a role for model owners, 180 phases of ML lifecycle, 250–253
storage options for, 252–253 review questions, 265–267
data analysis, 82, 209, 291 workbook template, 265
data analyst, 178–179 Data Processing & Feature Engineering Workflow
data capability, in Envision phase, 154 workbook template, 265
Data Catalog, 208, 231 data quality assurance, 334
data centers, moving into the cloud, 22–23 data querying, as a feature of data science
data cleaning, 206, 255–257 experimentation platform, 209
data collection, 178, 250–251, 353 data reputation, in large-scale AI initiatives, 38
data combinations, in real-time customer data scalability, 45
engagement, 196 data scaling, 257
data compliance, in large-scale AI initiatives, 38 data science experimentation platform, 209–211
data consistency, real-time vs. batch inference and, 352 data science library access, as a feature of data science
data consolidation, building experiences due to, 22 experimentation platform, 211
data denoising, as a use case for autoencoders, 298 data scientists, 136–138, 176, 177
data drift, detecting and addressing, 363–364 data security, 38, 349, 482
data engineer, 176, 244–247 data shifts, 335
data engineering, 243–244, 259–267 data source diversification, data poisoning threats
data ethics, in large-scale AI initiatives, 38 and, 334
data extraction, Amazon Textract for, 115 data sources, 188, 208
data flywheels, 475 data stewardship, 475
data generation, as a use case for autoencoders, 298 data synchronization, 213, 214, 352
data governance, importance of, 36 data transformation, 209
data infrastructure, importance of, 36 data versioning, 209
data ingestion, 196, 206, 208, 250, 251 data visualization, 196, 254
data initiatives, prioritizing, 5–6 data warehouses, 252
data integration, as a benefit of cloud-based data data wrangling, 255
processing, 247 data-driven predictions, using machine learning, 272–309
data lakes, 207–208, 252 Dataflow, 231
data latency, 213 Dataiku, 147
data level, as a mitigation technique, 258 DataOps, 475
data lineage tools, as a model monitoring challenge, 347 data-related challenges, 33
data management, 179, 207–209, 475–476, 482 DataRobot, 147
data parallelism, 318 dataset compression, as a use case for principal
data partitioning, 257 component analysis, 296
data poisoning threats, protecting against, 333–334 dataset versioning, as an artifact, 229–231
data preparation, 145, 146, 247, 253–254 datasets, as a use case for logistic regression, 280
data preprocessing deception, navigating against, 451
about, 253–255, 354 decision trees, 260, 281–283
data augmentation, 259 decision-making, 10, 480–481
data cleaning techniques, 255–257 deep learning, 94, 206, 260, 302–303
data partitioning, 257 DeepAR, 123, 124
data scaling, 257 DeepArt, 462
managing inconsistent formats, 259 DeepDream, 462
unbias and balance, 257–259 deepfakes, 451
data privacy, 38, 347, 449, 482 Dell, 94
data processing demand forecasting, 277, 282
about, 243–244, 265 demand predictions, ARIMA and, 122
answer key, 267 democratization, 135, 472
491
deployment – external knowledge
deployment, 45, 225, 345. See also model deployment enablers, in Pilot stage of maturity, 395
design involvement, as a role for domain experts, 178 encryption, enabling of inter-node cluster
detection, 257–259 communications, 333
developers, Google AI/ML services stack for, 138–141 endpoints, 230, 233–234, 370–371
development tools, AutoML and, 139 end-to-end automated pipeline, 225
device support, Dialogflow and, 139 energy consumption analysis, 277
DevOps, 23, 92 energy efficiency, 451
Dialogflow, 139, 141 energy industry, transforming, 15
diffusion models, 443–445 engagement strategy, creating, 73
digital humans, 467 ENGIE, 94
digital natives, AI adoption by, 35 entanglement, 473
digital twins, 94–95, 467 enterprise AI, 59–64, 300
Digitata, 93 Enterprise AI Cloud Platform Setup Checklist workbook
dimensionality reduction, 296–298 template, 214
dimensions, reducing using principal component analysis, enterprise AI opportunities, 6–15
294–296 enterprise AI transformation
direct downloads stage, as a data collection method, 250 about, 3, 16
disaster recovery, in multicloud architecture, 214 AI-first strategy, 5
discovery sessions, 480 answer key, 18
Discovery stage, as a stage of maturity, 394–395, 399, into business processes, 4–5
401, 403, 404 case studies, 19–28
discriminator, 308 prioritizing Ai and data initiatives, 5–6
discriminator networks, 443 review questions, 16–17
disruption, risk of, 449, 451 success and failure in, 4
distributed everything, 476 workbook template checklist, 15
distributed model management, 227 enterprise AI/ML healthcare platform, key components of,
distributed training, 318–319, 338 206–214
Docker build pipeline, 225 enterprise applications, Amazon Polly for,
Docker file, for creating training code containers, 321 117
Docker Hub, 232 enterprise-wide advisor, 410
document analysis, Amazon Textract for, 115 enterprise-wide governance, expanding scope from data
document services, 114–116 and analytics to, 450
documentation, streamlining, 11 environment consistency, as a characteristic of blue/green
domain expert, 178 deployment, 350
domain-specific features, as a feature transformation Envision phase, 51–53, 153–154
technique, 263 epochs, number of, 323
drift monitoring, 335 equipment maintenance, 130, 282
drug discovery, as a use for generative adversarial error checking/handling, 156, 181, 352
networks (GANs), 309 ethical considerations, 154, 380–381
DVC, 229 ethical framework, 397, 482
dynamic data, as a model monitoring challenge, 347 ethical handling, of private data, 452
ethical risks, addressing, 37
ethical safeguards, implementing, 482
E ethical use and responsibility, 346, 452
ethics. See model governance
EC2 Trn1n and Inf2 instances, 457 event-driven architecture, developing using IoT data,
ecommerce, 115, 126 191–193
edge deployment, 349–350 experimentation, 135, 395
education industry, 116, 466 explainability, 366, 379–380
efficiency, 8, 15, 281, 335, 384 exploratory data analysis, 178, 253–254
e-learning and training platforms, Amazon Polly for, 118 Exponential Smoothing State Space Model (ETS),
emails, 280, 441 123–124
emerging trends, 466–469, 475–476 export results, in real-time customer engagement, 196
employee collaboration, enhancing, 10 Extended Kalman Filter Dynamic Linear Model (ED)
employee knowledge base, Amazon Kendra and, 125 Prophet, 123, 124
employee productivity, as a success metric, 83 external knowledge, 446–447
492
F1 score – Google
493
Google AI platform – image-related generative AI use cases
494
impact – LLMs
impact, assessing, 88 J
impact criteria, defining, 87
Implementation plan and timeline component of AI Jenkins, as a CI/CD tool, 358
transformation plan, 427 job displacement, navigating, 452
imputation, as a technique for handling missing values, Johnson Controls, 94
256 Johnson & Johnson, 95
incident response, as a role for security engineers, 179 JP Morgan, 93
inconsistent formats, managing, 259
independent component analysis (ICA), as a feature
extraction technique, 261–262 K
industrial solutions, 129, 130
industry adoption, Amazon Lookout for Equipment and, K Cross validation, 326
130 Keras, 310, 472
industry compliance, as a role for domain experts, 178 key performance indicators, 427–428
industry regulations, as a model governance challenge, key phrase extraction, Amazon Comprehend for, 114
347 K-Means clustering, segmentation using, 292–294
industry trends, 84 K-Nearest Neighbors (KNN), 290–291
inference pipelines, implementing, 353–354 KNIME, 147
inference results, in real-time customer engagement, 196 knowledge distillation, 470
information extraction, as a use case for transformer knowledge graphs, 188, 471–472
models, 307 KPMG, 95
infrastructure Kubeflow, 231, 358
complexities of integration, 35–36
on-premises deployment and control of, 349
in Optimizing stage of maturity, 396 L
real-time vs. batch inference and, 352
scalability, 45 labeled data, 250, 290
infrastructure as a code (IaC), 331 labeling, 206, 274
innovation, 60, 108, 166–168, 482 language barriers, 442
insight gathering, in real-time customer engagement, 196 language detection, Amazon Comprehend for, 114
instance size, selecting optimal, 336 language support, Translation AI and, 140
integration, 108, 125, 180 languages, speech-to-text and, 140
Integration of Services, 142 large data, 293, 303
intellectual property, protecting, 452 large-scale AI initiatives, choosing between smaller proof
intelligence layer step, in customer 360-degree of concepts (PoCs) and, 37–39
architecture, 188 latency, 212, 213, 346, 352
intelligent apps, 469–470 layers, hidden, 324
intelligent document processing (IDP), 95 leadership alignment, ensuring, 69–72
intelligent search, 95–96 leadership sponsorship, in Operationalizing stage of
interaction features, 260 maturity, 396
interface development, as a role for software engineers, learning and iteration, n Pilot stage of maturity, 395
180 learning rate, 322–323
internal knowledge, 446–447 leave-one-out cross validation, 326
internal pilots, 448 legal compliance/documentation, 180, 181
inter-node cluster communications, enabling encryption legal industry, 115, 118, 126
of, 333 lessons learned, 396, 480
interpretability, 281, 286, 379–380 life sciences, use cases for, 92
Intuit, 92, 98 lifecycle, machine learning (ML), 155–158
inventory control, transforming retail with, 13 limited data, 288
IoT data, developing event-driven architecture using, linear data, 288
191–193 linear discriminant analysis, as a feature extraction
issue detection, Amazon Lookout for Equipment and, 130 technique, 262
IT auditor, 181 linear regression, predicting value using, 276–278
iterative team-building, 421–422 live testing, as a characteristic of canary deployment, 351
IVR systems, Amazon Polly for, 117 LLMs, 447
495
loan default prediction – MLOps best practices
loan default prediction, as a use case for logistic manifold learning, nonlinear dimensionality reduction
regression, 280 using, 300
logging, auditing and, 227–228 manufacturing, 13, 119, 466
logistic regression, 260, 278–281 marketing, 115, 118, 466
long term dependencies, RNNs and, 306 max length, as a playground parameter, 460
low-rank factorization, 470 mean absolute error (MAE), 366
mean average P(MAP), 367
mean squared error, 366
M mean/median imputation, as a technique for handling
missing values, 256
machine learning (ML). See also AI/ML algorithms measurement, continuous improvement through, 422–423
about, 3 media, Amazon Transcribe for, 118
artificial intelligence and, 269–270 media intelligence, 97
choosing optimal framework, 337 media translation, 139–141
cloud computing and hardware demands, 7 medical coding, Amazon Comprehend Medical and, 131
data processing phases of lifecycle, 250–253 medical diagnosis, 280, 282
data-driven predictions using, 272–309 medical imaging, as a use case for convolutional neural
democratization of, 472 networks (CNNs), 304
different categories of, 273 metadata, 322
enabling automation, 335 metadata tracking, as an artifact, 229–231
as a focus areas for building competitive advantage, 59 metaverse, 467
impact of cloud-first thinking on, 23 metrics, 326–330, 369
integrating into business processes, 4–5 MICE, as a technique for handling missing values, 256
leveraging benefits of for business, 7 Microsoft, 92
lifecycle, 155–158 Microsoft AI/ML services stack, 142–145
mapping opportunities, 481 Microsoft Azure Learning, 98
modernization, 97–98 Microsoft Azure OpenAI Service, 457–459
planning for, 154 Microsoft Azure Translator, 96
prioritizing opportunities, 481 Microsoft Cognitive Services, 97
revolutionizing operations management with, 21 Microsoft Graph, 96
securing environment, 333 Microsoft Healthcare Bot, 92
solving business problems with, 80–81 Microsoft’s Azure Form Recognizer, 95
strategies for monitoring models, 363–366 Microsoft’s Azure IOT hub, 95
streamlining workflows using MLOps, 331–332 Microsoft’s Azure Machine Learning Platform, 98
success and failure with, 4 min-max scaling, as a technique for handling outliers, 256
tagging to identify environments, 235 misinformation, 451
machine learning engineer, 176 missing values, 255–256
machine learning (ML) platforms and services mitigation, 257–259
about, 108, 134–135, 148–149 ML Operations Automation Guide workbook template,
answer key, 151 237
AWS, 112–114 MLflow, as a CI/CD tool, 358
benefits and factors to consider, 107–111 MLOps automation, implementing CI/CD for models,
core AI services, 113–120 357–359
in Envision phase, 154 MLOps best practices
factors to consider when building, 200–205 about, 217, 237
Google AI/ML services stack, 136–142 answer key, 239
key components of, 206 automation pipelines, 222–225
machine learning services, 134–135 automation through MLOps workflow, 218–221
Microsoft AI/ML services stack, 142–146 central role of MLOps in bridging infrastructure, data,
other enterprise cloud AI platforms, 147–148 and models, 217–221
review questions, 149–151 container image management, 232–233
securing, 332–334 data and artifacts linage tracking, 228–231
specialized AI services, 121–133 deployment scenarios, 225
workbook template, 148 importance of feedback loops, 218
machine translation, 96 logging and auditing, 227–228
maintenance, as a role for software engineers, 180 model inventory management, 225–227
496
MLOps engineer – model testing
497
model training – noncore AI roles
as a role for model risk managers, 179 reuse of, in Optimizing stage of maturity, 396
as a role for model validators, 181 review questions, 339–341
model training security of, as a role for security engineers, 180
Azure Machine Learning and, 145, 146 selecting algorithms, 317–318
as a component of automation pipelines, 223 size of, 212
as a feature of data science experimentation platform, structure, 316–317
210 tracking, 381–383
as a key component of enterprise AI/ML healthcare tracking lineage of, 383
platforms, 206 training, 318–322, 327
pipeline, 225 tuning, 322–325, 327
in real-time customer engagement, 196 validating, 325–330
Model Training and Evaluation Sheet workbook template, workbook template, 338–339
338–339 monolithic applications, 420
model tuning, as a feature of data science experimentation movie recommendations, as a use case for recommender
platform, 210 systems, 299
model validation multiclass classification, 275
Azure Machine Learning and, 145, 146 multicloud architecture, 213–214
as a role for data analysts, 179 multimodal models, 308
as a role for model risk managers, 179 multivariate imputation, 264
as a role for model validators, 181 music composition, as a use for generative adversarial
model validators, 181 networks (GANs), 308
model versioning, as an artifact, 230 MXNet, 112, 311
model-related challenges, 33
models
about, 315, 339 N
accuracy of, 212
answer key, 341 Naïve Bayes, 286–287
architecture of, 317 natural language processing (NLP), 94, 144, 304
Azure Machine Learning and building, 145, 146 natural language tasks, Cloud Natural Language and,
Azure Machine Learning and management of, 145, 146 139
best practices, 330–338 Netflix, 15, 58, 63, 97
building, 315–318, 326–327 Netflix and the Path Companies Take to Become
choosing, 89–91 World-Class case study, 24–27
complexity of, as a model monitoring challenge, 347 network connectivity, 213
compressed, 470 network effect, 7
consumption of, 222 neural language processing, 93
defined, 318 neural machine translation, Translation AI and, 140
defining, 322 neural networks, data analysis using, 291
developing algorithm code, 317 neural point forecasting (NPTS), 124
edge deployment and, 349 New York Times, 97
evaluating, 327, 330 news personalization, as a use case for recommender
evaluating, as a feature of data science experimentation systems, 299
platform, 211 noise handling, 288
executing, 221 noise reduction, as a use case for principal component
harmful, 452 analysis, 296
hyperparameters, 316–317 noisy data, 303, 306
improving, as a role for model validators, 181 nominal variables, 279
inventory management for, 225–227, 347 noncore AI roles
lack of transparency in, 452 data analyst, 178–179
latency of, 212 domain expert, 178
machine learning algorithms compared with AI services IT auditor, 181
and, 113 model owners, 180
maintenance of, as a role for model owners, 180 model risk manager, 179
as a maturity dimension, 403 model validators, 181
parameters, 316–317 security engineer, 179–180
registration pipeline, 225 software engineer, 180
498
nonlinear data – pilot project
499
Pilot stage – responsible AI
500
responsible deployment – specialized AI services
501
specific services – trained parameters
502
training – video-related generative AI use cases
503
virtual assistants – z-score normalization
504
WILEY END USER LICENSE AGREE-
MENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.