MLOps Engineering at Scale
By Carl Osipov
()
About this ebook
In MLOps Engineering at Scale you will learn:
Extracting, transforming, and loading datasets
Querying datasets with SQL
Understanding automatic differentiation in PyTorch
Deploying model training pipelines as a service endpoint
Monitoring and managing your pipeline’s life cycle
Measuring performance improvements
MLOps Engineering at Scale shows you how to put machine learning into production efficiently by using pre-built services from AWS and other cloud vendors. You’ll learn how to rapidly create flexible and scalable machine learning systems without laboring over time-consuming operational tasks or taking on the costly overhead of physical hardware. Following a real-world use case for calculating taxi fares, you will engineer an MLOps pipeline for a PyTorch model using AWS server-less capabilities.
About the technology
A production-ready machine learning system includes efficient data pipelines, integrated monitoring, and means to scale up and down based on demand. Using cloud-based services to implement ML infrastructure reduces development time and lowers hosting costs. Serverless MLOps eliminates the need to build and maintain custom infrastructure, so you can concentrate on your data, models, and algorithms.
About the book
MLOps Engineering at Scale teaches you how to implement efficient machine learning systems using pre-built services from AWS and other cloud vendors. This easy-to-follow book guides you step-by-step as you set up your serverless ML infrastructure, even if you’ve never used a cloud platform before. You’ll also explore tools like PyTorch Lightning, Optuna, and MLFlow that make it easy to build pipelines and scale your deep learning models in production.
What's inside
Reduce or eliminate ML infrastructure management
Learn state-of-the-art MLOps tools like PyTorch Lightning and MLFlow
Deploy training pipelines as a service endpoint
Monitor and manage your pipeline’s life cycle
Measure performance improvements
About the reader
Readers need to know Python, SQL, and the basics of machine learning. No cloud experience required.
About the author
Carl Osipov implemented his first neural net in 2000 and has worked on deep learning and machine learning at Google and IBM.
Table of Contents
PART 1 - MASTERING THE DATA SET
1 Introduction to serverless machine learning
2 Getting started with the data set
3 Exploring and preparing the data set
4 More exploratory data analysis and data preparation
PART 2 - PYTORCH FOR SERVERLESS MACHINE LEARNING
5 Introducing PyTorch: Tensor basics
6 Core PyTorch: Autograd, optimizers, and utilities
7 Serverless machine learning at scale
8 Scaling out with distributed training
PART 3 - SERVERLESS MACHINE LEARNING PIPELINE
9 Feature selection
10 Adopting PyTorch Lightning
11 Hyperparameter optimization
12 Machine learning pipeline
Related to MLOps Engineering at Scale
Related ebooks
Machine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5Machine Learning with TensorFlow, Second Edition Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsPractical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsDesigning Deep Learning Systems: A software engineer's guide Rating: 0 out of 5 stars0 ratingsDesigning Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsDeep Learning with Python, Second Edition Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsData Engineering on Azure Rating: 0 out of 5 stars0 ratingsInfrastructure as Code, Patterns and Practices: With examples in Python and Terraform Rating: 0 out of 5 stars0 ratingsServerless Architectures on AWS, Second Edition Rating: 5 out of 5 stars5/5TensorFlow in Action Rating: 0 out of 5 stars0 ratingsOperationalizing Machine Learning Pipelines: Building Reusable and Reproducible Machine Learning Pipelines Using MLOps Rating: 0 out of 5 stars0 ratingsGraph-Powered Machine Learning Rating: 0 out of 5 stars0 ratingsNatural Language Processing in Action: Understanding, analyzing, and generating text with Python Rating: 0 out of 5 stars0 ratingsMachine Learning Engineering in Action Rating: 0 out of 5 stars0 ratingsData Pipelines with Apache Airflow Rating: 0 out of 5 stars0 ratingsHuman-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI Rating: 0 out of 5 stars0 ratingsData Science Bookcamp: Five real-world Python projects Rating: 5 out of 5 stars5/5How to Lead in Data Science Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow Rating: 0 out of 5 stars0 ratingsPractical Recommender Systems Rating: 5 out of 5 stars5/5Real-World Natural Language Processing: Practical applications with deep learning Rating: 0 out of 5 stars0 ratingsLearning PySpark Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5
Reviews for MLOps Engineering at Scale
0 ratings0 reviews
Book preview
MLOps Engineering at Scale - Carl Osipov
MLOps Engineering at Scale
CARL OSIPOV
To comment go to liveBook
Manning_M_smallManning
Shelter Island
For more information on this and other Manning titles go to
www.manning.com
Copyright
For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: orders@manning.com
©2022 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
ISBN: 9781617297762
contents
preface
acknowledgments
about this book
about the author
about the cover illustration
Part 1 Mastering the data set
1 Introduction to serverless machine learning
1.1 What is a machine learning platform?
1.2 Challenges when designing a machine learning platform
1.3 Public clouds for machine learning platforms
1.4 What is serverless machine learning?
1.5 Why serverless machine learning?
Serverless vs. IaaS and PaaS
Serverless machine learning life cycle
1.6 Who is this book for?
What you can get out of this book
1.7 How does this book teach?
1.8 When is this book not for you?
1.9 Conclusions
2 Getting started with the data set
2.1 Introducing the Washington, DC taxi rides data set
What is the business use case?
What are the business rules?
What is the schema for the business service?
What are the options for implementing the business service?
What data assets are available for the business service?
Downloading and unzipping the data set
2.2 Starting with object storage for the data set
Understanding object storage vs. filesystems
Authenticating with Amazon Web Services
Creating a serverless object storage bucket
2.3 Discovering the schema for the data set
Introducing AWS Glue
Authorizing the crawler to access your objects
Using a crawler to discover the data schema
2.4 Migrating to columnar storage for more efficient analytics
Introducing column-oriented data formats for analytics
Migrating to a column-oriented data format
3 Exploring and preparing the data set
3.1 Getting started with interactive querying
Choosing the right use case for interactive querying
Introducing AWS Athena
Preparing a sample data set
Interactive querying using Athena from a browser
Interactive querying using a sample data set
Querying the DC taxi data set
3.2 Getting started with data quality
From garbage in, garbage out
to data quality
Before starting with data quality
Normative principles for data quality
3.3 Applying VACUUM to the DC taxi data
Enforcing the schema to ensure valid values
Cleaning up invalid fare amounts
Improving the accuracy
3.4 Implementing VACUUM in a PySpark job
4 More exploratory data analysis and data preparation
4.1 Getting started with data sampling
Exploring the summary statistics of the cleaned-up data set
Choosing the right sample size for the test data set
Exploring the statistics of alternative sample sizes
Using a PySpark job to sample the test set
Part 2 PyTorch for serverless machine learning
5 Introducing PyTorch: Tensor basics
5.1 Getting started with tensors
5.2 Getting started with PyTorch tensor creation operations
5.3 Creating PyTorch tensors of pseudorandom and interval values
5.4 PyTorch tensor operations and broadcasting
5.5 PyTorch tensors vs. native Python lists
6 Core PyTorch: Autograd, optimizers, and utilities
6.1 Understanding the basics of autodiff
6.2 Linear regression using PyTorch automatic differentiation
6.3 Transitioning to PyTorch optimizers for gradient descent
6.4 Getting started with data set batches for gradient descent
6.5 Data set batches with PyTorch Dataset and DataLoader
6.6 Dataset and DataLoader classes for gradient descent with batches
7 Serverless machine learning at scale
7.1 What if a single node is enough for my machine learning model?
7.2 Using IterableDataset and ObjectStorageDataset
7.3 Gradient descent with out-of-memory data sets
7.4 Faster PyTorch tensor operations with GPUs
7.5 Scaling up to use GPU cores
8 Scaling out with distributed training
8.1 What if the training data set does not fit in memory?
Illustrating gradient accumulation
Preparing a sample model and data set
Understanding gradient descent using out-of-memory data shards
8.2 Parameter server approach to gradient accumulation
8.3 Introducing logical ring-based gradient descent
8.4 Understanding ring-based distributed gradient descent
8.5 Phase 1: Reduce-scatter
8.6 Phase 2: All-gather
Part 3 Serverless machine learning pipeline
9 Feature selection
9.1 Guiding principles for feature selection
Related to the label
Recorded before inference time
Supported by abundant examples
Expressed as a number with a meaningful scale
Based on expert insights about the project
9.2 Feature selection case studies
9.3 Feature selection using guiding principles
Related to the label
Recorded before inference time
Supported by abundant examples
Numeric with meaningful magnitude
Bring expert insight to the problem
9.4 Selecting features for the DC taxi data set
10 Adopting PyTorch Lightning
10.1 Understanding PyTorch Lightning
Converting PyTorch model training to PyTorch Lightning
Enabling test and reporting for a trained model
Enabling validation during model training
11 Hyperparameter optimization
11.1 Hyperparameter optimization with Optuna
Understanding loguniform hyperparameters
Using categorical and log-uniform hyperparameters
11.2 Neural network layers configuration as a hyperparameter
11.3 Experimenting with the batch normalization hyperparameter
Using Optuna study for hyperparameter optimization
Visualizing an HPO study in Optuna
12 Machine learning pipeline
12.1 Describing the machine learning pipeline
12.2 Enabling PyTorch-distributed training support with Kaen
Understanding PyTorch-distributed training settings
12.3 Unit testing model training in a local Kaen container
12.4 Hyperparameter optimization with Optuna
Enabling MLFlow support
Using HPO for DcTaxiModel in a local Kaen provider
Training with the Kaen AWS provider
Appendix A Introduction to machine learning
Appendix B Getting started with Docker
index
front matter
preface
A useful piece of feedback that I got from a reviewer of this book was that it became a cheat code
for them to scale the steep MLOps learning curve. I hope that the content of this book will help you become a better informed practitioner of machine learning engineering and data science, as well as a more productive contributor to your projects, your team, and your organization.
In 2021, major technology companies are vocal about their efforts to democratize
artificial intelligence (AI) by making technologies like deep learning more accessible to a broader population of scientists and engineers. Regrettably, the democratization approach taken by the corporations focuses too much on core technologies and not enough on the practice of delivering AI systems to end users. As a result, machine learning (ML) engineers and data scientists are well prepared to create experimental, proof-of-concept AI prototypes but fall short in successfully delivering these prototypes to production. This is evident from a wide spectrum of issues: from unacceptably high failure rates of AI projects to ethical controversies about AI systems that make it to end users. I believe that, to become successful, the effort to democratize AI must progress beyond the myopic focus on core, enabling technologies like Keras, PyTorch, and TensorFlow. MLOps emerged as a unifying term for the practice of taking experimental ML code and running it effectively in production. Serverless ML is the leading cloud-native software development model for ML and MLOps, abstracting away infrastructure and improving productivity of the practitioners.
I also encourage you to make use of the Jupyter notebooks that accompany this book. The DC taxi fare project used in the notebook code is designed to give you the practice you need to grow as a practitioner. Happy reading and happy coding!
acknowledgments
I am forever grateful to my daughter, Sophia. You are my eternal source of happiness and inspiration. My wife, Alla, was boundlessly patient with me while I wrote my first book. You were always there to support me and to cheer me along. To my father, Mikhael, I wouldn’t be who I am without you.
I also want to thank the people at Manning who made this book possible: Marina Michaels, my development editor; Frances Buontempo, my technical development editor; Karsten Strøbaek, my technical proofreader; Deirdre Hiam, my project editor; Michele Mitchell, my copyeditor; and Keri Hales, my proofreader.
Many thanks go to the technical peer reviewers: Conor Redmond, Daniela Zapata, Dianshuang Wu, Dimitris Papadopoulos, Dinesh Ghanta, Dr. Irfan Ullah, Girish Ahankari, Jeff Hajewski, Jesús A. Juárez-Guerrero, Trichy Venkataraman Krishnamurthy, Lucian-Paul Torje, Manish Jain, Mario Solomou, Mathijs Affourtit, Michael Jensen, Michael Wright, Pethuru Raj Chelliah, Philip Kirkbride, Rahul Jain, Richard Vaughan, Sayak Paul, Sergio Govoni, Srinivas Aluvala, Tiklu Ganguly, and Todd Cook. Your suggestions helped make this a better book.
about this book
Thank you for purchasing MLOps Engineering at Scale.
Who should read this book
To get the most value from this book, you’ll want to have existing skills in data analysis with Python and SQL, as well as have some experience with machine learning. I expect that if you are reading this book, you are interested in developing your expertise as a machine learning engineer, and you are planning to deploy your machine learning—based prototypes to production.
This book is for information technology professionals or those in academia who have had some exposure to machine learning and are working on or are interested in launching a machine learning system in production. There is a refresher on machine learning prerequisites for this book in appendix A. Keep in mind that if you are brand new to machine learning you may find that studying both machine learning and cloud-based infrastructure for machine learning at the same time can be overwhelming.
If you are a software or a data engineer, and you are planning on starting a machine learning project, this book can help you gain a deeper understanding of the machine learning project life cycle. You will see that although the practice of machine learning depends on traditional information technologies (i.e., computing, storage, and networking), it is different from the traditional information technology in practice. The former is significantly more experimental and more iterative than you may have experienced as a software or a data professional, and you should be prepared for the outcomes to be less known in advance. When working with data, the machine learning practice is more like the scientific process, including forming hypotheses about data, testing alternative models to answer questions about the hypothesis, and ranking and choosing the best performing models to launch atop your machine learning platform.
If you are a machine learning engineer or practitioner, or a data scientist, keep in mind that this book is not about making you a better researcher. The book is not written to educate you about the frontiers of science in machine learning. This book also will not attempt to reteach you the machine learning basics, although you may find the material in appendix A, targeted at information technology professionals, a useful reference. Instead, you should expect to use this book to become a more valuable collaborator on your machine learning team. The book will help you do more with what you already know about data science and machine learning so that you can deliver ready-to-use contributions to your project or your organization. For example, you will learn how to implement your insights about improving machine learning model accuracy and turn them into production-ready capabilities.
How this book is organized: A road map
This book is composed of three parts. In part 1, I chart out the landscape of what it takes to put a machine learning system in production, describe an engineering gap between experimental machine learning code and production machine learning systems, and explain how serverless machine learning can help bridge the gap. By the end of part 1, I’ll have taught you how to use serverless features of a public cloud (Amazon Web Services) to get started with a real-world machine learning use case, prepare a working machine learning data set for the use case, and ensure that you are prepared to apply machine learning to the use case.
Chapter 1 presents a broad view on the field on machine learning systems engineering and what it takes to put the systems in production.
Chapter 2 introduces you to the taxi trips data set for the Washington, DC, municipality and teaches you how to start using the data set for machine learning in the Amazon Web Services (AWS) public cloud.
Chapter 3 applies the AWS Athena interactive query service to dig deeper into the data set, uncover data quality issues, and then address them through a rigorous and principled data quality assurance process.
Chapter 4 demonstrates how to use statistical measures to summarize data set samples and to quantify their similarity to the entire data set. The chapter also covers how to pick the right size for your test, training, and validation data sets and use distributed processing in the cloud to prepare the data set samples for machine learning.
In part 2, I teach you to use the PyTorch deep learning framework to develop models for a structured data set, explain how to distribute and scale up machine learning model training in the cloud, and show how to deploy trained machine learning models to scale with user demand. In the process, you’ll learn to evaluate and assess the performance of alternative machine learning model implementations and how to pick the right one for the use case.
Chapter 5 covers the PyTorch fundamentals by introducing the core tensor application programming interface (API) and helping you gain a level of fluency with using the API.
Chapter 6 focuses on the deep learning aspects of PyTorch, including support for automatic differentiation, alternative gradient descent algorithms, and supporting utilities.
Chapter 7 explains how to scale up your PyTorch programs by teaching about the graphical processing unit (GPU) features and how to take advantage of them to accelerate your deep learning code.
Chapter 8 teaches about data parallel approaches for distributed PyTorch training and covers, in-depth, the distinction between traditional, parameter, server-based approaches and the ring-based distributed training (e.g., Horovod).
In part 3, I introduce you to the battle-tested techniques of machine learning practitioners and cover feature engineering, hyperparameter tuning, and machine learning pipeline assembly. By the conclusion of this book, you will have set up a machine learning platform that ingests raw data, prepares it for machine learning, applies feature engineering, and trains high-performance, hyperparameter-tuned machine learning models.
Chapter 9 explores the use cases around feature selection and feature engineering, using case studies to build intuition about the features that can be selected or engineered for the DC taxi data set.
Chapter 10 teaches how to eliminate boilerplate engineering code in your DC taxi PyTorch model implementation by adopting a framework called PyTorch Lightning. Also, the chapter navigates through the steps required to train, validate, and test your enhanced deep learning model.
Chapter 11 integrates your deep learning model with an open-source hyperparameter optimization framework called Optuna, helping you train multiple models based on alternative hyperparameter values, and then ranking the trained models according to their loss and metric performance.
Chapter 12 packages your deep learning model implementation into a Docker container in order to run it through the various stages of the entire machine learning pipeline, starting from the development data set all the way to a trained model ready for production deployment.
About the code
You can access the code for this book from my Github repository: github.com/osipov/smlbook. The code in this repository is packaged as Jupyter notebooks and is designed to be used in a Linux-based Jupyter notebook environment. This means that you have options when it comes to how you can execute the code. If you have your own, local Jupyter environment, for example, with the Jupyter native client (JupyterApp: https://github.com/jupyterlab/jupyterlab_app) or a Conda distribution (https://jupyter.org/install), that’s great! If you do not use a local Jupyter distribution, you can run the code from the notebooks using a cloud-based service such as Google Colab or Binder. My Github repository README.md file includes badges and hyperlinks to help you launch chapter-specific notebooks in Google Colab.
I strongly urge you to use a local Jupyter installation as opposed to a cloud service, especially if you are worried about the security of your AWS account credentials. Some steps of the code will require you to use your AWS credentials for tasks like creating storage buckets, launching AWS Glue extract-transform-load (ETL) jobs, and more. The code for chapter 12 must be executed on node with Docker installed, so I recommend planning to use a local Jupyter installation on a laptop or a desktop where you have sufficient capacity to install Docker. You can find out more about Docker installation requirements in appendix B.
liveBook discussion forum
Purchase of MLOps for Engineering at Scale includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users.
To access the forum, go to https://livebook.manning.com/#!/book/mlops-engineering-at-scale/discussion. Be sure to join the forum and say hi! You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
about the author
about the cover illustration
The figure on the cover of MLOps Engineering at Scale is captioned Femme du Thibet,
or a woman of Tibet. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757—1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.
The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.
Part 1 Mastering the data set
Engineering an effective machine learning system depends on a thorough understanding of the project data set. If you have prior experience building machine learning models, you might be tempted to skip this step. After all, shouldn’t the machine learning algorithms automate the learning of the patterns from the data? However, as you are going to observe throughout this book, machine learning systems that succeed in production depend on a practitioner who understands the project data set and then applies human insights about the data in ways that modern algorithms can’t.
1 Introduction to serverless machine learning
This chapter covers
What serverless machine learning is and why you should care
The difference between machine learning code and a machine learning platform
How this book teaches about serverless machine learning
The target audience for this book
What you can learn from this book
A Grand Canyon—like gulf separates experimental machine learning code and production machine learning systems. The scenic view across the canyon
is magical: when a machine learning system is running successfully in production it can seem prescient. The first time I started typing a query into a machine learning—powered autocomplete search bar and saw the system anticipate my words, I was hooked. I must have tried dozens of different queries to see how well the system worked. So, what does it take to trek across the canyon?
It is surprisingly easy to get started. Given the right data and less than an hour of coding time, it is possible to write the experimental machine learning code and re-create the remarkable experience I have had using the search bar that predicted my words. In my conversations with information technology professionals, I find that many have started to experiment with machine learning. Online classes in machine learning, such as the one from Coursera and Andrew Ng, have a wealth of information about how to get started with machine learning basics. Increasingly, companies that hire for information technology jobs expect entry-level experience with machine learning.¹
While it is relatively easy to experiment with machine learning, building on the results of the experiments to deliver products, services, or features has proven to be difficult. Some companies have even started to use the word unicorn to describe the unreasonably hard-to-find machine learning practitioners with the skills needed to launch production machine learning systems. Practitioners with successful launch experience often have skills that span machine learning, software engineering, and many information technology specialties.
This book is for those who are interested in trekking the journey from experimental machine learning code to a production machine learning system. In this book, I will teach you how to assemble the components for a machine learning platform and use them as a foundation for your production machine learning system. In the process, you will learn:
How to use and integrate public cloud services, including the ones from Amazon Web Services (AWS), for machine learning, including data ingest, storage, and processing
How to assess and achieve data quality standards for machine learning from structured data
How to engineer synthetic features to improve machine learning effectiveness
How to reproducibly sample structured data into experimental subsets for exploration and analysis
How to implement machine learning models using PyTorch and Python in a Jupyter notebook environment
How to implement data processing and machine learning pipelines to achieve both high throughput and low latency
How to train and deploy machine learning models that depend on data processing pipelines
How to monitor and manage the life cycle of your machine learning system once it is put in production
Why should you invest the time to learn these skills? They will not make you a renowned machine learning researcher or help you discover the next ground-breaking machine learning algorithm. However, if you learn from this book, you can prepare yourself to deliver the results of your machine learning efforts sooner and more productively, and grow to be a more valuable contributor to your machine learning project, team, or organization.
1.1 What is a machine learning platform?
If you have never heard of the phrase yak shaving
as it is used in the information technology industry,² here’s a hypothetical example of how it may show up during a day in a life of a machine learning practitioner:
My company wants our machine learning system to launch in a month . . . but it is taking us too long to train our machine learning models . . . so I should speed things up by enabling graphical processing units (GPUs) for training . . . but our GPU device drivers are incompatible with our machine learning framework . . . so I need to upgrade to the latest Linux device drivers for compatibility . . . which means that I need to be on the new version of the Linux distribution.
There are many more similar possibilities in which you need to shave a yak
to speed up machine learning. The contemporary practice of launching machine learning—based systems in production and keeping them running has too much in common with the yak-shaving story. Instead of focusing on the features needed to make the product a resounding success, too much engineering time is spent on apparently unrelated activities like re-installing Linux device drivers or searching the web for the right cluster settings to configure the data processing middleware.
Why is that? Even if you have the expertise of machine learning PhDs on your project, you still need the support of many information technology services and resources to launch the system. Hidden Technical Debt in Machine Learning Systems,
a peer-reviewed article published in 2015 and based on insights from dozens of machine learning practitioners at Google, advises that mature machine learning systems end up being (at most) 5% machine learning code
(http://mng.bz/01jl).
This book uses the phrase machine learning platform
to describe the 95% that play a supporting yet critical role in the entire system. Having the right machine learning platform can make or break your product.
If you take a closer look at figure 1.1, you should be able to describe some of the capabilities you need from a machine learning platform. Obviously, the platform needs to ingest and store data, process data (which includes applying machine learning and other computations to data), and serve the insights discovered by machine learning to the users of the platform. The less obvious observation is that the platform should be able to handle multiple, concurrent machine learning projects and enable multiple users to run the projects in isolation from each other. Otherwise, replacing only the machine learning code translates to reworking 95% of the system.
01-01Figure 1.1 Although machine learning code is what makes your machine learning system stand out, it amounts to only about 5% of the system code according to the experiences described in Hidden Technical Debt in Machine Learning Systems
by Google’s Sculley et al. Serverless machine learning helps you assemble the other 95% using cloud-based infrastructure.
1.2 Challenges when designing a machine learning platform
How much data should the platform be able to store and process? AcademicTorrents.com is a website dedicated to helping machine learning practitioners get access to public data sets suitable for machine learning. The website lists over 50 TB of data sets, of which the largest are 1—5 TB in size. Kaggle, a website popular for hosting data science competitions, includes data sets as large as 3 TB. You might be tempted to ignore the largest data sets as outliers and focus on more common data sets that are at the scale of gigabytes. However, you should keep in mind that successes in machine learning are often due to reliance on larger data sets. The Unreasonable Effectiveness of Data,
by Peter Norvig et al. (http://mng.bz/5Zz4), argues in favor of the machine learning systems that can take advantage of larger data sets: simple models and a lot of data trump more elaborate models based on less data.
A machine learning platform that is expected to operate on a scale of terabytes to petabytes of data for storage and processing must be built as a distributed computing system using multiple inter-networked servers in a cluster, each processing a part of the data set. Otherwise, a data set with hundreds of gigabytes to terabytes will cause out-of-memory problems when processed by a single server with a typical hardware configuration. Having a cluster of servers as part of a machine learning platform also addresses the input/output bandwidth limitations of individual servers. Most servers can supply a CPU with just a few gigabytes of data per second. This means that most types of data processing performed by a machine learning platform can be sped up by splitting up the data sets in chunks (sometimes called shards) that are processed in parallel by the servers in the cluster. The distributed systems design for a machine learning platform as described is commonly known as scaling out.
A significant portion of figure 1.1 is the serving part of the infrastructure used in the platform. This is the part that exposes the data insights produced by the machine learning code to the users of the platform. If you have ever had your email provider classify your emails as spam or not spam, or if you have ever used a product recommendation feature of your favorite e-commerce website, you have interacted as a user with the serving infrastructure part of a machine learning platform. The serving infrastructure for a major email or an e-commerce provider needs to be capable of making the decisions for millions of users around the globe, millions of times a second. Of course, not every machine learning platform needs to operate at this scale. However, if you are planning to deliver a product based on machine learning, you need to keep in mind that it is within the realm of possibility for digital products and services to reach hundreds of millions of users in months. For example, Pokemon Go, a machine learning—powered video game from Niantic, reached half a billion users in less than two months.
Is it prohibitively expensive to launch and operate a machine learning platform at scale? As recently as the 2000s, running a scalable machine learning platform would have required a significant upfront investment in servers, storage, networking as well as software and the expertise needed to build one. The first machine learning platform I worked on for a customer back in 2009 cost over $100,000 USD and was built using on-premises hardware and open source Apache Hadoop (and Mahout) middleware. In addition to upfront costs, machine learning platforms can be expensive to operate due to waste of resources: most machine learning code underutilizes the capacity of the platform. As you know, the training phase of machine learning is resource-intensive, leading to high utilization of computing, storage, and networking. However, trainings are intermittent and are relatively rare for a machine learning system in production, translating to low average utilization. Serving infrastructure utilization varies based on the specific use case for a machine learning system and fluctuates based on factors like time of day, seasonality, marketing events, and more.
1.3 Public clouds for machine learning platforms
The good news is that public cloud-computing infrastructure can help you create a machine learning platform and address the challenges described in the previous section. In particular, the approach described in this book will take advantage of public clouds from vendors like Amazon Web Services, Microsoft Azure, or Google Cloud to provide your machine learning platform with:
Secure isolation so that multiple users of your platform can work in parallel with different machine learning projects and code
Access to information technologies like data storage, computing, and networking when your projects need them and for as long as they are needed
Metering based on consumption so that your machine learning projects are billed just for the resources you used
This book will teach you how to create a machine learning platform from public cloud infrastructure using Amazon Web Services as the primary example. In particular, I will teach you:
How to use public cloud services to cost effectively store data sets regardless of whether they are made of kilobytes of terabytes of data
How to optimize the utilization and cost of your machine learning platform computing infrastructure so that you are using just the servers you need
How to elastically scale your serving infrastructure to reduce the operational costs of your machine learning platform
1.4 What is serverless machine learning?
Serverless machine learning is a model for the software development of machine learning code written to run on a machine learning platform hosted in a cloud-computing infrastructure with consumption-based metering and billing.
If a machine learning system