Aws ML
Aws ML
Machine learning is creating rapid and exciting changes across all levels of society.
Machine learning is a complex subject area. Our goal in this lesson is to introduce you to
some of the most common terms and ideas used in machine learning. I will then walk you
through the different steps involved in machine learning and finish with a series of examples
that use machine learning to solve real-world situations.
Lesson Outline
This lesson is divided into the following sections:
First, we'll discuss what machine learning is, common terminology, and common
components involved in creating a machine learning project.
Next, we'll step into the shoes of a machine learning practitioner. Machine learning
involves using trained models to generate predictions and detect patterns from data.
To understand the process, we'll break down the different steps involved and examine
a common process that applies to the majority of machine learning projects.
Finally, we'll take you through three examples using the steps we described to solve
real-life scenarios that might be faced by machine learning practitioners.
Learning Objectives
By the end of the Introduction to machine learning section, you will be able to do the following. Take a
moment to read through these, checking off each item as you go through them.
What is Machine Learning?
Machine learning (ML) is a modern software development technique and a type of artificial
intelligence (AI) that enables computers to solve problems by using examples of real-world
data. It allows computers to automatically learn and improve from experience without being
explicitly programmed to do so.
Summary
Machine learning is part of the broader field of artificial intelligence. This field is concerned
with the capability of machines to perform activities using human-like intelligence. Within
machine learning there are several different kinds of tasks or techniques:
In supervised learning, every training sample from the dataset has a corresponding label or
output value associated with it. As a result, the algorithm learns to predict labels or output
values. We will explore this in-depth in this lesson.
In unsupervised learning, there are no labels for the training data. A machine learning
algorithm tries to learn the underlying patterns or distributions that govern the data. We will
explore this in-depth in this lesson.
In reinforcement learning, the algorithm figures out which actions to take in a situation to
maximize a reward (in the form of a number) on the way to reaching a specific goal. This is a
completely different approach than supervised and unsupervised learning. We will dive deep
into this in the next lesson.
Imagine, for example, the challenging task of writing a program that can detect if a cat is
present in an image. Solving this in the traditional way would require careful attention to
details like varying lighting conditions, different types of cats, and various poses a cat might
be in.
In machine learning, the problem solver abstracts away part of their solution as a flexible
component called a model, and uses a special program called a model training algorithm to
adjust that model to real-world data. The result is a trained model which can be used to
predict outcomes that are not part of the data set used to train it.
In a way, machine learning automates some of the statistical reasoning and pattern-matching
the problem solver would traditionally do.
The overall goal is to use a model created by a model training algorithm to generate
predictions or find patterns in data that can be used to solve a problem.
Understanding Terminology
Machine learning is a new field created at the intersection of statistics, applied math, and
computer science. Because of the rapid and recent growth of machine learning, each of these
fields might use slightly different formal definitions of the same terms.
Terminology
In supervised learning, every training sample from the dataset has a corresponding label or
output value associated with it. As a result, the algorithm learns to predict labels or output
values.
In reinforcement learning, the algorithm figures out which actions to take in a situation to
maximize a reward (in the form of a number) on the way to reaching a specific goal.
In unsupervised learning, there are no labels for the training data. A machine learning
algorithm tries to learn the underlying patterns or distributions that govern the data.
Additional Reading
Want to learn more about how software and application come together? Reading through
this entry about the software development process from Wikipedia can help.
Nearly all tasks solved with machine learning involve three primary components:
You can understand the relationships between these components by imagining the stages of
crafting a teapot from a lump of clay.
1. First, you start with a block of raw clay. At this stage, the clay can be molded into many
different forms and be used to serve many different purposes. You decide to use this lump
of clay to make a teapot.
2. So how do you create this teapot? You inspect and analyze the raw clay and decide how to
change it to make it look more like the teapot you have in mind.
3. Next, you mold the clay to make it look more like the teapot that is your goal.
Congratulations! You've completed your teapot. You've inspected the materials, evaluated
how to change them to reach your goal, and made the changes, and the teapot is now ready
for your enjoyment.
A machine learning model, like a piece of clay, can be molded into many different forms and
serve many different purposes. A more technical definition would be that a machine learning
model is a block of code or framework that can be modified to solve different but related
problems based on the data provided.
Important
A model is an extremely generic program(or block of code), made specific by the data used
to train it. It is used to solve different problems.
Example 1
Imagine you own a snow cone cart, and you have some data about the average number of
snow cones sold per day based on the high temperature. You want to better understand this
relationship to make sure you have enough inventory on hand for those high sales days.
Snow cones sold regression chart
In the graph above, you can see one example of a model, a linear regression model (indicated
by the solid line). You can see that, based on the data provided, the model predicts that as the
high temperate for the day increases so do the average number of snow cones sold. Sweet!
Example 2
Let's look at a different example that uses the same linear regression model, but with
different data and to answer completely different questions.
Imagine that you work in higher education and you want to better understand the relationship
between the cost of enrollment and the number of students attending college. In this example,
our model predicts that as the cost of tuition increases the number of people attending college
is likely to decrease.
Average tuition regression chart
Using the same linear regression model (indicated by the solid line), you can see that the
number of people attending college does go down as the cost increases.
Both examples showcase that a model is a generic program made specific by the data used to
train it.
Model Training
In the preceding section, we talked about two key pieces of information: a model and data. In
this section, we show you how those two pieces of information are used to create a trained
model. This process is called model training.
Model training algorithms work through an interactive process
Let's revisit our clay teapot analogy. We've gotten our piece of clay, and now we want to
make a teapot. Let's look at the algorithm for molding clay and how it resembles a machine
learning algorithm:
Think about the changes that need to be made. The first thing you would do is inspect the
raw clay and think about what changes can be made to make it look more like a teapot.
Similarly, a model training algorithm uses the model to process data and then compares the
results against some end goal, such as our clay teapot.
Make those changes. Now, you mold the clay to make it look more like a teapot. Similarly, a
model training algorithm gently nudges specific parts of the model in a direction that brings
the model closer to achieving the goal.
Repeat. By iterating over these steps over and over, you get closer and closer to what you
want until you determine that you’re close enough that you can stop.
Think about the changes that need to be made Make those changes
Model Inference: Using Your Trained Model
Now you have our completed teapot. You inspected the clay, evaluated the changes that
needed to be made, and made them, and now the teapot is ready for you to use. Enjoy your
tea!
So what does this mean from a machine learning perspective? We are ready to use the model
inference algorithm to generate predictions using the trained model. This process is often
referred to as model inference.
A finished teapo
Images
Clay
Clay hands
Teapot
Quiz Question
Which of the following are the primary components used in machine learning?
Terminology
A model is an extremely generic program, made specific by the data used to train it.
Model training algorithms work through an interactive process where the current model
iteration is analyzed to determine what changes can be made to get closer to the goal. Those
changes are made and the iteration continues until the model is evaluated to meet the goals.
Think back to the clay teapot analogy. Is it true or false that you always need to have an idea
of what you're making when you're handling your raw block of clay?
False
Question 2 of 2
We introduced three common components of machine learning. Let's review your new
knowledge by matching each component to its definition.
Definition
In the preceding diagram, you can see an outline of the major steps of the machine learning
process. Regardless of the specific model or training algorithm used, machine learning
practitioners practice a common workflow to accomplish machine learning tasks.
These steps are iterative. In practice, that means that at each step along the process, you
review how the process is going. Are things operating as you expected? If not, go back and
revisit your current step or previous steps to try and identify the breakdown.
The rest of the course is designed around these very important steps. Check through them again here and
get ready to dive deep into each of them!
Task List
All model training algorithms, and the models themselves, take data as their input. Their
outputs can be very different and are classified into a few different groups based on the task
they are designed to solve. Often, we use the kind of data required to train a model as part of
defining a machine learning task.
Supervised learning
Unsupervised learning
The presence or absence of labeling in your data is often used to identify a machine learning
task.
Supervised tasks
A task is supervised if you are using labeled data. We use the term labeled to refer to data
that already contains the solutions, called labels.
For example: Predicting the number of snow cones sold based on the temperatures is an
example of supervised learning.
Labeled data
In the preceding graph, the data contains both a temperature and the number of snow cones
sold. Both components are used to generate the linear regression shown on the graph. Our
goal was to predict the number of snow cones sold, and we feed that value into the model.
We are providing the model with labeled data and therefore, we are performing a supervised
machine learning task.
Unsupervised tasks
A task is considered to be unsupervised if you are using unlabeled data. This means you don't
need to provide the model with any kind of label or solution while the model is being trained.
Take a look at the preceding picture. Did you notice the tree in the picture? What you just
did, when you noticed the object in the picture and identified it as a tree, is called labeling
the picture. Unlike you, a computer just sees that image as a matrix of pixels of varying
intensity.
Since this image does not have the labeling in its original data, it is considered unlabeled.
Unsupervised learning involves using data that doesn't have a label. One common task is
called clustering. Clustering helps to determine if there are any naturally occurring groupings
in the data.
Imagine that you work for a company that recommends books to readers.
The assumption: You are fairly confident that micro-genres exist, and that there is one called
Teen Vampire Romance. Because you don’t know which micro-genres exist, you can't use
supervised learning techniques.
This is where the unsupervised learning clustering technique might be able to detect some
groupings in the data. The words and phrases used in the book description might provide
some guidance on a book's micro-genre.
In supervised learning, there are two main identifiers you will see in machine learning:
A categorical label has a discrete set of possible values. In a machine learning problem in
which you want to identify the type of flower based on a picture, you would train your model
using images that have been labeled with the categories of flower you would want to
identify. Furthermore, when you work with categorical labels, you often carry out
classification tasks*, which are part of the supervised learning family.
A continuous (regression) label does not have a discrete set of possible values, which often
means you are working with numerical data. In the snow cone sales example, we are trying
to predict the number* of snow cones sold. Here, our label is a number that could, in theory,
be any value.
In unsupervised learning, clustering is just one example. There are many other options, such
as deep learning.
Quiz Question
Data used
Labeled data
Supervised learning
Unlabeled data
Unsupervised learning
Terminology
Clustering. Unsupervised learning task that helps to determine if there are any naturally
occurring groupings in the data.
A categorical label has a discrete set of possible values, such as "is a cat" and "is not a cat."
A continuous (regression) label does not have a discrete set of possible values, which means
possibly an unlimited number of possibilities.
Discrete: A term taken from statistics referring to an outcome taking on only a finite number
of values (such as days of the week).
A label refers to data that already contains the solution.
Using unlabeled data means you don't need to provide the model with any kind of label or
solution while the model is being trained.
Additional Reading
The AWS Machine Learning blog is a great resource for learning more about projects in
machine learning.
You can use Amazon SageMaker to calculate new stats in Major League Baseball.
You can also find an article on Flagging suspicious healthcare claims with Amazon SageMaker
on the AWS Machine Learning blog.
What kinds of questions and problems are good for machine learning?
Which of the following problem statements fit the definition of a regression-based task?
I want to determine the expected reading time for online news articles, so I collect data on
my reading time for a week and write a browser plugin to use that data to predict the reading
time for new articles.
I work for a shoe company and want to provide a service to help parents predict their
children's shoe size for any particular age. Within this system, I represent shoe size as
a continuum of values and then round to the nearest shoe size.
Question 2 of 2
This is a broad question(too broad) with many different potential factors affecting how long a
customer might spend listening to music.
How might you change the scope or redefine the question to be better suited, and more
concise, for a machine learning task?
Will changing the frequency of when we start playing ad affect how long a customer
listens to music on our service?
Will creating custom playlists encourage customers to listen to music longer?
Will creating artist interviews about their songs increase how long our customers
spend listening to music?
;
Step Two: Build a Dataset
Summary
The next step in the machine learning process is to build a dataset that can be used to solve
your machine learning-based problem. Understanding the data needed helps you select better
models and algorithms so you can build more effective solutions.
Working with data is perhaps the most overlooked—yet most important—step of the machine
learning process. In 2017, an O’Reilly study showed that machine learning practitioners
spend 80% of their time working with their data.
You can take an entire class just on working with, understanding, and processing data for
machine learning applications. Good, high-quality data is essential for any kind of machine
learning project. Let's explore some of the common aspects of working with data.
Data collection
Does the data you've collected match the machine learning task and problem you have
defined?
Data inspection
The quality of your data will ultimately be the largest factor that affects how well you can
expect your model to perform. As you inspect your data, look for:
Outliers
Missing or incomplete values
Data that needs to be transformed or preprocessed so it's in the correct format to be used
by your model
Summary statistics
Now that you have some data in hand it is a good best practice to check that your data is in
line with the underlying assumptions of your chosen machine learning model.
With many statistical tools, you can calculate things like the mean, inner-quartile range
(IQR), and standard deviation. These tools can give you insight into the scope, scale, and
shape of the dataset.
Data visualization
You can use data visualization to see outliers and trends in your data and to help stakeholders
understand your data.
Look at the following two graphs. In the first graph, some data seems to have clustered into
different groups. In the second graph, some data points might be outliers.
Some of the data seems to cluster in groups Some of the data points seem to be outliers
Terminology
Impute is a common term referring to different statistical tools which can be used to
calculate missing values from your dataset.
Outliers are data points that are significantly different from others in the same sample.
Additional reading
In machine learning, you use several statistical-based tools to better understand your data. The
sklearn library has many examples and tutorials, such as this example demonstrating outlier
detection on a real dataset.
Question 1 of 5
True or false: Your data requirements will not change based on the machine learning task you
are using.
False
Question 2 of 5
False
Question 3 of 5
True or false: Data needs to be formatted so that is compatible with the model and model
training algorithm you plan to use.
True
Question 4 of 5
True or false: Data visualizations are the only way to identify outliers in your data.
False
Question 5 of 5
True or false: After you start using your model (performing inference), you don't need to
check the new data that it receives.
False
The first step in model training is to randomly split the dataset. This allows you to keep some
data hidden during training, so that data can be used to evaluate your model before you put it
into production. Specifically, you do this to test against the bias-variance trade-off. If you're
interested in learning more, see the Further learning and reading section.
Training dataset: The data on which the model will be trained. Most of your data will be
here. Many developers estimate about 80%.
Test dataset: The data withheld from the model during training, which is used to test how
well your model will generalize to new data.
The model training algorithm iteratively updates a model's parameters to minimize some loss
function.
Model parameters: Model parameters are settings or configurations the training algorithm
can update to change how the model behaves. Depending on the context, you’ll also hear
other more specific terms used to describe model parameters such as weights and biases.
Weights, which are values that change as the model learns, are more specific to neural
networks.
Loss function: A loss function is used to codify the model’s distance from this goal. For
example, if you were trying to predict a number of snow cone sales based on the day’s
weather, you would care about making predictions that are as accurate as possible. So you
might define a loss function to be “the average distance between your model’s predicted
number of snow cone sales and the correct number.” You can see in the snow cone example
this is the difference between the two purple dots.
You continue to cycle through these steps until you reach a predefined stop condition. This
might be based on a training time, the number of training cycles, or an even more intelligent
or application-aware mechanism.
1. Practitioners often use machine learning frameworks that already have working
implementations of models and model training algorithms. You could implement these from
scratch, but you probably won't need to do so unless you’re developing new models or
algorithms.
2. Practitioners use a process called model selection to determine which model or models to
use. The list of established models is constantly growing, and even seasoned machine
learning practitioners may try many different types of models while solving a problem with
machine learning.
3. Hyperparameters are settings on the model which are not changed during training but can
affect how quickly or how reliably the model trains, such as the number of clusters the
model should identify.
4. Be prepared to iterate.
Pragmatic problem solving with machine learning is rarely an exact science, and you might
have assumptions about your data or problem which turn out to be false. Don’t get
discouraged. Instead, foster a habit of trying new things, measuring success, and comparing
results across iterations.
Extended Learning
This information hasn't been covered in the above video but is provided for the advanced
reader.
Linear models
One of the most common models covered in introductory coursework, linear models simply
describe the relationship between a set of input numbers and a set of output numbers through
a linear function (think of y = mx + b or a line on a x vs y chart).
Classification tasks often use a strongly related logistic model, which adds an additional
transformation mapping the output of the linear function to the range [0, 1], interpreted as
“probability of being in the target class.” Linear models are fast to train and give you a great
baseline against which to compare more complex models. A lot of media buzz is given to
more complex models, but for most new problems, consider starting with a simple model.
Tree-based models
Tree-based models are probably the second most common model type covered in
introductory coursework. They learn to categorize or regress by building an extremely large
structure of nested if/else blocks, splitting the world into different regions at each if/else
block. Training determines exactly where these splits happen and what value is assigned at
each leaf region.
For example, if you’re trying to determine if a light sensor is in sunlight or shadow, you
might train tree of depth 1 with the final learned configuration being something like if
(sensor_value > 0.698), then return 1; else return 0;. The tree-based model XGBoost is
commonly used as an off-the-shelf implementation for this kind of model and includes
enhancements beyond what is discussed here. Try tree-based models to quickly get a baseline
before moving on to more complex models.
Extremely popular and powerful, deep learning is a modern approach based around a
conceptual model of how the human brain functions. The model (also called a neural
network) is composed of collections of neurons (very simple computational units) connected
together by weights (mathematical representations of how much information to allow to flow
from one neuron to the next). The process of training involves finding values for each weight.
Various neural network structures have been determined for modeling different kinds of
problems or processing different kinds of data.
FFNN: The most straightforward way of structuring a neural network, the Feed Forward
Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer
containing weights to all neurons in the previous layer.
CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized
data. They are by far the most commonly used type of model when processing images.
RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory
(LSTM) model types are structured to effectively represent for loops in traditional
computing, collecting state while iterating over some object. They can be used for
processing sequences of data.
Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture
enables training over larger datasets involving sequences of data.
For more classical models (linear, tree-based) as well as a set of common ML-related tools,
take a look at scikit-learn. The web documentation for this library is also organized for
those getting familiar with space and can be a great place to get familiar with some
extremely useful tools and techniques.
For deep learning, mxnet, tensorflow, andpytorch are the three most common libraries.
For the purposes of the majority of machine learning needs, each of these is feature-paired
and equivalent.
Terminology
Hyperparameters are settings on the model which are not changed during training but can
affect how quickly or how reliably the model trains, such as the number of clusters the model
should identify.
A loss function is used to codify the model’s distance from this goal
Training dataset: The data on which the model will be trained. Most of your data will be
here.
Test dataset: The data withheld from the model during training, which is used to test how
well your model will generalize to new data.
Model parameters are settings or configurations the training algorithm can update to change
how the model behaves.
Additional reading
The Wikipedia entry on the bias-variance trade-off can help you understand more about this
common machine learning concept.
In this AWS Machine Learning blog post, you can see how to train a machine-learning
algorithm to predict the impact of weather on air quality using Amazon SageMaker.
True or false: The loss function measures how far the model is from its goal.
True
Question 2 of 3
Why do you need to split the data into training and test data prior to beginning model
training?
If you use all the data you have collected during training, you won't have any with which to
test the model during the model evaluation phase.
Question 3 of 3
What makes hyperparameters different than model parameters? There may be more than one
correct answer.
Model accuracy is a fairly common evaluation metric. Accuracy is the fraction of predictions
a model gets right.
Here's an example:
Petal length to determine species
Imagine that you built a model to identify a flower as one of two common species based on
measurable details like petal length. You want to know how often your model predicts the
correct species. This would require you to look at your model's accuracy.
Extended Learning
This information hasn't been covered in the above video but is provided for the advanced
reader.
Log loss seeks to calculate how uncertain your model is about the predictions it is generating.
In this context, uncertainty refers to how likely a model thinks the predictions being
generated are to be correct.
For example, let's say you're trying to predict how likely a customer is to buy either a jacket
or t-shirt.
Log loss could be used to understand your model's uncertainty about a given prediction. In a
single instance, your model could predict with 5% certainty that a customer is going to buy a
t-shirt. In another instance, your model could predict with 80% certainty that a customer is
going to buy a t-shirt. Log loss enables you to measure how strongly the model believes that
its prediction is accurate.
In both cases, the model predicts that a customer will buy a t-shirt, but the model's certainty
about that prediction can change.
Every step we have gone through is highly iterative and can be changed or re-scoped during
the course of a project. At each step, you might find that you need to go back and reevaluate
some assumptions you had in previous steps. Don't worry! This ambiguity is normal.
Terminology
Log loss seeks to calculate how uncertain your model is about the predictions it is generating.
Additional reading
The tools used for model evaluation are often tailored to a specific use case, so it's difficult to
generalize rules for choosing them. The following articles provide use cases and examples of
specific metrics in use.
False
This lesson has covered linear regression in detail, explaining how you can envision
minimizing loss, how the model can be used in various scenarios, and the importance of data.
What are some methods or tools that could be useful to consider when evaluating a linear
regression output? Can you provide an example of a situation in which you would apply that
method or tool?
Your reflection
Least Square Method it is the most common method used for fitting a regression line. It calculates
the best-fit line for the observed data by minimizing the sum of the squares of the vertical
deviations from each data point to the line. Because the deviations are first squared, when added,
there is no cancelling out between positive and negative values. For example we will use only the
first feature of a diabetes dataset, in order to illustrate the data points within the two-dimensional
plot. The straight line will be seen in the plot, showing how linear regression attempts to draw a
straight line that will best minimize the residual sum of squares between the observed responses
in the dataset, and the responses predicted by the linear approximation. The coefficients, residual
sum of squares and the coefficient of determination are also calculated.
There are many different tools that can be used to evaluate a linear regression model. Here
are a few examples:
1. Mean absolute error (MAE): This is measured by taking the average of the absolute
difference between the actual values and the predictions. Ideally, this difference is
minimal.
2. Root mean square error (RMSE): This is similar MAE, but takes a slightly modified
approach so values with large error receive a higher penalty. RMSE takes the square
root of the average squared difference between the prediction and the actual value.
3. Coefficient of determination or R-squared (R^2): This measures how well-observed
outcomes are actually predicted by the model, based on the proportion of total
variation of outcomes.
;
Once you have trained your model, have evaluated its effectiveness, and are satisfied with the
results, you're ready to generate predictions on real-world problems using unseen data in the
field. In machine learning, this process is often called inference.
Iterative Process
Even after you deploy your model, you're always monitoring to make sure your model is
producing the kinds of results that you expect. Tthere may be times where you reinvestigate
the data, modify some of the parameters in your model training algorithm, or even change the
model type used for training.
Quiz Question
Generating predictions.
Finding patterns in your data.
Using a trained model.
Testing your model on data it has not seen before.
Introduction to Examples
Through the remainder of the lesson, we will be walking through 3 different case study
examples of machine learning tasks actually solving problems in the real world.
Supervised learning
o Using machine learning to predict housing prices in a neighborhood based on
lot size and number of bedrooms
Unsupervised learning
o Using machine learning to isolate micro-genres of books by analyzing the
wording on the back cover description.
Deep neural network
o While this type of task is beyond the scope of this lesson, we wanted to show
you the power and versatility of modern machine learning. You will see how it
can be used to analyze raw images from lab video footage from security
cameras, trying to detect chemical spills.
Traditionally, real estate appraisers use many quantifiable details about a home (such as
number of rooms, lot size, and year of construction) to help them estimate the value of a
house.
You detect this relationship and believe that you could use machine learning to predict home
prices.
Can we estimate the price of a house based on lot size or the number of bedrooms?
You access the sale prices for recently sold homes or have them appraised. Since you have
this data, this is a supervised learning task. You want to predict a continuous numeric value,
so this task is also a regression task.
Regression task
Data collection: You collect numerous examples of homes sold in your neighborhood within
the past year, and pay a real estate appraiser to appraise the homes whose selling price is
not known.
Data exploration: You confirm that all of your data is numerical because most machine
learning models operate on sequences of numbers. If there is textual data, you need to
transform it into numbers. You'll see this in the next example.
Data cleaning: Look for things such as missing information or outliers, such as the 10-room
mansion. Several techniques can be used to handle outliers, but you can also just remove
those from your dataset.
Prior to actually training your model, you need to split your data. The standard practice is to
put 80% of your dataset into a training dataset and 20% into a test dataset.
As you see in the preceding chart, when lot size increases, home values increase too. This
relationship is simple enough that a linear model can be used to represent this relationship.
A linear model across a single input variable can be represented as a line. It becomes a plane
for two variables, and then a hyperplane for more than two variables. The intuition, as a line
with a constant slope, doesn't change.
The Python scikit-learn library has tools that can handle the implementation of the model
training algorithm for you.
One of the most common evaluation metrics in a regression scenario is called root mean
square or RMS. The math is beyond the scope of this lesson, but RMS can be thought of
roughly as the "average error” across your test dataset, so you want this value to be low.
In the following chart, you can see where the data points are in relation to the blue line. You
want the data points to be as close to the "average" line as possible, which would mean less
net error.
You compute the root mean square between your model’s prediction for a data point in your
test dataset and the true value from your data. This actual calculation is beyond the scope of
this lesson, but it's good to understand the process at a high level.
Interpreting Results
In general, as your model improves, you see a better RMS result. You may still not be
confident about whether the specific value you’ve computed is good or bad.
Many machine learning engineers manually count how many predictions were off by a
threshold (for example, $50,000 in this house pricing problem) to help determine and verify
the model's accuracy.
Now you are ready to put your model into action. As you can see in the following image, this
means seeing how well it predicts with new data not seen during model training.
Terminology
Continuous: Floating-point values with an infinite range of possible values. The opposite of
categorical or discrete values, which take on a limited number of possible values.
Hyperplane: A mathematical term for a surface that contains more than two planes.
Plane: A mathematical term for a flat surface (like a piece of paper) on which two points can
be joined by a straight line.
Regression: A common task in supervised machine learning.
Additional reading
The Machine Learning Mastery blog is a fantastic resource for learning more about machine
learning. The following example blog posts dive deeper into training regression-based
machine learning models.
How to Develop Ridge Regression Models in Python offers another approach to solving the
problem in the example from this lesson.
Regression is a popular machine learning task, and you can use several different model
evaluation metrics with it.
Question 1 of 2
True or False: The model used in this example is an unsupervised machine learning task.
False
Challenge yourself
In this example, we used a linear model to solve a simple regression supervised learning task.
This model type is a great first choice when exploring a machine learning problem because
it's very fast and straightforward to train. It typically works well when you have relationships
in your data that are linear (when input changes by X, output changes by some fixed multiple
of X).
Can you think of an example of a problem that would not be solvable by a linear model?
Your reflection
Traffic signal control types of problems cannot be solved by linear programming methods, because
there is no need for optimization in such problems.
Linear models typically fail when there is no helpful linear relationship between the input
variables and the label.
For example, imagine predicting the height (label) of a thrown projectile over time (input
variable). You know the trajectory is not linear; it's curved. Any straight line you try to use to
describe this phenomenon would be invalid for a large range of the projectile's trajectory.
Techniques do exist to modify your data so you can still use linear models in these situations.
Such methods are out of scope for this course but are called kernel methods
K=2 K=3
During the model evaluation phase, you plan on using a metric to find
which value for k is most appropriate.
You find one cluster that contains a large collection of books you can
categorize as “paranormal teen romance.” This trend is known in your
industry, and therefore you feel somewhat confident in your machine
learning approach. You don’t know if every cluster is going to be as
cohesive as this, but you decide to use this model to see if you can find
anything interesting about which to write an article.
Terminology
Bag of words: A technique used to extract features from the text.
It counts how many times a word appears in a document
(corpus), and then transforms that information into a dataset.
Data vectorization: A process that converts non-numeric data
into a numerical format so that it can be used by a machine
learning model.
Silhouette coefficient: A score from -1 to 1 describing the
clusters found during modeling. A score near zero indicates
overlapping clusters, and scores less than zero indicate data
points assigned to incorrect clusters. A score approaching 1
indicates successful identification of discrete non-overlapping
clusters.
Stop words: A list of words removed by natural language
processing tools when building your dataset. There is no single
universal list of stop words used by all-natural language
processing tools.
Additional reading
Machine Learning Mastery is a great resource for finding examples of
machine learning projects.
The How to Develop a Deep Learning Bag-of-Words Model for
Sentiment Analysis (Text Classification) blog post provides an
example using a bag of words–based approach pair with a deep
learning model.
Unsupervised learning
SUBMIT
QUESTION 2 OF 3
In the k-means model used for this example, what does the
value for "k" indicate?
The number of clusters the model will try to find during training.
QUESTION 3 OF 3
False
SUBMIT
NEXT
Note: This example uses a neural network. The algorithm for how a
neural network works is beyond the scope of this lesson. However,
there is still value in seeing how machine learning applies in this case.
Contains spill
Does not contain spill
Image classification
Today, deep neural networks are the most common tool used for
solving this kind of problem. Many deep neural network models are
structured to learn the features on top of the underlying pixels so you
don’t have to learn them. You’ll have a chance to take a deeper look at
this in the next lesson, so we’ll keep things high-level for now.
Why not? You realize the model will see the 'Does not contain
spill' class almost all the time, so any model that just predicts “no
spill” most of the time will seem pretty accurate.
What you really care about is an evaluation tool that rarely misses a
real spill.
After doing some internet sleuthing, you realize this is a common
problem and that Precision and Recall will be effective. You can think
of precision as answering the question, "Of all predictions of a spill,
how many were right?" and recall as answering the question, "Of all
actual spills, how many did we detect?"
Manual evaluation plays an important role. You are unsure if your
staged spills are sufficiently realistic compared to actual spills. To get
a better sense how well your model performs with actual spills, you
find additional examples from historical records. This allows you to
confirm that your model is performing satisfactorily.
Step Five: Model Inference
The model can be deployed on a system that enables you to run
machine learning workloads such as AWS Panorama.
Thankfully, most of the time, the results will be from the class 'Does
not contain spill.'
No spill detected
But, when the class 'Contains spill' is detected, a simple paging
system could alert the team to respond.
Spill detected
Terminology
Convolutional neural networks(CNN) are a special type of neural
network particularly good at processing images.
Neural networks: a collection of very simple models connected
together.
These simple models are called neurons
the connections between these models are trainable model
parameters called weights.
Additional reading
As you continue your machine learning journey, you will start to
recognize problems that are excellent candidates for machine learning.
The AWS Machine Learning Blog is a great resource for finding more
examples of machine learning projects.
In the Protecting people from hazardous areas through virtual
boundaries with Computer Vision blog post, you can see a more
detailed example of the deep learning process described in this
lesson.
NEXT
Final Quiz
QUESTION 1 OF 4
True or False:
False
SUBMIT
QUESTION 2 OF 4
A loss function...
is a model hyperparameter.
is a model parameter.
QUESTION 3 OF 4
True or false:
False
SUBMIT
QUESTION 4 OF 4
True or false:
False
SUBMIT
NEXT
Lesson Review
Congratulations on making it through the lesson. Let's review what you
learning
Learning Objectives
If you watched all the videos, read through all the text and images, and
completed all the quizzes, then you should've mastered the learning
objectives for the lesson. You should recognize all of these by now. Please
read through and check off each as you go through them.
Task List
Describe commonly used algorithms including linear regression,
logistic regression, and k-means.
NEXT
Glossary
Bag of words: A technique used to extract features from the text. It
counts how many times a word appears in a document (corpus), and
then transforms that information into a dataset.
A categorical label has a discrete set of possible values, such as "is a
cat" and "is not a cat."
Clustering. Unsupervised learning task that helps to determine if there
are any naturally occurring groupings in the data.
CNN: Convolutional Neural Networks (CNN) represent nested filters
over grid-organized data. They are by far the most commonly used type
of model when processing images.
A continuous (regression) label does not have a discrete set of
possible values, which means possibly an unlimited number of
possibilities.
Data vectorization: A process that converts non-numeric data into a
numerical format so that it can be used by a machine learning model.
Discrete: A term taken from statistics referring to an outcome taking
on only a finite number of values (such as days of the week).
FFNN: The most straightforward way of structuring a neural network,
the Feed Forward Neural Network (FFNN) structures neurons in a
series of layers, with each neuron in a layer containing weights to all
neurons in the previous layer.
Hyperparameters are settings on the model which are not changed
during training but can affect how quickly or how reliably the model
trains, such as the number of clusters the model should identify.
Log loss is used to calculate how uncertain your model is about the
predictions it is generating.
Hyperplane: A mathematical term for a surface that contains more than
two planes.
Impute is a common term referring to different statistical tools which
can be used to calculate missing values from your dataset.
label refers to data that already contains the solution.
loss function is used to codify the model’s distance from this goal
Machine learning, or ML, is a modern software development technique
that enables computers to solve problems by using examples of real-
world data.
Model accuracy is the fraction of predictions a model gets right.
Discrete: A term taken from statistics referring to an outcome taking
on only a finite number of values (such as days of the week).
Continuous: Floating-point values with an infinite range of possible
values. The opposite of categorical or discrete values, which take on a
limited number of possible values.
Model inference is when the trained model is used to generate
predictions.
model is an extremely generic program, made specific by the data used
to train it.
NEXT
Why AWS?
The AWS achine learning mission is to put machine learning in the
hands of every developer.
AWS AI services
By using AWS pre-trained AI services, you can apply ready-made
intelligence to a wide range of applications such as personalized
recommendations, modernizing your contact center, improving safety
and security, and increasing customer engagement.
Industry-specific solutions
With no knowledge in machine learning needed, add intelligence to a
wide range of applications in different industries including healthcare
and manufacturing.
Getting started
In addition to educational resources such as AWS Training and
Certification, AWS has created a portfolio of educational devices to
help put new machine learning techniques into the hands of developers
in unique and fun ways, with AWS DeepLens, AWS DeepRacer,
and AWS DeepComposer.
AWS DeepLens: A deep learning–enabled video camera
AWS DeepRacer: An autonomous race car designed to test
reinforcement learning models by racing on a physical track
AWS DeepComposer: A composing device powered by
generative AI that creates a melody that transforms into a
completely original song
AWS ML Training and Certification: Curriculum used to train
Amazon developers
Additional Reading
To learn more about AWS AI Services, see Explore AWS AI
services.
To learn more about AWS ML Training and Certification offerings,
see Training and Certification.
Lesson Overview
In this lesson, you'll get an introduction to machine learning (ML) with
AWS and AWS AI devices: AWS DeepLens, AWS DeepComposer, and
AWS DeepRacer. Learn the basics of computer vision with AWS
DeepLens, race around a track and get familiar with reinforcement
learning with AWS DeepRacer, and discover the power of generative AI
by creating music using AWS DeepComposer.
The lesson outline
I understand that the AWS Free Tier provides me with 500 AWS
DeepComposer inferences jobs (500 pieces of music) at no cost.
I can use those jobs to finish the demo and exercise in the AWS
DeepComposer section.
Summary
Computer vision got its start in the 1960s in academia. Since its
inception, it has been an interdisciplinary field. Machine learning
practitioners use computers to understand and automate tasks
associated with the visual word.
Since 2010, there has been exponential growth in the field of computer
vision. You can start with simple tasks like image classification and
objection detection and then scale all the way up to the nearly real-
time video analysis required for self-driving cars to work at scale.
Summary
Computer vision (CV) has many real-world applications. In this video,
we cover examples of image classification, object detection, semantic
segmentation, and activity recognition. Here's a brief summary of what
you learn about each topic in the video:
New Terms
Input Layer: The first layer in a neural network. This layer
receives all data that passes through the neural network.
Hidden Layer: A layer that occurs between the output and input
layers. Hidden layers are tailored to a specific task.
Output Layer: The last layer in a neural network. This layer is
where the predictions are generated based on the information
captured in the hidden layers.
Additional Reading
You can use the AWS DeepLens Recipes website to find
different learning paths based on your level of expertise. For
example, you can choose either a student or teacher path.
Additionally, you can choose between beginner, intermediate,
and advanced projects which have been created and vetted by
the AWS DeepLens team.
You can check out the AWS machine learning blog to learn
about recent advancements in machine learning. Additionally,
you can use the AWS DeepLens tag to see projects which have
been created by the AWS DeepLens team.
Ready to get started? Check out the Getting started guide in
the AWS DeepLens Developer Guide.
NEXT
AWS DeepLens
AWS DeepLens allows you to create and deploy end-to-end computer
vision–based applications. The following video provides a brief
introduction to how AWS DeepLens works and how it uses other AWS
services.
Summary
AWS DeepLens is a deep learning–enabled camera that allows you to
deploy trained models directly to the device. You can either use
sample templates and recipes or train your own model.
AWS DeepLens is integrated with several AWS machine learning
services and can perform local inference against deployed models
provisioned from the AWS Cloud. It enables you to learn and explore
the latest artificial intelligence (AI) tools and techniques for developing
computer vision applications based on a deep learning model.
First, you use the AWS console to create your project, store your
data, and train your model.
Then, you use your trained model on the AWS DeepLens device.
On the device, the video stream from the camera is processed,
inference is performed, and the output from inference is passed
into two output streams:
Device stream – The video stream passed through without
processing.
Project stream – The results of the model's processing of
the video frames.
Additional Reading
To learn more about the specifics of the AWS DeepLens device,
see the AWS DeepLens Hardware Specifications in
the AWS DeepLens Developer Guide.
You can buy an AWS DeepLens device on Amazon.com.
Summary
AWS DeepLens is integrated with multiple AWS services. You use these
services to create, train, and launch your AWS DeepLens project. To
create any AWS DeepLens–based project you will need an AWS
account.
Important
Storing data, training a model, and using AWS Lambda to deploy
your model incur costs on your AWS account. For more
information, see the AWS account requirements page.
You are not required to follow this demo on the AWS console.
However, we recommend you watch it and understand the flow of
completing a computer vision project with AWS DeepLens.
NEXT
Summary
In reinforcement learning (RL), an agent is trained to achieve a goal based on the
feedback it receives as it interacts with an environment. It collects a number as
a reward for each action it takes. Actions that help the agent achieve its goal are
incentivized with higher numbers. Unhelpful actions result in a low reward or no
reward.
With a learning objective of maximizing total cumulative reward, over time, the
agent learns, through trial and error, to map gainful actions to situations. The better
trained the agent, the more efficiently it chooses actions that accomplish its goal.
Summary
Reinforcement learning is used in a variety of fields to solve real-world problems. It’s
particularly useful for addressing sequential problems with long-term goals. Let’s
take a look at some examples.
Some examples of real-world RL include: Industrial robotics, fraud detection, stock trading,
and autonomous driving
New Terms
Agent: The piece of software you are training is called an agent. It makes
decisions in an environment to reach a goal.
Environment: The environment is the surrounding area with which the agent
interacts.
Reward: Feedback is given to an agent for each action it takes in a given
state. This feedback is a numerical reward.
Action: For every state, an agent needs to take an action toward achieving its
goal.
NEXT
Summary
AWS DeepRacer may be autonomous, but you still have an important role to play in
the success of your model. In this section, we introduce the training algorithm,
action space, hyperparameters, and reward function and discuss how your
ideas make a difference.
An algorithm is a set of instructions that tells a computer what to do. ML is
special because it enables computers to learn without being explicitly
programmed to do so.
The training algorithm defines your model’s learning objective, which is to
maximize total cumulative reward. Different algorithms have different
strategies for going about this.
A soft actor critic (SAC) embraces exploration and is data-efficient, but
can lack stability.
A proximal policy optimization (PPO) is stable but data-hungry.
An action space is the set of all valid actions, or choices, available to an agent
as it interacts with an environment.
Discrete action space represents all of an agent's possible actions for
each state in a finite set of steering angle and throttle value
combinations.
Continuous action space allows the agent to select an action from a
range of values that you define for each sta te.
Hyperparameters are variables that control the performance of your agent
during training. There is a variety of different categories with which to
experiment. Change the values to increase or decrease the influence of
different parts of your model.
For example, the learning rate is a hyperparameter that controls how
many new experiences are counted in learning at each step. A higher
learning rate results in faster training but may reduce the model’s
quality.
The reward function's purpose is to encourage the agent to reach its goal.
Figuring out how to reward which actions is one of your most important jobs.
Summary
This video put the concepts we've learned into action by imagining the reward
function as a grid mapped over the race track in AWS DeepRacer’s training
environment, and visualizing it as metrics plotted on a graph. It also introduced the
trade-off between exploration and exploitation, an important challenge unique to
this type of machine learning.
Each square is a state. The green square is the starting position, or initial state, and the
finish line is the goal, or terminal state.
Key points to remember about reward functions:
Each state on the grid is assigned a score by your reward function. You
incentivize behavior that supports your car’s goal of completing fast laps by
giving the highest numbers to the parts of the track on which you want it to
drive.
The reward function is the actual code you'll write to help your agent
determine if the action it just took was good or bad, and how good or bad it
was.
The squares containing exes are the track edges and defined as terminal states, which tell
your car it has gone off track.
Key points to remember about exploration versus exploitation:
When a car first starts out, it explores by wandering in random directions.
However, the more training an agent gets, the more it learns about an
environment. This experience helps it become more confident about the
actions it chooses.
Exploitation means the car begins to exploit or use information from previous
experiences to help it reach its goal. Different training algorithms utilize
exploration and exploitation differently.
Key points to remember about the reward graph:
While training your car in the AWS DeepRacer console, your training metrics
are displayed on a reward graph.
Plotting the total reward from each episode allows you to see how the model
performs over time. The more reward your car gets, the better your model
performs.
Key points to remember about AWS DeepRacer:
AWS DeepRacer is a combination of a physical car and a virtual simulator in
the AWS Console, the AWS DeepRacer League, and community races.
An AWS DeepRacer device is not required to start learning: you can start now
in the AWS console. The 3D simulator in the AWS console is where training
and evaluation take place.
New Terms
Exploration versus exploitation: An agent should exploit known
information from previous experiences to achieve higher cumulative
rewards, but it also needs to explore to gain new experiences that can be
used in choosing the best actions in the future.
Additional Reading
If you are interested in more tips, workshops, classes, and other resources
for improving your model, you'll find a wealth of resources on the AWS
DeepRacer Pit Stop page.
For detailed step-by-step instructions and troubleshooting support, see
the AWS DeepRacer Developer Documentation.
If you're interested in reading more posts on a range of DeepRacer topics as
well as staying up to date on the newest releases, check out the AWS
Discussion Forums.
If you're interested in connecting with a thriving global community of
reinforcement learning racing enthusiasts, join the AWS DeepRacer Slack
community.
If you're interested in tinkering with DeepRacer's open-source device
software and collaborating with robotics innovators, check out our AWS
DeepRacer GitHub Organization.
NEXT
Summary
This demonstration walks the evaluation process in the AWS DeepRacer console.
Once you've created a successful model, you'll learn how to enter it into a race for
the chance to win awards, prizes, and the opportunity to compete in the
worldwide AWS DeepRacer Championship.
Important
To get you started, AWS DeepComposer provides a 12-month Free Tier for first-time users.
With the Free Tier, you can perform up to 500 inference jobs, translating to 500 pieces of
music, using the AWS DeepComposer Music studio. You can use one of these instances to
complete the exercise at no cost. For more information, please read the AWS account
requirements page.
Demo Part 1:
Demo Part 2:
Summary
In the demo, you have learned how to create music using AWS Deepcomposer.
You will need a music track to get started. There are several ways to do it. You can record
your own using the AWS keyboard device or the virtual keyboard provided in the console. Or
you can input a MIDI file or choose a provided music track.
Once the music track is inputted, choose "Continue" to create a model. The models you can
choose are AR-CNN, GAN, and transformers. Each of them has a slightly different function.
After choosing a model, you can then adjust the parameters used to train the model.
Once you are done with model creation, you can select "Continue" to listen and improve your
output melody. To edit the melody, you can either drag or extend notes directly on the piano
roll or adjust the model parameters and train it again. Keep tuning your melody until you are
happy with it then click "Continue" to finish the composition.
If you want to enhance your music further with another generative model, you can do it too.
Simply choose a model under the "Next step" section and create a new model to enhance
your music.
Congratulations on creating your first piece of music using AWS DeepComposer! Now you
can download the melody or submit it to a competition. Hope you enjoy the journey of
creating music with AWS DeepComposer.
You can learn more about SageMaker costs in the Amazon SageMaker pricing
documentation.
Now that it’s configured and ready to use, let’s take a moment to investigate what’s inside the
notebook.
Generator network that tries to generate data based on the data it was trained on.
Discriminator network that is trained to differentiate between real data and data which is
created by the generator.
A diagram of generator and discriminator.
Note: While executing the cell that installs dependency packages, you may see warning
messages indicating that later versions of conda are available for certain packages. It is
completely OK to ignore this message. It should not affect the execution of this notebook.
The next section of the notebook is where we’ll prepare the data so it can train the generator
network.
Data often comes from many places (like a website, IoT sensors, a hard drive, or physical
paper) and it’s usually not clean or in the same format. Before you can better understand your
data, you need to make sure it’s in the right format to be analyzed. Thankfully, there are
library packages that can help! One such library is called NumPy, which was imported into
our notebook.
The data we are preparing today is music and it comes formatted in what’s called a “piano
roll”. Think of a piano roll as a 2D image where the X-axis represents time and the Y-axis
represents the pitch value. Using music as images allows us to leverage existing techniques
within the computer vision domain.
Our data is stored as a NumPy Array, or grid of values. Our dataset comprises 229 samples of
4 tracks (all tracks are piano). Each sample is a 32 time-step snippet of a song, so our dataset
has a shape of:
or
Much like there are different libraries to help with cleaning and formatting data, there are also
different frameworks. Some frameworks are better suited for particular kinds of machine
learning workloads and for this deep learning use case, we’re going to use a Tensorflow
framework with a Keras library.
We'll use the dataset object to feed batches of data into our model.
To create the custom GAN, you will need to use an instance type that is not covered in the
Amazon SageMaker free tier. You may incur a cost if you want to build a custom GAN.
You can learn more about SageMaker costs in the Amazon SageMaker pricing
documentation.
Model Architecture
Before we can train our model, let’s take a closer look at model architecture including how
GAN networks interact with the batches of data we feed into the model, and how the
networks communicate with each other.
The model consists of two networks, a generator and a discriminator (critic). These two
networks work in a tight loop:
The generator takes in a batch of single-track piano rolls (melody) as the input and generates
a batch of multi-track piano rolls as the output by adding accompaniments to each of the
input music tracks.
The discriminator evaluates the generated music tracks and predicts how far they deviate
from the real data in the training dataset.
The feedback from the discriminator is used by the generator to help it produce more
realistic music the next time.
As the generator gets better at creating better music and fooling the discriminator, the
discriminator needs to be retrained by using music tracks just generated by the generator as
fake inputs and an equivalent number of songs from the original dataset as the real input.
We alternate between training these two networks until the model converges and produces
realistic music.
The discriminator is a binary classifier which means that it classifies inputs into two groups,
e.g. “real” or “fake” data.
As the model tries to identify data as “real” or “fake”, it’s going to make errors. Any
prediction different than the ground truth is referred to as an error.
The measure of the error in the prediction, given a set of weights, is called a loss function.
Weights represent how important an associated feature is to determining the accuracy of a
prediction.
Loss functions are an important element of training a machine learning model because they
are used to update the weights after every iteration of your model. Updating weights after
iterations optimizes the model making the errors smaller and smaller.
Training and tuning models can take a very long time – weeks or even months sometimes.
Our model will take around an hour to train.
Model Evaluation
Now that the model has finished training it’s time to evaluate its results.
There are several evaluation metrics you can calculate for classification problems and
typically these are decided in the beginning phases as you organize your workflow.
You can:
12. Run the cell to restore the saved checkpoint. If you don't want to wait to complete the
training you can use data from a pre-trained model by setting TRAIN = False in the cell.
13. Run the cell to plot the losses.
14. Run the cell to plot the metrics.
Finally, we are ready to hear what the model produced and visualize the piano roll output!
Once the model is trained and producing acceptable quality, it’s time to see how it does on
data it hasn’t seen. We can test the model on these unknown inputs, using the results as a
proxy for performance on future data.
16. In the first cell, enter 500 as the iteration number:run the cell and play the music
snippet. Or listen to the example snippet at iteration 500.
17. In the second cell, enter 500 as the iteration number:run the cell and display the piano
roll.
Play around with the iteration number and see how the output changes over time!
Do you see or hear a quality difference between iteration 500 and iteration 950?
18. Run the next cell to create a video to see how the generated piano rolls change over time.
Inference
Now that the GAN has been trained we can run it on a custom input to generate music.
19. Run the cell to generate a new song based on "Twinkle Twinkle Little Star". Or listen
to the example of the generated music here: Your browser does not support the
element.
20. Run the next cell and play the generated music. Or listen to the example of the
generated music here: Your browser does not support the element.
Stop and Delete the Jupyter Notebook When You Are Finished!
This project is not covered by the AWS Free Tier so your project will continue to accrue
costs as long as it is running.
4. Select Delete
Recap
In this demo we learned how to setup a Jupyter notebook in Amazon SageMaker, reviewed a
machine learning code, and what data preparation, model training, and model evaluation can
look like in a notebook instance. While this was a fun use case for us to explore, the concepts
and techniques can be applied to other machine learning projects like an object detector or a
sentiment analysis on text.
Identify AWS machine learning offerings and how different services are used for
different applications
Explain the fundamentals of computer vision and a couple of popular tasks
Describe how reinforcement learning works in the context of AWS DeepRacer
Explain the fundamentals of Generative AI, its applications, and three famous
generative AI model in the context of music and AWS DeepComposer
; Glossary
Action: For every state, an agent needs to take an action toward achieving its goal.
Agent: The piece of software you are training is called an agent. It makes decisions in
an environment to reach a goal.
Discriminator: A neural network trained to differentiate between real and synthetic
data.
Discriminator loss: Evaluates how well the discriminator differentiates between real
and fake data.
Edit event: When a note is either added or removed from your input track during
inference.
Environment: The environment is the surrounding area within which the agent
interacts.
Exploration versus exploitation: An agent should exploit known information from
previous experiences to achieve higher cumulative rewards, but it also needs to
explore to gain new experiences that can be used in choosing the best actions in the
future.
Generator: A neural network that learns to create new data resembling the source
data on which it was trained.
Generator loss: Measures how far the output data deviates from the real data present
in the training dataset.
Hidden layer: A layer that occurs between the output and input layers. Hidden layers
are tailored to a specific task.
Input layer: The first layer in a neural network. This layer receives all data that
passes through the neural network.
Output layer: The last layer in a neural network. This layer is where the predictions
are generated based on the information captured in the hidden layers.
Piano roll: A two-dimensional piano roll matrix that represents input tracks. Time is
on the horizontal axis and pitch is on the vertical axis.
Reward: Feedback is given to an agent for each action it takes in a given state. This
feedback is a numerical reward.
Clean and Modular Code
Production code: Software running on production servers to handle live users and data of the
intended audience. Note that this is different from production-quality code, which describes
code that meets expectations for production in reliability, efficiency, and other aspects.
Ideally, all code in production meets these expectations, but this is not always the case.
Clean code: Code that is readable, simple, and concise. Clean production-quality code is
crucial for collaboration and maintainability in software development.
Modular code: Code that is logically broken up into functions and modules. Modular
production-quality code that makes your code more organized, efficient, and reusable.
Module: A file. Modules allow code to be reused by encapsulating them into files that can be
imported into other files.
Question 1 of 2
Which of the following describes code that is clean? Select all the answers that apply.
Question 2 of 2
Making your code modular makes it easier to do which of the following things? There may
be more than one correct answer.
;
Refactoring code
Refactoring: Restructuring your code to improve its internal structure without
changing its external functionality. This gives you a chance to clean and modularize
your program after you've got it working.
Since it isn't easy to write your best code while you're still trying to just get it
working, allocating time to do this is essential to producing high-quality code. Despite
the initial time and effort required, this really pays off by speeding up your
development time in the long run.
You become a much stronger programmer when you're constantly looking to improve
your code. The more you refactor, the easier it will be to structure and write good
code the first time.
Be descriptive and imply type: For booleans, you can prefix with is_ or has_ to make it clear
it is a condition. You can also use parts of speech to imply types, like using verbs for
functions and nouns for variables.
Be consistent but clearly differentiate: age_list and age is easier to differentiate than
ages and age.
Avoid abbreviations and single letters: You can determine when to make these exceptions
based on the audience for your code. If you work with other data scientists, certain variables
may be common knowledge. While if you work with full stack engineers, it might be
necessary to provide more descriptive names in these cases as well. (Exceptions include
counters and common math variables.)
Long names aren't the same as descriptive names: You should be descriptive, but only with
relevant information. For example, good function names describe what they do well without
including details about implementation or highly specific uses.
Try testing how effective your names are by asking a fellow programmer to guess the
purpose of a function or variable based on its name, without looking at your code. Coming up
with meaningful names often requires effort to get right.
Organize your code with consistent indentation: the standard is to use four spaces for each
indent. You can make this a default in your text editor.
Separate sections with blank lines to keep your code well organized and readable.
Try to limit your lines to around 79 characters, which is the guideline given in the PEP 8 style
guide. In many good text editors, there is a setting to display a subtle line that indicates
where the 79 character limit is.
For more guidelines, check out the code layout section of PEP 8 in the following notes.
References
PEP 8 guidelines for code layout
Imagine you are writing a program that executes a number of tasks and categorizes each task
based on its execution time. Below is a small snippet of this program. Which of the following
naming changes could make this code cleaner? There may be more than one correct answer.
# Choice A
stock_limit_prices = {'LUX': 62.48, 'AAPL': 127.67, 'NVDA': 161.24}
for stock_ticker, stock_limit_price in buy_prices.items():
if stock_limit_price <= get_current_stock_price(ticker):
buy_stock(ticker)
else:
watchlist_stock(ticker)
# Choice B
prices = {'LUX': 62.48, 'AAPL': 127.67, 'NVDA': 161.24}
for ticker, price in prices.items():
if price <= current_price(ticker):
buy(ticker)
else:
watchlist(ticker)
# Choice C
limit_prices = {'LUX': 62.48, 'AAPL': 127.67, 'NVDA': 161.24}
for ticker, limit in limit_prices.items():
if limit <= get_current_price(ticker):
buy(ticker)
else:
watchlist(ticker)
Question 2 of 2
Choice C
Don't repeat yourself! Modularization allows you to reuse parts of your code. Generalize and
consolidate repeated code in functions or loops.
Abstracting out code into a function not only makes it less repetitive, but also improves
readability with descriptive function names. Although your code can become more readable
when you abstract out logic into functions, it is possible to over-engineer this and have way
too many modules, so use your judgement.
There are trade-offs to having function calls instead of inline logic. If you have broken up
your code into an unnecessary amount of functions and modules, you'll have to jump around
everywhere if you want to view the implementation details for something that may be too
small to be worth it. Creating more modules doesn't necessarily result in effective
modularization.
Each function you write should be focused on doing one thing. If a function is doing multiple
things, it becomes more difficult to generalize and reuse. Generally, if there's an "and" in
your function name, consider refactoring.
Arbitrary variable names in general functions can actually make the code more readable.
Try to use no more than three arguments when possible. This is not a hard rule and there are
times when it is more appropriate to use many parameters. But in many cases, it's more
effective to use fewer arguments. Remember we are modularizing to simplify our code and
make it more efficient. If your function has a lot of parameters, you may want to rethink how
you are splitting this up.
Project Documentation
Project documentation is essential for getting others to understand why and how
your code is relevant to them, whether they are potentials users of your project or
developers who may contribute to your code. A great first step in project
documentation is your README file. It will often be the first interaction most users
will have with your project.
Whether it's an application or a package, your project should absolutely come with a
README file. At a minimum, this should explain what it does, list its dependencies,
and provide sufficiently detailed instructions on how to use it. Make it as simple as
possible for others to understand the purpose of your project and quickly get
something working.
Translating all your ideas and thoughts formally on paper can be a little difficult, but
you'll get better over time, and doing so makes a significant difference in helping
others realize the value of your project. Writing this documentation can also help
you improve the design of your code, as you're forced to think through your design
decisions more thoroughly. It also helps future contributors to follow your original
intentions.
Bootstrap
Scikit-learn
Stack Overflow Blog
NEXT
Quiz: Documentation
SEND FEEDBACK
QUESTION 1 OF 2
Readable code is preferable over having comments to make your code readable.
SUBMIT
QUESTION 2 OF 2
NEXT
Scenario #1
Let's walk through the Git commands that go along with each step in the scenario
you just observed in the video.
Step 1: You have a local version of this repository on your laptop, and to get the latest stable
version, you pull from the develop branch.
Switch to the develop branch
git checkout develop
Pull the latest changes in the develop branch
git pull
Step 2: When you start working on this demographic feature, you create a new branch called
demographic, and start working on your code in this branch.
Create and switch to a new branch called demographic from the develop branch
git checkout -b demographic
Work on this new feature and commit as you go
git commit -m 'added gender recommendations'
git commit -m 'added location specific recommendations'
...
Step 3: However, in the middle of your work, you need to work on another feature. So you
commit your changes on this demographic branch, and switch back to the develop branch.
Commit your changes before switching
git commit -m 'refactored demographic gender and location recommendations
'
Switch to the develop branch
git checkout develop
Step 4: From this stable develop branch, you create another branch for a new feature called
friend_groups.
Create and switch to a new branch called friend_groups from the develop branch
git checkout -b friend_groups
Step 5: After you finish your work on the friend_groups branch, you commit your changes,
switch back to the development branch, merge it back to the develop branch, and push this
to the remote repository’s develop branch.
Commit your changes before switching
git commit -m 'finalized friend_groups recommendations '
Switch to the develop branch
git checkout develop
Merge the friend_groups branch into the develop branch
git merge --no-ff friends_groups
Push to the remote repository
git push origin develop
Step 6: Now, you can switch back to the demographic branch to continue your progress on
that feature.
Switch to the demographic branch
git checkout demographic
NEXT
; Scenario #2
Let's walk through the Git commands that go along with each step in the scenario
you just observed in the video.
Step 1: You check your commit history, seeing messages about the changes you made and
how well the code performed.
View the log history
git log
Step 2: The model at this commit seemed to score the highest, so you decide to take a look.
Check out a commit
git checkout bc90f2cbc9dc4e802b46e7a153aa106dc9a88560
After inspecting your code, you realize what modifications made it perform well, and
use those for your model.
Step 3: Now, you're confident merging your changes back into the development branch and
pushing the updated recommendation engine.
Switch to the develop branch
git checkout develop
Merge the friend_groups branch into the develop branch
git merge --no-ff friend_groups
Push your changes to the remote repository
git push origin develop
NEXT
Scenario #3
Let's walk through the Git commands that go along with each step in the scenario
you just observed in the video.
Step 1: Andrew commits his changes to the documentation branch, switches to the
development branch, and pulls down the latest changes from the cloud on this development
branch, including the change I merged previously for the friends group feature.
Commit the changes on the documentation branch
git commit -m "standardized all docstrings in process.py"
Switch to the develop branch
git checkout develop
Pull the latest changes on the develop branch down
git pull
Step 2: Andrew merges his documentation branch into the develop branch on his local
repository, and then pushes his changes up to update the develop branch on the remote
repository.
Merge the documentation branch into the develop branch
git merge --no-ff documentation
Push the changes up to the remote repository
git push origin develop
Step 3: After the team reviews your work and Andrew's work, they merge the updates from
the development branch into the master branch. Then, they push the changes to the master
branch on the remote repository. These changes are now in production.
Merge the develop branch into the master branch
git merge --no-ff develop
Push the changes up to the remote repository
git push origin master
Resources
Read this great article on a successful Git branching strategy.
Mostly commonly, this happens when two branches modify the same file.
For example, in this situation, let’s say you deleted a line that Andrew modified on
his branch. Git wouldn’t know whether to delete the line or modify it. You need to
tell Git which change to take, and some tools even allow you to edit the change
manually. If it isn’t straightforward, you may have to consult with the developer of
the other branch to handle a merge conflict.
To learn more about merge conflicts and methods to handle them, see About
merge conflicts.
; Model versioning
In the previous example, you may have noticed that each commit was
documented with a score for that model. This is one simple way to help you
keep track of model versions. Version control in data science can be tricky,
because there are many pieces involved that can be hard to track, such as large
amounts of data, model versions, seeds, and hyperparameters.
The following resources offer useful methods and tools for managing model
versions and large amounts of data. These are here for you to explore, but are
not necessary to know now as you start your journey as a data scientist. On the
job, you’ll always be learning new skills, and many of them will be specific to the
processes set in your company.
Resources
Four Ways Data Science Goes Wrong and How Test-Driven Data Analysis Can
Help: Blog Post
Ned Batchelder: Getting Started Testing: Slide Deck and Presentation Video
Unit tests
We want to test our functions in a way that is repeatable and automated. Ideally,
we'd run a test program that runs all our unit tests and cleanly lets us know which
ones failed and which ones succeeded. Fortunately, there are great tools available in
Python that we can use to create effective unit tests!
To learn more about integration testing and how integration tests relate to unit
tests, see Integration Testing. That article contains other very useful links as well.
Unit Testing Tools
To install pytest , run pip install -U pytest in your terminal. You can see more
information on getting started here.
Create a test file starting with test_ .
Define unit test functions that start with test_ inside the test file.
Enter pytest into your terminal in the directory of your test file and it detects
these tests for you.
test_ is the default; if you wish to change this, you can learn how in
this pytest configuration.
In the test output, periods represent successful unit tests and Fs represent failed
unit tests. Since all you see is which test functions failed, it's wise to have only
one assert statement per test. Otherwise, you won't know exactly how many
tests failed or which tests failed.
Your test won't be stopped by failed assert statements, but it will stop if you
have syntax errors.
Log messages
Logging is the process of recording messages to describe events that have occurred
while running your software. Let's take a look at a few examples, and learn tips for
writing good log messages.
UIZ QUESTION
What are some ways this log message could be improved? There may
be more than one correct answer.
ERROR - Failed to compute product similarity. I made
sure to fix the error from October so not sure why this
would occur again.
Use the DEBUG level rather than the ERROR level for this log message.
Add more details about this error, such as what step or product the program was on
when this occurred
NEXT
Code reviews
Code reviews benefit everyone in a team to promote best programming practices
and prepare code for production. Let's go over what to look for in a code review and
some tips on how to conduct one.
Code reviews
Code review best practices
NEXT
NEXT
As you may have noticed, with code reviews you are now dealing with people, not
just computers. So it's important to be thoughtful of their ideas and efforts. You are
in a team and there will be differences in preferences. The goal of code review isn't
to make all code follow your personal preferences, but to ensure it meets a
standard of quality for the whole team.
Tip: Use a code linter
This isn't really a tip for code review, but it can save you lots of time in a code
review. Using a Python code linter like pylint can automatically check for coding
standards and PEP 8 guidelines for you. It's also a good idea to agree on a style
guide as a team to handle disagreements on code style, whether that's an existing
style guide or one you create together incrementally as a team.
BETTER: Make the model evaluation code its own module. This will
simplify models.py to be less repetitive and focus primarily on
building models.
GOOD: How about we consider making the model evaluation code its
own module? This would simplify models.py to only include code
for building models. Organizing these evaluations methods into
separate functions would also allow us to reuse them with
different models without repeating code.
BAD: I wouldn't groupby genre twice like you did here... Just
compute it once and use that for your aggregations.
BAD: You create this groupby dataframe twice here. Just compute
it once, save it as groupby_genre and then use that to get your
average prices and views.
Let's say you were reviewing code that included the following lines:
first_names = []
last_names = []
df['first_name'] = first_names
df['last_names'] = last_names
BAD: You can do this all in one step by using the pandas
str.split method.
GOOD: We can actually simplify this step to the line below using
the pandas str.split method. Found this on this stack overflow
post: https://stackoverflow.com/questions/14745022/how-to-split-
a-column-into-two-columns
df['first_name'], df['last_name'] = df['name'].str.split(' ',
1).str
NEXT
OOP
Lesson outline
Object-oriented programming syntax
Procedural vs. object-oriented programming
Classes, objects, methods and attributes
Coding a class
Magic methods
Inheritance
Using object-oriented programming to make a Python package
Making a package
Tour of scikit-learn source code
Putting your package on PyPi
Lesson files
This lesson uses classroom workspaces that contain all of the files and functionality
you need. You can also find the files in the data scientist nanodegree term 2 GitHub
repo.
NEXT
;
Procedural versus object-oriented programming
QUIZ QUESTION
Match the vocabulary term on the left with the examples on the right.
Gray, large, round
Scientist, chancellor, actor
Stephen Hawking, Angela Merkel, Brad Pitt
To rain, to ring, to ripen
Color, size, shape
TERM
EXAMPLES
Object
Class
Attribute
Method
Value
SUBMIT
NEXT
self.price = new_price
Self tells Python where to look in the computer's memory for the shirt_one object.
Then, Python changes the price of the shirt_one object. When you call
the change_price method, shirt_one.change_price(12) , self is implicitly passed
in.
The word self is just a convention. You could actually use any other name as long
as you are consisten, but you should use self to avoid confusing people.
NEXT
You need to download three files for this exercise. These files are located on this
page in the Supporting materials section.
Shirt_exercise.ipynb contains explanations and instructions.
Answer.py containing solution to the exercise.
Tests.py tests for checking your code: You can run these tests using the last
code cell at the bottom of the notebook.
Getting started
Open the Shirt Exercise.ipynb notebook file using Jupyter Notebook and follow
the instructions in the notebook to complete the exercise.
Supporting Materials
Answer
Tests
Shirt Exercise
NEXT
def get_price(self):
return self._price
In terms of object-oriented programming, the rules in Python are a bit looser than in
other programming languages. As previously mentioned, in some languages, like C+
+, you can explicitly state whether or not an object should be allowed to change or
access an attribute's values directly. Python does not have this option.
shirt_one.price = 10 # US dollars
Then, you'll have to manually change them to Euros.
shirt_one.price = 8 # Euros
If you had used a method, then you would only have to change the method to
convert from dollars to Euros.
shirt_one.change_price(10)
For the purposes of this introduction to object-oriented programming, you don't
need to worry about updating attributes directly versus with a method; however, if
you decide to further your study of object-oriented programming, especially in
another language such as C++ or Java, you'll have to take this into consideration.
Modularized code
Thus far in the lesson, all of the code has been in Jupyter Notebooks. For example,
in the previous exercise, a code cell loaded the Shirt class, which gave you access
to the shirt class throughout the rest of the notebook.
If you were developing a software program, you would want to modularize this
code. You would put the Shirt class into its own Python script, which you might
call shirt.py . In another Python script, you would import the Shirt class with a line
like from shirt import Shirt .
For now, as you get used to OOP syntax, you'll be completing exercises in Jupyter
Notebooks. Midway through the lesson, you'll modularize object-oriented code into
separate files.
NEXT
In the first part, you'll write a Pants class. This class is similar to
the Shirt class with a couple of changes. Then you'll practice
instantiating Pants objects.
In the second part, you'll write another class called SalesPerson. You'll also
instantiate objects for the SalesPerson.
This exercise requires two files, which are located on this page in the Supporting
Materials section.
exercise.ipynbcontains explanations and instructions.
answer.py contains solution to the exercise.
Getting started
Open the exercise.ipynb notebook file using Jupyter Notebook and follow the
instructions in the notebook to complete the exercise.
Supporting Materials
Exercise
Answer
NEXT
From this point on, please always comment your code. Use both inline comments
and document-level comments as appropriate.
To learn more about docstrings, see Example Google Style Python Docstrings.
Make sure to indent your docstrings correctly or the code will not run. A
docstring should be indented one indentation underneath the class or
method being described.
You don't have to define self in your method docstrings. It's understood
that any method will have self as the first method input.
class Pants:
"""The Pants class represents an article of clothing sold in
a store
"""
Args:
color (str)
waist_size (int)
length (int)
price (float)
Attributes:
color (str): color of a pants object
waist_size (str): waist size of a pants object
length (str): length of a pants object
price (float): price of a pants object
"""
self.color = color
self.waist_size = waist_size
self.length = length
self.price = price
Args:
new_price (float): the new price of the pants object
Returns: None
"""
self.price = new_price
def discount(self, percentage):
"""The discount method outputs a discounted price of a
pants object
Args:
percentage (float): a decimal representing the amount
to discount
Returns:
float: the discounted price
"""
return self.price * (1 - percentage)
NEXT
; Gaussian class
\mu = n * pμ=n∗p
In other words, a fair coin has a probability of a positive outcome (heads) p = 0.5. If
you flip a coin 20 times, the mean would be 20 * 0.5 = 10; you'd expect to get 10
heads.
variance
\sigma^2 = n p (1 - p)σ2=np(1−p)
Continuing with the coin example, n would be the number of coin tosses and p
would be the probability of getting heads.
standard deviation
Further resources
If you would like to review the Gaussian (normal) distribution and binomial
distribution, here are a few resources:
This free Udacity course, Intro to Statistics, has a lesson on Gaussian distributions as
well as the binomial distribution.
This free course, Intro to Descriptive Statistics, also has a Gaussian distributions
lesson.
There are also relevant Wikipedia articles:
Quiz
Here are a few quiz questions to help you determine how well you understand the
Gaussian and binomial distributions. Even if you can't remember how to answer
these types of questions, feel free to move on to the next part of the lesson;
however, the material assumes you know what these distributions are and that you
know the basics of how to work with them.
QUESTION 1 OF 3
0.44
0.059
Great job! When finding the probabilities using a continuous distribution, the probability
of obtaining an exact value is zero. If the question had been what is the probability that
a man's weight is between 184.99 and 185.01, then the answer would be a small but
positive value of 0.0002.
QUESTION 2 OF 3
0.23
0.27
0.19
SUBMIT
Correct! The area under this particular Gaussian distribution between 120 and 155
would be 0.19. The area under the Gaussian curve represents the probability.
QUESTION 3 OF 3
.14
.05
0.12
SUBMIT
NEXT
; ell done! The answer is 0.12. You can use either an online calculator or the
binomial distribution formula to get the results.
In this exercise, you will use the Gaussian distribution class for calculating and
visualizing a Gaussian distribution.
This exercise requires three files, which are located on this page in the Supporting
materials section.
Gaussian_code_exercise.ipynb contains explanations and instructions.
Answer.py contains the solution to the exercise .
Numbers.txt can be read in by the read_data_file() method.
Getting started
Open the Gaussian_code_exercise.ipynb notebook file using Jupyter Notebook and
follow the instructions in the notebook to complete the exercise.
Supporting Materials
Gaussian Code Exercise
Numbers
Answer
Getting started
Open the file using Jupyter Notebook and follow these instructions:
To give another example of inheritance, read through the code in this Jupyter
Notebook to see how the code works.
You can see the Gaussian distribution code is refactored into a generic
distribution class and a Gaussian distribution class.
The distribution class takes care of the initialization and
the read_data_file method. The rest of the Gaussian code is in the
Gaussian class. You'll use this distribution class in an exercise at the end of
the lesson.
Run the code in each cell of this Jupyter Notebook.
Supporting Materials
Inheritance Probability Distribution
Throughout the lesson, you can do all of your work in a classroom workspace. These
workspaces provide interfaces that connect to virtual machines in the cloud.
However, if you want to run this code locally on your computer, the commands you
use might be slightly different.
If you are using macOS, you can open an application called Terminal and use the
same commands that you use in the workspace. That is because Linux and MacOS
are related.
If you are using Windows, the analogous application is called Console.
The Console commands can be somewhat different than the Terminal commands.
Use a search engine to find the right commands in a Windows environment.
The classroom workspace has one major benefit. You can do whatever you want to
the workspace, including installing Python packages. If something goes wrong, you
can reset the workspace and start with a clean slate; however, always download
your code files or commit your code to GitHub or GitLab before resetting a
workspace. Otherwise, you'll lose your code!
NEXT
At the bottom of this page under Supporting materials, download three files.
Gaussiandistribution.py
Generaldistribution.py
example_code.py
Look at how the distribution class and Gaussian class are modularized into different
files.
Supporting Materials
Generaldistribution
Gaussiandistribution
Example Code
NEXT
Use the following list of resources to learn more about advanced Python object-
oriented programming topics.
Python's Instance, Class, and Static Methods Demystified: This article explains
different types of methods that can be accessed at the class or object level.
Class and Instance Attributes: You can also define attributes at the class level
or at the instance level.
Mixins for Fun and Profit: A class can inherit from multiple parent classes.
Primer on Python Decorators: Decorators are a short-hand way to use
functions inside other functions.
NEXT
; Making a package
In the previous section, the distribution and Gaussian code was refactored into
individual modules. A Python module is just a Python file containing code.
In this next section, you'll convert the distribution code into a Python package.
A package is a collection of Python modules. Although the previous code might
already seem like it was a Python package because it contained multiple files, a
Python package also needs an __init__.py file. In this section, you'll learn how to
create this __init__.py file and then pip install the package into your local Python
installation.
What is pip?
pip is a Python package manager that helps with installing and uninstalling Python
packages. You might have used pip to install packages using the command line: pip
install numpy . When you execute a command like pip install
numpy , pip downloads the package from a Python package repository called PyPi.
For this next exercise, you'll use pip to install a Python package from a local folder
on your computer. The last part of the lesson will focus on uploading packages to
PyPi so that you can share your package with the world.
You can complete this entire lesson within the classroom using the provided
workspaces; however, if you want to develop a package locally on your computer,
you should consider setting up a virtual environment. That way, if you install your
package on your computer, the package won't install into your main Python
installation. Before starting the next exercise, the next part of the lesson will discuss
what virtual environments are and how to use them.
; ython environments
In the next part of the lesson, you'll be given a workspace where you can upload
files into a Python package and pip install the package. If you decide to install your
package on your local computer, you'll want to create a virtual environment. A
virtual environment is a silo-ed Python installation apart from your main Python
installation. That way you can install packages and delete the virtual environment
without affecting your main Python installation.
Let's talk about two different Python environment managers: conda and venv . You
can create virtual environments with either one. The following sections describe
each of these environment managers, including some advantages and
disadvantages. If you've taken other data science, machine learning, or artificial
intelligence courses at Udacity, you're probably already familiar with conda .
Conda
Conda does two things: manages packages and manages environments.
As a package manager, conda makes it easy to install Python packages, especially for
data science. For instance, typing conda install numpy installs the numpy package.
As an environment manager, conda allows you to create silo-ed Python installations.
With an environment manager, you can install packages on your computer without
affecting your main Python installation.
The command line code looks something like the following:
Which to choose
Whether you choose to create environments with venv or conda will depend on
your use case. conda is very helpful for data science projects, but conda can make
generic Python software development a bit more confusing; that's the case for this
project.
If you create a conda environment, activate the environment, and then pip install
the distributions package, you'll find that the system installs your
package globally rather than in your local conda environment. However, if you
create the conda environment and install pip simultaneously, you'll find
that pip behaves as expected when installing packages into your local environment:
conda create --name environmentname pip
On the other hand, using pip with venv works as expected. pip and venv tend to be
used for generic software development projects including web development. For
this lesson on creating packages, you can use conda or venv if you want to develop
locally on your computer and install your package.
The following video shows how to use venv , which is what we recommend for this
project.
Instructions for venv
For instructions about how to set up virtual environments on a macOS, Linux, or
Windows machine using the terminal, see Installing packages using pip and virtual
environments.
Refer to the following notes for understanding the tutorial:
If you are using Python 2.7.9 or later (including Python 3), the Python
installation should already come with the Python package manager
called pip. There is no need to install it.
env is the name of the environment you want to create. You can
call env anything you want.
Python 3 comes with a virtual environment package preinstalled. Instead of
typing python3 -m virtualenv env, you can type python3 -m venv
env to create a virtual environment.
Once you've activated a virtual environment, you can then use terminal commands
to go into the directory where your Python library is stored. Then, you can run pip
install .
In the next section, you can practice pip installing and creating virtual environments
in the classroom workspace. You'll see that creating a virtual environment actually
creates a new folder containing a Python installation. Deleting this folder removes
the virtual environment.
If you install packages on the workspace and run into issues, you can always reset
the workspace; however, you will lose all of your work. Be sure to download any files
you want to keep before resetting a workspace.
NEXT
cd 3a_python_package
pip install
If everything is set up correctly, pip installs the distributions package into the
workspace. You can then start the Python interpreter from the terminal by entering:
python
Then, within the Python interpreter, you can use the distributions package by
entering the following:
gaussian_one = Gaussian(25, 2)
gaussian_one.mean
gaussian_one + gaussian_one
In other words, you can import and use the Gaussian class because the distributions
package is now officially installed as part of your Python installation.
If you get stuck, there's a solution provided in the Supporting materials section
called 3b_answer_python_package .
If you want to install the Python package locally on your computer, you might want
to set up a virtual environment first. A virtual environment is a silo-ed Python
installation apart from your main Python installation. That way you can easily delete
the virtual environment without affecting your Python installation.
If you want to try using virtual environments in this workspace first, follow these
instructions:
1. There is an issue with the Ubuntu operating system and Python3, in which
the venv package isn't installed correctly. In the workspace, one way to fix
this is by running this command in the workspace terminal: conda update
python. For more information, see venv doesn't create activate script
python3. Then, enter y when prompted. It might take a few minutes for the
workspace to update. If you are not using Anaconda on your local computer,
you can skip this first step.
2. Enter the following command to create a virtual environment: python -m
venv venv_name where venv_name is the name you want to give to your
virtual environment. You'll see a new folder appear with the Python
installation named venv_name.
3. In the terminal, enter source venv_name/bin/activate. You'll notice that
the command line now shows (venv_name)at the beginning of the line to
indicate you are using the venv_name virtual environment.
4. Enter pip install python_package/. That should install your
distributions Python package.
5. Try using the package in a program to see if everything works!
Supporting Materials
Generaldistribution
Gaussiandistribution
3b Answer Python Package
NEXT
Instructions
Following the instructions from the previous video, convert the modularized code
into a Python package.
On your local computer, you need to create a folder called 3a_python_package .
Inside this folder, you need to create a few folders and files:
A setup.py file, which is required in order to use pip install.
A subfolder called distributions, which is the name of the Python
package.
Inside the distributions folder, you need:
The Gaussiandistribution.py file (provided).
The Generaldistribution.py file (provided).
The __init__.py file (you need to create this file).
Once everything is set up, in order to actually create the package, use your terminal
window to navigate into the 3a_python_package folder.
Enter the following:
cd 3a_python_package
pip install
If everything is set up correctly, pip installs the distributions package into the
workspace. You can then start the Python interpreter from the terminal by entering:
python
Then, within the Python interpreter, you can use the distributions package by
entering the following:
gaussian_one = Gaussian(25, 2)
gaussian_one.mean
gaussian_one + gaussian_one
In other words, you can import and use the Gaussian class because the distributions
package is now officially installed as part of your Python installation.
If you get stuck, there's a solution provided in the Supporting materials section
called 3b_answer_python_package .
If you want to install the Python package locally on your computer, you might want
to set up a virtual environment first. A virtual environment is a silo-ed Python
installation apart from your main Python installation. That way you can easily delete
the virtual environment without affecting your Python installation.
If you want to try using virtual environments in this workspace first, follow these
instructions:
1. There is an issue with the Ubuntu operating system and Python3, in which
the venv package isn't installed correctly. In the workspace, one way to fix
this is by running this command in the workspace terminal: conda update
python. For more information, see venv doesn't create activate script
python3. Then, enter y when prompted. It might take a few minutes for the
workspace to update. If you are not using Anaconda on your local computer,
you can skip this first step.
2. Enter the following command to create a virtual environment: python -m
venv venv_name where venv_name is the name you want to give to your
virtual environment. You'll see a new folder appear with the Python
installation named venv_name.
3. In the terminal, enter source venv_name/bin/activate. You'll notice that
the command line now shows (venv_name)at the beginning of the line to
indicate you are using the venv_name virtual environment.
4. Enter pip install python_package/. That should install your
distributions Python package.
5. Try using the package in a program to see if everything works!
1. pip** install your distributions package**. In the terminal, make sure you
are in the 4a_binomial_package directory. If not, navigate there by
entering the following at the command line:
cd 4a_binomial_package
pip install
2. Run the unit tests. Enter the following.
Modify the Binomialdistribution.py code until all the unit tests pass.
If you change the code in the distributions folder after pip installing the package,
Python will not know about the changes.
When you make changes to the package files, you'll need to run the following:
Supporting Materials
4a Binomial Package
4b Answer Binomial Package
NEXT