AI ML Unit 4 QB
AI ML Unit 4 QB
AI ML Unit 4 QB
Theory questions
approach?
Regression
If the prediction value tends to be a continuous value then it falls under Regression
type problem in machine learning. Giving area name, size of land, etc. as features and
predicting expected cost of the land.
Classification
If the prediction value tends to be category/discrete like yes/no , positive/negative , etc.
then it falls under classification type problem in machine learning. Given a sentence
predicting whether it is negative or positive review
Clustering
Grouping a set of points to given number of clusters. Given 3, 4, 8, 9 and number of
clusters to be 2 then the ML system might divide the given set into cluster 1 - 3, 4 and
cluster 2 - 8, 9
Ranking
Used for constructing a ranker from a set of labelled examples. This example set
consists of instance groups that can be scored with a given criteria. The ranking labels
are { 0, 1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups
with unknown scores for each instance.
pictorially.
Clustering the groups is defined by “distance” among sample cases and at the end,
researcher looks for some meaning of that grouping. Regression, classification and
clustering are based on sample content and the result reflects that sample.
To extrapolate the conclusion to population requires validation according to the Level of
Confidence you are willing to assume and additional math processes need to be done.
Regression Classification
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.
model.
Regression Analysis is an analytical process whose end goal is to understand the inter-
relationships in the data and find as much useful information as possible.
According to the book, there are a number of steps which are loosely detailed below.
1 - Problem definition
The very first step is to, off course; define the problem we are trying to solve. Perhaps a
business question that needs to be answered or simply a prediction we want to make
based on some set of data. In this stage we must know the target variable and the
attributes we presume affects the target variable. This would be later analysed to judge
its credibility. For the sake of our discussion let‟s take the Titanic Dataset as an example.
In this dataset we have data of about 900 passengers. The question or the problem we
must solve is predicting which passenger likely survived the tragedy given their data.
2 - Analyse Data
The key is to have visual representations of our data so we can better understand the „inter-
relationships‟ of the variables and likely so, the book I was referring to earlier, highly
recommends using visual tools to make the EDA(Exploratory Data Analysis) process easier.
For the afore-mentioned dataset, we could try answering a number of things that might give
us a better understanding of the problem at hand. What‟s the survival rate of passengers
from each class?
minimizes the sum of the square of the errors as small as possible given that no outliers are
present in the data.
5 - Model evaluation
Final step is model evaluation - measuring and criticising exactly how well is the model
fitting the data points. We run the model on the test data and check to see how accurately it
was able to predict the output values. Now, there are a number of measures to check this as
discussed below:
i) We can find RMSE (root mean squared error) of the actual Y values and predicted Y values.
There are other variations of it that can be explored.
There are many other methods, some more complex than others but these are usually a
good place to start. Based on this analysis, the model is updated and perfected after which it
can be used for its intended purpose.
9. What are the sources of the data that is needed for training the
milling/drilling/lathe?
10. What are the sources of the data that is needed for training the
elements?
11. What are the sources of the data that is needed for training the ML model
12. What are the sources of the data that is needed for training the ML model
13. You have given a task of developing a classification model for identifying
data corresponds to tool state either as healthy or faulty. So what are the
14. What is training data? What is labeled data? What is unlabeled data? What
Machine learning models are as good as the data they're trained on. Without
high-quality training data, even the most efficient machine learning algorithms
will fail to perform.
The need for quality, accurate, complete, and relevant data starts early on in the
training process.
Only if the algorithm is fed with good training data can it easily pick up the
features and find relationships that it needs to predict down the line.
More precisely, quality training data is the most significant aspect of machine
learning (and artificial intelligence) than any other.
If you introduce the machine learning (ML) algorithms to the right data, you're
setting them up for accuracy and success.
Training data is the initial dataset used to train machine learning algorithms.
Models create and refine their rules using this data. It's a set of data samples used
to fit the parameters of a machine learning model to training it by example.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
Training data is also known as training dataset, learning set, and training set. It's
an essential component of every machine learning model and helps them make
accurate predictions or perform a desired task.
Simply put, training data builds the machine learning model. It teaches what the
expected output looks like. The model analyzes the dataset repeatedly to deeply
understand its characteristics and adjust itself for better performance.
In a broader sense, training data can be classified into two categories: labeled
data and unlabeled data.
Labeled data is a group of data samples tagged with one or more meaningful
labels. It's also called annotated data, and its labels identify specific
characteristics, properties, classifications, or contained objects.
For example, the images of fruits can be tagged as apples, bananas, or grapes.
Labeled training data is used in supervised learning. It enables ML models to
learn the characteristics associated with specific labels, which can be used to
classify newer data points. In the example above, this means that a model can use
labeled image data to understand the features of specific fruits and use this
information to group new images.
Data labeling or annotation is a time-consuming process as humans need to tag
or label the data points. Labeled data collection is challenging and expensive. It
isn't easy to store labeled data when compared to unlabeled data.
As expected, unlabeled data is the opposite of labeled data. It's raw data or data
that's not tagged with any labels for identifying classifications, characteristics, or
properties. It's used in unsupervised machine learning, and the ML models have
to find patterns or similarities in the data to reach conclusions.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
Going back to the previous example of apples, bananas, and grapes, in unlabeled
training data, the images of those fruits won't be labeled. The model will have to
evaluate each image by looking at its characteristics, such as color and shape.
After analyzing a considerable number of images, the model will be able to
differentiate new images (new data) into the fruit types of apples, bananas, or
grapes. Of course, the model wouldn't know that the particular fruit is called an
apple. Instead, it knows the characteristics needed to identify it.
There are hybrid models that use a combination of supervised and unsupervised
machine learning.
For machine learning models, historical data is fodder. Just as humans rely on past
experiences to make better decisions, ML models look at their training dataset with past
observations to make predictions.
Predictions could include classifying images as in the case of image recognition, or
understanding the context of a sentence as in natural language processing (NLP).
Think of a data scientist as a teacher, the machine learning algorithm as the student, and
the training dataset as the collection of all textbooks.
The teacher‟s aspiration is that the student must perform well in exams and also in the
real world. In the case of ML algorithms, testing is like exams. The textbooks (training
dataset) contain several examples of the type of questions that‟ll be asked in the exam.
Of course, it won‟t contain all the examples of questions that‟ll be asked in the exam, nor
will all the examples included in the textbook will be asked in the exam. The textbooks
can help prepare the student by teaching them what to expect and how to respond.
No textbook can ever be fully complete. As time passes, the kind of questions asked will
change, and so, the information included in the textbooks needs to be changed. In the
case of ML algorithms, the training set should be periodically updated to include new
information.
In short, training data is a textbook that helps data scientists give ML algorithms an idea
of what to expect. Although the training dataset doesn't contain all possible examples,
it‟ll make algorithms capable of making predictions.
High-quality data translates to accurate machine learning models. Low-quality data can
significantly affect the accuracy of models, which can lead to severe financial losses. It's
almost like giving a student textbook containing wrong information and expecting them to
excel in the examination. The following are the four primary traits of quality training data.
Relevant
The data needs to be relevant to the task at hand. For example, if you want to train
a computer vision algorithm for autonomous vehicles, you probably won't require images of
fruits and vegetables. Instead, you would need a training dataset containing photos of roads,
sidewalks, pedestrians, and vehicles.
Representative
The AI training data must have the data points or features that the application is made to
predict or classify. Of course, the dataset can never be absolute, but it must have at least the
attributes the AI application is meant to recognize. For example, if the model is meant to
recognize faces within images, it must be fed with diverse data containing people's faces
from various ethnicities. This will reduce the problem of AI bias, and the model won't be
prejudiced against a particular race, gender, or age group.
Uniform
All data should have the same attribute and must come from the same source. Suppose your
machine learning project aims to predict churn rate by looking at customer information. For
that, you'll have a customer information database that includes customer name, address,
number of orders, order frequency, and other relevant information. This is historical data and
can be used as training data. One part of the data can't have additional information, such as
age or gender. This will make training data incomplete and the model inaccurate. In short,
uniformity is a critical aspect of quality training data.
Comprehensive
Again, the training data can never be absolute. But it should be a large dataset that
represents the majority of the model's use cases. The training data must have enough
examples that‟ll allow the model to learn appropriately. It must contain real-world data
samples as it will help train the model to understand what to expect. If you're thinking of
training data as values placed in large numbers of rows and columns, sorry, you're wrong. It
could be any data type like text, images, audio, or videos.
Humans are highly social creatures, but there are some prejudices that we might have picked
as children and require constant conscious effort to get rid of. Although unfavourable, such
biases may affect our creations, and machine learning applications are no different. For ML
models, training data is the only book they read. Their performance or accuracy will depend
on how comprehensive, relevant, and representative the very book is. That being said, three
factors affect the quality of training data:
People: The people who train the model have a significant impact on its accuracy or
performance. If they're biased, it‟ll naturally affect how they tag data and, ultimately,
how the ML model functions.
Processes: The data labeling process must have tight quality control checks in place.
This will significantly increase the quality of training data.
Tools: Incompatible or outdated tools can make data quality suffer. Using robust data
labeling software can reduce the cost and time associated with the process.
There isn't a specific answer to how much training data is enough training data. It
depends on the algorithm you're training – its expected outcome, application,
complexity, and many other factors.
Suppose you want to train a text classifier that categorizes sentences based on the
occurrence of the terms "cat" and "dog" and their synonyms such as "kitty," "kitten,"
"pussycat," "puppy," or "doggy". This might not require a large dataset as there are only
a few terms to match and sort.
But, if this was an image classifier that categorized images as "cats" and "dogs," the
number of data points needed in the training dataset would shoot up significantly. In
short, many factors come into play to decide what training data is enough training data.
The amount of data required will change depending on the algorithm used.
For context, deep learning, a subset of machine learning, requires millions of data points
to train the artificial neural networks (ANNs). In contrast, machine learning algorithms
require only thousands of data points. But of course, this is a far-fetched generalization
as the amount of data needed varies depending on the application.
The more you train the model, the more accurate it becomes. So it's always better to
have a large amount of data as training data.
Garbage in, garbage out
The phrase "garbage in, garbage out" is one of the oldest and most used phrases in data
science. Even with the rate of data generation growing exponentially, it still holds true.
The key is to feed high-quality, representative data to machine learning algorithms.
Doing so can significantly enhance the accuracy of models. Good quality training data is
also crucial for creating unbiased machine learning applications.
19. How should you split up a dataset into test and training sets?
When we are working on the model development, we need to train it and test it on the
same dataset. Since it is challenging to possess a vast number of data while the model is
in the development phase, the most obvious answer is to split the data into two separate
sets, out of which one will be for training and the other will be testing.
A. Splitting the data set into a training set and test set:
The two conditions that need to be taken care of before proceeding with the splitting of the
dataset:
The test set needs to be large to give statistically essential outputs.
The characteristics of the training set and test set should be similar.
Therefore, after the satisfaction of the above two conditions, the ultimate goal should be to
develop a model that can easily perform functions with the new dataset.
B. Validation of the trained model over the test data.
The model should not train over the test data. Many times good results on evaluation
metrics are an indication that inadvertently, you are training on test data.
A test set in machine learning is a secondary (or tertiary) data set that is used to test a
machine learning program after it has been trained on an initial training data set. The
idea is that predictive models always have some sort of unknown capacity that needs to
be tested out, as opposed to analysed from a programming perspective.
A test set is also known as a test data set or test data.
In machine learning, a validation set is used to “tune the parameters” of a classifier. The
validation test evaluates the program‟s capability according to the variation of
parameters to see how it might function in successive testing.
The validation set is also known as a validation data set, development set or dev set.
22. Compare Training data vs. test data vs. validation data.
Training data is used in model training, or in other words, it's the data used to fit the
model. On the contrary, test data is used to evaluate the performance or accuracy of the
model. It's a sample of data used to make an unbiased evaluation of the final model fit
on the training data.
A training dataset is an initial dataset that teaches the ML models to identify desired
patterns or perform a particular task. A testing dataset is used to evaluate how effective
the training was or how accurate the model is.
Once an ML algorithm is trained on a particular dataset and if you test it on the same
dataset, it's more likely to have high accuracy because the model knows what to expect.
If the training dataset contains all possible values the model might encounter in the
future, all well and good.
But that's never the case. A training dataset can never be comprehensive and can't teach
everything that a model might encounter in the real world. Therefore a test dataset,
containing unseen data points, is used to evaluate the model's accuracy.
Then there's validation data. This is a dataset used for frequent evaluation during the
training phase. Although the model sees this dataset occasionally, it doesn't learn from
it. The validation set is also referred to as the development set or dev set. It helps protect
models from overfitting and underfitting.
Although validation data is separate from training data, data scientists might reserve a
part of the training data for validation. But of course, this automatically means that the
validation data was kept away during the training.
Many use the terms "test data" and "validation data" interchangeably. The main
difference between the two is that validation data is used to validate the model during
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
the training, while the testing set is used to test the model after the training is
completed.
The validation dataset gives the model the first taste of unseen data. However, not all
data scientists perform an initial check using validation data. They might skip this part
and go directly to testing data.
The classifier model can be designed/trained and performance can be evaluated based
on K-fold cross-validation mode, training mode and test mode.
The main idea behind K-Fold cross-validation is that each sample in our dataset has
the opportunity of being tested. It is a special case of cross-validation where we
iterate over a dataset set k times. In each round, we split the dataset into k parts:
one part is used for validation, and the remaining k-1 parts are merged into a
training subset for model evaluation
Computation time is reduced as we repeated the process only 10 times when the value
of k is 10. It has Reduced bias.
Every data points get to be tested exactly once and is used in training k-1 times
The variance of the resulting estimate is reduced as k increases
• Machine learning algorithms have hyperparameters that allow you to tailor the
behavior of the algorithm to your specific dataset.
• Hyperparameters are different from parameters, which are the internal coefficients or
weights for a model found by the learning algorithm. Unlike parameters,
hyperparameters are specified by the practitioner when configuring the model.
• Typically, it is challenging to know what values to use for the hyperparameters of a given
algorithm on a given dataset, therefore it is common to use random or grid search
strategies for different hyperparameter values.
• The more hyperparameters of an algorithm that you need to tune, the slower the
tuning process. Therefore, it is desirable to select a minimum subset of model
hyperparameters to search or tune.
Max_Depth: The maximum depth of the tree. If this is not specified in the Decision Tree, the
nodes will be expanded until all leaf nodes are pure or until all leaf nodes contain less than
min_samples_split.
Default = None
Input options → integer
Min_Samples_Split: The minimum samples required to split an internal node. If the amount
of sample in an internal node is less than the min_samples_split, then that node will become
a leaf node.
Default = 2
Input options → integer or float (if float, then min_samples_split is fraction)
Min_Samples_Leaf: The minimum samples required to be at a leaf node. Therefore, a split
can only happen if it leaves at least the min_samples_leaf in both of the resulting nodes.
Default = 1
Input options → integer or float (if float, then min_samples_leaf is fraction)
Max_Features: The number of features to consider when looking for the best split. For
example, if there are 35 features in a dataframe and max_features is 9, only 9 of the 35
features will be used in the decision tree.
Default = None
Input options → integer, float (if float, then max_features is fraction) or {“auto”, “sqrt”,
“log2”}
“auto”: max_features=sqrt(n_features)
“sqrt”: max_features = sqrt(n_features)
“log2”: max_features=log2(n_features)
The choice of kernel that will control the manner in which the input variables will
be projected. There are many to choose from, but linear, polynomial, and RBF are the
most common, perhaps just linear and RBF in practice.
kernels in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
If the polynomial kernel works out, then it is a good idea to dive into the degree
hyperparameter.
Another critical parameter is the penalty (C) that can take on a range of values and has a
dramatic effect on the shape of the resulting regions for each class. A log scale might be
a good starting point.
C in [100, 10, 1.0, 0.1, 0.001]
Number of neurons: A weight is the amplification of input signals to a neuron and bias
is an additive bias term to a neuron.
Activation function: Defines how a neuron or group of neurons activate ("spiking")
based on input connections and bias term(s).
Learning rate: Step length for gradient descent update
Batch size: Number of training examples in each gradient descent (gd) update.
Epochs: The number of times all training examples have been passed through the
network during training.
Loss function: Loss function specifies how to calculate the error between prediction and
label for a given training example. The error is backpropagated during training in order
to update learnable parameters.
Number of layers: Typically layers between input and output layer, which are called
hidden layers
CLASSIFICATION MODEL:
Confusion matrix: A Confusion matrix is an N x N matrix used for evaluating the
performance of a classification model, where N is the number of target classes. The
matrix compares the actual target values with those predicted by the machine
learning model
What can we learn from this matrix?
• There are two possible predicted classes: "yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
• The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
• Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60 patients do not.
True positives (TP): these are cases in which we predicted yes (they have the disease),
and they do have the disease.
True negatives (tn): we predicted no, and they don't have the disease.
False positives (fp): we predicted yes, but they don't actually have the disease. (Also
known as a "type I error.")
False negatives (fn): we predicted no, but they actually do have the disease. (Also
known as a "type II error.")
REGRESSION MODEL:
Mean Absolute Error (MAE)
The mean absolute error (MAE) is defined the MAE as
Where is the actual value is the predicted value and is the absolute value
of the difference between the actual and predicted value.
N is the number of sample points.
Let's dig into this a bit deeper to understand what this calculation represents.
Take a look at the following plot, which shows the number of failures for a piece of
machinery against the age of the machine:
In order to predict the number of failures from the age, we would want to fit a regression
line such as this:
In order to understand how well this line represents the actual data, we need to measure
how good a fit it is. We can do this by measuring the distance from the actual data points to
the line:
You may recall that these distances are called residuals or errors. The mean size of these
errors is the MAE. We can calculate it as follows:
The mean of the absolute errors (MAE) is 8.5. Why do we take the absolute value? To
remove the sign on the error value! If we don't, the positive and negative errors will tend to
cancel each other out, giving a misleadingly small value for our evaluation metric. If
mathematical symbols are not your strong point, you may not immediately see how this
calculation relates to the formula at the start of this chapter:
Mean Absolute Error (MAE) tells us the average error in units of y, the predicted feature. A
value of 0 indicates a perfect fit, i.e. all our predictions are spot on.
The MAE has a big advantage in that the units of the MAE are the same as the units of y,
the feature we want to predict. In the example above, we have an MAE of 8.5, so it means
that on average our predictions of the number of machine failures are incorrect by 8.5
machine failures. This makes MAE very intuitive and the results are easily conveyed to a non-
machine learning expert!
Root Mean Square Error (RMSE)
Another evaluation metric for regression is the root mean square error (RMSE). Its
calculation is very similar to MAE, but instead of taking the absolute value to get rid of the
sign on the individual errors, we square the error (because the square of a negative number
is positive). The formula for RMSE is:
As with MAE, we can think of RMSE as being measured in the y units. So the above error
can be read as an error of 9.9 machine failures on average per observation.
MAE vs. RMSE
Compared to MAE, RMSE gives a higher total error and the gap increases as the errors
become larger. It penalizes a few large errors more than a lot of small errors. If you want
your model to avoid large errors, use RMSE over MAE.
Root Mean Square Error (RMSE) indicates the average error in units of y, the predicted
feature, but penalizes larger errors more severely than MAE. A value of 0 indicates a perfect
fit. You should also be aware that as the sample size increases, the accumulation of slightly
higher RMSEs than MAEs means that the gap between these two measures also increases as
the sample size increases.
R-Squared
As stated above that an advantage of both MAE and RMSE is that they can be thought of as
errors in the units of y, the predicted feature. This is helpful when relaying the results to non-
data scientists.
We can say things like "our model can predict the reliability of our machinery to within 8.5
machine failures on average" or "our model can predict the selling price of a house to within
£15k on average".
But take heed! This advantage can also be considered a disadvantage! It says nothing about
whether an error of 8.5 machine failures or an error of £15k on a house price is good or bad.
We can't compare how good different models are for different scenarios. This is where R-
squared or R2 comes in. Here is the formula for R2.
R2 computes how much better the regression line fits the data than the mean line.
Another way to look at this formula is to compare the variance around the mean line to the
variation around the regression line:
Take our example above, predicting the number of machine failures. We can examine the
errors for our regression line as we did before. We can also compute a mean line (by taking
the mean y (value) and examine the errors against this mean line. That is to say, we can see
the errors we would get if our model just predicted the mean number of failures (50.8) for
every age input. Here are the regression and mean lines, and their respective errors:
You can see that the regression line fits the data better than the mean line, which is what we
expected (the mean line is a pretty simplistic model, after all). But can you say how much
better it is? That's exactly what R2 does! Here is the calculation.
Notice something? Most of this is the same as the calculation of RMSE. The additional parts
to the calculation are the column on the far right (in blue) and the final calculation row,
computing R2. So we have an R-squared of 0.85. Without even worrying about the units of y
we can say this is a decent model. Why? Because the model explains 85% of the variation in
the data. That's exactly what an R-squared of 0.85 tells us!
R-squared (R2) tells us the degree to which the model explains the variance in the data. In
other words, how much better it is than just predicting the mean. Here's another example.
What if our data points and regression line looked like this?
The variance around the regression line is 0. In other words, var(line) is 0. There are no errors.
Now, remember that the formula for R-squared is:
So, if we have a perfect regression line, with no errors, we get an R-squared of 1. Let's look at
another example. What if our data points and regression line looked like this, with the
regression line equal to the mean line?
In this case, var(line) and var(mean) are the same. So the above calculation will yield an R-
squared of 0:
So, if our regression line is only as good as the mean line, we get an R-squared of 0. What if
our regression line was really bad, worse than the mean line?
It's unlikely to get this bad! But if it does, var(mean)-var(line) will be negative, so R-squared
will be negative. An R-squared of 1 indicates a perfect fit. An R-squared of 0 indicates a
model no better or worse than the mean. An R-squared of less than 0 indicates a model
worse than just predicting the mean.
30. Identify methodology to attempt following problems and enlist general steps
involved in it.
*********************