Machine Learning by Tom Mitchell - Definitions
Machine Learning by Tom Mitchell - Definitions
School of Computing
1. A neuron with 4 inputs has the weight vector w = [1, 2, 3, 4]T and a bias _ = 0 (zero). The activation
function is linear, where the constant of proportionality equals 2 — that is, the activation function is
given by f(net) = 2 × net. If the input vector is x = [4, 8, 5, 6]T then calculate the output of the neuron.
Output ------(118)
2. What is the biggest difference between Widrow & Hoff’s Delta Rule and the Perceptron Learning
Rule for learning in a single-layer feed forward network?
The Delta Rule is defined for linear activation functions, but the Perceptron Learning Rule is defined
for step activation functions.
Merits:
1. The mathematical formula present here can be applied to any network and does not require any
special mention of the features of the function to be learnt.
2. The computing time is reduced if the weights chosen are small at the beginning.
Demerits:
1. The n umber of learning steps may be high, and also the learning phase has intensive calculations.
2. The training may cause temporal instability to the system.
In order to create less complex (parsimonious) model when you have a large number of features in your
dataset, some of the Regularization techniques used to address over-fitting and feature selection are: L1
Regularization, L2 Regularization
8. List the difference between L1 Regularization & L2 Regularization.
A regression model that uses L1 regularization technique is called Lasso Regression and model which
uses L2 is called Ridge Regression.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the
highlighted part represents L2 regularization element.
Cost function
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will
add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen.
This technique works very well to avoid over-fitting issue.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of
coefficient as penalty term to the loss function.
Cost function
Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence
it will under-fit.The key difference between these techniques is that Lasso shrinks the less important feature’s
coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we
have a huge number of features.Traditional methods like cross-validation, stepwise regression to handle
overfitting and perform feature selection work well with a small set of features but these techniques are a
great alternative when we are dealing with a large set of features.
One method for improving network generalization ability is to use a network that is just large enough
to provide an adequate fit to the target function. But sometimes it is hard to know beforehand how large a
network should be for a specific application. One commonly used technique for improving network
generalization is early stopping. This technique monitors the error on a subset of the data (validation
data) that does not actually take part in the training. The training stops when the error on the validation
data increases for a certain amount of iterations
10. What is generalization?
The ability of a pattern recognition system to approximate the desired output values for pattern vectors
which are not in the training set.
Regularization is necessary because least squares regression methods, where the residual sum of
squares is minimized, can be unstable. This is especially true if there is multicollinearity in the model.
However, the mere practice of model fitting comes with a major pitfall: any set of data can be fitted
to a model, even if that model is ridiculously complex.
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for
a data set different than its training data. Regularization, significantly reduces the variance of the model,
without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques,
controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus
reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence
avoiding overfitting), without loosing any important properties in the data. But after certain value, the model
starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the
value of λ should be carefully selected.
16. What are characetristics of problems that can be solved by neural network learning?
Instances are represented by many attribute-value pairs. The target function output may be
discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes. The training
examples may contain errors. Long training times are acceptable. Fast evaluation of the learned
target function may be required. The ability of humans to understand the learned target function
is not important.
17. State the Equation of Perceptron Training rule.
W=weight x=input.
eta = learning rate t=target output o = generated output
21. Distinguish between standard gradient descent and stochastic gradient descent.
In standard gradient descent, the error is summed over all examples before updating weights,
whereas in stochastic gradient descent weights are updated upon examining each training example.
Summing over multiple examples in standard gradient descent requires more computation per
weight update step. On the other hand, because it uses the true gradient, standard gradient descent is
often used with a larger step size per weight update than stochastic gradient descent.
In cases where there are multiple local minima stochastic gradient descent can sometimes avoid
falling into these local minima and reaches global optimal minimum.
• Boolean functions.
• Continuous functions.
• Arbitrary functions.
28. Write down the formula of the general two-sided confidence interval for estimating the difference
between errors of two hypotheses.
29. What are the generic four-step procedure used to derive a confidence interval?
30. If a random variable Y obeys a Normal distribution with
then write down the confidence interval .
Dropout is a regularization technique for reducing overfitting in neural networks. At each training step we
randomly drop out (set to zero) set of nodes, thus we create a different model for each training case, all of
these models share weights. It’s a form of model averaging.
Hyperparameters as opposed to model parameters can’t be learn from the data, they are set before training
phase.
Learning rate
It determines how fast we want to update the weights during optimization, if learning rate is too small,
gradient descent can be slow to find the minimum and if it’s too large gradient descent may not converge(it
can overshoot the minima). It’s considered to be the most important hyperparameter.
Number of epochs
Epoch is defined as one forward pass and one backward pass of all training data.
Batch size
The number of training examples in one forward/backward pass.
The goal of an activation function is to introduce non-linearity into the neural network so that it can learn
more complex function. Without it, the neural network would be only able to learn function which is a linear
combination of its input data.
Cost function tells us how well the neural network is performing. Our goal during training is to find
parameters that minimize the cost function. For an example of a cost function, consider Mean Squared Error
function.
Gradient descent is an optimization algorithm used in machine learning to learn values of parameters that
minimize the cost function. It’s an iterative algorithm, in every iteration, we compute the gradient of the cost
function with respect to each parameter and update the parameters of the function via the following.
36. What is data augmentation? List some examples.
Data augmentation is a technique for synthesizing new data by modifying existing data in such a way that the
target is not changed, or it is changed in a known way.
Computer vision is one of fields where data augmentation is very useful. There are many modifications that
we can do to images:
• Resize
• Horizontal or vertical flip
• Rotate
• Add noise
• Deform
• Modify colors
37. Why do we need a validation set and test set? What is the difference between them?
When training a model, we divide the available data into three separate sets:
• The training dataset is used for fitting the model’s parameters. However, the accuracy that we
achieve on the training set is not reliable for predicting if the model will be accurate on new samples.
• The validation dataset is used to measure how well the model does on examples that weren’t part of
the training dataset. The metrics computed on the validation data can be used to tune the
hyperparameters of the model. However, every time we evaluate the validation data and we make
decisions based on those scores, we are leaking information from the validation data into our model.
The more evaluations, the more information is leaked. So we can end up overfitting to the validation
data, and once again the validation score won’t be reliable for predicting the behaviour of the model
in the real world.
• The test dataset is used to measure how well the model does on previously unseen examples. It should
only be used once we have tuned the parameters using the validation set.
So if we omit the test set and only use a validation set, the validation score won’t be a good estimate of the
generalization of the model.
Bagging. Dropout can be seen as an extreme form of bagging in which each model is trained on
a single case and each parameter of the model is very strongly regularized by sharing it with the
corresponding parameter in all the other models.
40. In a neural network, which of the techniques are used to deal with overfitting?
41. Which of the following statement is the best description of early stopping?
A. Train the network until a local minimum in the error function is reached
B. Simulate the network on a test dataset after every epoch of training. Stop training when the
generalization error starts to increase
C. Add a momentum term to the weight update in the Generalized Delta Rule, so that training
converges more quickly
Solution: (B)
While a parameter norm penalty is one way to regularize parameters to be close to one
another, the more popular way is to use constraints: to force sets of parameters to be
equal. This method of regularization is often referred to as parameter sharing, where we
interpret the various models or model components as sharing a unique set of parameters.
A significant advantage of parameter sharing over regularizing the parameters to be close
(via a norm penalty) is that only a subset of the parameters (the unique set) need to be
stored in memory.
43. What is Bagging?
45. Suppose you test a hypothesis h and find that it commits r = 300 errors on a sample S of n =
1000 randomly drawn test examples.
error s ( h ) = r/n
= 300 / 1000
= 0.3
The variance in this estimate arises completely from the variance in r.
Because r is Binomially distributed
variance ( error s ( h ) ) = np ( 1 - p )
Since p is unknown, substitute estimate r / n
= 1000 ( 0.3 )( 1 - 0.3 )
= 210
standard deviation ( r )
= square root ( variance ( r ) )
= square root ( 210 )
= 14.49
standard deviation ( error s ( h ) )
= standard deviation ( r ) / n
= 14.49 / 1000
= 0.01449
46. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently drawn
examples. What is the 90% confidence interval (two-sided) for the true error rate?
10 / 65 = 0.15
90% interval = 0.15 +- 1.64 ( square root [ 0.15 * ( 1 - 0.15 ) / 65 ] )
= 0.15 +- 0.073
47. What is the minimum number of examples ( n ) you must collect to assure that the width of
the two-sided 95% confidence interval will be smaller that 0.1?
n = 370
48. What are the values of weights w0, w1, and w2 for the perceptron whose decision surface is
illustrated in the figure? Assume the surface crosses the x1 axis at -1 and the x2 axis at 2.
49. a) Design a two-input perceptron that implements the Boolean function A∧¬B. (b) Design the two-layer
network of perceptrons that implements A XOR B.
The requested perceptron has 3 inputs: A, B, and the constant 1. The values of A and B are 1
(true) or -1 (false). The following table describes the output O of the perceptron:
A B A XOR B
-1 -1 -1
-1 1 -1
1 -1 1
1 1 -1
One of the correct decision surfaces (any line that separates the positive point from the negative
points would be fine) is shown in the following picture
50. Derive a gradient descent training rule for a single unit with output o, where
.
The gradient descent training rule specifies how the weights are to be changed at each step of
the learning procedure so that the prediction error of the unit decreases the most.
51. In the Back-Propagation learning algorithm, what is the object of the learning? Does the Back-Propagation
learning algorithm guarantee to find the global optimum solution?
The object is to learn the weights of the interconnections between the inputs and the hidden units and between the
hidden units and the output units. The algorithms attempts to minimize the squared error between the network output
values and the target values of these outputs. The learning algorithm does not guarantee to find the global optimum
solution. It guarantees to find at least a local minimum of the error function.