0% found this document useful (0 votes)
491 views

Machine Learning by Tom Mitchell - Definitions

1. The document provides information about machine learning concepts like perceptrons, backpropagation algorithm, regularization, and gradient descent. 2. It includes questions and answers on topics such as the perceptron training rule, differences between L1 and L2 regularization, early stopping, and characteristics of problems solved using neural networks. 3. Regularization techniques like L1 and L2 regularization help reduce overfitting when dealing with large number of features in a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
491 views

Machine Learning by Tom Mitchell - Definitions

1. The document provides information about machine learning concepts like perceptrons, backpropagation algorithm, regularization, and gradient descent. 2. It includes questions and answers on topics such as the perceptron training rule, differences between L1 and L2 regularization, early stopping, and characteristics of problems solved using neural networks. 3. Regularization techniques like L1 and L2 regularization help reduce overfitting when dealing with large number of features in a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SASTRA

School of Computing

Part A Collections : Machine Learning

1. A neuron with 4 inputs has the weight vector w = [1, 2, 3, 4]T and a bias _ = 0 (zero). The activation
function is linear, where the constant of proportionality equals 2 — that is, the activation function is
given by f(net) = 2 × net. If the input vector is x = [4, 8, 5, 6]T then calculate the output of the neuron.

Output ------(118)

2. What is the biggest difference between Widrow & Hoff’s Delta Rule and the Perceptron Learning
Rule for learning in a single-layer feed forward network?

The Delta Rule is defined for linear activation functions, but the Perceptron Learning Rule is defined
for step activation functions.

3. What are merits and demerits of Back Propagation Algorithm?

Merits:
1. The mathematical formula present here can be applied to any network and does not require any
special mention of the features of the function to be learnt.
2. The computing time is reduced if the weights chosen are small at the beginning.
Demerits:
1. The n umber of learning steps may be high, and also the learning phase has intensive calculations.
2. The training may cause temporal instability to the system.

4. What are the applications of back propagation algorithm?


1. Optical character recognition
2. Image compression
3. D ata compression
4. Control problems

5. What are the four main steps in back propagation algorithm?


1. Initialization of weights
2. Feed forward function
3. Back propagation
4. Termination

6. Give some applications of ANN

• Function approximation, or regression analysis, including time series prediction, fitness


approximation and modeling.
• Classification, including pattern and sequence recognition, novelty detection and sequential decision
making.
• Data processing, including filtering, clustering, blind source separation and compression.
• Robotics, including directing manipulators, Computer numerical control.

7. What are the uses of Regularization?

In order to create less complex (parsimonious) model when you have a large number of features in your
dataset, some of the Regularization techniques used to address over-fitting and feature selection are: L1
Regularization, L2 Regularization
8. List the difference between L1 Regularization & L2 Regularization.

A regression model that uses L1 regularization technique is called Lasso Regression and model which
uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the
highlighted part represents L2 regularization element.

Cost function

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will
add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen.
This technique works very well to avoid over-fitting issue.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of
coefficient as penalty term to the loss function.

Cost function

Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence
it will under-fit.The key difference between these techniques is that Lasso shrinks the less important feature’s
coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we
have a huge number of features.Traditional methods like cross-validation, stepwise regression to handle
overfitting and perform feature selection work well with a small set of features but these techniques are a
great alternative when we are dealing with a large set of features.

9. Define Early Stopping:

One method for improving network generalization ability is to use a network that is just large enough
to provide an adequate fit to the target function. But sometimes it is hard to know beforehand how large a
network should be for a specific application. One commonly used technique for improving network
generalization is early stopping. This technique monitors the error on a subset of the data (validation
data) that does not actually take part in the training. The training stops when the error on the validation
data increases for a certain amount of iterations
10. What is generalization?
The ability of a pattern recognition system to approximate the desired output values for pattern vectors
which are not in the training set.

11. Define Regularization.


Regularization is defined as “any modification we make to a learning algorithm that is intended to reduce its
generalization error but not its training error.”

12. What are the other names of L2 regularization?


It is also known as weight decay , ridge regression or Tikhonov regularization.

13. What is Regularization?

Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients. In


simple terms, it reduces parameters and shrinks (simplifies) the model. This more streamlined, more
parsimonious model will likely perform better at predictions. Regularization adds penalties to more
complex models and then sorts potential models from least overfit to greatest; The model with the
lowest “overfitting” score is usually the best choice for predictive power.

14. Why is Regularization Necessary?

Regularization is necessary because least squares regression methods, where the residual sum of
squares is minimized, can be unstable. This is especially true if there is multicollinearity in the model.
However, the mere practice of model fitting comes with a major pitfall: any set of data can be fitted
to a model, even if that model is ridiculously complex.

15. What does Regularization achieve?

A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for
a data set different than its training data. Regularization, significantly reduces the variance of the model,
without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques,
controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus
reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence
avoiding overfitting), without loosing any important properties in the data. But after certain value, the model
starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the
value of λ should be carefully selected.

16. What are characetristics of problems that can be solved by neural network learning?

Instances are represented by many attribute-value pairs. The target function output may be
discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes. The training
examples may contain errors. Long training times are acceptable. Fast evaluation of the learned
target function may be required. The ability of humans to understand the learned target function
is not important.
17. State the Equation of Perceptron Training rule.

W=weight x=input.
eta = learning rate t=target output o = generated output

18. Drawback of Perceptron Training Rule.


It can be applied to training samples which are Linearly separable. If the data are not linearly
separable, convergence is not assured.

19. When Gradient descent can be applied?


It is a strategy for searching through a large or infinite hypothesis space that can be applied
whenever
(1) the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear
unit),
(2) the error can be differentiated with respect to these hypothesis parameters.

20. What are the difficulties in applyting gradient descent?


(1) converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands
of gradient descent steps),
(2) if there are multiple local minima in the error surface, then there is no guarantee that the
procedure will find the global minimum.

21. Distinguish between standard gradient descent and stochastic gradient descent.
In standard gradient descent, the error is summed over all examples before updating weights,
whereas in stochastic gradient descent weights are updated upon examining each training example.
Summing over multiple examples in standard gradient descent requires more computation per
weight update step. On the other hand, because it uses the true gradient, standard gradient descent is
often used with a larger step size per weight update than stochastic gradient descent.
In cases where there are multiple local minima stochastic gradient descent can sometimes avoid
falling into these local minima and reaches global optimal minimum.

22. Distinguish between Perceptron Training Rule and Delta Rule.


The difference between these two training rules is reflected in different convergence properties. The
perceptron training rule converges after a finite number of iterations to a hypothesis that perfectly
classifies the training data, provided the training examples are linearly separable. The delta rule
converges only asymptotically toward the minimum error hypothesis, possibly requiring
unbounded time, but converges regardless of whether the training data are linearly separable.

23. How the problem of local minima can be alleviated?


Adding momentum term to the weight-update rule. Using stochastic gradient descent rather than
true gradient descent. Training multiple networks using the same data, but initializing each network
with different random weights.

24. What set of functions can be represented by feed forward networks?

• Boolean functions.
• Continuous functions.
• Arbitrary functions.

25. What is sample error?


26. What is true error?

27. Mention Central Limit Theorem.

28. Write down the formula of the general two-sided confidence interval for estimating the difference
between errors of two hypotheses.

29. What are the generic four-step procedure used to derive a confidence interval?
30. If a random variable Y obeys a Normal distribution with
then write down the confidence interval .

31. What is a dropout?

Dropout is a regularization technique for reducing overfitting in neural networks. At each training step we
randomly drop out (set to zero) set of nodes, thus we create a different model for each training case, all of
these models share weights. It’s a form of model averaging.

32. What are hyperparameters, provide some examples?

Hyperparameters as opposed to model parameters can’t be learn from the data, they are set before training
phase.

Learning rate
It determines how fast we want to update the weights during optimization, if learning rate is too small,
gradient descent can be slow to find the minimum and if it’s too large gradient descent may not converge(it
can overshoot the minima). It’s considered to be the most important hyperparameter.

Number of epochs
Epoch is defined as one forward pass and one backward pass of all training data.

Batch size
The number of training examples in one forward/backward pass.

33. What is the role of the activation function?

The goal of an activation function is to introduce non-linearity into the neural network so that it can learn
more complex function. Without it, the neural network would be only able to learn function which is a linear
combination of its input data.

34. What is a cost function?

Cost function tells us how well the neural network is performing. Our goal during training is to find
parameters that minimize the cost function. For an example of a cost function, consider Mean Squared Error
function.

35. What is a gradient descent?

Gradient descent is an optimization algorithm used in machine learning to learn values of parameters that
minimize the cost function. It’s an iterative algorithm, in every iteration, we compute the gradient of the cost
function with respect to each parameter and update the parameters of the function via the following.
36. What is data augmentation? List some examples.

Data augmentation is a technique for synthesizing new data by modifying existing data in such a way that the
target is not changed, or it is changed in a known way.
Computer vision is one of fields where data augmentation is very useful. There are many modifications that
we can do to images:
• Resize
• Horizontal or vertical flip
• Rotate
• Add noise
• Deform
• Modify colors

37. Why do we need a validation set and test set? What is the difference between them?

When training a model, we divide the available data into three separate sets:
• The training dataset is used for fitting the model’s parameters. However, the accuracy that we
achieve on the training set is not reliable for predicting if the model will be accurate on new samples.
• The validation dataset is used to measure how well the model does on examples that weren’t part of
the training dataset. The metrics computed on the validation data can be used to tune the
hyperparameters of the model. However, every time we evaluate the validation data and we make
decisions based on those scores, we are leaking information from the validation data into our model.
The more evaluations, the more information is leaked. So we can end up overfitting to the validation
data, and once again the validation score won’t be reliable for predicting the behaviour of the model
in the real world.
• The test dataset is used to measure how well the model does on previously unseen examples. It should
only be used once we have tuned the parameters using the validation set.
So if we omit the test set and only use a validation set, the validation score won’t be a good estimate of the
generalization of the model.

38. Implement an AND function to a single neuron.


Below is a tabular representation of an AND function:
X1 X2 X1 AND X2
0 0 0
0 1 0
1 0 0
1 1 1
The activation function of our neuron is denoted as:
Bias = -1.5, w1 = 1, w2 = 1.

39. Which technique perform similar operations as dropout in a neural network?

Bagging. Dropout can be seen as an extreme form of bagging in which each model is trained on
a single case and each parameter of the model is very strongly regularized by sharing it with the
corresponding parameter in all the other models.

40. In a neural network, which of the techniques are used to deal with overfitting?

Dropout, Regularization, Batch Normalization

41. Which of the following statement is the best description of early stopping?

A. Train the network until a local minimum in the error function is reached

B. Simulate the network on a test dataset after every epoch of training. Stop training when the
generalization error starts to increase

C. Add a momentum term to the weight update in the Generalized Delta Rule, so that training
converges more quickly

D. A faster version of backpropagation, such as the `Quickprop’ algorithm

Solution: (B)

42. What is parameter sharing?

While a parameter norm penalty is one way to regularize parameters to be close to one
another, the more popular way is to use constraints: to force sets of parameters to be
equal. This method of regularization is often referred to as parameter sharing, where we
interpret the various models or model components as sharing a unique set of parameters.
A significant advantage of parameter sharing over regularizing the parameters to be close
(via a norm penalty) is that only a subset of the parameters (the unique set) need to be
stored in memory.
43. What is Bagging?

Bagging (Bootstrap aggregating) is a technique for reducing generalization error by


combining several models. The idea is to train several different models separately, then
have all of the models vote on the output for test examples. This is an example of a
general strategy in machine learning called model averaging. Techniques employing this
strategy are known as ensemble methods.

44. List the advantages of Dropout.

• It is very computationally cheap


• it does not significantly limit the type of model or training procedure that
can be used. It works well with nearly any model that uses a distributed
representation and can be trained with stochastic gradient descent.

45. Suppose you test a hypothesis h and find that it commits r = 300 errors on a sample S of n =
1000 randomly drawn test examples.

What is the standard deviation in error s ( h )?

error s ( h ) = r/n
= 300 / 1000
= 0.3
The variance in this estimate arises completely from the variance in r.
Because r is Binomially distributed
variance ( error s ( h ) ) = np ( 1 - p )
Since p is unknown, substitute estimate r / n
= 1000 ( 0.3 )( 1 - 0.3 )
= 210
standard deviation ( r )
= square root ( variance ( r ) )
= square root ( 210 )
= 14.49
standard deviation ( error s ( h ) )
= standard deviation ( r ) / n
= 14.49 / 1000

= 0.01449
46. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently drawn
examples. What is the 90% confidence interval (two-sided) for the true error rate?

10 / 65 = 0.15
90% interval = 0.15 +- 1.64 ( square root [ 0.15 * ( 1 - 0.15 ) / 65 ] )
= 0.15 +- 0.073

47. What is the minimum number of examples ( n ) you must collect to assure that the width of
the two-sided 95% confidence interval will be smaller that 0.1?

Let E ( error D ( h ) ) = ( 0.2 + 0.6 ) / 2


= 0.4
95% interval width = 2 * ( 1.96 * x )
x = square root [ 0.4 * ( 1 - 0.4 ) / n ]
for width < 0.1
x = 0.1 / ( 1.96 * 2 ) 0.0255
= 0.0255
0.0255 = square root [ 0.4 * ( 1 - 0.4 ) / n ]
0.00065025 = ( 0.4 * 0.6 ) / n
0.00065025 = 0.24 / n
n = 0.24 / 0.00065025

n = 370

48. What are the values of weights w0, w1, and w2 for the perceptron whose decision surface is
illustrated in the figure? Assume the surface crosses the x1 axis at -1 and the x2 axis at 2.
49. a) Design a two-input perceptron that implements the Boolean function A∧¬B. (b) Design the two-layer
network of perceptrons that implements A XOR B.

The requested perceptron has 3 inputs: A, B, and the constant 1. The values of A and B are 1
(true) or -1 (false). The following table describes the output O of the perceptron:
A B A XOR B
-1 -1 -1
-1 1 -1
1 -1 1
1 1 -1

One of the correct decision surfaces (any line that separates the positive point from the negative
points would be fine) is shown in the following picture
50. Derive a gradient descent training rule for a single unit with output o, where

.
The gradient descent training rule specifies how the weights are to be changed at each step of
the learning procedure so that the prediction error of the unit decreases the most.

51. In the Back-Propagation learning algorithm, what is the object of the learning? Does the Back-Propagation
learning algorithm guarantee to find the global optimum solution?

The object is to learn the weights of the interconnections between the inputs and the hidden units and between the
hidden units and the output units. The algorithms attempts to minimize the squared error between the network output
values and the target values of these outputs. The learning algorithm does not guarantee to find the global optimum
solution. It guarantees to find at least a local minimum of the error function.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy