0% found this document useful (0 votes)
97 views20 pages

Book's Solutions

Flexible statistical learning methods will generally perform better than inflexible methods in the following situations: 1) When the sample size is large and number of predictors is small, as flexible models can capture non-linear relationships without overfitting. 2) When the relationship between predictors and response is highly non-linear, as flexible models can better fit complex patterns in the data. Inflexible statistical learning is preferred when: 3) The number of predictors is large and sample size is small, as flexible models are more likely to overfit. 4) The error variance is extremely high, as inflexible models avoid overfitting noise in the data.

Uploaded by

Kodjo ALIPUI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views20 pages

Book's Solutions

Flexible statistical learning methods will generally perform better than inflexible methods in the following situations: 1) When the sample size is large and number of predictors is small, as flexible models can capture non-linear relationships without overfitting. 2) When the relationship between predictors and response is highly non-linear, as flexible models can better fit complex patterns in the data. Inflexible statistical learning is preferred when: 3) The number of predictors is large and sample size is small, as flexible models are more likely to overfit. 4) The error variance is extremely high, as inflexible models avoid overfitting noise in the data.

Uploaded by

Kodjo ALIPUI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Q1

1. For each of parts (a) through (d), indicate whether we would generally expect the
performance of a flexible statistical learning method to be better or worse than an
inflexible method. Justify your answer.
a. The sample size n is extremely large, and the number of predictors p is small.
(*Better)
i. Flexible: we have enough observations to avoid overfitting, so assuming
some there are non-linear relationships in our data a more flexible model
should provide an improved fit.
ii. Variance increases slowly or not at all initially: for a large number of
observations, we can support more flexible models before variance begins to
increase significantly.
iii. Decreasing bias: As we increase the flexibility of the model, we expect the
bias to decrease as the model captures more structure in the training data.
b. The number of predictors p is extremely large, and the number of observations n is
small. ** Worse
i. Inflexible: we don't have enough observations to avoid overfitting
ii. Variance increases rapidly: for a small number of observations, we can
support a less flexible model before variance begins to increase significantly.
iii. The rate of increase of variance outweighs rate of decrease in bias as the
number of paramaters increases for a fixed small number of observations.
c. The relationship between the predictors and response is highly non-linear. **Better
i. Flexible: a high variance model affords a better fit to non-linear
relationships
ii. The variance increases slowly or not at all initially.
iii. The bias decreases rapidly.
d. The variance of the error terms, i.e. σ2 = Var(ε), is extremely high. **Worse
i. Inflexible: a high bias model avoids overfitting to the noise in our dataset
ii. The variance increases rapidly.
iii. The rate of increase of the variance outweighs the rate of decrease in bias.

2. Explain whether each scenario is a classification or regression problem, and indicate


whether we are most interested in inference or prediction. Finally, provide n and p.
a. (a) We collect a set of data on the top 500 firms in the US. For each firm we record
profit, number of employees, industry and the CEO salary. We are interested in
understanding which factors affect CEO salary.
i. regression, inference, n=500, p=4
b. (b) We are considering launching a new product and wish to know whether it will
be a success or a failure. We collect data on 20 similar products that were
previously launched. For each product we have recorded whether it was a success
or failure, price charged for the product, marketing budget, competition price, and
ten other variables.
i. classification, prediction, n=20, p = 10+3 = 13 (10 other variables + price +
marketing budget + competition price).
c. (c) We are interested in predicting the % change in the USD/Euro exchange rate in
relation to the weekly changes in the world stock markets. Hence we collect weekly
data for all of 2012. For each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market, and the % change in
the German market.
1 | 20
i. regression, prediction, n=52, p=4
3. We now revisit the bias-variance decomposition.
a. (a) Provide a sketch of typical (squared) bias, variance, training error, test error, and
Bayes (or irreducible) error curves, on a single plot, as we go from less flexible
statistical learning methods towards more flexible approaches. The x-axis should
represent the amount of flexibility in the method, and the y-axis should represent
the values for each curve. There should be five curves. Make sure to label each one.
i.

ii.
b. (b) Explain why each of the five curves has the shape displayed in part (a).** The
ideal function is somewhat non-linear
i. Bayes error: the irreducible error which is a constant irrespective of model
flexibility
ii. Variance: the variance of a model increases with flexibility as the model
picks up variation between training sets rsulting in more variaiton in f(X)
iii. Bias: bias tends to decrease with flexibility as the model casn fit more
complex relationships
iv. Test error: tends to decreases as reduced bias allows the model to better fit
non-linear relationships but then increases as an increasingly flexible model
begins to fit the noise in the dataset (overfitting)
v. Training error: decreases monotonically with increaed flexibility as the
model 'flexes' towards individual datapoints in the training set
vi. Three scenarios are depicted, from leftmost to rightmost:
1. (1) The ideal function is somewhat non-linear
a. Bias: Initially decreases rapidy as more complex model captures the
structure in the training data, before becoming flat as increasing
complexity further does not model the true relationship any better.
b. Variance: initially increases slowly as a moderately complex model
best approximates the true relationship, before increasing rapidly as
more complex models capture the noise in the training set and fail
to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the u
shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
2. (2) The ideal function is linear

2 | 20
a. Bias: Flat - the true relationship is best represented by a low
complexity model, increasing model complexity does not model the
true realtionship any better.
b. Variance: Increases - the true realtionship is best approximated by a
low compexity model, more complex models capture the noise in
the training set and fail to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the (half)
u shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
3. (3) The ideal function is very non-linear
a. Bias: Initially decreases very rapidly - as more complex model
captures the structure in the training data, before becoming flat as
increasing complexity further does not model the true relationship
any better.
b. Variance: Flat for a while - a very complex model best
approximates the true relationship, before increasing slightly as
extremely complex models capture the noise in the training set and
fail to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the (half)
u shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
4. a) Describe three real-life applications in which classification might be useful. Describe
the response, as well as the predictors. Is the goal of each application inference or
prediction? Explain your answer.
a. Is this tumor malignant or benign?
i. - response: boolean (is malign)
ii. - predictors (naive examples): tumor size, white blood cell
count, change in adrenal mass, position in body
iii. - goal: prediction
b. What animals are in this image?
i. response: 'Cat', 'Dog', 'Fish'
ii. predictors: image pixel values
iii. goal: prediction
c. What test bench metrics are most indicative of a faulty Printed Circuit Board
(PCB)?
i. response: boolean (is faulty)
ii. predictors: current draw, voltage drawer, output noise,
operating temp.
iii.goal: inference
5. (b) Describe three real-life applications in which regression might be useful. Describe
the response, as well as the predictors. Is the goal of each application inference or
prediction? Explain your answer.
a. How much is this house worth?
i. responce: SalePrice
ii. predictors: LivingArea, BathroomCount, GarageCount,
Neighbourhood, CrimeRate

3 | 20
iii. goal: prediction
b. What attributes most affect the market cap. of a company?
i. response: MarketCap
ii. predictors: Sector, Employees, FounderIsCEO, Age,
TotalInvestment, Profitability, RONA
iii. goal: inference
c. How long is this dairy cow likely to live?
i. response: years
ii. predictors: past medical conditions, current weight, milk
yield
iii. goal: prediction
6. Describe three real-life applications in which cluster analysis might be useful.
a. This dataset contains observations of 3 different species of flower. Estimate which
observations belong to the same species.
i. response: a, b, c (species class)
ii. predictors: sepal length, petal length, number of petals
iii. goal: prediction
b. which attributes of the flowers in dataset described above are most predictive of
species?
i. response: a, b, c (species class)
ii. predictors: sepal length, petal length, number of petals
iii. goal: inference
c. Group these audio recordings of birdsong by species.
i. responce: (species classes)
ii. predictors: audio sample values
iii. goal: prediction
d. Gene expression clustering in cancer cell lines.
e. Topic clustering in news stories.
f. Infectious agent clustering.
7. What are the advantages and disadvantages of a very flexible (versus a less flexible)
approach for regression or classification? Under what circumstances might a more
flexible approach be preferred to a less flexible approach? When might a less flexible
approach be preferred?
a. Very flexible vs Less flexible
i. Bias: Flexible models have lower bias.
ii. Variance: Flexible models have high variance if there are not a sufficient
number of observations to fit their parameters.
b. When flexible preferred
i. Large number of observations
ii. True relationship is complex
c. When less flexible preferred
i. Small number of observations
ii. True realtionship is not complex
d. Less flexible
i. Advantages :
1. gives better results with few observations
2. simpler inference: the effect of each feature can be more easily
understood
3. fewer parameters, faster optimization
ii. Disadvantage
4 | 20
1. performs poorly if observations contain highly non-linear relationships
e. More flexible
i. Advantage: gives better fit if observations contain non-linear relationships
ii. Disadvantage: an overfit the data providing poor predictions for new
observations
f.

8. Describe the differences between a parametric and a non-parametric statistical


learning approach. What are the advantages of a parametric approach to regression or
classification (as opposed to a non-parametric approach)? What are its disadvantages?
a. A parametric approach simplifies the problem of estimating the best fit to the
training data f(x) by making some assumptions about the functional form of f(x),
this reduces the problem to estimating the parameters of the model.
b. Parametric: Makes assumptions about the structure of the function we are trying to
approximate in order to reduce the search space of possible functions - for example,
a linear model assumes a linear function and need only search over the space of
possible coefficeints, rather that the space of all possible functions.
i.
ii. The advantage of the parametric approach is that it simplifies the
problem of estimating f(x) because it is easier to estimate paramters than an
arbitrary function.
iii. The disadvantage of this approach is that the assumed form of the function
f(X) could limit the degree of accuracy with which the model can fit the
training data. If too many parameters are used, in an attempt to increase the
models flexibility, then overfitting can occur – meaning that the model
begins to fit noise in the training data that is not representive of unseen
observations.
c. A non-parametric approach make no such assumptions and so f(x) can take any
arbitrary shape.
i. Non-Parametric: Makes no assumption about the structure of the function
we are trying to approximate - for example, KNN.
d. advantages / disadvantages Parametric and Non Parametric:
i. Non-parametric models must search over a very large space, become
intractable for highly dimensional data - example; the curse of
dimesnionality with KNN. Parametric models avoid this problem.
ii. Non-parametric make no assumption about the structure of the true fuction
and therefore have low bias. Parametric models have higher bias.
9. The table below provides a training data set containing six observations, three predictors, and
one qualitative response variable.

5 | 20
a.

6 | 20
Chap3: Linear Regression

1. Carefully explain the differences between the KNN classifier and KNN regression
methods.
a. The KNN classifier and the KNN regression methods are largely similar. The
KNN classifier determines a decision boundary which can be used to segment
data into 2 or more clusters or groups. KNN regression is non-parmetric
method for estimating a regression function that can be used to predict some
quantitivie variable.
2. Q3. Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 =
Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5
= Interaction between GPA and Gender. The response is starting salary after
graduation (in thousands of dollars). Suppose we use least squares to fit the model,
and get β_0 = 50, β_1 = 20 , β_2 = 0.07 , β_3 = 35 , β_4 = 0.01 , β_5 = −10 .
a. (a) Which answer is correct, and why?
i. i. For a fixed value of IQ and GPA, males earn more on average than
females.
ii. ii. For a fixed value of IQ and GPA, females earn more on average
than males.
iii. iii. For a fixed value of IQ and GPA, males earn more on average than
females provided that the GPA is high enough.
iv. iv. For a fixed value of IQ and GPA, females earn more on average
than males provided that the GPA is high enough.
b. iii. For a fixed value of IQ and GPA, males earn more on average than females
provided that the GPA is high enough.
i. Because X3 is our dummy variable for gender with 1 for female and 0
male, and coefficient 35, which means – all else being equal – the
model will estimate a starting salary for females $35k higher than for
males, but there is an additional interaction variable concerning GPA
and gender which means if GPA > 3.5 then males earn more than
females.

3. Q4. I collect a set of data (n = 100 observations) containing a single predictor and a
quantitative response. I then fit a linear regression model to the data, as well as a
separate cubic regression, i.e.
Y = β0 + β1X + β2X^2 + β3X^3 + ε
a. (a) Suppose that the true relationship between X and Y is linear, i.e. Y = β0 +
β1X + ε. Consider the training residual sum of squares (RSS) for the linear
regression, and also the training RSS for the cubic regression. Would we
expect one to be lower than the other, would we expect them to be the same,
or is there not enough information to tell? Justify your answer.
i. A: We would expect the training RSS for the cubic model because it is
more flexible which allows it to fit more closely variance in the
training data – which will reduce RSS despite this note being
representative of a closer approaximation to the true linear relationship
that is f(x).
b. (b) Answer (a) using test rather than training RSS.

7 | 20
i. A: We would expect the test RSS for the linear regression to be lower
because the assumption of high bias is correct and so the lack of
flexibility in that model is of no cost in estimating the true f(x). The
cubic model is more flexible, and so is likely to overfit the training
data meaning that the fit of the model will be affected by variance in
the training data that is not representive of the true f(x).
c. (c) Suppose that the true relationship between X and Y is not linear, but we
don’t know how far it is from linear. Consider the training RSS for the linear
regression, and also the training RSS for the cubic regression. Would we
expect one to be lower than the other, would we expect them to be the same,
or is there not enough information to tell? Justify your answer.
i. A: We expect training RSS to decrease as the the variance/flexibility
of our model increases. This holds true regardles of the true value of
f(x). So we expect the cubic model to result in a lower training RSS
d. (d) Answer (c) using test rather than training RSS.
i. There is not enough information to answer this fully.
e. If the true relationship is highly non-linear and there is low noise (or
irreducible error) in our training data then we might expect the more flexible
cubic model to deliver a better test RSS.
f. However, if the relationship is only slightly non-linear or the noise in our
training data is high then a linear model might deliver better results.

8 | 20
Chap 4 Classification

4. We now examine the differences between LDA and QDA.


a. (a) If the Bayes decision boundary is linear, do we expect LDA or QDA to
perform better on the training set? On the test set?
i. Training: QDA should perform best as higher variance model has
increased flexibility to fit noise in the data
ii. Test: LDA should perform best as increased bias is without cost if
Bayes decision boundary is linear.
b. (b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA
to perform better on the training set? On the test set?
i. Training: QDA should perform best as higher variance model has
increased flexibility to fit non-linear relationship in data and noise
ii. Test: QDA should perform best as higher variance model has increased
flexibility to fit non-linear relationship in data
c. (c) In general, as the sample size n increases, do we expect the test prediction
accuracy of QDA relative to LDA to improve, decline, or be unchanged?
Why?
i. Improve, as increased sample size reduces a more flexible models
tendency to overfit the training data.
d. (d) True or False: Even if the Bayes decision boundary for a given problem is
linear, we will probably achieve a superior test error rate using QDA rather
than LDA because QDA is flexible enough to model a linear decision
boundary. Justify your answer.
i. False. If the bayes decision boundary is linear, then a more flexible
model is prone to overfit and take account of noise in the training data
that will reduce its accuracy in making predictions during test.

9 | 20
Chap5 Resampling method

1. k-fold cross-validation.
a. (a) Explain how k-fold cross-validation is implemented.
i. In k-fold cross validation k independant samples are taken from the set
of all available observations each of size, 1−(nk)
ii. The model is then fitted to each of these training samples, and then
tested on the observations that were excluded from that sample. This
produces k error scores which are then averaged to produce the final
cross-validation score.
iii. Note that the proportion of observations that are included in each
training increases increases with k.
2. (b) What are the advantages and disadvantages of k-fold cross- validation relative to:
a. i. The validation set approach?
i. When k>2, cross-validation provides a larger training set than the
validation set approach. This means that there is less bias in the
training setting. This means crossvalidation can produce more accurate
estimates for more flexible models that benefit from a larger number of
observations in the training set.
ii. Cross-validation results will exhibit more variability than the
validation set approach. The approach is also more computationally
expensive because the model must be fitted and tested for each fold in
k.
b. - ii. LOOCV?
i. Cross-validation for k<n provides less variance in accuracy scores than
LOOCV. It is also less computationally expensive in most settings
(assuming that a linear regression model isn't used and a certain
mathematical shortcut implemented for LOOCV).
ii. LOOCV provides the maximum proportion of the original observations
for training, which could provide an improved accuracy estimate if the
model is limited by the size of training set available in k-fold cross
validation (for some small value of k). LOOCV can also be more
computationally efficient for linear regression assuming that the
afformentioned mathematical shortcut is implemented.

10 | 20
Chap6: model Selection and regularization

1. We perform best subset, forward stepwise, and backward stepwise selection on a


single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p
predictors. Explain your answers:
a. (a) Which of the three models with k predictors has the smallest training RSS?
i. The model obtained by best subset selection has training RSS equal to
or smaller than the others. The best subset procedure considers all
possible models for each number of predictors, forward stepwise, and
backward stepwise do not.
ii. For k=1, best subset and forward stepwise will always obtain the same
model.
iii. For k=p, best subset and backward stepwise will always obtain the
same model.
b. (b) Which of the three models with k predictors has the smallest test RSS?
i. The same reasoning above applies.
c. (c)
i. i. The predictors in the k-variable model identified by forward stepwise
are a subset of the predictors in the (k+1)-variable model identified by
forward stepwise selection.
1. True
ii. ii. The predictors in the k-variable model identified by backward
stepwise are a subset of the predictors in the (k + 1) variable model
identified by backward stepwise selection.
1. False
iii. iii. The predictors in the k-variable model identified by back- ward
stepwise are a subset of the predictors in the (k + 1)- variable model
identified by forward stepwise selection.
1. False
iv. iv. The predictors in the k-variable model identified by forward
stepwise are a subset of the predictors in the (k+1) variable model
identified by backward stepwise selection.
1. False
v. v. The predictors in the k-variable model identified by best subset are a
subset of the predictors in the (k + 1) variable model identified by best
subset selection.
1. False
2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your
answer.
a. (a) The lasso, relative to least squares, is:
i. iii. Less flexible and hence will give improved prediction accuracy
when its increase in bias is less than its decrease in variance.
b. (b) Repeat (a) for ridge regression relative to least squares.
i. iii. Less flexible and hence will give improved prediction accuracy
when its increase in bias is less than its decrease in variance.
c. (c) Repeat (a) for non-linear methods relative to least squares.
11 | 20
i. ii. More flexible and hence will give improved prediction accu- racy
when its increase in variance is less than its decrease in bias.
3. Suppose we estimate the regression coefficients in a linear regression model by
minimizing: (LASSO optimisation objective): for a particular value of s. For parts (a)
through (e), indicate which of i. through v. is correct. Justify your answer
a. (a) As we increase s from 0, the training RSS will:
i. iii. Steadily increase.
1. We will pay a price in increased bias and recieve no benefit
from reduced variance.
b. (b) Repeat (a) for test RSS.
i. ii. Decrease initially, and then eventually start increasing in a U shape.
1. Initially, the increased bias of the model is offset by an even
larger reduction in variance. At higher values for s, the bias
increases and is not offset by further reduction in variance;
increaing the test RSS. (This assumes the model is actually
suffering from high variance initially..)
c. (c) Repeat (a) for variance.
i. iv. Steadily decrease
1. We expect the variance to decrease (possibly rapidly initially)
and then flatten out.
d. (d) Repeat (a) for (squared) bias.
i. iii. Steadily increase.
1. We expect the bias to increase steadily (possibly slowly
initially) and then possibly increase more rapidly.
e. (e) Repeat (a) for the irreducible error.
i. v. Remain constant.
4. Suppose we estimate the regression coefficients in a linear regression model by
minimizing:
a. (Ridge optimisation objective)
i. answers same as for 3

12 | 20
chap7 Decision Tree

1. boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of
the form
a. It’s the case because: A depth-one tree is a function of a single variable. In this
setting the boosting algorithm becormes: …
b. by adding the new tree Note: it can't be shrunken because it is already as small as it
can possibly be, also note that it is a function of a single feature
2. Suppose we produce ten bootstrapped samples from a data set containing red and green
classes. We then apply a classification tree to each bootstrapped sample and, for a specific
value of X, produce 10 estimates of:
a. There are two common ways to combine these results together into a single class
prediction. One is the majority vote approach discussed in this chapter. The second
approach is to classify based on the average probability. In this example, what is
the final classification under each of these two approaches?
3. Provide a detailed explanation of the algorithm that is used to fit a regression tree.
a. Step 1: Recursive binary splitting
i. A decision tree divides the predictor space into J
ii. distinct non-overlapping regions (see 4b above). But how do we decide
where to place the splits?
iii. One approach might be to consider all possible combinations of splits across
all predictors, but this isn't likely to be tractable because the number of
possible combinations of splits is extremely large for any moderately sized
predictor space.
iv. To get around this problem a top-down greedy approach is used called
recursive binary splitting. This reduces the search space by only considering
the optimal split at each step. Residual Sum of Squares (RSS) is used to
measure optimality, at each step we search for a single additional split in
predictor space that most reduces RSS. This process is repeated recursively
with all sub regions of a all features considered at each step, with splits from
previous steps being maintained (e.g. you can't split across a pre-existing
split).
v. As this process is repeated through recursion an additional split and
resultant region is added at each step. The regions at the bottom of the tree
with no subsequent splits are called leaf nodes or leaves. The process is
repeated until the population of observations in each leaf is less than some
threshold t
vi. In the regression setting the estimated value for an observation that falls
into a given leaf node, is the mean response value of all observations in that
leaf.
b. Step 2: Cost complexity pruning
i. If the trheshold t
ii. is small then the above step will result in a large complex tree that is
optimised with respect to the training set. There will likely be some bias in
the training set that is not representative of the true population, that the
tree has accounted for, it has overfit the training data and this will impede its
predictive power on new observations.

13 | 20
iii. One approach to mitigate overfitting might be to continue step one only so
long ass the reduction in RSS at each step excedes some threshold. The
problem with this is that some very valuable split might follow a low value
split that doesn't exceed the threshold, in which case tree growth is halted
prematurely. I'm not sure if this situation occurs often in practice, but the ISL
authors suggest so.
iv. Another approach is to grow a very large tree that overfits the training set,
and then prune it back by removing splits. How to chose which splits to
remove?
v. This can be achieved by adding a penalty to RSS cost function that penalises
a higher number of terminal nodes in the tree. RSS is given by:
c. Step 3: Choose α using cross-validation
i. We can now use cross-validation for some range of α to choose the optimal
size tree. Each α results in a subtree of the full tree that we test using cross
validation to estimate its test score. More generally this is an optimisaiton of
the bias-variance tradeoff.
d. Step 4: Return subtree from Step 2 that corresponds to chosen value of α from Step
3
e. Finally we chose the subtree (value of α) that produces the best cross-validation
score and use this for prediction.
4.
5.

14 | 20
Chap 9 SVM:
1. With lower cost parameter the model uses more support vectors because the margin is now
wider
2. a slight movement of the seventh observation would not affect the maximal margin
hyperplane.
3. Use cross-validation to tune Cost parameter
4. Using cost 1 we misclassify a training observation but there is a wider margin and so are
more support vectors. It seems likely this model will be less affected by overfitting than when
cost=1e5.
5. We can see from the figure that there are a fair number of training errors in this SVM fit. If
we increase the value of cost, we can reduce the number of training errors. However, this
comes at the price of a more irregular decision boundary that seems to be at risk of
overfitting the data.
6. The best parameters are {'C': 1.0, 'gamma': 1.0}, with a score of
0.929

a. Therefore, the best choice of parameters found involves cost=1 and gamma=1.
We can view the test set predictions for this model by applying the predict()
function to the data.

15 | 20
Chp11 (Book 3)

1. Advantages of a CNN over a fully connected DNN for image classification:


o CNNs are specifically designed to handle spatial data such as images, making
them more effective for image classification tasks.
o CNNs exploit the spatial correlation present in images by using local
connectivity and shared weights, reducing the number of parameters compared
to fully connected DNNs.
o The use of convolutional layers and pooling layers in CNNs helps in capturing
and extracting hierarchical features, which are crucial for image
understanding.
o CNNs can handle inputs of different sizes due to their use of convolutional
filters and pooling operations, whereas fully connected DNNs require fixed-
size inputs.
o CNNs are translation invariant, meaning they can recognize patterns
regardless of their location in the image, which is important for image
classification tasks.
2. Calculation of parameters and RAM requirements:
o Each convolutional layer has 100, 200, and 400 feature maps, respectively.
o For each layer, the number of parameters is calculated as (filter_height *
filter_width * input_channels + 1) * output_channels.
o The first layer: (3 * 3 * 3 + 1) * 100 = 2800 parameters
o The second layer: (3 * 3 * 100 + 1) * 200 = 180200 parameters
o The third layer: (3 * 3 * 200 + 1) * 400 = 720400 parameters
o Total number of parameters: 2800 + 180200 + 720400 = 903,400 parameters
o If each parameter is represented using 32-bit floats, the RAM required for a
single instance prediction is approximately 903,400 * 32 bits = 28.94
megabytes (MB).
o When training on a mini-batch of 50 images, the RAM required would be
28.94 MB * 50 = 1.447 gigabytes (GB).
3. If the GPU runs out of memory while training a CNN, here are five possible solutions:
o Reduce the batch size: Decreasing the number of images processed in each
batch reduces the memory requirements.
o Use smaller images: Resizing the input images to a smaller size reduces the
memory needed.
o Decrease the model complexity: Reduce the number of layers, filters, or
channels in the network to reduce the memory footprint.
o Employ mixed precision training: Utilize techniques such as mixed precision
training, which combines low-precision and high-precision floating-point
numbers to reduce memory usage.
o Enable gradient checkpointing: This technique trades off computation for
memory by recomputing certain intermediate values during backpropagation
instead of storing them.
4. Max pooling layer vs. convolutional layer with the same stride:
o Max pooling layers are often used to downsample the spatial dimensions of
feature maps while retaining the most salient features.

16 | 20
o Max pooling layers introduce invariance to small translations and distortions
in the input, making the network more robust to variations in the position or
size of the detected features.
o Convolutional layers with the same stride as the max pooling layer would not
provide the same downsampling and invariance properties since the
convolution operation would consider the entire receptive field, including
multiple locations.
5. Local response normalization layer:
o Local response normalization (LRN) layers were previously used to provide a
form of lateral inhibition, where the activity of neurons in one channel is
inhibited relative to nearby channels.
o LRN layers were mainly used in early CNN architectures, such as AlexNet, to
promote competition between neighboring feature maps and enhance the
overall discrimination power of the network.
o However, recent research has shown that the effectiveness of LRN layers is
limited, and they are not widely used in modern architectures. Batch
normalization has become the preferred normalization technique.
6. Innovations in AlexNet, GoogLeNet, ResNet, SENet, and Xception:
o AlexNet: Introduced the concept of deep convolutional neural networks for
image classification, used rectified linear activation units (ReLU), utilized
GPUs for efficient training, and won the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) in 2012.
o GoogLeNet (Inception): Introduced the inception module with 1x1, 3x3, and
5x5 convolutions stacked together, reducing the computational complexity and
improving the network's ability to capture features at different scales.
o ResNet: Introduced residual connections, allowing the network to learn
residual mappings and alleviate the vanishing gradient problem, enabling the
training of very deep networks (e.g., 50, 101, or 152 layers).
o SENet (Squeeze-and-Excitation Network): Introduced the concept of channel-
wise feature recalibration using global information to adaptively rescale
feature maps, enhancing the discriminative power of the network.
o Xception: Introduced the concept of depthwise separable convolutions, which
decouple spatial and channel convolutions, reducing the number of parameters
and computations while maintaining model accuracy.
7. Fully Convolutional Network (FCN):
o A Fully Convolutional Network is a type of neural network architecture
specifically designed for semantic segmentation tasks, where pixel-level
predictions are made for an input image.
o The main difference between a fully connected dense layer and a
convolutional layer is the connectivity pattern and shared weights.
o To convert a dense layer into a convolutional layer, the dense layer's weights
can be reshaped into a 2D kernel (filter) with the same spatial dimensions as
the desired output feature map.
o The reshaped kernel is then applied to the input feature map using the
convolution operation, resulting in a feature map with the desired output
dimensions.
8. The main technical difficulty of semantic segmentation:
o Semantic segmentation involves assigning a class label to each pixel in an
image, which requires dense predictions and fine-grained localization.

17 | 20
o The main difficulty lies in capturing and understanding the fine details and
boundaries of objects in the image.
o Challenges include handling class imbalance, dealing with varying object
scales, accurately segmenting objects with intricate shapes or occlusions, and
maintaining spatial consistency across different scales.
9. Building a CNN from scratch for MNIST classification:
o Here's a simple example of a CNN architecture for MNIST classification:

18 | 20
Exam :

Solution for CNN

To calculate the number of parameters for all of page , we need to know the number of parameters
for each layer and then add them up. Here are the calculations: - Convolutional layer 1: 64 filters of
size 3x3 with 1 channel each, plus 1 bias term per filter, for a total of (3x3x1x64) + 64 = 640
parameters. - Max pooling layer 1: No parameters. - Convolutional layer 2: 64 filters of size 3x3 with
64 channels each, plus 1 bias term per filter, for a total of (3x3x64x64) + 64 = 36928 parameters. -
Max pooling layer 2: No parameters. - Flatten layer: No parameters. - Fully connected layer: 3136
input neurons and 128 output neurons, plus 1 bias term per output neuron, for a total of (3136x128)
+ 128 = 401536 parameters. Adding up all of these parameters, we get a total of 438,304 parameters
for page .

Solution2 ( Contradiction)

To calculate the number of parameters for page , we need to count the number of parameters in
each layer and add them up. Here is the breakdown: - Convolutional layer 1: 32 filters of size 3x3x1 =
288 parameters - Max pooling layer 1: 2x2 kernel with stride 2 = 0 parameters (no learnable
parameters) - Convolutional layer 2: 64 filters of size 3x3x32 = 18,432 parameters - Max pooling layer
2: 2x2 kernel with stride 2 = 0 parameters - Flatten layer: no parameters - Fully connected layer 1:
1,792 input neurons and 128 output neurons = (1,792 + 1) x 128 = 229,376 parameters - Fully
connected layer 2: 128 input neurons and 10 output neurons = (128 + 1) x 10 = 1,290 parameters
Adding up all the parameters, we get a total of 249,386 parameters for page

19 | 20
20 | 20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy