Book's Solutions
Book's Solutions
1. For each of parts (a) through (d), indicate whether we would generally expect the
performance of a flexible statistical learning method to be better or worse than an
inflexible method. Justify your answer.
a. The sample size n is extremely large, and the number of predictors p is small.
(*Better)
i. Flexible: we have enough observations to avoid overfitting, so assuming
some there are non-linear relationships in our data a more flexible model
should provide an improved fit.
ii. Variance increases slowly or not at all initially: for a large number of
observations, we can support more flexible models before variance begins to
increase significantly.
iii. Decreasing bias: As we increase the flexibility of the model, we expect the
bias to decrease as the model captures more structure in the training data.
b. The number of predictors p is extremely large, and the number of observations n is
small. ** Worse
i. Inflexible: we don't have enough observations to avoid overfitting
ii. Variance increases rapidly: for a small number of observations, we can
support a less flexible model before variance begins to increase significantly.
iii. The rate of increase of variance outweighs rate of decrease in bias as the
number of paramaters increases for a fixed small number of observations.
c. The relationship between the predictors and response is highly non-linear. **Better
i. Flexible: a high variance model affords a better fit to non-linear
relationships
ii. The variance increases slowly or not at all initially.
iii. The bias decreases rapidly.
d. The variance of the error terms, i.e. σ2 = Var(ε), is extremely high. **Worse
i. Inflexible: a high bias model avoids overfitting to the noise in our dataset
ii. The variance increases rapidly.
iii. The rate of increase of the variance outweighs the rate of decrease in bias.
ii.
b. (b) Explain why each of the five curves has the shape displayed in part (a).** The
ideal function is somewhat non-linear
i. Bayes error: the irreducible error which is a constant irrespective of model
flexibility
ii. Variance: the variance of a model increases with flexibility as the model
picks up variation between training sets rsulting in more variaiton in f(X)
iii. Bias: bias tends to decrease with flexibility as the model casn fit more
complex relationships
iv. Test error: tends to decreases as reduced bias allows the model to better fit
non-linear relationships but then increases as an increasingly flexible model
begins to fit the noise in the dataset (overfitting)
v. Training error: decreases monotonically with increaed flexibility as the
model 'flexes' towards individual datapoints in the training set
vi. Three scenarios are depicted, from leftmost to rightmost:
1. (1) The ideal function is somewhat non-linear
a. Bias: Initially decreases rapidy as more complex model captures the
structure in the training data, before becoming flat as increasing
complexity further does not model the true relationship any better.
b. Variance: initially increases slowly as a moderately complex model
best approximates the true relationship, before increasing rapidly as
more complex models capture the noise in the training set and fail
to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the u
shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
2. (2) The ideal function is linear
2 | 20
a. Bias: Flat - the true relationship is best represented by a low
complexity model, increasing model complexity does not model the
true realtionship any better.
b. Variance: Increases - the true realtionship is best approximated by a
low compexity model, more complex models capture the noise in
the training set and fail to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the (half)
u shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
3. (3) The ideal function is very non-linear
a. Bias: Initially decreases very rapidly - as more complex model
captures the structure in the training data, before becoming flat as
increasing complexity further does not model the true relationship
any better.
b. Variance: Flat for a while - a very complex model best
approximates the true relationship, before increasing slightly as
extremely complex models capture the noise in the training set and
fail to generalize beyond it.
c. Test MSE: Summing the bias and variance curves yields the (half)
u shaped test MSE curve.
d. Training MSE: (not pictured) decreases monotonically with model
complexity as more structure is captured in the training data and
predictive performance on the training set improves.
4. a) Describe three real-life applications in which classification might be useful. Describe
the response, as well as the predictors. Is the goal of each application inference or
prediction? Explain your answer.
a. Is this tumor malignant or benign?
i. - response: boolean (is malign)
ii. - predictors (naive examples): tumor size, white blood cell
count, change in adrenal mass, position in body
iii. - goal: prediction
b. What animals are in this image?
i. response: 'Cat', 'Dog', 'Fish'
ii. predictors: image pixel values
iii. goal: prediction
c. What test bench metrics are most indicative of a faulty Printed Circuit Board
(PCB)?
i. response: boolean (is faulty)
ii. predictors: current draw, voltage drawer, output noise,
operating temp.
iii.goal: inference
5. (b) Describe three real-life applications in which regression might be useful. Describe
the response, as well as the predictors. Is the goal of each application inference or
prediction? Explain your answer.
a. How much is this house worth?
i. responce: SalePrice
ii. predictors: LivingArea, BathroomCount, GarageCount,
Neighbourhood, CrimeRate
3 | 20
iii. goal: prediction
b. What attributes most affect the market cap. of a company?
i. response: MarketCap
ii. predictors: Sector, Employees, FounderIsCEO, Age,
TotalInvestment, Profitability, RONA
iii. goal: inference
c. How long is this dairy cow likely to live?
i. response: years
ii. predictors: past medical conditions, current weight, milk
yield
iii. goal: prediction
6. Describe three real-life applications in which cluster analysis might be useful.
a. This dataset contains observations of 3 different species of flower. Estimate which
observations belong to the same species.
i. response: a, b, c (species class)
ii. predictors: sepal length, petal length, number of petals
iii. goal: prediction
b. which attributes of the flowers in dataset described above are most predictive of
species?
i. response: a, b, c (species class)
ii. predictors: sepal length, petal length, number of petals
iii. goal: inference
c. Group these audio recordings of birdsong by species.
i. responce: (species classes)
ii. predictors: audio sample values
iii. goal: prediction
d. Gene expression clustering in cancer cell lines.
e. Topic clustering in news stories.
f. Infectious agent clustering.
7. What are the advantages and disadvantages of a very flexible (versus a less flexible)
approach for regression or classification? Under what circumstances might a more
flexible approach be preferred to a less flexible approach? When might a less flexible
approach be preferred?
a. Very flexible vs Less flexible
i. Bias: Flexible models have lower bias.
ii. Variance: Flexible models have high variance if there are not a sufficient
number of observations to fit their parameters.
b. When flexible preferred
i. Large number of observations
ii. True relationship is complex
c. When less flexible preferred
i. Small number of observations
ii. True realtionship is not complex
d. Less flexible
i. Advantages :
1. gives better results with few observations
2. simpler inference: the effect of each feature can be more easily
understood
3. fewer parameters, faster optimization
ii. Disadvantage
4 | 20
1. performs poorly if observations contain highly non-linear relationships
e. More flexible
i. Advantage: gives better fit if observations contain non-linear relationships
ii. Disadvantage: an overfit the data providing poor predictions for new
observations
f.
5 | 20
a.
6 | 20
Chap3: Linear Regression
1. Carefully explain the differences between the KNN classifier and KNN regression
methods.
a. The KNN classifier and the KNN regression methods are largely similar. The
KNN classifier determines a decision boundary which can be used to segment
data into 2 or more clusters or groups. KNN regression is non-parmetric
method for estimating a regression function that can be used to predict some
quantitivie variable.
2. Q3. Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 =
Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5
= Interaction between GPA and Gender. The response is starting salary after
graduation (in thousands of dollars). Suppose we use least squares to fit the model,
and get β_0 = 50, β_1 = 20 , β_2 = 0.07 , β_3 = 35 , β_4 = 0.01 , β_5 = −10 .
a. (a) Which answer is correct, and why?
i. i. For a fixed value of IQ and GPA, males earn more on average than
females.
ii. ii. For a fixed value of IQ and GPA, females earn more on average
than males.
iii. iii. For a fixed value of IQ and GPA, males earn more on average than
females provided that the GPA is high enough.
iv. iv. For a fixed value of IQ and GPA, females earn more on average
than males provided that the GPA is high enough.
b. iii. For a fixed value of IQ and GPA, males earn more on average than females
provided that the GPA is high enough.
i. Because X3 is our dummy variable for gender with 1 for female and 0
male, and coefficient 35, which means – all else being equal – the
model will estimate a starting salary for females $35k higher than for
males, but there is an additional interaction variable concerning GPA
and gender which means if GPA > 3.5 then males earn more than
females.
3. Q4. I collect a set of data (n = 100 observations) containing a single predictor and a
quantitative response. I then fit a linear regression model to the data, as well as a
separate cubic regression, i.e.
Y = β0 + β1X + β2X^2 + β3X^3 + ε
a. (a) Suppose that the true relationship between X and Y is linear, i.e. Y = β0 +
β1X + ε. Consider the training residual sum of squares (RSS) for the linear
regression, and also the training RSS for the cubic regression. Would we
expect one to be lower than the other, would we expect them to be the same,
or is there not enough information to tell? Justify your answer.
i. A: We would expect the training RSS for the cubic model because it is
more flexible which allows it to fit more closely variance in the
training data – which will reduce RSS despite this note being
representative of a closer approaximation to the true linear relationship
that is f(x).
b. (b) Answer (a) using test rather than training RSS.
7 | 20
i. A: We would expect the test RSS for the linear regression to be lower
because the assumption of high bias is correct and so the lack of
flexibility in that model is of no cost in estimating the true f(x). The
cubic model is more flexible, and so is likely to overfit the training
data meaning that the fit of the model will be affected by variance in
the training data that is not representive of the true f(x).
c. (c) Suppose that the true relationship between X and Y is not linear, but we
don’t know how far it is from linear. Consider the training RSS for the linear
regression, and also the training RSS for the cubic regression. Would we
expect one to be lower than the other, would we expect them to be the same,
or is there not enough information to tell? Justify your answer.
i. A: We expect training RSS to decrease as the the variance/flexibility
of our model increases. This holds true regardles of the true value of
f(x). So we expect the cubic model to result in a lower training RSS
d. (d) Answer (c) using test rather than training RSS.
i. There is not enough information to answer this fully.
e. If the true relationship is highly non-linear and there is low noise (or
irreducible error) in our training data then we might expect the more flexible
cubic model to deliver a better test RSS.
f. However, if the relationship is only slightly non-linear or the noise in our
training data is high then a linear model might deliver better results.
8 | 20
Chap 4 Classification
9 | 20
Chap5 Resampling method
1. k-fold cross-validation.
a. (a) Explain how k-fold cross-validation is implemented.
i. In k-fold cross validation k independant samples are taken from the set
of all available observations each of size, 1−(nk)
ii. The model is then fitted to each of these training samples, and then
tested on the observations that were excluded from that sample. This
produces k error scores which are then averaged to produce the final
cross-validation score.
iii. Note that the proportion of observations that are included in each
training increases increases with k.
2. (b) What are the advantages and disadvantages of k-fold cross- validation relative to:
a. i. The validation set approach?
i. When k>2, cross-validation provides a larger training set than the
validation set approach. This means that there is less bias in the
training setting. This means crossvalidation can produce more accurate
estimates for more flexible models that benefit from a larger number of
observations in the training set.
ii. Cross-validation results will exhibit more variability than the
validation set approach. The approach is also more computationally
expensive because the model must be fitted and tested for each fold in
k.
b. - ii. LOOCV?
i. Cross-validation for k<n provides less variance in accuracy scores than
LOOCV. It is also less computationally expensive in most settings
(assuming that a linear regression model isn't used and a certain
mathematical shortcut implemented for LOOCV).
ii. LOOCV provides the maximum proportion of the original observations
for training, which could provide an improved accuracy estimate if the
model is limited by the size of training set available in k-fold cross
validation (for some small value of k). LOOCV can also be more
computationally efficient for linear regression assuming that the
afformentioned mathematical shortcut is implemented.
10 | 20
Chap6: model Selection and regularization
12 | 20
chap7 Decision Tree
1. boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of
the form
a. It’s the case because: A depth-one tree is a function of a single variable. In this
setting the boosting algorithm becormes: …
b. by adding the new tree Note: it can't be shrunken because it is already as small as it
can possibly be, also note that it is a function of a single feature
2. Suppose we produce ten bootstrapped samples from a data set containing red and green
classes. We then apply a classification tree to each bootstrapped sample and, for a specific
value of X, produce 10 estimates of:
a. There are two common ways to combine these results together into a single class
prediction. One is the majority vote approach discussed in this chapter. The second
approach is to classify based on the average probability. In this example, what is
the final classification under each of these two approaches?
3. Provide a detailed explanation of the algorithm that is used to fit a regression tree.
a. Step 1: Recursive binary splitting
i. A decision tree divides the predictor space into J
ii. distinct non-overlapping regions (see 4b above). But how do we decide
where to place the splits?
iii. One approach might be to consider all possible combinations of splits across
all predictors, but this isn't likely to be tractable because the number of
possible combinations of splits is extremely large for any moderately sized
predictor space.
iv. To get around this problem a top-down greedy approach is used called
recursive binary splitting. This reduces the search space by only considering
the optimal split at each step. Residual Sum of Squares (RSS) is used to
measure optimality, at each step we search for a single additional split in
predictor space that most reduces RSS. This process is repeated recursively
with all sub regions of a all features considered at each step, with splits from
previous steps being maintained (e.g. you can't split across a pre-existing
split).
v. As this process is repeated through recursion an additional split and
resultant region is added at each step. The regions at the bottom of the tree
with no subsequent splits are called leaf nodes or leaves. The process is
repeated until the population of observations in each leaf is less than some
threshold t
vi. In the regression setting the estimated value for an observation that falls
into a given leaf node, is the mean response value of all observations in that
leaf.
b. Step 2: Cost complexity pruning
i. If the trheshold t
ii. is small then the above step will result in a large complex tree that is
optimised with respect to the training set. There will likely be some bias in
the training set that is not representative of the true population, that the
tree has accounted for, it has overfit the training data and this will impede its
predictive power on new observations.
13 | 20
iii. One approach to mitigate overfitting might be to continue step one only so
long ass the reduction in RSS at each step excedes some threshold. The
problem with this is that some very valuable split might follow a low value
split that doesn't exceed the threshold, in which case tree growth is halted
prematurely. I'm not sure if this situation occurs often in practice, but the ISL
authors suggest so.
iv. Another approach is to grow a very large tree that overfits the training set,
and then prune it back by removing splits. How to chose which splits to
remove?
v. This can be achieved by adding a penalty to RSS cost function that penalises
a higher number of terminal nodes in the tree. RSS is given by:
c. Step 3: Choose α using cross-validation
i. We can now use cross-validation for some range of α to choose the optimal
size tree. Each α results in a subtree of the full tree that we test using cross
validation to estimate its test score. More generally this is an optimisaiton of
the bias-variance tradeoff.
d. Step 4: Return subtree from Step 2 that corresponds to chosen value of α from Step
3
e. Finally we chose the subtree (value of α) that produces the best cross-validation
score and use this for prediction.
4.
5.
14 | 20
Chap 9 SVM:
1. With lower cost parameter the model uses more support vectors because the margin is now
wider
2. a slight movement of the seventh observation would not affect the maximal margin
hyperplane.
3. Use cross-validation to tune Cost parameter
4. Using cost 1 we misclassify a training observation but there is a wider margin and so are
more support vectors. It seems likely this model will be less affected by overfitting than when
cost=1e5.
5. We can see from the figure that there are a fair number of training errors in this SVM fit. If
we increase the value of cost, we can reduce the number of training errors. However, this
comes at the price of a more irregular decision boundary that seems to be at risk of
overfitting the data.
6. The best parameters are {'C': 1.0, 'gamma': 1.0}, with a score of
0.929
a. Therefore, the best choice of parameters found involves cost=1 and gamma=1.
We can view the test set predictions for this model by applying the predict()
function to the data.
15 | 20
Chp11 (Book 3)
16 | 20
o Max pooling layers introduce invariance to small translations and distortions
in the input, making the network more robust to variations in the position or
size of the detected features.
o Convolutional layers with the same stride as the max pooling layer would not
provide the same downsampling and invariance properties since the
convolution operation would consider the entire receptive field, including
multiple locations.
5. Local response normalization layer:
o Local response normalization (LRN) layers were previously used to provide a
form of lateral inhibition, where the activity of neurons in one channel is
inhibited relative to nearby channels.
o LRN layers were mainly used in early CNN architectures, such as AlexNet, to
promote competition between neighboring feature maps and enhance the
overall discrimination power of the network.
o However, recent research has shown that the effectiveness of LRN layers is
limited, and they are not widely used in modern architectures. Batch
normalization has become the preferred normalization technique.
6. Innovations in AlexNet, GoogLeNet, ResNet, SENet, and Xception:
o AlexNet: Introduced the concept of deep convolutional neural networks for
image classification, used rectified linear activation units (ReLU), utilized
GPUs for efficient training, and won the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) in 2012.
o GoogLeNet (Inception): Introduced the inception module with 1x1, 3x3, and
5x5 convolutions stacked together, reducing the computational complexity and
improving the network's ability to capture features at different scales.
o ResNet: Introduced residual connections, allowing the network to learn
residual mappings and alleviate the vanishing gradient problem, enabling the
training of very deep networks (e.g., 50, 101, or 152 layers).
o SENet (Squeeze-and-Excitation Network): Introduced the concept of channel-
wise feature recalibration using global information to adaptively rescale
feature maps, enhancing the discriminative power of the network.
o Xception: Introduced the concept of depthwise separable convolutions, which
decouple spatial and channel convolutions, reducing the number of parameters
and computations while maintaining model accuracy.
7. Fully Convolutional Network (FCN):
o A Fully Convolutional Network is a type of neural network architecture
specifically designed for semantic segmentation tasks, where pixel-level
predictions are made for an input image.
o The main difference between a fully connected dense layer and a
convolutional layer is the connectivity pattern and shared weights.
o To convert a dense layer into a convolutional layer, the dense layer's weights
can be reshaped into a 2D kernel (filter) with the same spatial dimensions as
the desired output feature map.
o The reshaped kernel is then applied to the input feature map using the
convolution operation, resulting in a feature map with the desired output
dimensions.
8. The main technical difficulty of semantic segmentation:
o Semantic segmentation involves assigning a class label to each pixel in an
image, which requires dense predictions and fine-grained localization.
17 | 20
o The main difficulty lies in capturing and understanding the fine details and
boundaries of objects in the image.
o Challenges include handling class imbalance, dealing with varying object
scales, accurately segmenting objects with intricate shapes or occlusions, and
maintaining spatial consistency across different scales.
9. Building a CNN from scratch for MNIST classification:
o Here's a simple example of a CNN architecture for MNIST classification:
18 | 20
Exam :
To calculate the number of parameters for all of page , we need to know the number of parameters
for each layer and then add them up. Here are the calculations: - Convolutional layer 1: 64 filters of
size 3x3 with 1 channel each, plus 1 bias term per filter, for a total of (3x3x1x64) + 64 = 640
parameters. - Max pooling layer 1: No parameters. - Convolutional layer 2: 64 filters of size 3x3 with
64 channels each, plus 1 bias term per filter, for a total of (3x3x64x64) + 64 = 36928 parameters. -
Max pooling layer 2: No parameters. - Flatten layer: No parameters. - Fully connected layer: 3136
input neurons and 128 output neurons, plus 1 bias term per output neuron, for a total of (3136x128)
+ 128 = 401536 parameters. Adding up all of these parameters, we get a total of 438,304 parameters
for page .
Solution2 ( Contradiction)
To calculate the number of parameters for page , we need to count the number of parameters in
each layer and add them up. Here is the breakdown: - Convolutional layer 1: 32 filters of size 3x3x1 =
288 parameters - Max pooling layer 1: 2x2 kernel with stride 2 = 0 parameters (no learnable
parameters) - Convolutional layer 2: 64 filters of size 3x3x32 = 18,432 parameters - Max pooling layer
2: 2x2 kernel with stride 2 = 0 parameters - Flatten layer: no parameters - Fully connected layer 1:
1,792 input neurons and 128 output neurons = (1,792 + 1) x 128 = 229,376 parameters - Fully
connected layer 2: 128 input neurons and 10 output neurons = (128 + 1) x 10 = 1,290 parameters
Adding up all the parameters, we get a total of 249,386 parameters for page
19 | 20
20 | 20