HW_02
HW_02
Number of * ~ difficulty of problem, where * mean easy, and *** means difficult
*Q1: Consider the perceptron in two dimensions: h(x) = sign(wTx) where w = [wo, w1, w2]T
and x = [1, x1, x2]T. Technically, x has three coordinates, but we call this perceptron two-
dimensional because the first coordinate is fixed at 1.
(a) Show that the regions on the plane where h(x) = +1 and h(x) = -1 are separated by a line. If
we express this line by the equation x2 = ax1 + b, what are the slope a and intercept b in terms
of wo, w1, w2?
(b) Draw a picture for the cases w = [1, 2, 3]T and w = -[1, 2, 3]T. In more than two dimensions,
the +1 and -1 regions are separated by a hyperplane, the generalization of a line.
**Q2: Suppose we have features 𝑥 ∈ 𝑅 𝑝 , a two-class response, class-1 and class-2 with class
size 𝑁1 , 𝑁2 . The query point coded as −𝑁/𝑁1, 𝑁/𝑁2 , where 𝑁 = 𝑁1 + 𝑁2 . Show that the LDA
rule classifies the query point to class-2 if,
1 𝑁
̂ −1 (𝜇̂ 2 − 𝜇̂ 1 ) >
𝑥𝑇 ∑ ̂ −1 (𝜇̂ 2 − 𝜇̂ 1 ) − 𝑙𝑜𝑔( 2 )
(𝜇̂ 2 + 𝜇̂ 1 )𝑇 ∑
2 𝑁1
and class-1 otherwise.
***Q3: You are designing a spam classifier for an email service. The classifier analyzes an
email for the words "WIN" and “MONEY” and decides whether the email is spam or not.
Based on historical data, the following probabilities are known:
If the email is spam, the probability that it contains the word "WIN" is 0.8.
If the email is not spam, the probability that it contains the word "WIN" is 0.1.
If the email is spam, the probability that it contains the word "MONEY" is 0.7.
If the email is not spam, the probability that it contains the word "MONEY" is 0.2.
If the email is spam, the probability that it contains both "WIN" & "MONEY" 0.5.
If the email is not spam, the probability that it contains both "WIN" & "MONEY" is 0.05.
Compute:
(a) The marginal probability that an email contains both "WIN" and "MONEY", i.e., P(WIN
∩ MONEY).
(b) If an email contains both "WIN" and "MONEY", calculate the probability that it is spam,
i.e., P(Spam | WIN ∩ MONEY)
*Q4: A company is trying to classify a new customer as either “High Spender” or “Low
Spender” based on the amount spent in the last two months. The dataset contains the following
points:
Training Data:
1 50 60 Low spender
2 45 55 Low spender
5 55 65 Low spender
(a) Use the Euclidean distance formula to calculate the distance of a new customer with
attributes:
Amount spent (month 1) = 60, Amount spent (month 2) = 70
from all the training points.
(b) If k=3, classify the new customer as either “High Spender” or “Low Spender” based on
the majority vote of the nearest neighbours.
***Q5: The Perceptron algorithm updates its weights whenever it encounters a misclassified
point. The sequence in which data points are presented can influence the updates and,
potentially, the total number of mistakes made during training.
Your task is to investigate this effect:
(a) Implement the Perceptron algorithm for a binary classification problem where y ∈
{−1,1}.
o Initialize the weight vector w and bias b to zero.
o Iterate through the dataset for a maximum of 1000 epochs or until convergence.
o Record the total number of mistakes made during training.
(b) Generate a synthetic 2D dataset which is linearly separable. Run the perceptron
algorithm on this data and print the number of mistakes. (Read about make_blobs from
sklearn to generate the dataset)
(c) Shuffle the dataset into 5 random permutations and run the Perceptron algorithm on
each permutation. For each permutation, record the number of mistakes made by the
algorithm.
(d) Are these mistakes consistent or do they change with the order of data presented? What
can you conclude?
*** Q6: Heads Up: The question is a long read, but it covers all the steps to approach an ML
problem, so stick to the end. It will be fun :)
k Nearest Neighbours or kNN is the simplest of all machine learning algorithms. It simply
calculates the distance between a sample data point and all the other training data points. Then,
it selects the k nearest data points where k can be any integer. Finally, it assigns the sample
data point to the class to which the majority of the k data points belong. For the problem, you
will be using the UCI_Breast Cancer Wisconsin (Original) Dataset (PFA). It contains the first
10 columns as features and the last(11th) column as the class of breast cancer - Benign(2) /
Malignant(4). Feel free to use sklearn and pandas.
Exploratory Data Analysis (EDA) is an essential part of any data science problem as it
provides us insights about the data.
(a) Statistically analyse the dataset by finding the values like mean, median, standard
deviation, count, minimum, maximum.
(b) Check if all the features are numerical values, as we will be finding the distance between
numerical features only. Try to convert the non-numerical values to numerical ones.
(c) Plot the frequency distribution of the 10 features as subplots in a single plot.
There are many times when the training dataset is not refined, which is our job to take care of
before training the model. Null values cause problems in model training as they can be
misleading. In these cases, feature engineering comes into play.
(d) Try to find if there are any null values in the columns.
(e) If you find there are null values, then try to fill those with the median of that feature, as
it can be a good approximation to start training.
Dataset Generation will be the next step, i.e., the division of dataset into training and testing.
This is done to get an idea of model accuracy keeping in mind the problems of overfitting.
(f) Divide the dataset for training and testing. Make sure to set a random state so you don’t
get different data and result each time you run the notebook.
As the features are of different scales, we want to make sure that the contribution should be
equal of each of them. Standard Scaling is a method which will cause all the training columns
to have mean of 0 and a standard deviation of 1.
(g) Perform standard scaling over all the features.
(h) Train the model using different values of K.
(i) Find the test and training errors and accuracy by the model predictions and plot the
errors as K varies.
(j) Find the K value where the error is the least.
(k) Plot a confusion matrix of the final model.
**Q7: In this question we will be using a Stock Market dataset (provided to you). The dataset
contains 1089 weekly returns for 21 years from 1990 to 2010.
(a) Provide some graphical summary of the provided dataset, if any patterns observed
kindly report.
(b) Fit the model using LDA using the training data from 1990 to 2007, with ‘Lag2’ as the
only predictor. Compute the confusion matrix and the correct predictions from the test
data (from 2008 to 2010).
(c) Experiments with different predictors and compute the confusion matrix for the same.
(d) (Optional) Repeat (b) for KNN and Naïve Bayes.