MS_key-2
MS_key-2
Instructions:
1. This question paper contains 2 pages (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – ambiguous cases will get 0 marks.
Q1 (Optimal DT) Melbo has a multiclass problem with three classes +,×, □. There are 16 datapoints
in total, each with a 2D feature vector (𝑥, 𝑦). 𝑥, 𝑦 can take value 0 or 1. The table below describes
each data point. All 16 points are at the root of a decision tree. Melbo wishes to learn a decision
stump based on the entropy reduction principle to split this node into two children. Help Melbo
finish this task. Hint: take logs to base 2 so no need for calculator . (8 x 0.5 = 4 marks)
SNo Class (𝑥, 𝑦) SNo Class (𝑥, 𝑦) SNo Class (𝑥, 𝑦) SNo Class (𝑥, 𝑦)
1 + (0,1) 5 + (0,1) 9 × (1,0) 13 □ (1,0)
2 + (1,1) 6 + (0,1) 10 × (1,0) 14 □ (0,0)
3 + (0,1) 7 + (1,1) 11 × (0,0) 15 □ (1,0)
4 + (1,1) 8 + (1,1) 12 × (0,0) 16 □ (0,0)
What is the entropy of the two child nodes (give answers for the Class proportions Class proportions
two nodes separately) if the split is done using the 𝑦 feature are (0, 1⁄2 , 1⁄2) are (1,0,0) i.e.,
i.e., 𝐻left = 1 𝐻right = 0
(𝑦 = 0 becomes left child, 𝑦 = 1 becomes right child)?
1 1
𝐻𝑟𝑜𝑜𝑡 − (𝐻left + 𝐻right ) = 1.5 − (1 + 0)
What is the reduction in entropy (i.e., 𝐻root − 𝐻children ) if the 2 2
To get the most entropy reduction, should we split using 𝑥 We should split using the 𝑦
feature or 𝑦 feature? feature
Q2. Write T or F for True/False in the box. Also, give justification. (4 x (1+3) = 16 marks)
Recall that ‖𝐯‖0 is the number of non-zero coordinates of the vector 𝐯. Then for
any two vectors 𝐚, 𝐛 ∈ ℝ3 , we always have ‖𝐚 + 𝐛‖0 ≤ ‖𝐚‖0 + ‖𝐛‖0 . If true, give
1 T
a brief proof, else give a counterexample of two 3D vectors that violate this
inequality. Show brief calculations in either case.
Page 2 of 4
Note that if 𝑎𝑖 = 0 = 𝑏𝑖 for some 𝑖 ∈ [𝑑], then 𝑎𝑖 + 𝑏𝑖 = 0. This means that if 𝑎𝑖 + 𝑏𝑖 ≠ 0, then
either 𝑎𝑖 or 𝑏𝑖 must be non-zero. Another way of saying this is to write
𝕀{𝑎𝑖 + 𝑏𝑖 ≠ 0} ≤ 𝕀{𝑎𝑖 ≠ 0} + 𝕀{𝑏𝑖 ≠ 0}
0 if 𝐴 is false
where 𝕀{𝐴} = { is the indicator function for some statement 𝐴. Taking sums over
1 if 𝐴 is true
𝑖 ∈ [𝑑] gives us ∑𝑖∈[𝑑]{𝑎𝑖 + 𝑏𝑖 ≠ 0} ≤ ∑𝑖∈[𝑑] 𝕀{𝑎𝑖 ≠ 0} + ∑𝑖∈[𝑑] 𝕀{𝑏𝑖 ≠ 0}. However, for any
vector 𝐯, we have ‖𝐯‖0 = ∑𝑖∈[𝑑] 𝕀{𝑣𝑖 ≠ 0}. This tells us ‖𝐚 + 𝐛‖0 ≤ ‖𝐚‖0 + ‖𝐛‖0.
The function 𝑁(𝑝) ≝ 𝑝 ln 𝑝 is convex over the interval 𝑝 ∈ (0, ∞). If true, give a
2 proof via chord definition or tangent definition or second derivative definition of T
convex functions. If false, give a counter example that violates any one definition.
To use the tangent definition, we have to show that for any 𝑝, 𝑞 ∈ (0, ∞), we always have
𝑁(𝑞) ≥ 𝑁(𝑝) + 𝑁 ′ (𝑝)(𝑞 − 𝑝)
Since 𝑁 ′ (𝑝) = (1 + ln 𝑝), the above requirement reduces to showing
𝑞 ln 𝑞 − 𝑞 > 𝑞 ln 𝑝 − 𝑝
Consider the function 𝑓(𝑥) ≝ 𝑞 ln 𝑥 − 𝑥. We have 𝑓 ′ (𝑥) = 𝑞⁄𝑥 − 1 and 𝑓 ′′ (𝑥) = −𝑞 ⁄𝑥 2 . Thus,
𝑓 ′ (𝑞) = 0 and 𝑓 ′′ (𝑞) < 0 i.e., 𝑞 is the global maxima for 𝑓(𝑥). This shows that 𝑓(𝑞) ≥ 𝑓(𝑝) or
in other words, 𝑞 ln 𝑞 − 𝑞 > 𝑞 ln 𝑝 − 𝑝, which proves that the function 𝑁(⋅) is convex.
The optimum for argmin‖𝐰 − 𝐰0 ‖22 + ‖𝑋𝐰 − 𝐲‖22 is always achieved at 𝐰0 . Justify
3 𝐰∈ℝ𝑑 F
𝑛×𝑑 𝑛 𝑑
by deriving the optimum. 𝑋 ∈ ℝ , 𝐲 ∈ ℝ , 𝐰0 ∈ ℝ are all constants.
As this is an unconstrained problem with a differentiable objective, first order optimality tells
us that the gradient must vanish at the optimum. This means (𝐰 − 𝐰0 ) + 𝑋 ⊤ (𝑋𝐰 − 𝐲) = 𝟎
i.e., 𝐰 = (𝑋 ⊤ 𝑋 + 𝐼)−1 (𝐰0 + 𝑋 ⊤ 𝐲). This means that the optimum is not always 𝐰0 .
For any two vectors 𝐮, 𝐯 ∈ ℝ3 , we always have 𝐮⊤ 𝐯 ≤ (‖𝐮‖22 + ‖𝐯‖22 )⁄2. If true,
4 T
give a brief proof, else give a counter example and calculations with two 3D vectors.
By Cauchy-Schwartz inequality, we have 𝐮⊤ 𝐯 ≤ ‖𝐮‖2 ⋅ ‖𝐯‖2 . However, we always have
‖𝐮‖2 ⋅ ‖𝐯‖2 ≤ (‖𝐮‖22 + ‖𝐯‖22 )⁄2
since (‖𝐮‖2 − ‖𝐯‖2 )2 ≥ 0. This completes the proof.
CS 771A: Intro to Machine Learning, IIT Kanpur Midsem Exam (18 Jun 2023)
Name MELBO 40 marks
Roll No 230007 Dept. AWSM Page 3 of 4
Q3 (Absolute Tilt) Consider the optimization problem min 𝑓(𝑥) with objective 𝑓: ℝ → ℝ defined
𝑥∈ℝ
as 𝑓(𝑥) ≝ |𝑥| + 𝑎 ⋅ 𝑥 where 𝑎 ∈ ℝ is a constant (maybe pos/neg/zero). Find the point 𝑥 ∗ at which
the optimum is achieved and 𝑓(𝑥 ∗ ). Note: 𝑥 ∗ and 𝑓(𝑥 ∗ ) depend on 𝑎. Both 𝑥 ∗ , 𝑓(𝑥 ∗ ) can be ∞ or
−∞ for certain cases. You must tell us for each possible case, where is the optimum achieved i.e.,
𝑥 ∗ and what is 𝑓(𝑥 ∗ ). E.g., you might say that case 1 is 𝑎 < 1, in which case we get 𝑓(𝑥 ∗ ) = 1 at
𝑥 ∗ = 0.5, and case 2 is 𝑎 ≥ 1, in which case we get 𝑓(𝑥 ∗ ) = −1 at 𝑥 ∗ = ∞. You may use at most
3 cases to describe your solution. If you don’t need those many cases, leave cases blank. Give brief
derivations. Hint: you should not have to derive the dual to solve this problem. (8 marks)
Case Case Condition (write condition Point 𝒙∗ where opt. is Optimal objective value
No. such as 𝒂 < 𝟏 or 𝒂 = 𝟏 etc). reached for this case 𝒇(𝒙∗ ) for this case
1
|𝑎| ≤ 1 0 0
2
|𝑎| > 1 Either +∞ or −∞ −∞
3
Q5 (CM to the rescue) Consider the following problem where 𝐚, 𝐛, 𝐜 ∈ ℝ𝑑 and 𝜆 ∈ ℝ are constants
and 𝜆 > 0. Design a coordinate minimization algorithm (choose coordinates cyclically) to solve the
primal. Give brief calculations on how you will create a simplified unidimensional problem for a
chosen coordinate 𝑖 ∈ [𝑑] and then show how to get the optimal value of 𝑥𝑖 . (7 marks)
1
min𝐱∈ℝ𝑑 ‖𝐱‖22 + 𝐚⊤ 𝐱
2
s. t. 𝐛⊤ 𝐱 ≤ 𝜆
𝐜⊤𝐱 ≤ 𝜆
Suppose cyclic coordinate choice has presented a coordinate 𝑖 ∈ [𝑑]. We extract portions of the
optimization problem that depend on 𝑥𝑖 (and treat 𝑥𝑗 , 𝑗 ≠ 𝑖 as constants) to get the following
simplified 1D optimization problem (where 𝑚𝑖 ≝ 𝜆 − ∑𝑗≠𝑖 𝑏𝑗 𝑥𝑗 and 𝑛𝑖 ≝ 𝜆 − ∑𝑗≠𝑖 𝑐𝑗 𝑥𝑗 ).
1
min𝑥𝑖∈ℝ 𝑥2 + 𝑎𝑖 𝑥𝑖 1
2 𝑖 𝑥2
s. t.
𝑏𝑖 𝑥𝑖 ≤ 𝑚𝑖 ⇒ min𝑥𝑖 ∈ℝ
s. t.
2 𝑖
+ 𝑎𝑖 𝑥𝑖
𝑥𝑖 ∈ [𝑙𝑖 , 𝑢𝑖 ]
𝑐𝑖 𝑥𝑖 ≤ 𝑛𝑖
Now we clean up the constraints to get a single box constraint as shown below
𝑏𝑖 > 0 𝑏𝑖 > 0 𝑏𝑖 < 0 𝑏𝑖 < 0
Case
𝑐𝑖 > 0 𝑐𝑖 < 0 𝑐𝑖 > 0 𝑐𝑖 < 0
𝑙𝑖 −∞ 𝑛𝑖 𝑚𝑖 𝑚𝑖 𝑛𝑖
max { , }
𝑐𝑖 𝑏𝑖 𝑏𝑖 𝑐𝑖
𝑢𝑖 𝑚𝑖 𝑛𝑖 𝑚𝑖 𝑛𝑖 +∞
min { , }
𝑏𝑖 𝑐𝑖 𝑏𝑖 𝑐𝑖
If either 𝑏𝑖 or 𝑐𝑖 is 0, that constraint is ignored since it no longer involves 𝑥𝑖 . Having converted the
pair of constraints into a single box constraint, we can now apply the QUIN trick to solve this
problem and obtain the optimal value 𝑥𝑖∗ in two simple steps:
1
1. Find the unconstrained minimum for 𝑥𝑖2 + 𝑎𝑖 𝑥𝑖 which turns out to be 𝑧𝑖 ≝ −𝑎𝑖
2
2. If 𝑧𝑖 ∈ [𝑙𝑖 , 𝑢𝑖 ], then 𝑥𝑖∗ = 𝑧𝑖 else if 𝑧𝑖 < 𝑙𝑖 , 𝑥𝑖∗ = 𝑙𝑖 else 𝑥𝑖∗ = 𝑢𝑖