Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
284 views
Berkeley Machine Learning
Uploaded by
alexciuciucu13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Berkeley Machine Learning For Later
Download
Save
Save Berkeley Machine Learning For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
0 ratings
0% found this document useful (0 votes)
284 views
Berkeley Machine Learning
Uploaded by
alexciuciucu13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Berkeley Machine Learning For Later
Carousel Previous
Carousel Next
Save
Save Berkeley Machine Learning For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
Download now
Download
You are on page 1
/ 185
Search
Fullscreen
‘A Comprehensive Guide to Machine Learning Soroush Nasiriany, Garrett Thomas, William Wang, Alex Yang, Jennifer Listgarten, Anant Sahai Department of Ele trical Engineering and Computer Sciences lifornia, Berkeley November 18, 20192 About CS 189 is the Machine Learning couse at UC Berkeley. In this guide we have created a com prehensive course guide in order to share our knowledge with students and the general public, and hopefully draw the interest of students from other universities to Berkeley's Machine Learning curriculum, This guide was started by CS 180 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with the assistance of William Wang and Alex Yang. We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk for his machine learning notes, from which we drew inspiration. ‘The latest version of this document can be found either at http: //www.eecs189.org/ or http //snasiriany me/cs189/. Please report any mistakes to the staff, and contact the authors if you wish to redistribute this document. Notation Notation | Meaning Rr set of real numbers R set (vector space) of n-tuples of real numbers, endowed with the usual inner product, Rm set (vector space) of m-by-n matrices by Kronecker delta, ie. 84; = 1 if t= j, 0 otherwise V(x) | gradient of the function f at x V?s(x) | Hessian of the function f at x p(X) distribution of random variable X p(x) probability density/mass function evaluated at x EX) expected value of random variable X Var(X) | variance of random variable X Cov(X,¥) | covariance of random variables X and ¥ Other notes: # Vectors and matrices are in bold (e.g. x,A). ‘This is true for vectors in R” as well as for vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman letters for matrices and random variables. ‘© We assume that vectors are column vectors, ie. that a vector in R” can be interpreted as an n-by-1 matrix, As such, taking the transpose of a vector is well-defined (and produces a row vector, which is a I-by-n matrix).Contents 1 Regression I LL Ordinary Least Squares 1.2 Ridge Regression 1.3 Feature Engineering 14 Hyperparameters and Validation 2. Regression IT 2.1 MLE and MAP for Regression (Part I) 2.2 Bias-Variance Tradeoff 2.3 Multivariate Gaussians 2A MLE and MAP for Regression (Part I) 2.5 Kemels and Ridge Regression 2.6 Sparse Least Squares 2.7 Total Least Squares 3 Dimensionality Reduction 3.1 Principal Component Analysis 3.2 Canonical Correlation Analysis 4 Beyond Least Squares: Optimization and Neural Networks 4.1 Nonlinear Least Squares 4.2 Optimization 4.3. Gradient Descent 44. Line Search 4.5 Convex Optimization 4.6 Newton's Method 4.7 Gauss-Newton Algorithm 4.8 Neural Networks 4.9 ‘Training Neural Networks 79 79 al 82 88 89 93 96 97 103Classification 5.1 Generative vs. Discriminative Classification 5.2 Least Squares Support Vector Machine 5.3 Logistic Regression Gaussian Discriminant Analysis Support Vector Machines Duality Nearest Neighbor Classification Clustering 6.1 K-means Clustering 6.2 Mixture of Gaussians 6.3. Expectation Maximization (EM) Algorithm Decision Tree Learning 7.1 Decision ‘Irees 7.2 Random Forests 7.3. Boosting Deep Learning 8.1, Convolutional Neural Networks 8.2 CNN Architectures 8.3. Visualizing and Understanding CNNs CONTENTS 107 107 109 3 121 127 134 45 151 182 155 156 163 163 168 169 175 175 182 185,Chapter 1 Regression I Our goal in machine learning is to extract a relationship from data, In regression tasks, this relationship takes the form of a function y = f(x), where y € R is some quantity that can be predicted from an input x < R¢, which should for the time being be thought of as some collection of numerical measurements. The truc relationship f is mknown to us, and our aim is to recover it as well as we can from data, Our end product is a function 9 = h(x), called the hypothesis, that should approximate f, We assume that we have access to a dataset D = {(x;,y;)}!" ,, where each pair (x,,,) is an example (possibly noisy or otherwise approximate) of the input-output mapping to be leamed. Since learning arbitrary fmctions is intractable, we restrict ourselves to some hypothesis class 7/ of allowable functions. More specifically, we typically employ a parametric model, meaning that there is some finite-dimensional vector w ¢ R&, the elements of which are known as parameters or weights, that controls the behavior of the function, That is, Ire (x) = 9(, W) for some other function g. The hypothesis class is then the set of all functions induced by the possible choices of the parameters w: = (he | WER After designating a cost function ZL, which measures how poorly the predictions 9 of the hypothesis match the true output y, we can proceed to search for the parameters that best fit the data by minimizing this function: 1.1 Ordinary Least Squares Ordinary least squares (OLS) is one of the simplest regression problems, but it is well-understood and practically useful. It is a linear regression problem, which means that we take hw to be of the form h(x) = x'w. We want Yi GE = hew(x) =CIAPTER 1. REGRESSION T for each i= 1,...,n. This set of equations can be written in matrix form as mw) pat] fev av io) [xn") [wa a. ¥ xv In words, the matrix X ¢ R"™# has the input datapoint x, as its ith row. This matrix is some- times called the design matrix. Usually n > d, meaning that there are more datapoints than measurements, ‘There will in general be no exact solution to the equation y = Xw (even if the data were perfect, consider how many equations and variables there are), but we can find an approximate solution by minimizing the sum (or equivalently, the mean) of the squared errors: Dow — 1)? = min | Xw — yl a Low Now that we have formulated an optimization problem, we want to go about solving it. We will see that the particular structure of OLS allows us to compute a closed-form expression for a globally optimal solution, which we denote Wiz. Approach 1: Vector calculus Calculus is the primary mathematical workhorse for studying the optimization of differentiable functions. Recall the following important result: if L : R4 + R is contimnously differentiable, then any local optimum w satisfies VL(w*) = 0. In the OLS case, L(w) = |Xw — yl (Xw ~ y)"(Xw — y) = (Kw)'Xw — (Kw)'y —y'Xw+y'y =w'X'Xw-2w xX'yty'y Using the following results from matrix calculus Vx(alx) =a V(x Ax) = (A + AT)x the gradient of L is easily seen to be VE(w) = Vor(w'X'Xw ~ 2w!X'y +-y"y) = Vw X"Xw) — 20a (W XY) + Vly) ~~ = 2X'Xw — 2Xx'y where in the last line we have used the symmetry of X"X to simplify XX + (X™X)" = 2x7X. Setting the gradient to 0, we conclude that any optimum wr. satisfies1.1. ORDINARY LEAST SQUARES 7 If X is full rank, then X"X is as well (assuming n > d), so we can solve for a unique solution Wiis = (XIX) IX'y Note: Although we write (X'X)~1, in practice one would not actually compute the inverse; it is more numerically stable to solve the linear system of equations above (e.g. with Gaussian elimination). In this derivation we have used the condition VL(w*) = 0, which is a necessary but not sufficient condition for optimality. We found a critical point, but in general such a point could be a local minimum, a local maximum, or a saddle point, Fortunately, in this case the objective function is convex, which implies that any critical point is indeed a global minimum, ‘To show that Lis convex, it suffices to compute the Hessian of L, which in this case is V?L(w) = 2X"X and show that this is positive semi-definite: vw, w!(2X'X)w = 2(Xw)'Xw = 2|Xw|} >0 Approach 2: Orthogonal projection ‘There is also a linear algebraic way to arrive at the same solution: orthogonal projections. Recall that if V is an inner product space and S a subspace of V, then any v < V can be decomposed ‘uniquely in the form vavsty, where vs € Sand v, € S4. Here S+ is the orthogonal complement of S, i. the set of vectors that are perpendicular to every vector in S. The orthogonal projection onto 5, denoted Ps, is the linear operator that maps v to vs in the decomposition above. An important property of the orthogonal projection is that lv - Psvll < lIv —s\] for all s € S, with equality if and only ifs = P.v. That is, Psy = arg min ||v — s|) 365 Proof. By the Pythagorean theorem, 5)’ = |v — Psv|P’ + |Psv — sl? > |Iv — Pov? \lv —s|? = lv - Pov + Pov os es with equality holding if and only if ||Psv—s|/? =0, ie. 8 = Psv. Taking square roots on both sides gives |v —s|| > lv — Psv|| as claimed (since norms are nonnegative) a Here is a visual representation of the argument above:8 CMAPTER 1. REGRI In the OLS case, Why = argimin Xw ~ yl But observe that the set of vectors that can be written Xw for some w € R* is precisely the range of X, which we know to be a subspace of R", so. scrange(X) le yl = ain, Xow ~ yh By pattern matching with the earlier optimality statement about Ps, we observe that Prangetx)¥ = Xwizs) where W5,. is any optimum for the right-hand side. The projected point Xw3,. is always unique, but if X is full rank (again assuming n > d), then the optimum wé,, is also unique (as expected). This is because X being full rank means that the columns of X are linearly independent, in which case there is a one-to-one correspondence between w and Xw. To solve for w3,., we need the following fact? null(X") = range(X)* Since we are projecting onto range(X), the orthogonality condition for optimality is that y — Py range(X), ie. y — XW, € mull(X"), This leads to the equation X'(y — Xwizs) = 0 which is equivalent to X™Xwi,5 = X'y as before, 1.2 Ridge Regression While Ordinary Least Squares can be used for solving linear least squares problems, it falls short due to numerical instability and generalization issues. Numerical instability arises when the features of the data are close to collinear (leading to linearly dependent feature columns), causing the input T This vest i often stated os part oftmatrix X to lose its rank or have singular values that very close to 0. Why are small singular values bad? Let us illustrate this via the singular value decomposition (SVD) of X: X=UEV) where U € R"™", De R™4, Vc R4*4, In the context of OLS, we must have that X'X is invertible, or equivalently, rank(X"X) = rank(X") = rank(X) = d. Assuming that X and X" are full column rank d, we can express the SVD of X as _y fa x-u[ Gv where Ey € Ris a diagonal matrix with strictly positive entries, Now let's try to expand the (X™X)-1 term in OLS using the SVD of X: (xy = (vay o}uTT [i vy =(V [Ee olf] vy = (Viv) (V)hagy tv = ‘This means that (X™X)~1 will have singular values that are the squared inverse of the singular values of X, potentially leading to extremely large singular values when the singular value of X are close to 0, Such excessively large singular values can be very problematic for numerical stability purposes. In addition, abnormally high values to the optimal w solution would prevent OLS from generalizing to unscen data. There is a very simple solution to these issues: penalize the entries of w from becoming too large. ‘We can do this by adding @ penalty term constraining the norm of w. For a fixed, small scalar A> 0, we now have: main | Xw — y|3 + Allw]3 Note that the A in our objective fimction is a hyperparameter that measures the sensitivity to the values in w. Just like the degrce in polynomial features, \ is a value that we must choose arbitratily through validation. Let's expand the terms of the objective function: L(w) = ||Xw — yl + Alwil w'X'Xw — 2w'Xly +y’y + Awl Finally take the gradient of the objective and find the value of w that achieves O for the gradient VwL(w) =0 2X"Xw — 2X"y + 2dw = 0 (XX + Aw =Xy =(X'X+ Al Xy Whoa This value is guaranteed to achieve the (unique) global minimum, because the objective function is strongly convex. To show that f is strongly convex, it suffices to compute the Hessian of f, which in this case is VL XX + 2AL10 CMAPTER 1. REGRI and show that this is positive definite (PD): vw £0, w(X"X + Al)w = (Xw)"Xw + Aw'w = ||Xw]]} + Aliw/} > 0 Since the Hessian is positive definite, we can equivalently say that the eigenvalues of the Hessian are strictly positive and that the objective fmetion is strongly convex. A useful property of strongly convex functions is that they have a unique optimum point, so the solution to ridge regression is unique. We cannot make such guarantees about ordinary least squares, because the corresponding Hessian could have cigenvalues that are 0. Let us explore the case in OLS when the Hessian has a0 eigenvalue. In this context, the term X"X is not invertible, but this does not imply that no solution exists! In OLS, there always exists a solution, and when the Hessian is PD that solution is unique; when the Hessian is PSD, there are infinitely many solutions. (‘There always exists a solution to the expression X'Xw = X"y, because the range of XX and the range space of X” are equivalent; since X"y lies in the range of X", it must equivalently lie in the range of X'X and therefore there always exists a w that satisfies the equation X’Xw = Xy.) ‘The technique we just described is known as ridge regression. Note that now the expression XX + Al is invertible, regardless of rank of X. Let’s find (X"X + AI)! through SVD: (XIX + AIT = (vf =wyt Pee 4 ya o AL cy [ao] yr -v| 0 aulY Now with our slight tweak, the matrix XTX + AT has become full rank and thus invertible. The singular values have become =y and 4, meaning that the singular values are guaranteed to be at most }, solving our numerical instability issues. Furthermore, we have partially solved the overfitting issue. By penalizing the norm of x, we encourage the weights corresponding to relevant features that capture the main structure of the true model, and penalize the weights corresponding to complex features that only serve to fine tune the model and fit noise in the data,13, FEATURE ENGINEERING hn 1.3. Feature Engineering ‘We've seen that the least-squares optimization problem min |[Xw — y|3 represents the “best-fit” linear model, by projecting y onto the subspace spanned by the columns of X. However, the true input-output relationship y = f(c) may be nonlinear, so it is useful to consider nonlinear models as well. It tums out that we can still do this under the framework of linear least-squares, by augmenting the data with new features. In particular, we devise some function @ : R' > Re, called a feature map, that maps each raw data point x € Rf into a vector of features #(x). The hypothesis function then writes ‘ Iae(X) = > wys(x) = WTOC) Note that the resulting model is still linear with respect to the features, but it is nonlinear with respect to the original data if ¢ is nonlinear. ‘The component functions 4, are sometimes called basis functions because our hypothesis is a linear combination of them. In the simplest case, we could just use the components of x as features (i.e. 9;(x) = 2;), but in general it is helpful to disambiguate the features of an example from the example’s entries. ‘We can then use least-squares to estimate the weights w, just as before. To do this, we replace the original data matrix X ¢ R'™é by @ ¢ R"4, which has (x;)" as its ith row: min |w — yl} Example: Fitting Ellipses Let's use least-squares to estimate the parameters of an ellipse from data. Assume that we have n data points D = {(71,i,72,:)}{), which may be noisy (i.e. could be off the actual orbit). Our goal is to determine the relationship between 21 and 29. ‘We assume that the ellipse from which the points were generated has the form ma? 4 a0008 + ware + ent + 502 = 1 where the coefficients w,,...,ws are the parameters we wish to estimate ‘We formulate the problem with least-squares: ain | w — 13 where yo why oH) tiara ta tap e : tn En Fintan Fin Fan In this caso, the feature map @ is given by (x) = (@}, 23, x19, 21, 02) Note that there is no “target” vector y here, so this is not a traditional regression problem, but it, still its into the framework of least-squares.2 CIAPTER 1. REGRESSION T Polynomial Features The example above demonstrates an important class of features known as polynomial features. Remember that a polynomial is linear combination of monomial basis terms. Monomials can be classified in two ways, by their degree and dimension: Dewree] > 3 Dimensio T (univariate) T= e 2 (bivariate) 1 aad, m2) a3, 2209, 223 fh thn, nia} A big reason we care polynomial features is that any smooth function can be approximated ar- bitrarily closely by some polynomial,? For this reason, polynomials are said to be universal approximators. One downside of polynomials is that as their degree increases, their number of terms increases rapidly. Specifically, one can use a “stars and bars” style combinatorial argument* to show that a polynomial of degree d in £ variables has (4) terms, ‘To get an idea for how quickly this quantity grows, consider a few examples: (+a! ~ Ea a 5 5 : 3 5 10 25 T fe] 4 o Tr 26 3 |4] 2] 56 236 3276 5 |e] 56 | 252 3003 142506 10 | 11] 286 | 3003 | 184756 183579396 25 _| 26 | 3276 | 142506 | 183579396 | 126410606437752 Later we will learn about the kernel trick, a clever mathematical method that allows us to circumvent this rapidly growing cost in certain cases. 1.4 Hyperparameters and Validation As above, consider a hypothesis of the form ‘ Ig) = > 25856) = WOE) i 7 Taylor's theorem gives more precae datements about the approsimation ero. 5 We coust the sumber of distinct monomils of degree st mont din € vtiblee 3.3, oF equivaleaty, the number of istinet monomiale of degre exactiy din £1 varabine xy —1,#3_..,22. Bvery monomial has the form 28° hy tres 1 he =. This corresponds to on arrangement of dviare and ébaze, where the mumber of ats bebween conesctive ‘ts (or the end ofthe exprerron) gives the degree of that crdered variable. Ror exemple The number of unique ways to arrange thess stars nnd bare is the aummber of wags to chaoee the pustions of the £ bars out of the tal 4+ d slots, Le. £-+ ad choose f. (You could alsa pick the positions of the d star out of the total £+ slot; the expression i symmetric in € and.)1.4, IIYPERPARAMBTERS AND VALIDATION 13 Observe that the model order d is not one of the decision variables being optimized when we fit to the data, For this reason d is called a hyperparameter. We might say more specifically that it is a model hyperparameter, since it determines the structure of the model. For another example, recall ridge regression, in which we add an @ penalty on the parameters rin |Xw — y|}-+ All] ‘The regularization weight is also a hyperparameter, as it is fixed during the minimization above. However A, wilike the previously discussed hyperparameter d, is not a part of the model, Rather, it is an aspect of the optimization procedure used to fit the model, so we say it is an optimization hyperparameter. Hyperparameters tend to fall into one of these two categories. Since hyperparameters are not determined by the data-fitting optimization procedure, how should we choose their values? A suitable answer to this question requires some discussion of the different types of error at play. ‘Types of Error ‘We have seen that it is common to minimize some measure of how poorly our hypothesis fits the data we have, but what we actually care about is how well the hypothesis predicts future data, Lot us try to formally distinguish the various types of error. Assume that the data are distributed according to some (umknown) distribution D, and that we have a loss function £:R xR > R, which is to measure the error between the true output y and our estimate 9 = h(x). The risk (or true error) of a particular hypothesis h € 2 is the expected loss over the whole data distribution: R(h) = Exxy)dl€(A(x), ¥)] Ideally, we would find the hypothesis that minimizes the risk, i.e ‘AY = argmin R(h) heH However, computing this expectation is impossible because we do not have access to the true data distribution, Rather, we have access to samples (x:, y;) ~ D. These enable us to approximate the real problem we care about by minimizing the empirical risk (or training error) Renas(h) = EY (hG«).4) i But since we have a finite number of samples, the hypothesis that performs the best on the training data is not necessarily the best on the whole data distribution. In particular, if we both train and evaluate the hypothesis using the same data points, the training error will be a very biased estimate of the true error, since the hypothesis has been chosen specifically to perform well on those points ‘This phenomenon is sometimes referred to as “data incest” A common sohition is to set aside some portion (say 30%) of the data, to be called the validation set, which is disjoint from the training set and not allowed to be used when fitting the model Validation ‘TrainingCIAPTER 1. REGRESSION T ¢ can use this validation set to estimate the true error by the validation error alt) = ESP cacao) With this estimate, we have a simple method for choosing hyperparameter values: try a bunch of configurations of the hyperparameters and choose the one that yields the lowest validation error. The effect of hyperparameters on error Note that as we add more features to a linear model, training error can only decrease. ‘This is because the optimizer can set w, = 0 if feature ¢ cannot be used to reduce training error Model Order Adding more features tends to reduce true error as long as the additional features are useful predictors of the output. However, if we keep adding features, these begin to fit noise in the ‘training data instead of the true signal, causing true error to actually inerease, This phenomenon, is known as overfitting, ‘rue Exror Model Order ‘The validation error tracks the true error reasonably well as long as the validation set is sufficiently large. The regularization hyperparameter \ has a somewhat different effect on training error. Observe that if =, we recover the exact OLS problem, which is directly minimizing the training, error. As ) increases, the optimizer places less emphasis on the training error and more em on reducing the magnitude of the parameters. This leads to a degradation in training error as grows14, HYPERPARAMETERS AND VALIDATION 15 ‘Training Error Regularization Weight Cross-validation Setting aside a validation sct works well, but comes at a cost, since we cannot use the validation data for training, Since having more data generally improves the quality of the trained model, we may prefer not to let that data go to waste, especially if we have little data to begin with and/or collecting more data is expensive. Cross-validation is an alternative to having a dedicated validation set k-fold cross-validation works as follows: 1. Shuifle the data and partition i into k equally-sized (or as equal as possible) blocks 2. Fori=1, k, ‘¢ Train the model on all the data except block i valuate the model (ie. compute the validation error) using block i ‘rain validate train validate ‘train 3. Average the k validation errors; this is our final estimate of the true error Observe that, although every datapoint is used is always evs evaluation at some time or another, the model Juated on a different set of points than it was trained ou, thereby cleverly avoiding the “data incest” problem mentioned earlier Note also that this process (except for the shuffling and partitioning) must be repeated for every. hyperparameter configuration we wish to test. This is the principle drawback of k-fold cross- validation as compared to using @ held-out validation set — there is roughly & times as much computation required. This is not a big deal for the relatively small linear models that we've seen so far, but it can be prohibitively expensive when the model takes a long time to train, as is the case in the Big Data regime or when using neural networks.16 CMAPTER 1. REGRIChapter 2 Regression IT 2.1 MLE and MAP for Regression (Part I) So far, we've explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression Wous = argmin |ly — Xw. in lly — Xw|} + Allw(3 Wranan — arg One question that arises is why we specifically use the & norm to measure the error of our predlic- tions, and to penalize the model parameters. We will justify this design choice by exploring the statistical interpretations of regression — namely, we will employ Gaussians, MLE and MAP to validate what we've done so far through a different lens. Probabi ic Model In the context of supervised learning, we assume that there exists a true underlying model mapping inputs to outputs: fix f(x) ‘The true model is unknown to us, and our goal is to find a hypothesis model that best represents the true model. The only information that we have about the true model is via a dataset D= {wi where x; € R* is the input and y; € R is the observation, a noisy version of the true output f(x:): Y= fa) + Zi ‘We assume that 2, is a fixed value (which implies that f(x) is fixed as well), while Z, is a random variable (which implies that Y; is a random variable as well). We always assume that Z, has zero mean, because otherwise there would be systematic bias in our observations. The Z's could be Gaussian, uniform, Laplacian, etc... In most contexts, we us assume that they are independent identically distributed (iid) Gaussians: Z, N(0,02). We can therefore say that Y; is a random variable whose probability distribution is given by ¥, 2 N(f60), 04) 718 CHAPTER 2. REGRESSION IT Now that we have defined the model and data, we wish to find a hypothesis model hg (parameterized by 8) that best captures the relationships in the data, while possibly taking into account prior beliefs ‘that we have about the true model. We can represent this as a probability problem, where the goal is to find the optimal model that maximizes our probability. Maximum Likelihood Estimation In Maximum Likelihood Estimation (MLE), the goal is to find the hypothesis model that maximizes the probability of the data. If we parameterize the set of hypothesis models with 8, we ‘can express the problem as O.ux = arg max £(0;D) = p(data =D | true model = he) ° ‘The quantity £(8) that we are maximizing is also known as the likelihood, hence the term MLE. Substituting our representation of D we have 8 argmax £(8;X,y) = plu, Ym | y+ %ny OB) ‘Note that we implicitly condition on the x;’s, because we treat them as fized values of the data. The only randomness in our data comes from the y's (since they are noisy versions of the true values J(x:)). We can further simplify the problem by working with the log likelihood €(@;X,y) = log £(0;X, y) Ove = are thax £(6;X,y) = argmax (0; X,¥) With logs we are still working with the same problem, because logarithms are monotonic functions. In other words we have that: P(A) < P(B) <=> log P(A) < log P(B) Let's decompose the log likelihood: Tog play «5 Yn [1-15 %n)) = log T] plu | xi, 0) = £O;X,y) Vroelplys | xi, )] a We decoupled the probabilities from each datapoints because their corresponding noise components are independent, Note that the logs allow us to work with sums rather products, simplifying ‘the problem — one reason why the log likelihood is such a powerful tool, Each individual term (yi |x:,8) comes from a Gaussian ¥, | 0 ~N(ho(x:), 0°) Continuing with logs: Ose = argmax £(0;X,y) (2a) ° = srgamox ) logo | 0) (22) a sean (SOME sg Ve es)2.1. MLE AND MAP FOR REGRESSION (PART I) 19 =sapin (So
Y maa TR) = ED fa(Ru) | dee(R)|Aur Aue eo Note the change of variables in the last step: we sum over the squares in U space, instead of parallelograms in R. space. So far, we have shown that (for any dimension n) PUUCR’) =| Ih Fula) durduy...dtey23. MULTIVARIATE GAUSSTANS 33 and PILE T(R') / The ‘fa(Reu) | dot(R)|duyduy dug Notice that these two probabilities aro equivalent! The probability that U takes value in R’ must ‘equal the probability that the transformed random veetor Z takes a value in the transformed region TR) ‘Therefore, we can say that parce) = [ff fotwdindra...dm =/ Jf, sotto aR) laoacten dy = P(ZETIR') We conclude that fu(u) = fz(Ru) | det(R)| ‘An almost identical argument will allow us to state that falz) = fo(R™2) det(R ) = 1 jaa eR) Since the densities for all the U;’s are iid, and U = R-¥Z, we can write the joint density ftmetion of Zas 1 Ro hale) = ayo "9 1 pray Lee 0 ot 1 Lp sarayen 1 tenors ~ [ett] Waa" eoRTRR Note that (RR')~* is simply the covariance matrix for Z Coy|Z) = EZZ'| = E[RUU'R'] = RE[UU JR’ = RIR' = RR’ ‘Thus the density function of Z can be written as 1 Laas 2” Tae a Furthermore, we know that, | det(z)| = det(RR’)| = | det(R) aet(R')a CHAPTER 2. REGRESSION IT = |det(R) - det(R)| = | det(R))? and therefore fa(z) = —L— eS Fae) (Van Estimating Gaussians from Data For a particular multivariate Gaussian distribution f(.), if we do not have the true means and covariances #4, B, then our best bet is to use MLE to estimate them empirically with iid, samples 2X2) Note that the above formulas are not necessarily trivial and must be formally proven using MLE. Just to present a glimpse of the process, let's prove that these formulas hold for the case where we are dealing with Id data points. For notation purposes, assume that D = {1,2,...,tq} is the set of all training data points that belong to class k. Note that the data points are iid. Our goal is to solve the following MLE problem: argmax P11, 222, 44m | M0?) = arg max stn | ve) Note that the objective above is not jointly convex, so we cannot simply take derivatives and set them to 0! Instead, we decompose the minimization over @? and 4. into a nested optimization problem: (ein)? min In(o) = minmin > nin So EAE sao) mom ‘The optimization problem has been decomposed into an inner problem that optimizes for js given a fixed o?, and an outer problem that optimizes for o? given the optimal value fl. Let’s first solve the inner optimization problem. Given a fixed 07, the objective is convex in js, so we can simply take a partial derivative wart ye and set it equal to 0: ot + In) ye23. MULTIVARIATE GAUSSTANS 35 Having solved the inner optimization problem, we now have that Note that this objective is not convex in ¢, so we must instead find the critical point of the objective that minimizes the objective. Assuming that o > 0, the critical points are: 0 =0: assuming that not all of the points x; are equal to fi, there are two terms that are at odds with each other: a 1/0? term that blows off to oo, and a In(a) term that blows off to 00 as ¢ +0. Note that the 1/a? term blows off at a faster rate, so we conclude that lim © Points at which the derivative w.rt o is 0 = ot We conclude that the optimal point is ft Isocontours Let's try to understand in detail how to visualize a multivariate Gaussian distribution. For sim- plicity, let's consider a zero-mean Gaussian distribution (0,3), which just leaves us with the covariance matrix 3. Since is a symmetric, positive semidefinite matrix, we can decompose it by the spectral theorem into 3 = VAV?, where the columns of V form an orthonormal basis in IR, and A is a diagonal matrix with real, non-negative values. We wish to find its level set or simply the set of all points x such that the probability density f(x) evaluates to a fixed constant k. This is equivalent to the level set In( f(x)) = In(k) which further reducr xTUx se for some constant . Without loss of generality, assume that this constant is 1. The level set x7 B-lx = 1 is an ellipsoid with axes v), v2,..., V4, with lengths V7, Va, ..., VAs respectively. Bach axis of the ellipsoid is the vector y Xv, and we can verify that (iwi Biv) = AwT Dy, = wl (BV) = Aw? Ay tw) = vP ws = 136 CHAPTER 2. REGRESSION IT The entries of A dictate how elongated or shrunk the distribution is along cach direction. In the case of isotropic distributions, the entries of A are all identical, meaning the the axes of the ellipsoid form a circle, In the case of anisotropic distributions, the entries of A are not necessarily identical, meaning that the resulting ellipsoid may be elongated shrunken and also rotated Figure 2.1: Tsotzopic (left) vs Anisotropic (right) contours are ellipsoids with axes /T;vs. Images courtesy Professor Shewchuk’s notes Properties Let’s state some well-known properties of Multivariate Gaussians. Given a JG random vector Z~N(t4z, Ba), the linear transformation AZ (where A is an appropriately dimensioned constant matrix) is also JG: AZ ~N(Apz,ADzA") ‘We can derive the mean and covariance of AZ using the linearity of expectations: Haz = E[AZ) = ABZ] = Ap, and Zaz = E(AZ ~ E[AZ])(AZ — E[AZ))"] = E/A(Z ~ E[Z])(Z — E[Z])"A") AE((Z — B[2))( — E[z})"]AT AZzA! Note that the statements above did not rely on the fact that Z is JG, so this reasoning applies to all random vectors. We know that AZ is JG itself, because it can be expressed as a linear transformation of iid. Gaussians: AZ = ARU. Now suppose that we have the partition Z = Fl whose distribution is given by Z~ A (te2, Ez) and , 5 — lex] yyy — [Bxx Bx Ma [ra] z [ee Be] It turns out that the marginal distribution of the individual random vector X (and Y) is JG X~N (Hx, Exx)24, MLE AND MAP FOR REGRESSION (PART I1) 37 However, the converse is not necessarily true: if X and Y are cach individually JG, it is not necessarily the case that |*| is JG! To see why, let's suppose that X and Y are individually JG. x! ly ‘Thus, we can express each ai a linear transformation of ii.d, Gaussian random variables: X =RxUx, ¥ = RyUy we would expect that the expression for the joint distribution would be JG: [X] [Rx 0] [Ux ly] > [0 Ry| [Uy However, since we cannot guarantee that the entries of Ux are independently distributed from the entries of Uy, we cannot conclude that the joint distribution is JG. If the entries are independently distributed, then we would be able to conclude that the joint distribution is JG Let’s now transition back to our discussion of Z. The conditional distribution of X given Y (and vice versa) is also JG: XY ~N (ux + ExyByy(¥ — Hy), Exx — ExyDyyByx) IfX and ¥ are uncorrelated (that is, if Bxy = Byx = 0), we can say that they are independent Namely, the conditional distribution of X given Y does not depend on Y: xy ~s (ux + ONY (¥ ~ py), Exx — OLY 0) = N (Hx, Bxx) ‘This also follows from the multivariate Gaussian pdf B Wa = fx (x) fyly) Note the significance of this statement, Given any two general random vectors, we cannot neces- sarily say “if they are uncorrelated, then they are independent”. However in the case of random vectors from the same JG joint distribution, we can make this claim. 2.4 MLE and MAP for Regression (Part II) ‘The power of probabilistic thinking is that it allows us a way to model situations that arise and adapt our approaches in a reasonably principled way. This is particularly true when it comes to incorporating information about the situation that comes from the physical context of the data kathering process. In this note, we will explore what happens as we vary our assumptions about the noise in our data and the priors for our parameters, as well as the “importance” of certain training points So far we have used MLE and MAP to justify the optimization formulation of OLS and ridge regression, respectively. The MLE formulation assumes that the observation ¥; is a noisy version of the true underlying output: Y= fe) + Zi38 CHAPTER 2. REGRESSION IT where the noise for each datapoint is crucially i.id. The MAP formulation assumes that the model parameter 1W; is according to an iid. Gaussian prior Wj SN (uy, 02) So far, we have restricted ourselves to the case when the noise/parameters are iid: Z~N(0,071), W~N(uw,o3]) However, what about the case when N\'s/W;’s are non-identical or dependent on one another? We would like to explore the case when the observation noise and underlying parameters are jointly Gaussian with arbitrary individual covariance matrices, but are independent of each other. Z~N(0Ez), W~N (Hw, Ew) It turns out that via a change of coordinates, we can reduce these non-ii.d. problems back to the iid. case and solve them using the original techniques we used to solve OLS and Ridge Regression! Changing coordinates is a powerful tool in thinking about machine learning Weighted Least Squares ‘The basic idea of weighted least squares is the following: we place more emphasis on the loss contributed from certain data points over others - that is, we care more about fitting some data points over others. It turns out that this weighted perspective is ve when we go beyond traditional least-squares problems xy useful as a building block Optimization View From an optimization perspective, the problem can be expressed as Wes = arginin (u(y —67w)?) ‘This objective is the same as OLS, except that each term in the sum is weighted by a positive coefficient «. As always, we can vectotize this problem: Wes = argmin (y — Xw)' My — Xw) Where the s'th row X is x,7, and 9 € R"*" is a diagonal matrix with 44 = 4; We rewrite the WLS objective to an OLS obj Woes = argmin (y ~ Xw)/ My — Xw) = argmin (y ~ Xw)'AV?2A4?(y — Xw) wer! = argmin (Sy — 04? Xw)"(A'2y — 0? Xw) wer! = argmin ||0'y — 2"? Xw/|? wer!24, MLE AND MAP FOR REGRESSION (PART II) 39 ‘This formulation is identical to OLS except that we have scaled the data matrix and the observation veetor by 4?, and we conclude that “1 Was = ((A4?X) (2X) (4X) Ay = (XTX) XTAY Probabilistic View As in MLE, we assume that our observations y are noisy, but now suppose that some of the y's are more noisy than others. How can we take this into account in our learning algorithm so we can get a better estimate of the weights? Our probabilistic model looks like Yaxlwt % where the Z's are still independent Gaussians random variables, but not necessarily identical: Z,~N(0,92). Jointly, we have that Z~ N (tz, Bz), where a? 0 ° 0 a. 0 Bee 0 oe We can morph the problem into an MLE one by scaling the data to make sure all the Z's are identically distributed, by dividing by 9 Note that the scaled noise entries are now iid Zita gy S28 vor FEvOD Jointly, we can express this change of coordinates as Bzty ~N(Bz?Xw, Bz B2Bz7) — N(BziXw, 1) ‘This change of variable is sometimes called the reparameterization trick. Now that the noise is iid. using the change of coordinates, we rewrite our original problem as a scaled MLE problem: (e lwy? sap) +nlog v2n x,'w)?, ‘The MLE estimate of this sealed problem is equivalent to the WLS estimate of the original problem: exist, ix) x's, sty (x13 X) XT Ey ty As long as no o is 0, Bz, is invertible. Note that «i from the optimization perspective is directly related to o? from the probabilistic perspective: w; = 2. Or at the level of matrices, @ = Ez-* AAs the variance 9? of the noise corresponding to data point # decreases, the weight w inereases: we fare more concerned about fitting data point # because it is likely to match the true underlying de- noised point. Inversely, as the variance o? increases, the weight w; decreases: we are less eoncerned about fitting data point # because it is noisy and should not be trusted.0 CHAPTER 2. REGRESSION IT Generalized Least Squares ‘Now let’s consider the case when the noise random variables are dependent on one another. We have Y=XwtZ where Z is now a jointly Gaussian random vector. That is, Z~N(0Bz), ¥~N(Xw, Ez) ‘This problem is known as generalized least squares. Our goal is to maximize the probability of our data over the set of possible w's: ous = argmex —__} (y-Xw "Ez y—Xw) cus = NEE deta) (Vea = argmin (y Xw)'Sz(y — Xw) The optimization problem is therefore given by Wous = argmin (y ~ Xw)/Bz'y ~ Xw) Since Bz is symmetric, we can decompose it into its eigen structure using the spectral theorem: oF 0 0 0 oF 0 =z=Q Q oO of where Q is orthonormal. As before with weighted least squares, our goal is to find an appropriate linear transformation so that we can reduce the problem into the iid, case. Consider +o ° ob 0 Bz -Q ° a We can scale the data to morph the problem into an MLB problem with iid, noise variables, by Jointly, we can express premultiplymg the data matrix X and the observation vector y by © this change of coordinates as Byty ~N(Bz'Xw, By E2B,*) = NB, Xw. Consequently, in a very similar fashion to the independent noise problem, the MLE of the scaled dependent noise problem is Weis = (XE X) 1X"E;'y.24, MLE AND MAP FOR REGRESSION (PART I1) at “Ridge Regression” with Dependent Parameters In the ordinary least squares (OLS) statistical model, we assume that the output Y is a linear function of the input, plus some Gaussian noise. We take this one step further in MAP estimation, where we assume that the weights are a random variable. The new statistical model is Y=xwizZ where ¥ and Z are n-dimensional random vectors, W is a d-dimensional random veetor, and X is a fixed n x d matrix. Note that random vectors are not notationally distinguished from matrices here, so keep in mind what each symbol represents. We have seen that ridge regression can be derived by assuming a prior distribution on W in which W; are iid, (univariate) Gaussian, or equivalently, wW-NOD But more generally, we can allow W to be any multivariate Gaussian: W ~NM(uw. Ew) mation of a standard Recall that we can rewrite a multivariate Gaussian variable as an affine transf Gaussian variable: W=SWV+uw, V~N(0,1) Plugging this parameterization into our previous statistical model gives Y=X(2YV 4 ww) +2 But this can be re-written Y —Xpw =XBDYV+Z which we see has the form of the statistical problem that underlies traditional Ridge Regression with \= 1, and therefore = (BYPXTXBY +) BYEXTy - Xnw) However V is not what we care about — we need to convert back to the actual weights W in order to make predictions. Since W is completely determined by V (assuming fixed mean and covariance), wan bw pw + BW (BW XBW +) TBR — Xpew) Ix" y — Xpw) = py + (XIX + By Od, = Hw + (XTX + By) XY — Xtew) Note that there are two terms: the prior mean fly, plus another term that depends on both the data and the prior. The positive-definite precision matrix of W's prior (Zw) controls how the data fit error affects our estimate, This is called Tikhonov regularization in the literature and generalizes ridge regularization.CHAPTER 2. REGRESSION IT To gain intuition, let us consider the simplified case where o} 0 0 0 03 0 Ew = 00 o When the prior variance o? for dimension j is large, the prior is telling us that Wj may take on a wide range of values. Thus we do not want to penalize that dimension as much, preferring to let the data fit sort it out. And indeed the corresponding entry in Ny! will be small, as desired. Conversely if ¢? is small, there is little variance in the value of W, so WY; = 1. As such we penalize the magnitude of the data-fit contribution to W, more heavily. If all the 0? are the same, then we have traditional ridge regularization. Alternative derivation: directly conditioning jointly Gaussian random variables In an explicitly probabilistic perspective, MAP with colored noise (and known X) can be expressed. U,V ENO) (2.6) ‘Y]_[Rz XRw] [U (27) w|~[o Rw] |v en where Rz and Rw are relationships with W and Z, respectively. Note that the Rw appears twice because our model assumes ¥ = XW + noise, so if W = Rw, then we must have Y = XRwV + noise, ‘We want to find the posterior W | ¥ = y. The formulation above makes it relatively easy to find the posterior of ¥ conditioned on W (see below), but not vieo-versa. So let’s pretend instead that uv 80.) ) TA BI TO" y|~|o D| |v! Now W | Y = y is straightforward. Since V! = D~'Y, the conditional mean and variance of W | ¥ =y can be computed as follows: [WY =y] =B/AU' + BV | ¥ =y] = BIAU' |Y =y| + E[BD2Y|Y¥=y; = AEU') +£[BD“Y | ¥ =y] Saee ¥ —Bp-y Var(W | ¥ = y) = E((W - E/W))(W -E[W))" | ¥ = y] = E|(AU! +BD-TY — BD“Y)(AU' + BD“'Y — BD"TY)' | Y =y| =E(AU')(AUY | ¥ =yl =E|AU(U'YTA24, MLE AND MAP FOR REGRESSION (PART I1) 43 = AgiU(U)') a" au) Nort = AAT In both cases above where we drop the conditioning on Y, we are using the fact U’ is independent of V! (and thus independent of ¥ = DV’). Therefore WIY=y~N(BD'y,AA) Recall that a Gaussian distribution is completely specified by its mean and covariance matrix. We see that the covariance matrix of the joint distribution is wy] j|_ja Byfav o ale (wr ¥ -( Hl [i 3 AAT BBT BD" be’ opp] -[Es Swy] Zyw Ey | ‘Matching the corresponding terms, we can express the conditional mean and variance of W | ¥ = y in terms of these (cross-Jeovarianee matrives =BD'D-"D"'y = (BD(DD)"'Y = Bw yEs! BD'Y 2oe. D°Y =(BD')(DD) Ty = EwyEy'¥ AAT= AA’ +BB'- BB" =AA'+BB'-BD'D_p 'DB Y¥ AA’ + BB" - (BD')(DD')"'DBT = Ew - 2w.y2y'Dyw ‘We can then apply the same reasoning to the original setup: R,Rz’ + XRwRw'X’ XRwRw'| Y w) [yw] = RwRw'X RwRw _ [By Eyw Ewy Ew ‘Therefore after defining Bz —RzRz', we can read off Zw = RwRw By =2z+XBwX Byw = XEw Bw.y = EwX” Plugging this into our estimator yields w=E[W|Y-y] = EwydyyCHAPTER 2. REGRESSION IT = EwX'(Bz + XBW") ly One may be concerned because this expression does not take the form we expect ~ the inverted matrix is hitting y directly, unlike in other solutions we've seen. Although this form will turn out to be quite informative when we introduce the idea of the kernel trick in machine, learning, it is still disconcertingly different from what we are used to, However, by using a lot of algebra together with the Woodbury matrix identity’, it tums out that wwe can rewrite this expression as w= (X'E;'X+ By) Ox Ezy which looks more familiar, In fact, you can recognize this as the general solution when we have both a generic Gaussian prior on the parameters and colored noise in the observations, Summary of Linear Gaussian Statistical Models ‘We have seen a number of related linear models, with varying assumptions about the randomness in the observations and the weights. We summarize these below: we (0,1) N(0,Bz) No prior OX TXTy Wau = OBA) XB (0.4711) (TX + ar xty (Xap x + AX"; 'y Nw, Ew) | ay + 7K + BX Kaw) | May + (RTS 4 By) XB Xw) 2.5 Kernels and Ridge Regression In ridge regression, we given a vector y € R” and a matrix X ¢ R™‘, where n is the number of training points and ¢ is the dimension of the raw data points. In most settings we don’t want to work with just the raw feature space, so we atigment features to the data points and replace X with & € R"*4, where ¢,! = $(x,) € R¢. Then we solve a well-defined optimization problem that involves & and y, over the parameters w € R¢. Note the problem that arises here. If we have polynomial features of degree at most p in the raw £ dimensional space, then there are d= ('7?) terms that we need to optimize, which can be very, very large (much larger than the number of training points n). Wouldn't it be useful, if instead of solving an optimization problem over d variables, we could solve an equivalent problem over (potentially much smaller) n variables, and achieve a computational runtime independent of the number of augmented features? As it turns ‘out, the concept of kernels (in addition to a technique called the kernel trick) will allow us to achieve this goal. Recall the solution to ridge regression: wi = (8S +a1) ley ‘This operation involves calculating ®"®, which is a dx d matrix and takes O(d2n) time to compute. ‘The matrix inversion operation takes an additional O(d*) time to compute. What we would really like is to have an n x n matrix that takes O(n) to invert. Here's a simple observation: if we flip the order of &” and &, we end up with an n x n matrix ®&". In fact, the matrix ®®" has a very intuitive meaning: it is the matrix of inner products between all of the augmented datapoints, whieh in loose terms measures the “similarity” among of the datapoints and captures their relationship. Now let's see if we could somehow express the solution to ridge regression using the matrix @& (As veV TATU(ET VATU) HVAT!25. KERNELS AND RIDGE REGRE: De For simplicity of notation, let's revert back to using X instead of ® (pretend that we are only working with raw features, our analysis of kernel ridge regression still holds if we use just the raw features). Rearranging, the terms of the original ridge regression solution, we have w- (XX +A) Xly (XX + Alw = xX'y X'Xw + Aw = X'y dw = XTy — X"Xw w= (x'y—x"xw) x xy — X'Xw wea w= xy kw which says that whatever w is, it is some linear combination of the training points x; (because anything of the form X"v is a linear combination of the columns of X", which are the training points). To find w it suffices to find v, where w = X'v. Recall that the relationship we have to satisfy is X'Xw — Aw = X'y. Let's assume that we had v, and just substitute X'v in for all the w's. XX(X'y) + A(X"y) = XTy XXX'v+ X'(Av) = Xy X1(XX'v + Av) = X"(y) We can't yet isolate v and have a closed-form solution for it, but we can make the observation that if we found an v such that we had XX'v+\v=y that would imply that this v also satisfies the above equation. Note that we did not “cancel the X's on both sides of the equation.” We saw that having v satisfy one equation implied that it satisfied the other as well, So, indeed, we can isolate v in this new equation: (XX + ADV = y => v= (XXT +arp ty and have that the v which satisfies this equation will be such that Xv equals w. We conclude ‘that the optimal w is wr = X'vt = XXX =A ly Recall that previously, we derived ridge regression and ended up with * = (X™X + AL IXTy In fact, these two are equivalent expressions! The question that now arises is which expression should you pick? Which is more efficient to calculate? We will answer this question after we introduce kernels,6 CHAPTER 2. REGRESSION IT Linear Algebra Derivation ‘The previous derivation involved using some intuitive manipulations to achieve the desired answer Let’s formalize our derivation using more principled arguments from linear algebra and optimiz tion Before we do so, we must first introduce the Fundamental Theorem of Lincar Algebra (FTLA): Suppose that there is a matrix (linear map) X that maps R! to R". Denote A(X) as the nullspace of X, and R(X) as the range of X. Then the following properties hold: 1. NK) SRK!) = RE and M(K!)S The symbol © indicates that we taking a direct sum of V(X) and R(X"), which means that Yu © R¢ there exist unique elements uj ¢ A(X) and uz € R(X") such that w= uy + uz Furthermore, the symbol L indicates that V(X) and R(X") are orthogonal subspaces. R(X) = RM by symmetry 2. N(X™X) = N’(X) and W(XX) = J 3. R(XTX) = R(X!) and R(X") = R(X) by symmet (X7) by symmetry Here's where FTLA comes, in the context of kernel ridge regression. We know that we can express any w € Rf as a unique combination w = wi + Wa, where wr € R(X") and wz € V(X). Equivalently we can express this as w = X'v +r, where v € R" and r € V(X). Now, instead of optimizing over w € R‘, we can optimize over v € R" and r € R¢, which equates to optimizing over n+ £ variables. However, as we shall see, the optimization over r will be trivial so we just have to optimize an n dimensional problem. We know that w = X'v +4, where v € R” and r € A(X), Let's now solve ridge regression by optimizing over the variables v and r instead of w: vive = argmin_ |[Xw — y|[} +Allwi veRreN) = argmin |\X(X'v+r)—y}+A|Xv+x|3 veRn revi) [XXTy +X yl3 = AIXTY + eB = argmin (WRXXRY —av'KKTy yy) 4a (WKY 4 BeRF EH) veR PENX) = tema (WCW avTER'Y) + (WEY +) We crossed out Xr and 2v/Xr because r € N(X) and therefore Xr = 0. Now wo are optimizing over L(v,), which is jointly convex in v and r, because its Hessian is PSD. Let's show that this is indeed the case: ViL(v,r) = 21> 0 VeVyL(v,r) = VwVeL(v,r) = 0 ViL(v,r) =2XX'XX! + 2\KX' > 0 Since the cross terms of the Hessian are 0, it sulfices that V2Z(v,r) and V2L(v,r) are PSD to establish joint convexity. With joint convexity established, we can set the gradient to 0 w.r.t r and v and obtain the global minimum: VeL(v,r*) = 2x" =0 => r= 0
You might also like
MATH6183 Introduction+Regression
PDF
No ratings yet
MATH6183 Introduction+Regression
70 pages
SM Ch1
PDF
No ratings yet
SM Ch1
30 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
PDF
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Machine Learning Guide
PDF
No ratings yet
Machine Learning Guide
185 pages
Machine Learning
PDF
100% (1)
Machine Learning
185 pages
data analysis
PDF
No ratings yet
data analysis
40 pages
Stat Modelling Notes
PDF
No ratings yet
Stat Modelling Notes
49 pages
Day 1
PDF
No ratings yet
Day 1
41 pages
CS 532 Lecture Notes
PDF
No ratings yet
CS 532 Lecture Notes
25 pages
Regression Using LS Handout
PDF
No ratings yet
Regression Using LS Handout
21 pages
Lecture Notes on High Dimensional Linear Regression
PDF
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
Machine Learning Lecture 1
PDF
No ratings yet
Machine Learning Lecture 1
5 pages
SkriptOptMach
PDF
No ratings yet
SkriptOptMach
49 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
PDF
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Week 4 Linear Regression
PDF
No ratings yet
Week 4 Linear Regression
38 pages
Notes MSM
PDF
No ratings yet
Notes MSM
66 pages
Convex Optimization Prerequisite_topics
PDF
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Practice 1130
PDF
No ratings yet
Practice 1130
20 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
PDF
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
3. Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
PDF
No ratings yet
3. Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
143 pages
Statlearn PDF
PDF
No ratings yet
Statlearn PDF
123 pages
Chapter Regression
PDF
No ratings yet
Chapter Regression
10 pages
ML Linear Model
PDF
No ratings yet
ML Linear Model
10 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
PDF
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
2 Linear
PDF
No ratings yet
2 Linear
15 pages
Vmls Python Companion
PDF
No ratings yet
Vmls Python Companion
192 pages
Mathematics For Machine Learning
PDF
No ratings yet
Mathematics For Machine Learning
134 pages
CS550 Lec2
PDF
No ratings yet
CS550 Lec2
24 pages
Introduction To Machine Learning Lecture 2: Linear Regression
PDF
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
PDF
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Regression
PDF
No ratings yet
Regression
39 pages
Regression Interpolation
PDF
No ratings yet
Regression Interpolation
34 pages
Lecture slides - Linear Regression (2025)
PDF
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
Wk05 machine learning
PDF
No ratings yet
Wk05 machine learning
6 pages
Unit 2
PDF
No ratings yet
Unit 2
92 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
PDF
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
Linear Regression
PDF
No ratings yet
Linear Regression
31 pages
A Comprehensive Guide To Machine Learning
PDF
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
Lec 03
PDF
No ratings yet
Lec 03
42 pages
(VMLS) Julia Language Companion PDF
PDF
No ratings yet
(VMLS) Julia Language Companion PDF
178 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
PDF
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
MIT_Regression
PDF
No ratings yet
MIT_Regression
5 pages
Bias
PDF
No ratings yet
Bias
62 pages
Main
PDF
No ratings yet
Main
515 pages
ML models and when to choose one over others
PDF
No ratings yet
ML models and when to choose one over others
7 pages
OptimumEngineeringDesign Day2b
PDF
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Lecture 13 - Least Squares
PDF
No ratings yet
Lecture 13 - Least Squares
28 pages
2 - Multiple Linear Regression
PDF
No ratings yet
2 - Multiple Linear Regression
71 pages
L02 Linear Regression
PDF
No ratings yet
L02 Linear Regression
9 pages
Least Square Adjustament PDF
PDF
No ratings yet
Least Square Adjustament PDF
55 pages
chapter2- optimisation
PDF
No ratings yet
chapter2- optimisation
7 pages
Lecture 3_Regression (1)
PDF
No ratings yet
Lecture 3_Regression (1)
47 pages
ML-Lec8
PDF
No ratings yet
ML-Lec8
7 pages
Least Squares Adjustment
PDF
100% (1)
Least Squares Adjustment
47 pages
HW 4
PDF
No ratings yet
HW 4
7 pages
Lec 07-08 - Final
PDF
No ratings yet
Lec 07-08 - Final
32 pages
03 Linear Regression Intuition
PDF
No ratings yet
03 Linear Regression Intuition
23 pages
Machine Learning: The Basics
PDF
No ratings yet
Machine Learning: The Basics
288 pages