Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Initialize parameters
learning_rate = 0.1
n_iterations = 1000
m = len(X_b)
predictions = X.dot(theta)
return cost
# Gradient Descent
cost_history.append(cost)
if iteration in selected_steps:
break
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
plt.figure()
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()
print(f"Intercept: {theta[0][0]}")
print(f"Slope: {theta[1][0]}")
Let’s break down why (𝑋𝑏 ⋅ 𝜃) is part of the gradient calculation in linear regression.
The linear regression model predicts the output ( 𝑦 ) using the equation:
[𝑦 = 𝜃0 + 𝜃1 ⋅ 𝑥]
[𝐲 = 𝐗 𝐛 ⋅ 𝜽]
where:
(𝐲) is the vector of predicted values, (𝐗 𝐛 ) is the augmented feature matrix (including
the bias term (intercept term)), (𝜽) is the vector of parameters (including the intercept
and slope).
Cost Function:
The cost function (Mean Squared Error) for linear regression is:
𝑚
1 2
[𝐽(𝜃) = ∑(ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖) ) ]
2𝑚
𝑖=1
where:
To minimize the cost function, we need to compute its gradient with respect to the
parameters (𝜽). The gradient is the vector of partial derivatives of the cost function with
respect to each parameter.
The gradient of the cost function is:
1 𝐓
[∇𝜃 𝐽(𝜃) = 𝐗 (𝐗 𝐛 ⋅ 𝜽 − 𝐲)]
𝑚 𝐛
Breaking Down the Gradient Calculation:
Predicted Values: (𝐗 𝐛 ⋅ 𝜽) gives the predicted values for all training examples. This is
the hypothesis function (ℎ𝜃 (𝑥)).
Error Term:
(𝐗 𝐛 ⋅ 𝜽 − 𝐲) gives the difference between the predicted values and the actual values
(error term).
Gradient Calculation:
(𝐗 𝐛 𝐓 (𝐗 𝐛 ⋅ 𝜽 − 𝐲)) computes the dot product of the transpose of (𝐗 𝐛 ) and the error term.
This gives the sum of the gradients for all training examples.
Summary:
See the code. X_b.dot(theta) represents mx+c where theta is a matrix contains first
column 1 and second column x values. multiplication of X_b* theta calculates mx+c for
every iteration step. We change theta in every step to give new theta.
This line calculates the gradient of the cost function with respect to the 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠(𝜽).
The gradients indicate how much the cost function would change if we adjusted the
parameters slightly. By updating the parameters in the opposite direction of the
gradients, we minimize the cost function.
Summary:
• The error term (𝐗 𝐛 ⋅ 𝜽 − 𝐲) is used to calculate how far off the predictions are
from the actual values.
• The gradient calculation uses this error term to determine the direction and
magnitude of the parameter updates needed to minimize the cost function.
Derivation
Cost Function:
The cost function (Mean Squared Error) for linear regression is: [𝐽(𝜃) =
1 2
∑𝑚 (ℎ (𝑥 (𝑖) ) − 𝑦 (𝑖) ) ] where:
2𝑚 𝑖=1 𝜃
Hypothesis Function:
To minimize the cost function, we need to compute its gradient with respect to the
parameters (𝜃). The gradient is the vector of partial derivatives of the cost function with
respect to each parameter.
1 (𝑖) 2
2. Substitute Hypothesis Function: [𝐽(𝜃) = 2𝑚 ∑𝑚 (𝑖)
𝑖=1(𝑋𝑏 ⋅ 𝜃 − 𝑦 ) ]
(𝑖)
5. Simplify the Derivative: The derivative of ((𝑋𝑏 ⋅ 𝜃 −
(𝑖) 𝜕𝐽(𝜃) 1 (𝑖)
𝑦 (𝑖) )) 𝑤𝑖𝑡ℎ𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑡𝑜(𝜃𝑗 )𝑖𝑠(𝑋𝑏 𝑗): [ = 𝑚 ∑ 𝑖 = 1𝑚 (𝑋𝑏 ⋅ 𝜃 − 𝑦 (𝑖) ) ⋅ 𝑋𝑏(𝑖) ]
𝜕𝜃𝑗 𝑗
6. Vectorize the Gradient: We can write the gradient for all parameters(𝜃) in
1
vectorized form: [∇𝜃 𝐽(𝜃) = 𝑚 𝑋𝑏𝑇 (𝑋𝑏 ⋅ 𝜃 − 𝑦)]
The gradient of the cost function with respect to the parameters(𝜃) is: [∇𝜃 𝐽(𝜃) =
2
𝑋𝑏𝑇 (𝑋𝑏 ⋅ 𝜃 − 𝑦)]
𝑚