Gradient Descent Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

import numpy as np

import matplotlib.pyplot as plt

# Generate some sample data

np.random.seed(0) # Ensure reproducibility

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

# Add a column of ones to X to account for the bias term (intercept)

X_b = np.c_[np.ones((100, 1)), X]

# Initialize parameters

theta = np.random.randn(2, 1) # Random initialization

learning_rate = 0.1

n_iterations = 1000

m = len(X_b)

tolerance = 1e-3 # Stopping criterion

# Function to compute the cost (Mean Squared Error)

def compute_cost(X, y, theta):

predictions = X.dot(theta)

cost = (1 / (2 * m)) * np.sum(np.square(predictions - y))

return cost

# Gradient Descent

cost_history = [] # To store the cost at each iteration

selected_steps = [0, 3, 5, 11] # Steps at which to plot the fitting process


plt.figure(figsize=(10, 8))

plt.plot(X, y, "b.", label='Data Points')

for iteration in range(n_iterations):

gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)

theta = theta - learning_rate * gradients

cost = compute_cost(X_b, y, theta)

cost_history.append(cost)

# Plot the fitting process at selected steps

if iteration in selected_steps:

plt.plot(X, X_b.dot(theta), label=f'Iteration {iteration}', linestyle='--', linewidth=2)

# Check for convergence

if iteration > 0 and abs(cost_history[-2] - cost_history[-1]) < tolerance:

print(f"Converged after {iteration} iterations.")

break

plt.xlabel("X")

plt.ylabel("y")

plt.title("Linear Regression Fitting Process at Selected Steps")

plt.legend()

plt.grid(True) # Added grid for better readability

plt.show()

# Plot the cost function vs. iterations

plt.figure()

plt.plot(range(len(cost_history)), cost_history, 'b-', linewidth=2)

plt.xlabel('Iterations')
plt.ylabel('Cost')

plt.title('Cost Function vs. Iterations')

plt.grid(True) # Added grid for better readability

plt.show()

# Print the final parameters (intercept and slope)

print(f"Intercept: {theta[0][0]}")

print(f"Slope: {theta[1][0]}")

Let’s break down why (𝑋𝑏 ⋅ 𝜃) is part of the gradient calculation in linear regression.

Linear Regression Model:

The linear regression model predicts the output ( 𝑦 ) using the equation:

[𝑦 = 𝜃0 + 𝜃1 ⋅ 𝑥]

In matrix form, this can be written as:

[𝐲 = 𝐗 𝐛 ⋅ 𝜽]

where:

(𝐲) is the vector of predicted values, (𝐗 𝐛 ) is the augmented feature matrix (including
the bias term (intercept term)), (𝜽) is the vector of parameters (including the intercept
and slope).

Cost Function:

The cost function (Mean Squared Error) for linear regression is:
𝑚
1 2
[𝐽(𝜃) = ∑(ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖) ) ]
2𝑚
𝑖=1

where:

( 𝑚 ) is the number of training examples, ( ℎ𝜃 (𝑥) ) is the hypothesis (predicted value),


which is (𝐗 𝐛 ⋅ 𝜽).

Gradient of the Cost Function:

To minimize the cost function, we need to compute its gradient with respect to the
parameters (𝜽). The gradient is the vector of partial derivatives of the cost function with
respect to each parameter.
The gradient of the cost function is:
1 𝐓
[∇𝜃 𝐽(𝜃) = 𝐗 (𝐗 𝐛 ⋅ 𝜽 − 𝐲)]
𝑚 𝐛
Breaking Down the Gradient Calculation:

Predicted Values: (𝐗 𝐛 ⋅ 𝜽) gives the predicted values for all training examples. This is
the hypothesis function (ℎ𝜃 (𝑥)).

Error Term:

(𝐗 𝐛 ⋅ 𝜽 − 𝐲) gives the difference between the predicted values and the actual values
(error term).

Gradient Calculation:

(𝐗 𝐛 𝐓 (𝐗 𝐛 ⋅ 𝜽 − 𝐲)) computes the dot product of the transpose of (𝐗 𝐛 ) and the error term.
This gives the sum of the gradients for all training examples.

Averaging and Scaling:


2
(𝑚) scales the gradient by the number of training examples and includes a factor of 2
for the Mean Squared Error derivative.

Complete Gradient Calculation:


2 𝐓
gradients = 𝐗 ⋅ (𝐗 𝐛 ⋅ 𝜽 − 𝐲)
𝑚 𝐛
This line calculates the gradient of the cost function with respect to the parameters (𝜽).
The gradients indicate how much the cost function would change if we adjusted the
parameters slightly. By updating the parameters in the opposite direction of the
gradients, we minimize the cost function.

Summary:

• (𝐗 𝐛 ⋅ 𝜽) represents the predicted values.


• The error term (𝐗 𝐛 ⋅ 𝜽 − 𝐲)is used to calculate how far off the predictions are
from the actual values.
• The gradient calculation uses this error term to determine the direction and
magnitude of the parameter updates needed to minimize the cost function

Python: gradients = 2/m * X_b. T. dot(X_b.dot(theta) - y)

See the code. X_b.dot(theta) represents mx+c where theta is a matrix contains first
column 1 and second column x values. multiplication of X_b* theta calculates mx+c for
every iteration step. We change theta in every step to give new theta.
This line calculates the gradient of the cost function with respect to the 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠(𝜽).
The gradients indicate how much the cost function would change if we adjusted the
parameters slightly. By updating the parameters in the opposite direction of the
gradients, we minimize the cost function.

Summary:

• (𝐗 𝐛 ⋅ 𝜽) represents the predicted values.

• The error term (𝐗 𝐛 ⋅ 𝜽 − 𝐲) is used to calculate how far off the predictions are
from the actual values.

• The gradient calculation uses this error term to determine the direction and
magnitude of the parameter updates needed to minimize the cost function.

Derivation

Cost Function:

The cost function (Mean Squared Error) for linear regression is: [𝐽(𝜃) =
1 2
∑𝑚 (ℎ (𝑥 (𝑖) ) − 𝑦 (𝑖) ) ] where:
2𝑚 𝑖=1 𝜃

• (𝑚) is the number of training examples.

• (ℎ𝜃 (𝑥)) is the hypothesis (predicted value), which is (𝑋𝑏 ⋅ 𝜃).

• (𝑦) is the actual value.

Hypothesis Function:

The hypothesis function for linear regression is: [ℎ𝜃 (𝑥) = 𝑋𝑏 ⋅ 𝜃]

Gradient of the Cost Function:

To minimize the cost function, we need to compute its gradient with respect to the
parameters (𝜃). The gradient is the vector of partial derivatives of the cost function with
respect to each parameter.

Step-by-Step Derivation for gradient equation:


1 2
1. Cost Function: [𝐽(𝜃) = 2𝑚 ∑𝑚 (𝑖) (𝑖)
𝑖=1(ℎ𝜃 (𝑥 ) − 𝑦 ) ]

1 (𝑖) 2
2. Substitute Hypothesis Function: [𝐽(𝜃) = 2𝑚 ∑𝑚 (𝑖)
𝑖=1(𝑋𝑏 ⋅ 𝜃 − 𝑦 ) ]

3. Compute Partial Derivative with Respect to (𝜃): We need to compute the


𝜕𝐽(𝜃)
partial derivative of (𝐽(𝜃)) with respect to each parameter (𝜃𝑗 ): [ =
𝜕𝜃𝑗
𝜕 1 2
( ∑𝑚 (𝑋𝑏(𝑖) ⋅ 𝜃 − 𝑦 (𝑖) ) )]
𝜕𝜃𝑗 2𝑚 𝑖=1
𝜕𝐽(𝜃) 1 (𝑖)
4. Apply Chain Rule: Using the chain rule, we get: [ = 𝑚 ∑𝑚 (𝑖)
𝑖=1(𝑋𝑏 ⋅ 𝜃 − 𝑦 ) ⋅
𝜕𝜃𝑗
𝜕 (𝑖)
(𝑋𝑏 ⋅ 𝜃 − 𝑦 (𝑖) )]
𝜕𝜃𝑗

(𝑖)
5. Simplify the Derivative: The derivative of ((𝑋𝑏 ⋅ 𝜃 −
(𝑖) 𝜕𝐽(𝜃) 1 (𝑖)
𝑦 (𝑖) )) 𝑤𝑖𝑡ℎ𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑡𝑜(𝜃𝑗 )𝑖𝑠(𝑋𝑏 𝑗): [ = 𝑚 ∑ 𝑖 = 1𝑚 (𝑋𝑏 ⋅ 𝜃 − 𝑦 (𝑖) ) ⋅ 𝑋𝑏(𝑖) ]
𝜕𝜃𝑗 𝑗

6. Vectorize the Gradient: We can write the gradient for all parameters(𝜃) in
1
vectorized form: [∇𝜃 𝐽(𝜃) = 𝑚 𝑋𝑏𝑇 (𝑋𝑏 ⋅ 𝜃 − 𝑦)]

Final Gradient Expression:

The gradient of the cost function with respect to the parameters(𝜃) is: [∇𝜃 𝐽(𝜃) =
2
𝑋𝑏𝑇 (𝑋𝑏 ⋅ 𝜃 − 𝑦)]
𝑚

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy