0% found this document useful (0 votes)
13 views23 pages

ML Lap

The document outlines various experiments using the California Housing Dataset, including linear regression, binary classification, and K-Nearest Neighbors (KNN) classification. Each experiment details the aim, procedure, and results, demonstrating the implementation of different machine learning techniques for predicting housing prices and classifying data. The final experiment involves implementing the k-means algorithm using the Codon Usage dataset.

Uploaded by

s.shanmugapriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

ML Lap

The document outlines various experiments using the California Housing Dataset, including linear regression, binary classification, and K-Nearest Neighbors (KNN) classification. Each experiment details the aim, procedure, and results, demonstrating the implementation of different machine learning techniques for predicting housing prices and classifying data. The final experiment involves implementing the k-means algorithm using the Codon Usage dataset.

Uploaded by

s.shanmugapriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Ex.

no:1
Linear Regression with the California Housing Dataset
Date:

Aim:
Build a linear regression model to predict housing prices using the California Housing Dataset.
Experiment with different features and tune the model's hyperparameters.

Procedure:
Step 1:Load the Dataset

 Download the dataset from kaggle.


 Save the dataset as housing.csv.

import pandas as pd

# Load dataset

df = pd.read_csv('housing.csv')

df.head()

Step 2:Explore the Data

 Inspect the dataset to understand its structure and features.

df.info()

df.describe()

Step 3: Preprocess the Data

 Check for and handle missing values.


 Encode categorical variables if necessary

df.isnull().sum()

# No missing values in this dataset

1
Step 4: Feature Selection

 Choose relevant features based on domain knowledge or correlation analysis.

import seaborn as sns

import matplotlib.pyplot as plt

# Correlation matrix

plt.figure(figsize=(10, 8))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.show()8))

Step 5:Split the Data

 Divide the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

X = df[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',


'Latitude', 'Longitude']]

y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6:Build and Train the Linear Regression model.

 Use ` Linear Regression` from scikit-learn to build and train the model.

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

Step 7:Evaluade the model.

 Predict on the test set and evaluate using mean squared error (MSE).

from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print(f'Mean Squared Error: {mse}')

2
Step 8:Hyperparameter Tuning

 Use Ridge (L2) regression and tune the hyperparameter `alpha` using GridSearchCV.

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge

param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}

grid_search = GridSearchCV(Ridge(), param_grid, cv=5,


scoring='neg_mean_squared_error')

grid_search.fit(X_train, y_train)

print(f'Best parameters: {grid_search.best_params_}')

Step 9:Final Model Training with Best Hyperparameters:

 Train the Ridge regression model with the best hyperparameters.

best_model = Ridge(alpha=grid_search.best_params_['alpha'])

best_model.fit(X_train, y_train)

# Evaluate the final model

y_pred_final = best_model.predict(X_test)

mse_final = mean_squared_error(y_test, y_pred_final)

print(f'Final Mean Squared Error: {mse_final}')

3
Result:

Implemented a linear regression model to predict housing prices using the California Housing Dataset
in successfully
Ex.no:2
4
Binary Classification with the California Housing Dataset
Date:

Aim:
Implement a binary classification model to predict whether houses in a neighborhood are above a certain
price threshold
Procedure:
Step 1:Load the Dataset

 Download the dataset from kaggle.


 Save the dataset as housing.csv.

import pandas as pd

# Load dataset

df = pd.read_csv('housing.csv')

df.head()

Step 2:Explore the Data

 Inspect the dataset to understand its structure and features.

df.info()

df.describe()

Step 3: Preprocess the Data

 Check for and handle missing values.


 Encode categorical variables if necessary

df.isnull().sum()

# No missing values in this dataset

Step 4: Define the Binary Target


 Create a new binary target variable indicating whether the median house value is above a certain
price threshold. For example, use $200,000 as the threshold

5
threshold = 200000

df['Above_Threshold'] = (df['MedHouseVal'] > threshold).astype(int)


Step 5: Feature Selection
 Choose relevant features based on domain knowledge or correlation analysis.

import seaborn as sns

import matplotlib.pyplot as plt

# Correlation matrix

plt.figure(figsize=(10, 8))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.show()

Step 6: Split the data


 Divide the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

X = df[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']]

y = df['Above_Threshold']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 7:Build and Train the Linear Regression model.

 Use ` Logistic Regression` from scikit-learn to build and train the model.

from sklearn.linear_model import LogisticRegression

model = LinearRegression()

model.fit(X_train, y_train)
Step 8:Evaluade the model.
6
 Predict on the test set and evaluate using accuracy,precision,recall and F1-score.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')

print(f'Precision: {precision}')

print(f'Recall: {recall}')

print(f'F1 Score: {f1}')

Step 9:Modify the Classification Threshold

 Adjust the threshold for classification and observe the changes in model performance.

y_pred_proba = model.predict_proba(X_test)[:, 1]

# Modify threshold

new_threshold = 0.7

y_pred_new = (y_pred_proba >= new_threshold).astype(int)

accuracy_new = accuracy_score(y_test, y_pred_new)

precision_new = precision_score(y_test, y_pred_new)

recall_new = recall_score(y_test, y_pred_new)

f1_new = f1_score(y_test, y_pred_new)

7
Step 10:Experiment with Different Classification Metrics:

 Evaluate the model using different metrics like ROC-AUC

from sklearn.metrics import roc_auc_score, roc_curve

roc_auc = roc_auc_score(y_test, y_pred_proba)

fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

plt.figure()

plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic')

plt.legend(loc='lower right')

plt.show()

print(f'ROC-AUC Score: {roc_auc}')

print(f'New Threshold: {new_threshold}')

print(f'Accuracy: {accuracy_new}')

print(f'Precision: {precision_new}')

8
Result:

Implemented a binary classification model to predict whether houses in a neighborhood are above a certain
price threshold was successfully.

Ex.no:3

9
Date: Classification with Nearest Neighbours

Aim:
To classify real vs. fake news headlines using the K-Nearest Neighbors (KNN) classifier in scikit-learn.
Procedure:
Step 1: Import Necessary Libraries

 Import necessary libraries the required libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

from sklearn.datasets import fetch_california_housing

Step 2: Load and Prepare the Dataset:

 Collect and preprocess the news headlines dataset.

# Load the dataset

data = fetch_california_housing()

X = pd.DataFrame(data.data, columns=data.feature_names)

y = pd.cut(data.target, bins=2, labels=[0, 1]) # Create a binary target variable for


classification

# Check the first few rows of the dataset

print(X.head())

print(y.head())

Step 3: Split the Data:

10
 Split the dataset into training and validation sets.

# Split the dataset into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,


random_state=42)from sklearn.metrics import accuracy_score

from sklearn.datasets import fetch_california_housing

Step 4: Train the Model:

 Use the KNN classifier to train on the training set.

# Initialize the KNN classifier with k=5

knn = KNeighborsClassifier(n_neighbors=5)

# Train the classifier

knn.fit(X_train, y_train)

Step 5: Evaluate the Model:

 Evaluate the classifier's performance on the validation set.

# Predict on the validation set

y_pred = knn.predict(X_val)

# Calculate the accuracy

accuracy = accuracy_score(y_val, y_pred)

print(f'Validation Accuracy: {accuracy:.2f}')

11
Result:

Implemented a binary classification to classify real vs fake news headlines.The Scikit-learn API training
and validation was successfully.

Ex.no:4
Validation and test sets using the California Housing Dataset
Date:

12
Aim:
To classify perform the complete exercise of training a KNN classifier with validation and test sets
using the California Housing Dataset.
Procedure:
Step 1: Import Necessary Libraries

 Import necessary libraries the required libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

from sklearn.datasets import fetch_california_housing

Step 2: Load and Prepare the Dataset:

 Collect and preprocess the dataset.

# Load the dataset

data = fetch_california_housing()

X = pd.DataFrame(data.data, columns=data.feature_names)

y = pd.cut(data.target, bins=2, labels=[0, 1]) # Create a binary target variable for


classification

# Check the first few rows of the dataset

print(X.head())

print(y.head())

Step 3: Split the Data:

13
 Split the dataset into training and validation sets.
 Further split the training set into a smaller training set and a validation set.

# Split the dataset into training and test sets (80-20 split)

X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Further split the training set into a smaller training set and a validation set (80-
20 split of the training set)

X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full,


test_size=0.25, random_state=42)

Step 4: Train the Model:

 Use the KNN classifier to train on the training set.

# Initialize the KNN classifier with k=5

knn = KNeighborsClassifier(n_neighbors=5)

# Train the classifier on the smaller training set

knn.fit(X_train, y_train)

Step 5: Evaluate the Model:

 Evaluate on the training set.


 Evaluate on the Validation set.
 Compare the performance to detect overfitting.

# Predict on the training set

y_train_pred = knn.predict(X_train) 14

train_accuracy = accuracy_score(y_train, y_train_pred)


# Predict on the validation set

Step 6: Test the Model:

 Evaluate the trained model on the test set to check for overfitting.

# Predict on the test set

y_test_pred = knn.predict(X_test)

test_accuracy = accuracy_score(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy:.2f}')

15
Result:

To perform the complete exercise of training a KNN classifier with validation and test sets using the
California Housing Dataset was successfully.

Ex.no:5
k-means algorithm using the Codon Usage dataset
16
Date:

Aim:
To implement the k-means algorithm using the Codon Usage dataset from the UCI Machine Learning
Repository.
Procedure:
Step 1: Import Necessary Libraries

 Import necessary libraries the required libraries

import pandas as pd

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

Step 2: Load and Prepare the Dataset:

 Load the dataset from the URL.


 Preprocess the data.

# Load the dataset

url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/codon/codon_usage.t
xt"

data = pd.read_csv(url, delim_whitespace=True)

# Display the first few rows of the dataset

print(data.head())

# Standardize the features

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data.iloc[:, 1:]) # Assuming the first column is a


non-numeric identifier

Step 3: Implement K-Means Clustering:

17
 Choose the number of clusters `k`.
 Fit the k-means algorithm to the dataset.
 Analyze the clustering results.

# Choose the number of clusters

k = 3 # Example, you may adjust this based on your analysis

# Initialize and fit the k-means algorithm

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(data_scaled)

# Get the cluster labels

labels = kmeans.labels_

# Add the cluster labels to the original dataframe

data['Cluster'] = labels

# Display the first few rows of the clustered dataset

print(data.head())

Step 4: Evaluate the Clustering:

 Use the elbow method to find the optimal number of clusters.


 Visualize the clusters.

# Elbow method to find the optimal number of clusters

inertia = []

for k in range(1, 11):

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(data_scaled)

inertia.append(kmeans.inertia_)

# Plot the elbow curve

plt.figure(figsize=(8, 5))

plt.plot(range(1, 11), inertia, marker='o')

plt.xlabel('Number of Clusters')

plt.ylabel('Inertia')

plt.title('Elbow Method for Optimal Number of Clusters')

plt.show()
Step 5: Visualize the Clusters:

18
 If the dataset has more than 2 dimensions, use dimensionality reduction techniques to visualize
the clusters.

from sklearn.decomposition import PCA

# Reduce the dimensionality of the data to 2D for visualization

pca = PCA(n_components=2)

data_pca = pca.fit_transform(data_scaled)

# Create a dataframe for the PCA-transformed data

data_pca_df = pd.DataFrame(data_pca, columns=['PC1', 'PC2'])

data_pca_df['Cluster'] = labels

# Plot the clusters

plt.figure(figsize=(10, 7))

for cluster in range(k):

plt.scatter(data_pca_df[data_pca_df['Cluster'] == cluster]['PC1'],

data_pca_df[data_pca_df['Cluster'] == cluster]['PC2'],

label=f'Cluster {cluster}')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.title('Clusters Visualization with PCA')

plt.legend()

plt.show()

19
Result:

To implement the k-means algorithm using the Codon Usage dataset from the UCI Machine Learning
Repository was successfully.

Ex.no:6
k-means algorithm using the Codon Usage dataset
20
Date:

Aim:

To implement the Naïve Bayes Classifier using the Gait Classification dataset from the UCI Machine
Learning Repository.

Procedure:
Step 1: Import Necessary Libraries

 Import necessary libraries the required libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, classification_report,


confusion_matrix

Step 2: Load and Prepare the Dataset:

 Load the dataset from the URL.


 Preprocess the data.

# Load the dataset

url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/00310/GaitClassifica
tion.csv"

data = pd.read_csv(url)

# Display the first few rows of the dataset

print(data.head())

# Assuming the last column is the target variable

X = data.iloc[:, :-1] # Features

y = data.iloc[:, -1] # Target

# Standardize the features

scaler = StandardScaler()
Step 3: Split the Data:

21
 Split the dataset into training and test sets.

# Split the dataset into training and test sets (80-20 split)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,


random_state=42)

Step 4: Train the Naïve Bayes Classifier:

 Train the naïve bayes classifiers

# Initialize the Naïve Bayes classifier

nb = GaussianNB()

# Train the classifier on the training set

nb.fit(X_train, y_train)

Step 5: Evaluate the Model:

 Predict on the test set


 Evaluate the classifier's performance using accuracy, classification report, and confusion matrix.

# Predict on the test set

y_pred = nb.predict(X_test)

# Calculate the accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy:.2f}')

# Print the classification report

print('Classification Report:')

print(classification_report(y_test, y_pred))

# Print the confusion matrix

print('Confusion Matrix:')

print(confusion_matrix(y_test, y_pred))

22
Result:

To implement the Naïve Bayes Classifier using the Gait Classification dataset from the UCI Machine
Learning Repository was successfully.

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy