ML Lab - Sukanya Raja

EXPERIMENT-1
 AIM:
Take a suitable dataset and study the linear regression with multiple variable
 THEORY:
Linear regression assumes a linear or straight line relationship between the input
variables (X) and the single output variable (y). More specifically, that output (y) can
be calculated from a linear combination of the input variables (X). When there is a
single input variable, the method is referred to as a simple linear regression.
Multiple Linear Regression is an extension of Simple Linear regression as it takes

more than one predictor variable to predict the response variable. It is an important
regression algorithm that models the linear relationship between a single dependent
continuous variable and more than one independent variable. It uses two or more
independent variables to predict a dependent variable by fitting a best linear
relationship.
It has two or more independent variables (X) and one dependent variable (Y), where
Y is the value to be predicted. Thus, it is an approach for predicting a quantitative
response using multiple features.
y=β0+β1∗X1+β2∗X2+β3∗X3y=β0+β1∗X1+β2∗X2+β3∗X3
Where:
 y is the response

 β0β0 is the intercept
 β1β1 is coefficient for X1X1 (first feature), β2β2 is coefficient for X2X2 and
so on
 DATASET:
We consider two variables X and Y and generate a dataset with random variables
instead of using any predefined dataset. Here, Y is the dependent variable and X is the
independent variable.
Once we have got the data points, they are appended to form an array for the
respective variables and are plotted in a 3D plot for visualization.
 CODE:
import numpy as np
Page 1 of 23
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

def generate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
for i in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
return np.array(x), np.array(y)

x, y = generate_dataset(200)

mpl.rcParams['legend.fontsize'] = 12

fig = plt.figure()
ax = fig.add_subplot(projection ='3d')

ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 10)
ax.legend()
ax.view_init(40, 0)

plt.show()
 OUTPUT:
Page 2 of 23
 CONCLUSION:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to
perform multiple linear Regression are almost similar to that of simple linear
Regression. The Difference lies in the evaluation. We can use it to find out which
factor has the highest impact on the predicted output and how different variables
relate to each other.
Page 3 of 23
EXPERIMENT-2
 AIM:
Take a suitable dataset and extract the features using PCA
 THEORY:
Principal component analysis (PCA) is a popular technique for analyzing large
datasets containing a high number of dimensions/features per observation, increasing
the interpretability of data while preserving the maximum amount of information, and
enabling the visualization of multidimensional data. Formally, PCA is a statistical
technique for reducing the dimensionality of a dataset. This is accomplished by
linearly transforming the data into a new coordinate system where (most of) the
variation in the data can be described with fewer dimensions than the initial data.
Importantly, the dataset on which PCA technique is to be used must be scaled. The
results are also sensitive to the relative scaling.
PCA is predominantly used as a dimensionality reduction technique in domains like
facial recognition, computer vision and image compression. It is also used for finding
patterns in data of high dimension in the field of finance, data mining, bioinformatics,
psychology, etc.
PCA considers the correlation among variables. If the correlation is very high, PCA
attempts to combine highly correlated variables and finds the directions of maximum
variance in higher-dimensional data. The following image shows that the first
principal component (PC1) has the largest possible variance and is orthogonal to PC2
(i.e. uncorrelated).
Page 4 of 23
 DATASET:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a
breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian:
"Robust Linear Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
Attribute Information:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
 CODE:
import numpy as np
import pandas as pd
df = pd.read_csv("/content/data_pca.csv")
df = df.drop(['id', 'Unnamed: 32'], axis=1)
df.head()
df_features = df.drop(['diagnosis'], axis=1)
from sklearn.preprocessing import StandardScaler
standardized = StandardScaler()
standardized.fit(df_features)
scaled_data = standardized.transform(df_features)
#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
Page 5 of 23
scaled_data.shape
x_pca.shape
def diag(x):
if x =='M':
return 1
else:
return 0
df_diag= df['diagnosis'].apply(diag)
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
%matplotlib inline
x_pca[:1]
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x_pca[:,0], x_pca[:,1], x_pca[:,2], c=df_diag, s=60)
ax.legend(['Malign'])
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component')
ax.set_zlabel('Third Principal Component')
ax.view_init(20, 130)
df_pc = pd.DataFrame(pca.components_, columns = df_features.colum
ns)
df_pc
plt.figure(figsize=(15, 8))
sns.heatmap(df_pc, cmap='viridis')
plt.title('Principal Components correlation with the features')
plt.xlabel('Features')
plt.ylabel('Principal Components')
 OUTPUT:
Page 6 of 23
 CONCLUSION:
There is no best technique for dimensionality reduction and no mapping of techniques
to problems. Instead, the best approach is to use systematic controlled experiments to
discover what dimensionality reduction techniques, when paired with your model of
choice, result in the best performance on your dataset. Typically, linear algebra and
manifold learning methods assume that all input features have the same scale or
distribution. This suggests that it is good practice to either normalize or standardize
data prior to using these methods if the input variables have differing scales or units.
Page 7 of 23
EXPERIMENT-3
 AIM:
Take a suitable dataset and design a classifier using:
a) Logistic regression
b) KNN
c) SVM
d) Decision Tree
 THEORY:
The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups.
a) Logistic Regression: Logistic regression is one of the most popular Machine

Learning algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of
independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore

the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
b) KNN: K-Nearest Neighbour is one of the simplest Machine Learning algorithms

based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
Page 8 of 23
K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
c) SVM: Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression problems.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there are
two different categories that are classified using a decision boundary or hyperplane.
Page 9 of 23
d) Decision tree: Decision Tree is a supervised learning technique that can be used
for both classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
 DATASET:
The Open Access Series of Imaging Studies (OASIS) is a project aimed at making MRI
data sets of the brain freely available to the scientific community. By compiling and
freely distributing MRI data sets, we hope to facilitate future discoveries in basic and
clinical neuroscience. OASIS is made available by the Washington University
Alzheimer’s Disease Research Center, Dr. Randy Buckner at the Howard Hughes
Medical Institute (HHMI) (at Harvard University, the Neuroinformatics Research Group
(NRG) at Washington University School of Medicine, and the Biomedical Informatics
Research Network (BIRN).
Longitudinal MRI Data in Nondemented and Demented Older Adults: This set
consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was
scanned on two or more visits, separated by at least one year for a total of 373 imaging
sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single
scan sessions are included. The subjects are all right-handed and include both men and
women. 72 of the subjects were characterized as nondemented throughout the study. 64 of
the included subjects were characterized as demented at the time of their initial visits and
remained so for subsequent scans, including 51 individuals with mild to moderate
Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time
of their initial visit and were subsequently characterized as demented at a later visit.
Page 10 of 23
 CODE:
#Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
data= pd.read_csv("/content/oasis_longitudinal.csv")
df= pd.DataFrame(data)
df
df_mod= df.dropna(axis = 0, how ='any')
df_mod= df_mod.drop(['Subject ID','MRI ID','Hand'],axis=1)
label_enc= LabelEncoder()
df_mod.iloc[:,0]= label_enc.fit_transform(df_mod.iloc[:,0].values
)
label_enc= LabelEncoder()
df_mod.iloc[:,3]= label_enc.fit_transform(df_mod.iloc[:,3].values
)
x= df_mod.iloc[:,2:13].values # independent dataset
y= df_mod.iloc[:,0].values #dependent dataset
x_train, x_test, y_train, y_test= train_test_split(x, y, test_siz
e= 0.25,random_state= 32)
sc= StandardScaler()
x_train= sc.fit_transform(x_train)
x_test= sc.fit_transform(x_test)
Train_accuracy = []
Test_accuracy = []
Algorithm = []
#LogisticRegression
from sklearn.linear_model import LogisticRegression
Algorithm.append('LogisticRegression')
regressor1 = LogisticRegression(random_state=32)
regressor1.fit(x_train, y_train)
predicted1= regressor1.predict(x_test)
accuracy = regressor1.score(x_train, y_train)*100
print(accuracy,'%')
Train_accuracy.append(accuracy)
accuracy = regressor1.score(x_test, y_test)*100
print(accuracy,'%')
Test_accuracy.append(accuracy)
for i in Algorithm, Train_accuracy, Test_accuracy:
print(i,end=',')
from sklearn.neighbors import KNeighborsClassifier
Algorithm.append('KNeighborsClassifier')
regressor2=KNeighborsClassifier(n_neighbors=3)
Page 11 of 23
regressor2.fit(x_train,y_train)
print(accuracy,'%')
print(accuracy,'%')
print(i,end=',')
from sklearn.svm import SVC
Algorithm.append('SVM')
regressor3=SVC(probability=True)
regressor3.fit(x_train,y_train)
print(accuracy,'%')
print(accuracy,'%')
print(i,end=',')
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
Algorithm.append('DecisionTreeClassifier')
regressor4 = DecisionTreeClassifier()
regressor4.fit(x_train, y_train)
print(accuracy,'%')
print(accuracy,'%')
print(i,end=',')
comparision=pd.DataFrame({'Algorithm':Algorithm,'Train_accuracy':
Train_accuracy,'Test_accuracy':Test_accuracy})
comparision
Page 12 of 23
 OUTPUT:
 91.69811320754717 %
 91.01123595505618 %
 ['LogisticRegression'],[91.69811320754717],
[91.01123595505618],92.83018867924528 %
 84.26966292134831 %
 ['LogisticRegression', 'KNeighborsClassifier'],
[91.69811320754717, 92.83018867924528],[91.01123595505618,
84.26966292134831],91.69811320754717 %
 91.01123595505618 %
 ['LogisticRegression', 'KNeighborsClassifier', 'SVM'],
[91.69811320754717, 92.83018867924528, 91.69811320754717],
[91.01123595505618, 84.26966292134831, 91.01123595505618],100.0 %
 80.89887640449437 %
 ['LogisticRegression', 'KNeighborsClassifier', 'SVM',
'DecisionTreeClassifier'],[91.69811320754717, 92.83018867924528,
91.69811320754717, 100.0],[91.01123595505618, 84.26966292134831,
91.01123595505618, 80.89887640449437],
 CONCLUSION:
Based on the testing accuracy of the models we can conclude that, Logistic regression
and SVM is providing the most promising results for this classification.
Page 13 of 23
EXPERIMENT-4
 AIM:
Take a suitable dataset and cluster the data using K-Means
 THEORY:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this
topic, we will learn what is K-means clustering algorithm, how the algorithm works,
along with the Python implementation of k-means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of predefined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.
 DATASET:
This dataset is composed by the following five features:
 CustomerID: Unique ID assigned to the customer
 Gender: Gender of the customer
 Age: Age of the customer
 Annual Income (k$): Annual Income of the customer
 Spending Score (1-100): Score assigned by the mall based on customer
behavior and spending nature.
In this particular dataset we have 200 samples to study.
 CODE:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
dataset = pd.read_csv('/content/Mall_Customers.csv')
x = dataset.iloc[:, [3, 4]].values
from sklearn.cluster import KMeans
wcss_list= []
#Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++',
random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
#mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
Page 14 of 23
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
kmeans = KMeans(n_clusters=5, init='k-means++',
random_state= 42)
y_predict= kmeans.fit_predict(x)
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100,
c = 'blue', label = 'Cluster 1')
#for first cluster
c = 'green', label = 'Cluster 2')
#for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100,
c = 'red', label = 'Cluster 3')
#for third cluster
c = 'cyan', label = 'Cluster 4')
#for fourth cluster
c = 'magenta', label = 'Cluster 5')
#for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Ce
ntroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
 OUTPUT:
Page 15 of 23
 CONCLUSION:
Ultimately, we can define these clusters according to their annual income and
spending scores. For instance, Cluster1 has high annual income and low spending
score so that we can name it “Careful” and name the others in comparison with this
appellation.
If you are doing clustering in more than two dimensions you don’t execute the last
code section to visualize the clusters because it’s only for two-dimensional clustering.
It is possible to use this code by using the dimension reduction technique. So if you
reduce the dataset to two dimensions by these techniques then you can use this last
code section to plot the clusters.
Page 16 of 23
EXPERIMENT-5
 AIM:
In a given dataset estimate the number of possible clusters using
Silhouette method
 THEORY:
In Clustering algorithms like K-Means clustering, we have to determine the right
number of clusters for our dataset. This ensures that the data is properly and
efficiently divided. An appropriate value of ‘k’ i.e. the number of clusters helps in
ensuring proper granularity of clusters and helps in maintaining a good balance
between compressibility and accuracy of clusters.
Let us consider two cases:
Case 1: Treat the entire dataset as one cluster
Case 2: Treat each data point as a cluster
This will give the most accurate clustering because of the zero distance between the
data point and its corresponding cluster center. But, this will not help in predicting
new inputs. It will not enable any kind of data summarization.
So, we can conclude that it is very important to determine the ‘right’ number of
clusters for any dataset. This is a challenging task but very approachable if we depend
on the shape and scaling of the data distribution. A simple method to calculate the
number of clusters is to set the value to about √(n/2) for a dataset of ‘n’ points. In the
rest of the article, two methods have been described and implemented in Python for
determining the number of clusters in data mining.
 DATASET:
This dataset is composed by the following five features:
 CustomerID: Unique ID assigned to the customer
 Gender: Gender of the customer
 Age: Age of the customer
 Annual Income (k$): Annual Income of the customer
 Spending Score (1-100): Score assigned by the mall based on customer
behavior and spending nature.
In this particular dataset we have 200 samples to study.
Page 17 of 23
 CODE:
# importing the libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
dataset = pd.read_csv("/content/Mall_Customers.csv")
dataset.head()
# printing the shape of dataset
print(dataset.shape)
# checking for any
# null values present
print(dataset.isnull().sum())
# extracting values from two
# columns for clustering
dataset_new = dataset[['Annual Income (k$)',
'Spending Score (1-100)']].values
# determining the maximum number of clusters
# using the simple method
limit = int((dataset_new.shape[0]//2)**0.5)
# determining number of clusters
# using silhouette score method
for k in range(2, limit+1):
model = KMeans(n_clusters=k)
model.fit(dataset_new)
pred = model.predict(dataset_new)
score = silhouette_score(dataset_new, pred)
print('Silhouette Score for k = {}: {:<.3f}'.format(k, score))
 OUTPUT:
Page 18 of 23
 CONCLUSION:
In addition to elbow, silhouette and gap statistic methods, there are more than thirty
other indices and methods that have been published for identifying the optimal
number of clusters. We’ll provide R codes for computing all these 30 indices in order
to decide the best number of clusters using the “majority rule”.
Page 19 of 23
EXPERIMENT-6
 AIM:
Take two suitable dataset and design ANN to classify the dataset
 THEORY:
The term "Artificial neural network" refers to a biologically inspired sub-field of
artificial intelligence modeled after the brain. An Artificial neural network is usually a
computational network based on biological neural networks that construct the
structure of the human brain. Similar to a human brain has neurons interconnected to
each other, artificial neural networks also have neurons that are linked to each other in
various layers of the networks. These neurons are known as nodes.
The architecture of an artificial neural network:
Neural Networks have been successfully used in a variety of solutions as Pattern

Recognition, Optimization Problems, Forecasting, and Control Systems and so on.
 DATASET:
Predicting which set of the customers are going to churn out from the organization by
looking into some of the important attributes and applying Machine Learning and
Deep Learning on it.
Customer churn refers to when a customer (player, subscriber, user, etc.) ceases his or
her relationship with a company. Online businesses typically treat a customer as
churned once a particular amount of time has elapsed since the customer's last
interaction with the site or service.
A Predictive Churn Model is a tool that defines the steps and stages of customer
churn, or a customer leaving your service or product. But with an evolving churn
model, you can fight for retention by acting on the metrics as they happen
Page 20 of 23
Customer churn occurs when customers or subscribers stop doing business with a
company or service, also known as customer attrition. It is also referred as loss of
clients or customers. Similar concept with predicting employee turnover, we are going
to predict customer churn using telecom dataset.
 CODE:
#Import the Libraries
import numpy as np
import pandas as pd
#Load the Dataset
dataset = pd.read_csv("/content/Churn_Modelling.csv")
dataset.head()
#Split Dataset into X and Y
X = pd.DataFrame(dataset.iloc[:, 3:13].values)
y = dataset.iloc[:, 13].values
#Encode Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_2 = LabelEncoder()
X.loc[:, 2] = labelencoder_X_2.fit_transform(X.iloc[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
#Split the X and Y Dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_si
ze = 0.2, random_state = 0)
#Perform Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Build Artificial Neural Network
import keras
from keras.models import Sequential
from keras.layers import Dense
#Initialize the Artificial Neural Network
classifier = Sequential()
#Add the input layer and the first hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation
= 'relu', input_dim = 11))
#Add the second hidden layer
Page 21 of 23
= 'relu'))
#Add the output layer
= 'sigmoid'))
#Compile the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentrop
y', metrics = ['accuracy'])
#Fit the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)
#Predict the Test Set Results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)
#Make the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test,y_pred)
 OUTPUT:
 CONCLUSION:
That is 79%, but after running all 100 epoch, the accuracy increase and we get the final
accuracy-
Page 22 of 23
From Confusion Matrix-
And we got 84.2% accuracy.
Page 23 of 23

ML Lab - Sukanya Raja

Uploaded by

Copyright:

Available Formats

ML Lab - Sukanya Raja

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lab - Sukanya Raja

Uploaded by

Copyright:

Available Formats

EXPERIMENT-1

Multiple Linear Regression is an extension of Simple Linear regression as it takes

 y is the response

a) Logistic Regression: Logistic regression is one of the most popular Machine

Logistic regression predicts the output of a categorical dependent variable. Therefore

b) KNN: K-Nearest Neighbour is one of the simplest Machine Learning algorithms

In this particular dataset we have 200 samples to study.

In this particular dataset we have 200 samples to study.

The architecture of an artificial neural network:

Neural Networks have been successfully used in a variety of solutions as Pattern

And we got 84.2% accuracy.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.