0% found this document useful (0 votes)
24 views

numpy_pandas_matplotlib

The document outlines practical exercises for a Machine Learning course focusing on the use of Numpy, Pandas, and Matplotlib for data manipulation, analysis, and visualization. It includes tasks for array creation, slicing, DataFrame manipulation, and linear regression model training and evaluation. The document also emphasizes data preparation, handling missing values, and visualizing data relationships through various plots.

Uploaded by

Raj Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

numpy_pandas_matplotlib

The document outlines practical exercises for a Machine Learning course focusing on the use of Numpy, Pandas, and Matplotlib for data manipulation, analysis, and visualization. It includes tasks for array creation, slicing, DataFrame manipulation, and linear regression model training and evaluation. The document also emphasizes data preparation, handling missing values, and visualizing data relationships through various plots.

Uploaded by

Raj Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CSE303-Machine Learning 22DCS015

PRACTICAL-1

AIM :- To understand and apply basic operations of Numpy , Pandas, and


Matplotlib for data manipulation ,analysis, and visualization.

1.1.1:- Array Creation:


• Create a NumPy array with zeros, ones, and random values.
• Generate an array with a specific sequence of numbers (e.g., arithmetic
progression).

1.1.2: - Slicing and Updating:

• Extract specific portions of an array using slicing syntax.


• Modify elements within an array using indexing and assignment.

1
CSE303-Machine Learning 22DCS015

Shape Manipulation:
• Reshape an array into different dimensions (e.g., from 1D to 2D).
• Transpose an array to swap rows and columns.

2
CSE303-Machine Learning 22DCS015

4. Looping:
• Iterate through an array using for loops and perform operations on elements.
5. File Reading:
• Load data from a text file (CSV or similar) using NumPy functions (e.g.,
np.genfromtxt).

6 .Performance Benchmarking:
• Measure the execution time for matrix multiplication using NumPy arrays vs.
Python lists
• for a 1000x1000 matrix. Analyze the performance difference.

Part 2: Pandas

1. DataFrame Creation:
• Construct a DataFrame from scratch, specifying column names and data
types.

3
CSE303-Machine Learning 22DCS015

1.2.2 File Reading:

• Read the chosen dataset from its source file format (CSV, Excel) into a Pandas
DataFrame.
• Deal with missing values in the data

4
CSE303-Machine Learning 22DCS015

1.2.3 Slicing and Manipulation:


• Select specific rows and columns using various indexing methods (e.g., by
label,
• position).
• Filter data based on specific conditions using boolean expressions.

5
CSE303-Machine Learning 22DCS015

Data Export:
• Export the DataFrame to a new CSV file or another desired format.

Masking:
• Apply boolean masks to filter and select data based on specific criteria.
• Read data selectively based on a boolean DataFrame.

6
CSE303-Machine Learning 22DCS015

Part 3: Matplotlib

Importing and Configuration:


• Import necessary modules from Matplotlib for plotting.
• Configure plot elements like labels, titles, and grid lines.

Line Charts:
• Create a line chart to visualize a numerical variable over time or another
relevant category.

Correlation Matrix:
• Generate a correlation matrix to depict relationships between features in the
dataset.

7
CSE303-Machine Learning 22DCS015

Histograms:
• Construct histograms to analyze the distribution of numerical features in the
dataset.
• Detect if Outliers exist and Plot the data distribution using Box Plots.

8
CSE303-Machine Learning 22DCS015

Multivariate Data Visualization:


• Create scatter plots or other visualization techniques to represent multivariate
data (multiple features).

9
CSE303-Machine Learning 22DCS015

Pie Charts:

Design a Pie chart to represent the proportions of categorical data within a


specific feature.

10
CSE303-Machine Learning 22DCS015

Practical 2: Linear Regression


AIM: To understand and apply the linear regression algorithm for the prediction
and evaluate its performance.

Data Preparation:
• Load and inspect the dataset.
• Handle missing values and outliers if present.
• Perform feature scaling or normalization if necessary.

Model Training and Evaluation:


• Split the dataset into training and test sets.
• Implement linear regression using a scikit-learn library.
• Train the model on the training set and make predictions on the test set.
• Evaluate the model using metrics such as mean squared error (MSE) and R-
squared.

Analysis and Interpretation:


• Analyze the coefficients of the linear regression model to understand the
importance of
• different features.
• Visualize the relationship between the predicted prices and actual prices.
• Identify any trends or patterns in the residuals to assess model performance.

11
CSE303-Machine Learning 22DCS015

Practical 2: Linear Regression


AIM: To understand and apply the linear regression algorithm for the prediction and evaluate
its performance.

Data Preparation:
• Load and inspect the dataset.
• Handle missing values and outliers if present.
• Perform feature scaling or normalization if necessary.

Model Training and Evaluation:


• Split the dataset into training and test sets.
• Implement linear regression using a scikit-learn library.
• Train the model on the training set and make predictions on the test set.
• Evaluate the model using metrics such as mean squared error (MSE) and R-squared.

Analysis and Interpretation:


• Analyze the coefficients of the linear regression model to understand the importance of
• different features.
• Visualize the relationship between the predicted prices and actual prices.
• Identify any trends or patterns in the residuals to assess model performance.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
# from sklearn.preprocessing import PolynomialFeatures

df=pd.read_csv('HousingData.csv')
df.shape

(506, 14)

df.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX


PTRATIO \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296
15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242
17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242
1
CSE303-Machine Learning 22DCS015

17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222
18.7
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222
18.7

B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 NaN 36.2

df.describe().T

count mean std min 25%


50% \
CRIM 486.0 3.611874 8.720192 0.00632 0.081900
0.253715
ZN 486.0 11.211934 23.388876 0.00000 0.000000
0.000000
INDUS 486.0 11.083992 6.835896 0.46000 5.190000
9.690000
CHAS 486.0 0.069959 0.255340 0.00000 0.000000
0.000000
NOX 506.0 0.554695 0.115878 0.38500 0.449000
0.538000
RM 506.0 6.284634 0.702617 3.56100 5.885500

2
CSE303-Machine Learning 22DCS015

6.208500
AGE 486.0 68.518519 27.999513 2.90000 45.175000
76.800000
DIS 506.0 3.795043 2.105710 1.12960 2.100175
3.207450
RAD 506.0 9.549407 8.707259 1.00000 4.000000
5.000000
TAX 506.0 408.237154 168.537116 187.00000 279.000000
330.000000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000
19.050000
B 506.0 356.674032 91.294864 0.32000 375.377500
391.440000
LSTAT 486.0 12.715432 7.155871 1.73000 7.125000
11.430000
MEDV 506.0 22.532806 9.197104 5.00000 17.025000
21.200000

75% max
CRIM 3.560263 88.9762
ZN 12.500000 100.0000
INDUS 18.100000 27.7400
CHAS 0.000000 1.0000
NOX 0.624000 0.8710
RM 6.623500 8.7800
AGE 93.975000 100.0000
DIS 5.188425 12.1265
RAD 24.000000 24.0000
TAX 666.000000 711.0000
PTRATIO 20.200000 22.0000
B 396.225000 396.9000
LSTAT 16.955000 37.9700
MEDV 25.000000 50.0000

sns.scatterplot(y=df["MEDV"],x=df["CRIM"])

<Axes: xlabel='CRIM', ylabel='MEDV'>

3
CSE303-Machine Learning 22DCS015

sns.scatterplot(y=df["MEDV"],x=df["TAX"])

<Axes: xlabel='TAX', ylabel='MEDV'>

4
CSE303-Machine Learning 22DCS015

sns.regplot(y=df["MEDV"],x=df["RM"])

<Axes: xlabel='RM', ylabel='MEDV'>

5
CSE303-Machine Learning 22DCS015

sns.regplot(y=df["MEDV"],x=df["CRIM"])

<Axes: xlabel='CRIM', ylabel='MEDV'>

6
CSE303-Machine Learning 22DCS015

df.isna().sum()
CRIM 20
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64

data=df.fillna(df.mean())

df.isna().sum()
data.isna().sum()
CRIM 0
ZN 0

7
CSE303-Machine Learning 22DCS015

INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64

fig=plt.subplots(figsize=(15,15))
sns.heatmap(data.corr(),annot=True)

<Axes: >

8
CSE303-Machine Learning 22DCS015

sns.displot(data["MEDV"],kind="kde")

<seaborn.axisgrid.FacetGrid at 0x1ce48a27a70>

9
CSE303-Machine Learning 22DCS015

plt.grid()
sns.histplot(data["PTRATIO"])

<Axes: xlabel='PTRATIO', ylabel='Count'>

10
CSE303-Machine Learning 22DCS015

sns.regplot(data=data,x="B",y="MEDV")

<Axes: xlabel='B', ylabel='MEDV'>

11
CSE303-Machine Learning 22DCS015

sns.regplot(data=data,x="DIS",y="MEDV")

<Axes: xlabel='DIS', ylabel='MEDV'>

12
CSE303-Machine Learning 22DCS015

Price=data["MEDV"]
new_data=data.drop(["MEDV"],axis=1)
# data=data.drop(["MEDV"],axis=1)

type(data)
# type(Price)
new_data

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD


TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.200000 4.0900 1
296
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.900000 4.9671 2
242
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.100000 4.9671 2
242
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.800000 6.0622 3
222
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.200000 6.0622 3
222
.. ... ... ... ... ... ... ... ... ...
...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.100000 2.4786 1
273

13
CSE303-Machine Learning 22DCS015

502 0.04527 0.0 11.93 0.0 0.573 6.120 76.700000 2.2875 1


273
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.000000 2.1675 1
273
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.300000 2.3889 1
273
505 0.04741 0.0 11.93 0.0 0.573 6.030 68.518519 2.5050 1
273

PTRATIO B LSTAT
0 15.3 396.90 4.980000
1 17.8 396.90 9.140000
2 17.8 392.83 4.030000
3 18.7 394.63 2.940000
4 18.7 396.90 12.715432
.. ... ... ...
501 21.0 391.99 12.715432
502 21.0 396.90 9.080000
503 21.0 396.90 5.640000
504 21.0 393.45 6.480000
505 21.0 396.90 7.880000

[506 rows x 13 columns]

# from sklearn.preprocessing import StandardScaler


# scaler=StandardScaler()
# columns=data.columns
# my_data=pd.DataFrame(scaler.fit_transform(data),columns=columns)
# my_data.head()

X_train,X_test,Y_train,Y_test=train_test_split(new_data,Price,test_siz
e=0.1,shuffle=True)

model=LinearRegression()
model.fit(X_train,Y_train)
Yhat=model.predict(X_test)

plt.grid()
columns=X_test.columns
plt.scatter(x=X_test["CRIM"],y=Y_test,color="Blue",label="Actual
Output")
plt.scatter(X_test["CRIM"],Yhat,color="orange",label="Predicted
Output")
plt.legend()
<matplotlib.legend.Legend at 0x1ce4814d460>

14
CSE303-Machine Learning 22DCS015

print(model.score(X_train,Y_train))
print(r2_score(Y_test,Yhat))
print(mean_squared_error(Y_test,Yhat))
print(model.coef_)
print(model.intercept_)

0.7419205308931333
0.6392849087461008
37.40510651778141
[-1.07582489e-01 3.51316401e-02 -2.69353830e-02 1.91968740e+00
-1.78211466e+01 4.64921433e+00 -1.07024210e-02 -1.42903936e+00
2.63749083e-01 -1.05532384e-02 -9.31746561e-01 8.56216938e-03
-4.32233650e-01]
30.902529774911535

15
CSE303 – Machine Learning 22DCS015-Raj Desai

Faculty of Technology and Engineering


Devang Patel Institute of Advance Technology and Research
Department of Computer Science & Engineering
Date: / /

Academic Year : 2024-25 Semester : 5

Course code : CSE303 Course name : Machine Learning

Practical - 3
Aim: To understand and apply the logistic regression algorithm for binary
classification.

1. Data Exploration and Preprocessing:


 Load and inspect the dataset to understand its structure and attributes.

 Handle missing values and perform data cleaning as necessary.

Page 1 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

 Explore the distributions of variables and correlations using histograms, box plots, and
correlation matrices.

Page 2 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

2. Feature Selection and Engineering:


 Select relevant features that may influence the likelihood of diabetes based on domain
knowledge and exploratory data analysis.
 Optionally, transform or normalize features to improve model performance.

3. Model Building and Evaluation:

Page 3 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

 Split the dataset into training and test sets.

 Implement logistic regression using a scikit-learn library.

 Train the model on the training set and evaluate its performance using metrics such as
accuracy, precision, recall, and F1-score.

 Compare the performance of the logistic regression model with other classification
algorithms if applicable.

4. Interpretation and Insights:


 Interpret the coefficients of the logistic regression model to understand the impact of
each feature on the likelihood of diabetes.
 Visualize the model's predictions using ROC curves and precision-recall curves to
assess its discriminative ability.
 Discuss insights gained regarding the factors contributing to diabetes risk and
potential implications for preventive healthcare strategies.

Page 4 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

Extra:
1. With Positively Correlated Features:

Page 5 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

2. With Negatively Correlated Features:

Page 6 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai

Page 7 of 7
22DCS006 , 22DCS012 ,22DCS015

Practical-4(Analysis)
Aim: To understand and apply the KNN algorithm for multi-class classification
to classify instances into multiple categories based on their features.

Objectives:
• Explore and understand the dataset structure and attributes suitable for
multiclass classification.
• Implement the KNN algorithm to classify instances into one of several
predefined classes.
• Evaluate the performance of the KNN model using appropriate metrics.
• Interpret the results and understand the strengths and limitations of the
KNN approach for multiclass classification.
• Additionally, perform feature selection on the dataset and analyze the
results. Document the findings as per the attached template.

1. What is data and where is data?


• Dataset Information:
The car evaluation dataset contains a data of 1728 cars denoting
their 7 columns which may be useful for testing constructive
induction and structure discovery methods. This dataset was
created by Marko Bohanec which is available at UCI repository.

• Attributes:
This dataset has 6 independent variables(features) and 1 dependent
variable(target). The 6 features are buying, maintenance, persons,
doors, lug_boot, safety and 1 target column is class. These all
columns contain value in string.

2. Is the data good?


According to the dataset there are no missing values in any of the
columns, but that doesn’t mean that there is no requirement of data pre-
processing. As we are performing KNN on this dataset we require float
values in the dataset, but all the values are in string due to which we have
to perform Label Encoding (as there are more than 2 class labels).
22DCS006 , 22DCS012 ,22DCS015

Here in target column, 0 is for acceptable(acc), 1 for good(good), 2 for


unacceptable(unacc) and 3 for very good(vgood).

As you can see that there is some class imbalance in target column.
22DCS006 , 22DCS012 ,22DCS015

No outliers can be seen in this box plot. Hence, we can say that our data
is good.
• Correlation Matrix:
22DCS006 , 22DCS012 ,22DCS015

From the above correlation matrix we can say that, more the number of people
lesser the chances of acceptability. This doesn’t mean that we can ignore the
‘persons’ feature in our feature selection as we also have other features, and
there might be a chance that persons feature with other features can improve our
score, but there is also a possibility that this feature can decrease our score.
Let’s see that by doing feature selection:
Case Features Reason
1 All Checking the accuracy
without feature
selection.
2 persons, buying, Ignoring number of
lug_boot, maint, safety doors as it has no
connection with car
acceptability.
3 persons, buying, maint, Ignoring size of luggage
safety and focusing on price.
4 persons, maint, safety Focusing on
maintenance and safety.
5 person, safety Taking safety as an
important measure.
6 buying As it is mostly
correlated to class.

Confusion Metrices
Case-1 (All features):
22DCS006,22DCS012,22DCS015

Case-2 (5 features):

Case-3 (4 features)
22DCS006,22DCS012,22DCS015

Case-4 (3 features)

Case-5 (2 features)
22DCS006,22DCS012,22DCS015

Case-6 (1 feature)

Performance Table:
Case Accuracy Precision Recall F1-Score Time taken

1 92.20 91.99 92.20 91.81 15.2 ms


2 92.20 92.18 92.20 92.18 14.5 ms

3 86.71 87.73 86.71 86.78 22.0 ms

4 77.75 74.93 77.75 75.33 16.2 ms


5 73.99 70.60 73.99 71.61 14.6 ms

6 66.47 44.19 66.47 53.09 17.5 ms


22DCS006,22DCS012,22DCS015

Interpretation of results:
1. We can say that Case 2 having 5 features performs better than
Case 1 having 6 features. This says that ‘doors’ affects slightly
on our target column.
2. Also, Case-2 takes the least time and provides best results.
3. We can’t ignore Case-3 as it provides a moderate result even
after ignoring 3 features. Still, it takes a huge time to give these
results.
4. Case-5 takes lesser time than all the cases (except 2nd) but it
does not provide a noteworthy result.
5. Lowest score is given by Case-6 as it only has 1 feature(buying).
This says that buying alone can’t decide the car acceptability.
6. The accuracy and recall of Case-1 and Case-4 are same but there
is a slight difference in Precision, F1 score and time which says
that ‘doors’ feature was not contributing much.
7. Case 2 gives us a perfect example of feature selection, i.e,
reducing the unnecessary features and gaining more accuracy in
lesser time.
8. Also, we can say that, ‘persons’ feature having a negative
correlation score with ‘class’ does not decrease the accuracy.
Which means the persons feature with other features provides a
better score.
22DCS006,22DCS012,22DCS015

Practical-5 (Analysis)

Aim: The aim of this case study is to demonstrate the application of Principal
Component Analysis (PCA) for feature selection and to compare the
performance of Naïve Bayes and Support Vector Machine (SVM) classifiers
using the full dataset and a reduced feature set.

Objective:
1. To understand the concept of feature selection using PCA.
2. To apply PCA for selecting the K best features from a dataset.
3. To train and evaluate the performance of Naïve Bayes and SVM classifiers.
4. To compare the accuracy and training time of the models with the full dataset
and with the selected K features.
5. Additionally, perform the following tasks on the given dataset:
• Feature selection - PCA
• Cross-validation
6. Interpret and discuss the results as per the attached template.

1. What is data and where is data?

• Dataset Information:
Arcene dataset contains 200 records and its task is to distinguish
cancer versus normal patterns from mass-spectrometric data. This
dataset is one of 5 datasets of the NIPS 2003 feature selection
challenge. It was created by Isabelle Guyon, Steve Gunn, Asa Ben-
Hur, Gideon Dror and is available at UCI repository.

• Attributes:
This dataset provides us data into 5 separate files, train_data,
test_data, valid_data, valid_labels, train_labels. After merging the
files we get a dataset of 200 rows and 10002 columns, where
22DCS006,22DCS012,22DCS015

10001 columns are features and 1 column is Target having 1 and -1


values.

2. Is the data good?


After merging those files, we get a data in which we have a
whole column having NULL values.

Hence as only the 10000th column is of NULL values we can easily drop
that column and train our model accurately.
As per the outliers, we have 10002 columns, so it is obvious there might
be some outliers. This can be seen from the below image:
22DCS006,22DCS012,22DCS015

So having this huge dataset and only having 1 column as NULL values,
we can say that the data is good. As for the outliers, we can’t expect to
totally exclude the outliers from 10002 columns.
As per the accuracy, without applying PCA, feature selection and cross
validation, we have 61.67 for NB and 93.33 for SVM.

• Naïve Bayes and SVM with PCA and Cross Validation:

Case Accuracy Precision Recall F1-Score

1 (95%) 63.00-NB 57.27-NB 64.71-NB 60.62-NB


88.00-SVM 84.17-SVM 89.74-SVM 86.79-SVM
2 (97%) 65.50-NB 60.02-NB 65.88-NB 62.68-NB
89.50-SVM 88.65-SVM 87.45-SVM 88.00-SVM
3 (99%) 63.50-NB 58.17-NB 63.59-NB 60.49-NB
91.50-SVM 89.37-SVM 92.03-SVM 90.55-SVM
4 (100%) 61.50-NB 56.14-NB 61.37-NB 58.43-NB
89.50-SVM 87.89-SVM 88.56-SVM 88.16-SVM

Here 95%, 97%, 99% and 100% are Variance Thresholds. These are
helpful in PCA to select how many principal components to keep. For
example, if we have 100 features in a dataset, and if the 1st principal
component capture 95% of the variance, then we might only need 10 or
20 principal components instead of all 100 features.

• Naïve Bayes and SVM with Feature Selection and PCA+Cross-


validation(top 5000 features):

Case Accuracy Precision Recall F1-Score

1 (95%) 69.50-NB 63.25-NB 72.68-NB 67.53-NB


96.00-SVM 96.84-SVM 94.25-SVM 95.41-SVM
2 (97%) 70.50-NB 64.58-NB 72.68-NB 68.27-NB
96.00-SVM 97.65-SVM 93.20-SVM 95.36-SVM
3 (99%) 69.50-NB 63.49-NB 71.63-NB 67.18-NB
95.50-SVM 95.67-SVM 94.31-SVM 94.88-SVM
4 (100%) 66.00-NB 58.71-NB 77.25-NB 66.61-NB
65.60-SVM 58.21-SVM 77.25-SVM 66.27-SVM
22DCS006,22DCS012,22DCS015

• Interpretation of Results:
1. From the above results we can say that highest accuracy of NB
and SVM are 70.50 and 96.00 respectively. It can be gained
from Feature Selection (top 5000) + PCA (97%) and Cross-
Validation. Here we set top 5000 features and a 97% variance
threshold.
2. Meanwhile, lowest accuracy of NB is 61.50 without Feature
Selection, and for SVM it is 65.60 with Feature Selection but
100% variance threshold.
3. It can be seen that there is a drastic change in accuracy of SVM
because of the 100% variance threshold, which can be justified
as 100% variance threshold means taking all the principal
components by keeping the entire variance of the dataset.
4. Without Feature Selection we got 93.33 for SVM (without PCA
and cross-validation) and 65.50 for NB (with PCA-97% and
cross validation).
5. So, this says that by reducing half of the data, we get a higher
accuracy, which justifies our feature selection.
6. Feature Selection also helped Naïve Bayes to improve its
accuracy and maintain a higher recall.
7. By keeping 100% variance threshold, we are not getting higher
results, for a better score we should take 97% variance
threshold, which can be seen in both tables.
8. It can be seen that even after applying Feature Selection, PCA
and Cross-Validation we are getting 70.50% accuracy for Naïve
Bayes model. On the other hand, without these 3 SVM is giving
93% accuracy, which proves a point that on ARCENE dataset
we should apply SVM rather than Naïve Bayes.
9. Hence whenever we have a large dataset like ARCENE, we
should first try reducing the features, which might help to
increase our Accuracy. It might not give a huge increase (like in
Naïve Bayes), but we shall keep experimenting and find best
results.
22DCS006,22DCS012,22DCS015

Practical-6 (Analysis)
Aim: To understand and implement a Neural Network with Iris dataset using
Keras.

Objective:
1. To familiarize with the Keras library for building neural networks.
2. To understand the architecture of a simple neural network.
3. To implement, train, and evaluate a neural network model.
4. To visualize the performance of the model.
5. To answer key questions about the neural network’s performance and optimization

1. What is data and where is data?

• Dataset Information:
The iris dataset is a classic dataset in machine learning, consisting of
150 samples of three species of iris flowers. This dataset is widely
used for classification tasks and can be found in various sources,
including popular libraries like Scikit-learn and online repositories like
Kaggle.

• Attributes:

The iris dataset features four attributes, each representing a measurement of a


specific part of the iris flower and a target column with 3 classes, i.e., Virginica,
Versicolor and Sentosa. The attributes include:

• Sepal length: The length of the flower's sepal in centimeters.


• Sepal width: The width of the flower's sepal in centimeters.
• Petal length: The length of the flower's petal in centimeters.
• Petal width: The width of the flower's petal in centimeters.
22DCS006,22DCS012,22DCS015

2. Is the data good?

Observing the Dataset provided, it can be said that the data is good and
need not to be further processed as there are no null values.

As we can see in the above graph, there is no class imbalance.


22DCS006,22DCS012,22DCS015

• Correlation Matrix:
22DCS006,22DCS012,22DCS015

3. Results:

Case-1: Without Hidden Layers.

Confusion Matrix:
22DCS006,22DCS012,22DCS015

Case-2: With Hidden Layers(4 Hidden Layers):

Confusion Matrix:
22DCS006,22DCS012,22DCS015
22DCS006,22DCS012,22DCS015

Performance Table:
Case Accuracy Precision Recall F1_score
Without hidden layer 87% 87% 87% 87%

With hidden layer 100% 100% 100% 100%

Conclusion:

Case-1: Without Hidden Layers

• Based on the confusion matrix, it is observed that the performance is satisfactory, but
limited in terms of capturing complex patterns. Without hidden layers, the model tends to
behave similarly to linear models and may not be able to fully interpret the non-linear
relationships present in the dataset.
• This setup highlights the simplicity of the architecture but shows that the absence of
hidden layers restricts the model's ability to generalize well.

Case-2: With Hidden Layers (4 Hidden Layers)

• The confusion matrix indicates improved performance compared to Case-1. Adding


hidden layers allows the network to capture more complex relationships between the input
features, improving accuracy in classification tasks.
• The results indicate that the inclusion of hidden layers significantly boosts the model’s
ability to generalize and perform well, particularly in handling non-linear data.

In summary, Case-2 outperforms Case-1 due to the presence of hidden layers, which enable the
network to better capture intricate patterns within the dataset.
CSE303 Machine Learning 22DCS015

Faculty of Technology and Engineering


Devang Patel Institute of Advance Technology and Research
Department of Computer Science & Engineering
Date: / /

Academic Year : 2024-25 Semester : 5

Course code : CSE303 Course name : Machine Learning

Practical - 7
Aim: To understand and apply basic operations of Numpy, Pandas, and
Matplotlib for data manipulation, analysis, and visualization.

Part-1: Data Preprocessing


1. Load the data:

2. Dropping unwanted columns:

3. Processing data to convert object to int or float:

Page 1 of 6
CSE303 Machine Learning 22DCS015

4. Convert "Yes" and "No" to 0 and 1 ,"Male" and "Female" to 0 and 1 respectively:

5. Train/Test split:

Part-2: K-Means Clustering:


1. Model Fitting and implementation:

Page 2 of 6
CSE303 Machine Learning 22DCS015
2. Visualizing the result:

Page 3 of 6
CSE303 Machine Learning 22DCS015
Part-3: DB-Scan Clustering:
1. Model Fitting and implementation:

2. Visualize Results:

Page 4 of 6
CSE303 Machine Learning 22DCS015

Part-4: Analysis and Comparison:

Page 5 of 6
CSE303 Machine Learning 22DCS015
Analysis:

Based on the clustering evaluation results, K-means significantly outperforms DBSCAN


on the Churn dataset. K-means achieves a Silhouette Score of 0.64, which indicates that the
clusters are well-separated and compact. In contrast, DBSCAN's Silhouette Score is negative
(-0.11), suggesting poor clustering structure and that many points may be misclassified or not
well grouped.

Regarding the Davies-Bouldin Index, K-means also performs better with a lower score of
0.50 compared to DBSCAN’s 0.78. A lower Davies-Bouldin Index signifies that the clusters
formed by K-means are more distinct and compact, while DBSCAN struggles to produce
cohesive clusters, leading to a higher score.

Page 6 of 6
CSE303 Machine Learning 22DCS015

Practical 8: Implementing Q-Learning Algorithm in


Reinforcement Learning

Aim:
To understand and implement the Q-Learning algorithm for solving a reinforcement learning
problem and evaluate its performance.
Objectives
1. To gain practical knowledge of Q-Learning algorithm.
2. To implement Q-Learning in a simulated environment.
3. To analyze the performance of the Q-Learning algorithm.
4. To understand the impact of different hyperparameters on the algorithm’s performance.

Description
A growing e-commerce company is building a new warehouse, and the company would like
all of the picking operations in the new warehouse to be performed by warehouse robots. In
the context of e-commerce warehousing, “picking” is the task of gathering individual items
from various locations in the warehouse in order to fulfil customer orders. After picking items
from the shelves, the robots must bring the items to a specific location within the warehouse
where the items can be packaged for shipping. In order to ensure maximum efficiency and
productivity, the robots will need to learn the shortest path between the item packaging area
and all other locations within the warehouse where the robots are allowed to travel.
Tasks to be Performed:
CSE303 Machine Learning 22DCS015

1. Setup Environment:
 Install the OpenAI Gym toolkit.
 Load the Frozen Lake environment.
 Understand the environment’s state and action space.

Output:
CSE303 Machine Learning 22DCS015

2. Implement Q-Learning Algorithm:


 Initialize the Q-table with zeros.
 Define the hyperparameters: learning rate (alpha), discount factor (gamma), and
exploration rate(epsilon).
 Implement the Q-Learning update rule.
 Implement an epsilon-greedy policy for action selection.

Output:
CSE303 Machine Learning 22DCS015

3. Training the Agent:


 Run the Q-Learning algorithm for a fixed number of episodes.
 Update the Q-values based on the experiences gained by the agent.
 Monitor the agent performance over time.

Output:
CSE303 Machine Learning 22DCS015

4. Evaluation:
 Test the trained Q-Learning agent in the environment.
 Measure the agent’s performance using metrics such as average reward per episode.

Output:
CSE303 Machine Learning 22DCS015

5. Analysis:
 Analyse how different hyperparameters affect the learning process.
 Compare the performance of the Q-Learning agent with other baseline methods if
available.

Analyse:
Alpha Gamma Epsilon Epsilon_min Epsilon_Decay Avg Reward
0.2 0.9 1.25 0.02 0.912 0.0
1.0 2.9 1.25 1.2 3.2 0.0
0.1 0.99 1.0 0.01 0.995 0.06
CSE303 Machine Learning 22DCS015

Practical 9: Implementing a Convolutional Neural Network


(CNN) for Leaf Classification

Aim:
The aim of this case study is to implement a Convolutional Neural Network (CNN) for
the binary classification of images of leaf classification . The goal is to build a model
that can accurately distinguish between images of fresh and scrotch leaf.

Objectives:
1. Understand the architecture and working principles of Convolutional Neural
Networks.
2. Preprocess the dataset to make it suitable for training and testing the CNN.
3. Implement and train a CNN using a popular deep learning framework.
4. Evaluate the performance of the trained CNN model.
5. Optimize the model to improve its accuracy.
6. Answer key questions related to the implementation and results of the CNN.

1. Dataset Preparation
Directory Setup:
The dataset was split into three main subsets: training, validation, and testing. For each of
these subsets, the images were further categorized into two classes, ‘fresh leaf ’and
‘scrotch leaf’. The directory structure was established programmatically using the os
module:

- Training Directory: Contains images for training the model.


- train/fresh: Contains all the training images for fresh leaf.
- train/scrotch: Contains all the training images for scrotch leaf.

- Validation Directory: Used for evaluating the model's performance after each epoch
during training.
- validation/fresh: Contains validation images for fresh leaf.
- validation/scrotch: Contains validation images for scrotch leaf.

- Test Directory: Used to test the model's performance after training is completed.
- test/fresh: Contains test images for fresh leaf.
- test/scrotch: Contains test images for scrotch leaf .
CSE303 Machine Learning 22DCS015

Image Preprocessing:
Each image was resized to a uniform size of 256x256 pixels to match the input size
expected by the EfficientNetB5 model.
Color Mode: Images were processed as RGB.
Normalization: Image pixel values were normalized to be within the [0,1] range.

Data augmentation :
CSE303 Machine Learning 22DCS015

2. Model Building
EfficientNetB5 Base Model:
The CNN model used EfficientNetB5, a state-of-the-art pre-trained model trained on the
ImageNet dataset. EfficientNet is known for achieving high accuracy while being
computationally efficient.
Key Details:
- Weights: Pre-trained on ImageNet.
- Input Shape: (256, 256, 3)
- Pooling: Max pooling applied at the end of the base model.
Custom Top Layers:
- BatchNormalization
- Dense Layer with 512 neurons
- Output Layer with Sigmoid activation (for binary classification).
CSE303 Machine Learning 22DCS015

3. Model Training
Compilation:
- Optimizer: Adamax with a learning rate of 0.001.
- Loss Function: Binary Cross-Entropy.
- Metrics: Accuracy.

Results:
- Training Loss: 2.96 e-16
- Training Accuracy:100.00 %
- Validation Loss: 0.00
- Validation Accuracy:100.00 %.
CSE303 Machine Learning 22DCS015
CSE303 Machine Learning 22DCS015

4. Model Evaluation
Test Set Evaluation:
- Test Loss: 0.00
- Test Accuracy: 100.00%
Classification Report:
Precision, recall, and F1-score for both cats and dogs were excellent, with nearly perfect
scores for both classes.

5. Confusion Matrix
Out of 20 leaf images, all were correctly classified perfectly .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy