numpy_pandas_matplotlib
numpy_pandas_matplotlib
PRACTICAL-1
1
CSE303-Machine Learning 22DCS015
Shape Manipulation:
• Reshape an array into different dimensions (e.g., from 1D to 2D).
• Transpose an array to swap rows and columns.
2
CSE303-Machine Learning 22DCS015
4. Looping:
• Iterate through an array using for loops and perform operations on elements.
5. File Reading:
• Load data from a text file (CSV or similar) using NumPy functions (e.g.,
np.genfromtxt).
6 .Performance Benchmarking:
• Measure the execution time for matrix multiplication using NumPy arrays vs.
Python lists
• for a 1000x1000 matrix. Analyze the performance difference.
Part 2: Pandas
1. DataFrame Creation:
• Construct a DataFrame from scratch, specifying column names and data
types.
3
CSE303-Machine Learning 22DCS015
• Read the chosen dataset from its source file format (CSV, Excel) into a Pandas
DataFrame.
• Deal with missing values in the data
4
CSE303-Machine Learning 22DCS015
5
CSE303-Machine Learning 22DCS015
Data Export:
• Export the DataFrame to a new CSV file or another desired format.
Masking:
• Apply boolean masks to filter and select data based on specific criteria.
• Read data selectively based on a boolean DataFrame.
6
CSE303-Machine Learning 22DCS015
Part 3: Matplotlib
Line Charts:
• Create a line chart to visualize a numerical variable over time or another
relevant category.
Correlation Matrix:
• Generate a correlation matrix to depict relationships between features in the
dataset.
7
CSE303-Machine Learning 22DCS015
Histograms:
• Construct histograms to analyze the distribution of numerical features in the
dataset.
• Detect if Outliers exist and Plot the data distribution using Box Plots.
8
CSE303-Machine Learning 22DCS015
9
CSE303-Machine Learning 22DCS015
Pie Charts:
10
CSE303-Machine Learning 22DCS015
Data Preparation:
• Load and inspect the dataset.
• Handle missing values and outliers if present.
• Perform feature scaling or normalization if necessary.
11
CSE303-Machine Learning 22DCS015
Data Preparation:
• Load and inspect the dataset.
• Handle missing values and outliers if present.
• Perform feature scaling or normalization if necessary.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
# from sklearn.preprocessing import PolynomialFeatures
df=pd.read_csv('HousingData.csv')
df.shape
(506, 14)
df.head()
17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222
18.7
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222
18.7
B LSTAT MEDV
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 NaN 36.2
df.describe().T
2
CSE303-Machine Learning 22DCS015
6.208500
AGE 486.0 68.518519 27.999513 2.90000 45.175000
76.800000
DIS 506.0 3.795043 2.105710 1.12960 2.100175
3.207450
RAD 506.0 9.549407 8.707259 1.00000 4.000000
5.000000
TAX 506.0 408.237154 168.537116 187.00000 279.000000
330.000000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000
19.050000
B 506.0 356.674032 91.294864 0.32000 375.377500
391.440000
LSTAT 486.0 12.715432 7.155871 1.73000 7.125000
11.430000
MEDV 506.0 22.532806 9.197104 5.00000 17.025000
21.200000
75% max
CRIM 3.560263 88.9762
ZN 12.500000 100.0000
INDUS 18.100000 27.7400
CHAS 0.000000 1.0000
NOX 0.624000 0.8710
RM 6.623500 8.7800
AGE 93.975000 100.0000
DIS 5.188425 12.1265
RAD 24.000000 24.0000
TAX 666.000000 711.0000
PTRATIO 20.200000 22.0000
B 396.225000 396.9000
LSTAT 16.955000 37.9700
MEDV 25.000000 50.0000
sns.scatterplot(y=df["MEDV"],x=df["CRIM"])
3
CSE303-Machine Learning 22DCS015
sns.scatterplot(y=df["MEDV"],x=df["TAX"])
4
CSE303-Machine Learning 22DCS015
sns.regplot(y=df["MEDV"],x=df["RM"])
5
CSE303-Machine Learning 22DCS015
sns.regplot(y=df["MEDV"],x=df["CRIM"])
6
CSE303-Machine Learning 22DCS015
df.isna().sum()
CRIM 20
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64
data=df.fillna(df.mean())
df.isna().sum()
data.isna().sum()
CRIM 0
ZN 0
7
CSE303-Machine Learning 22DCS015
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
fig=plt.subplots(figsize=(15,15))
sns.heatmap(data.corr(),annot=True)
<Axes: >
8
CSE303-Machine Learning 22DCS015
sns.displot(data["MEDV"],kind="kde")
<seaborn.axisgrid.FacetGrid at 0x1ce48a27a70>
9
CSE303-Machine Learning 22DCS015
plt.grid()
sns.histplot(data["PTRATIO"])
10
CSE303-Machine Learning 22DCS015
sns.regplot(data=data,x="B",y="MEDV")
11
CSE303-Machine Learning 22DCS015
sns.regplot(data=data,x="DIS",y="MEDV")
12
CSE303-Machine Learning 22DCS015
Price=data["MEDV"]
new_data=data.drop(["MEDV"],axis=1)
# data=data.drop(["MEDV"],axis=1)
type(data)
# type(Price)
new_data
13
CSE303-Machine Learning 22DCS015
PTRATIO B LSTAT
0 15.3 396.90 4.980000
1 17.8 396.90 9.140000
2 17.8 392.83 4.030000
3 18.7 394.63 2.940000
4 18.7 396.90 12.715432
.. ... ... ...
501 21.0 391.99 12.715432
502 21.0 396.90 9.080000
503 21.0 396.90 5.640000
504 21.0 393.45 6.480000
505 21.0 396.90 7.880000
X_train,X_test,Y_train,Y_test=train_test_split(new_data,Price,test_siz
e=0.1,shuffle=True)
model=LinearRegression()
model.fit(X_train,Y_train)
Yhat=model.predict(X_test)
plt.grid()
columns=X_test.columns
plt.scatter(x=X_test["CRIM"],y=Y_test,color="Blue",label="Actual
Output")
plt.scatter(X_test["CRIM"],Yhat,color="orange",label="Predicted
Output")
plt.legend()
<matplotlib.legend.Legend at 0x1ce4814d460>
14
CSE303-Machine Learning 22DCS015
print(model.score(X_train,Y_train))
print(r2_score(Y_test,Yhat))
print(mean_squared_error(Y_test,Yhat))
print(model.coef_)
print(model.intercept_)
0.7419205308931333
0.6392849087461008
37.40510651778141
[-1.07582489e-01 3.51316401e-02 -2.69353830e-02 1.91968740e+00
-1.78211466e+01 4.64921433e+00 -1.07024210e-02 -1.42903936e+00
2.63749083e-01 -1.05532384e-02 -9.31746561e-01 8.56216938e-03
-4.32233650e-01]
30.902529774911535
15
CSE303 – Machine Learning 22DCS015-Raj Desai
Practical - 3
Aim: To understand and apply the logistic regression algorithm for binary
classification.
Page 1 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Explore the distributions of variables and correlations using histograms, box plots, and
correlation matrices.
Page 2 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Page 3 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Train the model on the training set and evaluate its performance using metrics such as
accuracy, precision, recall, and F1-score.
Compare the performance of the logistic regression model with other classification
algorithms if applicable.
Page 4 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Extra:
1. With Positively Correlated Features:
Page 5 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Page 6 of 7
CSE303 – Machine Learning 22DCS015-Raj Desai
Page 7 of 7
22DCS006 , 22DCS012 ,22DCS015
Practical-4(Analysis)
Aim: To understand and apply the KNN algorithm for multi-class classification
to classify instances into multiple categories based on their features.
Objectives:
• Explore and understand the dataset structure and attributes suitable for
multiclass classification.
• Implement the KNN algorithm to classify instances into one of several
predefined classes.
• Evaluate the performance of the KNN model using appropriate metrics.
• Interpret the results and understand the strengths and limitations of the
KNN approach for multiclass classification.
• Additionally, perform feature selection on the dataset and analyze the
results. Document the findings as per the attached template.
• Attributes:
This dataset has 6 independent variables(features) and 1 dependent
variable(target). The 6 features are buying, maintenance, persons,
doors, lug_boot, safety and 1 target column is class. These all
columns contain value in string.
As you can see that there is some class imbalance in target column.
22DCS006 , 22DCS012 ,22DCS015
No outliers can be seen in this box plot. Hence, we can say that our data
is good.
• Correlation Matrix:
22DCS006 , 22DCS012 ,22DCS015
From the above correlation matrix we can say that, more the number of people
lesser the chances of acceptability. This doesn’t mean that we can ignore the
‘persons’ feature in our feature selection as we also have other features, and
there might be a chance that persons feature with other features can improve our
score, but there is also a possibility that this feature can decrease our score.
Let’s see that by doing feature selection:
Case Features Reason
1 All Checking the accuracy
without feature
selection.
2 persons, buying, Ignoring number of
lug_boot, maint, safety doors as it has no
connection with car
acceptability.
3 persons, buying, maint, Ignoring size of luggage
safety and focusing on price.
4 persons, maint, safety Focusing on
maintenance and safety.
5 person, safety Taking safety as an
important measure.
6 buying As it is mostly
correlated to class.
Confusion Metrices
Case-1 (All features):
22DCS006,22DCS012,22DCS015
Case-2 (5 features):
Case-3 (4 features)
22DCS006,22DCS012,22DCS015
Case-4 (3 features)
Case-5 (2 features)
22DCS006,22DCS012,22DCS015
Case-6 (1 feature)
Performance Table:
Case Accuracy Precision Recall F1-Score Time taken
Interpretation of results:
1. We can say that Case 2 having 5 features performs better than
Case 1 having 6 features. This says that ‘doors’ affects slightly
on our target column.
2. Also, Case-2 takes the least time and provides best results.
3. We can’t ignore Case-3 as it provides a moderate result even
after ignoring 3 features. Still, it takes a huge time to give these
results.
4. Case-5 takes lesser time than all the cases (except 2nd) but it
does not provide a noteworthy result.
5. Lowest score is given by Case-6 as it only has 1 feature(buying).
This says that buying alone can’t decide the car acceptability.
6. The accuracy and recall of Case-1 and Case-4 are same but there
is a slight difference in Precision, F1 score and time which says
that ‘doors’ feature was not contributing much.
7. Case 2 gives us a perfect example of feature selection, i.e,
reducing the unnecessary features and gaining more accuracy in
lesser time.
8. Also, we can say that, ‘persons’ feature having a negative
correlation score with ‘class’ does not decrease the accuracy.
Which means the persons feature with other features provides a
better score.
22DCS006,22DCS012,22DCS015
Practical-5 (Analysis)
Aim: The aim of this case study is to demonstrate the application of Principal
Component Analysis (PCA) for feature selection and to compare the
performance of Naïve Bayes and Support Vector Machine (SVM) classifiers
using the full dataset and a reduced feature set.
Objective:
1. To understand the concept of feature selection using PCA.
2. To apply PCA for selecting the K best features from a dataset.
3. To train and evaluate the performance of Naïve Bayes and SVM classifiers.
4. To compare the accuracy and training time of the models with the full dataset
and with the selected K features.
5. Additionally, perform the following tasks on the given dataset:
• Feature selection - PCA
• Cross-validation
6. Interpret and discuss the results as per the attached template.
• Dataset Information:
Arcene dataset contains 200 records and its task is to distinguish
cancer versus normal patterns from mass-spectrometric data. This
dataset is one of 5 datasets of the NIPS 2003 feature selection
challenge. It was created by Isabelle Guyon, Steve Gunn, Asa Ben-
Hur, Gideon Dror and is available at UCI repository.
• Attributes:
This dataset provides us data into 5 separate files, train_data,
test_data, valid_data, valid_labels, train_labels. After merging the
files we get a dataset of 200 rows and 10002 columns, where
22DCS006,22DCS012,22DCS015
Hence as only the 10000th column is of NULL values we can easily drop
that column and train our model accurately.
As per the outliers, we have 10002 columns, so it is obvious there might
be some outliers. This can be seen from the below image:
22DCS006,22DCS012,22DCS015
So having this huge dataset and only having 1 column as NULL values,
we can say that the data is good. As for the outliers, we can’t expect to
totally exclude the outliers from 10002 columns.
As per the accuracy, without applying PCA, feature selection and cross
validation, we have 61.67 for NB and 93.33 for SVM.
Here 95%, 97%, 99% and 100% are Variance Thresholds. These are
helpful in PCA to select how many principal components to keep. For
example, if we have 100 features in a dataset, and if the 1st principal
component capture 95% of the variance, then we might only need 10 or
20 principal components instead of all 100 features.
• Interpretation of Results:
1. From the above results we can say that highest accuracy of NB
and SVM are 70.50 and 96.00 respectively. It can be gained
from Feature Selection (top 5000) + PCA (97%) and Cross-
Validation. Here we set top 5000 features and a 97% variance
threshold.
2. Meanwhile, lowest accuracy of NB is 61.50 without Feature
Selection, and for SVM it is 65.60 with Feature Selection but
100% variance threshold.
3. It can be seen that there is a drastic change in accuracy of SVM
because of the 100% variance threshold, which can be justified
as 100% variance threshold means taking all the principal
components by keeping the entire variance of the dataset.
4. Without Feature Selection we got 93.33 for SVM (without PCA
and cross-validation) and 65.50 for NB (with PCA-97% and
cross validation).
5. So, this says that by reducing half of the data, we get a higher
accuracy, which justifies our feature selection.
6. Feature Selection also helped Naïve Bayes to improve its
accuracy and maintain a higher recall.
7. By keeping 100% variance threshold, we are not getting higher
results, for a better score we should take 97% variance
threshold, which can be seen in both tables.
8. It can be seen that even after applying Feature Selection, PCA
and Cross-Validation we are getting 70.50% accuracy for Naïve
Bayes model. On the other hand, without these 3 SVM is giving
93% accuracy, which proves a point that on ARCENE dataset
we should apply SVM rather than Naïve Bayes.
9. Hence whenever we have a large dataset like ARCENE, we
should first try reducing the features, which might help to
increase our Accuracy. It might not give a huge increase (like in
Naïve Bayes), but we shall keep experimenting and find best
results.
22DCS006,22DCS012,22DCS015
Practical-6 (Analysis)
Aim: To understand and implement a Neural Network with Iris dataset using
Keras.
Objective:
1. To familiarize with the Keras library for building neural networks.
2. To understand the architecture of a simple neural network.
3. To implement, train, and evaluate a neural network model.
4. To visualize the performance of the model.
5. To answer key questions about the neural network’s performance and optimization
• Dataset Information:
The iris dataset is a classic dataset in machine learning, consisting of
150 samples of three species of iris flowers. This dataset is widely
used for classification tasks and can be found in various sources,
including popular libraries like Scikit-learn and online repositories like
Kaggle.
• Attributes:
Observing the Dataset provided, it can be said that the data is good and
need not to be further processed as there are no null values.
• Correlation Matrix:
22DCS006,22DCS012,22DCS015
3. Results:
Confusion Matrix:
22DCS006,22DCS012,22DCS015
Confusion Matrix:
22DCS006,22DCS012,22DCS015
22DCS006,22DCS012,22DCS015
Performance Table:
Case Accuracy Precision Recall F1_score
Without hidden layer 87% 87% 87% 87%
Conclusion:
• Based on the confusion matrix, it is observed that the performance is satisfactory, but
limited in terms of capturing complex patterns. Without hidden layers, the model tends to
behave similarly to linear models and may not be able to fully interpret the non-linear
relationships present in the dataset.
• This setup highlights the simplicity of the architecture but shows that the absence of
hidden layers restricts the model's ability to generalize well.
In summary, Case-2 outperforms Case-1 due to the presence of hidden layers, which enable the
network to better capture intricate patterns within the dataset.
CSE303 Machine Learning 22DCS015
Practical - 7
Aim: To understand and apply basic operations of Numpy, Pandas, and
Matplotlib for data manipulation, analysis, and visualization.
Page 1 of 6
CSE303 Machine Learning 22DCS015
4. Convert "Yes" and "No" to 0 and 1 ,"Male" and "Female" to 0 and 1 respectively:
5. Train/Test split:
Page 2 of 6
CSE303 Machine Learning 22DCS015
2. Visualizing the result:
Page 3 of 6
CSE303 Machine Learning 22DCS015
Part-3: DB-Scan Clustering:
1. Model Fitting and implementation:
2. Visualize Results:
Page 4 of 6
CSE303 Machine Learning 22DCS015
Page 5 of 6
CSE303 Machine Learning 22DCS015
Analysis:
Regarding the Davies-Bouldin Index, K-means also performs better with a lower score of
0.50 compared to DBSCAN’s 0.78. A lower Davies-Bouldin Index signifies that the clusters
formed by K-means are more distinct and compact, while DBSCAN struggles to produce
cohesive clusters, leading to a higher score.
Page 6 of 6
CSE303 Machine Learning 22DCS015
Aim:
To understand and implement the Q-Learning algorithm for solving a reinforcement learning
problem and evaluate its performance.
Objectives
1. To gain practical knowledge of Q-Learning algorithm.
2. To implement Q-Learning in a simulated environment.
3. To analyze the performance of the Q-Learning algorithm.
4. To understand the impact of different hyperparameters on the algorithm’s performance.
Description
A growing e-commerce company is building a new warehouse, and the company would like
all of the picking operations in the new warehouse to be performed by warehouse robots. In
the context of e-commerce warehousing, “picking” is the task of gathering individual items
from various locations in the warehouse in order to fulfil customer orders. After picking items
from the shelves, the robots must bring the items to a specific location within the warehouse
where the items can be packaged for shipping. In order to ensure maximum efficiency and
productivity, the robots will need to learn the shortest path between the item packaging area
and all other locations within the warehouse where the robots are allowed to travel.
Tasks to be Performed:
CSE303 Machine Learning 22DCS015
1. Setup Environment:
Install the OpenAI Gym toolkit.
Load the Frozen Lake environment.
Understand the environment’s state and action space.
Output:
CSE303 Machine Learning 22DCS015
Output:
CSE303 Machine Learning 22DCS015
Output:
CSE303 Machine Learning 22DCS015
4. Evaluation:
Test the trained Q-Learning agent in the environment.
Measure the agent’s performance using metrics such as average reward per episode.
Output:
CSE303 Machine Learning 22DCS015
5. Analysis:
Analyse how different hyperparameters affect the learning process.
Compare the performance of the Q-Learning agent with other baseline methods if
available.
Analyse:
Alpha Gamma Epsilon Epsilon_min Epsilon_Decay Avg Reward
0.2 0.9 1.25 0.02 0.912 0.0
1.0 2.9 1.25 1.2 3.2 0.0
0.1 0.99 1.0 0.01 0.995 0.06
CSE303 Machine Learning 22DCS015
Aim:
The aim of this case study is to implement a Convolutional Neural Network (CNN) for
the binary classification of images of leaf classification . The goal is to build a model
that can accurately distinguish between images of fresh and scrotch leaf.
Objectives:
1. Understand the architecture and working principles of Convolutional Neural
Networks.
2. Preprocess the dataset to make it suitable for training and testing the CNN.
3. Implement and train a CNN using a popular deep learning framework.
4. Evaluate the performance of the trained CNN model.
5. Optimize the model to improve its accuracy.
6. Answer key questions related to the implementation and results of the CNN.
1. Dataset Preparation
Directory Setup:
The dataset was split into three main subsets: training, validation, and testing. For each of
these subsets, the images were further categorized into two classes, ‘fresh leaf ’and
‘scrotch leaf’. The directory structure was established programmatically using the os
module:
- Validation Directory: Used for evaluating the model's performance after each epoch
during training.
- validation/fresh: Contains validation images for fresh leaf.
- validation/scrotch: Contains validation images for scrotch leaf.
- Test Directory: Used to test the model's performance after training is completed.
- test/fresh: Contains test images for fresh leaf.
- test/scrotch: Contains test images for scrotch leaf .
CSE303 Machine Learning 22DCS015
Image Preprocessing:
Each image was resized to a uniform size of 256x256 pixels to match the input size
expected by the EfficientNetB5 model.
Color Mode: Images were processed as RGB.
Normalization: Image pixel values were normalized to be within the [0,1] range.
Data augmentation :
CSE303 Machine Learning 22DCS015
2. Model Building
EfficientNetB5 Base Model:
The CNN model used EfficientNetB5, a state-of-the-art pre-trained model trained on the
ImageNet dataset. EfficientNet is known for achieving high accuracy while being
computationally efficient.
Key Details:
- Weights: Pre-trained on ImageNet.
- Input Shape: (256, 256, 3)
- Pooling: Max pooling applied at the end of the base model.
Custom Top Layers:
- BatchNormalization
- Dense Layer with 512 neurons
- Output Layer with Sigmoid activation (for binary classification).
CSE303 Machine Learning 22DCS015
3. Model Training
Compilation:
- Optimizer: Adamax with a learning rate of 0.001.
- Loss Function: Binary Cross-Entropy.
- Metrics: Accuracy.
Results:
- Training Loss: 2.96 e-16
- Training Accuracy:100.00 %
- Validation Loss: 0.00
- Validation Accuracy:100.00 %.
CSE303 Machine Learning 22DCS015
CSE303 Machine Learning 22DCS015
4. Model Evaluation
Test Set Evaluation:
- Test Loss: 0.00
- Test Accuracy: 100.00%
Classification Report:
Precision, recall, and F1-score for both cats and dogs were excellent, with nearly perfect
scores for both classes.
5. Confusion Matrix
Out of 20 leaf images, all were correctly classified perfectly .