0% found this document useful (0 votes)

12 views

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Exercise 6 : Clusters and Anomalies

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python

Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# % matplotlib inline will produce a figure immediately below

## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset

Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape

\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal

MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice

0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Clustering by Gr Living Area and Garage Area

Extract the required variables from the dataset, and then perform Bi-Variate Clustering.

# Extract the Features from the Data

X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfad8709a0>
Basic KMeans Clustering
Guess the number of clusters from the 2D plot, and perform KMeans Clustering.
We will use the KMeans clustering model from sklearn.cluster module.

# Import KMeans from sklearn.cluster

import warnings
warnings.filterwarnings("ignore",category=UserWarning,module="sklearn"
)
from sklearn.cluster import KMeans

# Guess the Number of Clusters

num_clust = 3

# Create Clustering Model using KMeans

kmeans = KMeans(n_clusters = num_clust,n_init=10)

# Fit the Clustering Model on the Data

kmeans.fit(X)

KMeans(n_clusters=3)

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers

print("Features", "\tLiving", "\tGarage")
print()

for i, center in enumerate(kmeans.cluster_centers_):

print("Cluster", i, end=":\t")
for coord in center:
print(round(coord, 2), end="\t")
print()

Features Living Garage

Cluster 0: 2570.17 678.3

Cluster 1: 1086.74 375.3
Cluster 2: 1696.92 522.68

Labeling the Clusters in the Data

We may use the model on the data to predict the clusters.

# Predict the Cluster Labels

labels = kmeans.predict(X)

# Append Labels to the Data

X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels

sb.countplot(x=X_labeled["Cluster"])

<AxesSubplot: xlabel='Cluster', ylabel='count'>

# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
sc = plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Cluster",
cmap = 'viridis', data = X_labeled)
legend = axes.legend(*sc.legend_elements(), loc="lower right",
title="Classes")

Within Cluster Sum of Squares

WithinSS = 0 : Every data point is a cluster on its own
WithinSS = Variance : Whole dataset is a single cluster

# Print the Within Cluster Sum of Squares

print("Within Cluster Sum of Squares :", kmeans.inertia_)

Within Cluster Sum of Squares : 140805418.54656714

Discuss : Is this the optimal clustering that you will be happy with? If not, try changing
num_clust.
Anomaly Detection for the Dataset
Extract the required variables from the dataset, and then perform Bi-Variate Anomaly Detection.

# Extract the Features from the Data

X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfaf326910>

Basic Anomaly Detection

Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and
Anomalies.
We will use the LocalOutlierFactor neighborhood model from sklearn.neighbors
module.

# Import LocalOutlierFactor from sklearn.neighbors

from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood

num_neighbors = 20 # Number of Neighbors
cont_fraction = 0.05 # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination =
cont_fraction)
# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

LocalOutlierFactor(contamination=0.05)

Labeling the Anomalies in the Data

We may use the model on the data to predict the anomalies.

# Predict the Anomalies

labels = lof.fit_predict(X)

# Append Labels to the Data

X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels

sb.countplot(x=X_labeled["Anomaly"])

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Anomaly", cmap =
'viridis', data = X_labeled)

<matplotlib.collections.PathCollection at 0x1dfb0c12e50>

Discuss : Is this the optimal anomaly detection that you will be happy with? If not, try changing
parameters.

Excel Cheatsheet
No ratings yet
Excel Cheatsheet
1 page
Simulation FIFO, LFU and MFU Page Replacement Algorithms (VB) by Shaify Mehta
69% (13)
Simulation FIFO, LFU and MFU Page Replacement Algorithms (VB) by Shaify Mehta
6 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
q1
No ratings yet
q1
2 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Ex 1
No ratings yet
Ex 1
119 pages
IE0005 Exercise Solutions 2-6
No ratings yet
IE0005 Exercise Solutions 2-6
84 pages
Exercise2 Solution
No ratings yet
Exercise2 Solution
15 pages
ds_ml__house_price_book
No ratings yet
ds_ml__house_price_book
46 pages
Ml Solution
No ratings yet
Ml Solution
60 pages
Evan Marie Carr - Python and SKlearn
No ratings yet
Evan Marie Carr - Python and SKlearn
32 pages
Data Analysis Advance House Price Prediction 1682585529
No ratings yet
Data Analysis Advance House Price Prediction 1682585529
73 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
House Price Prediction
No ratings yet
House Price Prediction
63 pages
Ads Exp5 Code
No ratings yet
Ads Exp5 Code
2 pages
ADS-Exp3
No ratings yet
ADS-Exp3
8 pages
1722414346054
No ratings yet
1722414346054
18 pages
Task 6
No ratings yet
Task 6
14 pages
Practical 5
No ratings yet
Practical 5
6 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
IndianHouses 1695069727
No ratings yet
IndianHouses 1695069727
7 pages
Machine Learning - Code - Jupiter
No ratings yet
Machine Learning - Code - Jupiter
14 pages
Final DA LAB1 Merged (1)
No ratings yet
Final DA LAB1 Merged (1)
48 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Eda Project
No ratings yet
Eda Project
28 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
1684918425867
No ratings yet
1684918425867
14 pages
Data Analysis With Python - Jupyter Notebook
No ratings yet
Data Analysis With Python - Jupyter Notebook
10 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
33 pages
Pandas Assignment 1
No ratings yet
Pandas Assignment 1
7 pages
Boston Housing Solutions
No ratings yet
Boston Housing Solutions
3 pages
Advanced Visualization For Data Scientists With Matplotlib
No ratings yet
Advanced Visualization For Data Scientists With Matplotlib
38 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
MLLabManual
No ratings yet
MLLabManual
24 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Scenario 1:: Acknowlegement
No ratings yet
Scenario 1:: Acknowlegement
17 pages
Xgboost
No ratings yet
Xgboost
12 pages
Deepak Data Analysis 1
No ratings yet
Deepak Data Analysis 1
31 pages
Real Estate
No ratings yet
Real Estate
10 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Kotlin Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
Kotlin Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Shobo
No ratings yet
BD-3.7.1 PACT TechnicalManual Ver3.0.3
No ratings yet
BD-3.7.1 PACT TechnicalManual Ver3.0.3
303 pages
IBM Knowledge Center - Charraymember
No ratings yet
IBM Knowledge Center - Charraymember
3 pages
Introduction To PL/SQL: Kristian Torp
No ratings yet
Introduction To PL/SQL: Kristian Torp
46 pages
Unit - 4 CSS: Computer Engineering
No ratings yet
Unit - 4 CSS: Computer Engineering
45 pages
Python With Automation
No ratings yet
Python With Automation
18 pages
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
Stok Status Report
No ratings yet
Stok Status Report
52 pages
Best First Search: A Algorithm
No ratings yet
Best First Search: A Algorithm
43 pages
Technic For Faster PL SQL
100% (2)
Technic For Faster PL SQL
45 pages
Finroc Crash Course
No ratings yet
Finroc Crash Course
21 pages
TP2 TensorFlow Programming Basics Experiment Guide
No ratings yet
TP2 TensorFlow Programming Basics Experiment Guide
47 pages
Harsh Practical File
No ratings yet
Harsh Practical File
54 pages
Crypto Lab File - PARAS
No ratings yet
Crypto Lab File - PARAS
28 pages
2023 Msce Computer Studies Paper i Theory - Done
100% (1)
2023 Msce Computer Studies Paper i Theory - Done
10 pages
Learn Pascal Programming Tutorial Lesson 1 - Introduction To Pascal
No ratings yet
Learn Pascal Programming Tutorial Lesson 1 - Introduction To Pascal
39 pages
Verifyaccess Admin Federation
No ratings yet
Verifyaccess Admin Federation
44 pages
L02 - Introduction To Verilog
No ratings yet
L02 - Introduction To Verilog
17 pages
### Backend (Golang) .Zip
No ratings yet
### Backend (Golang) .Zip
3 pages
Movieflic App
No ratings yet
Movieflic App
12 pages
Net SNMP Host MIB
No ratings yet
Net SNMP Host MIB
17 pages
第4章_系统管理模块
No ratings yet
第4章_系统管理模块
381 pages
Assignment Bca Visual Basic
No ratings yet
Assignment Bca Visual Basic
12 pages
FMI-Specification-2 0 1
No ratings yet
FMI-Specification-2 0 1
128 pages
Python TOC
No ratings yet
Python TOC
5 pages
SQL Cheat Sheet - Basics - SELECT, INSERT, UPDATE, DELETE, COUNT, DISTINCT, LIMIT
No ratings yet
SQL Cheat Sheet - Basics - SELECT, INSERT, UPDATE, DELETE, COUNT, DISTINCT, LIMIT
2 pages
Ch12. Tuples Notes With Question Answers
No ratings yet
Ch12. Tuples Notes With Question Answers
22 pages
DMS K Scheme report
No ratings yet
DMS K Scheme report
8 pages
CSS Unit2 Notes
No ratings yet
CSS Unit2 Notes
77 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Exercise6 Solution

Uploaded by

Exercise6 Solution

Uploaded by

Exercise 6 : Clusters and Anomalies

NumPy : Library for Numeric Computations in Python

# % matplotlib inline will produce a figure immediately below

Setup : Import the Dataset

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal

YrSold SaleType SaleCondition SalePrice

Problem 1 : Clustering by Gr Living Area and Garage Area

# Extract the Features from the Data

# Plot the Raw Data on a 2D grid

# Import KMeans from sklearn.cluster

# Guess the Number of Clusters

# Create Clustering Model using KMeans

# Fit the Clustering Model on the Data

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers

for i, center in enumerate(kmeans.cluster_centers_):

Features Living Garage

Cluster 0: 2570.17 678.3

Labeling the Clusters in the Data

# Predict the Cluster Labels

# Append Labels to the Data

# Summary of the Cluster Labels

<AxesSubplot: xlabel='Cluster', ylabel='count'>

Within Cluster Sum of Squares

# Print the Within Cluster Sum of Squares

Within Cluster Sum of Squares : 140805418.54656714

# Extract the Features from the Data

# Plot the Raw Data on a 2D grid

Basic Anomaly Detection

# Import LocalOutlierFactor from sklearn.neighbors

# Set the Parameters for Neighborhood

# Create Anomaly Detection Model using LocalOutlierFactor

Labeling the Anomalies in the Data

# Predict the Anomalies

# Append Labels to the Data

# Summary of the Anomaly Labels

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.