0% found this document useful (0 votes)
12 views

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Exercise 6 : Clusters and Anomalies

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# % matplotlib inline will produce a figure immediately below


## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape


\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1


3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal


MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice


0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Clustering by Gr Living Area and Garage Area


Extract the required variables from the dataset, and then perform Bi-Variate Clustering.

# Extract the Features from the Data


X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfad8709a0>
Basic KMeans Clustering
Guess the number of clusters from the 2D plot, and perform KMeans Clustering.
We will use the KMeans clustering model from sklearn.cluster module.

# Import KMeans from sklearn.cluster


import warnings
warnings.filterwarnings("ignore",category=UserWarning,module="sklearn"
)
from sklearn.cluster import KMeans

# Guess the Number of Clusters


num_clust = 3

# Create Clustering Model using KMeans


kmeans = KMeans(n_clusters = num_clust,n_init=10)

# Fit the Clustering Model on the Data


kmeans.fit(X)

KMeans(n_clusters=3)

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers


print("Features", "\tLiving", "\tGarage")
print()

for i, center in enumerate(kmeans.cluster_centers_):


print("Cluster", i, end=":\t")
for coord in center:
print(round(coord, 2), end="\t")
print()

Features Living Garage

Cluster 0: 2570.17 678.3


Cluster 1: 1086.74 375.3
Cluster 2: 1696.92 522.68

Labeling the Clusters in the Data


We may use the model on the data to predict the clusters.

# Predict the Cluster Labels


labels = kmeans.predict(X)

# Append Labels to the Data


X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels


sb.countplot(x=X_labeled["Cluster"])

<AxesSubplot: xlabel='Cluster', ylabel='count'>


# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
sc = plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Cluster",
cmap = 'viridis', data = X_labeled)
legend = axes.legend(*sc.legend_elements(), loc="lower right",
title="Classes")

Within Cluster Sum of Squares


WithinSS = 0 : Every data point is a cluster on its own
WithinSS = Variance : Whole dataset is a single cluster

# Print the Within Cluster Sum of Squares


print("Within Cluster Sum of Squares :", kmeans.inertia_)

Within Cluster Sum of Squares : 140805418.54656714

Discuss : Is this the optimal clustering that you will be happy with? If not, try changing
num_clust.
Anomaly Detection for the Dataset
Extract the required variables from the dataset, and then perform Bi-Variate Anomaly Detection.

# Extract the Features from the Data


X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfaf326910>

Basic Anomaly Detection


Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and
Anomalies.
We will use the LocalOutlierFactor neighborhood model from sklearn.neighbors
module.

# Import LocalOutlierFactor from sklearn.neighbors


from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood


num_neighbors = 20 # Number of Neighbors
cont_fraction = 0.05 # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor


lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination =
cont_fraction)
# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

LocalOutlierFactor(contamination=0.05)

Labeling the Anomalies in the Data


We may use the model on the data to predict the anomalies.

# Predict the Anomalies


labels = lof.fit_predict(X)

# Append Labels to the Data


X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels


sb.countplot(x=X_labeled["Anomaly"])

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Anomaly", cmap =
'viridis', data = X_labeled)

<matplotlib.collections.PathCollection at 0x1dfb0c12e50>

Discuss : Is this the optimal anomaly detection that you will be happy with? If not, try changing
parameters.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy