Artificial Intelligence: Semester Project
Artificial Intelligence: Semester Project
CSC-411
Semester Project
Instructor
Dr. Samabia Tehseem
Submitted by
Abdur Rehman Anwar
01-134142-199
Suneel Kumar
01-134142-202
Muhammad Ejaz
01-134142-201
19-12-2017
Table of Contents
1. Introduction ............................................................................................................................................ 3
1.1 Domain ............................................................................................................................................... 3
1.2 Application ......................................................................................................................................... 3
2. Dataset ..................................................................................................................................................... 3
2.1 Details..................................................................................................................................................... 3
2.2 Source .................................................................................................................................................... 3
3. Algorithms .............................................................................................................................................. 3
3.1 K-means ................................................................................................................................................. 3
3.1.1 Process ........................................................................................................................................... 3
3.1.2 Advantages .................................................................................................................................... 3
3.1.3 Disadvantages ............................................................................................................................... 3
3.2 Agglomerative ....................................................................................................................................... 4
3.2.1 Process ........................................................................................................................................... 4
3.2.2 Advantages .................................................................................................................................... 4
3.2.3 Disadvantages ............................................................................................................................... 4
3.3 Mean-Shift.............................................................................................................................................. 4
3.3.1 Process ....................................................................................................................................... 4
3.3.2 Advantages .................................................................................................................................... 4
3.3.3 Disadvantages ............................................................................................................................... 4
4. Result and Analysis.............................................................................................................................. 5
4.1. Screen Shots of Python Based Graphical User Interface (GUI) ............................................................. 5
4.1.1. Initial Layout of this system. .................................................................................................... 5
4.1.2. When user wants to load required file using this project, then it opens dialog box to
load file...................................................................................................................................................... 5
4.1.3. If user tries to load other extension files which are not allowed through this system,
then it shows error message by using message box. ....................................................................... 5
4.1.4. If user uploaded respective extension file which is allowed through this system
specification, then it shows confirmation message by using message box. .................................. 5
4.1.5. This is complete layout of this system, after that system works on the three clustering
algorithms as they mentioned by using button for each algorithm. ................................................. 6
4.2. K-Mean Result ..................................................................................................................................... 6
4.2.1. K-means Graph ............................................................................................................................ 6
4.2.2. K-means Analysis ........................................................................................................................ 6
4.3. Agglomerative Result .......................................................................................................................... 7
4.3.1. Agglomerative Graph................................................................................................................... 7
4.3.2. Agglomerative Analysis ............................................................................................................... 7
4.4. Mean-Shift Result ................................................................................................................................ 7
4.4.1. Mean-Shift Graph ......................................................................................................................... 7
4.4.2. Mean-Shift Analysis ..................................................................................................................... 7
1. Introduction
1.1 Domain
We selected Machine Learning as a domain of Artificial Intelligence project. Many result can be
produced using clustering algorithms on cricket statistics. It is a vast domain for these types of
algorithms.
1.2 Application
To predict match outcomes.
To predict player’s performance.
2. Dataset
2.1 Details
The dataset we choose for our project is about t20 most runs by batsman in 2016. It is a numeric
dataset which contains 14 columns and 50 rows. In columns there is players name, matches,
innings, not outs, runs, average runs, strike rate, run rate, best score, 100s, 50, 6s, 4s and 0s.
2.2 Source
We took this dataset from world famous dataset repository Kaggle. This dataset is available in the
Kaggle website with name T20 Cricket Most Runs 2016 in the below link,
https://www.kaggle.com/frankfernandes/t20mostruns2016/
3. Algorithms
3.1 K-means
k-means clustering algorithm is one of the unsupervised learning algorithms in machine learning. It
can be sense from its name that it makes k clusters based on k means. It takes distance of new
given data with all means then compare these distances and assigns this given data to the cluster
which have least distance from it. After this it calculate means of each cluster and then it repeats
the same process until new means are become equal to previous means.
3.1.1 Process
Assume that we have dataset and set of k centroids.
Iterates through all dataset
Calculate distance between each data feature with these centroids
Assign this data to the cluster which have minimum distance from this data
At the end recalculate new centroids using the values in their cluster
Repeat the process until new centroids are equal to the previous centroids
3.1.2 Advantages
It is a simple and easy algorithm which can be written by beginners
It is fast as compare to the other clustering algorithms
It is best for distinct data values in the dataset
3.1.3 Disadvantages
It is not best for overlapping data values in the dataset as it can’t decide the cluster of these
values
It gives different result for different representation of the same dataset.
Euclidean distance is not efficient for calculating distance for different data values in the
dataset
When we choose data centers randomly it can’t lead us to the accurate results
It is useless for nosey data and not good for outliers.
Dataset should be linear otherwise it will not give correct results.
3.2 Agglomerative
It is a hierarchical clustering algorithm. It clusters dataset from bottom to up as every cluster in the
top has sub-clusters and these sub-clusters also have their own sub-clusters.
3.2.1 Process
Consider each value as a cluster.
Construct a distance matrix.
Merge two clusters with minimum distance.
Reconstruct the distance matrix with these new clusters.
Repeat the process until distance matrix is reduced to two elements.
3.2.2 Advantages
It can generate the order of objects, which can be useful for the data visualization.
Lesser clusters are created, which can be supportive for detection.
It is simple to implement and have multiple applications
3.2.3 Disadvantages
There can be a chance it produced different result as it uses different distance matrices to calculate
distance.
It can produce imbalances clusters.
It is very difficult to choose number of clusters.
3.3 Mean-Shift
Mean shift is one of the nonparametric clustering algorithms with does not need the number of
clusters to be generated and there is no restriction for cluster’s shape. It shifts means into the
region with high density until convergence occur. It iterates through the region of interest and
calculate mean of all values in that region and shift center to the newly calculated mean. In this
way it calculates all centroids for the clusters.
3.3.1 Process
First take random centroids and radius for the area of interest.
Calculate distance of all dataset from this centroid.
Choose those values which are within the defined area of interest.
Calculate the mean of all chosen values.
Shift the centroid to the newly calculated mean of values.
Repeat the same process until the there is no shifting between centroids.
3.3.2 Advantages
Mean shift algorithm is independent of the type of application.
It is very simple and easy to implement.
It does not require predefined shape of the clusters.
It can be used to cluster dataset with any type of features.
It depends on the bandwidth of the cluster.
3.3.3 Disadvantages
Bandwidth should be taken carefully otherwise result will be inaccurate.
Window size to calculate kernel should be non-trivial.
As it picks first centroid randomly which can be an outlier.
4. Result and Analysis
4.1. Screen Shots of Python Based Graphical User Interface (GUI)
4.1.2. When user wants to load required file using this project, then it opens dialog box
to load file.
4.1.3. If user tries to load other extension files which are not allowed through this system,
then it shows error message by using message box.
4.1.4. If user uploaded respective extension file which is allowed through this system
specification, then it shows confirmation message by using message box.
4.1.5. This is complete layout of this system, after that system works on the three
clustering algorithms as they mentioned by using button for each algorithm.