Eda of Ipl

EXPLORATORY DATA ANALYSIS OF IPL MATCHES
Project submitted in the partial fulfilment of the

requirements for the award of
POST GRADUATE DEGREE IN ACTUARIAL SCIENCE
OF
Mahatma Gandhi University,

Kottayam, Kerala
Submitted by
FASAL K A (Reg. No. 200011019991)
DEPARTMENT OF ACTUARIAL SCIENCE
ST. JOSEPH’S ACADEMY OF HIGHER EDUCATION
AND RESEARCH, MOOLAMATTOM
ARAKULAM, IDUKKI – 685591
2020-2022
ST. JOSEPH’S ACADEMY OF HIGHER EDUCATION
AND RESEARCH, MOOLAMATTOM
DEPARTMENT OF ACTUARIAL SCIENCE
CERTIFICATE
This is to certify that the project entitled “ EXPLORATORY DATA ANALYSIS OF IPL
MATCHES ” is an authentic record of work carried out by Mr. FASAL K A under my
supervision and guidance in the Department of ACTUARIAL SCIENCE at St. Joseph’s
Academy of Higher Education and Research, Moolamattom.
Signature of Guide Signature of HOD

Mr. Vineeth Viswanath Mr. Dibin Thomas
Name of the Examiners: Signature with date

1.
2
DECLARATION
I hereby declare that the project report entitled “ EXPLORATORY DATA ANALYSIS OF
IPL MATCHES ” submitted in partial fulfillment of the requirements for the award of Master’s
Degree in Actuarial Science of Mahatma Gandhi University, Kottayam, is my project work. The
contents of the study, in full or parts, have not been submitted to any other institution or
university for the award of any degree or diploma.
FASAL K A
Place: Moolamattom
Date:
3
ACKNOWLEDGEMENT
During the course of my project work several persons collaborated directly and indirectly with
me. Without their support it would be impossible for me to finish my work. That is why I wish
to dedicate this section to recognize their support.
First of all, my gratitude is to GOD who has showered his divine providence and graces on me
in the completion of this work.It is my proud privilege to release the feelings of my gratitude to
several persons who helped me directly or indirectly to conduct this project work. I want to
recognize and stretch out my ardent appreciation to Mr.Vineeth Viswanath (Project guide,
Actuarial Science, St. Joseph’s Academy of Higher Education and Research, Moolamattom) who
has made the fruition of this undertaking conceivable.
I am deeply indebted to all other staff members of the department Ms. Neenumol Tom and Mr.
Dibin Thomas for their valuable and experienced guidance provided during the period of my
study.
I also like to thank my family members and friends for the support they have given for the
completion of the work.
4
ABSTRACT
Cricket is a popular sport not only in India but also in the surrounding areas of the world.
Specifically the T-20 format of this game is very popular in recent years. Today one of the
championships named as Indian premier league (IPL) associated with this format has grown
rapidly. But cricket is always said to be a game of uncertainty. Predicting the winner of the
tournament or the game has an area of concern for many fans. Technology, on the other hand,
is developing at an alarming rate. Machine learning algorithms are always the first choice for
researchers to predict something after model training. So in this project we will predict the
probable percentage of winning teams in the IPL using different supervised learning methods.
5
TABLE OF CONTENTS
1. INTRODUCTION
1.1 OBJECTIVE OF THE STUDY...................................................................10
1.2 SCOPE OF THE STUDY..........................................................................10
1.3 DATA AND METHODOLOGY.............................................................. 11
2. DATA ANALYSIS
2.1 MAIN PHASES IN DATA ANALYSIS.....................................................14
3. DATA ANALYSIS USING PYTHON
3.1 DATA PREPARATION AND CLEANING
3.11 MATCHES DF .................................................................................18
3.12 DELIVERIES DF................................................................................24
3.2 EXPLORATORY ANALYSIS AND VISUALIZATION
3.21 MATCHES DF....................................................................................26
3.22 DELIVERIES DF................................................................................40
4. ASKING AND ANSWERING QUESTIONS .........................................50
6
5. CONCLUSION ……...................................................................................59
6. FUTURE SCOPE........................................................................................60
7. BIBLIOGRAPHY..................................................................................... 61
7
CHAPTER 1
INTRODUCTION
8
1.INTRODUCTION
Indian Premier League more popularly called as IPL is a Cricket Tournament hosted by the
Cricket Board of India (BCCI). Players from different countries participate in IPL making it an
exciting opportunity to entertain cricket lovers. IPL was established in 2008 when the first
season of IPL was hosted. Since then every year the IPL game is played and celebrated as a
month long cricket festival for Indians and cricket lovers throughout the world. IPL also gives
opportunities to the young players to showcase their talent and improve their experience by
playing with some of the best and experienced players of cricket.
In this project I am going to go through two datasets of IPL matches in INDIA ,observe the
data, analyse and process it and answer a few common questions about the dataset that would
generally bug you. Go through the notebook carefully and enjoy the different observations
made by me .
The given dataset was taken from the dataset bundle present in Kaggle Datasets, Reffer to this
link [IPL 2008-2019 Kaggle Dataset] (https://www.kaggle.com/nowke9/ipldata) to get more
information about the dataset and download it from Kaggle to work with it.
With this dataset I am trying to visualize different trends in IPL score of teams and players
from 2008 to 2019, As the current season IPL 2020 is ongoing it would be fun and helpful to
know the stats of teams and players visually for the last 11 years. Hope you will enjoy the
visualization provided by me.
The name of the Dataset used for this projects are `matches.csv` and `deliveries.csv`. There are
756 rows in the `matches.csv` file each row containing data about a specific match. The
`deliveries.csv` dataset is a huge one with over `1.79 Lakhs` of rows of data and every row
represents data from each delivery from each match for the last 11 years.
I will be using Python 3 for this analysis, And am doing this project in Jupyter Notebook
(Kaggle and Google Collab are also good options to run this notebook and work with it). The
Libraries/Packages I will be using in this project are as followed.
9
* __numpy__ (as np is one of the very famous packages for working with arrays in python)
* __pandas__ (Is greatly used in analysis of data and making dataframe)
* __matplotlib__ (Lets make our Analyzation fun and interactive with the visualization
library matplotlib)
* __seaborn__ (Adding more colors into matplotlib visualization)
1.1 OBJECTIVE OF THE STUDY
The motivation behind EDA includes the answers to following questions
● To find the team that won the most number of matches in a season.
● To find the team that lost the most number of matches in a season.
● Does winning a toss increase the chances of victory.
● To find the player with the most player of the match awards.
● To find the city that hosted the maximum number of IPL matches.
● To find the most winning team for each season.
● To find the on-field umpire with the maximum number of IPL matches.
● To find the biggest victories in IPL while defending a total and while chasing a total.
1.2 SCOPE OF THE STUDY
The main purpose of this method is to help to understand the data by having a look and then
make predictions on the data. This helps the data scientist to find the errors, to handle noisy
data, to detect anomalies/ outliers, and identify the new patterns in that data. EDA is used to
ensure the output should be valid and can be applied to any business goals. It helps to conclude
about standard deviations, categorical variables, and confidence intervals. Once the process is
complete its features can then be used for more sophisticated data analysis or modeling,
including machine learning.
10
1.3 DATA AND METHODOLOGY
Exploratory Data Analysis (EDA) is a method used to analyze and summarize data sets. In
other words, this method is used by data scientists to analyze and investigate the patterns in the
data sets and summarize their characteristics. This helps the data scientists to discover various
patterns in the data, find the anomalies in the data, test a hypothesis.The dataset consists of data
about IPL matches played from the year 2008 to 2019. IPL is a professional Twenty20 cricket
league founded by the Board of Control for Cricket in India (BCCI) in 2008. The league has 8
teams representing 8 different Indian cities or states. It enjoys tremendous popularity and the
brand value of the IPL in 2019 was estimated to be ₹475 billion (US$6.7 billion).
11
CHAPTER 2
EXPLORATORY DATA ANALYSIS
12
2. EXPLORATORY DATA ANALYSIS
Data are those raw facts and figures with no proper information hence need to be processed to
get the desired information. While information is those results which we get after processing
the raw data in different levels or extracted conclusions from a given dataset through a
process called data analysis.
Data Analysis is simply the analysis of various data means cleaning the data, transforming it
into understandable form, and then modeling data to extract some useful information for
business use or an organizational use. It is mainly used in making business decisions. Many
libraries are available for doing the analysis. For example, NumPy, Pandas, Seaborn,
Matplotlib, Sklearn, etc.
• NumPy:
NumPy is a library written in Python, used for numerical analysis in Python. It stores the data
in the form of nd-arrays (n-dimensional arrays).
• Pandas:
Pandas is mainly used for converting data into tabular form and hence, makes the data more
structured and easily to read.
• Matplotlib:
Matplotlib is a data visualization and graphical plotting package for Python and its numerical
extension NumPy that runs on all platforms.
• Seaborn:
Seaborn is a Python data visualization package based on matplotlib that is tightly connected
with pandas data structures. The core component of Seaborn is visualization, which aids in data
exploration and comprehension.
• Sklearn:
Scikit-learn is the most useful library for machine learning in Python. It includes numerous
useful tools for classification, regression, clustering, and dimensionality reduction. Data
visualization will help the data analysis to make it more understandable and interactive by
plotting or displaying the data in pictorial form. Pandas, a Python open-source package that
13
deals with three different data structures: series, data frames, and panels, solves the need of
analyzing and visualizing data.
Data analysis using Python makes task easier since Python Programming language has many
advantages over any other programming language. It has prominent features like being a high-
level programming language (the codes are in human readable form) it is easy to understand
and use by any programmer or user. Many libraries and functions for statistical, numerical
analysis are available in Python. Moreover, the source code is freely available to anyone (free
and open source). This paper includes all the basic terms and functions which are much needed
by a beginner to know what data analysis is. The paper is divided broadly into 4 sections. In
section II, the main steps in data analysis will be discussed. In section III, data analysis using
python will be studied with all the basic needs of python in doing data analysis and data
visualization will aid the analysis by representing them in picture format. In section IV,
conclusion of the paper is given
2.1 MAIN PHASES IN DATA ANALYSIS
A. Data requirements:
Data is the most important unit in any study. Data must be provided as inputs to the analysis
based on the analysis’ requirements. The term “experimental unit” refers to the type of
organization that would be used to gather data (e.g., a person or population of people). It is
possible to identify and obtain specific population variables (such as height, weight, age, and
salary). It doesn’t matter whether the data is numerical or categorical.
B. Data Collecting:
The collecting of data is simply known as Data Collecting. Data is gathered from a variety of
sources, including relational databases, cloud databases, and other sources, depending on the
study’ needs. Field sensors, such as traffic cameras, satellites, monitoring systems, and so on,
can also be used as data sources.
14
C. Data processing :
Data that is collected must be processed or organized for analysis. For instance, these may
involve arranging data into rows and columns in a table format (known as structured data) for
further analysis, often through the use of spreadsheet or statistical software.
D. Data cleaning:
The method of cleaning data after it has been processed and organized is known as data
cleaning. It scans for data inconsistencies, duplicates, and errors, and then removes them. The
data cleaning process includes tasks such as record matching, identifying data inaccuracy, data
sort, outlier data identification, textual data spell checker, and data quality maintenance. As a
consequence, it keeps us from having unexpected outcomes and assists us in delivering high-
quality data, which is essential for a successful outcome.
E. Exploratory data analysis:
Once the datasets are cleaned and free of error, it can then be analyzed. A variety of techniques
can be applied such as exploratory data analysis- understanding the messages contained within
the obtained data and descriptive statistics finding average, median, etc. Data visualization is
also a technique used, in which the data is represented in a graphical format in order to obtain
additional insights, regarding the information within the data.
F. Modeling and algorithms:
Mathematical formulas or models (known as algorithms), may be applied to the data in order
to identify relationships among the variables; for example, using correlation or causation.
G. Data product:
A data product is a computer application that takes data inputs and generates outputs, feeding
them back into the environment. It may be based on a model or algorithm.
15
CHAPTER 3
DATA ANALYSIS USING PYTHON
16
3. DATA ANALYSIS USING PYTHON
In this section, data analysis using python will be studied. The most basic things like why using
python for data analysis will be understood. Moreover, how anyone can start using python will
be shown. The important libraries, the platforms, the dataset to carry out the analysis will be
introduced. Usage of various python functions for numerical analysis are given along with
various methods of plotting graphs or charts are discussed.
A. Why use Python?
Python is a high-level, interpreted, multi-purpose programming language. Many programming

paradigms like procedural programming language, object-oriented programming are supported
in python. It can be used for many applications, that includes statistical computing with various
packages and functions. Moreover, it is easy to learn. It can be picked up by anyone including
those who have less programming skills.
Some features of Python are as listed below:
• Open source and free
• Interpreted language
• Dynamic typesetting
• Portable
• Numerous IDE
B. Packages used:
• Numpy
• Pandas
• Seaborn
• Matplotlib
C. Platform used:
• Anaconda (Jupyter Notebook)
17
D. Dataset used:
•IPL(INDIAN PREMIER LEAGUE) 2008-2019
3.1 Data Preparation and Cleaning
In this section, I explored the data from the surface level and did the required cleaning and
preparation for the analysis
3.11 Matches DF
# reading the matches dataset.
matches = pd.read_csv('ipldata/matches.csv')
# displaying the data of the dataset.
matches
18
# getting general info about the dataset.
matches.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns):
1. As seen above, there are 756 venues whereas there are 749 cities, implying there
are some (7 ) missing values in the city column.
2. There are 754 values in umpire1 and umpire2 columns instead of 756.
3. There are 752 winner values, this may be due to matches being tied or matches
having no result (because of rain).
19
## Finding the venues where the city value is missing.
matches.venue[matches.city.isnull()]
461 Dubai International Cricket Stadium

Name: venue, dtype: object
## Checking if ALL the matches played at DUbai International Cricket Stadium have
missing city value.
matches[matches.venue == 'Dubai International Cricket Stadium']
# Filling up the city column when the matches have venue Dubai International
Cricket Stadium with "Dubai".
matches.loc[matches[matches['venue'] == 'Dubai International Cricket

Stadium'].index, "city"] = "Dubai"
20
# Finding out the two rows where the values of umpire1 column are missing.
matches[matches.umpire1.isnull()]
# Filling those values (of umpire2 too) using a quick google search.
matches.loc[4, 'umpire1'] = 'C Shamshuddin'

matches.loc[4, 'umpire2'] = 'CK Nandan'
matches.loc[753, 'umpire1'] = 'Bruce Oxenford'

matches.loc[753, 'umpire2'] = 'Sundaram Ravi'
# Checking data type of Date column.
matches['date']
0 2017-04-05
1 2017-04-06
2 2017-04-07
3 2017-04-08
4 2017-04-08
...
751 05/05/19
752 07/05/19
753 08/05/19
754 10/05/19
755 12/05/19
Name: date, Length: 756, dtype: object
# Converting the date column's data type into datetime data time of python (which
would standardize the column).
matches['date'] = pd.to_datetime(matches.date)
21
matches['date']
0 2017-04-05
1 2017-04-06
2 2017-04-07
3 2017-04-08
4 2017-04-08
...
751 2019-05-05
752 2019-07-05
753 2019-08-05
754 2019-10-05
755 2019-12-05
Name: date, Length: 756, dtype: datetime64[ns]
# Finding all the names of the teams that compete in the IPL.
matches.team1.unique()
array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',

'Rising Pune Supergiant', 'Royal Challengers Bangalore',
'Kolkata Knight Riders', 'Delhi Daredevils', 'Kings XI Punjab',
'Chennai Super Kings', 'Rajasthan Royals', 'Deccan Chargers',
'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants',
'Delhi Capitals'], dtype=object)
1. As seen above, there are multiple names for the same team, i.e, `Rising Pune
Supergiants` and `Rising Pune Supergiant`. This is because of the omission of `s`, we
shall fix that.
2. `Delhi Capitals` and `Delhi Daredevils` are the names of the same team. The team
representing Delhi, which was Delhi Daredevils changed its name to Delhi Capitals in 2018.
So, for simplification, we will change the values where Delhi Daredevils is used to Delhi
Capitals.
These will be the columns that would require fixing:
1. team1
2. team2
3. winner
22
#Using replace method in pandas library to fix the team name errors.
matches.team1.replace({'Rising Pune Supergiants' : 'Rising Pune Supergiant',

'Delhi Daredevils':'Delhi Capitals'},inplace=True)
matches.team2.replace({'Rising Pune Supergiants' : 'Rising Pune Supergiant',
matches.winner.replace({'Rising Pune Supergiants' : 'Rising Pune Supergiant',
# Checking all the names of the teams that compete in the IPL again for
confirmation of the fix.
matches.team1.unique()
array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',

'Rising Pune Supergiant', 'Royal Challengers Bangalore',
'Kolkata Knight Riders', 'Delhi Capitals', 'Kings XI Punjab',
'Kochi Tuskers Kerala', 'Pune Warriors'], dtype=object)
# Finding out matches where there was no result.
matches[matches.result == 'no result']
As seen above, there might be multiple names for the same city, just like in the case of
`Bangalore` and `Bengaluru`.
# Finding all the unique values in the city column
matches.city.unique()
23
array(['Hyderabad', 'Pune', 'Rajkot', 'Indore', 'Bangalore', 'Mumbai',
'Kolkata', 'Delhi', 'Chandigarh', 'Kanpur', 'Jaipur', 'Chennai',
'Cape Town', 'Port Elizabeth', 'Durban', 'Centurion',
'East London', 'Johannesburg', 'Kimberley', 'Bloemfontein',
'Ahmedabad', 'Cuttack', 'Nagpur', 'Dharamsala', 'Kochi',
'Visakhapatnam', 'Raipur', 'Ranchi', 'Abu Dhabi', 'Sharjah',
'Dubai', 'Mohali', 'Bengaluru'], dtype=object)
# Fixing the Bangalore and Bengaluru error.
matches.city.replace({'Bangalore' : 'Bengaluru'},inplace=True)
3.12 Deliveries DF
# Loading the Deliveries Dataframe which would be used later for some Ball by Ball
Analysis.
deliveries_df = pd.read_csv('ipldata/deliveries.csv')
# Displaying the data.
deliveries_df
# Finding out the all the team names.
deliveries_df.batting_team.unique()
24
array(['Sunrisers Hyderabad', 'Royal Challengers Bangalore',
'Mumbai Indians', 'Rising Pune Supergiant', 'Gujarat Lions',
'Kolkata Knight Riders', 'Kings XI Punjab', 'Delhi Daredevils',
'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants',
'Delhi Capitals'], dtype=object)
As seen above, Deliveries dataframe has the same errors as Matches dataframe (Rising Pune Supergiants and Delhi
Daredevils).
The following columns would require the fixing:
1. batting_team
2. bowling_team
# Fixing the errors using the replace function.
deliveries_df.batting_team.replace({'Rising Pune Supergiants' : 'Rising Pune

Supergiant', 'Delhi Daredevils':'Delhi Capitals'},inplace=True)
deliveries_df.bowling_team.replace({'Rising Pune Supergiants' : 'Rising Pune
Supergiant', 'Delhi Daredevils':'Delhi Capitals'},inplace=True)
# Checking all the names of the teams that compete in the IPL again for
confirmation of the fix.
deliveries_df.batting_team.unique()
array(['Sunrisers Hyderabad', 'Royal Challengers Bangalore',

'Mumbai Indians', 'Rising Pune Supergiant', 'Gujarat Lions',
'Kolkata Knight Riders', 'Kings XI Punjab', 'Delhi Capitals',
'Kochi Tuskers Kerala', 'Pune Warriors'], dtype=object)
25
3.2 Exploratory Analysis and Visualization
This section contains general analysis of both the matches and the deliveries dataframe in the
form of tables and graphs.
Let's begin by importing matplotlib.pyplot and seaborn .
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (12, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
3.21 Matches Dataframe
Toss Decisions Visualization
# How many times teams decided to bat or bowl.
toss_decision = matches.toss_decision.value_counts()
toss_decision
field 463
bat 293
Name: toss_decision, dtype: int64
# Visualising the toss decisions.
fielding = toss_decision[0]
batting = toss_decision[1]
toss_decisions = [fielding, batting]
labels = ["Fielding", 'Batting']
colors = ['Black', "Grey"]
plt.title("Toss Decisions", fontweight='bold', fontsize = 20)
patches, texts, pcts = plt.pie(toss_decisions, colors = colors, labels = labels,
autopct = "%.2f%%", startangle = 80, counterclock = False,
wedgeprops = {'linewidth': 3.0, 'edgecolor':
'white'}, textprops= {'fontsize': 15, 'fontweight' : 'bold'})
plt.axis('equal');
plt.setp(pcts, color='lightcoral', fontweight='bold', fontsize = 15);
26
As seen from the above visualization, teams tend to field first in the Indian Premier League
Matches Hosted By Each City
# Using the groupby functionality of pandas, here we first group by the data using
city and then count the matches that
# happened in each city by aggregating the ID column using count().
matches_per_city_df = matches.groupby('city')[['id']].count()
matches_per_city_df = matches_per_city_df.sort_values('id', ascending =
False).reset_index()
# Renaming the columns.
matches_per_city_df.rename(columns = {'id' : 'matches'}, inplace = True)

matches_per_city_df
27
# Plotting the visualisation for the top 20 cities who hosted the most matches.
plt.bar(matches_per_city_df['city'][:20], matches_per_city_df['matches'][:20],
alpha = 0.8,
color = 'blueviolet', edgecolor = 'black')
plt.title('Total Matches Hosted By Different Cities', pad = 20, fontweight='bold',
fontsize = 20)
plt.ylabel('Number Of Matches', labelpad = 10, fontweight='bold', fontsize = 15)
plt.xlabel('Cities', labelpad = 20, fontweight='bold', fontsize = 15)
plt.xticks(rotation=60)
28
Why have some cities hosted less than twenty games?
1. The 2009 Indian Premier League season, abbreviated as IPL 2 or the 2009 IPL, was the
second season of the Indian Premier League. The tournament was hosted by South Africa and
was played between 18 April and 24 May 2009.'
2. The tournament was held in 8 cities: Cape Town, Johannesburg, Durban,

Centurion(Pretoria), East London, Kimberly, Bloemfontein and Port Elizabeth.
3. Mohali having hosted less matches has multiple reasons, one of them being renovation in
2011 and another one of them being, the home team (Kings XI Punjab) losing many matches
there which attracts less crowd, hence, less revenue, so some of the home matches of Kings
XI Punjab were played at Dharamsala & Indore.
4. Some of the cities like Ahmedabad were introduced later on as a venue, and having no
team representing the state, less matches were conducted there.
29
5. Similarly, Rajkot has hosted less number of matches because the team representing the
state (Gujarat Lions) played only two seasons of IPL (2016 & 2017) as it was one of the two
replacement teams
The Top 10 Biggest Victory Margins, Both Through Runs & Wickets
# Creating a loser column in the dataframe.
loser = []
for i in range(756):
if matches['winner'][i] == matches['team1'][i]:
loser.append(matches['team2'][i])
elif matches['winner'][i] == matches['team2'][i]:
loser.append(matches['team1'][i])
else:
loser.append(matches['winner'][i])
matches['loser'] = loser
# Finding out the top 10 biggest margin of victories by runs
largest_runs_wins = matches.win_by_runs.sort_values(ascending = False)[0:10]

matches.loc[largest_runs_wins.index][['date', 'winner','loser',
'win_by_runs']].reset_index()
30
# Finding out the top 10 biggest margin of victories by wickets
largest_wickets_wins = matches.win_by_wickets.sort_values(ascending = False)[0:10]

matches.loc[largest_wickets_wins.index][['date', 'winner','loser',
'win_by_wickets']].reset_index()
Total Matches In Each Season
# Using the group by functionality of pandas, grouping based on the season and id
columns and aggregating using count.
total_matches_per_year = matches.groupby('season')[['id']].count()
total_matches_per_year
31
# Visualising the number of matches in a season using a line chart.
plt.title('Number Of Matches In Each Season', pad = 20, fontweight='bold',

fontsize = 20)
plt.xlabel("Seasons", labelpad = 20, fontweight='bold', fontsize = 15)
plt.ylabel("Number Of Matches", labelpad = 20, fontweight='bold', fontsize = 15)
plt.xticks(total_matches_per_year.index)
x = total_matches_per_year.index
y = total_matches_per_year.id
plt.plot(x, y, marker = 'o', markeredgecolor = 'black', linewidth = 3, markersize
= 10, linestyle = "-", color = 'hotpink');
plt.ylim(40, 80);
Why are there more matches in 2011, 2012 and 2013?
1. In 2011, two new teams were introduced in the IPL, Kochi Tuskers Kerala & Pune
Warriors India. This means that the IPL now has 10 teams instead of 8 teams which can
explain the increase in number of matches.
2. In 2011, Kochi Tuskers Kerala was terminated because they failed to pay the 10% bank
guarantee they had agreed to pay the IPL committee despite several reminders from the
BCCI. They were removed from the IPL in October 2011. By December, they were gone. So,
for the 2012 season, they had to form a new format for 9 teams now instead of 10, hence the
increase in matches again (though, only a slight increase).
3. In 2013, just like Kochi Tuskers Kerala, Pune Warriors India were terminated for failing to
furnish a bank guarantee worth Rs 170 crore for the next season.
32
4. Hence, since 2014, there have been a similar number of matches being played in the IPL
amongst 8 teams.
Matches Played & Won By Each Team
# Concatenating (joining on axis = 0) the two teams (team1 and team 2) columns and
then using value_count() to determine the number of matches
# played by each team.
total_matches = pd.concat([matches['team1'], matches['team2']])

total_matches_per_team = total_matches.value_counts().reset_index()
total_matches_per_team.columns = ['Teams', 'Total Matches']
total_matches_per_team
# Using the groupby functionality of pandas, creating a df of wins by each team by
grouping on winner column and counting the

# id using count aggregate function.
wins = matches.groupby('winner')[['id']].count()
wins_per_each_team = wins.sort_values(by = 'id', ascending = False).reset_index()
wins_per_each_team.columns = ["Teams", "Wins"]
33
wins_per_each_team
# Merging the total matches and total wins datasets and calculating the winning
percentage.
matches_and_wins = total_matches_per_team.merge(wins_per_each_team, on = 'Teams')

matches_and_wins['Winning Percentage'] =
round((matches_and_wins['Wins']/matches_and_wins['Total Matches']) * 100, 2)
matches_and_wins
34
# Visualising tha bove dataframe.
plt.bar(matches_and_wins['Teams'], matches_and_wins['Total Matches'], alpha = 0.8,
color = 'white', edgecolor = 'black')
plt.bar(matches_and_wins['Teams'], matches_and_wins['Wins'], alpha = 0.8, color =
'cyan', edgecolor = 'black')
plt.plot(matches_and_wins.Teams, matches_and_wins['Winning Percentage'], color =
'black', marker = 'o', linewidth = 2, markersize = 8, linestyle = '--')
plt.legend(['Winning Percentage','Matches Played','Matches Won']);
plt.title('Number Of Matches Played VS Number Of Matches Won',fontweight = 'bold',
pad = 15, fontsize = 20)
plt.xlabel('Teams', labelpad = -25, fontweight='bold', fontsize = 15)
plt.ylabel('Matches Played, Won & Winning %age', labelpad = 20, fontweight='bold',
fontsize = 15);
Mumbai Indians have the best winning percentage which is 58.29% .
35
Matches Won By The Home Team & The Away Team
Home team is a team which plays the match in its own city and away team is the team it plays
against
# Creating an empty dictionary, the idea is to map each team with its city.
dictionary = {}
# Putting Teams as keys in the dictionary and None as values.
for i in matches_and_wins.Teams:
if i in dictionary:
pass
else:
dictionary[i] = None
dictionary
'Mumbai Indians': None,

'Royal Challengers Bangalore': None,
'Kolkata Knight Riders': None,
'Delhi Capitals': None,
'Kings XI Punjab': None,
'Chennai Super Kings': None,
'Rajasthan Royals': None,
'Sunrisers Hyderabad': None,
'Deccan Chargers': None,
'Pune Warriors': None,
'Gujarat Lions': None,
'Rising Pune Supergiant': None,
'Kochi Tuskers Kerala': None}
# Creating a list of (home) cities in the order of the keys of dictionary.
cities = ['Mumbai', 'Bengaluru', 'Kolkata', 'Delhi', 'Mohali', 'Chennai',

'Jaipur', 'Hyderabad', 'Hyderabad', 'Pune', 'Rajkot', 'Pune', 'Kochi']
# Making cities as the values of the keys in the dictionary.
j = 0
for i in dictionary:
dictionary[i] = cities[j]
j += 1
# Dictionary with teams as keys and their home cities as values.
dictionary
36
{'Mumbai Indians': 'Mumbai',
'Royal Challengers Bangalore': 'Bengaluru',
'Kolkata Knight Riders': 'Kolkata',
'Delhi Capitals': 'Delhi',
'Kings XI Punjab': 'Mohali',
'Chennai Super Kings': 'Chennai',
'Rajasthan Royals': 'Jaipur',
'Sunrisers Hyderabad': 'Hyderabad',
'Deccan Chargers': 'Hyderabad',
'Pune Warriors': 'Pune',
'Gujarat Lions': 'Rajkot',
'Rising Pune Supergiant': 'Pune',
'Kochi Tuskers Kerala': 'Kochi'}
# Creating a df containing only city, team1 and team2 columns.
city_teams_df = matches[['city', 'team1', 'team2']]
# Creating a list of home teams, the idea is to create a list consisting of home
team in each match and then using it
# as a column.
home_team = []
# Seeing the city in which the match took place, and appending the home team list
using keys and values of the dictionary.
# If both teams' cities do not match, then it is a neutral venue.
if dictionary[city_teams_df.team1[i]] == city_teams_df.city[i]:
home_team.append(city_teams_df.team1[i])
elif dictionary[city_teams_df.team2[i]] == city_teams_df.city[i]:
home_team.append(city_teams_df.team2[i])
else:
home_team.append("Neutral")
# Making the home team list as a column of the dataset.
matches['home_team'] = home_team
# Creating a list of away teams, the idea is to create a list consisting of away
team in each match and then using it
# as a column.
away_team = []
# Seeing the city in which the match took place, and appending the away team list
using keys and values of the dictionary.
# If both teams' cities do not match, then it is a neutral venue.
if matches.home_team[i] == matches.team1[i]:
away_team.append(matches.team2[i])
37
elif matches.home_team[i] == matches.team2[i]:
away_team.append(matches.team1[i])
else:
away_team.append(matches.home_team[i])
# Checking total neutral matches
len(matches[matches.home_team == 'Neutral'])
237
# Making the away team list as a column of the dataset.
matches['away_team'] = away_team
# Finding out the matches which had no result.
matches[matches.result == "no result"]
# Removing the matches which had "no result".
matches_with_result = matches[matches.result != 'no result'].reset_index(drop =

True)
# Caculating home team wins and away team wins by comparing the winner column with
the home team column and away team column.
home_team_wins = 0
away_team_wins = 0
neutral_matches= 0
if matches_with_result.winner[i] == matches_with_result.home_team[i]:
home_team_wins += 1
elif matches_with_result.winner[i] == matches_with_result.away_team[i]:
away_team_wins += 1
elif matches_with_result.home_team[i] == "Neutral":
38
neutral_matches += 1
# Number of home team wins.
home_team_wins
293
# Number of away team wins.
away_team_wins
222
# Number of neutral matches.
neutral_matches
237
# Creating a list of team type and results to visualise.
team_type = ["Home Team", "Away Team"]

team_type_results = [home_team_wins, away_team_wins]
# Visualising the home team and away team wins.
color = ['deepskyblue', 'red']

plt.bar(team_type[0], team_type_results[0], color = 'deepskyblue', width = 0.5,
edgecolor = 'black', linewidth = 2)
plt.bar(team_type[1], team_type_results[1], color = 'red', width = 0.5, edgecolor
= 'black', linewidth = 2)
plt.title('Home Team Wins VS Away Team Wins', fontweight = 800, fontsize = 20, pad
= 20 )
plt.xlabel('Team Type ', fontweight = 800, labelpad = 5.5, fontsize = 15)
plt.ylabel('Total Wins', fontweight = 800, labelpad = 15.5, fontsize = 15)
plt.axis([-1, 2, 0 , 550])
plt.legend(['Home Team', 'Away Team']);
39
As we can see by the visualization, there is only a slight advantage to the team playing in front of their home crowd and the
match can go either way.
3.22 Deliveries DF Analysis

DIfferent Kinds Of Batsmen Dismissal
# Using the group by functionality of pandas to group the dataset based on

dismissal kind and aggregating using count.
dismissal_kind_df = deliveries_df.groupby('dismissal_kind')[['match_id']].count()
# Renaming the columns.
dismissal_kind_df = dismissal_kind_df.rename(columns = {'dismissal_kind' :

'dismissal_type', 'match_id' : 'count'})
# Sorting the values based on the amount of times batsmen have been dismissed in a
certain way (descending).
dismissal_kind_df = dismissal_kind_df.sort_values(by = 'count', ascending =

False).reset_index()
dismissal_kind_df# Since the values of hit wicket, retired hurt,obstructing the
field have very less values, so for better interpretation

# of the visualisation, i decided to drop them.
dismissal_kind_df_sig = dismissal_kind_df.drop([6,7,8])
40
# Visualising the dataframe.
sizes = dismissal_kind_df_sig['count']
labels = ['Caught', 'Bowled', 'Run Out', 'LBW', 'Stumped', 'Caught & Bowled']
plt.title('Different Ways Of Batsman Dismissal', pad = 80, fontweight='bold',
fontsize = 20)
plt.axis('equal')
patches, texts, pcts = plt.pie(sizes, labels = labels, autopct = "%.1f %%",
startangle = 15, counterclock = False, radius = 1.3,
wedgeprops = {'linewidth': 3.0, 'edgecolor': 'black'},
textprops={'fontsize': 13, 'fontweight' : 'bold'});
plt.setp(pcts, color='white', fontweight='bold', fontsize = 10, rotation = 20);
plt.legend(['Caught', 'Bowled', 'Run Out', 'LBW', 'Stumped', 'Caught & Bowled'],
loc = 'upper left');
As seen by the visualization, batsmen get caught out most often, followed by getting bowled
and getting dismissed by leg before wicket .
41
Most Matches Played By Batsmen & Bowlers
# Here we use the group by functionality of Pandas to group the dataset by

batsmen, and then aggregate matchid column by using
# lambda and creating a set (of match ids) and then sorting it based on the
length of set (descending).
most_match_bats = deliveries_df.groupby(['batsman']).agg({'match_id': lambda

x:len(set(x))}).sort_values(ascending = False, by = 'match_id')
most_match_bats[:15]
# Here we use the group by functionality of Pandas to group the dataset by

bowlers, and then aggregate matchid column by using
# lambda and creating a set (of match ids) and then sorting it based on the
length of set (descending).
most_match_bowlers = deliveries_df.groupby(['bowler']).agg({'match_id': lambda x:

len(set(x))}).sort_values(ascending = False, by = 'match_id')
most_match_bowlers[:15]
42
Most 6s & 4s Hit By Batsmen
# Using the value_counts functionality of pandas to count the batsmen who hit most
sixes.
big_hitters = deliveries_df.batsman[deliveries_df.batsman_runs ==
6].value_counts()[:15]
big_hitters
CH Gayle 327
AB de Villiers 214
MS Dhoni 207
SK Raina 195
RG Sharma 194
V Kohli 191
DA Warner 181
SR Watson 177
KA Pollard 175
YK Pathan 161
RV Uthappa 156
Yuvraj Singh 149
BB McCullum 129
AT Rayudu 120
AD Russell 119
Name: batsman, dtype: int64
43
# visualising the result.
colors = ['darkred', 'firebrick', 'indianred', 'lightcoral', 'rosybrown',

'darkcyan', 'cyan', 'paleturquoise', 'lightcyan', 'azure', 'darkolivegreen',
'limegreen', 'lawngreen', 'lightgreen', 'mediumspringgreen']
plt.bar(big_hitters.index,big_hitters.values, color = colors, edgecolor = 'black')
plt.xticks(rotation = 60)
plt.title('Batsmen Who Hit The Most 6s', pad = 20, fontweight='bold', fontsize =
20)
plt.xlabel('Batsmen', labelpad = 10, fontweight='bold', fontsize = 15)
plt.ylabel('Number Of Sixes Hit', labelpad = 20, fontweight='bold', fontsize =

15);
sixes.
gap_finders = deliveries_df.batsman[deliveries_df.batsman_runs ==
4].value_counts()[:15]
gap_finders
S Dhawan 526
SK Raina 495
G Gambhir 492
V Kohli 482
DA Warner 459
RV Uthappa 436
RG Sharma 431
AM Rahane 405
CH Gayle 376
PA Patel 366
44
KD Karthik 358
AB de Villiers 357
SR Watson 344
V Sehwag 334
MS Dhoni 297
Name: batsman, dtype: int64
fours.
plt.bar(gap_finders.index,gap_finders.values, color = colors, edgecolor = 'black')

plt.xticks(rotation = 60)
plt.title('Batsmen Who Hit The Most 4s', pad = 20, fontweight='bold', fontsize =
20)
plt.xlabel('Batsmen', labelpad = 10, fontweight='bold', fontsize = 15)
plt.ylabel('Number Of Fours Hit', labelpad = 20, fontweight='bold', fontsize =
15);
Raina , Warner , Kohli , Uthappa , R Sharma , Gayle being in top 10 of both the lists makes
them the most boundary hitting and dangerous batsmen,
45
Most Wickets Taking Bowlers
# Notice hoe there are many Nans in the player_dismissed column.
deliveries_df
# Notice hoe there are many Nans in the player_dismissed column.
deliveries_df
# Looking at the column to see if the changes were made.
deliveries_df
46
# Creating an empty list is_wicket, the idea is to make a column in the dataframe
that shows that if a wicket was taken at
# a particular delivery or not (did not include run out, retired hurt or
obstructing the field because they are not
# awarded to the bowler).
is_wicket = []
if deliveries_df.player_dismissed[i] == 0:
is_wicket.append(0)
elif deliveries_df.player_dismissed[i] != 0:
if deliveries_df.dismissal_kind[i] != 'run out'or
deliveries_df.dismissal_kind[i] != 'retired hurt' or
deliveries_df.dismissal_kind[i] != 'obstructing the field':
is_wicket.append(1)
# Adding the is_wicket list as a column to the dataset.
deliveries_df['is_wicket'] = is_wicket
# Using the groupby functionality, grouping the dataset on bowlers and summing the
is_wicket column for each bowler and
# sorting it based on the sum value (descending).
deliveries_df.groupby('bowler')['is_wicket'].sum().sort_values(ascending =
False)[:20]
bowler
SL Malinga 188
DJ Bravo 168
A Mishra 165
Harbhajan Singh 161
PP Chawla 156
B Kumar 141
R Ashwin 138
SP Narine 137
UT Yadav 136
R Vinay Kumar 127
A Nehra 121
Z Khan 119
RA Jadeja 116
SR Watson 107
DW Steyn 104
YS Chahal 102
P Kumar 102
RP Singh 100
PP Ojha 99
MM Sharma 99
Name: is_wicket, dtype: int64
47
Bowlers Who Bowled The Most Deliveries
# Using the value_count functionality of pandas to find out the most deliveries
bowled by top 20 bowlers.
deliveries_df['bowler'].value_counts()[:20]
Harbhajan Singh 3451

A Mishra 3172
PP Chawla 3157
R Ashwin 3016
SL Malinga 2974
DJ Bravo 2711
B Kumar 2707
P Kumar 2637
UT Yadav 2605
SP Narine 2600
RA Jadeja 2541
Z Khan 2276
DW Steyn 2207
R Vinay Kumar 2186
SR Watson 2137
IK Pathan 2113
I Sharma 1999
A Nehra 1974
PP Ojha 1945
RP Singh 1874
Name: bowler, dtype: int64
48
CHAPTER 4
ASKING AND ANSWERING QUESTIONS
49
Asking and Answering Questions
In this section, I will try to answer some interesting questions about both the datasets.
Who Won The Most Man Of The Match Awards?

## Using the value_counts functionality, I found the players (top 15) who won the
most man of the match awards.
motm = matches[['player_of_match']]
motm = motm.rename(columns = {'player_of_match' : 'Player'})
top_players = motm.Player.value_counts()[:15]
top_players
CH Gayle 21
AB de Villiers 20
RG Sharma 17
MS Dhoni 17
DA Warner 17
YK Pathan 16
SR Watson 15
SK Raina 14
G Gambhir 13
MEK Hussey 12
AM Rahane 12
V Kohli 12
V Sehwag 11
DR Smith 11
AD Russell 11
Name: Player, dtype: int64
# Visualising the result.
labels = top_players.index
sizes = top_players.values
plt.title('Most MOTM Award Winners', y = -0.3, fontweight = 'bold')
plt.axis('equal')
patches, texts, pcts = plt.pie(sizes, labels = labels, autopct = "%.1f %%",
startangle = 90, counterclock = False, radius = 1.5,
wedgeprops = {'linewidth': 3.0, 'edgecolor': 'black'},
textprops={'fontsize': 11});
plt.setp(pcts, color='white', fontweight='bold');
50
1. Gayle , Raina , Kohli , Gambhir , Warner have been in three of the most amazing stats, thus,
being very scary players to face.
2. Interestingly, there are no bowlers in this list.
How Many Matches Were There Where The Toss Winner Also Won The
Match?
# taking out only those records where the match winner was also the toss winner.
match_and_toss_winner_df = matches[(matches.toss_winner == matches.winner)]
# Finding out if there were any ties in the above dataframe.
match_and_toss_winner_df.result.unique()
array(['normal', 'tie'], dtype=object)
# Only keeping the matches that had normal (weren't tied and had a tiebreaker
method to conclude the result) results.
match_and_toss_winner_df =
match_and_toss_winner_df[match_and_toss_winner_df['result'] == 'normal']
match_and_toss_winner_df
51
# Grouping the dataframe by winner and id and aggregating using count and sorting
using count values.
match_and_toss_winner_df =
match_and_toss_winner_df.groupby('winner')[['id']].count()
match_and_toss_winner_df = match_and_toss_winner_df.sort_values(by = 'id',
ascending = False).reset_index()
match_and_toss_winner_df
52
So, there are 350 matches that have happened in IPL where the toss winner has also gone on to
win the match out of which Chennai Super Kings have won the most, i.e, 57 matches, followed
very closely by Mumbai Indians which has won 55 matches, which in turn is followed very
closely by Kolkata Knight Riders which has won 53 matches.
How Did The Top Bowlers Dismiss The Batsmen?
# Finding out different dismissal types.
w_types = deliveries_df.dismissal_kind.unique()
# Removing nan, run out, retired hurt from dismissal types list cause they are not
awarded to the bowlers.
w = [1, 2, 4, 5, 6, 8]
w_types = [w_types[x] for x in w]
# 1. Using indexing, first finding out only those records that have the above
dismissal types.
# 2. Then using groupby functionality, grouping the records by bowler and
dismissal kind and aggregating using count.
# 3. Then unstacking the temp df so that it becomes easier to read.
temp = deliveries_df[deliveries_df['dismissal_kind'].isin(w_types)].groupby(by =
['bowler','dismissal_kind']).dismissal_kind.count().unstack(fill_value = 0)
# Then summing across columns to create a column total (the idea is to sort by the
total (wickets) column in desc. order).
temp['total'] = temp.sum(axis=1)
# Sorting by the total column in descending order and then dropping it and only
keeping the top 10 bowlers.
temp = temp.sort_values('total', ascending = False).drop('total', axis =

1).head(10)
print(temp)
53
# Visusalising the above dataframe using stacked barchart.
x = temp.index
y1 = temp.bowled
y2 = temp.caught
y3 = temp['caught and bowled']
y4 = temp['hit wicket']
y5 = temp.lbw
y6 = temp.stumped
plt.bar(x, y1, color = 'tab:red', edgecolor = 'black')

plt.bar(x, y2, color = 'tab:green', bottom = y1, edgecolor = 'black')
plt.bar(x, y3, color = 'tab:blue', bottom = y1 + y2, edgecolor = 'black')
plt.bar(x, y4, color = 'tab:olive', bottom = y1 + y2 + y3, edgecolor = 'black')
plt.bar(x, y5, color = 'tab:pink', bottom = y1 + y2 + y3 + y4, edgecolor =
'black')
plt.bar(x, y6, color = 'tab:gray', bottom = y1 + y2 + y3 + y4 + y5, edgecolor =
'black')
plt.legend(w_types, loc = 1)
plt.ylim([0, 200])
plt.title('Top Bowlers & Dismissal Types',fontweight = 'bold', pad = 15, fontsize
= 20)
plt.xlabel('Bowlers', labelpad = -15, fontweight='bold', fontsize = 15)
plt.ylabel('Total Wickets & Dismissal Types', labelpad = 20, fontweight='bold',
fontsize = 15);
54
As we can see from the above visualization, SL Malinga is the most dangerous bowler who
bowls dismisses his opponents a lot.
Who Were The Batsmen Who Struggled Against A Particular Bowler?
# Creating an empty dataframe with bowler, dismissal kind and batsman as columns.
max_dismissal = pd.DataFrame(columns = ["bowler", "dismissal_kind", "batsman"])
# Finding out all the batsmen that have played in the IPL.
batsmen = deliveries_df.batsman.unique()
# 1. picking out the data of each batsman from the above list one by one.
# 2. Filtering out data based on dismissal kind of that batsman.
# 3. Grouping by bowler and aggregating using count.
# 4. Sorting values by dismissal kind count (descending() and then picking out the
bowler at the top.
# 5. Adding the current batsman to the column batsmen (after creating the column).
# 6. Concatenating using concat functionality of pandas (over x axis).
for x in batsmen:
current = deliveries_df[deliveries_df['batsman'] == x]
current = current[current['dismissal_kind'].isin(['caught', 'lbw', 'bowled',
'stumped', 'caught and bowled', "hit wicket"])]
current = current.groupby('bowler').count()
current = current.sort_values(by = 'dismissal_kind', ascending =
0).dismissal_kind[:1].reset_index()
current['batsman'] = x
max_dismissal = pd.concat([max_dismissal, current], ignore_index=True)
55
# Sorting values using dismissal kind count (descending), creating a max_dismissal
df, renaming the columns & showing top 10
# bowlers who took a wicket of a particular batsman.
max_dismissal= max_dismissal.sort_values(by = 'dismissal_kind', ascending = 0,

ignore_index = True)
max_dismissal= max_dismissal[["batsman", "bowler", "dismissal_kind"]]
max_dismissal.rename(columns = {'batsman' : 'Batsman', 'bowler' : 'Bowler',
'dismissal_kind' : 'Times_Dismissed'}, inplace = True)
1. As seen from the above table, MS Dhoni struggled against Z Khan the most and rest of the
batsmen can be seen as well, struggling against some bowlers
2. Z Khan & B Kumar are in the top 10 twice.
Who Were The Top 10 Batsmen Based On Runs Scored & Who Were The Top
10 Batsmen That Were Dismissed The Most?
#top batsmen according to runs

runs = deliveries_df.groupby(['batsman']).batsman_runs.sum().sort_values(ascending
= False)
runs[:10]
batsman
V Kohli 5434
SK Raina 5415
RG Sharma 4914
DA Warner 4741
S Dhawan 4632
CH Gayle 4560
MS Dhoni 4477
RV Uthappa 4446
56
AB de Villiers 4428
G Gambhir 4223
Name: batsman_runs, dtype: int64
1. Interestingly, apart from MS Dhoni, everybody else are openers (that is they bat at positions
1, 2 or 3), which makes this a feat for Dhoni considering he bats in the middle order.
2. Raina, Kohli, Sharma, Gambhir, Warner, Uthappa, Gaylehave been in every good list so far
# max number of times batsmen getting out

out = deliveries_df.groupby(['batsman']).is_wicket.sum().sort_values(ascending =
False)
out[:10]
batsman
RG Sharma 162
SK Raina 161
RV Uthappa 157
V Kohli 152
KD Karthik 140
S Dhawan 137
G Gambhir 135
PA Patel 127
MS Dhoni 118
AM Rahane 117
Name: is_wicket, dtype: int64
Raina, Kohli, Sharma, Gambhir, Warner, Uthappa also get dismissed a lot too (though, after
scoring a lot)
Who Were The Top Batsmen Based On Batting Average?
Indented block
Batting average of a player is de¦ned as number of runs scored divided by the amount of times
dismissed
# Finding top batsmen based on batting average.
avg = (runs/out[:200]).sort_values(ascending = False)

avg[:15]
batsman
HM Amla 44.384615
AB de Villiers 42.576923
JP Duminy 41.653061
DA Warner 41.587719
57
CH Gayle 41.454545
KL Rahul 41.081633
LMP Simmons 39.962963
ML Hayden 39.535714
OA Shah 38.923077
KS Williamson 38.794118
SE Marsh 38.292308
MEK Hussey 38.019231
MS Dhoni 37.940678
JC Buttler 37.657895
A Symonds 37.461538
dtype: float64
1. HM Amla, JP Duminy & AB de Villers are the top batsmen based on batting average.
2. Gayle, Warner, Kohli make this list, too.
58
CONCLUSION
The above analysis gives us an overview of the IPL matches, stats of different players and some
more enjoyable and knowledge facts about IPL from the Starting of IPL in the year 2008 upto
2019. The above observations contain a lot of information about a player in particular or a team
as a whole.With that, we’ve come to the end of this analysis. If you are a cricket lover, I am
sure you have heard about IPL before, and for many of you it is one of the favorite games to
enjoy with your family. It's fun as well as exciting to discuss the results of the games you love
and tell others the stories of the same, after going through this notebook you will have a lot
more stories to tell about IPL and brag about your knowledge on the game.
59
FUTURE SCOPE
There are a lot of scopes of improvement and/or addition in this project in future, with the data
provided and adding extra datasets we can, Make a better team statistics which shows Run Rate
of each team and overall position or value the team has in IPL. Predict the costs of players in
next seasons of ipl using the data of the players and with the knowledge of the cost of the
players in the previous seasons. Add observation for peak over to score in IPL and overs in
which most dismissals take place can be made using data manipulations. Also we can add the
dataset of 2020 IPL, observe and compare how the performance of players and teams changes
by modifying only a few lines of code (more the data, merrier the visualization).
60
BIBLIOGRAPHY
Websites
● IPL 2008-2019 Dataset: https://www.kaggle.com/nowke9/ipldata
● Kaggle Datasets (Choose Dataset of your choice): https://www.kaggle.com/datasets
● Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
● Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
● Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html
● Data analysis guide(https://jovian.ml/aakashns/python-pandas-data-analysis)
● Python solutions in Geeksforgeeks (Solutions made easy):

https://www.geeksforgeeks.org/python-programming-language/
● open datasets Python library (Choosing and using datasets in python made easy
https://github.com/JovianML/opendatasets
Textbooks
● Python for data analysis , Wes Mckinney , 2nd Edition , O'Reilly Media, Inc
61

Eda of Ipl

Uploaded by

Copyright:

Available Formats

Eda of Ipl

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eda of Ipl

Uploaded by

Copyright:

Available Formats

EXPLORATORY DATA ANALYSIS OF IPL MATCHES

Project submitted in the partial fulfilment of the

Mahatma Gandhi University,

FASAL K A (Reg. No. 200011019991)

DEPARTMENT OF ACTUARIAL SCIENCE

ST. JOSEPH’S ACADEMY OF HIGHER EDUCATION

AND RESEARCH, MOOLAMATTOM

ARAKULAM, IDUKKI – 685591

Signature of Guide Signature of HOD

Name of the Examiners: Signature with date

1.1 OBJECTIVE OF THE STUDY...................................................................10

1.2 SCOPE OF THE STUDY..........................................................................10

1.3 DATA AND METHODOLOGY.............................................................. 11

2.1 MAIN PHASES IN DATA ANALYSIS.....................................................14

3. DATA ANALYSIS USING PYTHON

3.1 DATA PREPARATION AND CLEANING

3.11 MATCHES DF .................................................................................18

3.12 DELIVERIES DF................................................................................24

3.2 EXPLORATORY ANALYSIS AND VISUALIZATION

3.21 MATCHES DF....................................................................................26

3.22 DELIVERIES DF................................................................................40

4. ASKING AND ANSWERING QUESTIONS .........................................50

* __pandas__ (Is greatly used in analysis of data and making dataframe)

* __seaborn__ (Adding more colors into matplotlib visualization)

1.1 OBJECTIVE OF THE STUDY

The motivation behind EDA includes the answers to following questions

1.2 SCOPE OF THE STUDY

EXPLORATORY DATA ANALYSIS

2.1 MAIN PHASES IN DATA ANALYSIS

E. Exploratory data analysis:

F. Modeling and algorithms:

DATA ANALYSIS USING PYTHON

A. Why use Python?

Python is a high-level, interpreted, multi-purpose programming language. Many programming

Some features of Python are as listed below:

• Open source and free

• Anaconda (Jupyter Notebook)

•IPL(INDIAN PREMIER LEAGUE) 2008-2019

3.1 Data Preparation and Cleaning

# reading the matches dataset.

# displaying the data of the dataset.

RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns):

are some (7 ) missing values in the city column.

having no result (because of rain).

461 Dubai International Cricket Stadium

matches[matches.venue == 'Dubai International Cricket Stadium']

matches.loc[matches[matches['venue'] == 'Dubai International Cricket

matches.loc[4, 'umpire1'] = 'C Shamshuddin'

matches.loc[753, 'umpire1'] = 'Bruce Oxenford'

# Checking data type of Date column.

array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',

These will be the columns that would require fixing:

matches.team1.replace({'Rising Pune Supergiants' : 'Rising Pune Supergiant',

array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',

# Finding out matches where there was no result.

matches[matches.result == 'no result']

# Finding all the unique values in the city column

# Fixing the Bangalore and Bengaluru error.

# Displaying the data.

* pandas (Is greatly used in analysis of data and making dataframe)

* seaborn (Adding more colors into matplotlib visualization)