Eda of Ipl
Eda of Ipl
Eda of Ipl
2020-2022
ST. JOSEPH’S ACADEMY OF HIGHER EDUCATION
AND RESEARCH, MOOLAMATTOM
DEPARTMENT OF ACTUARIAL SCIENCE
CERTIFICATE
This is to certify that the project entitled “ EXPLORATORY DATA ANALYSIS OF IPL
MATCHES ” is an authentic record of work carried out by Mr. FASAL K A under my
supervision and guidance in the Department of ACTUARIAL SCIENCE at St. Joseph’s
Academy of Higher Education and Research, Moolamattom.
2
DECLARATION
I hereby declare that the project report entitled “ EXPLORATORY DATA ANALYSIS OF
IPL MATCHES ” submitted in partial fulfillment of the requirements for the award of Master’s
Degree in Actuarial Science of Mahatma Gandhi University, Kottayam, is my project work. The
contents of the study, in full or parts, have not been submitted to any other institution or
university for the award of any degree or diploma.
FASAL K A
Place: Moolamattom
Date:
3
ACKNOWLEDGEMENT
During the course of my project work several persons collaborated directly and indirectly with
me. Without their support it would be impossible for me to finish my work. That is why I wish
to dedicate this section to recognize their support.
First of all, my gratitude is to GOD who has showered his divine providence and graces on me
in the completion of this work.It is my proud privilege to release the feelings of my gratitude to
several persons who helped me directly or indirectly to conduct this project work. I want to
recognize and stretch out my ardent appreciation to Mr.Vineeth Viswanath (Project guide,
Actuarial Science, St. Joseph’s Academy of Higher Education and Research, Moolamattom) who
has made the fruition of this undertaking conceivable.
I am deeply indebted to all other staff members of the department Ms. Neenumol Tom and Mr.
Dibin Thomas for their valuable and experienced guidance provided during the period of my
study.
I also like to thank my family members and friends for the support they have given for the
completion of the work.
4
ABSTRACT
Cricket is a popular sport not only in India but also in the surrounding areas of the world.
Specifically the T-20 format of this game is very popular in recent years. Today one of the
championships named as Indian premier league (IPL) associated with this format has grown
rapidly. But cricket is always said to be a game of uncertainty. Predicting the winner of the
tournament or the game has an area of concern for many fans. Technology, on the other hand,
is developing at an alarming rate. Machine learning algorithms are always the first choice for
researchers to predict something after model training. So in this project we will predict the
probable percentage of winning teams in the IPL using different supervised learning methods.
5
TABLE OF CONTENTS
1. INTRODUCTION
2. DATA ANALYSIS
6
5. CONCLUSION ……...................................................................................59
6. FUTURE SCOPE........................................................................................60
7. BIBLIOGRAPHY..................................................................................... 61
7
CHAPTER 1
INTRODUCTION
8
1.INTRODUCTION
Indian Premier League more popularly called as IPL is a Cricket Tournament hosted by the
Cricket Board of India (BCCI). Players from different countries participate in IPL making it an
exciting opportunity to entertain cricket lovers. IPL was established in 2008 when the first
season of IPL was hosted. Since then every year the IPL game is played and celebrated as a
month long cricket festival for Indians and cricket lovers throughout the world. IPL also gives
opportunities to the young players to showcase their talent and improve their experience by
playing with some of the best and experienced players of cricket.
In this project I am going to go through two datasets of IPL matches in INDIA ,observe the
data, analyse and process it and answer a few common questions about the dataset that would
generally bug you. Go through the notebook carefully and enjoy the different observations
made by me .
The given dataset was taken from the dataset bundle present in Kaggle Datasets, Reffer to this
link [IPL 2008-2019 Kaggle Dataset] (https://www.kaggle.com/nowke9/ipldata) to get more
information about the dataset and download it from Kaggle to work with it.
With this dataset I am trying to visualize different trends in IPL score of teams and players
from 2008 to 2019, As the current season IPL 2020 is ongoing it would be fun and helpful to
know the stats of teams and players visually for the last 11 years. Hope you will enjoy the
visualization provided by me.
The name of the Dataset used for this projects are `matches.csv` and `deliveries.csv`. There are
756 rows in the `matches.csv` file each row containing data about a specific match. The
`deliveries.csv` dataset is a huge one with over `1.79 Lakhs` of rows of data and every row
represents data from each delivery from each match for the last 11 years.
I will be using Python 3 for this analysis, And am doing this project in Jupyter Notebook
(Kaggle and Google Collab are also good options to run this notebook and work with it). The
Libraries/Packages I will be using in this project are as followed.
9
* __numpy__ (as np is one of the very famous packages for working with arrays in python)
* __matplotlib__ (Lets make our Analyzation fun and interactive with the visualization
library matplotlib)
● To find the team that won the most number of matches in a season.
● To find the team that lost the most number of matches in a season.
● Does winning a toss increase the chances of victory.
● To find the player with the most player of the match awards.
● To find the city that hosted the maximum number of IPL matches.
● To find the most winning team for each season.
● To find the on-field umpire with the maximum number of IPL matches.
● To find the biggest victories in IPL while defending a total and while chasing a total.
The main purpose of this method is to help to understand the data by having a look and then
make predictions on the data. This helps the data scientist to find the errors, to handle noisy
data, to detect anomalies/ outliers, and identify the new patterns in that data. EDA is used to
ensure the output should be valid and can be applied to any business goals. It helps to conclude
about standard deviations, categorical variables, and confidence intervals. Once the process is
complete its features can then be used for more sophisticated data analysis or modeling,
including machine learning.
10
1.3 DATA AND METHODOLOGY
Exploratory Data Analysis (EDA) is a method used to analyze and summarize data sets. In
other words, this method is used by data scientists to analyze and investigate the patterns in the
data sets and summarize their characteristics. This helps the data scientists to discover various
patterns in the data, find the anomalies in the data, test a hypothesis.The dataset consists of data
about IPL matches played from the year 2008 to 2019. IPL is a professional Twenty20 cricket
league founded by the Board of Control for Cricket in India (BCCI) in 2008. The league has 8
teams representing 8 different Indian cities or states. It enjoys tremendous popularity and the
brand value of the IPL in 2019 was estimated to be ₹475 billion (US$6.7 billion).
11
CHAPTER 2
12
2. EXPLORATORY DATA ANALYSIS
Data are those raw facts and figures with no proper information hence need to be processed to
get the desired information. While information is those results which we get after processing
the raw data in different levels or extracted conclusions from a given dataset through a
process called data analysis.
Data Analysis is simply the analysis of various data means cleaning the data, transforming it
into understandable form, and then modeling data to extract some useful information for
business use or an organizational use. It is mainly used in making business decisions. Many
libraries are available for doing the analysis. For example, NumPy, Pandas, Seaborn,
Matplotlib, Sklearn, etc.
• NumPy:
NumPy is a library written in Python, used for numerical analysis in Python. It stores the data
in the form of nd-arrays (n-dimensional arrays).
• Pandas:
Pandas is mainly used for converting data into tabular form and hence, makes the data more
structured and easily to read.
• Matplotlib:
Matplotlib is a data visualization and graphical plotting package for Python and its numerical
extension NumPy that runs on all platforms.
• Seaborn:
Seaborn is a Python data visualization package based on matplotlib that is tightly connected
with pandas data structures. The core component of Seaborn is visualization, which aids in data
exploration and comprehension.
• Sklearn:
Scikit-learn is the most useful library for machine learning in Python. It includes numerous
useful tools for classification, regression, clustering, and dimensionality reduction. Data
visualization will help the data analysis to make it more understandable and interactive by
plotting or displaying the data in pictorial form. Pandas, a Python open-source package that
13
deals with three different data structures: series, data frames, and panels, solves the need of
analyzing and visualizing data.
Data analysis using Python makes task easier since Python Programming language has many
advantages over any other programming language. It has prominent features like being a high-
level programming language (the codes are in human readable form) it is easy to understand
and use by any programmer or user. Many libraries and functions for statistical, numerical
analysis are available in Python. Moreover, the source code is freely available to anyone (free
and open source). This paper includes all the basic terms and functions which are much needed
by a beginner to know what data analysis is. The paper is divided broadly into 4 sections. In
section II, the main steps in data analysis will be discussed. In section III, data analysis using
python will be studied with all the basic needs of python in doing data analysis and data
visualization will aid the analysis by representing them in picture format. In section IV,
conclusion of the paper is given
A. Data requirements:
Data is the most important unit in any study. Data must be provided as inputs to the analysis
based on the analysis’ requirements. The term “experimental unit” refers to the type of
organization that would be used to gather data (e.g., a person or population of people). It is
possible to identify and obtain specific population variables (such as height, weight, age, and
salary). It doesn’t matter whether the data is numerical or categorical.
B. Data Collecting:
The collecting of data is simply known as Data Collecting. Data is gathered from a variety of
sources, including relational databases, cloud databases, and other sources, depending on the
study’ needs. Field sensors, such as traffic cameras, satellites, monitoring systems, and so on,
can also be used as data sources.
14
C. Data processing :
Data that is collected must be processed or organized for analysis. For instance, these may
involve arranging data into rows and columns in a table format (known as structured data) for
further analysis, often through the use of spreadsheet or statistical software.
D. Data cleaning:
The method of cleaning data after it has been processed and organized is known as data
cleaning. It scans for data inconsistencies, duplicates, and errors, and then removes them. The
data cleaning process includes tasks such as record matching, identifying data inaccuracy, data
sort, outlier data identification, textual data spell checker, and data quality maintenance. As a
consequence, it keeps us from having unexpected outcomes and assists us in delivering high-
quality data, which is essential for a successful outcome.
Once the datasets are cleaned and free of error, it can then be analyzed. A variety of techniques
can be applied such as exploratory data analysis- understanding the messages contained within
the obtained data and descriptive statistics finding average, median, etc. Data visualization is
also a technique used, in which the data is represented in a graphical format in order to obtain
additional insights, regarding the information within the data.
Mathematical formulas or models (known as algorithms), may be applied to the data in order
to identify relationships among the variables; for example, using correlation or causation.
G. Data product:
A data product is a computer application that takes data inputs and generates outputs, feeding
them back into the environment. It may be based on a model or algorithm.
15
CHAPTER 3
16
3. DATA ANALYSIS USING PYTHON
In this section, data analysis using python will be studied. The most basic things like why using
python for data analysis will be understood. Moreover, how anyone can start using python will
be shown. The important libraries, the platforms, the dataset to carry out the analysis will be
introduced. Usage of various python functions for numerical analysis are given along with
various methods of plotting graphs or charts are discussed.
• Interpreted language
• Dynamic typesetting
• Portable
• Numerous IDE
B. Packages used:
• Numpy
• Pandas
• Seaborn
• Matplotlib
C. Platform used:
17
D. Dataset used:
In this section, I explored the data from the surface level and did the required cleaning and
preparation for the analysis
3.11 Matches DF
matches = pd.read_csv('ipldata/matches.csv')
matches
18
# getting general info about the dataset.
matches.info()
<class 'pandas.core.frame.DataFrame'>
1. As seen above, there are 756 venues whereas there are 749 cities, implying there
2. There are 754 values in umpire1 and umpire2 columns instead of 756.
3. There are 752 winner values, this may be due to matches being tied or matches
19
## Finding the venues where the city value is missing.
matches.venue[matches.city.isnull()]
## Checking if ALL the matches played at DUbai International Cricket Stadium have
missing city value.
# Filling up the city column when the matches have venue Dubai International
Cricket Stadium with "Dubai".
20
# Finding out the two rows where the values of umpire1 column are missing.
matches[matches.umpire1.isnull()]
# Filling those values (of umpire2 too) using a quick google search.
matches['date']
0 2017-04-05
1 2017-04-06
2 2017-04-07
3 2017-04-08
4 2017-04-08
...
751 05/05/19
752 07/05/19
753 08/05/19
754 10/05/19
755 12/05/19
Name: date, Length: 756, dtype: object
# Converting the date column's data type into datetime data time of python (which
would standardize the column).
matches['date'] = pd.to_datetime(matches.date)
21
matches['date']
0 2017-04-05
1 2017-04-06
2 2017-04-07
3 2017-04-08
4 2017-04-08
...
751 2019-05-05
752 2019-07-05
753 2019-08-05
754 2019-10-05
755 2019-12-05
Name: date, Length: 756, dtype: datetime64[ns]
# Finding all the names of the teams that compete in the IPL.
matches.team1.unique()
1. As seen above, there are multiple names for the same team, i.e, `Rising Pune
Supergiants` and `Rising Pune Supergiant`. This is because of the omission of `s`, we
shall fix that.
2. `Delhi Capitals` and `Delhi Daredevils` are the names of the same team. The team
representing Delhi, which was Delhi Daredevils changed its name to Delhi Capitals in 2018.
So, for simplification, we will change the values where Delhi Daredevils is used to Delhi
Capitals.
1. team1
2. team2
3. winner
22
#Using replace method in pandas library to fix the team name errors.
# Checking all the names of the teams that compete in the IPL again for
confirmation of the fix.
matches.team1.unique()
As seen above, there might be multiple names for the same city, just like in the case of
`Bangalore` and `Bengaluru`.
matches.city.unique()
23
array(['Hyderabad', 'Pune', 'Rajkot', 'Indore', 'Bangalore', 'Mumbai',
'Kolkata', 'Delhi', 'Chandigarh', 'Kanpur', 'Jaipur', 'Chennai',
'Cape Town', 'Port Elizabeth', 'Durban', 'Centurion',
'East London', 'Johannesburg', 'Kimberley', 'Bloemfontein',
'Ahmedabad', 'Cuttack', 'Nagpur', 'Dharamsala', 'Kochi',
'Visakhapatnam', 'Raipur', 'Ranchi', 'Abu Dhabi', 'Sharjah',
'Dubai', 'Mohali', 'Bengaluru'], dtype=object)
matches.city.replace({'Bangalore' : 'Bengaluru'},inplace=True)
3.12 Deliveries DF
# Loading the Deliveries Dataframe which would be used later for some Ball by Ball
Analysis.
deliveries_df = pd.read_csv('ipldata/deliveries.csv')
deliveries_df
deliveries_df.batting_team.unique()
24
array(['Sunrisers Hyderabad', 'Royal Challengers Bangalore',
'Mumbai Indians', 'Rising Pune Supergiant', 'Gujarat Lions',
'Kolkata Knight Riders', 'Kings XI Punjab', 'Delhi Daredevils',
'Chennai Super Kings', 'Rajasthan Royals', 'Deccan Chargers',
'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants',
'Delhi Capitals'], dtype=object)
As seen above, Deliveries dataframe has the same errors as Matches dataframe (Rising Pune Supergiants and Delhi
Daredevils).
1. batting_team
2. bowling_team
# Checking all the names of the teams that compete in the IPL again for
confirmation of the fix.
deliveries_df.batting_team.unique()
25
3.2 Exploratory Analysis and Visualization
This section contains general analysis of both the matches and the deliveries dataframe in the
form of tables and graphs.
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (12, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
toss_decision = matches.toss_decision.value_counts()
toss_decision
field 463
bat 293
Name: toss_decision, dtype: int64
fielding = toss_decision[0]
batting = toss_decision[1]
toss_decisions = [fielding, batting]
labels = ["Fielding", 'Batting']
colors = ['Black', "Grey"]
plt.title("Toss Decisions", fontweight='bold', fontsize = 20)
patches, texts, pcts = plt.pie(toss_decisions, colors = colors, labels = labels,
autopct = "%.2f%%", startangle = 80, counterclock = False,
wedgeprops = {'linewidth': 3.0, 'edgecolor':
'white'}, textprops= {'fontsize': 15, 'fontweight' : 'bold'})
plt.axis('equal');
26
As seen from the above visualization, teams tend to field first in the Indian Premier League
# Using the groupby functionality of pandas, here we first group by the data using
city and then count the matches that
# happened in each city by aggregating the ID column using count().
matches_per_city_df = matches.groupby('city')[['id']].count()
matches_per_city_df = matches_per_city_df.sort_values('id', ascending =
False).reset_index()
27
# Plotting the visualisation for the top 20 cities who hosted the most matches.
plt.bar(matches_per_city_df['city'][:20], matches_per_city_df['matches'][:20],
alpha = 0.8,
color = 'blueviolet', edgecolor = 'black')
plt.title('Total Matches Hosted By Different Cities', pad = 20, fontweight='bold',
fontsize = 20)
plt.ylabel('Number Of Matches', labelpad = 10, fontweight='bold', fontsize = 15)
plt.xlabel('Cities', labelpad = 20, fontweight='bold', fontsize = 15)
plt.xticks(rotation=60)
28
Why have some cities hosted less than twenty games?
1. The 2009 Indian Premier League season, abbreviated as IPL 2 or the 2009 IPL, was the
second season of the Indian Premier League. The tournament was hosted by South Africa and
was played between 18 April and 24 May 2009.'
3. Mohali having hosted less matches has multiple reasons, one of them being renovation in
2011 and another one of them being, the home team (Kings XI Punjab) losing many matches
there which attracts less crowd, hence, less revenue, so some of the home matches of Kings
XI Punjab were played at Dharamsala & Indore.
4. Some of the cities like Ahmedabad were introduced later on as a venue, and having no
team representing the state, less matches were conducted there.
29
5. Similarly, Rajkot has hosted less number of matches because the team representing the
state (Gujarat Lions) played only two seasons of IPL (2016 & 2017) as it was one of the two
replacement teams
The Top 10 Biggest Victory Margins, Both Through Runs & Wickets
loser = []
for i in range(756):
if matches['winner'][i] == matches['team1'][i]:
loser.append(matches['team2'][i])
elif matches['winner'][i] == matches['team2'][i]:
loser.append(matches['team1'][i])
else:
loser.append(matches['winner'][i])
matches['loser'] = loser
30
# Finding out the top 10 biggest margin of victories by wickets
# Using the group by functionality of pandas, grouping based on the season and id
columns and aggregating using count.
total_matches_per_year = matches.groupby('season')[['id']].count()
total_matches_per_year
31
# Visualising the number of matches in a season using a line chart.
1. In 2011, two new teams were introduced in the IPL, Kochi Tuskers Kerala & Pune
Warriors India. This means that the IPL now has 10 teams instead of 8 teams which can
explain the increase in number of matches.
2. In 2011, Kochi Tuskers Kerala was terminated because they failed to pay the 10% bank
guarantee they had agreed to pay the IPL committee despite several reminders from the
BCCI. They were removed from the IPL in October 2011. By December, they were gone. So,
for the 2012 season, they had to form a new format for 9 teams now instead of 10, hence the
increase in matches again (though, only a slight increase).
3. In 2013, just like Kochi Tuskers Kerala, Pune Warriors India were terminated for failing to
furnish a bank guarantee worth Rs 170 crore for the next season.
32
4. Hence, since 2014, there have been a similar number of matches being played in the IPL
amongst 8 teams.
# Concatenating (joining on axis = 0) the two teams (team1 and team 2) columns and
then using value_count() to determine the number of matches
# played by each team.
wins = matches.groupby('winner')[['id']].count()
wins_per_each_team = wins.sort_values(by = 'id', ascending = False).reset_index()
wins_per_each_team.columns = ["Teams", "Wins"]
33
wins_per_each_team
# Merging the total matches and total wins datasets and calculating the winning
percentage.
34
# Visualising tha bove dataframe.
plt.xticks(rotation=90)
plt.bar(matches_and_wins['Teams'], matches_and_wins['Total Matches'], alpha = 0.8,
color = 'white', edgecolor = 'black')
plt.bar(matches_and_wins['Teams'], matches_and_wins['Wins'], alpha = 0.8, color =
'cyan', edgecolor = 'black')
plt.plot(matches_and_wins.Teams, matches_and_wins['Winning Percentage'], color =
'black', marker = 'o', linewidth = 2, markersize = 8, linestyle = '--')
plt.legend(['Winning Percentage','Matches Played','Matches Won']);
plt.title('Number Of Matches Played VS Number Of Matches Won',fontweight = 'bold',
pad = 15, fontsize = 20)
plt.xlabel('Teams', labelpad = -25, fontweight='bold', fontsize = 15)
plt.ylabel('Matches Played, Won & Winning %age', labelpad = 20, fontweight='bold',
fontsize = 15);
35
Matches Won By The Home Team & The Away Team
Home team is a team which plays the match in its own city and away team is the team it plays
against
# Creating an empty dictionary, the idea is to map each team with its city.
dictionary = {}
for i in matches_and_wins.Teams:
if i in dictionary:
pass
else:
dictionary[i] = None
dictionary
j = 0
for i in dictionary:
dictionary[i] = cities[j]
j += 1
dictionary
36
{'Mumbai Indians': 'Mumbai',
'Royal Challengers Bangalore': 'Bengaluru',
'Kolkata Knight Riders': 'Kolkata',
'Delhi Capitals': 'Delhi',
'Kings XI Punjab': 'Mohali',
'Chennai Super Kings': 'Chennai',
'Rajasthan Royals': 'Jaipur',
'Sunrisers Hyderabad': 'Hyderabad',
'Deccan Chargers': 'Hyderabad',
'Pune Warriors': 'Pune',
'Gujarat Lions': 'Rajkot',
'Rising Pune Supergiant': 'Pune',
'Kochi Tuskers Kerala': 'Kochi'}
# Creating a list of home teams, the idea is to create a list consisting of home
team in each match and then using it
# as a column.
home_team = []
# Seeing the city in which the match took place, and appending the home team list
using keys and values of the dictionary.
# If both teams' cities do not match, then it is a neutral venue.
for i in range(756):
if dictionary[city_teams_df.team1[i]] == city_teams_df.city[i]:
home_team.append(city_teams_df.team1[i])
elif dictionary[city_teams_df.team2[i]] == city_teams_df.city[i]:
home_team.append(city_teams_df.team2[i])
else:
home_team.append("Neutral")
matches['home_team'] = home_team
# Creating a list of away teams, the idea is to create a list consisting of away
team in each match and then using it
# as a column.
away_team = []
# Seeing the city in which the match took place, and appending the away team list
using keys and values of the dictionary.
# If both teams' cities do not match, then it is a neutral venue.
for i in range(756):
if matches.home_team[i] == matches.team1[i]:
away_team.append(matches.team2[i])
37
elif matches.home_team[i] == matches.team2[i]:
away_team.append(matches.team1[i])
else:
away_team.append(matches.home_team[i])
len(matches[matches.home_team == 'Neutral'])
237
matches['away_team'] = away_team
# Caculating home team wins and away team wins by comparing the winner column with
the home team column and away team column.
home_team_wins = 0
away_team_wins = 0
neutral_matches= 0
for i in range(752):
if matches_with_result.winner[i] == matches_with_result.home_team[i]:
home_team_wins += 1
elif matches_with_result.winner[i] == matches_with_result.away_team[i]:
away_team_wins += 1
elif matches_with_result.home_team[i] == "Neutral":
38
neutral_matches += 1
home_team_wins
293
away_team_wins
222
neutral_matches
237
39
As we can see by the visualization, there is only a slight advantage to the team playing in front of their home crowd and the
match can go either way.
dismissal_kind_df = deliveries_df.groupby('dismissal_kind')[['match_id']].count()
# Sorting the values based on the amount of times batsmen have been dismissed in a
certain way (descending).
dismissal_kind_df_sig = dismissal_kind_df.drop([6,7,8])
40
# Visualising the dataframe.
sizes = dismissal_kind_df_sig['count']
labels = ['Caught', 'Bowled', 'Run Out', 'LBW', 'Stumped', 'Caught & Bowled']
plt.title('Different Ways Of Batsman Dismissal', pad = 80, fontweight='bold',
fontsize = 20)
plt.axis('equal')
patches, texts, pcts = plt.pie(sizes, labels = labels, autopct = "%.1f %%",
startangle = 15, counterclock = False, radius = 1.3,
wedgeprops = {'linewidth': 3.0, 'edgecolor': 'black'},
textprops={'fontsize': 13, 'fontweight' : 'bold'});
plt.setp(pcts, color='white', fontweight='bold', fontsize = 10, rotation = 20);
plt.legend(['Caught', 'Bowled', 'Run Out', 'LBW', 'Stumped', 'Caught & Bowled'],
loc = 'upper left');
As seen by the visualization, batsmen get caught out most often, followed by getting bowled
and getting dismissed by leg before wicket .
41
Most Matches Played By Batsmen & Bowlers
42
Most 6s & 4s Hit By Batsmen
# Using the value_counts functionality of pandas to count the batsmen who hit most
sixes.
big_hitters = deliveries_df.batsman[deliveries_df.batsman_runs ==
6].value_counts()[:15]
big_hitters
CH Gayle 327
AB de Villiers 214
MS Dhoni 207
SK Raina 195
RG Sharma 194
V Kohli 191
DA Warner 181
SR Watson 177
KA Pollard 175
YK Pathan 161
RV Uthappa 156
Yuvraj Singh 149
BB McCullum 129
AT Rayudu 120
AD Russell 119
Name: batsman, dtype: int64
43
# visualising the result.
gap_finders = deliveries_df.batsman[deliveries_df.batsman_runs ==
4].value_counts()[:15]
gap_finders
S Dhawan 526
SK Raina 495
G Gambhir 492
V Kohli 482
DA Warner 459
RV Uthappa 436
RG Sharma 431
AM Rahane 405
CH Gayle 376
PA Patel 366
44
KD Karthik 358
AB de Villiers 357
SR Watson 344
V Sehwag 334
MS Dhoni 297
Name: batsman, dtype: int64
# Using the value_counts functionality of pandas to count the batsmen who hit most
fours.
Raina , Warner , Kohli , Uthappa , R Sharma , Gayle being in top 10 of both the lists makes
them the most boundary hitting and dangerous batsmen,
45
Most Wickets Taking Bowlers
# Notice hoe there are many Nans in the player_dismissed column.
deliveries_df
deliveries_df
deliveries_df
46
# Creating an empty list is_wicket, the idea is to make a column in the dataframe
that shows that if a wicket was taken at
# a particular delivery or not (did not include run out, retired hurt or
obstructing the field because they are not
# awarded to the bowler).
is_wicket = []
for i in range(179078):
if deliveries_df.player_dismissed[i] == 0:
is_wicket.append(0)
elif deliveries_df.player_dismissed[i] != 0:
if deliveries_df.dismissal_kind[i] != 'run out'or
deliveries_df.dismissal_kind[i] != 'retired hurt' or
deliveries_df.dismissal_kind[i] != 'obstructing the field':
is_wicket.append(1)
deliveries_df['is_wicket'] = is_wicket
# Using the groupby functionality, grouping the dataset on bowlers and summing the
is_wicket column for each bowler and
# sorting it based on the sum value (descending).
deliveries_df.groupby('bowler')['is_wicket'].sum().sort_values(ascending =
False)[:20]
bowler
SL Malinga 188
DJ Bravo 168
A Mishra 165
Harbhajan Singh 161
PP Chawla 156
B Kumar 141
R Ashwin 138
SP Narine 137
UT Yadav 136
R Vinay Kumar 127
A Nehra 121
Z Khan 119
RA Jadeja 116
SR Watson 107
DW Steyn 104
YS Chahal 102
P Kumar 102
RP Singh 100
PP Ojha 99
MM Sharma 99
Name: is_wicket, dtype: int64
47
Bowlers Who Bowled The Most Deliveries
# Using the value_count functionality of pandas to find out the most deliveries
bowled by top 20 bowlers.
deliveries_df['bowler'].value_counts()[:20]
48
CHAPTER 4
49
Asking and Answering Questions
In this section, I will try to answer some interesting questions about both the datasets.
motm = matches[['player_of_match']]
motm = motm.rename(columns = {'player_of_match' : 'Player'})
top_players = motm.Player.value_counts()[:15]
top_players
CH Gayle 21
AB de Villiers 20
RG Sharma 17
MS Dhoni 17
DA Warner 17
YK Pathan 16
SR Watson 15
SK Raina 14
G Gambhir 13
MEK Hussey 12
AM Rahane 12
V Kohli 12
V Sehwag 11
DR Smith 11
AD Russell 11
Name: Player, dtype: int64
# Visualising the result.
labels = top_players.index
sizes = top_players.values
plt.title('Most MOTM Award Winners', y = -0.3, fontweight = 'bold')
plt.axis('equal')
patches, texts, pcts = plt.pie(sizes, labels = labels, autopct = "%.1f %%",
startangle = 90, counterclock = False, radius = 1.5,
wedgeprops = {'linewidth': 3.0, 'edgecolor': 'black'},
textprops={'fontsize': 11});
plt.setp(pcts, color='white', fontweight='bold');
50
1. Gayle , Raina , Kohli , Gambhir , Warner have been in three of the most amazing stats, thus,
being very scary players to face.
2. Interestingly, there are no bowlers in this list.
How Many Matches Were There Where The Toss Winner Also Won The
Match?
# taking out only those records where the match winner was also the toss winner.
match_and_toss_winner_df.result.unique()
# Only keeping the matches that had normal (weren't tied and had a tiebreaker
method to conclude the result) results.
match_and_toss_winner_df =
match_and_toss_winner_df[match_and_toss_winner_df['result'] == 'normal']
match_and_toss_winner_df
51
# Grouping the dataframe by winner and id and aggregating using count and sorting
using count values.
match_and_toss_winner_df =
match_and_toss_winner_df.groupby('winner')[['id']].count()
match_and_toss_winner_df = match_and_toss_winner_df.sort_values(by = 'id',
ascending = False).reset_index()
match_and_toss_winner_df
52
So, there are 350 matches that have happened in IPL where the toss winner has also gone on to
win the match out of which Chennai Super Kings have won the most, i.e, 57 matches, followed
very closely by Mumbai Indians which has won 55 matches, which in turn is followed very
closely by Kolkata Knight Riders which has won 53 matches.
w_types = deliveries_df.dismissal_kind.unique()
# Removing nan, run out, retired hurt from dismissal types list cause they are not
awarded to the bowlers.
w = [1, 2, 4, 5, 6, 8]
w_types = [w_types[x] for x in w]
# 1. Using indexing, first finding out only those records that have the above
dismissal types.
# 2. Then using groupby functionality, grouping the records by bowler and
dismissal kind and aggregating using count.
# 3. Then unstacking the temp df so that it becomes easier to read.
temp = deliveries_df[deliveries_df['dismissal_kind'].isin(w_types)].groupby(by =
['bowler','dismissal_kind']).dismissal_kind.count().unstack(fill_value = 0)
# Then summing across columns to create a column total (the idea is to sort by the
total (wickets) column in desc. order).
temp['total'] = temp.sum(axis=1)
# Sorting by the total column in descending order and then dropping it and only
keeping the top 10 bowlers.
53
# Visusalising the above dataframe using stacked barchart.
x = temp.index
y1 = temp.bowled
y2 = temp.caught
y3 = temp['caught and bowled']
y4 = temp['hit wicket']
y5 = temp.lbw
y6 = temp.stumped
54
As we can see from the above visualization, SL Malinga is the most dangerous bowler who
bowls dismisses his opponents a lot.
# Creating an empty dataframe with bowler, dismissal kind and batsman as columns.
# Finding out all the batsmen that have played in the IPL.
batsmen = deliveries_df.batsman.unique()
# 1. picking out the data of each batsman from the above list one by one.
# 2. Filtering out data based on dismissal kind of that batsman.
# 3. Grouping by bowler and aggregating using count.
# 4. Sorting values by dismissal kind count (descending() and then picking out the
bowler at the top.
# 5. Adding the current batsman to the column batsmen (after creating the column).
# 6. Concatenating using concat functionality of pandas (over x axis).
for x in batsmen:
current = deliveries_df[deliveries_df['batsman'] == x]
current = current[current['dismissal_kind'].isin(['caught', 'lbw', 'bowled',
'stumped', 'caught and bowled', "hit wicket"])]
current = current.groupby('bowler').count()
current = current.sort_values(by = 'dismissal_kind', ascending =
0).dismissal_kind[:1].reset_index()
current['batsman'] = x
max_dismissal = pd.concat([max_dismissal, current], ignore_index=True)
55
# Sorting values using dismissal kind count (descending), creating a max_dismissal
df, renaming the columns & showing top 10
# bowlers who took a wicket of a particular batsman.
1. As seen from the above table, MS Dhoni struggled against Z Khan the most and rest of the
batsmen can be seen as well, struggling against some bowlers
2. Z Khan & B Kumar are in the top 10 twice.
Who Were The Top 10 Batsmen Based On Runs Scored & Who Were The Top
10 Batsmen That Were Dismissed The Most?
batsman
V Kohli 5434
SK Raina 5415
RG Sharma 4914
DA Warner 4741
S Dhawan 4632
CH Gayle 4560
MS Dhoni 4477
RV Uthappa 4446
56
AB de Villiers 4428
G Gambhir 4223
Name: batsman_runs, dtype: int64
1. Interestingly, apart from MS Dhoni, everybody else are openers (that is they bat at positions
1, 2 or 3), which makes this a feat for Dhoni considering he bats in the middle order.
2. Raina, Kohli, Sharma, Gambhir, Warner, Uthappa, Gaylehave been in every good list so far
batsman
RG Sharma 162
SK Raina 161
RV Uthappa 157
V Kohli 152
KD Karthik 140
S Dhawan 137
G Gambhir 135
PA Patel 127
MS Dhoni 118
AM Rahane 117
Name: is_wicket, dtype: int64
Raina, Kohli, Sharma, Gambhir, Warner, Uthappa also get dismissed a lot too (though, after
scoring a lot)
Indented block
Batting average of a player is de¦ned as number of runs scored divided by the amount of times
dismissed
57
CH Gayle 41.454545
KL Rahul 41.081633
LMP Simmons 39.962963
ML Hayden 39.535714
OA Shah 38.923077
KS Williamson 38.794118
SE Marsh 38.292308
MEK Hussey 38.019231
MS Dhoni 37.940678
JC Buttler 37.657895
A Symonds 37.461538
dtype: float64
1. HM Amla, JP Duminy & AB de Villers are the top batsmen based on batting average.
58
CONCLUSION
The above analysis gives us an overview of the IPL matches, stats of different players and some
more enjoyable and knowledge facts about IPL from the Starting of IPL in the year 2008 upto
2019. The above observations contain a lot of information about a player in particular or a team
as a whole.With that, we’ve come to the end of this analysis. If you are a cricket lover, I am
sure you have heard about IPL before, and for many of you it is one of the favorite games to
enjoy with your family. It's fun as well as exciting to discuss the results of the games you love
and tell others the stories of the same, after going through this notebook you will have a lot
more stories to tell about IPL and brag about your knowledge on the game.
59
FUTURE SCOPE
There are a lot of scopes of improvement and/or addition in this project in future, with the data
provided and adding extra datasets we can, Make a better team statistics which shows Run Rate
of each team and overall position or value the team has in IPL. Predict the costs of players in
next seasons of ipl using the data of the players and with the knowledge of the cost of the
players in the previous seasons. Add observation for peak over to score in IPL and overs in
which most dismissals take place can be made using data manipulations. Also we can add the
dataset of 2020 IPL, observe and compare how the performance of players and teams changes
by modifying only a few lines of code (more the data, merrier the visualization).
60
BIBLIOGRAPHY
Websites
● open datasets Python library (Choosing and using datasets in python made easy
https://github.com/JovianML/opendatasets
Textbooks
● Python for data analysis , Wes Mckinney , 2nd Edition , O'Reilly Media, Inc
61