Madan Gopal Jhanwar
Madan Gopal Jhanwar
Madan Gopal Jhanwar
in
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in
Databases
(ECML-PKDD 2016)
1 Introduction
Statistical modeling has been used in sports since decades and has contributed
significantly to the success on field. Cricket is one of the most popular sports
in the world, second only to soccer. Various natural factors affecting the game,
enormous media coverage, and a huge betting market have given strong incen-
tives to model the game from various perspectives. However, the complex rules
governing the game, the ability of players and their performances on a given day,
and various other natural parameters play an integral role in affecting the final
outcome of a cricket match. This presents significant challenges in predicting the
accurate results of a game.
The game of cricket is played in three formats - Test Matches, ODIs and
T20s. We focus our research on ODIs, the most popular format of the game. To
predict the outcome of ODI cricket matches, we propose an approach where we
first estimate the batting and bowling potentials of the 22 players playing the
match using their career statistics and active participation in recent games. We
2 Madan Gopal Jhanwar and Vikram Pudi
use these player potentials to render the relative dominance one team has over
the other. Taking two other base features into account, namely, toss decision
and the venue of the match, along with the relative team strength, we adopt
supervised learning algorithms to predict the winner of the match.
The major contributions of our paper are as follows:
2 Related Work
In literature, Duckworth and Lewis proposed a solution, called the D/L method
[1], to reset targets in rain interrupted matches which was adopted by the In-
ternational Cricket Council (ICC) in 1998. Further, the use of Duckworth-Lewis
resources to assess players performances has been studied in [1], [2] and [3].
Optimal batting orders are discussed in [4] and [5]. The methods of graphical
representation to compare players are presented in [6], [7], and [8]. [9] consid-
ers the strength of opponent team, along with other factors, in modeling the
performance of batsmen and bowlers. However, like in any sport, winning is
the ultimate goal in cricket. [10] takes into account various factors affecting the
game including home team advantage, day/night effect and toss, etc., and uses
the Bayesian classifier to predict the outcome of the match. [11] uses a combi-
nation of linear regression and nearest-neighbor clustering algorithms to predict
the outcome of a match. They take into account both historical data as well as
instantaneous state of a match while the game is still in progress. [12] studied the
role of multiple factors including home field advantage, toss, match type (day or
day and night), competing teams, venue familiarity, and season, etc., and applied
Support Vector Machines(SVM) and Naive Bayes Classifiers for predicting the
winner of a match.
In this paper, we embark upon a very critical aspect that the team compo-
sition changes over time, which has not been studied yet. A team is comprised
of 11 players, and these 11 players are replaced over time. A team changes its
composition depending upon the match conditions, venue, opponent team, etc.
There could be various other reasons for the same, such as a player getting
injured, or getting dropped from the team for his poor performance, or taking
retirement from the sport itself. Figure 1 shows that on average at least 2 players
change per match for each team. Therefore, relying completely on the historical
data is not only insufficient, but also fallacious since it does not portray the
current competence of a team. Taking such obsolete factors into account might
lead us to incorrect conclusions.
Predicting the Outcome of ODI Cricket Matches 3
3.0
2.5
1.5
1.0
0.5
0.0
lia a d d a ka tan ies esh
tr a fric glan lan Indi Lan Pakis Ind lad
Aus th A En Ze a Sri est ang
Sou N e w W B
Countries
Fig. 1. The average number of player changes per match for all the teams in the years
2010-2014.
3 Methodology
In this section, we explain our approach to the problem in detail, including the
definitions and the mechanics of various algorithms used to model the batsmen,
bowlers and the teams.
Notation Description
φM atches P layed #Matches played by the player
φBatting Innings #Matches in which the player batted
φBatting Average #Runs scored divided by the #times the player got out
φN um Centuries #Times the player scored ≥ 100 runs in a match
φN um F if ties #Times the player scored ≥ 50 but less than 100 runs in a match
φBowling Innings #Matches in which the player bowled
φW kts T aken #Wickets taken by the player
φF W kts Hauls #Times the player has taken ≥ 5 wickets in a match
φBowling Average #Runs conceded by the player per wicket taken
φBowling Economy Average #runs conceded by the player per over bowled
4 Madan Gopal Jhanwar and Vikram Pudi
The pseudo code of the algorithm to model the batsmen for a given match is
given in Algorithm 1. Lines 2-6 calculate a player’s Career Score using his overall
career statistics. Variable u (line 3) is the ratio of the number of matches in
which the batsman batted to the total number of matches he played. It captures
whether the player is a full-time specialist batsman or not. Higher values of u
indicate that the player often bats at the top of the batting order and hence he
gets to bat in almost every match. On the other hand, lower values of u tell us
that the player bats lower down the batting order and his chances of batting in
the next match is also comparatively low. Variable φCareer Score (line 6) takes all
the career statistics into account, and therefore signifies the Career Score of the
batsman. Similarly, lines 7-8 calculate the Recent Score of a batsman. Variable
M (line 7) holds the recent matches played by the player. Variable φRecent Score
(line 8) captures the Recent Score of a batsman, which is the average number
of runs scored by the player in his recent games. Since the Career Score and
the Recent Score of players have different ranges, we have normalized them
(lines 11-12) to lie in a common range of [0,1]. Finally, variable φBatsman Score
Predicting the Outcome of ODI Cricket Matches 5
(line 13) stores the Batsman Score of a player which is a combination of his
Career Score and Recent Score.
The pseudo code of the algorithm to model bowlers for a given match is
given in Algorithm 2. Variable u (line 3) is the ratio of the number of matches in
which the bowler bowled to the total number of matches he played. It captures
whether the player is a full-time specialist bowler or not. Higher values of u
indicate that the player often bowls at the top of the bowling order and hence,
he gets to bowl in almost every match. On the other hand, lower values of u
tell us that the player is a part-time bowler who doesn’t bowl in every match he
plays and his chances of bowling in the next match are also comparatively low.
Variables v and w (lines 4-5) consider other statistically significant features of
a bowler. Finally, variable φBowler Score (line 6) takes everything into account,
and therefore signifies the Bowler Score of the player.
Notice that unlike batsmen, we haven’t considered the recent performances
of a bowler in calculating his Bowler Score. This is due to the lack of data, as
we do not have match-wise individual performances of every bowler.
Modeling Teams: The batsmen and the bowlers are the fundamental units of
a team. Therefore, using the modeled batsmen and bowlers, we intend to define
an overall score of a team with respect to the other. We define the batting score
of a team as the summation of the batting scores of all its players. Similarly, the
bowling score of a team is defined as the summation of the bowling scores of all
its players. We have directly used the scores of all the players in the team score,
as the variable u in the Algorithms 1 and 2 already takes care of the weighted
contribution of individual players to the team score.
6 Madan Gopal Jhanwar and Vikram Pudi
Our algorithm to find the relative strength between two teams, A and B,
competing against one another in a match m is shown in Algorithm 3. Since the
Batsman Scores and the Bowler Scores have different ranges, we first normalize
them to lie in the same range of [0,1] (lines 1-4). Lines 5-8 of the Algorithm
calculate the batting and bowling scores of both the teams. Variable S(A/B)
(line 9) captures the relative strength of team A against team B. The algorithm
follows the fundamental aspect of the game strategy where the batsmen of one
team work against the bowlers of the other team and vice-versa.
Note that out of the two competing teams, any one of them could be con-
sidered as team A and all the feature values and the target value would update
accordingly.
Dataset: To retrieve all the required statistics, the entire dataset has been
scraped from the cricinfo website [13]. The dataset includes all the matches
played between 2010 and 2014. The dataset contains the basic match details
including the two competing teams, the outcome of the toss, the date when it
was held, the venue and the winner of the match for all the matches. Along with
these, the career statistics of the participating players and their performances in
every match is also included.
We have restricted our study to only top 9 ODI-playing teams, namely,
Australia, South Africa, India, England, Sri Lanka, Pakistan, New Zealand,
Bangladesh and West Indies. Since the impact of the nature on the game cannot
be foreseen, a total of 109 matches which were either interrupted by rain or ended
up in a draw/tie, have been removed from the dataset. Finally, we divided the
dataset into two parts, namely, the test data and the training data. The training
dataset contains all the matches played during the years 2010 to 2013, and the
test dataset contains all the matches played in the year 2014. There are a total
of 299 matches in training dataset and 67 matches in test dataset.
Binary Classifiers: Using various binary and numeric features and the out-
come of the match as the label, we evaluated a large number of binary classifiers
using their scikit-learn implementations [18] to generate supervised classification
models, including SVM, Random Forests, Logistic Regression, Decision Trees
and kNN. We used the sweep feature to experiment with all the possible val-
ues and combinations of the parameters for all the algorithms. The efficacy of
the kNN algorithm, with k=4, was statistically superior to those obtained by
the best models of other classifiers, as shown in Figure 2. The idea of using
the data of future matches to predict the outcome of past matches is absurd.
Consequently, we could not carry out any sort of cross-validation procedure as
it would interfere with the chronological order of the data.
8 Madan Gopal Jhanwar and Vikram Pudi
0.72
0.70
0.68
0.66
Accuracy 0.64
0.62
0.60
0.58
SVM est ion ree
s
KN
N
orr res
s
nT
domF Reg cisio
Ran is tic De
Log
Algorithm
The only obstacle we faced while evaluating our approach is the inability to
compare against previous models like [10] and [11], due to the different under-
lying datasets used. Our dataset does not have some of the features used by
them. For instance, we do not have the details on the timings of the matches
(day/night) as used by [10], and the instantaneous state of the matches at mul-
tiple stages as used by [11]. However, we compared our model with two other
baseline models – the team winning the toss wins the match (Model 1), and the
team with positive relative strength, as calculated in the algorithm 3, wins the
match (Model 2). The results are tabulated in Table 2. The superiority of our
model against the others proves the significance of the combination of various
features used.
Although we cannot directly compare these results with the prior state-of-
the-art approaches due to differences in the dataset, it is noteworthy that the
best accuracy in predicting the outcome of ODI cricket matches reported so far
in the literature is between 0.68 and 0.70 ( [11]). Team-wise winning accuracy,
as predicted by our model, is shown in Figure 3.
Table 2. Comparing our kNN based model with other baseline models
Model Accuracy
Model 1 0.56
Model 2 0.63
Our Model 0.71
Predicting the Outcome of ODI Cricket Matches 9
0.9
0.8
0.7
0.6
0.5
Accuracy 0.4
0.3
0.2
0.1
0.0
nd nka kistan gland India sh s lia
lade est Indie Austra uth Afric
a
Zeala Sri La Pa En Bang
Ne w W So
Country
5 Conclusion
The paper addresses the problem of predicting the outcome of an ODI cricket
match using the statistics of 366 matches. The novelty of our approach lies in
addressing the problem as a dynamic one, and using the participating players as
the key feature in predicting the winner of the match. We observe that simple
features can yield very promising results.
References
1. Duckworth, Frank C., and Anthony J. Lewis. ”A fair method for resetting the
target in interrupted one-day cricket matches.” Journal of the Operational Research
Society 49.3 (1998): 220-227.
2. Beaudoin, David, and Tim B. Swartz. ”The best batsmen and bowlers in one-day
cricket.” South African Statistical Journal 37.2 (2003): 203.
3. Lewis, A. J. ”Towards fairer measures of player performance in one-day cricket.”
Journal of the Operational Research Society 56.7 (2005): 804-815.
4. Swartz, Tim B., Paramjit S. Gill, and David Beaudoin. ”Optimal batting orders in
one-day cricket.” Computers and operations research 33.7 (2006): 1939-1950.
5. Norman, John M., and Stephen R. Clarke. ”Optimal batting orders in cricket.”
Journal of the Operational Research Society 61.6 (2010): 980-986.
6. Kimber, Alan. ”A graphical display for comparing bowlers in cricket.” Teaching
Statistics 15.3 (1993): 84-86.
7. Barr, G. D. I., and B. S. Kantor. ”A criterion for comparing and selecting batsmen
in limited overs cricket.” Journal of the Operational Research Society 55.12 (2004):
1266-1274.
8. Van Staden, Paul Jacobus. ”Comparison of cricketers bowling and batting perfor-
mances using graphical displays.” (2009).
10 Madan Gopal Jhanwar and Vikram Pudi