0% found this document useful (0 votes)
4 views100 pages

Analysis of Football DataFinalPiece

This dissertation by Patrik Atkinson analyzes football data to provide insights for clubs, fans, and bookmakers, focusing on forecasting match results, evaluating Expected Goals (xG), and understanding interruptions during play. It employs various statistical models, including Poisson distributions for match results and logistic regression for xG, demonstrating their effectiveness using data from major European leagues. The findings highlight the potential for financial gains through informed betting strategies and improved transfer market efficiency, while also addressing the significant impact of interruptions on match duration.

Uploaded by

elhounoudmohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views100 pages

Analysis of Football DataFinalPiece

This dissertation by Patrik Atkinson analyzes football data to provide insights for clubs, fans, and bookmakers, focusing on forecasting match results, evaluating Expected Goals (xG), and understanding interruptions during play. It employs various statistical models, including Poisson distributions for match results and logistic regression for xG, demonstrating their effectiveness using data from major European leagues. The findings highlight the potential for financial gains through informed betting strategies and improved transfer market efficiency, while also addressing the significant impact of interruptions on match duration.

Uploaded by

elhounoudmohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/363816236

ANALYSIS OF FOOTBALL DATA

Preprint · September 2022

CITATIONS READS

0 1,628

1 author:

Patrik Atkinson
The University of Manchester
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Patrik Atkinson on 01 December 2022.

The user has requested enhancement of the downloaded file.


ANALYSIS OF FOOTBALL DATA

A dissertation submitted to the University of Manchester


for the degree of Master of Science
in the Faculty of Science and Engineering

2022

Patrik Atkinson
Department of Mathematics
Contents

Abstract 6

Declaration 7

Intellectual Property Statement 8

Acknowledgements 9

1 Introduction 10

2 Football Scores 15
2.1 Methods and Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Dixon and Coles Independent Poisson Model . . . . . . . . . . . 15
2.1.2 Bivariate Poisson Model . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Data and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Premier League 2021/22 . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Bundesliga 2021/22 . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 La Liga 2021/22 . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 Serie A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.5 Ligue 1 2021/22 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.6 Betting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Expected Goals (xG) 36


3.1 Methods and Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Rathke’s 2017 Zonal xG Model . . . . . . . . . . . . . . . . . . 36

2
3.1.2 Gómez’ 2020 Distance and Angle xG Model . . . . . . . . . . . 38
3.1.3 Stats Perform’s xG Model . . . . . . . . . . . . . . . . . . . . . 41
3.2 Data and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Rathke’s 2017 xG Model . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Distance and Angle Logistic Regression xG model . . . . . . . . 49
3.3.3 Comparing xG models . . . . . . . . . . . . . . . . . . . . . . . 50

4 Football Interruptions 56
4.1 Methods and Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Zhao & Zhang’s Gamma distributed GLM with log link . . . . . 56
4.1.2 Alternative Approaches to Modelling Football Interruptions . . 58
4.2 Data and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 Timing Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Area (Location) Effect . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 League Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.4 Area and Timing Interaction Effect . . . . . . . . . . . . . . . . 64
4.3.5 Interruption Results Discussion . . . . . . . . . . . . . . . . . . 64

5 Concluding Discussion 67

A Appendix 76
A.1 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

B Appendix 78
B.1 Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

C Appendix 89
C.1 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.1.1 Football Scores Data Processing and Analysis . . . . . . . . . . 89
C.1.2 xG Data Processing and Analysis . . . . . . . . . . . . . . . . . 93
C.1.3 Interruptions Data Processing and Analysis . . . . . . . . . . . 98

3
List of Tables

2.1 Table of attack and defence parameter estimates for Premier League
clubs 2021/22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Table of forecast scores & observed scores for Premier League match-
week 1 2022/23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Table of xG estimates for each zone and the number of shots to result
in 1 xG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Table of xG estimates for 3 central locations using the logistic regression
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Table of AIC and BIC values for the two xG models. . . . . . . . . . . 51
3.4 Table of Brier score and skill score for contrasting xG models. . . . . . 51

B.1 Table of attack and defence parameter estimates for Bundesliga 2021/22. 81
B.2 Table of attack and defence parameter estimates for La Liga 2021/22. . 81
B.3 Table of attack and defence parameter estimates for Serie A 2021/22. . 82
B.4 Table of attack and defence parameter estimates for Ligue 1 2021/22. . 83
B.5 Table of parameter estimates for the fitted free kick interruptions GLM.
Significant at *95%, **99%, ***99.9%. All values taken to 3s.f. . . . . . 86
B.6 Table of parameter estimates for the fitted throw in interruptions GLM. 87
B.7 Table of parameter estimates for the fitted out of bounds ball interrup-
tions GLM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.8 Table of parameter estimates for the fitted free kick interruptions GLM. 87
B.9 Table of parameter estimates for the fitted corner interruptions GLM. . 88
B.10 Table of parameter estimates for the fitted penalty interruptions GLM. 88

4
List of Figures

1.1 Premier league home and away goals from seasons 2017/18 - 2021/22
(Fixture Download 2022) with fitted Poisson distributions (red). . . . . 11

2.1 Distribution of Premier League results for seasons 2017/18 - 2021/22. . 35

3.1 Distance and angle to goal from a shot. . . . . . . . . . . . . . . . . . . 43


3.2 All unblocked open play unsuccessful shots (blue) and goals (red) from
the top 5 European leagues 2017/18, European Championship 2016 and
FIFA World Cup 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Heatmap of the zonal xG model. . . . . . . . . . . . . . . . . . . . . . 53
3.4 Heat map demonstrating estimated xG for logistic regression model
with covariates distance and angle to goal. . . . . . . . . . . . . . . . . 54
3.5 ROC curve for both the zonal and distance & angle xG models. . . . . 55

B.1 Distributions of 4 European leagues home and away goals from seasons
2017/18 - 2021/22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.2 Densities of 4 European leagues scores from seasons 2017/18 - 2021/22. 80
B.3 Results for the Rathke 2017 zonal xG model. . . . . . . . . . . . . . . . 84
B.4 Pitch segmentation for the location covariate when modelling free kick
interruptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.5 Example structure of Wyscout data (Pappalardo et al. 2019) when
loaded into R using ’fromJSON’. . . . . . . . . . . . . . . . . . . . . . 86

5
Abstract

The global appeal and heavy financial investment in football has led to the accumu-
lation of richly detailed data that can be analysed to provide added value for football
associations, clubs, fans and bookmakers. Such value can be obtained via critical eval-
uation of selected modelling approaches including: football scores, Expected Goals,
and interruptions, which are covered in this dissertation. First, accurate forecasting of
football match results may enable a long-term positive financial return by bookmak-
ers or those placing statistically-informed bets. These result forecasts can be made
using models which assume the number of home and away goals follow independent
Poisson distributions. Using data from the ’big 5’ European leagues demonstrate the
effectiveness of the model, and that a betting strategy producing long-term profit may
be attainable. Second, successful identification of outstanding shot-takers using the
Expected Goals (xG) metric can provide greater transfer market efficiency for football
clubs, ultimately reducing financial loss through poor transfer decisions. A zonal xG
model is contrasted with a logistic regression model; both adequately fit the data with
non-significant deviance (p = 0.6047, p = 0.1654). The logistic regression model with
distance and angle-to-goal as covariates, demonstrates a better fit for match event data
from elite European and world football (Pappalardo et al. 2019). Third, interruptions
to play are important; accounting for a mean duration of 36 minutes 16 seconds per
match. Thus, an empirical investigation of significant factors influencing the duration
of interruptions is undertaken. Duration of interruptions followed a gamma distribu-
tion and is modelled using a GLM with log link. Significant variation in duration of
different interruption types is demonstrated according to the time in a match, location
on the pitch and the league. Notably penalty duration in leagues using VAR, Serie
A and Bundesliga in 2017/18, was longer (p < 0.001) than in the Premier League in
2017/18 without VAR. Many findings agreed with Zhao & Zhang (2021).

6
Declaration
No portion of the work referred to in the dissertation has
been submitted in support of an application for another
degree or qualification of this or any other university or
other institute of learning.

7
Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns certain copyright or related rights in it (the “Copyright”) and
s/he has given The University of Manchester certain rights to use such Copyright,
including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard or elec-
tronic copy, may be made only in accordance with the Copyright, Designs and
Patents Act 1988 (as amended) and regulations issued under it or, where appropri-
ate, in accordance with licensing agreements which the University has entered into.
This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other intel-
lectual property (the “Intellectual Property”) and any reproductions of copyright
works in the dissertation, for example graphs and tables (“Reproductions”), which
may be described in this dissertation, may not be owned by the author and may
be owned by third parties. Such Intellectual Property and Reproductions cannot
and must not be made available for use without the prior written permission of the
owner(s) of the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and com-
mercialisation of this dissertation, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University
IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487), in
any relevant Dissertation restriction declarations deposited in the University Library,
The University Library’s regulations (see http://www.manchester.ac.uk/library/ab-
outus/regulations) and in The University’s Guidance on Presentation of Disserta-
tions.

8
Acknowledgements

Firstly I would like to thank my project supervisor Dr. Timothy Waite for his in-
valuable insight and support throughout all aspects of this dissertation, from the data
collecting and processing to the production of the final write-up. Thank you to the
academic and administrative staff in the Department of Mathematics at The Univer-
sity of Manchester who have made made my 4 years of study very enjoyable and have
greatly enriched my passion for statistics. I would also like to thank all of the staff at
Alder Centre for Education (ACE) for their care and teaching during my prolonged pe-
riods of poor health in earlier education. In particular Robin Steere MBE, for sharing
his mathematics knowledge via voluntary tuition, without which attaining the level
of understanding required to study a mathematics degree would not have been possible.

I would like to thank all of my family and friends for their patience, support and
interest throughout the research and writing of this dissertation. This support pro-
vided me with the grounding to produce my best possible quality of work. Lastly,
thank you to Jürgen Klopp and Liverpool Football Club for being so good at footy -
enhancing my enjoyment of football, which has made this dissertation an extremely
interesting and delectable project.

9
Chapter 1

Introduction

Football is a sport viewed on a global scale. From local clubs to international sides,
millions of viewers are drawn-in on a weekly basis and even billions of viewers for
the biggest events in world football (Top Media Advertising n.d.). With such large-
scale viewership, this interest in football provides the opportunity for broadcasters and
bookmakers to profit. The number of Premier League games covered by UK broad-
casters is increasing, as is the value of broadcasting rights. From 2001-03 just 110 of
the 380 matches in a Premier League season were televised live in a deal worth £400m
annually. Whereas 168 matches each season from 2016-19 were televised live in a
deal worth £1.712bn annually (Butler & Massey 2019). Complementing the increased
accessibility and value in live football, the number of bets placed with bookmaker
Betfair, saw almost yearly linear increases on Premier League matches from 2009 to
2015 (Deutscher et al. 2019). Hence the prospect of taking a mathematical approach
to modelling scores and forecasting the scoreline in a given football match may be of
interest to investigate whether money could be earned through strategically betting
on football outcomes. Models which are able to produce accurate outcome probabili-
ties could be contrasted with bookmakers odds, to assess the feasibility of designing a
strategy which results in long-term monetary gain for the bettor.
Treating the number of goals scored by the home or away team separately, the distri-
butions of both home and away goals over multiple seasons appear to closely follow
Poisson distributions, as can be seen in figure 1.1 for the Premier League, where
typically the home team scores more goals than the away team due to the ’home ad-
vantage’. The Poisson approximation may also be a reasonable assumption for other

10
CHAPTER 1. INTRODUCTION 11

Figure 1.1: Premier league home and away goals from seasons 2017/18 - 2021/22
(Fixture Download 2022) with fitted Poisson distributions (red).

elite European leagues, see figure B.1. Both fitted Poisson distributions, in red in fig-
ure 1.1, are significantly non-Poisson distributed under the likelihood ratio test with
p = 0.00130, p = 0.0000198 for home and away goals respectively. However this may
be a feature of the large sample size of 1900 matches, hence n is large resulting in
small standard errors. As is discussed in section 2.1.1, the under and overestimation
using a Poisson approximation for low scores of 0 and 1 home and away goals further
supports Dixon and Coles choice to include the function τ in equation 2.6.
There is substantial variability from one game to another, caused by a multitude of
variables, which makes prediction of any singular outcome challenging. Dixon & Coles
(1997) propose treating the number of home and away goals scored in a game as fol-
lowing independent Poisson distributions with a correction factor for lower scoring
matches, namely 0-0, 1-0, 0-1, and 1-1, which contradict the independence assump-
tion.
CHAPTER 1. INTRODUCTION 12

A similar approach for modelling football scores uses a Bivariate Poisson distribution
as proposed by Maher (1982), and later developed by Karlis & Ntzoufras (2003). Both
home and away goals in a game are not independent of each other since the teams
are simultaneously playing each other with one football, rather than two separate
games at opposite ends of the pitch. Therefore a potential benefit of using a Bi-
variate Poisson distribution, instead of two independent Poisson distributions, is that
the dependence between home and away goals can be considered within the model’s
distribution. Since estimates for the number of goals scored by the home and away
teams are being estimated, there are a considerable number of betting markets which
could be investigated. These include: ’correct score’, ’total goals exact’, ’total goals
over/under’, ’both teams to score’, ’match result’, and many others.

A considerably different analysis of the number of goals scored in a football match


is Expected Goals, which is used in post-match analysis as opposed to forecasting.
Expected Goals (xG) was first described by Green (2012) as a method of assessing the
performance of Premier League goalscorers. It is a model which provides a numerical
estimate for the probability that a shot will result in a goal. One of the primary uses
for xG is providing a metric for football clubs to identify whether the team or a player
has been effective with the shots they have taken. In an isolated match a team or
player with considerably greater xG than the number of goals actually scored could be
perceived as having poor luck, missing some shots from good positions, or facing an
opposition goalkeeper playing considerably better than on average. Whereas if xG is
greater than goals scored throughout a longer period of games or a full season, it may
be more indicative of a trend of poor conversion from the shots taken. The opposite
is also true that a team or player who out-scores their xG throughout a season has
been more effective in front of goal than average. This may indicate that the team or
a player is above average in converting their chances. A crucial use of this information
for football clubs can be to inform the club of possible transfer targets to improve their
attack, or sell players that are regularly under-performing relative to their xG. Coaches
and tacticians may also use xG models to instruct individuals to take more shots from
particular zones where they have greater xG and thus maximise their opportunity to
score goals.
CHAPTER 1. INTRODUCTION 13

To model xG, since the outcome is binary (1 for a goal, 0 for no goal), a common
approach is to use logistic regression. Typically the differences between xG models
lies with the decision of choosing which covariates are sufficient to explain as much
variation between shot outcomes as possible. Another challenge involving covariates
is the complexity and access to data, since some variables may be defined differently
in different data sets and many providers of spatial football event data such as Opta
withhold their data behind a pay-wall (Stats Perform 2022). For the purpose of this
study two contrasting models will be considered; a model which divides the attacking
half of the pitch into zones and calculates empirical probabilities for each zone as pro-
posed by Rathke (2017) and a logistic regression model with covariates for the distance
and angle to the goal similar to that used by Opta (Green 2012) and as discussed in
an article by Gómez (2020).
A third aspect of football data analysis, which has varying approaches, is modelling
the duration of interruptions in football matches. By investigating such models, there
is opportunity to develop a deeper understanding of an important aspect of football
strategy; managing the time in a match. Some features which may influence a player’s
attitude towards managing interruptions in a football match could include: the tired-
ness of teammates and opponents, the game state including the time and extent of a
lead, where on the pitch the interruption has taken place, and whether there is consid-
erable chance of scoring a goal or conceding possession in a defensive area. This poses
the opportunity to investigate the extent of ’time-wasting’, which is considered by Cox
(2018) as one of the ”dark arts” of football, and whether it may be more prevalent in
different footballing cultures. Football teams may have an interest in modelling the
duration of interruptions as an opportunity to monitor the well-being of their players
and devise a strategy to reduce player fatigue, therefore reducing the risk of injury
(Ekstrand et al. 2004).
Investigating the duration of interruptions may also benefit spectators, by gaining
greater insight into which interruptions have longer unnecessary durations and could
therefore be minimised by means of a rule change implemented by a governing body
in football such as the English Football Association (F.A.), The UEFA Congress or
FIFA to make matches more time-efficient. Zhao & Zhang (2021) propose modelling
the duration of six different types of interruptions; free kicks, throw-ins, out-of-bounds
CHAPTER 1. INTRODUCTION 14

balls, goal kicks, corners and penalties using a generalised linear model (GLM) with
durations belonging to the gamma family of distributions and a log link function.
The structure of this dissertation is partitioned into 5 chapters: 1 introduction, 2
football scores, 3 Expected Goals, 4 interruptions, and 5 concluding discussion. Each
chapter 2, 3, and 4 is divided into three sections: methods and modelling, data and
processing, and results and discussion.
All statistical analyses is this dissertation are carried out using R Studio (R Core Team
2020).
Chapter 2

Football Scores

2.1 Methods and Modelling

2.1.1 Dixon and Coles Independent Poisson Model

Maher (1982) demonstrated the adequacy of modelling the number of goals scored
by the home and away teams in association football matches using independent Pois-
son distributions, with rates determined by the quality of each team based on past
performance. A Poisson distribution appears suitable since the support of a Poisson
distributed random variable is {0, 1, 2, ...}, which is also the set of possible numbers
of goals scored by either home or away team. Both home and away goals appear to
follow Poisson distributions for the Premier League in the seasons 2017/18 - 2021/22.
See figure 1.1. Also the use of Poisson distribution for home and away goals for four of
the most elite leagues in European football (the German Bundesliga, La Liga, Serie A,
and Ligue 1) appears appropriate. See figure B.1. In the simplified version of Maher’s
model, proven to have a non-significantly different log-likelihood to the full original
model, there are 2n independent parameters comprising: ’k’ - a constant parameter
to quantify the ’home effect’ (differences in performance when teams are playing at
their home stadium in comparison to away), each team has an attack parameter αi
and a defence parameter βi for i = 1, ..., n with n the number of teams in the division.
Maher imposes the constraint ni=1 αi = ni=1 βi , to ensure there are 2n identifiable
P P

parameters.
The independent Poisson model proposed by Dixon & Coles (1997) models the number

15
CHAPTER 2. FOOTBALL SCORES 16

of home team goals scored by team ’i’ against away team ’j’ as Xi,j and away team
goals scored by team ’j’ against team ’i’ as Yi,j such that:

Xi,j ∼ Poisson(λ) (2.1)

Yi,j ∼ Poisson(µ) (2.2)

log(λ) = αi + βj + γ (2.3)

log(µ) = αj + βi (2.4)

Where λ = exp(αi + βj + γ), µ = exp(αj + βi ), and αi , βi > 0 are the ’attack’ and
’defence’ rates respectively for the ith team in the league. Constant ’home effect’
is denoted γ > 0 and Xi,j , Yi,j are assumed to be independent. Equations 2.3 and
2.4 describe the GLM’s with Poisson distributed responses and canonical log link
functions. The assumption that home effect (or typically home advantage) is constant
for all teams may not be valid. Peeters & van Ours (2021) discuss how ”clubs differ
substantially” in relative home advantage, which describes the differences between
teams in the same division. Thus an alteration to the Dixon and Coles model could be
team-specific home advantages γi for i = 1,...,n to account for relative home advantage,
although this would add a further n − 1 parameters to be estimated.
Dixon and Coles test the validity of the independence assumption by computing the
ratio of joint empirical probability function f˜(x, y), with the product of marginal
empirical probability functions for home f˜H (x) and away f˜A (y) scores, with ’x’ and ’y’
the number of goals scored by home and away teams respectively.

f˜(x, y)
(2.5)
f˜H (x)f˜A (y)
If this ratio is equal to 1, then the product of empirical marginal distributions is
equal to the joint empirical distribution, thus independence is satisfied since for two
independent events A and B; IP(A) . IP(B) = IP(A ∩ B). When computing this ratio
and respective bootstrap standard errors for each possible score, the only scores which
had ratios significantly different to 1 were 0-0, 1-0, 0-1, 1-1, with respective ratios
(standard errors): 1.115 (0.0352), 0.937 (0.0243), 0.92 (0.0287), 1.057 (0.02). Hence
the independence model significantly overestimates the probability of 0-0 and 1-1, and
underestimates 1-0 and 0-1. The details of this test are vague, although each of these
ratios deviate from 1 by more than two standards errors which typically corresponds to
CHAPTER 2. FOOTBALL SCORES 17

the limits of an approximate 95% confidence interval. One possible explanation for this
is the frequency at which these scores occur, as can be seen by the size and coloration of
points figure 2.1. Hence the number of samples n is large in the calculation of standard
error s.e. = √σ . Conducting a chi-square test for independence on the contingency
n

table of scores could be viewed as a more rigorous approach. This chi-square test is
carried out in section 2.3 for all 1900 Premier League full-time results from 2017/18 -
2021/22.
To adapt Maher’s model to account for the lack of independence found in the scores
0-0, 1-0, 0-1, 1-1, a dependence parameter ρ is introduced using the function τλ,µ (x, y)
by Dixon and Coles.

1 − λµρ

 if x = y = 0,







1 + λρ if x = 0, y = 1,


τλ,µ (x, y) = 1 + µρ if x = 1, y = 0, (2.6)




1−ρ if x = y = 1,







1

otherwise

Therefore the probability mass function (pmf) of the Dixon and Coles model can be
described in equation 2.7.

λx exp(−λ) µy exp(−µ)
IP(Xi,j = x, Yi,j = y) = τλ,µ (x, y) (2.7)
x! y!

To ensure each probability is confined to the range [0, 1], the dependence parameter
h    i
is constrained by ρ ∈ max − λ1 , − µ1 , min λµ 1
, 1 . The case ρ = 0 leads to the
original independence model. The function τλ,µ (x, y) could be effective in improving
accuracy of the model since it reduces the probabilities of results 0-0, 1-1, which were
previously overestimated, and increases the probabilities of 1-0, 0-1, which were pre-
viously underestimated.
An additional feature implemented by Dixon and Coles to further improve on Ma-
her’s independent Poisson model includes non-static estimates of attack and defence
parameters for each team. Initially the model is static since the attack and defence
parameters are independent of time. Therefore using maximum likelihood estimation
of the likelihood function in equation 2.8, with equations 2.9, 2.10, the estimates of
CHAPTER 2. FOOTBALL SCORES 18

attack and defence rates for each team would not reflect recent form.
N
Y
L(αi , βi , ρ, γ; i = 1, ..., n) = τλk ,µk (xk , yk ) exp(−λk ) λxkk exp(−µk ) µykk (2.8)
k=1

λk = αi(k) βj(k) γ, (2.9)

µk = αj(k) βi(k) (2.10)

(i(k), j(k) denote the home team ’i’, against the away team ’j’ in match ’k’)
Introducing a time-dependent weighting function ϕ(t) prevents the parameters from
remaining static. At each time point (i.e. after each match week), a pseudo-likelihood
function can be constructed as in equation 2.11.
Y
Lt (αi , βi , ρ, γ; i = 1, ..., n) = [τλk ,µk (xk , yk ) exp(−λk ) λxkk exp(−µk ) µykk ]ϕ(t−tk )
k∈At
(2.11)
With tk the time (match week) of match k, At = {k : tk < t} is the set of all times k
up to the present. Thus ϕ(t − tk ) decays as the time-lag increases to reflect the how
recent performances could be a more accurate representation of the quality of a team.
Dixon and Coles set this weighting function to exponentially decay as time-lag in-
creases, with the magnitude of this decay at each time-step dictated by the rate ξ > 0.

ϕ(t) = exp(−ξ t) (2.12)

The constant ξ reduces the weighting of all previous results, however as time increases
the effect of this reduction increases, thus providing greater relative weight to recent
results when computing the parameter estimates for each teams ’attack’ and ’defence’
rate. Should ξ = 0 the model corresponds to the original static model. In order to
find the optimal value for the rate of decay ξ, the value of ξ which maximises equation
2.13 should be found, as described by Dixon & Coles (1997).
N
X
S(ξ) = (δkH logpH D D A A
k + δk logpk + δk logpk ) (2.13)
k=1

Here δkH,D,A is an indicator function, defined in equation 2.14 for match k and pH,D,A
k

is the probability of a home win, draw or away win respectively.



1

result: home win
H
δk = (2.14)
0

result: draw or away win
CHAPTER 2. FOOTBALL SCORES 19

In the case of a home win, this is calculated as the sum of all estimated probabilities
such that the number of home team goals is greater than away team goals in match
k, or more formally equation 2.15.
X
pH
k = IP(Xk = a, Yk = b) (2.15)
∀ {(a,b) : a>b}

To ensure unique parameters estimates can be obtained, a constraint for the mean
of attack parameters is imposed: n1 ni=1 αi = 1. Therefore the model is no longer
P

over-parameterized and 185 of the total 186 (92 teams across the four professional
divisions in English football each with attack and defence parameters, the dependence
parameter and home effect parameter) parameters are identifiable.
One of the main applications of modelling football scores could be to gain a long-term
advantage over bookmakers, thus earning money through betting. As per Dixon &
Coles (1997), match result odds are transformed into probabilities by a formula with
odds format o1:o2, then p = o2 / (o1 + o2). Alternatively, bookmakers odds provided
in a decimal format could be used to calculate odds implied probability as 1 / (decimal
odds). Taking the sum of these implied probabilities for home win, draw, away win
and subtracting 1 leaves the expected profit for a bookmaker from each fixture on
match outcome betting. This approach is adopted for a full week of Premier League
fixtures, namely the second matchweek of the 2022/23 season commencing Saturday
13th August, to get an average ’take’ or expected gain by the bookmakers over these
10 matches. The mean sum of implied probabilities for these 10 fixtures is 1.04884
(s.d. = 0.01173), taken from one of the UK’s largest betting sites; Betfair (2022),
meaning the expected gross profit (without accounting for other business expenses) by
Betfair, and therefore other bookmakers, could be approximately 4.88% for Premier
League match outcome bets. Therefore to gain a long-term advantage over betting
companies, an estimated probability must exceed the odds implied probability by ap-
proximately 5% under the assumption modelling accuracy is equal using the Dixon
and Coles approach and the bookmakers approach. The assumption of equal fore-
casting accuracy may be inadequate given the resources available to such large-scale
companies to model sporting outcomes. Therefore increasing this imbalance between
estimated probabilities and odds implied probabilities may increase the expected gain
CHAPTER 2. FOOTBALL SCORES 20

for those using the Dixon and Coles approach. Under the assumption of equally accu-
rate probability estimates using the Dixon and Coles model as bookmakers estimates,
the expected return after bookmakers expected profit can be calculated as in equation
2.16.  
E
Expected Return(%) = − BP × 100 (2.16)
I
Where E denotes the estimated probability of an outcome using the Dixon and Coles
method, I denotes the odds implied probability from a bookmaker and BP is the
bookmakers expected profit (1.04884 in the example above). For example using the
expected bookmaker profit above, placing bets in which the estimated probability is
10% greater than the odds implied probability would give approximately 5.12% ex-
pected return for the bettor, whereas using the similar approach but only placing bets
with estimated probability 20% greater than the odds implied probability gives ex-
pected gain of 15.12%. Even at the upper limit of a 95% confidence interval whereby
bookmakers would expect a long-term return of approximately 7.23%, both betting
strategies above would result in a positive expected return for those placing the bets.

A consequence of increasing the difference between probability estimates and odds


implied probabilities is that the higher the difference is, the fewer bets will be placed
over the length of a season. Therefore the betting regime which provides highest per-
centage return is also the strictest in terms of the number of bets placed. As discussed
at the end of section 2.3.5, the Dixon & Coles (1997) appears suitable for other elite
European leagues. Hence under a strict betting regime, such as a ratio which provided
expected return of 20-30%, the total number of bets places could be maximised by
betting on multiple European leagues including the ’big 5’ discussed in section 2.3.
In an experiment which tests whether a financial advantage over bookmakers could be
achieved using Dixon and Coles modelling approach, several variables which should
remain fixed. These include; the bookmaker or odds provider, the stake amount, and
the difference between estimated and odds implied probabilities being tested for a
chosen betting regime.
A future area of study could investigate how optimal betting stake could be weighted in
accordance to the extent of disparity between estimated and odds implied probabilities
to maximise the bettors overall expected gain.
CHAPTER 2. FOOTBALL SCORES 21

2.1.2 Bivariate Poisson Model

The Bivariate Poisson model as suggested by Maher (1982); similarly to the Dixon
& Coles (1997) uses attack and defence parameters for each team, however for con-
struction requires three independent Poisson distributions such that the correlation
between home and away goals in a given match can be accounted for in the model.
Using notation more consistent with the Dixon and Coles model, Maher’s Bivariate
Poisson model can be described with similarly defined features. First three indepen-
dent Poisson distributions are constructed in equations 2.17, 2.18, 2.19. Then the
home and away goals are re-defined in equations 2.20, 2.21 as in section 2.1.1. Equa-
tion 2.22 gives the bivariate Poisson distribution of home and away goals in a match
with the ith home team playing jth away team in the league.

Ui,j ∼ Poisson(µi,j − ηi,j ) (2.17)

Vi,j ∼ Poisson(λi,j − ηi,j ) (2.18)

Wi,j ∼ Poisson(ηi,j ) (2.19)

Xi,j = Vi,j + Wi,j ∼ Poisson(λi,j ) (2.20)

Yi,j = Ui,j + Wi,j ∼ Poisson(µi,j ) (2.21)

(Xi,j , Yi,j ) ∼ BP (λi,j , µi,j , ηi,j ) (2.22)

Where

• Xi,j is the number of home goals scored by team ’i’ against team ’j’, for i, j =
1, ...n with n teams in the league.

• Yi,j is the number of away goals scored by team ’j’ against team ’i’.

• αi , βi are the attack and defence parameters for team ’i’ respectively.

• log(λi,j ) = αi + βj + γ, log(µi,j ) = αj + βi

• γ represents the ’home effect’.

• ηi,j = cov(Xi,j , Yi,j ) = ρ λi,j µi,j .


p

 q q 
λi,j µi,j
• ρ ∈ 0, min µi,j
, λi,j
is the correlation (Holgate 1964) between home and
away goals to be estimated, where λi,j and µi,j are the means of the marginal
distributions.
CHAPTER 2. FOOTBALL SCORES 22

Therefore the joint probability mass function can be defined as equation 2.23.

(λi,j − ηi,j )xi,j (µi,j − ηi,j )yi,j


  
IP(Xi,j = xi,j , Yi,j = yi,j ) = exp(ηi,j − λi,j − µi,j ) .
xi,j ! yi,j !
min(xi,j ,yi,j )  wi,j   
X ηi,j xi,j yi,j
wi,j ! (2.23)
w
(λ i,j − η i,j )(µ i,j − η i,j ) w i,j w i,j
i,j=0

The expected values are: IE[Xi,j ] = exp(αi + βj + γ) and IE[Yi,j ] = exp(αj + βi ).


To obtain maximum likelihood estimates for αi , βi , ρ, γ, the likelihood function of these
parameters is required. This likelihood function can be written as equation 2.24.
n Y
Y (λi,j − ηi,j )xi,j (µi,j − ηi,j )yi,j
L(α, β, ρ, γ|x, y) = exp(ηi,j − λi,j − µi,j ) .
i=1 i̸=j
xi,j ! yi,j !
min(xi,j ,yi,j )  wi,j   
X ηi,j xi,j yi,j
wi,j ! (2.24)
wi,j=0
(λi,j − ηi,j )(µi,j − ηi,j ) wi,j wi,j

Full equations for the pmf and function likelihood for the bivariate Poisson in terms
of attack and defense parameters can be found in equations A.4 and A.5.
Karlis & Ntzoufras (2003) account for the underestimated probability of equal numbers
of home and away goals (draws) in a Bivariate Poisson distribution, an effect which is
amplified by the increase in covariance between home and away goals. Their diagonal
inflated model can be described in equation 2.25 using similar notation to above.

(1 − p)BP(x, y|λ, µ, η),

x ̸= y
PD (x, y) = (2.25)
(1 − p)BP(x, y|λ, µ, η) + pD(x, θ)

x=y

Where D(x, θ) is a discrete distribution to be specified, with corresponding parameter


vector θ and p can be taken as the estimated probability of a draw, i.e. the case x =
y.
Due to the time constraints of this project attaining forecasts using a bivariate Poisson
approach to contrast with the Dixon and Coles approach was not possible.

2.1.3 Other Approaches

Forecasting football scores is not limited to these models described by Maher (1982),
Dixon & Coles (1997), and Karlis & Ntzoufras (2003). There are a multitude of
alternative models which raise further questions of whether improvements to modelling
CHAPTER 2. FOOTBALL SCORES 23

football scores can be achieved.


Owen (2011) develops the approach of using GLMs to forecast football scores, based on
estimates quantifying the quality of each team by introducing dynamic parameters. By
design a dynamic GLM requires Bayesian estimation (West et al. 1985). The Bayesian
parameter estimates may vary in time which may be more appropriate than static
estimates since the performances of football teams is unlikely to remain constant week-
by-week. The dynamic estimates for attack and defence parameters αi,t, βi,t at time ’t’
are specified and estimated by Monte Carlo Markov Chain (MCMC) sampling using a
random walk Metropolis algorithm, with mean located at the estimated previous time
step αi, t−1 , βi, t−1 and constant, common (to all teams) ’evolution variance’ σ 2 . Thus
the proposal distributions are:

αi,t, ∼ N(αi, t−1 , σ 2 ) (2.26)

βi,t ∼ N(βi, t−1 , σ 2 ) (2.27)

However the assumption of constant and common ’evolution variance’ for all teams
may not be valid. Throughout a season a team may have substantial fluctuations in
form, which may be irregular. For example at the beginning of a season a team that
has signed many new players may have greater fluctuations in form than later in the
season when the players have become more familiar with their teammates, and there-
fore may achieve more consistent form. Likewise the assumption of common ’evolution
variance’ for all teams may lead to reduced model accuracy, since some teams may be
more consistent than others, e.g. teams at the top/bottom may be more consistently
good/poor than teams in the middle of the division who retain neither consistently
good nor poor form. Although the decision to impose these assumptions are under-
standable to minimise the number of parameters in the model, clustering together
teams which may have similar ’evolution variances’ could benefit the accuracy of the
model without adding a substantial number of parameters to be estimated.
Boshnakov et al. (2017) challenge the assumption made in the independent Poisson
and Bivariate Poisson models of a time-homogeneous Poisson process describing the
pattern of goals scored. Instead modelling the inter-arrival times between goals using
a Weibull renewal process. A Weibull renewal process does not assume the exponential
inter-arrival distribution which the Poisson process does, therefore use of a Weibull
CHAPTER 2. FOOTBALL SCORES 24

renewal process may provide greater flexibility in the model, since the Weibull hazard
function can account for some of the stochastic nature in the time-intervals between
goals. A possible extension to this model could be introducing a function which ac-
counts for dependency on the game-state, i.e. a narrow lead, drawing, little time
remaining etc. A model which included dependency on the game state could further
model the intensities of attacking and defensive sequences in football which may vary
depending on whether a team has scored, or if there is little time left in the game. For
example; in a cup final with only a few minutes until full-time a team trailing by 1
goal may be more likely to intensively attack the opposition team due to the relative
importance of scoring one goal and the irrelevance of conceding a further goal.
van Der Wurp et al. (2020) take an alternative approach to modelling the dependence
between home and away goals by extending Trivedi & Zimmer (2017) ’bivariate cop-
ulas for discrete count data’, by introducing a LASSO-type penalty structure. Here
the log-likelihood function obtained via copulas is penalised using the least absolute
shrinkage and selection operator (LASSO) and made adaptable by the weights. The
weights dictate the strength of shrinkage in the LASSO penalisation and are adaptive
in the sense that they are based on the inverse of maximum likelihood-estimates of
the regression coefficients.

2.2 Data and Processing


To model football scores for recent seasons using the Dixon and Coles model, full-
time result data was gathered for at least the past five seasons for each of the ’big
five’ European leagues: Premier League (England), Bundesliga (Germany), La Liga
(Spain), Serie A (Italy), and Ligue 1 (France) from publicly available data sets online:
(Fixture Download 2022), (Kaggle 2018), (Kaggle 2022), (Kaggle 2020), (La Liga
2022), and (Data Hub 2022). This enabled access to approximately 9,000 match
results. Much of the data was taken from a variety of sources and compiled into five
data frames (one for each league) which followed the principles of tidy data. These
principles can mostly be summarised by 3 rules described by Wickham & Grolemund
(2017) as:

1. Each variable is a column.


CHAPTER 2. FOOTBALL SCORES 25

2. Each observation is a row.

3. Each value must have its own cell.

To follow these principles, match results of the format ”x-y” in a cell were split into
two columns; home goals and away goals with the data extracted from its string format
using the ’substr’ function in R (R Core Team 2020). To avoid issues with combining
the data frames for each season the column names were changed where necessary to
ensure they were uniform for each season and league. Team names were checked for
spelling errors and abbreviations used by different sources, and renamed using the
’mutate’ function in R (R Core Team 2020) to ensure each team could be identified
using a only one unique name. The result of this data processing is 5 data frames (one
for each European league), following the tidy principles, which contain information of
all matches to have taken place in the seasons 2017/18 to 2021/22. Each data frame
consists of 7 columns: (”Season”, ”Matchweek”, ”H.Team”, ”A.Team”, ”H.goals”,
”A.goals”, ”Result”). For example the first row of the data frame for the Premier
League is:

Season Matchweek H.Team A.Team H.goals A.goals Result


2017/18 1 Arsenal Leicester City 4 3 H

That is; a home win to Arsenal against Leicester City by 4 goals to 3 in the first week
of fixtures of the 2017/18 season.
The Dixon & Coles (1997) model is applied to scores data from the most recent com-
pleted season 2021/22 for five of the most elite leagues in Europe. The R function
’dixoncoles’ available in the ’regista’ package (Torvaney 2022) has arguments: home
goals, away goals, home team, away team, data. The parameters output of this func-
tion consists of the natural logarithm of each of the n attack and n defence parameters,
the natural logarithm of the home advantage parameter γ and an estimate for the de-
pendence parameter ρ, totalling 2(n + 1) parameter estimates for n total number of
teams in each league.
CHAPTER 2. FOOTBALL SCORES 26

2.3 Results and Discussion


As suggested in section 2.1.1, a more rigorous method to test the assumption of in-
dependence between the number of goals scored by home and away teams is the chi-
squared test.

x <- fullPL17_21$H.goals
y <- fullPL17_21$A.goals
chisq.test(x,y)

Performing this chi-squared test in R results in a non-significant p-value of p = 0.1759


at a 10% significance level with test statistic χ2 = 83.034 on 72 degrees of freedom.
This suggests that the assumption of independence is valid at a 10% significance level.
However this may be caused by the considerable number of result combinations for
which there were no data for that score e.g. 9-9. Figure 2.1 demonstrates how few
of matches in the seasons 2017/18 - 2021/22 which finished with either home or away
team scoring more than four goals. Repeating this test for only matches where home
or away team scored up to 4 goals which accounted for 1812 of the total 1900 games
(95.4%); there is strong evidence that the assumption of independence is not valid at
a 1% significance level, since the p-value is approximately p=0.0002:

lteq4x <- as.numeric(leq4$H.goals)


lteq4y <- as.numeric(leq4$A.goals)
chisq.test(lteq4x, lteq4y)
Pearson’s Chi-squared test
data: lteq4x and lteq4y
X-squared = 43.954, df = 16, p-value = 0.0002005

2.3.1 Premier League 2021/22

The premier League consists of 20 teams, each playing every other team once at home
(19 home matches) and once away from home (19 away matches) in each season. Thus
there are a total of 380 matches per season. The estimate for ’home effect’ is γ = 1.159.
This is the estimate of the ratio of home to away goals i.e. the number of goals scored
CHAPTER 2. FOOTBALL SCORES 27

Team α β (α − β)
Arsenal 1.138 1.178 -0.040
Aston Villa 0.975 1.314 -0.339
Brentford 0.902 1.358 -0.456
Brighton 0.780 1.060 -0.280
Burnley 0.637 1.268 -0.631
Chelsea 1.398 0.821 0.576
Crystal Palace 0.931 1.117 -0.187
Everton 0.816 1.593 -0.777
Leeds 0.807 1.906 -1.099
Leicester 1.170 1.451 -0.281
Liverpool 1.717 0.658 1.059
Man City 1.809 0.662 1.148
Man United 1.072 1.394 -0.322
Newcastle 0.831 1.497 -0.666
Norwich 0.444 1.989 -1.546
Southampton 0.817 1.617 -0.800
Tottenham 1.277 0.989 -0.288
Watford 0.652 1.843 -1.191
West Ham 1.123 1.251 -0.128
Wolves 0.704 1.032 -0.327

Table 2.1: Table of attack and defence parameter estimates for Premier League clubs
2021/22.

by the home team is approximately 1.159 times greater than the away team. The
estimate for dependence parameter is ρ = −0.00372.
Since these parameters dictate the expected number of goals a team may score and
concede in a given match, the combination of a home team with high attack parameter
and low defence parameter and an away team with high defence parameter and low
attack parameter is likely to result in a large victory for the home team. For example
using the parameter estimates in table 2.1 Man City (α = 1.809, β = 0.662) and
Norwich (α = 0.444, β = 1.989). By inspecting the pmf for all combinations of home
and away scores up to 10 goals for each team, the outcome with highest estimated
probability is a 5-0 win to Manchester City with probability 14.5%. The expected
scoreline, assuming the function τ in equation 2.6 is equal to 1, is Man City 4 - 0
Norwich (rounded from 4.170 - 0.294), whereas the actual score in this fixture on 21st
August 2021 was Man City 5 - 0 Norwich (Premier League 2021). Although this ex-
ample is only one fixture, where the model has been trained using data including this
CHAPTER 2. FOOTBALL SCORES 28

observed result, and football often does not finish as may be expected prior to kick-off,
it does emphasise the model’s ability to vary depending on the attacking and defensive
qualities of each team. This is due to the considerable contradiction between a model
predicting the most likely result to be 5-0 when compared with the density of premier
league results in figure 2.1, whereby so few fixtures finish 5-0. However this example
should not be used in isolation draw conclusions about the accuracy of the model.
Table 2.1 lists higher values for α for teams which have typically performed stronger
in recent seasons and lower values for β, whereas teams generally considered to be
weaker appear to have lower attack parameter α and higher defence parameter β.
Hence observing the difference between α and β could be an effective method of gaug-
ing the overall quality of a team in a single value. This difference more closely follows
goal difference rather than league standings as some teams may more frequently win
but by a smaller margin of goals than a team which only occasionally wins but by a
large margin of goals, so may not exactly describe the quality of a team in the view
of winning games.
Estimates for the probability of an outcome can be made by training the model with
the previous seasons results to forecast the first set of matches in match week 1, then
updating the data and remodelling each week as the season develops. Forecasting the
outcomes in the early match weeks of a season poses a considerable challenge due to
the many factors which may result in a team may playing differently to the previous
season. Factors which may have greatest influence on performance differences include;
new managers, transferred players, lack of match fitness, the players not playing to-
gether recently, and many other factors which could have minor effects such as weather
differences. A further limitation to forecasting at the beginning of a season is the re-
quirement to have the scores data from all 4 divisions of professional English football
and perhaps cup competitions to ensure that newly promoted teams, such as Fulham,
Bournemouth and Nottingham Forrest for the 2022/23 Premier League season, can be
included in the forecasts.
The expected number of goals, and therefore match score, can be obtained by momen-
tarily disregarding the function τλ,µ (x, y) in equation 2.6. Expected result is applied
to fixtures in match week 1 of the 2022/23 season as seen in table 2.2. Similarly, the
’most probable’ result can be obtained by inspecting the pmf of possible scorelines and
CHAPTER 2. FOOTBALL SCORES 29

Fixture (Week 1 2022/23) Expected Most Probable (est. IP%) Score


Crystal Palace vs Arsenal 1-1 1 - 1 (12.8%) 0-2
Fulham vs Liverpool - - 2-2
Bournemouth vs Aston Villa - - 2-0
Newcastle vs Nottingham Forrest - - 2-0
Tottenham vs Southampton 2-1 2 - 0 (11.7%) 4-1
Leeds vs Wolves 1-1 0 - 1 (13.3%) 2-1
Everton vs Chelsea 1-2 0 - 2 (12.3%) 0-1
Leicester vs Brentford 2-1 1 - 1 (10.4%) 2-2
Man United vs Brighton 1-1 1 - 1 (13.0%) 1-2
West Ham vs Man City 1-2 0 - 2 (11.3%) 0-2

Table 2.2: Table of forecast scores & observed scores for Premier League matchweek
1 2022/23.

extracting the result with highest probability. All fixtures and results information is
taken from the Premier League (2022b) website.
As can be seen in table 2.2, only one of ’most probable’ scores correctly predicted the
match score. This may be unsurprising given the estimated probability of each score-
line is between 10.4 − 13.3%, thus 1 of the 7 most probable estimates being accurate is
approximately 14.3% success rate, although this is too small of a sample size to make
conclusions about the accuracy of the model.
Measuring the accuracy of probability estimates for outcomes using the Dixon and
Coles model pmf can be achieved using a Brier score. A Brier score is a method of
measuring prediction probabilities of mutually exclusive and exhaustive events, such
as in a football match where the outcomes are home win, draw, or away win. This is
described by Brier (1950) as equation 2.28
N R
1 XX
P = (fij − Eij )2 (2.28)
N i=1 j=1

Where P is the Brier score, N is the number of matches for which prediction probabili-
ties are being measured, R is the number of classes (in this case 3 - home win/draw/away
win), fij is the probability estimate of the jth class, and Eij = 1 or 0 denotes whether
the event happened (= 1) or did not happen (= 0). A Brier score for R = 3 has a scale
of [0, 2], where 0 corresponds to perfect accuracy in the model and 2 represents perfect
inaccuracy.
To assess the effectiveness of the Dixon and Coles model for the match outcome fore-
casts for the Premier League 2021/22 season, the Brier score is computed for matches
CHAPTER 2. FOOTBALL SCORES 30

in the first 3 match weeks (27 matches - since in the first week newly promoted teams
could not be estimated) and then the first 3 match weeks of the second half of the
season (match weeks 20, 21, 22, giving 28 games).
The Dixon and Coles function was trained using results data from the 2020/21 season
for the estimates of the first 3 match weeks and each previous week as the 2021/22
season progressed. Whereas the model for latter estimates was trained using the 190
games already played during the 2021/22 season and the games from each previous
match week, i.e. for match week 21 the model was trained with data from the first 20
match weeks. This provides insight into how effectively the Dixon and Coles model
performs overall and whether there may be indication of substantial difference in ac-
curacy at the beginning of the season in comparison to near the middle of the season.
The Brier score for the 27 matches at the beginning of the season is P = 0.35027,
whereas for the 28 matches at the beginning of the second half of the season is
P = 0.34684. The lower the score is indicates the better the model is at predict-
ing the probability of outcomes. Since the Brier score is nearer to 0 than 2 for both
periods of matches, it may suggest that the Dixon and Coles model appears to predict
the probability of outcomes with reasonable accuracy. The Brier score is slightly larger
for the probability estimates at the start of the season. However this difference is small
and may not indicate a significant difference in forecast accuracy.
The Brier skill score for multiple categories can take values in the range (−∞, 1] and
is defined by Glen (2022) as equation 2.29.

P
BSS = 1 − (2.29)
Pref

With Pref denoting the Brier score for reference forecast. In this case the reference
forecast could be equal probability of 0.33 for each outcome home win/draw/away
win. The Brier skill scores for the periods at the beginning and near the middle of
the 2021/22 Premier League season are BSS1−3 = 0.47459 and BSS20−22 = 0.47974
respectively. Since both of these values are greater than 0, it suggests the Dixon and
Coles model forecasts perform better than the reference forecast.
Another method of assessing model performance in the beginning of the season when
compared to match weeks 20, 21, and 22 is using the surprisal. This can be defined in
CHAPTER 2. FOOTBALL SCORES 31

the case of forecasting an exact score for match k by equation 2.30.


X
surprisal = − log(IP(Xk = xk , Yk = yk )) (2.30)
k

The surprisal takes values in the range (0, +∞), with values closer to zero showing
less surprisal and therefore closer probability estimates to the occurrence of the ob-
served result. For example the surprisal for the Man City vs Norwich match discussed
earlier in this section 2.3.1 is 2.111, whereas for the Leicester City vs Liverpool game
which ended 1-0 and was the only Premier League game during the 2021/22 season
where Liverpool failed to score (Premier League 2022a) had a surprisal of 4.697. The
surprisal for the 27 games at the beginning of the 2021/22 season was 100.19, whereas
for the 28 games at the start of the second half of the 2021/22 Premier League season
was 82.121. Since the surprisal is considerably lower for the second half of the season,
while having one additional observation, it may suggest the model is more accurate
at forecasting exact scores of matches nearer the middle of the season in match weeks
20-22 than at the start of the season in match weeks 1-3.

2.3.2 Bundesliga 2021/22

The Bundesliga has 18 teams, with each playing 17 home and away matches per
season, giving a total of 306 matches played per season. The estimate for ’home effect’
is γ = 1.304, suggesting a greater ratio of goals being scored by the home team than
away team than in the premier league. This is verified by the ratio of home (539) to
away (415) goals for all Bundesliga matches giving a ratio of 1.299, which is greater
than the ratio of home to away goals in the Premier League (1.159). The dependence
parameter for the 2021/22 Bundesliga season is estimated to be ρ = −0.148. Based
on these parameter estimates it could be suggested that Bayern Munich, the league
winners by an 8-point margin (Sky Sports 2022a), are one of the strongest teams
overall, since they have the largest value for α = 1.809 and the second smallest defence
parameter β = 0.981. At the other end of the table was Greuther Fürth; who finished
10 points below the next team above them (Sky Sports 2022a), and can be easily
identified for their weakness with α = 0.566 β = 2.059, the second lowest attack
parameter and the highest defence parameter.
CHAPTER 2. FOOTBALL SCORES 32

2.3.3 La Liga 2021/22

Similarly to the Premier League, La Liga has 20 teams and a total of 380 matches per
season. Using Dixon and Coles model with the data collected for the 2021/22 La Liga
season provided γ = 1.318 as the estimated ’home effect’ and dependence parameter
estimate ρ = −0.100. The home effect estimate again is greater than 1, thus the
home effect is advantageous for the home team. Similarly to the estimates for the
other leagues; the attack and defence parameter estimates appear to closely reflect the
abilities of each of the teams. Real Madrid and Barcelona have the greatest positive
difference between attack and defence values with α − β = 0.926, 0.527 respectively.
This may be expected as Real Madrid and Barcelona finished 1st and 2nd in La Liga
with the 1st and 2nd highest goal difference (Sky Sports 2022b) and are historically
the 1st and 2nd most successful clubs in Spanish football when measured by major
honours (Carnicero 2022).

2.3.4 Serie A

Following the same format as both the Premier League and La Liga, 20 teams compete
in the Italian Serie A, with each season comprising of 380 matches. The estimated
’home effect’ is γ = 1.103, thus greater than 1 suggesting a home advantage in the
Italian top division. The dependence parameter is estimated to be ρ = −0.0547.
Intuitively, since α and β are designed to model the numbers of goals scored and
conceded by each team in a given match, the difference between α and β closely
follows the goal total difference (total goals scored - total goals conceded) for each
team. This can be demonstrated by the three teams with the greatest goal difference:
Inter Milan (+52), Napoli (+43), and AC Milan (+38) (Sky Sports 2022d) follow the
same ranking by comparing their respective values of (α − β) = 0.693, 0.542, 0.463.
However using α − β may not be the best way to predict which team is best, since
AC Milan won the 2021/22 season despite having a smaller positive difference between
attack and defence parameter estimates than both Inter Milan and Napoli.
CHAPTER 2. FOOTBALL SCORES 33

2.3.5 Ligue 1 2021/22

Each season of Ligue 1 is made up of a total of 380 matches played by the 20 teams,
and follows the same format as the Premier League, La Liga and Serie A. The ’home
effect’ parameter estimate is γ = 1.289, similarly to the other four leagues which
suggests there is a greater number of goals scored by the home team than the away
team. The maximum likelihood estimate for the dependence parameter is ρ = −0.047.
Unsurprisingly, Paris Saint-Germain (PSG) appear to be the strongest team in terms
of attack with α = 1.667 this reflects their total goal tally of 90 goals for the season
(Sky Sports 2022c), more than any other club. Similarly the worst club defensively
was Bordeaux who conceded 91 goals throughout the season (Sky Sports 2022c) and
their defence parameter β = 2.099 was the largest in the league and at least 0.35
greater than any other team.
Overall the Dixon & Coles (1997) model appears to demonstrate that it can identify
the stronger and weaker performing teams across the big 5 European leagues, not only
English football as discussed in their paper. This leads to the potential opportunity
to develop a betting strategy for these other European leagues.

2.3.6 Betting Strategy

The betting strategy discussed at the end of section 2.1.1 could be implemented for
any of the major five European leagues, however only the Premier League is considered
here. The probability of outcomes which have available betting markets can be esti-
mated, with the disparity between these estimates and the implied probability from
bookmakers odds used to evaluate an expected gain.
One such example of this is the only match outcome in the first week of Premier
League 2022/23 fixtures which had greater than 30% relative difference between esti-
mated probability and odds implied probability. The Dixon and Coles model estimated
the probability of a Brighton win away from home against Manchester United to be
0.309. Whereas the bookmaker Betfair gave decimal odds of 5 (4/1 in fractional for-
mat), approximately 24 hours prior to kick-off (Betfair 2022) for that outcome which
gives an odds implied probability of 0.2. Therefore, the ratio of estimated to implied
0.309
probability is 0.2
= 1.545 which with a betting strategy set to accept all bets with
CHAPTER 2. FOOTBALL SCORES 34

a ratio greater than 1.3 would result in the bet being placed. This match resulted in
a 2 - 1 Brighton win (Premier League 2022b), and would return £5 from a £1 stake.
Although this is only a sample of one game and with an estimated probability of 0.309
it is an unlikely event, of which around 1/3 bets placed would be expected to win, it
can highlight the potential of the model to identify disparity between the probability
of an outcome and bookmakers odds.
An area for future study could investigate over the length of a full season, or longer,
whether positive financial return can be achieved under betting strategies with differ-
ent expected return thresholds.
CHAPTER 2. FOOTBALL SCORES 35

Premier League FT Results 2017/18 − 2021/22

6
Number of Away Goals

Frequency
5 200

150

100

4 50

0 1 2 3 4 5 6 7 8 9
Number of Home Goals

Figure 2.1: Distribution of Premier League results for seasons 2017/18 - 2021/22.
Chapter 3

Expected Goals (xG)

3.1 Methods and Modelling


Typically, Expected Goals are used to assess the performance of attackers or players
who frequently take shots. However a simple adaptation to xG can provide insight into
the performance of goalkeepers and potentially defenders. This adaptation involves
evaluating the number of shots and their respective xG faced by a goalkeeper in a
match and calculating Expected Goals Against (xGA). Similarly to how xG provides
a metric to gauge the shooting ability of the attacking team, xGA can indicate how
well a goalkeeper is performing. Depending on which covariates are included in the
logistic regression model, the effectiveness of defensive pressure in impacting attackers’
shot conversion may also be measureable. Over a long period of games, the better
a goalkeeper (or defence) is the greater xGA they would have in comparison to the
observed number of goals conceded. For brevity only modelling of xG will be discussed,
since xGA is a simple transformation to an xG model.

3.1.1 Rathke’s 2017 Zonal xG Model

Rathke (2017) considers the location of a shot in the attacking half of a football pitch
which is divided into eight zones, shown in figure B.3 which presents the xG estimate
for each zone. In the setup of this model error is introduced, since the lengths and
widths of professional football pitches vary (Football History 2022), therefore dividing
the pitch into fixed zones will lead to different zone sizes for each different pitch size.

36
CHAPTER 3. EXPECTED GOALS (XG) 37

However this error may not be large since the 18-yard box (penalty area) is a fixed
size for all professional football pitches and accounts for five of the eight zones. These
five zones also have some of the highest xG estimates and are the same size regardless
of pitch dimensions.
The number of total shots, shots on target, and goals for each zone is used to compute
empirical estimates for the probability of a shot resulting in a goal by separately
computing for i = 1, ..., 8.

# goals from zone i


xGi = (3.1)
# total shots from zone i

# goals from zone i


xGi = (3.2)
# shots on target from zone i

Rathke uses these empirical probability estimates, using total shots from each zone
in equation (3.1) to be the estimate for xG in each zone. If equation 3.2 was chosen,
the quality of players’ shooting ability may be distorted. In particular a player who
frequently shoots off-target, yet has a high goal conversion rate when they shoot on-
target would appear to be a far more accomplished shot taker than than would be a
true reflection of their shooting ability.
The total xG for a team in a given game is calculated as the sum of all xG values for
each shot depending on the zone from which each shot was taken from. In a given
match, total xG can be calculated by equation 3.3.
8
X
xG = xGi × ni (3.3)
i=1

where ni is the total number of shots from zone i, with i = 1, 2, ..., 8 and xGi denotes
the xG value for zone i. Rathke’s model assumes that xG is bilaterally symmetric
around the central-line between left and right touchlines, and that xG is uniform for
each zone regardless of position within that zone. Both of these assumptions may
not be valid. Firstly the majority of football players are right-footed (Bryson et al.
2013), to be more precise, 60% of all football players in the top 5 European leagues in
2005/06 were right-footed. This may lead to a systematic difference in the likelihood
of a shot being scored from one side of the pitch than the other, since players shooting
on their preferred foot can often utilise greater power, accuracy and curve than on
their weaker foot. Thus if the assumption of symmetry about the centre of the pitch
CHAPTER 3. EXPECTED GOALS (XG) 38

is not appropriate, using symmetric zones may induce error in the model. Second, in
a given zone the distance and angle to the goal could change considerably from one
part to another. An example of this could be in zone 5 (see figure B.3), where at
the nearest location to goal in zone 5 the ball is approximately 7.2 yards from the
centre of the goal with the angle of the goal 53◦ from post-to-post. However from the
furthest corner of zone 5 the ball would be approximately 28.5 yards from the centre
of the goal and with an angle of goal of just 10.3◦ . Thus due to the considerably dif-
ferent distances and angles within some zones, the assumption of uniform probability
of scoring from within the each xG zone may not be appropriate. Improvements to
this model could be made using more data. Empirical estimates for a greater number
of smaller zones may still provide adequate accuracy, should a sufficiently large set of
data be available. These smaller zones could also be designed such that the symmetry
assumption is not required.
The simplicity of this model may benefit a spectator who does not have access to shot
coordinates, yet may be keen enough to memorise the xG for each zone and while
watching a match can gauge which zone a shot was taken from, keeping track of each
teams xG. Therefore this method may be more inclusive for spectators.

3.1.2 Gómez’ 2020 Distance and Angle xG Model

Modelling xG using logistic regression with the distance from the goal line and the
angle of the goal is proposed by Gómez (2020). Both the distance to the centre of
the goal line and the angle of the goal can be treated as approximately continuous
variables and the model can be set up as a generalised linear model (GLM) with
binomial response. The model can be described using the standard set up of a GLM
in two equations 3.4 and 3.5.

y =µ+ϵ (3.4)

g(µ) = xT β (3.5)

Where:

• y is the response (in this case 1 for a goal, 0 for no goal).


CHAPTER 3. EXPECTED GOALS (XG) 39

• µ is the mean response.

• ϵ is random error with zero mean.

• g(.) is an invertible function called the link function.

• x = (x1 , ..., xp )T is the set of covariates.

• β = (β1 , ..., βp )T are unknown parameters to be estimated.

The link function in this case is chosen to be the canonical link for binomial responses:
logit defined by equation 3.6.
 
π
g(π) = log (3.6)
1−π

Here π denotes the probability of a goal being scored, which will later be the xG value
for each shot. The GLM is described later in equation 3.13 after the covariates have
been defined.
The Wyscout events data from Pappalardo et al. (2019) records the locations of shots
using (x, y) coordinates, with both x and y in the range [0, 100]. Here defined from
the perspective of the team attacking; x reflects how close the team is to the goal line
in front of them and y denotes how close the team are to the right side of the field.
For example, the centre circle is located at (50, 50) whereas the right vertex (from the
attackers perspective) of the opposition 18-yard box is located at (84, 80). Since these
are recorded to the nearest integer, there may be some rounding error which could
impact model accuracy. To calculate the distance to goal, for simplicity a singular
point on the goal line is chosen.
Gómez (2020) chooses the centre of the goal line which appears a reasonable selection.
By using a singular point on the goal line some error is introduced into the model,
since a shot could be taken from almost on the goal line and within the posts, yet
still have distance of up to 3.66m to the centre of the goal. This effect is large for
shots taken close but wide from the goal, however is less for shots taken from greater
distances.
Another simplification which induces error into the model is to fix the size of the pitch,
since pitch sizes vary. In this case the assumption of a common, fixed pitch size will
result in the conversion between coordinates and distances to the goal to not truly
CHAPTER 3. EXPECTED GOALS (XG) 40

reflect the distance should the pitch be different to the dimensions set.
A common choice for pitch size chosen by many elite teams in European football are
dimensions 105m x 68m (Football History 2022). Therefore to calculate the distance
from goal using the x, y coordinates, a conversion from percentage to metres is used
based on the 105 x 68m pitch dimensions and is described in equations 3.7 and 3.8.
105
xm = (100 − x) × (3.7)
100
68
ym = y × (3.8)
100
Where xm , ym are the coordinates in metres and x is transformed to denote distance
from the opposition goal line (short touchline at the end of the pitch). Then Pythago-
ras’ theorem for a right-angled triangle is required since xm which is the distance from
the goal line and ym are known, which can be used to calculate the distance to the
centre of the goal. These are two sides of a right-angled triangle, thus the distance to
the centre of the goal is the hypotenuse. The calculation for the distance to the centre
of the goal is given by equation 3.9.
p
L= x2m + (34 − ym )2 (3.9)

The distance calculated in equation 3.9 can be visualised as the dashed line in figure 3.1.
With L the distance to the centre of the goal and the centre of the goal being located at
the centre of a 68m wide pitch it is located at 34m. This contradicts Gómez’ calculation
of distance, where 32.5m is used instead of 34m despite also basing calculations on a
pitch of dimension 105m x 68m. I believe using 32.5m is an unintentional mistake.
To calculate the angle of the goal as depicted with the black angle line in figure 3.1,
finding the angles to each goal post individually is required. To find the y values of
each of the goal post based on a 68m wide pitch with the standard 7.32m goal width,
the left and right goal posts are calculated by equations 3.10 and 3.11.
68 − 7.32
yleft = = 30.34m (3.10)
2
yright = 30.34 + 7.32 = 37.66m (3.11)

Then the angle between the goals (black) on figure 3.1 is the difference between the
angle from the furthest and nearest posts (red and light blue angle lines) on figure 3.1.
Therefore the calculation for angle of goal is given by equation 3.12
    
◦ 37.66 − ym 30.34 − ym 180
A = arctan − arctan × (3.12)
xm xm π
CHAPTER 3. EXPECTED GOALS (XG) 41

Therefore the GLM can be described using the distance and angle to goal as equation
3.13  
π
log = β0 + β1 L + β2 A (3.13)
1−π
Where L (metres) and A (degrees◦ ) are defined as above in equations 3.9 and 3.12
respectively. Therefore after algebraic manipulation of equation 3.13, the estimate for
xG for a given distance L and angle A from goal is given by equation 3.14.

exp(β0 + β1 L + β2 A)
π= (3.14)
1 + exp(β0 + β1 L + β2 A)

A possible cause of issues in this model may be the strong negative correlation between
distance and angle to goal. For the shots data discussed in section 3.2, the correlation
between distance and angle to goal was ρ = −0.74. As distance increases, typically
the angle to goal will decrease thus the covariates are negatively associated. However
as discussed in section 3.3.2, still provides adequate fit to the data. Models which use
orthogonal distances from the centre of the pitch and the distance to the goal line were
considered, yet as discussed at the end of section 3.3.3 provided no improvement.

3.1.3 Stats Perform’s xG Model

With less restricted access to data, a considerably greater number of covariates can
be included in the logistics regression model in an attempt to explain a greater pro-
portion of variation in the binary response data. Whitmore (2021) writes about the
xG model used by Stats Perform, who own sports analytics company Opta. Although
technical details of the model and data are not included in the article due to corporate
confidentiality, their logistic regression model is said to include many variables, some
of the most important of which are listed in Whitmore’s article as:

• ”Distance to goal.

• Angle to goal.

• One-to-one.

• Big chance.

• Body part (e.g., header or foot).


CHAPTER 3. EXPECTED GOALS (XG) 42

• Type of assist (e.g, through ball, cross, pull-back etc).

• Pattern of play (e.g., open play, fast break, direct free kick, corner kick, throw-in
etc).”

Achieving a model with such a list of covariates, some of which are categorical, is likely
to produce greater explanatory power and is possible due to access to ”hundreds of
thousands of shots from historical Opta data.” Another feature which is considered
by this model as opposed to Rathke or Gómez is treating penalties, direct free kicks
and headers from set-pieces differently. For example the xG for a penalty in the Stats
Perform model is given to be a fixed value equal to the empirical estimate of how
many penalties are scored, which for their data is given to be 0.79 xG. This may be a
more appropriate method to model xG for set-pieces as the situation of a set-piece can
considerably differ from a shot from open-play. For example a free-kick. FIFA Law
13 FIFA (2012) states; the wall of players blocking the shot should remain at least
10ft (9.15m approx) from the ball until it is in play. Whereas a shot from a similar
position in open play could have an opponent far closer attempting to block their shot
or pressuring the attacker to take the shot with less opportunity for composure.
While having a substantial list of covariates included in the model may provide a highly
effective xG model, a simpler model containing only the covariates which explain a
significant amount of variation in the data may be considered an adequate model.
In addition to the covariates included in the Stats Perform model, there may be other
factors which could improve the modelling of xG. Some of these may include: proxim-
ity of the nearest defender (or some equivalent measure of defensive pressure on the
attacker at the time of the shot), the velocity (speed and direction) of the ball at the
moment before the ball is struck, the velocity of the attacking player at the moment
the ball is struck, the number of touches for the shooter before the shot is taken, and
any other features which may influence the quality of a shot. A limitation of using
some of these covariates is the demand for in-depth data which may currently not be
recorded or available. Thus can be viewed as an area for future study.
An extension to current xG models could be removing the assumption of heterogene-
ity of teams in the model, instead accounting for the quality of the team in the xG
model. While this loses some of the useful interpretation of xG in terms of providing
CHAPTER 3. EXPECTED GOALS (XG) 43

Figure 3.1: Distance and angle to goal from a shot.

direct comparison with an average calculated across all qualities of teams and leagues
included in the shots data. This could provide greater accuracy in estimating xG and
give insight into which players are more efficient with their shots relative to the other
members in their squad, and from a tactical viewpoint where they should encourage
individual players to take more shots from to maximise their xG. To implement this
method, a similar model to any of the above could be suitable. For example includ-
ing a covariate in the distance and angle model which provided measurement of team
quality. This covariate could correspond to a teams league standing or using a rat-
ing system such as ELO rating, thus allowing the logistic regression model to more
accurately estimate xG for the best and worst teams who typically over-perform and
under-perform their respective xG estimates.

3.2 Data and Processing


The data required to at least construct a basic xG model such as Rathke’s model
with 8 shot zones must include all shots for a fixed period, from a set competition,
with information whether the shot was a goal and from where on the pitch was the
shot taken. These requirements were met by Pappalardo et al. (2019) spatial-temporal
match events data set collected by Wyscout. These data comprised of observations of
CHAPTER 3. EXPECTED GOALS (XG) 44

the following variables: eventId, eventName, subEventId, tags, eventSec, id, matchId,
matchPeriod, playerId, positions, and teamId. Observations were available for all
matches played in the top 5 European leagues for the season 2017/18; English Premier
League, German Bundesliga, Italian Serie A, Spanish La Liga, and French Ligue 1 and
also for the European Championship 2016 (’Euro 2016’) and the FIFA World Cup 2018.
This data was split into 7 JSON (JavaScript Object Notation) files which where made
available by Pappalardo et al. (2019). A total of 1,941 matches is covered by the data,
thus providing a sufficiently large set of shots data with which an xG model could be
created.
To read this data into R, the ’jsonlite’ (Ooms 2014) package was required. Then using
the ’fromJSON’ function, each of the 7 files could be read in-turn into R console. The
structure of these files did not follow the tidy data principles, since each observation
the ’tags’ column was a list with each item in the list being an n x 1 data frame for
n = 3, 4, 5, 6. These coded tags could be translated using the ’tags2name’ comma-
separated values (csv) file also available with the spatial-temporal data. Also each
cell for the ’positions’ observations were 2 x 2 data frames which contained (x, y)
coordinates of the location of initial event and coordinates at the end of that event.
Since the only events of interest for the xG model were shots; the events data was
subset using the following command, in this case for the Premier League, where the
data was called ’eng’ and the ’eventId’ 10 corresponded to a shot:

eng_shots <- as.list(subset(eng, eng$eventId == 10))

Then it was necessary to determine which of these shots were goals, so the function
’is goal’ in section C.1.2 was created to create a vector in binary code with 1 corre-
sponding to a goal and 0 no goal. Another similar function ’is blocked’, also found
in section C.1.2, is created to determine which shots where blocked so these could be
removed when creating the xG models. Next the positional data was extracted into
numeric vector form using a ’for’ loop which counted through each event observation
creating two vectors of length equal to the number of shots, with x1 and y1 being the
x and y coordinates respectively in the range [0, 100] for the location each shot was
taken from:

x1 <- numeric()
CHAPTER 3. EXPECTED GOALS (XG) 45

y1 <- numeric()
for(i in 1:eng_n) {
x1[i] <- eng_pos[[i]][1,2]
y1[i] <- eng_pos[[i]][1,1]
}

Then smaller data frames of all shots which followed the tidy data principles of Wick-
ham & Grolemund (2017) for each of the 7 competitions were created by using the
’cbind’ function, and each column was given appropriate and identical names for each
data frame so they could be later combined for all seven competitions without further
complication.

eng_df <- cbind(eng_shots$matchId,eng_shots$teamId,eng_shots$playerId,


eng_shots$matchPeriod,eng_shots$eventSec,x1,y1,goal,block)
colnames(eng_df) <- c(’MatchID’, ’TeamID’,’PlayerID’,’MatchPeriod’,
’Time (s)’,’x1’, ’y1’, ’Goal’, ’Blocked’)
save(eng_df, file="eng.Rda")

Each of these were saved as ’Rda’ files to later be loaded and combined. Once the files
for all 7 competitions had been reloaded in R, they were combined as a data frame
using the ’as.data.frame’ and ’rbind’ function.

shots_df <- as.data.frame(rbind(UCL_df, wc_df, eng_df, fra_df,


itl_df, ger_df, spa_df))

This resulted in a data frame with 43,078 observed shots, each with 9 variables and
no missing values. Of these shots, 10,222 (23.73%) were blocked and so removed
for the purpose of creating the xG model, since the definition of a blocked shot was
vague, could include close range or long distance blocks and none of which were goals.
Since blocks are removed xG should only be calculated for unblocked shots, otherwise
these xG models may overestimate the number of goals scored. Therefore 4,492 of
the remaining 32,760 shots were goals, which gives an approximate conversion rate of
13.71%. With the data processed, exploratory data analysis could be undertaken to
gain an initial insight into the relationship between shot location and conversion rate.
To produce figure 3.2 based on that of Gómez (2020), packages ’ggplot2’ (Wickham
CHAPTER 3. EXPECTED GOALS (XG) 46

2016) and ’ggsoccer’ (Torvaney 2020) were installed.


Prior to modelling, the x and y coordinates are converted into the necessary units for
both the models. First for the Rathke model, since the zones were divided on the pitch
using natural markers such as the 6-yard and 18-yard boxes, x and y were converted
into yards based on a 105 x 68m (or 114.829 x 74.366yds using 1:1.094 conversion)
pitch:

ydsx1 <- (100 - unblocked_shots$x1) * (114.829/100)


ydsy1 <- unblocked_shots$y1 * (74.3657/100)

Where the x coordinate is also transformed to reflect the distance from the opposition
goal line as in equation 3.7.
Later for the logistic regression model with covariates relating to the distance and angle
from goal, metres are used since a metric system may be easier for interpretation of
the model. These conversions were again made based on a 105 x 68m pitch size.

x1m <- (100 - xG_df$x1) * (105/100)


y1m <- xG_df$y1 * (68/100)

Where xG df$x1 and xG df$y1 are the x and y coordinates respectively.


For the Rathke zonal model, it was necessary to evaluate which zone each shot was
taken from. Using the x and y coordinates in yards, the function ’zone’, in section
C.1.2, is designed to allocate a number from 1 to 8 for each shot corresponding to
the shot location. The output of the ’zone’ is a numeric vector of length equal to
the number of shots in the data frame. The limits for these zones were calculated
on the basis of a 114.829 x 74.366 yd pitch. The zone for each shot was successfully
determined by the function and the vector of zones was added to the data frame.
Using the number for each zone, empirical estimates could be computed by creating
subsets of shots from each zone from the data frame of all shots. The shots data was
partitioned into a training set and test set with proportions 95% (31,122 shots) and
5% (1638 shots) respectively. Empirical estimates were found by summing the binary
outcome ’Goal’ for the subset of each zone and dividing by the total number of shots
in that zone. For example with zone 1:

zone1 <- train[(train$Zone == 1),]


xG1 <- sum(zone1$Goal) / length(zone1$Goal)
CHAPTER 3. EXPECTED GOALS (XG) 47

Prior to creating the GLM for the distance and angle logistic regression model in R,
both covariates must be computed. As discussed in equation 3.9 using Pythagoras’
theorem, the distance to the centre of the goal (in metres) was calculated.

dist_to_goal_line_centre_m <- sqrt((x1m)^2 + (34 - y1m)^2)

Next the angle of the goal, in degrees, was calculated as discussed in equation 3.12.

angle_of_goal <- (atan((37.66-y1m)/x1m) - atan((30.34-y1m)/x1m))


* 180/pi

Where ’x1m’ and ’y1m’ where the x and y coordinates as defined formally in equations
3.7, 3.8. These variables are appended as columns to the shots data frame now de-
noted ’df angle’. The GLM model was then fitted with binary ’Goal’ as the response
and ’dist to goal line centre m’ and ’angle of goal’ as the explanatory variables. The
exponential family distribution was set to the binomial distribution with logit link.

fittedxG <- glm(Goal ~ dist_to_goal_line_centre_m + angle_of_goal,


family = binomial(link = "logit"), data = df_angle)

3.3 Results and Discussion

3.3.1 Rathke’s 2017 xG Model

Figure 3.2 indicates that in the Rathke (2017) model, the central zones and those
inside the penalty area have greater xG than wider or further zones from the goal such
as zones 7 and 8. This xG model trained using 95% of the unblocked shot data from
the top 5 European leagues, European Cup 2016 and the FIFA World Cup 2018 is
given in table 3.1 and is depicted in figure B.3.
The increase in xG estimates can be visualised using the heatmap, figure 3.3, whereby
the code was based on a figure designed by Gómez (2020).
The goodness of fit of this model can be assessed using the chi-square goodness of
fit test where the data are treated as binomial counts for each of the zones from 1
to 8. Using the data from the test set; the number of shots, goals and the expected
number of goals based on the xG value for each zone was calculated. The test statistic
CHAPTER 3. EXPECTED GOALS (XG) 48

Figure 3.2: All unblocked open play unsuccessful shots (blue) and goals (red) from the
top 5 European leagues 2017/18, European Championship 2016 and FIFA World Cup
2018.

Zone xG #shots per xG


1 0.5383 1.86
2 0.2807 3.56
3 0.2183 4.58
4 0.0782 12.8
5 0.1425 7.02
6 0.0560 17.9
7 0.1134 8.82
8 0.0279 35.8

Table 3.1: Table of xG estimates for each zone and the number of shots to result in 1
xG.
CHAPTER 3. EXPECTED GOALS (XG) 49

is calculated in equation 3.15, with Oi the observed number of goals in each zone and
Ei the expected number of goals in each zone.
8
2
X (Oi − Ei )2
χ = (3.15)
i=1
Ei

This gave χ2 = 5.45 < 14.07 = χ20.95,7 , with p-value p = 0.6047. Hence there is insuffi-
cient evidence to reject the null hypothesis at a 5% significance level and therefore the
zonal xG model adequately fits the data. Table 3.1 also provides the number of shots
1
required from each zone to expect one goal to be scored, calculated as xGi
. This may
be a more naturally intuitive statistic for a typical football fan to follow, in particular
for the zones with low xG.

3.3.2 Distance and Angle Logistic Regression xG model

The logistic regression model with distance and angle to goal as explanatory variables
depicted in figure 3.2 indicates that the closer and more central to goal a shot is taken
the more frequently it is a goal. Whereas shots taken from a considerably further
distance, in particular outside of the 18-yard box, or from a wide position were less
frequently goals. This may suggest that the coefficient of distance would be negative,
so xG estimate decreases as distance increases, and the coefficient of angle is positive,
so as the angle to goal becomes larger (with closer and more central shots), the xG
estimate will increase. The summary of the fitted model in R provides information
that the intercept, distance, and angle covariates are all significant in the model at a
0.1% level. Moreover the estimates are β0 = −1.28797, β1 = −0.07931, β2 = 0.02206
to 5 decimal places. As anticipated the coefficient of distance is negative, whereas
the coefficient of angle is positive. Thus the fitted logistic regression model for xG is
described in equation 3.16 using the formula discussed in equation 3.13.
 
xG
log = −1.28797 − 0.07931L + 0.02206A (3.16)
1 − xG
Hence the xG estimate for a given shot from distance L (m) and at angle A (degrees◦ )
from the goal is given by equation 3.17.
exp(−1.28797 − 0.07931L + 0.02206A)
xG = (3.17)
1 + exp(−1.28797 − 0.07931L + 0.02206A)
For example, table 3.2 demonstrates the xG for 3 open play shots each from a central
position to the goal with their distances and angles using this fitted model.
CHAPTER 3. EXPECTED GOALS (XG) 50

Location Distance L (m) Angle A(◦ ) xG


Edge of the 6-yard box 5.5 67 0.440
Penalty spot 11 37 0.206
Edge of the 18-yard box 16.5 25 0.115

Table 3.2: Table of xG estimates for 3 central locations using the logistic regression
model.

The closer the ball is to the goal, with a wider angle, the greater value of xG for
each shot as can be visualised in the figure 3.4; a heat map designed similarly to
that of Gómez (2020). The scale of both heat maps is set identically, enabling direct
comparison between the models.
Since the xG values using the distance and angle logistic regression model do not
naturally fall into categories as the zonal xG model did, the Hosmer-Lemeshow test
can be used to assess the goodness of fit for this model. The Hosmer-Lemeshow test
statistic groups observations together, in this case by their xG values, to provide a test
statistic which is asymptotically chi-squared distributed with g-2 degrees of freedom,
with g the number of groups. This statistic was calculated for the test set of 1638
shots with the default 10 groups using the ’logitof’ function available in R package
’generalhoslem’ (Jay 2019). The output of this test was test statistic X 2 = 11.693
with p = 0.1654. Hence at a 5% significance level there is insufficient evidence to
reject the null hypothesis, therefore the null hypothesis that the model provides an
adequate fit for the data is retained.

3.3.3 Comparing xG models

Both xG models provide adequate fit to the data at a 5% significance level, thus it
may be desired to draw comparisons investigating which model may provide better fit.
This can be achieved in numerous ways including; ROC curves and their auc value,
AIC and BIC, Brier score and skill score, and their respective MSE’s.
Prior to comparison the zonal xG model can be formulated as a logistic regression
model using the GLM function in R, with eight parameters for the categorical variable;
zone with 8 factor levels.
This is equivalent to calculating the empirical estimates for each zone. An ROC
CHAPTER 3. EXPECTED GOALS (XG) 51

Zonal xG model Distance and angle model


AIC 23,626.51 23,109.57
BIC 23,693.68 23,134.76

Table 3.3: Table of AIC and BIC values for the two xG models.

Brier Score Brier Skill Score


Zonal xG model 0.10780 0.08757
Distance and Angle xG model 0.10561 0.10612

Table 3.4: Table of Brier score and skill score for contrasting xG models.

(receiver operating characteristic) curve is a visualisation method capable of providing


insight into the performance of a classification model, such as a logistic regression
model. In an ROC curve the sensitivity (true positive rate) is plotted against specificity
(1 - false positive rate) to provide an idea of how effectively the model is predicting
the outcomes. The perfect model would follow the dotted line in figure 3.5, therefore
having unit area under the curve (AUC) equal to 1.
The performance of both models appears similar, however figure 3.5 suggests that the
xG model using distance and angle to goal as covariates provides a better fit for the
data. This is reflected in the calculations for AUC, where the distance and angle xG
model had AUC= 0.7411, whereas Rathke’s zonal xG model had AUC= 0.7113.
The models could also be compared based on AIC and BIC, see table 3.3. The larger
AIC and BIC for the zonal xG model indicates that the distance and angle model may
be a more suitable model for the data.
These models could also be evaluated by Brier score and Brier skill score for each shot,
since xG is in essence a probability estimate in the range [0,1] with outcome 1 for a
goal and 0 for no goal. Brier score and Brier skill score are discussed in more detail
in 2.3.1, however the number of classes ’R’ in equation 2.28 is now 2 (goal, no goal)
instead of 3 earlier. These values can be found in table 3.4, and were calculated using
the ’BrierxG’ function found in section C.1.2.
The Brier skill score was calculated based on a reference level that each shot had equal
probability of being a goal, with that probability level set to be the overall conversion
rate of all shots in the data set, namely 13.71%. Estimates for both Brier scores are
CHAPTER 3. EXPECTED GOALS (XG) 52

closer to a score of 0 relating to perfect accuracy than 1 which corresponds to perfect


inaccuracy. Both scores are similar which may suggest the models are predicting xG
with similar degree of accuracy. Similarly the Brier skill scores for both models are
greater than 0, thus performing better than if xG was modelled by using a common
value of 0.1371 for all shots. The skill score is slightly larger, see table 3.4, for the model
which uses distance and angle as covariates which may suggest this model performs
slightly better than Rathke’s xG model with zonal covariates.
Lastly to compare these two models the mean squared error (MSE) can be calculated
using the observed number of goals scored by each team in a match and the xG for
each team in that match. The formula for MSE is given in equation 3.18, where ’y’
and ’ŷ’ denote observed and predicted values respectively.
n
1X
MSE = (yi − ŷi )2 (3.18)
n i=1

For both models the MSE is calculated by using the ’aggregate’ function in R (R Core
Team 2020) over Match ID and Team ID, to find the number of goals scored by each
team in each game and also the sums of xG for each game and team combination was
calculated. For Rathke’s model this was given to be MSE = 1.02540, in comparison to
Gómez’ distance and angle model which had MSE = 1.02468. These values are similar
and suggest that the both approaches model xG to a similar accuracy.
If only one model was to be selected, the distance and angle logistic regression model
may be preferred since it provides a similar; yet slightly better, adequate fit to the
data as the zonal model does, however with only 3 parameters to be estimated instead
of 8 parameters.
Alternative logistic regression models which contained other covariates relating to the
location of shots such as one including the absolute distance in the y direction from the
centre of the pitch and distance to the goal line in the x direction as an additive model
and with an interaction term were tested. These alternatives provided no improvement
in terms of ROC curve and AUC value on the distance and angle xG model, therefore
were not explored further.
CHAPTER 3. EXPECTED GOALS (XG) 53

xG value
Rathke Zonal xG Model 0.00 0.25 0.50 0.75 1.00

Figure 3.3: Heatmap of the zonal xG model.


CHAPTER 3. EXPECTED GOALS (XG) 54

xG value
Logistic Regression xG Model 0.00 0.25 0.50 0.75 1.00

with covariates distance & angle to goal

Figure 3.4: Heat map demonstrating estimated xG for logistic regression model with
covariates distance and angle to goal.
CHAPTER 3. EXPECTED GOALS (XG) 55

ROC Curve for both xG Models

1.0
Sensitivity
0.5

Distance & Angle


0.0

Rathke Zonal

1.0 0.8 0.6 0.4 0.2 0.0


Specificity

Figure 3.5: ROC curve for both the zonal and distance & angle xG models.
Chapter 4

Football Interruptions

4.1 Methods and Modelling

4.1.1 Zhao & Zhang’s Gamma distributed GLM with log link

Zhao & Zhang (2021) propose modelling the duration of six types of stoppages; free
kicks, throw-ins, out-of-bounds balls, goal kicks, corner kicks, and penalties. In or-
der to do this they include 5 independent explanatory variables; timing, game state,
location, venue, and salary ratio. In addition to these independent variables, two
interaction terms are included which describe the interactions between location and
game state, and timing and game state. Timing and salary ratios are treated as con-
tinuous variables, while the remaining variables are all categorical. Game-state is split
into 5 categories; ”drawing”, ”one-goal behind/lead”, and ”two-or-more behind/lead”.
League is also split into 5 categories pertaining to each of the countries of which that
league resides, i.e. the Premier League has category ”England”. Location had 6 zones
equal in size whereby the pitch is divided so that each division line is parallel to the
goal line producing 3 zones in the defensive half (1, 2, 3) and 3 in the attacking half (4,
5, 6). Venue recorded the home and away team using a dummy variable (1 = home,
0 = away).
The model for each interruption type could be written as a GLM following similar
form to equation 3.5, however using different distributional family and link function:
gamma response distribution with log link. An example model for free kicks, as given

56
CHAPTER 4. FOOTBALL INTERRUPTIONS 57

by Zhao & Zhang (2021) is equation 4.1.

log[IE(yi )] = Xi β (4.1)

Where:

• yi are independent response variables belonging to a probability distribution in


the gamma family.

• Xi is a row vector of covariates for each observation i.

• β is a column vector of unknown parameters.

The variance of response y depends on the mean according to the variance function:
V (yi ) = αγ 2 , IE(yi ) = αγ, where α is the shape parameter and γ is the scale parameter.
The response y for each interruption type is the duration of time for which play has
been interrupted. These values are strictly non-negative and can be calculated to
an accuracy of 0.001s using the ’eventSec’ variable from the Wyscout data released
by Pappalardo et al. (2019), hence could be considered approximately continuous.
Therefore using the gamma family for the responses appears to be a suitable choice.
Using the natural logarithm for the link function ensures that the mean is always
greater than zero, which is necessary since the response variables are times and cannot
take negative values.
A crucial limitation some of the analysis is that video assistant refereeing (VAR) is only
mentioned as a feature for further study by Zhao & Zhang (2021) and is not considered
as a possible cause for differences in interruption duration across leagues. The data
used is from the 2017/18 league season when both the Bundesliga (Germany) and Serie
A (Italy) were using VAR for their first entire season (Kohli 2017), while the Premier
League (England), Ligue 1 (France) and La Liga (Spain) did not have VAR in place for
all games until the 2018/19 season (Casamayor 2018), (Labellarte 2018). This could
have considerable impact on the quality of analysis, since the duration of penalties,
offsides and serious foul play interruptions across leagues with and without VAR are
not directly comparable, unless the assumption that VAR has a non-significant impact
on the duration of these interruptions is made. Although this assumption may be
reasonable given VAR overturns just 0.32 decisions per game in the Premier League
CHAPTER 4. FOOTBALL INTERRUPTIONS 58

(Johnson 2022). VAR interruptions may last considerably longer than traditional on-
field referee decisions, since the average overturned decision in the Premier League
using VAR takes 84 seconds (Johnson 2021). Moreover, the generalisability of findings
in the study for future seasons may be limited given each season minor rule changes
are made which may influence interruption durations. For example the ’multi-ball’
system, whereby there are 9 additional footballs located just off the pitch to quickly
replace a ball which may have gone into the stand, has been newly introduced into the
Premier League for the 2022/23 season (King 2022). Therefore the duration of throw-
ins, corners, goal-kicks and out-of-bounds balls may decrease from previous years if a
football is more readily available than was previously the case.
An alteration to the way the location categorical variable is defined could provide
further interest. Similarly to Zhao and Zhang’s model the pitch could be divided into
six zones, however splitting the pitch into defensive, midfield and attacking thirds,
and then into central and wide areas. This provides the opportunity to evaluate for
free kicks whether durations significantly vary from a central or wide location, while
also maintaining some of the levels of defensive, midfield and attacking zones. Such
division of the pitch requires the assumption of symmetry in the interruption duration
locations about the center of the pitch, which will not be justified in this dissertation.
This method of partitioning the pitch can be visualised in figure B.4 in the appendix.

4.1.2 Alternative Approaches to Modelling Football Interrup-


tions

Siegle & Lames (2012) aim to investigate the effect of variables including location and
time of interruption and the score. This poses a similar investigation to Zhao and
Zhang (2021), however using a data-set of just 16 matches and with fewer covariates.
Siegle and Lames include all interruption types included by Zhao and Zhang except for
the out-of-bounds ball, yet include three further interruption types; substitutions, in-
juries and drop balls. Interruption duration for each type is assumed to follow a normal
distribution, since analysis of variance (ANOVA) and multifactor analysis of variance
(MANOVA) for interaction terms is used to assess the significance of variables. Cru-
cially the small sample size of matches limits the number of factors each categorical
CHAPTER 4. FOOTBALL INTERRUPTIONS 59

variable can be defined by and the combinations of these included in the model without
it becoming overparameterised. As a result, conclusions made using this study may be
less reliable than if a larger data set was used. The unjustified combinations of levels
within categorical variables in the model, to avoid overparameteristaion, reduces its
explanatory power and may incur complications with interpreting the results of the
model.

Riedl et al. (2015) use a linear regression model for additional injury time added onto
the end of the second half of football matches in the Bundesliga between 2000/01-
2010/11. The additional injury time is taken as the response variable. This approach
varies from modelling the duration of interruptions themselves, rather using the fre-
quency of interruptions (amongst other covariates) in each game to model the extra-
time added on by the referee after 90 minutes have been played. The main aim of this
approach is to assess which covariates are significant in the model and therefore have
influence on the decision made by referees of how much additional time should be al-
located. Their goal was to identify whether referee bias towards injury time allocation
could be established.

4.2 Data and Processing


Similarly to section 3.2, the Wyscout events data (Pappalardo et al. 2019) is used
again, but now to model the interruptions. This is due to the limited available data
freely accessible which would provide sufficient variables to model the duration of inter-
ruptions, with just some of the covariates used in Zhao & Zhang’s model. To minimise
model complexity and the number of assumptions made regarding the similarities be-
tween international knockout competitions and league club football, only the data for
the ’big 5’ European Leagues is used. The interruption type ’out-of-bounds’ is derived
from the ’ball out of the field’ sub-event and occurs when the ball leaves play for a
throw-in, corner or goal kick (Wyscout n.d.). However for the purpose of this study
the duration of this event will be defined as the time from the ball leaving the field,
to the time the ball is provided to the player taking the throw-in, corner or goal kick.
At which point the duration of those interruptions begins.
CHAPTER 4. FOOTBALL INTERRUPTIONS 60

As in section 3.2 the ’jsonlite’ package (Ooms 2014) and ’fromJSON’ function is used
in R Studio R Core Team (2020) to extract five data frames (one for each league) with
a complex structure; whereby 10 of the 12 columns where either character, integer, or
double-class type, however each entry in the column ’positions’ was a 2x2 data frame
and the column for tags was a list with each entry a data frame of ’tags’. An example
of the structure of the data frames is given in figure B.5.
The initial position in terms of x and y coordinate was extracted and stored in a data
frame using the ’getpositions’ function in section C.1.3. The data frame containing
x1, y1 columns are then binded to the data frame for each league and the ’tags’ and
original ’positions’ columns are removed. To prepare for the ’League’ covariate later,
a character vector is produced for each league and binded to the data frame. For
example the Premier League:

eng <- cbind(eng, Positions)


eng <- eng[,-c(3,5)]
eng <- cbind(eng, rep("England", length(eng[,1])))

All 5 leagues can be combined into a singular data frame using the ’rbind’ function.

LeagueEvents <- rbind(eng, fra, itl, ger, spa)

This provides a data frame containing all 3,071,395 events which took place in the
1826 matches across the Premier League, Ligue 1, Serie A, Bundesliga and La Liga
2017/18 season. This agrees with the number of data events and matches as Zhao &
Zhang (2021), indicating the data has been appropriately processed.
Given the data available and limited time restrictions, replicating the covariates; game-
state, venue and salary ratio from the interruptions model by Zhao and Zhang was not
feasible. However the response durations and covariates; timing, location, and league
could still be replicated.
The durations of each of the six interruption types should be defined. This is not
specified clearly in the Zhao & Zhang 2021 paper; however duration shall be defined
for the purpose of this dissertation to be the difference in ’eventSec’ from each type
of interruption and the previous event. For example the difference in ’eventSec’ for a
free-kick and the foul/offside which preceded it. The ’eventSec’ variable records the
time since the beginning of each half to an accuracy of 0.001 seconds. To calculate
CHAPTER 4. FOOTBALL INTERRUPTIONS 61

the duration of each event type, 6 similar functions were created to iteratively assess
whether each event was the appropriate interruption type. Then the difference in
’eventSec’ from the previous observation could be computed. For example the ’getfk’
function can be found in section C.1.3. Next the continuous covariate ’timing’ should
be calculated. To obtain the covariate timing; the ’dplyr’ (Wickham et al. 2022) from
the ’tidyverse’ (Wickham et al. 2019) was installed to identify the ’eventSec’ of the
last event of the first half for each match. Then for all events a left mutating join
was used to append a column of full first half times to each unique ’matchId’. This
provided the necessary information to calculate the timings within the game for each
relevant event, which is called the ’Timing’ function in section C.1.3.

fhtime <- leaguesevents %>% group_by(matchId) %>%


filter(matchPeriod == "1H") %>% summarize(Fhtime = max(eventSec))
leagueseventst1 <- leaguesevents %>% left_join(fhtime, by = "matchId")

The covariate ’location’ is only relevant for free-kicks, throw-ins and out-of-bounds
balls, since the remaining interruption types: goal-kicks (area 5), corners (2) and
penalties (1) are taken from only one area. Following the areas depicted in figure
B.4, the function ’fkarea’ C.1.3 demonstrates how each zone was allocated for each
free kick interruption. A full data frame denoted ’LeagueEventsdata’ containing all 22
variables, including durations for each interruption type and the areas for free kicks,
throw-ins and out-of-bounds balls was created for events data across all five leagues.
Prior to sub-setting the data frame containing all events, variables ’League’, ’FK Area’,
’TI Area’, ’OOB Area’ were mutated to ’factor’ type to ensure when modelling the
GLM’s for each interruption type would have appropriate structure.

LeagueEventsdata <- mutate_at(LeagueEventsdata, vars(League,FK_Area,


TI_Area,OOB_Area), as.factor)

Six smaller data frames were created, one for each interruption type, and sorted to en-
sure they followed the tidy data principles of Wickham & Grolemund (2017), discussed
in section 2.2. For example again for free kicks.

FK1 <- subset(LeagueEventsdata, subEventId == 31)


FK2 <- subset(LeagueEventsdata, subEventId == 32)
CHAPTER 4. FOOTBALL INTERRUPTIONS 62

FK3 <- subset(LeagueEventsdata, subEventId == 33)


FreeKicks <- as.data.frame(rbind(FK1, FK2, FK3))
FreeKicks <- FreeKicks[,-c(14:19)]

Columns which contained information regarding the zones or durations of other in-
terruption types and comprised entirely of missing values were removed. Checks for
other missing values showed small numbers such as 13 out-of-bounds balls which where
removed since there were 129,165 observations remaining, thus the exclusion of these
is unlikely to have substantial impact on the results.
Each of these tidy data frames were saved as ’Rda’ files so they can be easily and
efficiently read into R for modelling. For example with free kicks:

save(FreeKicks, file = "FreeKicks.Rda")

The data comprised of 80,305 throw-ins, 29,725 goal kicks, 18,181 corners, 53,716 free
kicks, 541 penalties, 129,165 out-of-bounds balls. The mean match had 44 throw-ins,
16 goal kicks, 10 corners, 29 free kicks, 0.3 penalties and 71 out-of-bounds balls. Using
the mean duration of each of these, a mean total interruptions time for each game can
be calculated. This value is approximately 2176 seconds or 36 minutes and 16 seconds.
Further sufficiently detailed summary statistics for the frequency and duration of these
interruptions can be found in section 3.1 and table 1 of Zhao & Zhang (2021), since
the data appears to be processed so the exact same data is obtained.
The GLM’s for each interruption type with response mean duration following a gamma
distribution with log link is fitted using the ’glm’ function in R. For example free kicks
were fitted as follows:

fitFK <- glm(FK_Durations ~ Timing + FK_Area + League, family =


Gamma(link = log), data = FreeKicks)

4.3 Results and Discussion


Full details of the parameter estimates and their significance for each interruption type
can be found in tables B.5, B.6, B.7, B.8, B.9, B.10. Results are significant at a 5%
level and all t tests are two-sided.
CHAPTER 4. FOOTBALL INTERRUPTIONS 63

4.3.1 Timing Effect

Timing was significantly positively associated with mean throw-in and goal kick dura-
tions, yet negatively associated with mean corner duration as was the case in Zhao &
Zhang (2021). However timing does not have significant effect on the duration of free
kicks which contradicts the results of Zhao and Zhang. A possible explanation for this
is the difference in parameterisation of the models. Using the fitted models for throw-
ins and goal kicks, a unit increase in time would increase mean throw-in duration
by (eβ = 1.000072) 0.0072% seconds and mean goal kick duration by (eβ = 1.000035
0.0035% seconds. Whereas corners decrease in mean duration by (1−eβ = 1 - 0.999987)
0.0013% seconds with each additional second of time.

4.3.2 Area (Location) Effect

All areas 2-6 had significant negative association from area 1 (central attacking), sug-
gesting a decrease in mean duration as free kicks were taken from other areas.
Throw-ins are only possible from areas 2 (wide attacking), 4 (wide midfield), and 6
(wide defensive). Both areas 4 and 6 had significant positive association from area 2,
and the parameter estimate for area 6 is greater than area 4 (0.239 > 0.0349), sug-
gesting increase in mean throw-in duration as throw-ins are taken from more defensive
positions.
Out-of-bounds balls are only possible from areas which border the touch-line or goal-
line therefore all zones excluding zone 3 are included in the model. All possible areas
2-6 were significantly negatively associated from area 1, suggesting a mean duration
decrease in out of bounds balls from areas 2, 4, 5, and 6 when compared with area 1.

4.3.3 League Effect

Mean duration of Premier League free kicks were significantly longer than in the
Bundesliga, Serie A and La Liga, however were not significantly different from Ligue
1.
Throw-ins had greater mean duration in the Premier League (12.3s) than Ligue 1
(9.98s), Bundesliga (10.5s), Serie A (10.3s), and La Liga (10.8s). Out-of-bounds balls
and goal kicks were also longer in the Premier League than the other 4 European
CHAPTER 4. FOOTBALL INTERRUPTIONS 64

Leagues.
Mean corner duration in the Bundesliga (22.3s) was significantly longer than in the
Premier League (21.3s), however the model used with only timing and league covariates
did not detect the same significant difference between the Premier League and Ligue
1 as found by Zhao & Zhang (2021).
Mean penalty duration was shortest of all 5 European leagues in the Premier League
(56.6s). The mean penalty duration of Italy (104.7s) and Germany (85.3s) were the
largest of the 5 leagues with Ligue 1 and La Liga having mean duration 73.3s and
67.0s respectively.

4.3.4 Area and Timing Interaction Effect

The area and timing interactions for areas 3, 4, 5, 6 had significant positive coefficient
estimates for free kicks, suggesting an increase in mean free kick duration in areas 3,
4, 5 and 6 as the match progressed in time in comparison to area 1.
For mean throw-in duration, the interaction between timing and area 6 had a signifi-
cant negative coefficient estimate. This suggests as the match develops mean throw-in
duration in area 6 decreases by (1 − eβ = 0.0000214) 0.00214% in comparison with
those in zone 1 with each additional second.
The interaction between timing and areas 2, 4, and 6 significantly varied for the mean
duration of out-of-bounds balls. Since the estimated coefficients of these interaction
terms are positive, the model suggests there is an increase in out-of-bounds ball mean
duration as the game progresses in areas 2, 4, and 6 in comparison to area 1.

4.3.5 Interruption Results Discussion

Firstly the increase in mean duration as the game progresses for throw-ins could be
related to player fatigue. As a game progresses, players are likely to have ran longer
distances which may reduce the speed at which they arrive at the touchline to take a
throw-in or allow the players to take more time providing opportunity to rest. Simi-
larly goal kicks may provide opportunity for a goalkeeper to allow outfield players a
brief interval to rest. Contrasting this, the mean duration of corners is seen to de-
crease. Although a corner also provides opportunity to rest, it also provides a goal
CHAPTER 4. FOOTBALL INTERRUPTIONS 65

scoring opportunity with 3.4% of corners being scored in the Premier League during
seasons 2010/11 - 2020/21 (Jones & Carey 2021). As a match progresses a team which
is trailing may be more desperate for a goal, thus taking a corner quickly before a de-
fensive team is afforded time to compose may provide them an increased opportunity
to score.
As discussed earlier in section 3.3.2, the more closer and more central to the goal a
shot is taken from, the better opportunity of scoring a goal is. This may be a cause of
the increased mean duration of free kicks from area 1 (central attacking) compared to
the other areas, since there is an opportunity to shoot on goal and therefore a greater
amount of time is taken for the attacking team to compose the shot.
Zhao & Zhang (2021) highlight in section 4.4 that a possible reason there is a sig-
nificant increase in mean duration of throw-ins in more defensive areas is due to the
increased risk of conceding a goal, should the ball be lost in a defensive position such
as area 6. However an alternative supporting argument could be the increased oppor-
tunity for the attacking team with throw-in from zone 2 to take the throw-in quickly,
thus minimising the time available for defending team to compose.
Perhaps most interesting of the analysis available given the limited covariates in this
investigation may be the effects of different leagues on the duration of each interruption
type. The mean duration of Premier League free kick, goal kick, throw-in and out-of-
bounds ball interruptions was significantly greater than for the remaining European
leagues. This agrees with the results in section 3.7 of Zhao & Zhang (2021) paper,
and their suggestion of faster paced play and greater distances ran at high-intensities
in the Premier League could result in more fatigued players. Therefore the increased
duration for these types of interruption, as a result of the increased demand of rest for
Premier League players appears a reasonable explanation.
However when discussing the differences in mean penalty duration, Zhao and Zhang
discuss choice of penalty placement in different leagues as a possible explanation for
the differences in mean duration between leagues. Whereas a more plausible alterna-
tive explanation for the significantly larger mean duration of penalties in the Serie A
and Bundesliga could be the introduction of VAR in those leagues in comparison to
the remaining three which did not use VAR during the 2017/18 season. The smaller
differences between the Premier League, Ligue 1 and La Liga (maximum difference of
CHAPTER 4. FOOTBALL INTERRUPTIONS 66

16.7s) may be explained in part by the theory of cultural attitudes towards penalty
taking discussed by Jamil et al. (2020), since it is concluded that longer run-ups are
preferred by players in La Liga and Serie A than in the Premier League and Bun-
desliga. However the substantially greater mean penalty durations in the Serie A and
Bundesliga than La Liga and the Premier League for this data could be a result of
VAR checks and reviews on penalty decisions. An article published by The Stats Zone
(2019) stated the median duration of a VAR check is 20 seconds whereas a VAR re-
view has median duration of 35 seconds. Therefore the range of differences of between
12.0-48.1s from La Liga to the Bundesliga and the Premier League to Serie A for mean
penalty duration may be a consequence of the introduction to VAR.
A future study using data from more recent seasons that 2017/18 where all leagues
had VAR implemented, contrasted with the same leagues prior to the implementation
of VAR could provide key insight into the magnitude on interruption duration as a
result of VAR.
A varying area for further study could seek to model the frequency and duration of
injuries and substitutions, with the outlook of assessing whether the frequency or dura-
tion is used strategically by teams to manage the time in a game to gain an advantage
over opponents.
Chapter 5

Concluding Discussion

The Dixon & Coles (1997) model appears to effectively distinguish between the offen-
sive and defensive qualities of stronger and weaker teams from not only from English
football, but also across other elite European leagues. Typically the teams which would
finish higher in the tables at the end of the 2021/22 season would have large values for
α and smaller values of β. Whereas as team quality weakens, estimates for α would
decrease and increase for β which correspond to estimating a smaller number of goals
scored and greater number of goals conceded by weaker teams. With the forecasts
tested there is promising indication that a betting strategy could lead to a positive
financial gain to the bettor, however due to the time limitations of this dissertation
this could not be assessed with sufficiently large sample size collected over time.
A future area for studying the modelling of football scores using attack and defence
parameters could be using a Bivariate Poisson model, with explicitly defined discrete
distribution to counter-act the effect of underestimation of low-scoring draws. Whereby
the primary benefit is modelling the covariance between home and away goals within
the Bivariate Poisson distribution, as oppose to using a dependence function.

Both the zonal xG model described by Rathke (2017) and the distance and angle-
to-goal logistic regression model of Gómez (2020) provide adequate fit to the large
data set of shots from elite European and world football. These models both indicate
that xG increases as shots are taken from closer and more central positions to the goal.
However, the logistic regression model with distance and angle as covariates appears
to more closely fit the data when tested on the test set of 1638 shots.

67
CHAPTER 5. CONCLUDING DISCUSSION 68

Future investigations may be interested in including covariates which quantify the


overall or attacking quality of a team therefore challenging the assumption of hetero-
geneity in current xG models.

The approach of Zhao & Zhang (2021) appears to effectively model the different types
of interruptions included in their study. Many interesting and significant differences
were detected by their investigation into interruption durations. In particular, the
longer mean duration of many Premier League interruptions when compared with
other elite European leagues. The study could also provide more robust conclusions
than previously, due to the large sample size available from the Wyscout data (Pap-
palardo et al. 2019). However, the study is ultimately limited in terms of interpretation
without having data in which all or none of the leagues were using VAR. Similarly the
investigation completed in this dissertation provides less powerful insight with covari-
ates such as the game state and salary ratio not attainable under the time constraint
of submission.
Future approaches which instead investigate how the frequency of interruption occur-
rences can be modelled using similar sets of covariates could provide a fuller view of
how teams strategically use interruptions to manage the time in a game.
References

Betfair (2022), ‘English Premier League’, https://www.betfair.com/sport/


football/english-premier-league/10932509. [Accessed 09/08/2022].

Boshnakov, G., Kharrat, T. & McHale, I. G. (2017), ‘A bivariate Weibull count


model for forecasting association football scores’, International Journal of Fore-
casting 33(2), 458–466.

Brier, G. W. (1950), ‘Verification of forecasts expressed in terms of probability’,


Monthly Weather Review 78(1), 1–3.

Bryson, A., Frick, B. & Simmons, R. (2013), ‘The returns to scarce talent: footed-
ness and player remuneration in European soccer’, Journal of Sports Economics
14(6), 606–628.

Butler, R. & Massey, P. (2019), ‘Has competition in the market for subscription sports
broadcasting benefited consumers? The case of the English Premier League’, Journal
of Sports Economics 20(4), 603–624.

Carnicero, J. V. T. (2022), ‘Spain - list of champions’, https://www.rsssf.org/


tabless/spanchamp.html. [Accessed 22/08/2022].

Casamayor, J. (2018), ‘Tebas: With VAR, there will be more fairness


in football’, https://www.marca.com/en/football/spanish-football/2018/03/
02/5a996dfc22601de9518b464f.html. [Accessed 18/08/2022].

Cox, M. (2018), ‘Football’s dark arts: Your guide to the set-piece trickery, diving
and sneaky fouls that make a difference’, https://www.espn.co.uk/football/
english-premier-league/23/blog/post/3710171/footballs-dark-arts-

69
REFERENCES 70

your-guide-to-the-set-piece-trickery-diving-and-sneaky-fouls-that-
make-a-difference. [Accessed 18/08/2022].

Data Hub (2022), ‘Spanish La Liga (football)’, https://datahub.io/sports-data/


spanish-la-liga#resource-season-1718. [Accessed 04/07/2022].

Deutscher, C., Ötting, M., Schneemann, S. & Scholten, H. (2019), ‘The demand for
English Premier League soccer betting’, Journal of Sports Economics 20(4), 556–
579.

Dixon, M. J. & Coles, S. G. (1997), ‘Modelling association football scores and inef-
ficiencies in the football betting market’, Journal of the Royal Statistical Society:
Series C (Applied Statistics) 46(2), 265–280.

Ekstrand, J., Waldén, M. & Hägglund, M. (2004), ‘A congested football calendar and
the wellbeing of players: correlation between match exposure of European footballers
before the World Cup 2002 and their injuries and performances during that World
Cup’, British Journal of Sports Medicine 38(4), 493–497.

FIFA (2012), ‘Interpretation of the laws of the game and guidelines for refer-
ees’, http://www.fifa.com/mm/document/worldfootball/clubfootball/01/37/
04/29/interpretation law13 en.pdf. [Accessed 03/08/2022].

Fixture Download (2022), ‘Download football/soccer fixtures, schedules and results’,


https://fixturedownload.com/sport/football. [Accessed 27/06/2022].

Football History (2022), ‘The football field and its dimensions’, https://www.
footballhistory.org/field.html. [Accessed 27/07/2022].

Glen, S. (2022), ‘Brier score: Definition, examples’, https://www.statisticshowto.


com/brier-score/. [Accessed 26/07/2022].

Green, S. (2012), ‘Assessing the performance of premier league goalscor-


ers’, https://www.statsperform.com/resource/assessing-the-performance-
of-premier-league-goalscorers/. [Accessed 03/08/2022].

Gómez, I. (2020), ‘Fitting your own football xG model’, https://www.datofutbol.


cl/xg-model/. [Accessed 02/08/2022].
REFERENCES 71

Holgate, P. (1964), ‘Estimation for the bivariate Poisson distribution’, Biometrika


51(1-2), 241–287.

Jamil, M., Littman, P. & Beato, M. (2020), ‘Investigating inter-league and inter-
nation variations of key determinants for penalty success across European football’,
International Journal of Performance Analysis in Sport 20(5), 892–907.

Jay, M. (2019), generalhoslem: Goodness of Fit Tests for Logistic Regression Models.
R package version 1.3.4.
URL: https://CRAN.R-project.org/package=generalhoslem

Johnson, D. (2021), ‘The ultimate guide to VAR in the Premier League - all your ques-
tions answered’, https://www.espn.co.uk/football/english-premier-league/
story/3925549/the-ultimate-guide-to-var-in-the-premier-league-all-
your-questions-answered. [Accessed 18/08/2022].

Johnson, D. (2022), ‘How VAR has changed the Premier League, from penalties to
offside and handball’, https://www.espn.co.uk/football/english-premier-
league/story/4675887/how-var-has-changed-the-premier-leaguefrom-
penalties-to-offside-and-handball. [Accessed 18/08/2022].

Jones, A. & Carey, M. (2021), ‘Why outswinging corners lead to more chances but
inswingers lead to more goals’, https://theathletic.com/2911374/2021/10/28/
why-outswinging-corners-lead-to-more-chances-but-inswingers-lead-to-
more-goals/. [Accessed 20/08/2022].

Kaggle (2018), ‘Bundesliga results 1993-2018’, https://www.kaggle.com/datasets/


thefc17/bundesliga-results-19932018. [Accessed 27/06/2022].

Kaggle (2020), ‘Ligue 1 results - 1999 to 2019’, https://www.kaggle.com/datasets/


brunoo/ligue-1-results-1999-to-2019. [Accessed 01/07/2022].

Kaggle (2022), ‘Serie A matches dataset’, https://www.kaggle.com/datasets/


giovannicarlozzi/serie-a-matches-dataset?select=135 2021.csv. [Accessed
28/06/2022].
REFERENCES 72

Karlis, D. & Ntzoufras, I. (2003), ‘Analysis of sports data by using bivariate Pois-
son models’, Journal of the Royal Statistical Society: Series D (The Statistician)
52(3), 381–393.

King, B. (2022), ‘Liverpool may have new advantage over rivals following quiet Premier
League rule change’, https://www.sportbible.com/football/liverpool-news-
premier-league-man-city-20220720. [Accessed 18/08/2022].

Kohli, S. (2017), ‘VAR: The good, the bad and the ugly’, https://edition.
cnn.com/2017/07/04/football/video-assistant-referee-technology-var-
chris-foy/index.html. [Accessed 18/08/2022].

La Liga (2022), ‘Fixture: Atlético de Madrid vs Levante UD’, https:


//www.laliga.com/en-GB/match/temporada-2021-2022-laliga-santander-
atletico-de-madrid-levante-ud-21. [Accessed 04/07/2022].

Labellarte, G. (2018), ‘Ligue de Football Professionnel approves Video Assistant Ref-


eree use’, https://www.sportsmole.co.uk/football/news/french-football-
league-approves-var-use 314125.html. [Accessed 18/08/2022].

Maher, M. J. (1982), ‘Modelling association football scores’, Statistica Neerlandica


36(3), 109–118.

Ooms, J. (2014), ‘The jsonlite package: A practical and consistent mapping between
json data and r objects’, arXiv:1403.2805 [stat.CO] .
URL: https://arxiv.org/abs/1403.2805

Owen, A. (2011), ‘Dynamic Bayesian forecasting models of football match outcomes


with estimation of the evolution variance parameter’, IMA Journal of Management
Mathematics 22(2), 99–113.

Pappalardo, L., Cintia, P., Rossi, A., Massucco, E., Ferragina, P., Pedreschi, D. &
Giannotti, F. (2019), ‘A public data set of spatio-temporal match events in soccer
competitions’, Scientific Data 6(1), 1–15.

Peeters, T. & van Ours, J. C. (2021), ‘Seasonal home advantage in English professional
football; 1974–2018’, De Economist 169(1), 107–126.
REFERENCES 73

Premier League (2021), ‘Man City v Norwich, Premier League 2021/22’, https://
www.premierleague.com/match/66358. [Accessed 15/07/2022].

Premier League (2022a), ‘Liverpool FC scores, results & season archives - Premier
League’, https://www.premierleague.com/clubs/10/Liverpool/results?co=
1&se=418. [Accessed 21/08/2022].

Premier League (2022b), ‘Premier League: results’, https://www.premierleague.


com/. [Accessed 08/08/2022].

R Core Team (2020), R: A Language and Environment for Statistical Computing, R


Foundation for Statistical Computing, Vienna, Austria.

Rathke, A. (2017), ‘An examination of expected goals and shot efficiency in soccer’,
Journal of Human Sport and Exercise 12(2), 514–529.

Riedl, D., Strauss, B., Heuer, A. & Rubner, O. (2015), ‘Finale furioso: referee-biased
injury times and their effects on home advantage in football’, Journal of Sports
Sciences 33(4), 327–336.

Siegle, M. & Lames, M. (2012), ‘Game interruptions in elite soccer’, Journal of Sports
Sciences 30(7), 619–624.

Sky Sports (2022a), ‘Bundesliga table 2021/22 season’, https://www.skysports.


com/bundesliga-table/2021. [Accessed 15/07/2022].

Sky Sports (2022b), ‘La Liga table 2021/22 season’, https://www.skysports.com/


la-liga-table/2021. [Accessed 15/07/2022].

Sky Sports (2022c), ‘Ligue 1 table 2021/22 season’, https://www.skysports.com/


ligue-1-table/2021. [Accessed 15/07/2022].

Sky Sports (2022d), ‘Serie A table 2021/22 season’, https://www.skysports.com/


serie-a-table/2021. [Accessed 15/07/2022].

Stats Perform (2022), ‘Opta data from stats perform’, https://www.statsperform.


com/opta/. [Accessed 20/07/2022].
REFERENCES 74

The Stats Zone (2019), ‘The statistics that show why VAR should be accepted
by the FA’, https://www.thestatszone.com/archive/the-statistics-that-
show-why-var-should-be-accepted-by-the-fa. [Accessed 20/08/2022].

Top Media Advertising (n.d.), ‘Football viewing figures’, https://


topmediadvertising.co.uk/football-viewing-figures/. [Accessed
20/08/2022].

Torvaney, B. (2020), ggsoccer: Plot Soccer Event Data. R package version 0.1.6.
URL: https://CRAN.R-project.org/package=ggsoccer

Torvaney, B. (2022), regista: Soccer Analytics.


URL: http://regista.statsandsnakeoil.com, https://github.com/Torvaney/regista

Trivedi, P. & Zimmer, D. (2017), ‘A note on identification of bivariate copulas for


discrete count data’, Econometrics 5(1), 10.

van Der Wurp, H., Groll, A., Kneib, T., Marra, G. & Radice, R. (2020), ‘Gener-
alised joint regression for count data: a penalty extension for competitive settings’,
Statistics and Computing 30(5), 1419–1432.

West, M., Harrison, P. J. & Migon, H. S. (1985), ‘Dynamic generalized linear


models and Bayesian forecasting’, Journal of the American Statistical Association
80(389), 73–83.

Whitmore, J. (2021), ‘What are expected goals (xG)?’, https://theanalyst.com/


eu/2021/07/what-are-expected-goals-xg/. [Accessed 03/08/2022].

Wickham, H. (2016), ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag


New York.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R.,
Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller,
E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V.,
Takahashi, K., Vaughan, D., Wilke, C., Woo, K. & Yutani, H. (2019), ‘Welcome to
the tidyverse’, Journal of Open Source Software 4(43), 1686.
REFERENCES 75

Wickham, H., François, R., Henry, L. & Müller, K. (2022), dplyr: A Grammar of Data
Manipulation. R package version 1.0.9.
URL: https://CRAN.R-project.org/package=dplyr

Wickham, H. & Grolemund, G. (2017), R for Data Science: Import, Tidy, Transform,
Visualize, and Model Data, 1st edn, O’Reilly Media, Inc.

Wyscout (n.d.), ‘Wyscout main events description’, https://footballdata.


wyscout.com/events-manual/. [Accessed 20/08/2022].

Zhao, Y. & Zhang, H. (2021), ‘Investigating the inter-country variations in game


interruptions across the Big-5 European football leagues’, International Journal of
Performance Analysis in Sport 21(1), 180–196.
Appendix A

Appendix

A.1 Derivations
For simplicity the i,j subscripts are dropped to allow the reader to more clearly follow
equations.
Bivariate Poisson probability mass function (pmf) (Karlis & Ntzoufras 2003): Since
(X, Y ) = (V + W, U + W ), then (X, Y ) = (x, y) is the disjoint union of events such
that (W, V, U ) = (w, x − w, y − w). Thus due to the independence of U, V, W the
probabilities multiply, giving
min(x,y)
X
IP(X = x, Y = y) = IP(W = w)IP(V = x − w)IP(U = y − w) (A.1)
w=0

Therefore inputting these probabilities gives:

min(x,y)
X  w (x−w) (y−w)
   
−η η −(λ−η) (λ − η) −(µ−η) (µ − η)
IP(X = x, Y = y) = e e . e
w=0
w! (x − w)! (y − w)!
(A.2)

Rearranging and simplifying the sum using the formula for binomial coefficients for x
x x!

w
= (x−w)!w! , similarly for y, the pmf can be decribed by:

IP(X = x, Y = y) =
min(x,y)  w   
(λ − η)x (µ − η)y X η x y
exp(−η − (λ − η) − (µ − η)) w!
x! y! w=0
(λ − η)(µ − η) w w
(A.3)

76
APPENDIX A. APPENDIX 77

Expectations of home and away goals:

IE[X] = IE[V + W ] = IE[V ] + IE[W ] = λi,j − ηi,j + ηi,j = λi,j = exp(αi + βj + γ),

similarly for IE[Y ] = exp(αj + βi ).


Bivariate Poisson pmf in full (terms of parameters to be estimated):
p 
exp(αi + βj + γ + αj + βi ) is written as Γ so the sum-term can be written on one line.

(exp(αi + βj + γ) − ρΓ))xi,j (exp(αj + βi ) − ρΓ)yi,j


IP(Xi,j = xi,j , Yi,j = yi,j ) = exp (ρΓ)) .
xi,j ! yi,j !
min(xi,j ,yi,j )  wi,j   
X ρΓ xi,j yi,j
wi,j !
w
(exp(αi + βj + γ) − ρΓ) (exp(αj + βi ) − ρΓ)) wi,j wi,j
i,j=0

(A.4)

Full Bivariate Poisson likelihood function in terms of the parameters to be maximised:


n Y
Y (exp(αi + βj + γ) − ρΓ)xi,j (exp(αj + βi ) − ρΓ)yi,j
L(α, β, ρ, γ|x, y) = exp (ρΓ)) . .
i=1 i̸=j
xi,j ! yi,j !
min(xi,j ,yi,j )  wi,j   
X ρΓ xi,j yi,j
.wi,j ! (A.5)
wi,j=0
(exp(αi + βj γ) − ρΓ) (exp(αj + βi ) − ρΓ) wi,j wi,j
Appendix B

Appendix

B.1 Figures and Tables

78
APPENDIX B. APPENDIX 79

Goal Distributions 2017/18 − 2021/22


Away/Home A H

Bundesliga La Liga

600

400

400
Count

Count
200
200

0 0

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Goals Goals

Serie A Ligue 1

600 600

400 400
Count

Count

200 200

0 0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9
Goals Goals

Figure B.1: Distributions of 4 European leagues home and away goals from seasons
2017/18 - 2021/22.
APPENDIX B. APPENDIX 80

Full−Time Results 2017/18 − 2021/22

Frequency
50 100 150

Bundesliga La Liga
6 6
Number of Away Goals

Number of Away Goals


5

4 4

2 2

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Number of Home Goals Number of Home Goals

Serie A Ligue 1
7 6
Number of Away Goals

Number of Away Goals

6 5
5
4
4
3
3
2
2

1 1

0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9
Number of Home Goals Number of Home Goals

Figure B.2: Densities of 4 European leagues scores from seasons 2017/18 - 2021/22.
APPENDIX B. APPENDIX 81

Team α β (α − β)
Augsburg 0.739 1.411 -0.672
Arminia Bielefeld 0.509 1.311 -0.803
Bayer 04 Leverkusen 1.496 1.233 0.263
Bayern Munich 1.809 0.981 0.828
Bochum 0.712 1.307 -0.595
Borussia Dortmund 1.601 1.365 0.236
Borussia Mönchengladbach 1.020 1.544 -0.524
Eintracht Frankfurt 0.853 1.250 -0.397
Freiburg 1.102 1.202 -0.100
Greuther Fürth 0.566 2.059 -1.493
Hertha BSC 0.719 1.771 -1.052
Hoffenheim 1.117 1.544 -0.427
Köln 0.963 1.219 -0.257
Mainz 05 0.933 1.144 -0.211
RB Leipzig 1.333 0.955 0.378
Stuttgart 0.785 1.502 -0.717
Union Berlin 0.929 1.117 -0.187
Wolfsburg 0.815 1.337 -0.523

Table B.1: Table of attack and defence parameter estimates for Bundesliga 2021/22.

Team α β (α − β)
Alavés 0.653 1.441 -0.788
Athletic Bilbao 0.894 0.821 0.073
Atlético Madrid 1.349 0.990 0.359
Barcelona 1.402 0.875 0.527
Cádiz 0.753 1.143 -0.391
Celta Vigo 0.899 0.979 -0.078
Elche 0.838 1.157 -0.319
Espanyol 0.843 1.193 -0.350
Getafe 0.695 0.924 -0.229
Granada 0.950 1.392 -0.442
Levante 1.113 1.737 -0.623
Mallorca 0.769 1.408 -0.639
Osasuna 0.789 1.150 -0.361
Rayo Vallecano 0.813 1.112 -0.299
Real Betis 1.297 0.925 0.372
Real Madrid 1.663 0.738 0.926
Real Sociedad 0.837 0.830 0.007
Sevilla 1.096 0.682 0.415
Valencia 1.031 1.211 -0.180
Villareal 1.315 0.868 0.447

Table B.2: Table of attack and defence parameter estimates for La Liga 2021/22.
APPENDIX B. APPENDIX 82

Team α β (α − β)
AC Milan 1.245 0.782 0.463
Atalanta 1.194 1.221 -0.027
Bologna 0.814 1.365 -0.551
Cagliari 0.635 1.671 -1.036
Empoli 0.938 1.743 -0.805
Fiorentina 1.077 1.273 -0.196
Genoa 0.504 1.472 -0.968
Inter Milan 1.519 0.825 0.693
Juventus 1.029 0.924 0.105
Lazio 1.428 1.485 -0.057
Napoli 1.328 0.786 0.542
Roma 1.076 1.077 -0.001
Salernitana 0.618 1.909 -1.291
Sampdoria 0.852 1.567 -0.716
Sassuolo 1.192 1.671 -0.479
Spezia 0.765 1.747 -0.982
Torino 0.830 1.013 -0.182
Udinese 1.125 1.463 -0.338
Venezia 0.633 1.692 -1.059
Verona 1.201 1.494 -0.293

Table B.3: Table of attack and defence parameter estimates for Serie A 2021/22.
APPENDIX B. APPENDIX 83

Team α β (α − β)
Angers 0.823 1.254 -0.431
AS Monaco 1.207 0.931 0.276
Bordeaux 1.017 2.099 -1.082
Brest 0.920 1.301 -0.380
Clermont 0.722 1.557 -0.835
Lens 1.155 1.111 0.044
LOSC Lille 0.901 1.099 -0.198
Lorient 0.667 1.425 -0.758
Lyon 1.240 1.185 0.054
Marseille 1.169 0.881 0.288
Metz 0.671 1.561 -0.891
Montpellier 0.926 1.393 -0.467
Nantes 1.024 1.105 -0.081
OGC Nice 0.956 0.826 0.130
Paris Saint-Germain 1.667 0.860 0.806
Reims 0.803 1.006 -0.204
Rennes 1.520 0.943 0.577
Saint-Étienne 0.806 1.749 -0.943
Strasbourg 1.116 0.989 0.127
Troyes 0.690 1.195 -0.505

Table B.4: Table of attack and defence parameter estimates for Ligue 1 2021/22.
APPENDIX B. APPENDIX 84

Figure B.3: Results for the Rathke 2017 zonal xG model.


APPENDIX B. APPENDIX 85

Figure B.4: Pitch segmentation for the location covariate when modelling free kick
interruptions.
APPENDIX B. APPENDIX 86

Figure B.5: Example structure of Wyscout data (Pappalardo et al. 2019) when loaded
into R using ’fromJSON’.

Free Kicks (N=53,716)


Parameters β S.E. t p
Intercept 4.00 0.0259 155 0.000***
Timing −5.70 × 10−6 7.26 × 10−6 -0.785 0.432
Area 2 -0.198 0.0324 -6.12 0.000***
Area 3 -1.01 0.0284 -35.7 0.000***
Area 4 -1.04 0.0290 -36.0 0.000***
Area 5 -0.830 0.0282 -29.4 0.000***
Area 6 -0.907 0.0305 -29.7 0.000***
France -0.0149 0.0106 -1.40 0.161
Germany -0.0820 0.0111 -7.41 0.000***
Italy -0.0573 0.0107 -5.37 0.000***
Spain -0.0771 0.0105 -7.38 0.000***
Timing x Area 2 −3.72 × 10−6 9.50 × 10−6 -0.392 0.695
Timing x Area 3 5.70 × 10−5 8.35 × 10−6 6.83 0.000***
Timing x Area 4 6.34 × 10−5 8.54 × 10−6 7.42 0.000***
Timing x Area 5 5.98 × 10−5 8.25 × 10−6 7.24 0.000***
Timing x Area 6 3.67 × 10−5 9.02 × 10−6 4.07 0.000***

Table B.5: Table of parameter estimates for the fitted free kick interruptions GLM.
Significant at *95%, **99%, ***99.9%. All values taken to 3s.f.
APPENDIX B. APPENDIX 87

Throw Ins (N=80,305)


Parameters β S.E. t p
Intercept 2.24 0.0147 152 0.000***
Timing 7.23 × 10−5 3.86 × 10−6 18.717 0.000***
Area 4 0.0349 0.0173 2.02 0.0437*
Area 6 0.239 0.0189 12.6 0.000***
France -0.201 0.0117 -17.1 0.000***
Germany -0.145 0.0125 -11.6 0.000***
Italy -0.170 0.0120 -14.2 0.000***
Spain -0.120 0.0120 -9.99 0.000***
Timing x Area 4 1.80 × 10−8 5.39 × 10−6 0.003 0.997
Timing x Area 6 −2.14 × 10−5 5.85 × 10−7 4.07 0.000***

Table B.6: Table of parameter estimates for the fitted throw in interruptions GLM.

Out of Bounds (N=129,165)


Parameters β S.E. t p
Intercept 1.99 0.0191 104 0.000***
Timing −9.35 × 10−6 5.40 × 10−6 -1.73 0.0832
Area 2 -0.314 0.0217 -14.5 0.000***
Area 4 -0.521 0.0216 -24.1 0.000***
Area 5 -0.159 0.0241 -6.60 0.000***
Area 6 -0.374 0.0213 -17.6 0.000***
France -0.113 0.00914 -12.4 0.000***
Germany -0.0791 0.00975 -8.11 0.000***
Italy -0.0798 0.00920 -8.67 0.000***
Spain -0.0494 0.00929 -5.31 0.000***
Timing x Area 2 1.84 × 10−5 6.52 × 10−6 2.82 0.00486**
Timing x Area 4 3.13 × 10−5 6.54 × 10−6 4.78 0.000***
Timing x Area 5 9.68 × 10−6 7.17 × 10−6 1.35 0.177
Timing x Area 6 1.95 × 10−5 6.41 × 10−6 3.04 0.00234**

Table B.7: Table of parameter estimates for the fitted out of bounds ball interruptions
GLM.

Goal Kicks (N=29,175)


Parameters β S.E. t p
Intercept 3.10 0.0102 303 0.000***
Timing 3.49 × 10−5 2.18 × 10−6 16.0 0.000***
France -0.172 0.0112 -15.4 0.000***
Germany -0.174 0.0119 -14.6 0.000***
Italy -0.216 0.0110 -19.6 0.000***
Spain -0.123 0.0123 -10.9 0.000***

Table B.8: Table of parameter estimates for the fitted free kick interruptions GLM.
APPENDIX B. APPENDIX 88

Corners (N=18,181)
Parameters β S.E. t p
Intercept 3.10 0.0132 235 0.000***
Timing −1.26 × 10−5 2.87 × 10−6 -4.41 0.000***
France 0.00160 0.0144 0.111 0.912
Germany 0.0451 0.0155 2.92 0.00353***
Italy -0.00751 0.0142 -0.527 0.598
Spain -0.0207 0.0144 -1.43 0.152

Table B.9: Table of parameter estimates for the fitted corner interruptions GLM.

Penalties (N=541)
Parameters β S.E. t p
Intercept 4.05 0.00698 58.0 0.000***
Timing −3.69 × 10−6 1.24 × 10−5 -0.297 0.767
France 0.257 0.0671 3.83 0.000***
Germany 0.408 0.0718 5.69 0.000***
Italy 0.613 0.0672 9.13 0.000***
Spain 0.167 0.0686 2.43 0.0155*

Table B.10: Table of parameter estimates for the fitted penalty interruptions GLM.
Appendix C

Appendix

C.1 R Code

C.1.1 Football Scores Data Processing and Analysis


#Importing and processing PL21/22 data.
PL21 <- read_excel("Uni/MSc Statistics/Dissertation/Data/Scores
Data/Premier League/epl-2021-GMTStandardTime - Fixture Download.xlsx")
homegoals21 <- substr(PL21$Result, 1, 2)
hgoals21 <- as.numeric(homegoals21)
awaygoals21 <- substr(PL21$Result, 5, 6)
agoals21 <- as.numeric(awaygoals21)
data21 <- data.frame(PL21$‘Home Team‘, PL21$‘Away Team‘, hgoals21,
agoals21, PL21$‘Round Number‘)
#Function to determine match result.
result <- function(n, hgoals, agoals){
res <- numeric(n)
for(i in 1:n) {
if(hgoals[i] > agoals[i]){
res[i] <- "H"
} else if(hgoals[i] == agoals[i]){
res[i] <- "D"
} else if(hgoals[i] < agoals[i]){
res[i] <- "A"
}
}
res
}
Res21 <- result(380, hgoals21, agoals21)
Season21 <- rep("2021/22", 380)
fullPL21 <- data.frame(Season21, PL21$‘Round Number‘, PL21$‘Home Team‘,
PL21$‘Away Team‘, hgoals21, agoals21, Res21)
colnames(fullPL21) <- c(’Season’, ’Matchweek’, ’H.Team’, ’A.Team’,
’H.goals’, ’A.goals’, ’Result’)
#Importing and processing PL20/21 data.
PL20 <- read_excel("Uni/MSc Statistics/Dissertation/Data/Scores Data
/Premier League/epl-2020-GMTStandardTime.xlsx")
homegoals20 <- substr(PL20$Result, 1, 2)
#Allowable since no team has ever scored 100+ goals in one games

89
APPENDIX C. APPENDIX 90

hgoals20 <- as.numeric(homegoals20)


awaygoals20 <- substr(PL20$Result, 5, 6)
agoals20 <- as.numeric(awaygoals20)
Res20 <- result(380, hgoals20, agoals20)
Season20 <- rep("2020/21", 380)
fullPL20 <- data.frame(Season20, PL20$‘Round Number‘, PL20$‘Home Team‘,
PL20$‘Away Team‘, hgoals20, agoals20, Res20)
colnames(fullPL20) <- c(’Season’, ’Matchweek’, ’H.Team’, ’A.Team’,
’H.goals’, ’A.goals’, ’Result’)
#Importing and processing PL19/20
PL19 <- read_excel("Uni/MSc Statistics/Dissertation/Data/Scores Data
/Premier League/epl-2019-GMTStandardTime.xlsx")
homegoals19 <- substr(PL19$Result, 1, 2)
hgoals19 <- as.numeric(homegoals19)
awaygoals19 <- substr(PL19$Result, 5, 6)
agoals19 <- as.numeric(awaygoals19)
Res19 <- result(380, hgoals19, agoals19)
Season19 <- rep("2019/20", 380)
fullPL19 <- data.frame(Season19, PL19$‘Round Number‘, PL19$‘Home Team‘,
PL19$‘Away Team‘, hgoals19, agoals19, Res19)
colnames(fullPL19) <- c(’Season’, ’Matchweek’, ’H.Team’, ’A.Team’,
’H.goals’, ’A.goals’, ’Result’)
#Importing and processing PL18/19.
PL18 <- read_excel("Uni/MSc Statistics/Dissertation/Data/Scores Data/
Premier League/epl-2018-GMTStandardTime.xlsx")
homegoals18 <- substr(PL18$Result, 1, 2)
hgoals18 <- as.numeric(homegoals18)
awaygoals18 <- substr(PL18$Result, 5, 6)
agoals18 <- as.numeric(awaygoals18)
Res18 <- result(380, hgoals18, agoals18)
Season18 <- rep("2018/19", 380)
fullPL18 <- data.frame(Season18, PL18$‘Round Number‘, PL18$‘Home Team‘,
PL18$‘Away Team‘, hgoals18, agoals18, Res18)
colnames(fullPL18) <- c(’Season’, ’Matchweek’, ’H.Team’, ’A.Team’,
’H.goals’, ’A.goals’, ’Result’)
#Importing and processing PL17.
PL17 <- read_excel("Uni/MSc Statistics/Dissertation/Data/Scores Data/
Premier League/epl-2017-GMTStandardTime.xlsx")
homegoals17 <- substr(PL17$Result, 1, 2)
hgoals17 <- as.numeric(homegoals17)
awaygoals17 <- substr(PL17$Result, 5, 6)
agoals17 <- as.numeric(awaygoals17)
Res17 <- result(380, hgoals17, agoals17)
Season17 <- rep("2017/18", 380)
fullPL17 <- data.frame(Season17, PL17$‘Round Number‘, PL17$‘Home Team‘,
PL17$‘Away Team‘, hgoals17, agoals17, Res17)
colnames(fullPL17) <- c(’Season’, ’Matchweek’, ’H.Team’, ’A.Team’,
’H.goals’, ’A.goals’, ’Result’)
#Combining all seasons 17/18-21/22.
PL17_21 <- rbind(fullPL17, fullPL18, fullPL19, fullPL20, fullPL21)
fullPL17_21 <- mutate(PL17_21, H.Team = recode(H.Team,
"Leicester" = "Leicester City", "Man City" = "Manchester City",
"Man Utd" = "Manchester United", "Spurs" = "Tottenham Hotspur",
"Stoke" = "Stoke City", "Swansea" = "Swansea City", "West Brom" =
"West Bromwich Albion", "West Ham" = "West Ham United",
"Brighton" = "Brighton & Hove Albion", "Huddersfield" =
"Huddersfield Town", "Newcastle" = "Newcastle United",
"Cardiff" = "Cardiff City", "Wolves" = "Wolverhampton Wanderers",
"Sheffield Utd" = "Sheffield United", "Leeds" = "Leeds United"))
fullPL17_21 <- mutate(PL17_21, A.Team = recode(A.Team, "Leicester" =
APPENDIX C. APPENDIX 91

"Leicester City", "Man City" = "Manchester City", "Man Utd" =


"Manchester United", "Spurs" = "Tottenham Hotspur", "Stoke" =
"Stoke City", "Swansea" = "Swansea City", "West Brom" = "West
Bromwich Albion", "West Ham" = "West Ham United", "Brighton" =
"Brighton & Hove Albion", "Huddersfield" = "Huddersfield Town",
"Newcastle" = "Newcastle United", "Cardiff" = "Cardiff City",
"Wolves" = "Wolverhampton Wanderers", "Sheffield Utd" =
"Sheffield United", "Leeds" = "Leeds United"))
Implementing Dixon and Coles for 2021/22 Season data.
PL2122 <- dixoncoles(H.goals, A.goals, H.Team, A.Team, fullPL21)
CRY <- exp(PL2122$par[7]) * exp(PL2122$par[21]) * exp(PL2122$par[41])
ARS <- exp(PL2122$par[1]) * exp(PL2122$par[27])
TOT <- exp(PL2122$par[17]) * exp(PL2122$par[36]) * exp(PL2122$par[41])
SOU <- exp(PL2122$par[16]) * exp(PL2122$par[37])
LEE <- exp(PL2122$par[9]) * exp(PL2122$par[40]) * exp(PL2122$par[41])
WOL <- exp(PL2122$par[29]) * exp(PL2122$par[20])
EVE <- exp(PL2122$par[8]) * exp(PL2122$par[26]) * exp(PL2122$par[41])
CHE <- exp(PL2122$par[6]) * exp(PL2122$par[28])
LEI <- exp(PL2122$par[10]) * exp(PL2122$par[23]) * exp(PL2122$par[41])
BRE <- exp(PL2122$par[3]) * exp(PL2122$par[30])
MUN <- exp(PL2122$par[13]) * exp(PL2122$par[24]) * exp(PL2122$par[41])
BHA <- exp(PL2122$par[4]) * exp(PL2122$par[33])
WHU <- exp(PL2122$par[19]) * exp(PL2122$par[32]) * exp(PL2122$par[41])
MCI <- exp(PL2122$par[12]) * exp(PL2122$par[39])
#Dixon and Coles pmf and sums to estimate H, D, A probabilities.
tau <- function(x, y, lambda, mu, rho){
t = 1
if(x == 0 && y == 0) {t = 1 - lambda * mu * rho}
if(x == 0 && y == 1) {t = 1 + lambda * rho}
if(x == 1 && y == 0) {t = 1 + mu * rho}
if(x == 1 && y == 1) {t = 1 - rho}
t
}
dixcolpmf <- function(x, y, lambda, mu, rho) {
p <- tau(x, y, lambda, mu, rho) * dpois(x, lambda) * dpois(y, mu)
p
}
probresult <- function(Hatt, Hdef, Aatt, Adef, rho, gamma){
resprob <- numeric(3)
p <- matrix(0, nrow = 11, ncol = 11)
lambda <- Hatt * Adef * gamma
mu <- Hdef * Aatt
for(i in 1:11) {
for(j in 1:11) {
p[i, j] <- dixcolpmf(i-1, j-1, lambda, mu, rho)
}
}
resprob[1] <- sum(p[2,1], p[3,1:2], p[4,1:3], p[5,1:4], p[6,1:5],
p[7,1:6], p[8,1:7], p[9,1:8], p[10,1:9], p[11,1:10])
resprob[2] <- sum(p[1,1], p[2,2], p[3,3], p[4,4], p[5,5], p[6,6],
p[7,7], p[8,8], p[9,9], p[10,10], p[11,11])
resprob[3] <- sum(p[1, 2], p[1:2,3], p[1:3,4], p[1:4,5], p[1:5,6],
p[1:6,7], p[1:7,8], p[1:8, 9], p[1:9, 10], p[1:10,11])
resprob
}
scorepmf <- function(Hatt, Hdef, Aatt, Adef, rho, gamma){
p <- matrix(0, nrow = 11, ncol = 11)
lambda <- Hatt * Adef * gamma
mu <- Hdef * Aatt
APPENDIX C. APPENDIX 92

for(i in 1:11) {
for(j in 1:11) {
p[i, j] <- dixcolpmf(i-1, j-1, lambda, mu, rho)
}
}
p
}
#Example for Man City vs Norwich.
MCINOR <- probresult(exp(PL2122$par[12]), exp(PL2122$par[32]),
exp(PL2122$par[15]),exp(PL2122$par[35]), PL2122$par[42],
exp(PL2122$par[41]))
pmfMCINOR <- scorepmf(exp(PL2122$par[12]), exp(PL2122$par[32]),
exp(PL2122$par[15]), exp(PL2122$par[35]), PL2122$par[42],
exp(PL2122$par[41]))
#Goal Distribution Plot PL17/18-21/22.
gol <- as.vector(c(PL17_21$H.goals,PL17_21$A.goals))
ind <- c(rep("H", 1900), rep("A", 1900))
plot1721 <- as.data.frame(cbind(gol, ind))
ggplot(plot1721, aes(x = as.factor(gol), group = ind, color = ind,
fill = ind, show.legend = T)) + geom_bar(stat = "count",
position = "dodge", colour = "black", show.legend = F) +
geom_point(stat = "count", size = 4, color = "black",
show.legend = F) + geom_line(stat = "count", size = 1) +
ggtitle("Distribution of Home and Away Goals", subtitle =
"Premier League 2017/18 - 2021/22") + scale_x_discrete("Goals")
+ ylab("Count") + labs(colour = "Away/Home")
#Density plot for match results.
data <- count(fullPL17_21, vars = c("H.goals", "A.goals"))
resultscatter <- ggplot(data, aes(x = H.goals, y = A.goals, colour =
freq)) + geom_point(size = data$freq/10) + labs(colour =
"Frequency") + xlab("Number of Home Goals") + ylab("Number of
Away Goals") + ggtitle("Premier League FT Results 2017/18 -
2021/22") + scale_x_continuous(breaks = seq(0,9,1)) +
scale_y_continuous(breaks = seq(0,9,1))
#Example of result probability calculations:
CRYARS <- probresult(exp(PL2122$par[7]), exp(PL2122$par[27]),
exp(PL2122$par[1]),exp(PL2122$par[21]), PL2122$par[42],
exp(PL2122$par[41]))
pmfCRYARS <- scorepmf(exp(PL2122$par[7]), exp(PL2122$par[27]),
exp(PL2122$par[1]),exp(PL2122$par[21]), PL2122$par[42],
exp(PL2122$par[41]))
PL20_21 <- dixoncoles(H.goals, A.goals, H.Team, A.Team, fullPL20)
#Example of obtaining result probabilities for assessing Dixon and
Coles accuracy.
MUNLEE <- probresult(exp(PL20_21$par[13]), exp(PL20_21$par[33]),
exp(PL20_21$par[9]), exp(PL20_21$par[29]), PL20_21$par[42],
exp(PL20_21$par[41]))
outcome_Hwin <- function(n, hgoals, agoals){
outcome <- numeric(n)
for(i in 1:n) {
if(hgoals[i] > agoals[i]){
outcome[i] <- 1
} else if(hgoals[i] == agoals[i]){
outcome[i] <- 0
} else if(hgoals[i] < agoals[i]){
outcome[i] <- 0
}
}
APPENDIX C. APPENDIX 93

outcome
}
Brier <- function(obs, pred){
n <- length(obs)
val <- matrix(0, ncol=3, nrow = n)
for(i in 1:n) {
for(j in 1:3){
val[i,j] <- sum((pred[i,j] - obs[i,j]) ^ 2)
}
}
(1/n) * sum(val)
}
pred <- matrix(as.numeric(c("NA","NA","NA",MUNLEE,BURBHA,CHECRY,
EVESOT,LEIWOL,"NA","NA","NA","NA","NA","NA",NEWWHU,TOTMCI,
LIVBUR,AVLNEW,CRYBRE,LEEEVE,MCINOR,BHAWAT,SOTMUN,WOLTOT,ARSCHE,
WHULEI,MCIARS,AVLBRE,BHAEVE,NEWSOT,NORLEI,WHUCRY,LIVCHE,BURLEE,
TOTWAT,WOLMUN)), ncol=3, byrow = T)
outcomeH <- outcome_Hwin(380, fullPL21$H.goals, fullPL21$A.goals)
outcomeD <- outcome_Draw(380, fullPL21$H.goals, fullPL21$A.goals)
outcomeA <- outcome_Awin(380, fullPL21$H.goals, fullPL21$A.goals)
Prem21 <- cbind(fullPL21, outcomeH, outcomeD, outcomeA)
Wks13 <- cbind(Prem21[1:30,], pred)
wks13 <- Wks13[c(2:6,9:30),]
WK1_3 <- Brier(wks13[,8:10], wks13[,11:13])
eqprob13 <- matrix(rep(1/3, 81), ncol=3)
Wk13BSS <- Brier(wks13[,8:10], eqprob13)
BSS13 <- 1 - (WK1_3/Wk13BSS)
wks13surprisal <- -sum(wk1pmfsum, wk2pmfsum, wk3pmfsum)
#Matchweeks 20-22
wks20_22 <- Prem21[c(190:196,200:207,175,176,210:217,230,168:169),]
pred20_22 <- matrix(c(CRYNOR,SOTTOT,WATWHU,LEILIV,CHEBHA,BREMCI,MUNBUR,
ARSMCI,WATTOT,CRYWHU,BREAVL,EVEBHA,LEEBUR,CHELIV,MUNWOL,
SOTBRE,WHUNOR,BHACRY,MCICHE,NEWWAT,NOREVE,WOLSOT,AVLMUN,
LIVBRE,WHULEE,BHACHE,LEITOT,BREMUN),ncol=3,byrow=T)
WKS20_22 <- cbind(Prem21[c(190:196,200:207,175,176,210:217,230,
168:169),], pred20_22)
WKS20_22[c(1,3,6,9,11,16,17,20,21,24,28),]
8:19,22:23,25:27),]
WK20_22 <- Brier(WKS20_22[,8:10], WKS20_22[11:13])
eqprob20_22 <- matrix(rep(1/3, 84), ncol=3)
Wk20_22BSS <- Brier(WKS20_22[,8:10], eqprob20_22)
BSS20_22 <- 1 - (WK20_22/Wk20_22BSS)
wks20_22suprisal <- -sum(wk20pmfsum, wk21pmfsum, wk22pmfsum)

C.1.2 xG Data Processing and Analysis


#Reading the xG model data:
library(jsonlite)
library(dplyr)
library(tidyr)
library(purrr)
library(ggsoccer)
library(ggplot2)
library(tibble)
Example of reading and processing JSON data for UCL 2017/18:
UCL <- fromJSON("Uni/MSc Statistics/Dissertation/Data/xG/
events_European_Championship.json")
UCL_shots <- as.list(subset(UCL, UCL$eventId == 10))
UCL_n <- length(UCL_shots$eventName)
APPENDIX C. APPENDIX 94

as.data.frame(UCL_shots$tags)
#Gives error which gives the max number (6) of tags for each shot.
#Functions to create binary vectors for goals and blocked shots.
is_goal <- function(n){
is_goal2 <- numeric(n)
for(i in 1:n){
for(j in 1:6){
if(is.na(UCL_shots$tags[[i]][j,1] == 101)) next
if(UCL_shots$tags[[i]][j,1] == 101){
is_goal2[i] = 1
}
}
}
is_goal2
}
is_blocked <- function(n){
is_block <- numeric(n)
for(i in 1:n){
for(j in 1:6){
if(is.na(UCL_shots$tags[[i]][j,1] == 2101)) next
if(UCL_shots$tags[[i]][j,1] == 2101){
is_block[i] = 1
}
}
}
is_block
}
#Extracting the location data for shots.
UCL_pos <- UCL_shots$positions
x1 <- numeric(UCL_n)
y1 <- numeric(UCL_n)
for(i in 1:UCL_n) {
x1[i] <- UCL_pos[[i]][1,2]
y1[i] <- UCL_pos[[i]][1,1]
}
goal <- is_goal(UCL_n)
block <- is_blocked(UCL_n)
#Creating data frames of shots and saving them as Rda files.
UCL_df <- cbind(UCL_shots$matchId,UCL_shots$teamId,UCL_shots$playerId,
UCL_shots$matchPeriod,UCL_shots$eventSec,x1,y1,goal,block)
colnames(UCL_df) <- c(’MatchID’, ’TeamID’,’PlayerID’,’MatchPeriod’,
’Time (s)’,’x1’, ’y1’, ’Goal’, ’Blocked’)
save(UCL_df, file="UCL.Rda")
#Load the saved files and create the full shots data frame.
load("UCL.Rda")
load("WC.Rda")
load("England.Rda")
load("France.Rda")
load("Italy.Rda")
load("Germany.Rda")
load("Spain.Rda")
shots_df <- as.data.frame(rbind(UCL_df, wc_df, eng_df, fra_df,
itl_df, ger_df, spa_df))
goal.prop <- sum(as.numeric(shots_df[,8]))/length(shots_df[,8])
block.prop <- sum(as.numeric(shots_df[,9]))/length(shots_df[,9])
unbl_shots <- shots_df[!(shots_df$Blocked == 1),]
#Visualisation of all unblocked shots, code based on Gómez (2020)Gómez (2020).
ggplot(data = unbl_shots, aes(y = y1, x = x1)) +
APPENDIX C. APPENDIX 95

annotate_pitch(colour = "white", fill = "darkgreen",


limits = FALSE) +
theme_pitch() +
theme(plot.background = element_rect(fill = "black"),
title = element_text(colour = "white")) +
coord_flip(xlim = c(51, 101),
ylim = c(-1, 101)) +
geom_jitter(aes(fill = factor(Goal, levels = c(1, 0))),
alpha = 0.3, shape = 21, size = 0.8) +
facet_wrap(~Goal, nrow = 1) +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue")) +
theme(legend.direction = "horizontal",
legend.text = element_text(color = "white", size = 8,
face = "plain"),
legend.background = element_rect(fill = "black"),
legend.key = element_rect(fill = "black"),
strip.background=element_rect(fill = "black"),
strip.text = element_text(colour = "black"),
plot.title = element_text(hjust = 0.07, face = "plain"),
plot.subtitle = element_text(hjust = 0.07, size = 10, face =
"italic"),
plot.caption = element_text(hjust = 0.95),
plot.margin = margin(1, 0.2, 0.5, 0.2, "cm")) +
labs(fill = "Goal") +
guides(fill = guide_legend(override.aes = list(alpha = 0.8, size = 2),
reverse=T)) + ggtitle("Unblocked Open Play Shots", "Top 5 European
Leagues & European Championship and World Cup 2018")
#Calculate coordinates in yards.
ydsx1 <- (100 - unblocked_shots$x1) * (114.829/100)
ydsy1 <- unblocked_shots$y1 * (74.3657/100)
xg_shots <- as.data.frame(cbind(unblocked_shots, ydsx1, ydsy1))
#Function to determine which xG zone each shot is taken from.
zone <- function(n){
Zone <- numeric(n)
for(i in 1:n){
if(xg_shots$ydsx1[i] < 6 && 33.1465 < xg_shots$ydsy1[i] &&
xg_shots$ydsy1[i] < 41.1465){
Zone[i] = 1
} else if(xg_shots$ydsx1[i] < 6 && 27.1465 < xg_shots$ydsy1[i] &&
xg_shots$ydsy1[i] < 33.1465) {
Zone[i] = 2
} else if(xg_shots$ydsx1[i] < 6 && 41.1465 < xg_shots$ydsy1[i] &&
xg_shots$ydsy1[i] < 47.1465) {
Zone[i] = 2
} else if(6 < xg_shots$ydsx1[i] && xg_shots$ydsx1[i] < 18 &&
33.1465 < xg_shots$ydsy1[i] && xg_shots$ydsy1[i] < 41.1465) {
Zone[i] = 3
} else if(xg_shots$ydsx1[i] < 6 && 15.1465 < xg_shots$ydsy1[i] &&
xg_shots$ydsy1[i] < 27.1465) {
Zone[i] = 4
} else if(xg_shots$ydsx1[i] < 6 && 47.1465 < xg_shots$ydsy1[i] &&
xg_shots$ydsy1[i] < 59.1465) {
Zone[i] = 4
} else if(6 < xg_shots$ydsx1[i] && xg_shots$ydsx1[i] < 18 &&
15.1465 < xg_shots$ydsy1[i] && xg_shots$ydsy1[i] < 33.1465) {
Zone[i] = 5
} else if(6 < xg_shots$ydsx1[i] && xg_shots$ydsx1[i] < 18 &&
APPENDIX C. APPENDIX 96

41.1465 < xg_shots$ydsy1[i] && xg_shots$ydsy1[i] < 59.1465) {


Zone[i] = 5
} else if(18 < xg_shots$ydsx1[i] && xg_shots$ydsx1[i] < 29.707) {
Zone[i] = 6
} else if(xg_shots$ydsx1[i] < 18 && xg_shots$ydsy1[i] < 15.1465) {
Zone[i] = 7
} else if(xg_shots$ydsx1[i] < 18 && 59.1465 < xg_shots$ydsy1[i]) {
Zone[i] = 7
} else if(xg_shots$ydsx1[i] > 29.707) {
Zone[i] = 8
}
}
Zone
}
n <- length(xg_shots$ydsx1)
Zone <- as.factor(zone(n))
full_df <- as.data.frame(cbind(xg_shots, Zone))
#Partition the data into a training and test set.
split <- sort(sample(nrow(full_df), nrow(full_df) * 0.95))
train <- full_df[split,]
testset <- full_df[-split,]
#Creating subsets for each zone and computing empirical estimates.
zone1 <- train[(train$Zone == 1),]
zone2 <- train[(train$Zone == 2),]
zone3 <- train[(train$Zone == 3),]
zone4 <- train[(train$Zone == 4),]
zone5 <- train[(train$Zone == 5),]
zone6 <- train[(train$Zone == 6),]
zone7 <- train[(train$Zone == 7),]
zone8 <- train[(train$Zone == 8),]
xG1 <- sum(zone1$Goal) / length(zone1$Goal)
xG2 <- sum(zone2$Goal) / length(zone2$Goal)
xG3 <- sum(zone3$Goal) / length(zone3$Goal)
xG4 <- sum(zone4$Goal) / length(zone4$Goal)
xG5 <- sum(zone5$Goal) / length(zone5$Goal)
xG6 <- sum(zone6$Goal) / length(zone6$Goal)
xG7 <- sum(zone7$Goal) / length(zone7$Goal)
xG8 <- sum(zone8$Goal) / length(zone8$Goal)
Rathketrain <- c(xG1,xG2,xG3,xG4,xG5,xG6,xG7,xG8)
#Creating a vector of xG relating to each shot.
xGtrain <- numeric(length(train$Zone))
for(j in 1:length(train$Zone)){
for(i in 1:8){
if(train$Zone[j] == i) {
xGtrain[j] = Rathketrain[i]
}
}
}
rathtraindf <- as.data.frame(cbind(train, xGtrain))
#Converting coordinates to metres and computing distance and angle.
x1m <- (100 - xG_df$x1) * (105/100)
y1m <- xG_df$y1 * (68/100)
dist_to_goal_line_centre_m <- sqrt((x1m)^2 + (34 - y1m)^2)
angle_of_goal <- (atan((37.66-y1m)/x1m) - atan((30.34-y1m)/x1m)) * 180/pi
df_angle <- as.data.frame(cbind(full_df, dist_to_goal_line_centre_m,
angle_of_goal))
#Fitting the xG logistic regression model.
fittedxG <- glm(Goal ~ dist_to_goal_line_centre_m + angle_of_goal,
family = binomial(link = "logit"), data = df_angle)
APPENDIX C. APPENDIX 97

summary(fittedxG)
Fitting Rathke’s xG model as a GLM.
fitRathkexG <- glm(Goal ~ Zone, family = binomial(link = "logit"),
data = full_df)
AIC(fittedxG)
BIC(fittedxG)
AIC(fitRathkexG)
BIC(fitRathkexG)
#Calculating xG for each shot using the fitted distance and angle model.
predxG <- function(n, data){
xG <- numeric(n)
for(i in 1:n){
xG[i] <- exp(fittedxG$coefficients[1] + fittedxG$coefficients[2] *
data[i,13] + fittedxG$coefficients[3] * data[i,14])/
(1 + exp(fittedxG$coefficients[1] + fittedxG$coefficients[2] *
data[i,13] + fittedxG$coefficients[3] * data[i,14]))
}
xG
}
xG <- predxG(length(df_angle$angle_of_goal), data = df_angle)
#Heatmap figure 3.4 for distance and angle xG, based on Gómez (2020) Gómez (202
library(RColorBrewer)
logitheat <- ggplot(data = fittedxG_df, aes(x= x1, y = y1)) +
annotate_pitch(colour = "white", fill = "black", limits = FALSE) +
theme_pitch() + theme(plot.background = element_rect(fill = "black"),
title = element_text(colour = "white")) + coord_flip(xlim = c(50, 100),
ylim = c(0, 100)) + geom_tile(aes(fill = xG)) +
scale_fill_gradientn(colours = rev(brewer.pal(11, "Spectral")),
limits = range(0,1)) + theme(legend.position = c(0.8, 1.1),
legend.direction = "horizontal", legend.text = element_text(color =
"white", size = 8, face = "plain"), legend.background =
element_rect(fill = "black"), legend.key = element_rect(fill =
"black", color = "black"), plot.title = element_text(hjust = 0.07,
face = "plain"), plot.subtitle = element_text(hjust = 0.07, size = 10,
face = "italic"), plot.caption = element_text(hjust = 0.95),
plot.margin = margin(1, 0.2, 0.5, 0.2, "cm")) + labs(fill = "xG value")
+ ggtitle("Logistic Regression xG Model",
"with covariates distance & angle to goal")
#Computing Brier score and skill score for both xG models.
BrierxG <- function(obs, pred){
n <- length(obs)
val <- numeric(n)
for(i in 1:n) {
val[i] <- sum((pred[i] - obs[i]) ^ 2)
}
(1/(n)) * sum(val)
}
BS1 <- BrierxG(fittedxG_df$Goal, xG_df$xG) # = 0.10779996
BS2 <- BrierxG(fittedxG_df$Goal, fittedxG_df$xG) # = 0.10560873
refxG <- rep(0.1371, 32760) #Empirical shot conversion rate.
refBS <- BrierxG(xG_df$Goal, refxG)
BSS1 <- 1 - (BS1/refBS) # = 0.087568
BSS2 <- 1 - (BS2/refBS) # = 0.106115
#ROC Plot and AUC.
library(verification)
library(pROC)
idealroc <- roc(fittedxG_df$Goal, fittedxG_df$Goal) #Distance & angle
plot(idealroc, lty = 2, xlim = c(1,0),
main = "ROC Curve for both xG Models")
modelroc <- roc(fittedxG_df$Goal, fittedxG_df$xG)
APPENDIX C. APPENDIX 98

plot(modelroc, add = T)
model2roc <- roc(xG_df$Goal, xG_df$xG)
plot(model2roc, add = T, col = 2)
legend(x = 0.5, y = 0.1, legend = c("Distance & Angle", "Rathke Zonal"),
fill = c(1, 2))
auc(modelroc) # = 0.7382
auc(model2roc) # = 0.7095
#MSE calculations for each game and team combination.
MSE_fitdf <- aggregate(cbind(Goal, xG) ~ MatchID + TeamID, data =
fittedxG_df, FUN = sum)
MSE_rathdf <- aggregate(cbind(Goal, xG) ~ MatchID + TeamID, data =
xG_df, FUN = sum)
fitMSE <- mean((MSE_fitdf$Goal - MSE_fitdf$xG)^2)
rathMSE <- mean((MSE_rathdf$Goal - MSE_rathdf$xG)^2)

C.1.3 Interruptions Data Processing and Analysis


library(jsonlite)
#Example of processing event positions from an untidy format:
eng <- fromJSON("events_England.json")
getpositions <- function(n){
x1 <- numeric()
y1 <- numeric()
for(i in 1:n) {
x1[i] <- eng$positions[[i]][1,2]
y1[i] <- eng$positions[[i]][1,1]
}
Positions <- cbind(x1, y1)
Positions
}
Positions <- getpositions(length(eng[,1))
#Example of how interruption durations were calculated:
getfk <- function(event, subevent, eventsec){
n <- length(event)
fkduration <- numeric()
for(i in 1:n){
if(subevent[i] == 31 || subevent[i] == 32 || subevent[i] == 33){
fkduration[i] <- as.numeric(eventsec[i]) -
as.numeric(eventsec[i-1])
} else if(is.na(subevent[i]) == T) {
fkduration[i] = 0
} else fkduration[i] = 0
}
fkduration
}
FK_durationENG <- getfk(eng$eventId,eng$subEventId, eng$eventSec)
FK_Durations <- c(FK_durationENG, FK_durationFRA, FK_durationITL,
FK_durationGER, FK_durationSPA)
#How the timing variable was calculated:
library(dplyr)
fhtime <- leaguesevents %>% group_by(matchId) %>%
filter(matchPeriod == "1H") %>%
summarize(Fhtime = max(eventSec))
leagueseventst1 <- leaguesevents %>% left_join(fhtime, by = "matchId")
Timing <- function(eventsec, Fhtime, matchperiod) {
n <- length(eventsec)
timing <- numeric(n)
for(i in 1:n){
APPENDIX C. APPENDIX 99

if(matchperiod[i] == "1H") {
timing[i] <- eventsec[i]
} else if(matchperiod[i] == "2H"){
timing[i] <- eventsec[i] + Fhtime[i]
}
}
timing
}
Timing <- Timing(leagueseventst1$eventSec, leagueseventst1$Fhtime,
leagueseventst1$matchPeriod)
#Example of how the area variable was computed, here for free kicks.
fkarea <- function(x, y, subevent){
n <- length(x)
Area <- numeric()
for(i in 1:n){
if(subevent[i] == 31 || subevent[i] == 32 || subevent[i] == 33){
if(x[i] > 66.7 && y[i] > 20.4 && y[i] < 79.6){
Area[i] <- 1
} else if(x[i] > 66.7 && y[i] < 20.4){
Area[i] <- 2
} else if(x[i] > 66.7 && y[i] > 79.6){
Area[i] <- 2
} else if(x[i] < 66.7 && x[i] > 33.3 && y[i] > 20.4 && y[i] <
79.6){
Area[i] <- 3
} else if(x[i] < 66.7 && x[i] > 33.3 && y[i] < 20.4){
Area[i] <- 4
} else if(x[i] < 66.7 && x[i] > 33.3 && y[i] > 79.6){
Area[i] <- 4
} else if(x[i] < 33.3 && y[i] > 20.4 && y[i] < 79.6){
Area[i] <- 5
} else if(x[i] < 33.3 && y[i] < 20.4){
Area[i] <- 6
} else if(x[i] < 33.3 && y[i] > 79.6){
Area[i] <- 6
}
} else if(is.na(subevent[i]) == T){
Area[i] = 0
} else Area[i] = 0
}
Area
}
FK_Area <- fkarea(fullevents$x1, fullevents$y1, fullevents$subEventId)
#Example of how mean penalty duration was calculated for each league.
engpen <- subset(Penalties, League == "England")
mean(engpen$Pen_Durations)

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy