Stats Book Sfu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 354

Introductory Statistics for Economics

Brian Krauth

Fall 2021
2
Contents

About this book 7


About the author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Introduction 9
1.1 Course goals and context . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 SFU-specific information . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Computer resources . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Conventions of this book . . . . . . . . . . . . . . . . . . . . . . . 16

2 Basic data cleaning with Excel 17


2.1 Application: Canadian employment data . . . . . . . . . . . . . . 18
2.2 Data cleaning principles . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Introduction to Excel . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Version control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Probability and random events 51


3.1 Outcomes and events . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Joint and conditional probabilities . . . . . . . . . . . . . . . . . 58

3
4 CONTENTS

Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Introduction to random variables 69


4.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 The expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Other characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Standard discrete distributions . . . . . . . . . . . . . . . . . . . 85
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 More on random variables 93


5.1 Continuous random variables . . . . . . . . . . . . . . . . . . . . 93
5.2 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Multiple random variables . . . . . . . . . . . . . . . . . . . . . . 107
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6 Basic data analysis with Excel 121


6.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Univariate statistics in Excel . . . . . . . . . . . . . . . . . . . . 123
6.3 Univariate graphs in Excel . . . . . . . . . . . . . . . . . . . . . . 133
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7 Statistics 147
7.1 Data and the data generating process . . . . . . . . . . . . . . . 148
7.2 Statistics and their properties . . . . . . . . . . . . . . . . . . . . 155
7.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . 164
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
CONTENTS 5

8 Statistical inference 173


8.1 Principles of inference . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.3 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . 182
8.4 Inference on the mean . . . . . . . . . . . . . . . . . . . . . . . . 184
8.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 189
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

9 An introduction to R 197
9.1 A brief tour of RStudio . . . . . . . . . . . . . . . . . . . . . . . 198
9.2 The R language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.3 Packages and the Tidyverse . . . . . . . . . . . . . . . . . . . . . 213
9.4 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

10 Advanced data cleaning 221


10.1 Data file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.2 Combining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.3 Advanced data management . . . . . . . . . . . . . . . . . . . . . 236
10.4 Reading and viewing data in R . . . . . . . . . . . . . . . . . . . 242
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

11 Using R 247
11.1 Cleaning data in R . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.2 Data analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11.3 Graphs with ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6 CONTENTS

12 Multivariate data analysis 271


12.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . 272
12.2 Pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.3 Graphical methods . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

A Math review 299


A.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
A.3 Sequences and summations . . . . . . . . . . . . . . . . . . . . . 306
A.4 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

B Solutions to practice problems 315


2 Basic data cleaning with Excel . . . . . . . . . . . . . . . . . . . . . 315
3 Probability and random events . . . . . . . . . . . . . . . . . . . . . 316
4 Introduction to random variables . . . . . . . . . . . . . . . . . . . . 320
5 More on random variables . . . . . . . . . . . . . . . . . . . . . . . . 325
6 Basic data analysis with Excel . . . . . . . . . . . . . . . . . . . . . 331
7 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9 An introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . 345
10 Advanced data cleaning . . . . . . . . . . . . . . . . . . . . . . . . 346
11 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
12 Multivariate data analysis . . . . . . . . . . . . . . . . . . . . . . . 351
Appendix A Math review . . . . . . . . . . . . . . . . . . . . . . . . 353
About this book

This book has been written for use as a textbook for ECON 233, the introduc-
tory statistics course for economics majors at Simon Fraser University.
The current version of the book can be obtained at https://bookdown.org/
bkrauth/BOOK/.
The book itself is written using Bookdown, and its source code is available at
https://github.com/bvkrauth/econ233.

About the author


Brian Krauth is Associate Professor of Economics at Simon Fraser University.

7
8 CONTENTS
Chapter 1

Introduction

Before we dive into the day-to-day course material, it is important to understand


the big picture of what we are trying to do here, and to get set up for our work.

Goals
Chapter goals
In this chapter we will:

• Review educational goals and establish course expectations


• Gather resources and tools, including all needed computer software.

1.1 Course goals and context


This is an introductory course in statistics for economics. It is similar to courses
taught all over the world to first and second year university students in business,
economics, and other social sciences.
Course goals
By the end of this course:
You will develop computer skills:

1. Clean, analyze, and graph data in Excel.


2. Clean, analyze and graph data in R.
3. Follow recommended practices for data management and reproducible
analysis.

You will become familiar with basic statistical concepts:

9
10 CHAPTER 1. INTRODUCTION

4. Calculate and interpret probabilities and expected values.


5. Explain the relationship between population and sample.
6. Describe the properties of a statistic or estimator including its probability
distribution, expected value, variance, bias, and mean squared error.
7. State and apply the law of large numbers and central limit theorem.

You will be able to apply these skills in combination to analyze real-world eco-
nomic data:

8. Construct and interpret common charts including histograms, scatter


plots, and time-series plots.
9. Construct and interpret frequency tables and cross-tabulations.
10. Construct and interpret common univariate and bivariate statistics, in-
cluding mean, variance, standard deviation, covariance and correlation.
11. Construct and interpret hypothesis tests and confidence intervals.

We will be switching back and forth between theory, data analysis and applica-
tions. All three skill sets are valuable.
Hopefully you are in this course because you are fascinated by statistics and
can’t wait to learn more about it. But most of you are taking it because it’s a
required course.
So I’d like to motivate everyone to take this course as an opportunity to learn
some very useful skills. Today’s world is awash in data:

• retailers maintain databases of transactions


• manufacturers track product quality and costs
• marketers collect data on customers and potential customers
• government records everyone’s interactions with schools, tax authorities,
social welfare, health care and criminal justice,
• employers maintain detailed personnel records.

These databases can be linked and analyzed in various ways, and many of the
world’s most successful companies rely heavily on the innovative gathering and
usage of data:

• Google’s core product (the search engine) is built on the innovative anal-
ysis of massive amounts of data.
• Both Google and the major social media companies are based on providing
valuable “free” services in order to gather data on consumers that can then
be sold (in some form) to other businesses.
• Amazon and other retailers use what is called A/B testing to fine-tune
product descriptions and set prices so as to maximize profits.
1.2. EXPECTATIONS 11

Some of this data analysis is done by computer scientists, but much of it is done
by economists: for example, Amazon is the second-largest employer of PhD
economists in the US (after the Federal Reserve System).
This course will not qualify you for those jobs, but it is a first step in that
direction.
Be the Mona Lisa
I always tell students thinking about the future to remember supply and demand
in the labour market. In the labour market your skills and effort are the product,
and you are the seller. Like all sellers, you want to be expensive. This requires
that you have skills that are both:

• Useful (high demand)


• Uncommon (low supply)

In other words, you need to be like the Mona Lisa. If your skills are useful but
common (like water), or rare but useless (like my one-of-a-kind drawing of the
Mona Lisa) your labour will sell at a low price.

High demand Low demand


Low supply
High supply

The ability to analyze data in a sophisticated way, and to explain the results in
written or oral presentation, is an extremely useful and uncommon skill. Most of
you do not have the technical skills of your colleagues in Computer Science, but
if you can combine a reasonable level of computer skills with writing, knowledge
of the underlying statistical principles, and the ability to recognize the economic
considerations in a situation, you will do quite well.

1.2 Expectations
The course is constructed under the assumptions that:

1. You have taken introductory microeconomics and introductory macroeco-


nomics.
• We will use ideas from those courses in applications and examples.
2. You have seen some probability and statistics content in high school.
• It’s OK if you do not remember much.
12 CHAPTER 1. INTRODUCTION

3. You can do high-school level math including algebra and basic set theory
and have taken or are currently taking an introductory calculus course.
• I will not ask you to take derivatives or solve integrals; instead I will
refer to concepts like functions, sequences and limits.
• The math review appendix provides material and practice problems
if you need to review these concepts and tools.
4. You have access to a desktop or laptop computer, and basic computer
skills.

This is not a class in introductory economics, high school math, or using a


computer. If you are a little behind in those skills you will need to ask for help,
but I am happy to help anyone who asks.

1.3 SFU-specific information


ECON 233 is the first course in the two-course econometrics sequence that
is required for all economics majors. If you’ve never seen the word before,
“econometrics” just means statistics and data analysis for economics. The course
Canvas page is available at https://canvas.sfu.ca/courses/62548. It includes
information on lectures, tutorials, quizzes, and assignments.
ECON 233 or BUS 232?
All economics majors have the option of taking ECON 233 or BUS 232, so you
may be wondering what the difference is. Either course is suitable preparation
for ECON 333, but there are some key differences:

• Tools: ECON 233 uses both Excel and R, while BUS 232 uses Excel.
– You are likely to use R in ECON 333 and other upper-division ECON
courses, so it is nice to get used to it now.
• Applications: ECON 233 emphasizes economics applications, while BUS
232 emphasizes business applications.

ECON 233 is part of the Social Data and Analytics (SDA) minor; if you are
an economics student and are interested in that minor you are recommended to
take ECON 233.
ECON 333 is the second course in the two-course econometrics sequence required
for all economics majors. In ECON 333, you will learn more advanced techniques
including linear regression, you will use R more extensively, and you will go
deeper into the theory.
Related courses
If you find you enjoy and/or do well in this course, I would strongly encourage
you to take further courses in econometrics:
1.4. COMPUTER RESOURCES 13

• ECON 334: Data Visualization and Economic Analysis is an elective fo-


cusing on exploratory data analysis and visualization
• ECON 335: Introduction to Causal Inference and Policy Evaluation is an
elective focusing on the problem of inferring cause-and-effect from eco-
nomic data, and using data to forecast the effects of economic policies.
• ECON 433: Financial and Time Series Econometrics is an advanced elec-
tive focusing on techniques for analyzing the kind of time series data that
is used in macroeconomics and financial markets.
• ECON 435: Econometric Methods is an advanced course in statistics and
econometrics that is part of our honours sequence. It gives you the oppor-
tunity and tools to write a serious empirical research paper. Non-honours
students are eligible to take it if they have a 3.0 CGPA and the course
prerequisites.

I would also encourage you to take courses outside of the economics depart-
ment, and to consider a Statistics minor or the new interdisciplinary Social
Data Analytics (SDA) minor .

1.4 Computer resources


To do the computer work you will need a computer with internet access and the
following software packages installed:

• Microsoft Excel
• R
• RStudio

Excel is a commercial application, while R and RStudio are both open-source


(free). They are available for both Windows and macOS. The examples in the
textbook use Windows.
The required software packages are available free of charge for SFU students,
and are installed on all campus lab computers,

1.4.1 Installing Microsoft Excel

Microsoft Excel is a well-known spreadsheet program that is available for both


Windows and macOS. Alternatives to Excel include Google Sheets and Apple
Numbers.
Installing Excel at SFU
SFU has a licensing agreement with Microsoft that allows its students free in-
stallation of the entire Microsoft Office suite, including Excel. Installation in-
structions are available at
14 CHAPTER 1. INTRODUCTION

https://www.sfu.ca/itservices/technical/software/office365.html.

Once you have installed Excel, you should confirm that it is working by starting
the program. You should see something that looks like this:

1.4.2 Installing R and RStudio

Later in the semester, we will also be using a more specialized statistical program
called R, and a related program called RStudio.

• R is a programming language used for statistical analysis.


• RStudio is an “Integrated Development Environment” for R, that is it is
an integrated set of tools for building and running R programs.

Both R and RStudio are open-source, and are available free of charge for both
Windows and macOS. Installation instructions are available at:

https://rstudio.com/products/rstudio/download/#download.

Be sure to install R first, then RStudio.

After installing R and RStudio, you should confirm that they are working by
opening RStudio. You should see something like this:
1.4. COMPUTER RESOURCES 15

1.4.3 Installing the Tidyverse

One of the most useful features of R is that it allows users to write and distribute
packages that extend its capabilities.
One of the most popular and useful packages is called the Tidyverse. R is a
very powerful program, but it is also a very old one: the underlying language
(called “S”) was originally created in 1976. The result of this is that some of
the original commands are outdated in design and aren’t well suited for modern
capabilities or principles of software development. The Tidyverse solves this
problem by adding new, more modern versions of these commands. You can
learn more about the Tidyverse at https://www.tidyverse.org/.
To install the Tidyverse package:

1. Open RStudio if it isn’t already open.


2. Click in the Console window (you will see it towards the bottom of the
screen)
3. Enter install.packages("tidyverse") (i.e., type it and hit the <enter>
key)

Once the installation is concluded and the > prompt reappears you can test to
make sure the installation worked.

4. Enter library(tidyverse) in the console window.


• If you get an error message like Error in library(tidyverse) : there is
no package called ‘tidyverse’ drop by office hours for help.
16 CHAPTER 1. INTRODUCTION

• If you don’t get an error message (you will get some message about
“Conflicts”), the installation worked.

If you run into trouble here, don’t worry. We will not need the Tidyverse for a
few weeks, so there is plenty of time to get help.

1.5 Conventions of this book


This book uses consistent visual conventions to convey information.

• Organization:
– Each chapter corresponds to one full week of the course.
• Typography:
– Computer code or other inputs are shown like this.
– Math is usually shown like 𝑡ℎ𝑖𝑠.
– When new terminology is introduced, it is shown like this.
• Boxes:
– Pull-out information is shown in colored boxes.

Example 1.1. Boxes like this are for examples

Goals

Boxes like this are for showing course or chapter goals.

Economics background

Boxes like this are for providing economic background

FYI
Boxes like this are for providing optional information that might be of
interest to some students.
Chapter 2

Basic data cleaning with


Excel

Before we can analyze data, we usually need to clean it. Cleaning data means
putting it into a form that is ready to analyze, and can include:

• Converting data files from one file format to another


• Reorganizing complex tables of data into simple tables
• Combining data from multiple tables into a single table
• Modifying existing variables
• Creating new variables

In this course, we will do most of our data cleaning using Excel.

This chapter will develop both both some principles of data cleaning and the
basic Excel tools needed to implement them . We will also apply this knowledge
by using Excel to clean a real data set.

17
18 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

Goals
Chapter goals
In this chapter we will learn how to:

• Use basic Excel terminology and concepts including:


– Workbooks, worksheets, and cells
– Cell contents and display format
– Cell ranges
– Relative and absolute cell addresses
– Numeric, text, date, and logical data types
• Use Excel tools to view data, including:
– Sorting
– Filtering
– Freezing panes
– Changing cell size
• Use Excel tools to enter new data and “clean” existing data, in-
cluding
– Fill and series
– Formulas
– Functions
– Number formats
• Follow good data management practices
– Practice data documentation and version control
– Identify the characteristics of tidy data

2.1 Application: Canadian employment data

We will demonstrate the key principles and tools in this chapter by cleaning
the November 2020 employment data for Canadian provinces. The data can be
found in the file
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpNov20.xlsx
2.1. APPLICATION: CANADIAN EMPLOYMENT DATA 19

Economics background

Economics review: Employment statistics


Employment statistics are typically covered in a Principles of Macroeco-
nomics course (at SFU, this course is called ECON 105). To review the
basics:

• These statistics are reported monthly by Statistics Canada and are


based on the Labour Force Survey (LFS), a monthly survey of the
civilian, non-institutionalized, working-age population of
Canada.
– “civilian” excludes those on active military duty
– “non-institutionalized” excludes people in prison, hospitals,
nursing homes, etc.
– “working-age” excludes those under age 15
– People living on reserve are also not covered by the LFS
• The LFS population is grouped into three categories:
– Employed: worked for pay or profit in the previous week, or
had a job and was absent (e.g. due to illness or vacation).
– Unemployed: not employed in the previous week, but either
looking for work, on temporary layoff, or had a job to start
within the next four weeks.
– Not in the labour force: everyone else. This includes re-
tirees, full-time students who aren’t working for pay, and any-
one else who is neither working nor looking for work.
• These basic counts are then used to calculate several other statistics
– The labour force is the total count of those who are em-
ployed or unemployed.
– The labour force participation rate is the proportion or
percentage of the population that is in the labour force. The
Canadian LFP rate is typically around 65%.
– The unemployment rate is the proportion of the labour
force that is unemployed. The Canadian unemployment rate
is typically 5-10%.

The unemployment rate and LFP rate are key indicators of labour mar-
ket conditions. A higher-than-usual unemployment rate means that
workers are having difficulty finding work, and a lower-than-usual LPF
rate means that some workers have stopped looking for work.
20 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

2.2 Data cleaning principles


Next, we describe some of the core principles we will follow in cleaning and
managing our data.

2.2.1 Reproducibility

When analyzing data it is important for the results of our analysis to be re-
producible. What that means is that an interested reader should be able to
figure out exactly how you got your results, and should be able to generate
those results themselves. Note that the “interested reader” might be you, as it
is common to return to an analysis you did earlier. Without reproducibility,
you may not remember where you found the data, what you have done with it,
or what it means.
Reproducibility requires that we treat our data with care. In particular, we
should:

1. Document the original sources for all data.


• If a data file is downloaded from the internet, the original URL should
be documented.
2. Keep an unmodified copy of all original data.
3. Avoid directly editing data. Add new data instead.
• Sometimes this is not possible or practical, which is why we keep an
unmodified copy of the original data.
4. Give all files and variables brief but informative names.
• Avoid spaces or special characters in variable or file names. They
can cause problems when moving data between operating systems,
file formats, or applications.

We will discuss the implementation of these principles later in this chapter.

2.2.2 Tidy data

To keep analysis easy and minimize errors, data should be organized as what
data scientists have come to call tidy data. Tidy data has the following seven
properties:

1. Data is arranged in one or more simple rectangular grids or tables.


2. Each column in a table represents a distinct variable or attribute.
3. Each variable has an informative and unique variable name.
2.2. DATA CLEANING PRINCIPLES 21

• Variable names are typically displayed in the top row of the table.
4. Each row in a table (after the top row) represents a distinct observation,
data point or case.
5. All observations in a given table come from the same unit of observation
• For example, data on Canadian cities and data on Canadian
provinces should probably be in two separate tables.
6. One variable in a table serves as a unique identifier or ID for the ob-
servation.
• A unique identifier takes on a different value for each observation.
• The ID variable is often in the first column of the table.
7. The order in which observations or variables are listed is irrelevant to the
analysis.
• That is, the interpretation of the table would not change if its rows
or columns were in a different order.

Data that is not tidy is sometimes called messy data. One of the first steps in
cleaning data is rearranging it from a messy format to a tidy format.

2.2.3 Observations and identifiers

Each tidy data set normally includes at least one unique identifier or ID variable.
By definition, an ID variable must take on a different value for each observation.
With this property, we can use ID variables to link and combine data from
multiple sources.
Example 2.1. ID variables at SFU
SFU is one of British Columbia’s largest organizations, and relies heavily on ID
variables to organize its information:

• Each person at SFU (faculty, staff or student) has a unique 16-digit ID


number and a unique computing ID.
• Each semester at SFU has a 4-digit semester code. For example, Fall 2021
is assigned the semester code 1217.
• Each academic program at SFU has a 2-letter to 4-letter program code
such as IS, BUS, or ECON.
• Each course at SFU is uniquely identified by the combination of its pro-
gram code (e.g. ECON) its numeric course code (e.g. 233), its section
number (e.g., D100), and its semester number (e.g., 1217).

Your library records, grades, financial records, and almost any other information
SFU has about you includes one or more of these IDs.
22 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

The example above shows several common characteristics of ID variables:

1. They can be numbers (like your student ID number) or text strings (like
your computing ID).
2. They can be

• Assigned sequentially or arbitrarily (like your student ID number),


• Constructed by a formula (like semester codes).
• Standardized and listed in a table (like program codes)

3. An observation can sometimes be identified by combining identifiers. For


example, a specific course at SFU might be “ECON233-D100-1217”.

ID variables should ideally be portable across applications and systems. That


is, our ID variable should still work if we move data from a Windows PC to a
Linux or Apple system, and should still work if we move from Excel to R or
Python. To maximize portability, keep in mind the following potential issues:

Issue Solution
Some applications will round non-integer Numeric ID variables should
values (changing 1.23 to 1) or drop always be integers without leading
leading zeros (changing 00045 to 45). zeros, or converted to text.
Some applications will reject or transform Text ID variables should only use
spaces or unusual characters. (Latin) letters and (Arabic)
digits.
Some applications are case-sensitive (so Text ID variables should typically
that “hello” and “Hello” are different use either all upper-case or all
values) and others are not. lower-case.

It’s OK for an ID variable to be nearly-but-not-exactly identical to some other


variable. For example a Canadian data set might have both a Province vari-
able that takes on values like “Québec” and “Nova Scotia” and a ProvinceID
variable that takes on values like “QUEBEC” and “NOVASCOTIA.”
2.3. INTRODUCTION TO EXCEL 23

FYI
Names, IDs, and probabilistic matching
Proper names are typically not used as ID variables since they are not
necessarily unique. In addition, they are not always written consistently.
For example, the same person might be called “Doug” in one data set
and “Douglas” in another.
Occasionally a data set will not have a standardized ID variable, and
our only option is to match observations based on a proper name. For
example, in BC school data an ID number called a PEN while health
data uses a different ID number called a PHN. There is no direct way
to match PEN and PHN, so we have to match education and health
records on a combination of proper names and other information such as
year of birth. Matches made this way are called “probabilistic” matches,
meaning (roughly) that the records probably describe the same person
but might not.

2.3 Introduction to Excel

Excel is the most commonly-used example of a spreadsheet, a software program


designed for the tabulation, analysis, and display of data. Its main competitors
include Google Sheets and Apple Numbers. It is widely used in business and
government, so good Excel skills are valuable in the labour market.

Many of you have probably used Excel before, but there are many features you
are probably not yet familiar with. We will start by going over some of its basic
characteristics and terminology.
24 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

2.3.1 Terminology and interface

Start Excel. Your screen should look something like this:

When giving instructions, I will refer to various elements of Excel’s user interface
by name. You may have been using these elements for years without ever
knowing their names, so I will list them here:

• The menu bar is at the top of the screen:

• The ribbon is below the menu bar

– The ribbon has buttons for performing simple actions.


– The buttons are grouped by function, for example the ribbon above
depicts groups called Clipboard, Font, etc.
∗ Within each function group, there is usually a little icon in
the lower-right corner that you can click on to access additional
options.
– If your ribbon is not visible:
∗ You can make it visible by selecting Home (or anything else) on
the menu bar.
∗ Once you have made the ribbon visible, you can keep it visible
by clicking on the little thumbtack icon in its lower right corner.
– The contents of the ribbon depend on the currently-active menu bar
option (usually but not always Home).
2.3. INTRODUCTION TO EXCEL 25

∗ I will assume that Home is the currently-active option; if it isn’t


you can just select Home to make it the currently-active option.
• The formula bar is just below the ribbon.

– It shows the contents of the current cell.


– You can type in it to change the contents of the current cell.

• The insert function button is to the left of the formula bar.


– We will learn to use this later.
• Most of the screen displays a grid of cells that is called a spreadsheet
or worksheet:

– Columns are identified by letter.


– Rows are identified by number.
– Cells are identified by column and row. For example, the cell in
column A, row 2 is called cell A2.
• Below the worksheet is a row of tabs:

– You can click on a tab to switch to that worksheet.


– You can double-click on a tab to change its name.
– You can click on the “+” button to add a new worksheet.
• An Excel file is called a workbook.
– Each workbook contains one or more worksheets.
– in Windows, Excel files normally have the .xlsx extension.
– Older files sometimes have the .xls extension.

2.3.2 Viewing data

The first step in any data cleaning exercise is to look at the data and assess
what needs to be done.
26 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

Example 2.2. Overview of the employment data


Open the data file in Excel. As you can see, this Excel workbook contains three
worksheets:

• Employment Nov 2020 is our main data table:

It is in tidy format:
– There is a single rectangular table starting in cell A1.
– Each row represents a Canadian province
– Canadian provinces are the unit of observation
– Each column represents a variable describing that province
– The top row shows brief but clear names for each variable
– The provinces are listed in alphabetical order, but the interpretation
of the table would not change if they were listed in some other order
• Raw data is the original data as downloaded from Statistics Canada.
2.3. INTRODUCTION TO EXCEL 27

It is messy (the opposite of tidy)


– The table starts in cell A6 rather than cell A1.
– Variables (Population, Labour Force, etc.) are in rows rather than
columns.
– Observations (the unit of observation is the province-month) are in
both rows and columns.
– Cells are filled in “implicitly”. For example, there is nothing in row
17 that says what province it describes, but we know that it describes
Nova Scotia because it comes after row 16 (which does say it describes
Nova Scotia). As a result, the order of rows is very important.
• Source describes where the original data was obtained.

Note that we are following good data management practice by saving a copy
of the original data and creating a new data set based on it, rather than by
directly editing the original data. We are also documenting data sources. Both
of these practices will enhance the reproducibility and reliability of our analysis.

In the remainder of this section, we will learn a few tools for changing how our
data is displayed that do not change the content of the data.

2.3.3 Sorting, filtering, and freezing

Sorting allows us to re-order the rows of our data based on the value in one or
more of the columns. Since order does not matter with tidy data, we can sort
in whatever way we like without changing the content of our data.

Example 2.3. Sorting the employment data


The original data are in alphabetical order by province name. But suppose we
wanted it to be ordered by population (with the highest populations on top)
instead. Here’s all we need to do:

1. Select any cell in the Population column.


2. Select Home > Sort and Filter > Sort Largest to Smallest.

As you can see, the data set is now sorted by population. We can follow similar
steps to put the data set back in alphabetical order:

1. Select any cell in the Province column.


2. Select Home > Sort and Filter > Sort A to Z.

Notice that Excel can tell whether a column contains numbers or text, and will
sort accordingly.
28 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

You can sort in ascending or descending order, and you can sort on multiple
columns by selecting the “Custom sort” option.
Filtering allows us to hide some observations so we can look at a particular
subset of observations that we are interested in. The hidden observations are
still there.

Example 2.4. Filtering the employment data


Suppose we only want to see data for larger provinces. Here’s what we can do:

1. Select Home > Sort and Filter > Filter. If you look at the column
headers in your sheet you will see that they have become drop-down boxes.
2. Click on the drop-down box for Population, then select Number filters
> Greater Than...
3. Enter one million (1000000) in the box and select OK.

At this point, only the provinces with at least one million residents appear in
the table. Don’t worry, the other ones haven’t gone anywhere.
We can undo the filter and remove the drop-down boxes by selecting Home >
Sort and Filter > Filter again.

You can filter on more complex criteria, and you can combine sorting and fil-
tering.
Freezing panes keeps some rows and/or columns visible regardless of which
cell is currently selected. This allows us to work with large tables while keeping
the top row (variable names) and/or the first column (observation IDs) visible.

Example 2.5. Freezing panes


Go down to row 50 or so in your worksheet. Notice that you can’t see the
variable names in row 1 any more. This is fine for our current data set, but it
would be a problem if we had more than a few rows.

1. Go back to cell A1.


2. Select View > Freeze Panes > Freeze Top Row.

Now go back down to row 50 or so. You will see that the top row is still displayed
and you can see the variable names.
To undo this, select View > Freeze Panes > Unfreeze Panes.

Instead of freezing the first row, you can freeze the first column, or you can
freeze any number of rows and columns.
2.4. CLEANING DATA 29

2.3.4 Cell formatting

Another way we can change the appearance of our data without changing its
content is to adjust the cell formatting for one or more cells. Cell formatting
characteristics include:

• Column width
• Row height
• Font
• Bold/italics/underline
• Text color
• Background color
• Cell borders
• Alignment (left/right/center as well as top/bottom/middle)

The procedure for modifying the cell format is straightforward if you regularly
use productivity applications like Microsoft Word.
Example 2.6. Changing cell size
You may notice that cell B8 (Ontario population) appears to contain
“######” rather than a number. The cause of this problem is that the cell
is not wide enough to display the correct number. So let’s make it wider.
We have several options for doing this:

• From the menu: Select any cell in column B, and then select Home >
Format > Cell Width.... A dialog box will appear that allows you to
enter your preferred width. Try 10.
• With the mouse: Move your cursor to the line between column headers
B and C until the cursor looks like this: . Click on the
mouse and drag the cursor to resize the column.
• Auto-fit (this is what I usually do): Move your cursor to the line between
column headers B and C, and double-click. The column width will auto-
matically adjust to fit the data.

You can follow similar procedures to adjust the row height.

Another important cell formatting characteristic is the number format, which


will be discussed in a later section.

2.4 Cleaning data


We are now ready to start actively cleaning our data.
30 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

2.4.1 Preparation

We start by making two copies of our data set:

1. The original file, exactly as we received it.


2. A working copy under a new name.

Recall that one of our core principles of reproducibility is to always keep an


unmodified copy of the original data. We will do any modifications or additions
in the working copy.
Example 2.7. Creating a working copy
Save your Excel workbook, giving it a different name from the name of the
original data file.
As we proceed through the example application, be sure to save your working
copy each time you make a significant modification or addition.

The next step is to look at the data and construct a cleaning plan. Your cleaning
plan should have several steps:

1. Convert all data into tidy-format tables.


2. Ensure that each table has a unique ID variable.
3. Identify and address problems in existing variables.
4. Create new variables that may be useful.
5. Link data across tables.

The data cleaning plan should be based around what we plan to do with the
data, but should preserve flexibility in case we want to use the data for other
purposes.
Example 2.8. A plan for cleaning the employment data
The first step in cleaning the employment data is to ensure we have tidy data.
This step has already been completed, so we can move on.
The second step is to ensure that each table has a unique ID variable. In our
employment data, the province name could serve as a unique identifier, but
names typically have some drawbacks:

• Names are not always unique. This is obviously not an issue with Cana-
dian provinces, but it is an issue in many other data sets. For example,
there are 41 cities and towns in the United States named “Springfield.”
• The same observation often appears with different names in different data
sets. For example, some older data sets call the province of Newfoundland
and Labrador just “Newfoundland” as that was the province’s name before
December 6, 2001.
2.4. CLEANING DATA 31

As a result, it is typically not advisable to use names as ID variables. Instead,


we could:

• Use a standardized code such as the two-letter postal abbreviation.


• Assign an arbitrary number to each observation.

We will do both.
The third step is to identify and address problems in our existing variables.
The fourth step is to consider is whether there are any variables we would like
to analyze that have not yet been constructed. In our employment data, we will
want to construct

• Labour force
• Labour force participation rate
• Unemployment rate

We will also want to include information on which specific month these obser-
vations are describing (November 2020). Although that information is in the
Raw data worksheet and in the title of the main worksheet (Employment Nov
2020) it may be useful to have it in the table as well.

We will now add several variables to our working data set. Be sure to save the
file after adding each variable so that you don’t lose your work.

2.4.2 Inserting cells

Most of the time we will add data to the end of the existing table, adding new
variables after the last column or new observations after the last row. But
occasionally we will want to insert a column or row into our table.

Example 2.9. Adding a column for the ID variable


We will want our first column to include a not-yet-constructed ID variable, so
we need to insert a column to the left.

1. Select any cell in column A.


2. Select Home > Insert > Insert Sheet Columns from the menu.

This will shift all columns to the right, and insert a new blank column A.

We can also insert rows, delete rows or columns, and even insert or delete
individual cells.
32 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

2.4.3 Data entry, fill and series

The simplest way to add a variable to an Excel table is by typing data directly
into the cells.

Example 2.10. Entering data


As we discussed earlier, we would like to add the two-letter postal abbreviation
for each province:

Province or Territory ProvAbb


Alberta AB
British Columbia BC
Manitoba MB
New Brunswick NB
Newfoundland and Labrador NL
Northwest Territories NT
Nova Scotia NS
Nunavut NU
Ontario ON
Prince Edward Island PE
Quebec QC
Saskatchewan SK
Yukon YT

The postal abbreviation is useful for at least two reasons

1. It is standardized, while the full province name may not be.


• Quebec is sometimes written Québec.
2. It is short, which can be useful in charts.

To add this variable:

1. Enter ProvAbb in cell F1 to name the variable.


2. Fill in cells F2:F11 with the correct postal abbreviation.

Excel has several tools available to speed the process of entering data. First,
you can copy-and-paste or cut-and-paste the contents of any cell into any other
cell.
Excel’s fill tool allows you to quickly copy the contents of a cell into a set of
cells immediately, above, below or to the left or right.
2.4. CLEANING DATA 33

Example 2.11. Using fill


Our data cleaning plan includes adding a variable indicating which month and
year these particular observations describe (November 2020).

1. Enter MonthYr in cell G1 to name the variable.


2. Enter “11/20” in cell G2.
• You may notice that Excel displays the date differently from how you
entered it - on my computer it displays as Nov-20. We will talk more
later about how Excel handles dates.

Now we could enter the exact same date in cells G3:G11, but we can save
ourselves some time by using Excel’s fill tool:

3. Select cells G2:G11.


4. Select Home > Fill > Down.

As you can see, Excel fills in all selected cells with the value in the top cell.

The series tool allows you to fill in a group of cells with an ascending or
descending sequence of numbers or dates.

Example 2.12. Using series


Let’s create a unique sequential ID variable in column A. We can enter these
ID numbers by hand, but there is an easier way using Excel’s series tool:

1. Enter ID in cell A1 to name the variable.


2. Enter 1 in cell A2.
3. Select cells A2:A11.
4. Select Fill > Series.... The Series dialog box will appear.
5. There are several options for constructing a series. Fortunately, the default
is exactly what we want, so select OK.

As you can see, column A now contains a unique identifier that numbers
provinces from 1 to 10.

2.4.4 Formulas

Most of our new variables will be calculated from existing variables using for-
mulas. A formula is just a rule for calculating a value from some other values.
Formulas always start with the equals sign = followed by a mathematical ex-
pression that can include any combination of:
34 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

• Specific values, for example =2 or =TRUE


• References to other cells, for example =D2
• Standard arithmetic operators, for example =2+2 or =D2/4
• Functions, for example =SQRT(D2) or =SUM(D2:D10).

Formulas can be simple, or they can be quite complex.


Example 2.13. Calculating the labour force
Returning to our application, everyone who is either employed or unemployed
is in what economists call the “labour force.” So let’s add that variable.

1. Enter LabourForce in cell H1 to name the variable.


• Note that I avoid putting spaces in variable names. This is a habit
of mine when working with data, because spaces can cause compli-
cations when moving data across systems, e.g. from Excel to R.
2. Enter = D2 + E2 in cell H2.

Cell H2 now displays 2,493,300 which is in fact the value in cell D2 plus the
value in cell E2.

As discussed earlier, it is important to distinguish between what is the contents


of a cell and how those contents are displayed. For example, the true contents
of cell H2 in our example above are the formula = D2 + E2 but the cell shows
the number 2,493,300. If you change the number in cell D2 or cell E2, the
number shown in cell H2 will adjust accordingly.
Formulas can use the results from other formulas, as shown in the example
below.
Example 2.14. Calculating unemployment and LFP rates
Let’s add a column for the labour force participation rate. To remind you, this
is the proportion or percentage of the population (column C) that is in the
labour force (column H).

1. Enter LFPRate in cell I1 to name the variable.


2. Enter =H2/C2 in cell I2 to calculate the variable.

Let’s also add a column for the unemployment rate. To remind you, this is the
proportion or percentage of the labour force (column H) that is unemployed
(column E).

3. Enter UnempRate in cell J1 to name the variable.


4. Enter =E2/H2 in cell J2 to calculate the variable.

Notice that both of these formulas use cell H2, which itself contains a formula.
2.4. CLEANING DATA 35

2.4.5 Functions

Excel has about 500 built-in functions that we can use in formulas. Each
function has a name and a set of arguments whose values you can set.
To use a function, you simply include its name and its arguments as part of
the formula. For example, the SQRT() function takes a single numeric argument
and returns the square root of the argument. So if you enter =SQRT(2) in a cell,
the cell will display 1.414, the square root of 2.
Excel also has extensive tools for

• Finding the function you need for a particular calculation task.


• Figuring out what arguments to use in the function.

You can also use Google or any other search engine to find this information.

Example 2.15. A simple function


Suppose we want to add a new variable for the (natural) log of population.

1. Enter LogPop in cell K1 to give the variable a name.


2. Select cell K2.
3. If we already know the name of the function for taking the natural log,
we can just type in our formula. But suppose we don’t know the name of
the function, or even whether there is such a function.

a. Click on the insert function button to the left of the formula


bar.
b. The Insert Function dialog box will appear. It will have a very
long list of functions. You will want to narrow this list down, and
you have several options for doing so:
• Search: Enter logarithm in the Search for a function text
box.
• Browse: Select Math & Trig from the Or select a category
drop-down box.
c. Once you have narrowed the list down, it is easy to find the function
you want (LN). Select LN from the Select a function list box, and
then select OK.You will now see the Function Arguments dialog box
for the LN function.
4. As you can see, the LN function takes one argument (the number you want
the log of). You want to take the log of cell C2, so enter the text C2 here
or click on cell C2.
• the dialog box will display the value in cell C2 (3588700) as well as
the calculated value for the log of cell C2 (15.0933058).
36 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

5. Select OK.

You will see that cell K2 now contains =LN(C2) which displays as 15.0933. Note
that if you already knew the function and arguments you needed, you could have
just typed =LN(C2) into cell K2 instead.

2.4.6 Cell ranges

Some functions like SUM() and AVERAGE() operate on a range of cells rather
than a single cell. A range is just a rectangular set of cells, and is described by
its upper-left and lower-right cells, separated by a colon (“:”). For example:

• Range A2:A5 consists of cells A2, A3, A4 and A5.


• Range A2:C2 consists of cells A2, B2, and C2.
• Range A2:B3 consists of cells A2, B2, A3, and C3.

A single cell can also be thought of as a range of cells with one row and one
column.
Example 2.16. Total population
Suppose we want to create a new variable that reports the total population
across all observations in the data. The function to do that is SUM().

1. Enter TotPop in cell L1 to give the variable a name,


2. Enter =SUM(C2:C11) in cell L2.

Cell L2 should display 31,275,600 which is indeed the sum of cells C2 to C11.

2.4.7 Copying formulas

In Excel, you can copy-and-paste the contents of a cell to any other cell. This is
particularly handy when a cell contains a formula, as it would be inconvenient
to type the same formula into each cell.
Example 2.17. Copying a formula
To use copy-and-paste to copy the formula in cell H2 to the other cells in column
H:

1. Select cell H2 and copy it.


2. Select cells H3:H11 and paste.

You can also use fill for this purpose, and it is usually quicker.
2.4. CLEANING DATA 37

Example 2.18. Filling a formula


To use fill to copy the formulas in cells I2:L2 to the other cells in columns I
through L:

3. Select cells I2:L11


4. Select Home > Fill > Down

2.4.8 Relative and absolute references

Excel is smart and normally treats cell addresses in formulas as relative ref-
erences when copying and pasting cells. That is, when a formula is copied to
another cell 𝑎 columns to the right and 𝑏 rows down, the column letters in the
formula are increased by 𝑎 units, and the row numbers are increased by 𝑏 units.
For example, suppose cell B5 contains the formula =A1. If we copy the contents
of this cell to other cells, we get:

Cell Formula
B5 =A1
B6 =A2
B7 =A3
C5 =B1
D5 =C1
C6 =B2
D7 =C3

This is usually exactly what we want Excel to do.

Example 2.19. Relative references


Select cell H3 and take a look at the formula bar. You’ll notice that while
the original cell H2 contains =D2+E2, cell H3 actually contains =D3+E3. This is
exactly what we would want; each cell in column H calculates the province’s
labour force by adding together the unemployed and employed counts from the
same province (i.e. same row).

Sometimes we will want Excel to treat cell references as absolute references


instead. That is, we want to copy the cell reference exactly as written.

Example 2.20. Total population, part 2


Take a look at the TotPop variable in column L. This variable is supposed to
represent the total Canadian population, but each cell in this column shows a
different number.
38 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

Because Excel treats cell references as relative, copying the cell L2 to the rest
of column L produces: - Cell L2 contains =SUM(C2:C11) - Cell L3 contains
=SUM(C3:C12) - Cell L4 contains =SUM(C4:C13)
but in this case we want all of the cells to contain =SUM(C2:C11).

We can tell Excel to treat a given cell reference as absolute by adding the $
character to the cell reference. For example, suppose that we copy the formula
in cell C2 to cell D3. Then:

If cell C2 contains Then cell D3 contains


=A1 =B2
=$A1 =$A2
=A$1 =B$1
=$A$1 =$A$1

Note that the presence or absence of the $ does not affect the calculation in the
cell, it only affects how the formula is copied over to other cells.
Example 2.21. Total population, part 2
To fix the TotPop variable:

1. Go back to cell L2.


2. Change =SUM(C2:C11) to =SUM(C$2:C$11).
• Notice that the value in this cell does not change.
3. Copy/paste or fill cell L2 into cells L3:L11.
• Notice that all cells in this column display the same contents
(=SUM(C$2:C$11)) and value (31,275,600)

This is what we want.

Sometimes we will want to combine absolute and relative references in the same
formula. This typically will happen when we want to compare the current
observation to the other observations.
Example 2.22. Population rank
Suppose we want to create a new variable that is the province’s population rank.
That is, the province with the highest population has a rank of 1, second highest
has a rank of 2, etc. The function to do that is RANK.EQ(), which takes two
arguments: the value to rank, and the list of values to use for the ranking. We
want the first argument (the province’s population) to vary across provinces,
but we want the list of values (the populations of all of the provinces) to stay
the same.
2.5. DATA TYPES 39

1. Enter PopRank in cell M1 to give the variable a name.


2. Select cell M2.
3. Use the Insert Function tool to access the arguments for RANK.EQ().
Enter the appropriate arguments and select OK:
• Number is the number we wish to rank: enter C2 for Alberta’s pop-
ulation.
• Ref is the set of values we want to rank within: enter the range
C$2:C$11 for the list of all provinces’ populations.
• Order is an optional argument: leave it blank.
4. Copy/paste or fill cell M2 into cells M3:M11.

As you can see, Excel displays the correct ranks. We can check this by sorting
on Population and seeing if PopRank is also sorted.

FYI
Advanced options
The RANK.EQ() function has several relatives with similar syntax:

• The RANK() and RANK.AVG() functions also return the rank, but
use slightly different rules for handling ties.
• The PERCENTRANK(), PERCENTRANK.EXC() and
PERCENTRANK.INC() functions return ranks in percentiles.

2.5 Data types


Most of the data we work with is numeric. However, there are other types of
information we regularly encounter:

• Text
• Logical (true/false) values
• Dates and times

Excel has various tools for entering, storing and processing these data types.

2.5.1 Numeric data

Most modern computer applications, including Excel, do numeric calculations


in double-precision (64-bit) binary floating point format. Calculations in this
format are typically accurate up to the 15th decimal place.
40 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

But the results of these calculations are typically displayed to fewer decimal
places than that. A cell’s contents are distinct from its numeric display
format, which is how it appears on the screen. We can change the display
format of a cell or group of cells without changing its contents.

Example 2.23. Displaying unemployment and LFP rates as percent-


ages
Both UnempRate and LFPRate are calculated as proportions (a number
between 0 and 1). But we might prefer to display them as percentages (a
number between 0 and 100).
We could change the formulas (for example, change the LPFRate formula to
=100*H2/C2) or we could just change the cells’ display format:

1. Select cells I2:J11.


2. Select Home > Number Format (it is a little drop down box that currently
displays General), and then select Percentage.

You will see that the cell now displays the rates in percentage, to two decimal
places. I think it would be more readable if we round to just one decimal place.
To do that, select Home > Decrease Decimal.

An important thing to understand here: all we have done is change how the
numbers are displayed. If we do any calculations with these cells, the calculation
will use the original proportional value without rounding.

2.5.2 Text data

Text data is also called character data or string data. In Excel, text values
can be entered directly in a cell, can be used in a formula, and can be the result
of a formula.
Excel has many functions for working with text data. A particularly useful one
is the CONCAT() function, which allows you to join or concatenate two or more
strings. This is useful in building reports, in constructing ID variables, and in
many other applications.

Example 2.24. Using CONCAT()


To use the CONCAT() function to create a sentence that describes our data as if
it we were writing a written report.

1. Enter Description in cell N1 to name the variable


2. Enter =CONCAT(C2," people live in ",B2) in cell N2.
• The cell should display 3588700 people live in Alberta
2.5. DATA TYPES 41

3. Copy/paste/fill in cells N3:N11.

As you can see, CONCAT() is useful for converting data into human-readable
statements. It is also useful for creating ID variables.

FYI
Advanced options
There are many useful functions to manipulate text strings in Excel:

• LEN() calculates the length (number of characters) of a string. For


example:
– =LEN("Hello world!") returns the value of 12.
• MID() allows you to extract part of a string. For example
– =MID("Hello world!",2,3) returns a value of “ell”.

• UPPER(), LOWER() and PROPER() allow you to change the case of


a string. For example:
– =UPPER("Hello world!") returns “HELLO WORLD!”,
– =LOWER("Hello world!") returns “hello world!”, and
– =PROPER("Hello world!") returns “Hello World!”.
• FIND() allows you to find a particular substring within a larger
string, and
– =FIND("world!","Hello world!") returns 7
• REPLACE() allows you to replace part of a string.
– =REPLACE("Hello world!",7,5,"mom") returns “Hello
mom!”

2.5.3 Logical data

In addition to text and numbers, cells can also contain logical values (TRUE or
FALSE). Logical values can be entered directly in a cell, can be used in a formula,
and can be the result of a formula.
Mathematical expressions using the comparison operators = (equal), <> (not
equal), > (greater than), < (less than), >= (greater than or equal), and <= (less
then or equal) can be used to create logical values.
Example 2.25. Creating a logical variable
To create a logical variable that indicates whether a province has a labour force
participation rate below 64%.
42 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

1. Enter LowLFP in cell O1 to name the variable.


2. Enter =(I2 < 0.64) in cell O2.

• Notice that (I2 < 0.64) is a statement that is either true or false,
not a numeric expression.
• Logical statements can use other comparison operators, including =,
<, >, and <=.

3. Copy/paste or fill cell O2 into cells O3:O11.

As you can see, the cells display TRUE in the three provinces with LFP rates
below 64%, and FALSE in the other seven.

Logical values are particularly powerful in combination with the IF() function.
This function takes three arguments: - a statement, - a value to return if the
statement is true - a value to return if the statement is false.

Example 2.26. Using IF() to create an indicator variable

An indicator variable is the numerical version of a logical variable: it takes on


the value 1 if a particular statement is true, and 0 if the statement is false.

To create an indicator variable for low labour force participation:

1. Enter LowLFPInd in cell P1 to name the variable.


2. Enter =IF(I2 < 0.64,1,0) in cell P2.
3. Copy/paste or fill cell P2 into cells P3:P11.

As you can see, the cells display 1 in the three provinces with LFP rates be-
low 64%, and 0 in the other seven. When cleaning data we will typically use
indicator variables rather than logical variables.

FYI
Advanced options
Some additional functions that work with logical variables:

• NOT() returns TRUE if its argument is FALSE and FALSE if its argu-
ment is TRUE.
• AND() returns TRUE if all of its arguments are TRUE.
• OR() returns TRUE if any of its arguments are TRUE.
• SWITCH() and IFS() are extensions of IF() that take multiple
conditions.
2.5. DATA TYPES 43

2.5.4 Dates and times

In additions to numbers, text, and logical values, Excel can handle dates and
times. Dates and times are a surprisingly complex subject that can create all
sorts of problems on a computer, for several reasons:

• There are many ways of writing the same date:


– November 1, 2020
– 2020 November 1
– 14 Heshvan 5781 (Hebrew calendar)
– 11/1/2020
– 11/1/20
– 11-1-2020
– 2020/11/1
– etc.
• Customs vary across cultures and organizations,
– in some places 11/1 means November 1 and in others it means Jan-
uary 11.
• We want to be able to sort and rank
– January 10, 2020 comes before December 10, 2020.
• We want to be able to add and subtract
– January 4, 2021 comes 5 days after December 30, 2020.
• We want to handle time zones, daylight savings time, leap years, and even
leap seconds.
• We want to do this in a way that is perfectly accurate, but hides all of
these complexities in normal usage.

Each application has its own rules for handling dates and times, though there
are some standards that have developed. Excel handles these issues as follows:

1. Dates are stored as the number of days elapsed since some base date. In
Excel the base date is January 1, 1900, which means that
• January 1, 1900 is day 1.
• January 2, 1900 is day 2.
• November 1, 2020 is day 44136.
2. Dates are displayed according to the cell’s display formatting.
• The default display formatting varies across regions, so the same date
in the same Excel file might appear different on your computer and
my computer.
44 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

3. When you enter something that looks like a date, Excel does several things
behind the scenes:
• it guesses the date format for what you have entered
• it converts what you have entered to the internal storage form (days
since base date).
• it changes the display format to what Excel thinks it should be (based
on your location settings).

Most of the time, this system works seamlessly and you don’t even notice it.
But it can cause problems, and understanding the underlying structure can help
you to solve those problems.
Example 2.27. How dates are stored and displayed
The MonthYr variable in column G is a date.

1. Select cell G2. Notice that (at least on my computer):


• The cell displays Nov-20
• The formula bar displays 11/1/2020.
2. To see how Excel sees this date, change the cell’s number format from
Custom (on your computer it might be Date) to General
• Now the cell displays 44136.

Now while a date of 44136 is quite clear to Excel, we want to display dates in a
more human-readable way. I don’t like the Nov-20 display format, since it isn’t
obvious whether that means November 2020 or November 20. So let’s change
the formatting:

3. Select cells G2:G11.


4. Select Number Format > Short Date.

You can see even more options if you select More number formats

We can also do calculations with dates, and there are various functions using
dates.
Example 2.28. Some date calculations
To calculate how long ago November 2020 was:

1. Enter Today in cell Q1 to name the variable


2. Enter =TODAY() in cell Q2 to put in today’s date.
• This will display today’s date.
2.5. DATA TYPES 45

• Tomorrow, the cell will display tomorrow’s date.

3. Enter HowLongAgo in cell R1.


4. Enter =Q2-G2 in cell R2.

• This will display the number of days that have passed between
November 1, 2020 and today.

5. If you have not already done so, save your data file.

Most of the time, Excel’s handling of dates works seamlessly and is very clever.
But sometimes Excel guesses wrong, and this can create all sorts of problems.

FYI
Excel dates and genetics
Excel’s handling of dates caused a significant unanticipated problem in
the the field of human genetics, where it is a widely used tool.
Each gene has a standard abbreviation like “TCEA1” or “CTCF” as-
signed by a scientific body called the HUGO Gene Nomenclature Com-
mittee (HGNC). Unfortunately, 27 of these genes have abbreviations
that Excel misinterprets as dates, for example “Membrane Associated
Ring-CH-Type Finger 2,” also known as “MARCH2”. If you enter the
text “MARCH2” in an Excel cell, Excel will automatically convert it to
the date of March 2 in the current year. A 2016 research paper found
that roughly 20% of published research articles in the field used data
that was affected by this problem.
Unfortunately, it is too late to “fix” Excel to keep this from happening.
Any change to its behavior would “break” Excel in thousands of other
applications that rely on its current behavior.
When you can’t fix a problem in a computer application, you need to
find a workaround: a modification to how you use the application that
avoids or minimizes the effects of the problem. So the HGNC changed
the names of these 27 genes in 2020. For example, the gene MARCH2 is
now called MARCHF2.

In addition to dates, Excel can also handle date-time values such as 11/1/2020
12:00:00 PM.
46 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

FYI
How Excel handles times
Excel treats times as partial days. For example, Excel will store
11/1/2020 12:00:00 PM as day number 44136.5 and 11/1/2020
1:00:00 PM as day number 44136.5416666667.
There are also functions that work with date-time values. For example,
we have already seen the function TODAY(.) which returns the current
date but there is also a function NOW(.) that returns the current date
and time.

2.6 Version control


The last step in cleaning data is to save your work. In fact you should be saving it
regularly so that you don’t lose work if something goes wrong. If you have not al-
ready done so, save your file now. You can compare it to my version of the file at
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpNov20Clean.xlsx
In a simple setting, this is all you need to do. But as your projects get more
complex, you may find yourself keeping multiple versions of your data file. There
are many reasons to do this:

• Maybe you are trying something new, and are keeping an earlier version
in case something goes wrong.
• Maybe you did an analysis a few weeks ago that you are no longer using,
but you don’t want to throw it away in case you change your mind.
• Maybe you and a classmate are working on a project together, and you
have each made changes to separate copies of the original file.

You will want to use some form of version control here, with a goal of keeping
everything you might need without making mistakes or spending a lot of time on
it. Version control is an important element in making your analysis reproducible.
Software developers and professional data analysts (like me) use a formal version
control system like Git/GitHub. For our purposes, we can just follow a few
simple rules:

1. The working copy is the file you are actively working on right now.
2. The master version is the most recently saved file that is “complete.”
• You should try to work in small and discrete projects so that you can
– Make a working copy of the master version.
– Complete a project in your working copy.
– Make this working copy the new master version.
2.6. VERSION CONTROL 47

• There should always be exactly one master version of a given docu-


ment.
3. Archived versions are earlier master versions (or occasionally working
copies) that you wish to save.
• You may choose to save every previous master version.
• Or you may choose to save just a few important ones.
• Use some consistent naming convention to distinguish between
archived versions. For example, I add a date code to the end like
“210907” which means that this file was the master version as of
September 7, 2021.

At this level, you do not normally need to keep archived versions. But you
should at least distinguish between your master version and working copy.

Chapter review

Data cleaning is among the most important practical skills one can develop
in applied statistical analysis. Simple statistical methods like averages and
frequencies are all most people will ever use, but everyone who works with data
regularly encounters complex and messy data.
In this chapter we have learned some important data cleaning concepts, in-
cluding reproducible research, tidy data, ID variables, and version control. We
have also learned how to implement these concepts in Excel using tools such as
fill/series, sorting, formatting, formulas, and functions.
We will soon use Excel to do some basic statistical analysis and graphing us-
ing our cleaned data. Later on, we will learn more advanced data cleaning
concepts such as linking, aggregating, error validation/handling, importing and
exporting, as well has how to to implement them in both Excel and R.

Practice problems

Each chapter will include a few simple practice problems to help you check your
knowledge. They are organized by the specific skill or area of knowledge you
are practicing.
Answers can be found in the appendix.
SKILL #1: Identify features of tidy data

1. Which of the following tables shows a tidy data set?


48 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL

PersonID 101 102


Name Bob Joe
Age 30 35
Occupation Chef Waiter

a.

Name Age Occupation


Bob 30 Chef
Joe 35 Waiter

b.

Variable Value
Name Bob
Age 30
Occupation Chef
Name Joe
Age 35
Occupation Waiter

c.

SKILL #2: Identify features of ID variables

2. Which of the following characteristics is necessary for a variable to function


as an ID variable?
a. It takes on a different value for each observation.
b. Each value it can take on has an associated observation.
c. It can take only one value.

SKILL #3: Use absolute and relative references in Excel

3. For each of the following formulas, suppose we copy that from cell C12 to
cell E15. What formula appears in cell E15?
a. =B2
b. =$B$2
c. =$B2
d. =B$2
e. =SUM(B2:B10)
2.6. VERSION CONTROL 49

f. =SUM($B$2:$B$10)
g. =SUM($B2,$B10)
h. =SUM(B$2,B$10)

SKILL #4: Construct Excel formulas

4. Construct a formula to do each of the following.


a. Find the square root of the number in cell A2.
b. Find the lowest value in cells A2:A100.
c. Find the absolute value of the difference between cells A2 and B2.

SKILL #5: Use logical and text data in Excel

5. Construct an Excel formula to do each of the following.


a. Display “Reject” if cell A2 contains a number less than 0.05, and
display “Fail to reject” otherwise.
b. Display the text “A2 =” followed by the value in cell A2.
c. Display the first two letters of the text in cell A2.

SKILL #6: Use dates in Excel

6. Construct an Excel formula to do each of the following:


a. Display the current month.
b. Display the date 100 days from today.
c. Display the number of days since your birth.
50 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
Chapter 3

Probability and random


events

Probability is a method of mathematically modeling a random process so that


we can understand it and/or make predictions about its future results. Proba-
bility is an essential tool for casinos, as well as for banks, insurance companies,
and any other businesses that manage risks.

Goals
Chapter goals
In this chapter we will learn how to:

• Model random events using the tools of probability


• Calculate and interpret marginal, joint, and conditional probabili-
ties
• Interpret and use the assumptions of independence and equal out-
come probability

This chapter uses mathematical notation and terminology that you have seen
before but may need to review. If you have difficulty with the math, please refer
to the sections on Sets and on Functions in the Math Review appendix.

Example 3.1. Example application: Roulette

We will develop ideas by considering the casino game of Roulette. The picture
below shows what a roulette wheel looks like.

51
52 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

Source: Roulette Vectors by Vecteezy


Here are the rules:

• It features
– a ball.
– a spinning wheel with numbered/colored slots.
– a table on which to place bets
• The slots are numbered from 0 to 36
– Slot number 0 is green
– 18 slots are red
– 18 slots are black.
– The picture above depicts an American roulette table, which has an
additional green slot labeled “00”,
– I will assume we have a European roulette table, which does not
include the “00” slot.
• Players can place various bets on the table including:
– Red (ball lands in a red slot) pays $1 per $1 bet
– Black (ball lands in a black slot) pays $1 per $1 bet
– A straight bet on any specific number (ball lands on that number)
pays $35 per $1 bet

Like other casino games, a roulette game is an example of a random process.


Something will happen, it matters (to the players and the casino) what will
happen, but we don’t know in advance what will happen.
3.1. OUTCOMES AND EVENTS 53

3.1 Outcomes and events


To build a probabilistic model of a random process, we start by defining the
outcome we are interested in. An outcome can be a simple yes/no result, it can
be a number, or it can be a much more complex object. The outcome should
be a complete description of the random process, in the sense that everything
we are interested in can be defined in terms of the outcome.
Example 3.2. Outcomes in roulette
The outcome of a single game of roulette can be defined as the number of the
slot in which the ball lands. Call that number 𝑏.

The set of all possible outcomes is called the sample space


Example 3.3. The sample space in roulette
The sample space for a game of roulette can be defined as the set of all numbers
the ball can land on:
Ω = {0, 1, 2, … , 36}
This sample space has |Ω| = 37 elements.

Next, we define a set of events that we are interested in. We can think of an
event as either:

• A statement that is either true or false OR


• A subset of the sample space

These two concepts are equivalent, though the subset concept makes the math
clearer.
Example 3.4. Events in roulette
These roulette events are well-defined for our sample space:

• Ball lands on 14:


𝑏 ∈ {14}
• Ball lands on red:

1, 3, 5, 7, 9, 12, 14, 16, 18,


𝑏 ∈ 𝑅𝑒𝑑 = { } (3.1)
19, 21, 23, 25, 27, 30, 32, 34, 36

• Ball lands on black:

2, 4, 6, 8, 10, 11, 13, 15, 17,


𝑏 ∈ 𝐵𝑙𝑎𝑐𝑘 = { } (3.2)
20, 22, 24, 26, 28, 29, 31, 33, 35
54 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

• Ball lands on one of the first 12 numbers:


𝑏 ∈ 𝐹 𝑖𝑟𝑠𝑡12 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

We could define many more events, depending on what bets we are interested
in.

An event that contains only one outcome is called an elementary event.


Since events are sets, we can use the terminology and mathematical tools for
sets.
Example 3.5. Relationships among events
In our roulette example:

• Two events are identical (𝐴 = 𝐵) if they contain exactly the same out-
comes:
– The event “ball lands on 14” and “a bet on 14 wins” are identical
since {14} = {14}.
– Intuitively, identical means they are just two different ways of de-
scribing the same event.
• An event implies another event (𝐴 ⊂ 𝐵) if all of its outcomes are also in
the implied event
– The event “ball lands on 14” implies the event “ball lands on red”
since {14} ⊂ 𝑅𝑒𝑑.
– When an event happens, any event it implies also happens.
• Two events are disjoint (𝐴 ∩ 𝐵) = ∅ if they share no outcomes:
– The events “ball lands on red” and “ball lands on black” are disjoint
since 𝑅𝑒𝑑 ∩ 𝐵𝑙𝑎𝑐𝑘 = ∅.
– If two events are disjoint, they cannot both happen.
– But they can both fail to happen. For example, if the ball lands in
the green zero slot (𝑏 = 0), neither red nor black wins.
• Any two elementary events are either identical or disjoint
– The events “ball lands on 14” and “ball lands on 25” are disjoint
since {14} ∩ {25} = ∅.

3.2 Probabilities
Our final step is to define a probability distribution for this random process,
which is a function that assigns a number to each possible event. The number
is called the event’s probability.
Probabilities are normally between zero and one:
3.2. PROBABILITIES 55

• If an event has probability zero, it definitely will not happen


• If an event has probability strictly between zero and one, it might happen.
• If an event has probability one, it definitely will happen.

3.2.1 The axioms of probability

All valid probability distributions must obey the following three conditions,
which are sometimes called the axioms of probability.

1. Probabilities are never negative:

Pr(𝐴) ≥ 0

2. One of the outcomes will definitely happen:

Pr(Ω) = 1

3. For any two disjoint events 𝐴 and 𝐵, the probability that 𝐴 or 𝐵 happen
is the sum of their individual probabilities:

Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵)

Probability distributions have many other properties, but they can all be derived
from these three axioms.

Example 3.6. Outcome probabilities for a fair roulette game


Let’s assume that the roulette wheel is “fair” in the sense that each outcome
has the same probability. Now, I should emphasize that this doesn’t have to be
the case, it’s just an assumption. But it’s a reasonable one in this case because
casinos are required by law to run fair roulette wheels and would be subject
to heavy penalties if they run unfair wheels. Later on, we will use statistics to
confirm that a roulette wheel is fair.
Call that probability 𝑝:

𝑝 = Pr(𝑏 = 0) = Pr(𝑏 = 1) = ⋯ = Pr(𝑏 = 36)

To find the value of 𝑝 we use the rules of probability. By rule #2 of probability,


one of the outcomes will happen:

Pr(Ω) = 1

Since the different outcomes are disjoint, rule #3 implies that:

Pr(Ω)) = ⏟
⏟ Pr({0}) + ⏟
Pr({1}) + ⋯ + ⏟
Pr({36})
⏟⏟⏟⏟
1 𝑝 𝑝 𝑝
56 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

Summarizing this equation:


1 = 37𝑝

Solving for 𝑝 we get:


𝑝 = 1/37 ≈ 0.027

That is, each of the 37 elementary events have a probability of 1/37.

Since this is an introductory course, our sample space will usually contain a
finite number of outcomes, as in our roulette example. In that case, probability
calculations are pretty simple:

• Find the probability of each elementary event.


• To find the probability of a specific event, just add up the probabilities of
its elementary events.

Example 3.7. Event probabilities for a fair roulette game

In the roulette example, the probability of any event 𝐴 is just the number of
outcomes in 𝐴 times the probability of each outcome 1/37:

Pr(𝐴) = |𝐴| ∗ 1/37

The notation |𝐴| just means the size of (number of elements in) the set 𝐴.

For example:
Pr(𝑏 = 25) = |{25}| ∗ 1/37 = 1/37 ≈ 0.027

Pr(𝑅𝑒𝑑) = |𝑅𝑒𝑑| ∗ 1/37 = 18/37 ≈ 0.486

Pr(𝐸𝑣𝑒𝑛) = |𝐸𝑣𝑒𝑛| ∗ 1/37 = 18/37 ≈ 0.486

Pr(𝐹 𝑖𝑟𝑠𝑡12) = |𝐹 𝑖𝑟𝑠𝑡12| ∗ 1/37 = 12/37 ≈ 0.324

However, not all sample spaces contain a finite number of outcomes. For exam-
ple, suppose we are interested in using probability to model the unemployment
rate, or a person’s income. Those are real numbers, and can take on any of
an infinite number of values. This adds a few complications, and is the reason
that the probability axioms refer to events (sets of outcomes) and not individual
outcomes.
3.2. PROBABILITIES 57

FYI
What do probabilities really mean?
What does it really mean to say that the probability of the ball landing
in a red slot is about 0.486? That’s actually a tough question. There are
two standard interpretations for probabilities:

• Frequentist or classical interpretation: we are thinking of the ran-


dom process as something that could be repeated many times, and
the probability of an event is the approximate fraction of times
that the event will occur. That is, if you go to a casino and bet
1000 times on Red, you will win about 486 times.
• Bayesian or subjectivist interpretation: the random process is a
one-time occurrence, but we have limited information about it and
the probability of event represents the strength of our belief that
the event will happen.

The frequentist interpretation of probability is well-suited for simple re-


peated settings like casino games or car insurance, while the Bayesian
interpretation makes more sense for things like predicting election re-
sults.

3.2.2 Additional rules for probabilities

Let 𝐴 and 𝐵 be two events. Then our three axioms of probability imply several
additional rules:

• Probabilities cannot be higher than one.

Pr(𝐴) ≤ 1

• Probabilities of identical events are identical:

𝐴 = 𝐵 ⟹ Pr(𝐴) = Pr(𝐵)

• Probabilities of implied events are larger:

𝐴 ⊂ 𝐵 ⟹ Pr(𝐴) ≤ Pr(𝐵)

• The probability of an event not happening is:

Pr(𝐴𝐶 ) = 1 − Pr(𝐴)

• The probability of nothing happening is:

Pr(∅) = 0
58 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

• The probability of either 𝐴 or 𝐵 happening is:

Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 ∩ 𝐵) (3.3)


≤ Pr(𝐴) + Pr(𝐵) (3.4)

These results are not hard to prove, but I will not go through the proofs. How-
ever, I will use these results so you should be familiar with them.

3.3 Joint and conditional probabilities

We are often interested in more than one event, and want to talk about how
they are related. For example:

• In some casino games like poker or blackjack, players take an additional


action after partial information about the outcome is revealed.
• Politicians often use polls to predict the winner of an election.
• Finance people often want to model multiple market scenarios and forecast
a company’s earnings under each of them.
• Economists often have data on current economic conditions and want to
predict future economic conditions.

This section will develop some tools for dealing with the relationship between
different random events.

3.3.1 Joint probabilities

The joint probability of two events 𝐴 and 𝐵 is the probability that they both
happen:
Pr(𝐴 ∩ 𝐵)
Remember that the intersection (∩) of 𝐴 and 𝐵 is the set of all outcomes that
are in both 𝐴 and 𝐵.

Example 3.8. Joint probabilities for roulette bets


Consider two events for a game of roulette:

1, 3, 5, 7, 9, 12, 14, 16, 18,


𝑅𝑒𝑑 = { } (3.5)
19, 21, 23, 25, 27, 30, 32, 34, 36
2, 4, 6, 8, 10, 12, 14, 16, 18,
𝐸𝑣𝑒𝑛 = { } (3.6)
20, 22, 24, 26, 28, 30, 32, 34, 36
3.3. JOINT AND CONDITIONAL PROBABILITIES 59

Suppose you are interested in the probability that the ball lands on a number
that is both red and even. This event is just the intersection of 𝑅𝑒𝑑 and 𝐸𝑣𝑒𝑛
so this joint probability is:

Pr(𝑅𝑒𝑑 ∩ 𝐸𝑣𝑒𝑛) = Pr({12, 14, 16, 18, 30, 32, 34, 36}) (3.7)
= 8/37 (3.8)
≈ 0.216 (3.9)

Joint probabilities are just probabilities, so they obey all of the axioms and rules
of probability described in Section 3.2.

3.3.2 Conditional probabilities

The conditional probability of an event 𝐴 given another event 𝐵 is defined


as:
Pr(𝐴 ∩ 𝐵)
Pr(𝐴|𝐵) =
Pr(𝐵)
The conditional probability answers the question: if we already know that 𝐵 is
true, what are the chances that 𝐴 is true?
Conditional probabilities are very important when playing poker. At the begin-
ning of the game, every player has equal chance of having a winning hand. But
that is no longer true after you see your cards - having “good” cards increases
your chance of winning, and having “bad” cards decreases that chance. In other
words, your bet should be based on Pr(𝑤𝑖𝑛|𝑐𝑎𝑟𝑑𝑠) rather than Pr(𝑤𝑖𝑛). Good
poker players have detailed knowledge of these conditional probabilities.
Example 3.9. Conditional probabilities in roulette
In our roulette example:

Pr(𝑅𝑒𝑑 ∩ 𝐸𝑣𝑒𝑛) 8/37


Pr(𝑅𝑒𝑑|𝐸𝑣𝑒𝑛) = = ≈ 0.444
Pr(𝐸𝑣𝑒𝑛) 18/37

Pr(𝑏 = 14 ∩ 𝐸𝑣𝑒𝑛) 1/37


Pr(𝑏 = 14|𝐸𝑣𝑒𝑛) = = ≈ 0.056
Pr(𝐸𝑣𝑒𝑛) 18/37

Like joint probabilities, conditional probabilities are just probabilities, so they


obey all of the axioms and rules of probability described in Section 3.2.

3.3.3 Independent events

One common “trick” in modeling joint and conditional probabilities is to assume


that certain events are unrelated to each other. This can simplify the math
significantly.
60 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

We say that two events 𝐴 and 𝐵 are independent if their joint probability is
just the two individual probabilities multiplied together:

Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵)

We usually express independence with the notation 𝐴⊥𝐵.


The definition of independence is not very intuitive, but we can clarify it by do-
ing a little math. Consider two independent events 𝐴 and 𝐵 that have nonzero1
probability. Then by the definition of independence:

Pr(𝐴 ∩ 𝐵) Pr(𝐴) Pr(𝐵)


Pr(𝐴|𝐵) = = = Pr(𝐴)
Pr(𝐵) Pr(𝐵)

By the same reasoning:


Pr(𝐵|𝐴) = Pr(𝐵)
In other words, knowing that one of these events are true tells you nothing useful
about whether the other the other event is true.
When would it be reasonable to assume events are independent? The typi-
cal scenario would be where there is simply no physical or logical relationship
between them, usually due to a separation in time and space.

Example 3.10. Independence across roulette games


We have already shown that events related to a single roulette game are not nec-
essarily independent. But the outcomes/events of two different roulette games
can be reasonably assumed to be independent of one another.
Suppose that I bring $100 to a casino this afternoon for a few games of roulette.
I bet all of my money on Red for the first game.

• If I lose, I am broke and stop playing.


• If I win, I keep all of my money (both my initial bet and my winnings)
on Red for the next spin.

• I keep playing until I run out of money.

After 3 games:

• If Red wins all 3 games, I have 𝑤 = $800.


• Otherwise, I have nothing 𝑤 = $0
1 You may wonder: if it makes more sense to describe independence in terms of conditional

probabilities, why do we define it in terms of joint probabilities? The key is the require-
ment that the events have nonzero probability. When 𝐵 has zero probability the conditional
probability Pr(𝐴|𝐵) is not well defined since its denominator is zero.
3.3. JOINT AND CONDITIONAL PROBABILITIES 61

What is the probability of each of these events? Since we can assume that each
game’s outcome is independent, this is an easy problem:

Pr(𝑤 = $800) = Pr(𝑅𝑒𝑑1 ∩ 𝑅𝑒𝑑2 ∩ 𝑅𝑒𝑑3 ) (3.10)


= Pr(𝑅𝑒𝑑1 ) × Pr(𝑅𝑒𝑑2 ) × Pr(𝑅𝑒𝑑3 ) (3.11)
= (18/37) × (18/37) × (18/37) ≈ 0.115 (3.12)
Pr(𝑤 = $0) = 1 − Pr(𝑤 = $800) (3.13)
≈ 0.885 (3.14)
(3.15)

So we have an 11.5% chance of winning big, and an 88.5% chance of going broke.
Very important: equation (3.11) only follows from the previous equation because
we have assumed the events 𝑅𝑒𝑑1 , 𝑅𝑒𝑑2 , and 𝑅𝑒𝑑3 are independent.

When is it not reasonable to assume that events are independent? In almost any
other case. Remember that events are defined in terms of the same underlying
outcome, so they are typically related unless you have some very specific reason
to assume otherwise.
Example 3.11. Independence within a roulette game?
Consider the roulette events “Red wins” and “Even wins”. We earlier showed
that the unconditional probability that Red wins is:

Pr(𝑅𝑒𝑑) = 18/37 ≈ 0.486

The conditional probability that Red wins given that Even wins is:

Pr(𝑅𝑒𝑑|𝐸𝑣𝑒𝑛) = 8/18 ≈ 0.444

Since 0.44 ≠ 0.486, these two events are not independent.

A common mistake by students who are new to probability and statistics is


to take results that only apply under independence and use them when there
is no reason to believe that independence holds. Don’t make this mistake:
independence is an assumption, and one that can easily be incorrect.

3.3.4 Law of total probability

In addition to the results we have already discussed, there are two important
results using conditional probabilities:
The first is the law of total probability which is a rule for determining un-
conditional probabilities from conditional probabilities:

Pr(𝐴) = Pr(𝐴|𝐵) Pr(𝐵) + Pr(𝐴|𝐵𝑐 ) Pr(𝐵𝑐 )


62 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

The law of total probability allows us to create a set of scenarios, calculate


probabilities under each scenario, and then add them up. It is useful when we
are modeling random outcomes that occur in multiple stages, for example a
poker game or an energy company making a series of investments to develop an
oil field.
Example 3.12. The law of total probability in poker
Suppose you are playing Texas hold’em poker with a few friends, and the hand
has one card left to deal (the “river”). If the last card has a heart on it (25%
probability) you will have a flush and win the hand with a probability you
estimate to be 90%. If not, you will win with a probability you estimate to be
10%. What are your overall chances of winning?
The answer can be calculated using the law of total probability:
Pr(𝑊 𝑖𝑛) = Pr(Win|Hearts) Pr(Hearts) + Pr(Win|not Hearts) Pr(not Hearts)
(3.16)
= 0.9 ∗ 0.25 + 0.1 ∗ 0.75 (3.17)
= 0.3 (3.18)
So you have a 30% chance of winning.

3.3.5 Bayes’ law


The second is Bayes’ law, which is a rule for determining conditional proba-
bilities:
Pr(𝐵|𝐴) Pr(𝐴)
Pr(𝐴|𝐵) =
Pr(𝐵)
Bayes’ law is particularly useful in evaluating evidence, because it allows us to
restate one conditional probability in terms of another.
Both the law of total probability and Bayes’ law follow from the definition of
conditional probabilities. They are easy to prove, but I won’t prove them here.
Instead, I will use an example to show how they can be useful.
Example 3.13. False positives in medical testing
When someone is tested for a disease, the test comes back either “positive” (the
person has the disease) or “negative” (the person does not have the disease).
However, no test is perfect. Sometimes people who do not have the disease test
positive (“false positives”) and sometimes people who do have the disease test
negative (“false negative”).
Let the event 𝑇 mean a particular patient tests positive for a disease, and let
the event 𝐷 mean that this patient actually has the disease.
The sensitivity of the test is an infected patient’s probability of testing positive:
Pr(𝑇 |𝐷) = 𝑝
3.3. JOINT AND CONDITIONAL PROBABILITIES 63

the specificity of the test is a healthy patient’s probability of testing negative:

Pr(𝑇 𝑐 |𝐷𝑐 ) = 𝑞

and the prevalence of the infection is the probability that a given patient has
the disease:
Pr(𝐷) = 𝑑
Suppose that a patient has tested positive. What is the probability that he has
the disease, i.e. what the value of Pr(𝐷|𝑇 )?
This is a classic probability question, as it makes use of Bayes’ law and the law
of total probability, and it has obvious practical usage.
Since we want a conditional probability, we start by stating Bayes’ law:

Pr(𝑇 |𝐷) Pr(𝐷)


Pr(𝐷|𝑇 ) ==
Pr(𝑇 )

Bayes’ law will allow us to calculate Pr(𝐷|𝑇 ) if we can find the components
of the right side of this equation. We already know that Pr(𝑇 |𝐷) = 𝑝 and
Pr(𝐷) = 𝑟, so all we need is to find Pr(𝑇 ).
Since Pr(𝑇 ) is an unconditional probability, we can use the law of total proba-
bility:
Pr(𝑇 ) = Pr(𝑇 |𝐷) Pr(𝐷) 𝑐 Pr(𝐷𝑐 )
⏟ ⏟ +⏟ Pr(𝑇
⏟⏟ |𝐷
⏟⏟) ⏟
𝑝 𝑑 1−𝑞 1−𝑑

Plugging these results into our formula we get:

𝑝𝑑
Pr(𝐷|𝑇 ) ==
𝑝𝑑 + (1 − 𝑞)(1 − 𝑑)

which is the result we need.


Now, let’s try out some numbers. Suppose that false positives are rare (𝑞 =
0.99), and false negatives never happen (𝑝 = 1).

• Suppose the disease itself is fairly common (𝑑 = 0.10). Then:

1 ∗ 0.1
Pr(𝐷|𝑇 ) == ≈ 0.917
1 ∗ 0.1 + (1 − 0.99) ∗ (1 − 0.1)

• Suppose the disease itself is quite rare (𝑑 = 0.001). Then

1 ∗ 0.001
Pr(𝐷|𝑇 ) == ≈ 0.091
1 ∗ 0.001 + (1 − 0.99) ∗ (1 − 0.001)

In other words, the exact same test has a very different meaning depending on
the prevalence in the population: when the disease is common a positive test
64 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

means a 91.7% chance of having the disease, and when the disease is rare a
positive test result means a 9.1% chance of having the disease.
This general issue (even a small false positive rate can have a big impact when
prevalence is low) appeared repeatedly in March and April of 2020. Several
studies by well-known researchers2 dramatically overestimated the early preva-
lence of the COVID-19 virus and thus dramatically underestimated its fatality
rate. These studies were regularly cited as support by those who wanted to
substantially relax public health restrictions in April 2020, and had substantial
real world consequences.

Chapter review
In this chapter we have learned the basic terminology and concepts of probabil-
ity. You may have seen a number of these terms and ideas in high school, but
we are approaching them at a higher level. Be sure to review these terms and
concepts in detail, and do the practice problems to test your knowledge.
Our next step is to take our general framework of outcomes and events, and
apply them to random variables - outcomes that are specifically numerical.

Practice problems
Answers can be found in the appendix.
Most of these practice problems will be based on the casino game of craps.
Craps is played with a pair of 6-sided dice.
Players take turns rolling the dice, and the player currently rolling the dice is
called the “shooter”. There are various bets - pass, don’t pass, come, don’t
come, field, place, buy - that can be placed on the results of multiple rolls of
the dice. These bets and their probability calculations can be quite complex, so
we will focus on “single roll” bets.

• A bet on “Snake Eyes” wins if the total showing on the dice is 2.


• A bet on “Yo” wins if the total showing on the dice is 11.
• A bet on “Boxcars” wins if the total showing on the dice is 12.
• A bet on “Field” wins if the total showing on the dice is 2, 3, 4, 9, 10, 11,
or 12.

For this example, assume that


2 If you are interested in learning more about this, an article in Science provides an overview

of the controversy, and a blog post by statistician Andrew Gelman provides a thorough dis-
cussion of the statistical issues.
3.3. JOINT AND CONDITIONAL PROBABILITIES 65

• One die is red and the other is white.


• Both dice are fair, that is each side has equal probability
• The dice are independent of one another

An outcome for a single roll of the dice is a pair of numbers (𝑟, 𝑤) where 𝑟 is
the amount showing on the red die, and 𝑤 is the amount showing on the white
die. For example an outcome (2, 4) means that the red die is showing 2 and the
white die is showing 4.
SKILL #1: Define outcomes and sample space for a simple example

1. Let Ω be the sample space for the outcome of a single roll in craps.
a. Define Ω by enumeration.
b. Find the cardinality of Ω.
2. Using enumeration, define the following events:
a. Yo wins
b. Snake eyes wins
c. Boxcars wins
d. Field wins

SKILL #2: Use set theory to work with events

3. Which of the following statements are true?


a. The events “Yo wins” and “Boxcars wins” are identical.
b. The events “Yo wins” and (𝑟, 𝑤) = (5, 6) are identical.
c. The events “Boxcars wins” and (𝑟, 𝑤) = (6, 6) are identical.
4. Which of the following statements are true?
a. The events “Yo wins” and “Boxcars wins” are disjoint.
b. The events “Yo wins” and “Field wins” are disjoint.
c. The events “Yo wins” and “Boxcars loses” are disjoint.
d. The events “Yo wins” and “Field loses” are disjoint.
5. Which of the following statements are true?
a. The event “Yo wins” implies the event “Boxcars wins”.
b. The event “Yo wins” implies the event “Boxcars loses”.
c. The event “Yo wins” implies the event “Field wins”.
d. The event “Yo wins” implies the event “Field loses”.
6. Which of the following are elementary events?
a. Yo wins.
b. Yo loses.
c. Boxcars wins.
66 CHAPTER 3. PROBABILITY AND RANDOM EVENTS

d. Boxcars loses.
e. Field wins.
f. Field loses.

SKILL #3: Calculate event probabilities from elementary event prob-


abilities

7. Calculate each of the following elementary event probabilities:


a. (𝑟, 𝑤) = (1, 1)
b. (𝑟, 𝑤) = (3, 4)
c. (𝑟, 𝑤) = (6, 6)
8. Find the probability of each of the following events:
a. A bet on Yo wins.
b. A bet on Snake eyes wins.
c. A bet on Boxcars wins.
d. A bet on Field wins.

SKILL #4: Calculate joint and conditional probabilities

9. Calculate each of the following joint probabilities:


a. Pr(Yo wins ∩ Boxcars wins)
b. Pr(Yo wins ∩ Field wins)
c. Pr(Yo wins ∩ Boxcars loses)
10. Calculate each of the following conditional probabilities:
a. Pr(Yo wins|Boxcars wins)
b. Pr(Yo wins|Field wins)
c. Pr(Yo wins|Boxcars loses)
d. Pr(Field wins|Yo wins)
e. Pr(Boxcars wins|Yo wins)
11. Which of the following pairs of events are independent?
a. Yo wins and Boxcars wins.
b. Yo wins and Field wins.
c. Yo wins and Yo wins.
d. 𝑟 = 3 and 𝑟 = 5.
e. 𝑟 = 3 and 𝑤 = 5.

SKILL #5: Apply the axioms of probability

12. Let 𝐴 be an event. Which of the following statements are true?


a. Pr(𝐴) ≥ 0.
3.3. JOINT AND CONDITIONAL PROBABILITIES 67

b. Pr(𝐴) > 0.
c. Pr(𝐴) ≤ 1.
d. Pr(𝐴) < 1.
e. Pr(𝐴𝑐 ) ≥ 0.
f. Pr(𝐴𝑐 ) > 0.
g. Pr(𝐴𝑐 ) ≤ 1.
h. Pr(𝐴𝑐 ) < 1.
i. Pr(𝐴𝑐 ) = 1 − Pr(𝐴).
13. Let 𝐴 and 𝐵 be two events. Which of the following statements are true?
a. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵).
b. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 ∩ 𝐵).
c. Pr(𝐴 ∪ 𝐵) ≤ Pr(𝐴) + Pr(𝐵).
d. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
14. Let 𝐴 and 𝐵 be two disjoint events. Which of the following statements
are true?
a. Pr(𝐴 ∩ 𝐵) = 0.
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) + Pr(𝐵).
c. Pr(𝐴 ∪ 𝐵) = 0.
d. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵).
e. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 ∩ 𝐵).
f. Pr(𝐴 ∪ 𝐵) ≤ Pr(𝐴) + Pr(𝐵).
g. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
h. Pr(𝐴|𝐵) = 0
15. Let 𝐴 and 𝐵 be two events such that 𝐴 ⊂ 𝐵. Which of the following
statements are true?
a. Pr(𝐴) ≤ Pr(𝐵)
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴)
c. Pr(𝐴|𝐵) = 1
16. Let 𝐴 and 𝐵 be two independent events. Which of the following statements
are true?
a. Pr(𝐴 ∩ 𝐵) = 0.
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
c. Pr(𝐴|𝐵) = Pr(𝐴).
68 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
Chapter 4

Introduction to random
variables

The previous chapter developed a general framework for modeling random out-
comes and events. This framework can be applied to any set of random out-
comes, no matter how complex.

This chapter develops additional tools for the case when the random outcomes
we are interested in are quantitative, that is, they can be described by a number.
Quantitative outcomes are also called “random variables.”

Goals
Chapter goals
In this chapter we will learn how to:

• Calculate and interpret the CDF and PDF of a discrete random


variable, or several random variables.
• Calculate and interpret the expected value of a discrete random
variable from its PDF.
• Calculate and interpret the variance and standard deviation of a
discrete random variable from its PDF.
• Work with common discrete probability distributions including the
Bernoulli, binomial, and discrete uniform.

The material in this chapter will use some mathematical notation (the summa-
tion operator) that provides a convenient way to represent long sums. Please
review the section on sequences and summations in the math appendix.

69
70 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

4.1 Random variables


A random variable is a number whose value depends on a random outcome.
The idea here is that we are going to use a random variable to describe some
(but not necessarily every) aspect of the outcome.

Example 4.1. Random variables in roulette


Here are a few random variables we could define in a roulette game:

• The original outcome 𝑏.


• An indicator for whether a bet on red wins:

1 𝑏 ∈ 𝑅𝑒𝑑
𝑟 = 𝐼(𝑏 ∈ 𝑅𝑒𝑑) = {
0 𝑏 ∉ 𝑅𝑒𝑑

• The net payout from a $1 bet on red:

1 if 𝑏 ∈ 𝑅𝑒𝑑
𝑤𝑟𝑒𝑑 = 𝑤𝑟𝑒𝑑 (𝑏) = {
−1 if 𝑏 ∈ 𝑅𝑒𝑑𝑐

That is, a player who bets $1 on red wins $1 if the ball lands on red and
loses $1 if the ball lands anywhere else.
• The net payout from a $1 bet on 14:

35 if 𝑏 = 14
𝑤14 = 𝑤14 (𝑏) = {
−1 if 𝑏 ≠ 14

That is, a player who bets $1 on 14 wins $35 if the ball lands on 14 and
loses $1 if the ball lands anywhere else.

All of these random variables are defined in terms of the underlying outcome.

A random variable is always a function of the original outcome, but for conve-
nience, we usually leave its dependence on the original outcome implicit, and
write it as if it were an ordinary variable.

4.1.1 Implied distribution

A random variable has its own sample space (normally ℝ) and probability dis-
tribution. This probability distribution can be derived from the probability
distribution of the underlying outcome.

Example 4.2. Probability distributions for roulette


4.1. RANDOM VARIABLES 71

• The probability distribution for 𝑏 is:


Pr(𝑏 = 0) = 1/37 ≈ 0.027
Pr(𝑏 = 1) = 1/37 ≈ 0.027

Pr(𝑏 = 36) = 1/37 ≈ 0.027
All other values of 𝑏 have probability zero.
• The probability distribution for 𝑤𝑟𝑒𝑑 is:
Pr(𝑤𝑟𝑒𝑑 = 1) = Pr(𝑏 ∈ 𝑅𝑒𝑑) = 18/37 ≈ 0.486
Pr(𝑤𝑟𝑒𝑑 = −1) = Pr(𝑏 ∉ 𝑅𝑒𝑑) = 19/37 ≈ 0.514
All other values of 𝑤𝑟𝑒𝑑 have probability zero.
• The probability distribution for 𝑤14 is:
Pr(𝑤14 = 35) = Pr(𝑏 = 14) = 1/37 ≈ 0.027
Pr(𝑤14 = −1) = Pr(𝑏 ≠ 14) = 36/37 ≈ 0.973
All other values of 𝑤14 have probability zero.

Notice that these random variables are related to each other since they all
depend on the same underlying outcome. Section 5.4 will explain how we can
describe and analyze those relationships.

4.1.2 The support


The support of a random variable 𝑥 is the smallest1 set 𝑆𝑥 ⊂ ℝ such that
Pr(𝑥 ∈ 𝑆𝑥 ) = 1.
In plain language, the support is the set of all values in the sample space that
have some chance of actually happening.
Example 4.3. The support in roulette
The support of 𝑏 is 𝑆𝑏 = {0, 1, 2, … , 36}.
The support of 𝑤𝑅𝑒𝑑 is 𝑆𝑅𝑒𝑑 = {−1, 1}.
The support of 𝑤14 is 𝑆14 = {−1, 35}.

The random variables we will consider in this chapter have discrete support.
That is, the support is a set of isolated points each of which has a strictly positive
probability. In most examples the support will also have a finite number of
elements. All finite sets are also discrete, but it is also possible for a discrete
set to have an infinite number of elements. For example, the set of integers is
both discrete and infinite.
Some random variables have a support that is continuous rather than discrete.
Chapter 5 will cover continuous random variables.
1 Technically, it is the smallest closed set, but let’s ignore that for now.
72 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

4.1.3 The PDF

We can describe the probability distribution of a random variable with a function


called its probability density function (PDF).

The PDF of a discrete random variable is defined as:

𝑓𝑥 (𝑎) = Pr(𝑥 = 𝑎)

where 𝑎 is any number. By convention, we typically use a lower-case 𝑓 to


represent a PDF, and we use the subscript when needed to clarify which specific
random variable we are talking about.

Example 4.4. The PDF in roulette

Our three random variables are all discrete, and each has its own PDF:

1/37 𝑎 ∈ {0, 1, … , 36}


𝑓𝑏 (𝑎) = Pr(𝑏 = 𝑎) = {
0 𝑎 ∉ {0, 1, … , 36}

⎧19/37 𝑎 = −1
{
𝑓𝑟𝑒𝑑 (𝑎) = Pr(𝑤𝑟𝑒𝑑 = 𝑎) = ⎨18/37 𝑎=1
{0 𝑎 ∉ {−1, 1}

⎧36/37 𝑎 = −1
{
𝑓14 (𝑎) = Pr(𝑤14 = 𝑎) = ⎨1/37 𝑎 = 35
{0 𝑎 ∉ {−1, 35}

Figure ?? below shows these three PDFs.


4.1. RANDOM VARIABLES 73

Probability density function (PDF)


Roulette
1.00
f_14(a)

0.75
f(a)

0.50 f_red(a)

0.25

f_b(a)

0.00

0 10 20 30
a

We can calculate any probability from the PDF by simple addition. That is:
Pr(𝑥 ∈ 𝐴) = ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ∈ 𝐴)
𝑠∈𝑆𝑥

2
where 𝐴 ⊂ ℝ is any event defined for 𝑥.
Example 4.5. Some event probabilities in roulette
Since the outcome in roulette is discrete, we can calculate any event probability
by adding up the probabilities of the event’s outcomes.
The probability of the event 𝑏 ≤ 3 can be calculated:
36
Pr(𝑏 ≤ 3) = ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ≤ 3) (4.1)
𝑠=0
= 𝑓𝑏 (0) + 𝑓𝑏 (1) + 𝑓𝑏 (2) + 𝑓𝑏 (3) (4.2)
= 4/37 (4.3)

The probability of the event 𝑏 ∈ 𝐸𝑣𝑒𝑛 can be calculated:


36
Pr(𝑏 ∈ 𝐸𝑣𝑒𝑛) = ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ∈ 𝐸𝑣𝑒𝑛) (4.4)
𝑠=0
= 𝑓𝑏 (2) + 𝑓𝑏 (4) + ⋯ + 𝑓𝑏 (36) (4.5)
= 18/37 (4.6)
2 If
you are unfamiliar with the notation here, please refer to Section A.3.3 in the Math
Review Appendix.
74 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

The PDF of a discrete random variable has several general properties:

1. It is always between zero and one:

0 ≤ 𝑓𝑥 (𝑎) ≤ 1

since it is a probability.
2. It sums up to one over the support:

∑ 𝑓𝑥 (𝑎) = Pr(𝑥 ∈ 𝑆𝑥 ) = 1
𝑎∈𝑆𝑥

since the support has probability one by definition.


3. It is strictly positive for all values in the support:

𝑎 ∈ 𝑆𝑥 ⟹ 𝑓𝑥 (𝑎) > 0

since the support is the smallest set that has probability one.

You can confirm that examples above all satisfy these properties.

4.1.4 The CDF

Another way to describe the probability distribution of a random variable is


with a function called its cumulative distribution function (CDF). The
CDF is a little less intuitive than the PDF, but it has the advantage that it
always has the same definition whether or not the random variable is discrete.
The CDF of the random variable 𝑥 is the function 𝐹𝑥 ∶ ℝ → [0, 1] defined by:

𝐹𝑥 (𝑎) = 𝑃 𝑟(𝑥 ≤ 𝑎)

where 𝑎 is any number. By convention, we typically use an upper-case 𝐹 to


indicate a CDF, and we use the subscript to indicate what random variable we
are talking about.
We can construct the CDF of a discrete random variable by just adding up the
PDF:

𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎) (4.7)


= ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ≤ 𝑎) (4.8)
𝑠∈𝑆𝑥

This formula leads to a “stair-step” appearance: the CDF is flat for all values
outside of the support, and then jumps up at all values in the support.

Example 4.6. CDFs for roulette


4.1. RANDOM VARIABLES 75

• The CDF of 𝑏 is:

⎧0 𝑎<0
{
{1/37 0≤𝑎<1
{
{2/37 1≤𝑎<2
𝐹𝑏 (𝑎) = ⎨
{⋮ ⋮
{36/37 35 ≤ 𝑎 < 36
{
{
⎩1 𝑎 ≥ 36

• The CDF of 𝑤𝑟𝑒𝑑 is:

⎧0 𝑎 < −1
{
𝐹𝑟𝑒𝑑 (𝑎) = ⎨19/37 −1 ≤ 𝑎 < 1
{1 𝑎≥1

• The CDF of 𝑤14 is:

⎧0 𝑎 < −1
{
𝐹14 (𝑎) = ⎨36/37 −1 ≤ 𝑎 < 35
{1 𝑎 ≥ 35

The CDF has several properties. First, it is non-decreasing. That is, choose any
two numbers 𝑎 and 𝑏 so that 𝑎 ≤ 𝑏. Then

𝐹𝑥 (𝑎) ≤ 𝐹𝑥 (𝑏)

The reason for this is simple: the event 𝑥 ≤ 𝑎 implies the event 𝑥 ≤ 𝑏, so its
probability cannot be higher.
Second, it is a probability, which implies:

0 ≤ 𝐹𝑥 (𝑎) ≤ 1

just like for a discrete PDF.


Third, it runs from zero to one:

lim 𝐹𝑥 (𝑎) = Pr(𝑥 ≤ −∞) = 0


𝑎→−∞

lim 𝐹𝑥 (𝑎) = Pr(𝑥 ≤ ∞) = 1


𝑎→∞

The intuition is simple: all values in the support are between −∞ and ∞.
Example 4.7. CDF properties
Figure 4.1 below graphs the CDFs from the previous example:
Notice that they show all of the general properties described above.
76 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

Cumulative distribution function (CDF)


Roulette
1.00

f_14(a)

0.75
F(a)

0.50 f_red(a)

0.25

f_b(a)
0.00

0 10 20 30
a

Figure 4.1: CDFs for the roulette example

• The CDF never goes down, only goes up or stays the same.
• The CDF runs from zero to one, and never leaves that range.

In addition, all of these CDFs have a distinctive “stair-step” shape, jumping up


at each point in 𝑆𝑥 and flat between those points, This is a general property of
CDFs for discrete random variables.

In addition to constructing the CDF from the PDF, we can also go the other
way, and construct the PDF of a discrete random variable from its CDF. Each
little jump in the CDF is a point in the support, and the size of the jump is
exactly equal to the PDF.

In more formal mathematics, the formula for deriving the PDF of a discrete
random variable from its CDF would be written:

𝑓𝑥 (𝑎) = lim 𝐹𝑥 (𝑎) − 𝐹𝑥 (𝑎 − |𝜖|)


𝜖→0

but we can just think of it as the size of the jump.

Finally, we can use the CDF to calculate the probability that 𝑥 lies in any
4.1. RANDOM VARIABLES 77

interval. That is, let 𝑎 and 𝑏 be any two numbers such that 𝑎 < 𝑏. Then:

𝐹 (𝑏) − 𝐹 (𝑎) = Pr(𝑥 ≤ 𝑏) − Pr(𝑥 ≤ 𝑎) (4.9)


= Pr((𝑥 ≤ 𝑎) ∪ (𝑎 < 𝑥 ≤ 𝑏)) − Pr(𝑥 ≤ 𝑎) (4.10)
= Pr(𝑥 ≤ 𝑎) + Pr(𝑎 < 𝑥 ≤ 𝑏) − Pr(𝑥 ≤ 𝑎) (4.11)
= Pr(𝑎 < 𝑥 ≤ 𝑏) (4.12)

Notice that we have to be a little careful here to distinguish between the strict
inequality < and the weak inequality ≤, because it is always possible for 𝑥 to
be exactly equal to 𝑎 or 𝑏.

Example 4.8. Calculating interval probabilities

Consider the CDF for 𝑏 derived above. Then:

Pr(𝑏 ≤ 36) = 𝐹𝑏 (36) =1 (4.13)


Pr(1 < 𝑏 ≤ 36) = 𝐹𝑏 (36) − 𝐹𝑏 (1) (4.14)
= 1 − 2/37 (4.15)
= 35/37 (4.16)

4.1.5 Functions of a random variable

Any function of a random variable is also a random


√ variable. So for example,
if 𝑥 is a random variable, so is 𝑥2 or ln(𝑥) or 𝑥. We can derive the PDF or
CDF of a function of a random variable directly from the PDF or CDF of the
original random variable.

We say that 𝑦 is a linear function of 𝑥 if:

𝑦 = 𝑎𝑥 + 𝑏

where 𝑎 and 𝑏 are constants. We will have many results that apply specifically
for linear functions.

Example 4.9. A linear function in roulette

The net payout from a $1 bet on red (𝑤𝑟𝑒𝑑 ) was earlier defined directly from
the underlying outcome 𝑏. However, we could have also defined it as a linear
function of the random variable 𝑟:

𝑤𝑟𝑒𝑑 = 2𝑟 − 1

That is, 𝑤𝑟𝑒𝑑 = −1 when red loses (𝑟 = 0) and 𝑤𝑟𝑒𝑑 = 1 when red wins (𝑟 = 1).
78 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

4.2 The expected value


The expected value of a random variable 𝑥 is written 𝐸(𝑥). When 𝑥 is discrete,
it is defined as:

𝐸(𝑥) = ∑ 𝑎 Pr(𝑥 = 𝑎) = ∑ 𝑎𝑓𝑥 (𝑎)


𝑎∈𝑆𝑥 𝑎∈𝑆𝑥

The expected value is also called the mean, the population mean or the
expectation of the random variable.
The formula might look difficult if you are not used to the notation, but it is
actually quite simple to calculate:

1. Figure out the set of values in the support of 𝑥.


2. Multiply each value in the support by the PDF at that value.
3. Add these all up.

Example 4.10. Some expected values in roulette


The support of 𝑏 is {0, 1, 2 … , 36} so its expected value is:

𝑏 (0) +1 ∗ 𝑓
𝐸(𝑏) = 0 ∗ 𝑓⏟ ⏟ 𝑏 (1) + ⋯ 36 ∗ 𝑓
⏟ 𝑏 (36) (4.17)
1/37 1/37 1/37

= 18 (4.18)

The support of 𝑟 is {0, 1} so its expected value is:

𝐸(𝑟) = 0 ∗ 𝑓⏟
𝑟 (0) +1 ∗ 𝑓
⏟ 𝑟 (1) (4.19)
19/37 18/37

= 18/37 (4.20)
≈ 0.486 (4.21)

The support of 𝑤14 is {−1, 35} so its expected value is:

𝐸(𝑤14 ) = −1 ∗ 𝑓⏟
14 (−1) +35 ∗ 𝑓
⏟ 14 (35) (4.22)
36/37 1/37

= 1/37 (4.23)
≈ −0.027 (4.24)

That is, each dollar bet on 14 leads to an average loss of 2.7 cents for the bettor.

We can think of the expected value as a weighted average of its possible values,
with each value weighted by the probability of observing that value.
4.2. THE EXPECTED VALUE 79

Since the expected value is a sum, it has some of the same properties as sums.
In particular, the associative and distributive rules apply, which means that:
𝐸(𝑎 + 𝑏𝑥) = 𝑎 + 𝑏𝐸(𝑥)
That is, we can take the expected value “inside” any linear function. This will
turn out to be a very handy property.
Example 4.11. The expected value of a linear function in roulette
Earlier, we showed that 𝑤𝑟𝑒𝑑 is a linear function of 𝑟:
𝑤𝑟𝑒𝑑 = 2𝑟 − 1
so its expected value is:
𝐸(𝑤𝑟𝑒𝑑 ) = 𝐸(2𝑟 − 1) = 2 𝐸(𝑟)
⏟ −1 = −1/37 ≈ −0.027
18/37

We can verify this calculation is correct by deriving the expected value directly
from the PDF:
𝐸(𝑤𝑟𝑒𝑑 ) = −1 ∗ 𝑓⏟
𝑟𝑒𝑑 (−1) +1 ∗ 𝑓⏟
𝑟𝑒𝑑 (1) ≈ −0.027
19/37 18/37

That is, each dollar bet on red leads to an average loss of 2.7 cents for the bettor,
as does each dollar bet on 14.

Unfortunately, this handy property applies only to linear functions. If 𝑔(⋅) is a


nonlinear function, than 𝐸(𝑔(𝑥)) ≠ 𝑔(𝐸(𝑥)). For example:
𝐸(𝑥2 ) ≠ 𝐸(𝑥)2
𝐸(1/𝑥) ≠ 1/𝐸(𝑥)
Students frequently make this mistake, so try to avoid it.
Example 4.12. The expected value of a nonlinear function in roulette
We can define 𝑤14 as a function of the underlying outcome 𝑏:
𝑤14 = 36𝐼(𝑏 = 14) − 1
That is, 𝑤14 = 35 if a bet on 14 wins (𝑏 = 14) and 𝑤14 = −1 otherwise.
Although this is a linear function of the indicator variable 𝐼(𝑏 = 14), it is a
nonlinear function of 𝑏 itself.
If we could take the expected value inside of a nonlinear function we would get:
𝐸(𝑤14 ) = 36𝐼(𝐸(𝑏)
⏟ = 14) − 1 (can we do this?) (4.25)
18
= 36 𝐼(18
⏟⏟⏟ =⏟
14)
⏟ −1 (4.26)
0
= −1 (4.27)
which is clearly wrong since we already calculated that 𝐸(𝑤14 ) = −0.027.
80 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

4.3 Other characteristics

The expected value is one way of describing something about a random variable,
but there are many others. We will describe a few of the most important ones.

4.3.1 Range

The range of a random variable is the interval from its lowest possible value
(min(𝑆𝑥 ) its highest possible value max(𝑆𝑥 ).

Example 4.13. The range in roulette

The support of 𝑤𝑟𝑒𝑑 is {−1, 1} so its range is [−1, 1].

The support of 𝑤14 is {−1, 35} so its range is [−1, 35].

The support of 𝑏 is {0, 1, 2, … , 36} so its range is [0, 36].

4.3.2 Quantiles and percentiles

Let 𝑞 be any number between zero and one. Then the 𝑞 quantile of a random
variable 𝑥 is defined as:

𝐹𝑥−1 (𝑞) = min{𝑎 ∶ Pr(𝑥 ≤ 𝑎) ≥ 𝑞} = min{𝑎 ∶ 𝐹𝑥 (𝑎) ≥ 𝑞}

where 𝐹𝑥 (⋅) is the CDF of 𝑥. The quantile function 𝐹𝑥−1 (⋅) is also called the
inverse CDF.

The 𝑞 quantile of a distribution is also called the 100𝑞 percentile; for example
the 0.25 quantile of 𝑥 is also called the 25th percentile of 𝑥.

Example 4.14. Quantiles in roulette

The CDF of 𝑤𝑟𝑒𝑑 is:

⎧0 𝑎 < −1
{
𝐹𝑟𝑒𝑑 (𝑎) = ⎨0.514 −1 ≤ 𝑎 < 1
{1 𝑎≥1

We can plot this CDF below.


4.3. OTHER CHARACTERISTICS 81

Cumulative distribution function (CDF)


Roulette − net winnings from bet on red
1.00

F_red(a)

0.75 q_red(0.75) = 1
F(a)

0.50 q_red(0.50) = −1

0.25 q_red(0.25) = −1

0.00

0 10 20 30
a

We can use this graph to find any quantile. For example, the 0.25 quantile
(25th percentile) is:

−1
𝐹𝑟𝑒𝑑 (0.25) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.25} = min[−1, 1] = −1

which is depicted by the lowest blue dashed line.

By the same method, we can find that the 0.5 quantile (50th percentile) is:

−1
𝐹𝑟𝑒𝑑 (0.5) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.5} = min[−1, 1] = −1

and the 0.75 quantile (75th percentile) is:

−1
𝐹𝑟𝑒𝑑 (0.75) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.75} = min{1} = 1

The formula for the quantile function may look intimidating, but it can be
constructed by just “flipping” the axes of the CDF.

Example 4.15. The whole quantile function

The quantile function for 𝑤𝑟𝑒𝑑 can be constructed by inverting the 𝐹𝑟𝑒𝑑 (⋅):
82 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

Quantile function (inverse CDF)


Roulette − net winnings from bet on red
1.0
q_red(a)
value of w_red at this quantile

0.5

0.0

−0.5

−1.0

0.00 0.25 0.50 0.75 1.00


quantile

4.3.3 Median

The median of a random variable is its 0.5 quantile or 50th percentile. It can
be interpreted roughly as the “middle” of the distribution.

Example 4.16. The median in roulette


The median of 𝑤𝑟𝑒𝑑 is just its 0.5 quantile or 50th percentile:
−1
𝑚𝑒𝑑𝑖𝑎𝑛(𝑤𝑟𝑒𝑑 ) = 𝐹𝑟𝑒𝑑 (0.5) = −1

4.3.4 Variance

The median and expected value both aim to describe a typical or central value
of the random variable. We are also interested in measures of how much the
random variable varies. We have already seen one - the range - but there are
others, including the variance and standard deviation.
The variance of a random variable 𝑥 is defined as:

𝜎𝑥2 = 𝑣𝑎𝑟(𝑥) = 𝐸((𝑥 − 𝐸(𝑥))2 )

Variance can be thought of as a measure of how much 𝑥 tends to deviate from


its central tendency 𝐸(𝑥).
4.3. OTHER CHARACTERISTICS 83

Example 4.17. Calculating variance from the definition


The variance of 𝑟 is:
2 19 18
𝑣𝑎𝑟(𝑟) = (0 − 𝐸(𝑟)
⏟) ∗ + (1 − 𝐸(𝑟)
⏟ )2 ∗ (4.28)
37 37
18/37 18/37

≈ 0.25 (4.29)

The variance of 𝑤𝑟𝑒𝑑 is:

2 19 2 18
𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) = (−1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 )) ∗ + (1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 )) ∗ (4.30)
37 37
≈0.027 ≈0.027
≈ 1.0 (4.31)

The variance of 𝑤14 is

2 36 2 1
𝑣𝑎𝑟(𝑤14 ) = (−1 − 𝐸(𝑤
⏟ 14 )) ∗ + (35 − 𝐸(𝑤
⏟ 14 )) ∗ (4.32)
37 37
≈0.027 ≈0.027
≈ 34.1 (4.33)

That is, a bet on 14 has the same expected payout as a bet on red, but its
payout is much more variable.

The key to understanding the variance is that it is the expected value of a square
(𝑥 − 𝐸(𝑥))2 , and the expected value is just a (weighted) sum.
The first implication of this is that the variance is always positive (or more
precisely, non-negative):
𝑣𝑎𝑟(𝑥) ≥ 0
The intuition is straightforward. All squares are positive, and the expected
value is just a sum. If you add up several positive numbers, you will get a
positive number.
The second implication is that:

𝑣𝑎𝑟(𝑥) = 𝐸(𝑥2 ) − 𝐸(𝑥)2

The derivation of this is as follows:

𝑣𝑎𝑟(𝑥) = 𝐸((𝑥 − 𝐸(𝑥))2 ) (4.34)


= 𝐸((𝑥 − 𝐸(𝑥)) ∗ (𝑥 − 𝐸(𝑥))) (4.35)
2 2
= 𝐸(𝑥 − 2𝑥𝐸(𝑥) + 𝐸(𝑥) ) (4.36)
2 2
= 𝐸(𝑥 ) − 2𝐸(𝑥)𝐸(𝑥) + 𝐸(𝑥) (4.37)
= 𝐸(𝑥2 ) − 𝐸(𝑥)2 (4.38)

This formula is often an easier way of calculating the variance.


84 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

Example 4.18. Another way of calculating the variance

We already found that 𝐸(𝑤14 ) = −0.027, so we can calculate 𝑣𝑎𝑟(𝑤14 ) by


finding:

2
𝐸(𝑤14 ) = (−1)2 𝑓14 (−1) + 352 𝑓14 (35) (4.39)
36 1
=1∗ + 1225 ∗ (4.40)
37 37
≈ 34.08 (4.41)

Adding these together we get:

2
𝑣𝑎𝑟(𝑤14 ) = 𝐸(𝑤14 ) − 𝐸(𝑤14 )2 (4.42)
2
≈ 34.08 + (−0.027) (4.43)
≈ 34.1 (4.44)

That is, a bet on 14 has the same expected payout as a bet on red, but its
payout is much more variable.

We can also find the variance of any linear function of a random variable. For
any constants 𝑎 and 𝑏:
𝑣𝑎𝑟(𝑎 + 𝑏𝑥) = 𝑏2 𝑣𝑎𝑟(𝑥)

This can be derived as follows:

𝑣𝑎𝑟(𝑎 + 𝑏𝑥) = 𝐸(((𝑎 + 𝑏𝑥) − 𝐸(𝑎 + 𝑏𝑥))2 ) (4.45)


2
= 𝐸((𝑎 + 𝑏𝑥 − 𝑎 − 𝑏𝐸(𝑥)) ) (4.46)
= 𝐸((𝑏(𝑥 − 𝐸(𝑥)))2 ) (4.47)
2 2
= 𝐸(𝑏 (𝑥 − 𝐸(𝑥)) ) (4.48)
2 2
= 𝑏 𝐸((𝑥 − 𝐸(𝑥)) ) (4.49)
= 𝑏2 𝑣𝑎𝑟(𝑥) (4.50)

I do not expect you to remember how to derive these results, but I want you to
know them and use them.

Example 4.19. Variance of a linear function

The variance of 𝑤𝑟𝑒𝑑 is:

𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) = 𝑣𝑎𝑟(2𝑟 − 1) (4.51)


2
= 2 𝑣𝑎𝑟(𝑟) (4.52)
≈ 4 ∗ 0.25 (4.53)
≈ 1.0 (4.54)
4.4. STANDARD DISCRETE DISTRIBUTIONS 85

4.3.5 Standard deviation

The standard deviation of a random variable is defined as the (positive)


square root of its variance.

𝜎𝑥 = 𝑠𝑑(𝑥) = √𝑣𝑎𝑟(𝑥)

The standard deviation is just another way of describing the variability of 𝑥.


In some sense, the variance and standard deviation are interchangeable since
they are so closely related. The standard deviation has the advantage that it
is expressed in the same units as the underlying random variable, while the
variance is expressed in the square of those units. This makes the standard
deviation somewhat easier to interpret.

Example 4.20. Standard deviation in roulette


The standard deviation of 𝑟 is:

𝑠𝑑(𝑟) = √𝑣𝑎𝑟(𝑟) ≈ 0.5

The standard deviation of 𝑤𝑟𝑒𝑑 is:

𝑠𝑑(𝑤𝑟𝑒𝑑 ) = √𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) ≈ 1.0

The standard deviation of 𝑤14 is

𝑠𝑑(𝑤14 ) = √𝑣𝑎𝑟(𝑤14 ) ≈ 5.8

The standard deviation has analogous properties to the variance:

1. It is always non-negative:
𝑠𝑑(𝑥) ≥ 0
2. For any constants 𝑎 and 𝑏:

𝑠𝑑(𝑎 + 𝑏𝑥) = 𝑏 𝑠𝑑(𝑥)

These properties follow directly from the corresponding properties of the vari-
ance.

4.4 Standard discrete distributions


In principle, there are an infinite number of possible probability distributions.
However, some probability distributions appear so often in applications that we
86 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

have given them names. This provides a quick way to describe a particular
distribution without writing out its full PDF, using the notation
𝑅𝑎𝑛𝑑𝑜𝑚𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ∼ 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑁 𝑎𝑚𝑒(𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
where 𝑅𝑎𝑛𝑑𝑜𝑚𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 is the name of the random variable whose distribution is
being described, the ∼ character can be read as “has the following probability
distribution”, 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑁 𝑎𝑚𝑒 is the name of the probability distribution,
and 𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 is a list of numbers called parameters that provide additional
information about the probability distribution.
Using a standard distribution also allows us to establish the properties of a
commonly-used distribution once, and use those results every time we use that
distribution.

4.4.1 Bernoulli
The Bernoulli probability distribution is usually written:
𝑥 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝)
It has discrete support 𝑆𝑥 = {0, 1} and PDF:
⎧(1 − 𝑝) 𝑎=0
{
𝑓𝑥 (𝑎) = ⎨𝑝 𝑎=1
{0 𝑎 = anything else

Note that the “Bernoulli distribution” isn’t really a (single) probability distri-
bution. Instead it is what we call a parametric family of distributions. That
is, the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) is a different distribution with a different PDF for each
value of the parameter 𝑝.
We typically use Bernoulli random variables to model the probability of some
random event 𝐴. If we define 𝑥 as the indicator variable 𝑥 = 𝐼(𝐴), then 𝑥 ∼
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) where 𝑝 = Pr(𝐴).
Example 4.21. The Bernoulli distribution in roulette
The variable 𝑟 = 𝐼(𝑅𝑒𝑑) has the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(18/37) distribution.

The mean of a 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) random variable is:


𝐸(𝑥) = (1 − 𝑝) ∗ 0 + 𝑝 ∗ 1 (4.55)
=𝑝 (4.56)
and its variance is:
𝑣𝑎𝑟(𝑥) = 𝐸[(𝑥 − 𝐸(𝑥))2 ] (4.57)
2
= 𝐸[(𝑥 − 𝑝) ] (4.58)
2 2
= (1 − 𝑝)(0 − 𝑝) + 𝑝(1 − 𝑝) (4.59)
= 𝑝(1 − 𝑝) (4.60)
4.4. STANDARD DISCRETE DISTRIBUTIONS 87

4.4.2 Binomial

The binomial probability distribution is usually written:

𝑥 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝)

It has discrete support 𝑆𝑥 = {0, 1, 2, … , 𝑛} and its PDF is:

𝑛!
𝑝𝑎 (1 − 𝑝)𝑛−𝑎 𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = { 𝑎!(𝑛−𝑎)!
0 anything else

You do not need to memorize or even understand this formula. The Excel
function BINOMDIST() can be used to calculate the PDF or CDF of the bino-
mial distribution, and the function BINOM.INV() can be used to calculate its
quantiles.
The binomial distribution is typically used to model frequencies or counts. We
can show that it is the distribution of how many times a probability-𝑝 event
happens in 𝑛 independent attempts.
For example, the basketball player Stephen Curry makes about 43% of his 3-
point shot attempts. If each shot is independent of the others, then the number
of shots he makes in 10 attempts will have the 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(10, 0.43) distribution.

Example 4.22. The binomial distribution in roulette


Suppose we play 50 games of roulette, and bet on red in every game. Let
𝑊 𝐼𝑁 50 be the number of times we win.
Since the outcome of a single bet on red is 𝑟 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(18/37), this means
that 𝑊 𝐼𝑁 50 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(50, 18/37).

The mean and variance of a binomial random variable are:

𝐸(𝑥) = 𝑛𝑝

𝑣𝑎𝑟(𝑥) = 𝑛𝑝(1 − 𝑝)

The formula for the binomial PDF looks strange, but it can actually be derived
from a fairly simple and common situation. Let (𝑏1 , 𝑏2 , … , 𝑏𝑛 ) be a sequence of
𝑛 independent random variables from the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) distribution and let:
𝑛
𝑥 = ∑ 𝑏𝑖
𝑖=1

count up the number of times that 𝑏𝑖 is equal to one (i.e., the event modeled by
𝑏𝑖 happened). Then it is possible to derive the PDF for 𝑦, and that is the PDF
we call 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝). The derivation is not easy, but the intuition is simple:
88 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

• We can calculate the probability of the event 𝑥 = 𝑎 by adding up proba-


bility of its component outcomes.
𝑛!
• The number of outcomes in the event 𝑥 = 𝑎 is 𝑎!(𝑛−𝑎)! .
• The probability of each individual outcome in the event 𝑥 = 𝑎 is 𝑝𝑎 (1 −
𝑝)𝑛−𝑎 .

𝑛! 𝑎
Therefore the probability of the event 𝑥 = 𝑎 is 𝑎!(𝑛−𝑎)! 𝑝 (1 − 𝑝)𝑛−𝑎 .

4.4.3 Discrete uniform

The discrete uniform distribution

𝑥 ∼ 𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒𝑈 𝑛𝑖𝑓𝑜𝑟𝑚(𝑆𝑥 )

is a distribution that puts equal probability on every value in a discrete set 𝑆𝑥 .


Its support is 𝑆𝑥 and its PDF is:

1/|𝑆𝑥 | 𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = {
0 𝑎 ∉ 𝑆𝑥

Discrete uniform distributions appear in gambling and similar applications.

Example 4.23. The discrete uniform distribution in roulette


In our roulette example, the outcome 𝑏 has a discrete uniform distribution on
Ω = {0, 1, … , 36}.

Chapter review
In this chapter we have learned various ways of describing the probability dis-
tribution of a simple random variable - a single random variable that takes on
values in a finite set. We have also learned two standard probability distribu-
tions for simple random variables.
In the next chapter, we will deal with more complex random variables including
random variables that take on values in a continuous set, or random variables
that are related to other random variables. We will then use the concept of a
random variable to understand both data and statistics calculated from data.

Practice problems
Answers can be found in the appendix.
4.4. STANDARD DISCRETE DISTRIBUTIONS 89

The questions below continue our craps example. To review that example, we
have an outcome (𝑟, 𝑤) where 𝑟 and 𝑤 are the numbers rolled on a pair of fair
six-sided dice
Let the random variable 𝑡 be the total showing on the pair of dice, and let the
random variable 𝑦 = 𝐼(𝑡 = 11) be an indicator for whether a bet on “Yo” wins.
SKILL #1: Define a random variable in terms of its underlying out-
come

1. Define 𝑡 in terms of the underlying outcome (𝑟, 𝑤).


2. Define 𝑦 in terms of the underlying outcome (𝑟, 𝑤).

SKILL #2: Find the support of a random variable

3. Find the support of the following random variables:


a. Find the support 𝑆𝑟 of the random variable 𝑟.
b. Find the support 𝑆𝑡 of the random variable 𝑡.
c. Find the support 𝑆𝑦 of the random variable 𝑦.

SKILL #3: Find the range from the support

4. Find the range of each of the following random variables


a. Find the range of 𝑟.
b. Find the range of 𝑡.
c. Find the range of 𝑦.

SKILL #4: Find the (discrete) PDF for a simple example

5. Find the following PDFs:


a. Find the PDF 𝑓𝑟 for the random variable 𝑟.
b. Find the PDF 𝑓𝑡 for the random variable 𝑡.
c. Find the PDF 𝑓𝑦 for the random variable 𝑦.

SKILL #5: Find the CDF from the (discrete) PDF

6. Using the PDFs you found earlier, find the following CDFs
a. Find the CDF 𝐹𝑟 for the random variable 𝑟.
b. Find the CDF 𝐹𝑦 for the random variable 𝑦.

SKILL #6: Find the expected value from the (discrete) PDF
90 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES

7. Using the PDFs you found earlier, find the following expected values:
a. Find the expected value 𝐸(𝑟).
b. Find the expected value 𝐸(𝑟2 ).

SKILL #7: Find quantiles from the CDF

8. Using the CDFs you found earlier, find the following quantiles:
a. Find the median 𝑀 𝑒𝑑(𝑟).
b. Find the 0.25 quantile 𝐹𝑟−1 (0.25).
c. Find the 75th percentile of 𝑟.

SKILL #8: Calculate variance and standard deviation from the (dis-
crete) PDF

9. Let 𝑑 = (𝑦 − 𝐸(𝑦))2
a. Find the PDF 𝑓𝑑 of 𝑑.
b. Use this PDF to find 𝐸(𝑑)
c. Use the results above to find the variance 𝑣𝑎𝑟(𝑦).
d. Use the results above to find the standard deviation 𝑠𝑑(𝑦).

SKILL #9: Calculate variance and standard deviation from expected


values

10. In question (7) above, you calculated 𝐸(𝑟) and 𝐸(𝑟2 ) from the PDF.
a. Use these results to find 𝑣𝑎𝑟(𝑟)
b. Use these results to find 𝑠𝑑(𝑟)

SKILL #10: Identify and use random variables from standard dis-
crete distributions

11. The random variable 𝑦 can be described using a standard distribution.


a. What standard distribution describes 𝑦?
b. Use standard results for this distribution to find 𝐸(𝑦)
c. Use standard results for this distribution to find 𝑣𝑎𝑟(𝑦)
12. Let 𝑌10 be the number of times in 10 dice rolls that a bet on “Yo” wins.
a. What standard distribution describes 𝑌10 ?
b. Use existing results for this distribution to find 𝐸(𝑌10 ).
c. Use existing results for this distribution to find 𝑣𝑎𝑟(𝑌10 ).
d. Use Excel to calculate Pr(𝑌10 = 0).
e. Use Excel to calculate Pr(𝑌10 ≤ 10/16).
4.4. STANDARD DISCRETE DISTRIBUTIONS 91

f. Use Excel to calculate Pr(𝑌10 > 10/16).

SKILL #11: Calculate mean and variance for a linear function of a


random variable

13. The “Yo” bet pays out at 15:1, meaning you win $15 for each dollar
bet. Suppose you bet $10 on Yo. Your net winnings in that case will be
𝑊 = 160 ∗ 𝑦 − 10.
a. Using earlier results, find 𝐸(𝑊 ).
b. Using earlier results, find 𝑣𝑎𝑟(𝑊 ).
c. The event 𝑊 > 0 (your net winnings are positive) is identical to the
event 𝑦 = 1. Using earlier results, find Pr(𝑊 > 0).
14. Suppose you bet $1 on Yo in ten independent rolls. Your net winnings in
that case will be 𝑊10 = 16 ∗ 𝑌10 − 10.
a. Using earlier results, find 𝐸(𝑊10 ).
b. Using earlier results, find 𝑣𝑎𝑟(𝑊10 ).
c. The event 𝑊10 > 0 (your net winnings are positive) is identical to
the event 𝑌10 > 10/16. Using earlier results, find Pr(𝑊10 > 0).

SKILL #12: Interpret means and variances

15. If you have $10 and care mostly about expected net winnings, which would
be your preferred betting strategy?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
• Keep your $10 and not bet at all.
16. Which of the following two betting strategies produces the highest proba-
bility of walking away from the table with more money than you started
with?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
17. Which of the following two betting strategies has more variable net win-
nings?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
92 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
Chapter 5

More on random variables

The previous chapter developed the basic terminology and analytical tools for
a single discrete random variable. This chapter will extend those tools to work
wit continuous random variables, and with multiple random variables.

Goals
Chapter goals
In this chapter we will:

• Interpret the CDF and PDF of a continuous random variable.


• Work with common continuous probability distributions including
the uniform and normal.
• Calculate and interpret joint, marginal and conditional distribu-
tions of two random variables.
• Calculate and interpret the covariance and correlation coefficient
of two discrete random variables.

5.1 Continuous random variables


So far we have considered random variables with a discrete support. However,
many random variables of interest have a continuous support.
For example, consider the Canadian labour force participation rate. It is defined
as:
(labour force)
(LFP rate) = × 100%
(population)
so it can be any (rational) number between 0% and 100%:

𝑆𝐿𝐹 𝑃 𝑟𝑎𝑡𝑒 = [0%, 100%]

93
94 CHAPTER 5. MORE ON RANDOM VARIABLES

This introduces a complication: there is an infinite number of values in this


support. There is even an infinite number of possible values between two much
closer numbers like 63% and 63.0001%.
Because a continuous random variable has an infinite support, it has a seemingly
inconsistent pair of properties:

• The probability that 𝑥 is any specific value in the support is:

Pr(𝑥 = 𝑎) = 0

for all 𝑎 ∈ 𝑆𝑥 .
• The probability that 𝑥 is somewhere in the support is:

Pr(𝑥 ∈ 𝑆𝑥 ) = 1

This feature applies to ranges as well. For example, the Canadian labour force
participation rate has a high probability of being between 60% and 70%, but
zero probability of being exactly 65%.
The math for working with continuous random variables is a little different from
the math for working with discrete random variables, and is harder because it
requires calculus. In many cases, it requires integral calculus (MATH 152 or
MATH 158) which is not a prerequisite to this course, and which I do not
expect you to know.
But deep down, there are no really important differences between continuous
and discrete random variables. The intuition for why is straightforward: you
can make a continuous random variable into a discrete random variable by just
rounding it. For example, suppose you round the labour force participation rate
to the nearest percentage point. Then it becomes a discrete random variable,
with support:
𝑆𝑥 = {0%, 1%, … 99%, 100%}
The same point applies if you round to the nearest 1/100th of a percentage
point, or the nearest 1/1,000,000th of a percentage point.
Since discrete and random variables are more alike than first appears, most of
the results and intuition we have already developed for discrete random variables
can also be applied to continuous random variables. So:

• Most of my examples will be for discrete case.


• I will briefly show you the math for the continuous case, but I will not
expect you to do it.
• Most of the results I give you will apply for both cases.

Any time you see an integral here, you can ignore it.
5.1. CONTINUOUS RANDOM VARIABLES 95

5.1.1 The continuous CDF

The CDF of a continuous random variable 𝑥 is defined exactly the same way as
for the discrete case:
𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎)
The only difference is how it looks.
If you recall, the CDF of a discrete random variable takes on a stair-step form:
increasing in discrete jumps at every point in the discrete support, and flat
everywhere else.
In contrast, the CDF of a continuous random variable increases continuously.
It can have flat parts, but never jumps.

Example 5.1. The standard uniform distribution


Consider a random variable 𝑥 that has the standard uniform distribution.
What that means is that:

1. The support of 𝑥 is the range [0, 1].


2. All values in this range are equally likely.

The CDF of the standard uniform distribution is:

⎧0 𝑎 < 0
{
𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎) = ⎨𝑎 𝑎 ∈ [0, 1]
{1 𝑎 > 1

Figure 5.1 below shows the CDF of the standard uniform distribution.
As you can see, the CDF is smoothly increasing between zero and one, and flat
everywhere else.

The CDF of a continuous random variable obeys all of the properties described
in section 4.1.4:

𝐹𝑥 (𝑎) ≤ 𝐹𝑥 (𝑏) if 𝑎 ≤ 𝑏
0 ≤ 𝐹𝑥 (𝑎) ≤ 1
lim 𝐹𝑥 (𝑎) = Pr(𝑥 ≤ −∞) = 0
𝑎→−∞

lim 𝐹𝑥 (𝑎) = Pr(𝑥 ≤ ∞) = 1


𝑎→∞

𝐹 (𝑏) − 𝐹 (𝑎) = Pr(𝑎 < 𝑥 ≤ 𝑏) (5.1)


96 CHAPTER 5. MORE ON RANDOM VARIABLES

Cumulative distribution function (CDF)


Standard uniform
1.00

0.75

F_x(a)
F(a)

0.50

0.25

0.00

−2 −1 0 1 2
a

Figure 5.1: CDF for the standard uniform distribution

In addition, the result on intervals applies to both strict and weak inequalities:
𝐹 (𝑏) − 𝐹 (𝑎) = Pr(𝑎 < 𝑥 ≤ 𝑏) (5.2)
= Pr(𝑎 < 𝑥 < 𝑏) (5.3)
= Pr(𝑎 ≤ 𝑥 ≤ 𝑏) (5.4)
= Pr(𝑎 ≤ 𝑥 < 𝑏) (5.5)
since a continuous random variable has probability zero of taking on any specific
value.

5.1.2 The continuous PDF


The PDF 𝑓𝑥 (𝑎) for a discrete random variable is defined as the size of the
“jump” in the CDF at 𝑎, or (equivalently) the probability Pr(𝑥 = 𝑎) of observing
that particular value. But the CDF of a continuous random variable has no
jumps, and the probability of observing any particular value is always zero. So
this particular function is useless in describing the probability distribution of a
continuous random variable.
Instead, we define the PDF of a continuous random variable 𝑥 as the slope
or derivative of the CDF:
𝑑𝐹𝑥 (𝑎)
𝑓𝑥 (𝑎) =
𝑑𝑎
5.1. CONTINUOUS RANDOM VARIABLES 97

In other words, instead of the amount the CDF increases (jumps) at 𝑎, it is the
rate at which it increases.

Example 5.2. The PDF of the standard uniform distribution


The PDF of a standard uniform random variable is:

⎧0 𝑎<0
{
𝑓𝑥 (𝑎) = ⎨1 𝑎 ∈ [0, 1]
{0 𝑎>1

which looks like this:

Probability density function (PDF)


Standard uniform
1.00

f_x(a)
0.75
f(a)

0.50

0.25

0.00

−2 −1 0 1 2
a

The PDF of a continuous random variable is a good way to visualize its prob-
ability distribution, and this is about the only way we will use the continuous
PDF in this class (since everything else requires integration).

Example 5.3. Interpreting the uniform PDF


The uniform PDF shows the key feature of this distribution: in some loose sense,
all values in the support are “equally likely”, much like in the discrete uniform
distribution described earlier. In fact, if you round a uniform random variable,
you get a discrete uniform random variable.

I have defined the PDF in terms of the CDF, but it is also possible to derive the
CDF from the PDF. This requires integral calculus, so I will give the definition
below but not expect you to use it.
98 CHAPTER 5. MORE ON RANDOM VARIABLES

FYI
Deriving the CDF from the PDF of a continuous random vari-
able
The formula for deriving the CDF of a continuous random variable from
its PDF is: 𝑎
𝐹𝑥 (𝑎) = ∫ 𝑓𝑥 (𝑣)𝑑𝑣
−∞

Unless you have taken MATH 152 or MATH 158, you may have no idea
what this is or how to solve it. That’s OK! All you need to know for this
course is that it can be solved.

The continuous PDF has many properties that are similar but not identical to
the properties of the discrete PDF.

FYI
Like the discrete PDF, the continuous PDF is non-negative for all values:

𝑓𝑥 (𝑎) ≥ 0

and is strictly positive for all values in the support:

𝑎 ∈ 𝑆𝑥 ⟹ 𝑓𝑥 (𝑎) > 0

but it is not a probability, and can be greater than one.


If you recall, we can calculate probabilities from the discrete PDF by
adding, and the discrete PDF sums to one. Similarly, we can calculate
probabilities from the continuous PDF by integrating:
𝑏
Pr(𝑎 < 𝑥 < 𝑏) = ∫ 𝑓𝑥 (𝑣)𝑑𝑣
𝑎

and the continuous PDF integrates to one:



∫ 𝑓𝑥 (𝑣)𝑑𝑣 = 1
−∞

5.1.3 Means and quantiles

Quantiles and percentiles, including the range and median, have the same defi-
nition and interpretation whether the random variable is continuous or discrete.
The definition for the expected value of a continuous random variable is different
and uses integral calculus.
5.2. THE UNIFORM DISTRIBUTION 99

FYI
The expected value for a continuous random variable
When 𝑥 is continuous, its expected value is defined as:

𝐸(𝑥) = ∫ 𝑎𝑓𝑥 (𝑎)𝑑𝑎
−∞

Notice that this looks just like the definition for the discrete case, but
with the sum replaced by an integral sign. If you know much about
integral calculus, you may require that an integral is a sum (or more
precisely, the limit of a sum). This is why the same properties we earlier
found for the expected value of a discrete random variable also applies
to continuous random variables.
There is even a general definition that covers both discrete and contin-
uous variables, as well as any mix between them:

𝐸(𝑥) − ∫ 𝑎𝑑𝐹𝑥 (𝑎)
−∞

This expression uses notation that is not typically taught in a first course
in integral calculus, so even if you have taken MATH 152 or MATH 158
you may not know how to interpret it. Again, I am only showing you so
that you know the formula exists, I am not asking you to remember or
use it.

More importantly, the expected value has the same interpretation as it does for
a discrete random variable, and it has all of the properties described earlier as
well.
The variance and standard deviation are both defined as expected values, so
they also have the same interpretation and properties for a continuous random
variable as they do for a discrete random variable.

5.2 The uniform distribution


The uniform probability distribution is usually written
𝑥 ∼ 𝑈 (𝐿, 𝐻)
where 𝐿 < 𝐻.

5.2.1 The uniform PDF


The uniform distribution is a continuous probability distribution with support:
𝑆𝑥 = [𝐿, 𝐻]
f_x(a)

100 CHAPTER 5. MORE ON RANDOM VARIABLES

and PDF:

1
𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = { 𝐻−𝐿
0 otherwise

For example, if 𝑥 ∼ 𝑈 (2, 5) its support is the range of all values from 2 to 5,
and its PDF looks like this:

Probability density function (PDF)


U(2,5)

0.3

0.2
f(a)

0.1

0.0

−6 −3 0 3 6
a

The uniform distribution puts equal probability on all values between 𝐿 and
𝐻. We have already seen the standard uniform distribution, which is just
the 𝑈 (0, 1) distribution.

5.2.2 The uniform CDF

The CDF of the 𝑈 (𝐿, 𝐻) distribution is

⎧0 𝑎≤𝐿
{ 𝑎−𝐿
𝐹𝑥 (𝑎) = ⎨ 𝐻−𝐿 𝐿<𝑎<𝐻
{1 𝑎≥𝐻

5.2. THE UNIFORM DISTRIBUTION 101

For example, if 𝑥 ∼ 𝑈 (2, 5) the CDF looks like this:

Cumulative distribution function (CDF)


U(2,5)
1.00

0.75

F_x(a)
F(a)

0.50

0.25

0.00

−6 −3 0 3 6
a

5.2.3 Means and quantiles

The median of a uniform random variable 𝑥 ∼ 𝑈 (𝐿, 𝐻) is just the midpoint of


its support:
𝐿+𝐻
𝑀 𝑒𝑑(𝑥) = 𝐹𝑥−1 (0.5) =
2
Its mean can be found using integral calculus, and is also the midpoint:

𝐿+𝐻
𝐸(𝑥) =
2

Its variance also requires integral calculus:

(𝐻 − 𝐿)2
𝑣𝑎𝑟(𝑥) =
12

and its standard deviation is just the square root of the variance:

(𝐻 − 𝐿)2
𝑠𝑑(𝑥) = √
12

as always.
102 CHAPTER 5. MORE ON RANDOM VARIABLES

5.2.4 Functions of a uniform


A linear function of a uniform random variable also has a uniform distribution.
That is, if 𝑥 ∼ 𝑈 (𝐿, 𝐻) and 𝑦 = 𝑎 + 𝑏𝑥 where 𝑏 > 0, then:
𝑦 ∼ 𝑈 (𝑎 + 𝑏𝐿, 𝑎 + 𝑏𝐻)
A nonlinear function of a uniform random variable is generally not uniform, but
it has a very useful characteristic described below.

FYI
Uniform distributions in video games
Uniform distributions are important in many computer applications in-
cluding video games.
It is easy for a computer to generate a random number from the 𝑈 (0, 1)
distribution, and a 𝑈 (0, 1) has the unusual feature that its 𝑞 quantile is
equal to 𝑞.
As a result, you can generate a random variable with any probability
distribution you like by following these steps:

1. Generate a random variable 𝑞 ∼ 𝑈 (0, 1).


2. Calculate 𝑥 = 𝐹 −1 (𝑞) where 𝐹 (⋅) is the CDF of the distribution
you want.

Then 𝑥 is a random variable with the CDF 𝐹 (⋅)


If you have ever played a video game, that game is constantly generat-
ing 𝑈 (0, 1) random numbers and using them to determine the behavior
of non-player characters, the location of weapons and other resources,
etc. Without that element of randomness, these games would be far too
predictable to be much fun.

5.3 The normal distribution


The normal distribution is typically written as:
𝑥 ∼ 𝑁 (𝜇, 𝜎2 )
The normal distribution is also called the Gaussian distribution, and the
𝑁 (0, 1) distribution is called the standard normal distribution.

5.3.1 The normal PDF


The 𝑁 (𝜇, 𝜎2 ) distribution is a continuous distribution with support 𝑆𝑥 = ℝ and
PDF:
1 (𝑎−𝜇)2
𝑓𝑥 (𝑎) = √ 𝑒− 2𝜎
𝜎 2𝜋
f_x(a)

5.3. THE NORMAL DISTRIBUTION 103

For example, the PDF for the 𝑁 (0, 1) distribution looks like this:

Probability density function (PDF)


N(0,1)
0.4

0.3
f(a)

0.2

0.1

0.0

−6 −3 0 3 6
a

As the figure shows, the 𝑁 (0, 1) distribution is symmetric around 𝜇 = 0 and


bell-shaped, meaning that it is usually close to zero but can occasionally be
quite far.

For other values of 𝜇, the 𝑁 (𝜇, 𝜎2 ) distribution is also symmetric around 𝜇 and
bell-shaped, with the “spread” of the distribution depending on the value of 𝜎2 :
104 CHAPTER 5. MORE ON RANDOM VARIABLES

Probability density function (PDF)

0.4

0.3 N(0,1) N(1,1)


f(a)

0.2

0.1

N(0,2)

0.0

−6 −3 0 3 6
a

5.3.2 The normal CDF

The CDF of the normal distribution can be derived by integrating the PDF.
There is no simple closed-form expression for this CDF, but it is easy to calculate
with a computer.
5.3. THE NORMAL DISTRIBUTION 105

Cumulative distribution function (CDF)

1.00

N(0,1)
0.75
f(a)

0.50

N(1,1)
0.25

N(0,2)

0.00

−6 −3 0 3 6
a

The Excel function NORM.DIST() can be used to calculate the PDF or CDF of
any normal distribution.

5.3.3 Means and quantiles

The Excel function NORM.INV() can be used to calculate the quantile (inverse
CDF) function of any normal distribution.
Since the 𝑁 (𝜇, 𝜎2 ) distribution is symmetric around 𝜇, its median is also 𝜇.
The mean and variance of a 𝑁 (𝜇, 𝜎2 ) random variable can be found by integra-
tion:
𝐸(𝑥) = 𝜇
𝑣𝑎𝑟(𝑥) = 𝜎2
and the standard deviation is just the square root of the variance:

𝑠𝑑(𝑥) = 𝜎

5.3.4 Functions of a normal

Any linear function of a normal random variable is also normal. That is, if

𝑥 ∼ 𝑁 (𝜇, 𝜎2 )
106 CHAPTER 5. MORE ON RANDOM VARIABLES

Then for any constants 𝑎 and 𝑏:


𝑎𝑥 + 𝑏 ∼ 𝑁 (𝑎𝜇 + 𝑏, 𝑎2 𝜎2 )
Just to be clear, we already showed that 𝐸(𝑎𝑥+𝑏) = 𝑎𝐸(𝑥)+𝑏 and 𝑣𝑎𝑟(𝑎𝑥+𝑏) =
𝑎2 𝑣𝑎𝑟(𝑥) for any random variable 𝑥. What is new here is that 𝑎𝑥 + 𝑏 is normally
distributed if 𝑥 is.
There are many other standard distributions that describe functions of one or
more normal random variables and are derived from the normal distribution.
These distributions include the 𝜒2 distribution, the 𝐹 distribution and the 𝑇
distribution, and have various applications in statistical analysis.

5.3.5 The standard normal distribution


The standard normal distribution is so useful that we have special symbol for
its PDF:
1 𝑎2
𝜙(𝑎) = √ 𝑒− 2
2𝜋
and its CDF: 𝑎
Φ(𝑎) = ∫ 𝜙(𝑏)𝑑𝑏
−∞
All statistical programs, including Excel and R, provide a function that calcu-
lates the standard normal CDF.
Our result in the previous section about linear functions of a normal can be
used to restate the distribution of any normally distributed random variable in
terms of a standard normal. That is, suppose that
𝑥 ∼ 𝑁 (𝜇, 𝜎2 )
Then if we define
𝑥−𝜇
𝑧=
𝜎
our result implies that
𝜇−𝜇 1 2 2
𝑧∼𝑁( ,( ) 𝜎 )
𝜎 𝜎
or equivalently:
𝑧 ∼ 𝑁 (0, 1)
We can use this result to derive the CDF of 𝑥:
𝐹𝑥 (𝑎) = Pr (𝑥 ≤ 𝑎) (5.6)
𝑥−𝜇 𝑎−𝜇
= Pr ( ≤ ) (5.7)
𝜎 𝜎
𝑎−𝜇
= Pr (𝑧 ≤ ) (5.8)
𝜎
𝑎−𝜇
= Φ( ) (5.9)
𝜎
5.4. MULTIPLE RANDOM VARIABLES 107

Since the standard normal CDF is available as a built-in function in Excel or


R, so we can use this result to calculate the CDF for any normally distributed
random variable.

FYI
A very important result called the Central Limit Theorem tells us that
many “real world” random variables have a probability distribution that
is well-approximated by the normal distribution.
We will discuss the central limit theorem in much more detail later.

5.4 Multiple random variables


Almost all interesting data sets have multiple observations and multiple vari-
ables. So before we start talking about data, we need to develop some tools and
terminology for thinking about multiple random variables.
To keep things simple, most of the definitions and examples will be stated
in terms of two random variables. The extension to more than two random
variables is conceptually straightforward but will be skipped.

5.4.1 Joint distribution

Let 𝑥 = 𝑥(𝑏) and 𝑦 = 𝑦(𝑏) be two random variables defined in terms of the same
underlying outcome 𝑏.
Their joint probability distribution assigns a probability to every event that
can be defined in terms of 𝑥 and 𝑦, for example Pr(𝑥 = 6 ∩ 𝑦 = 0) or Pr(𝑥 < 𝑦).
This joint distribution can be fully described by the joint CDF:

𝐹𝑥,𝑦 (𝑎, 𝑏) = Pr(𝑥 ≤ 𝑎 ∩ 𝑦 ≤ 𝑏)

or by the joint PDF:

Pr(𝑥 = 𝑎 ∩ 𝑦 = 𝑏) if 𝑥 and 𝑦 are discrete


𝑓𝑥,𝑦 (𝑎, 𝑏) = { 𝜕𝐹𝑥,𝑦 (𝑎,𝑏)
𝜕𝑎𝜕𝑏 if 𝑥 and 𝑦 are continuous

Example 5.4. The joint PDF in roulette


In our roulette example, the joint PDF of 𝑤𝑟𝑒𝑑 and 𝑤14 can be derived from the
original outcome.
If 𝑏 = 14, then both red and 14 win:

𝑓𝑟𝑒𝑑,14 (1, 35) = Pr(𝑤𝑟𝑒𝑑 = 1 ∩ 𝑤14 = 35) (5.10)


= Pr(𝑏 ∈ {14}) = 1/37 (5.11)
108 CHAPTER 5. MORE ON RANDOM VARIABLES

If 𝑏 ∈ 𝑅𝑒𝑑 but 𝑏 ≠ 14, then red wins but 14 loses:

𝑓𝑟𝑒𝑑,14 (1, −1) = Pr(𝑤𝑟𝑒𝑑 = 1 ∩ 𝑤14 = −1) (5.12)


1, 3, 5, 7, 9, 12, 16, 18, 19, 21,
= Pr (𝑏 ∈ { }) (5.13)
23, 25, 27, 30, 32, 34, 36
= 17/37 (5.14)

Otherwise both red and 14 lose:

𝑓𝑟𝑒𝑑,14 (−1, −1) = Pr(𝑤𝑟𝑒𝑑 = −1 ∩ 𝑤14 = −1) (5.15)


0, 2, 4, 6, 7, 10, 11, 13, 15, 17,
= Pr (𝑏 ∈ { }) (5.16)
20, 22, 24, 26, 28, 31, 33, 35
= 19/37 (5.17)

All other values have probability zero.

5.4.2 Marginal distribution

The joint distribution tells you two things about these variables

1. The probability distribution of each individual random variable, some-


times called that variable’s marginal distribution.
• For example, we can derive each variable’s CDF from the joint CDF:

𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎) = Pr(𝑥 ≤ 𝑎 ∩ 𝑦 ≤ ∞) = 𝐹𝑥,𝑦 (𝑎, ∞)

𝐹𝑦 (𝑏) = Pr(𝑦 ≤ 𝑏) = Pr(𝑥 ≤ ∞ ∩ 𝑦 ≤ 𝑏) = 𝐹𝑥,𝑦 (∞, 𝑏)


• We can also derive each variable’s PDF from the joint PDF
2. The relationship between the two variables.
• We will develop several ways of describing this relationship: condi-
tional distribution, covariance, correlation, etc.

Note that while you can always derive the marginal distributions from the joint
distribution, you cannot go the other way around unless you know everything
about the relationship between the two variables.
Example 5.5. Three joint distributions with identical marginal distri-
butions
The scatter plots in Figure ?? below depict simulation results for a pair of
random variables (𝑥, 𝑦), with a different joint distribution in each graph. In all
three graphs, 𝑥 and 𝑦 have the same marginal distribution (standard normal).
The differences between the graphs are in the relationship between 𝑥 and 𝑦.
5.4. MULTIPLE RANDOM VARIABLES 109

• In the first graph, 𝑥 and 𝑦 are unrelated, so the data looks like as a
“cloud” of random dots.

• In the second graph, 𝑥 and 𝑦 have something of a negative relationship.


High values of 𝑥 tend to go with low values of 𝑦.
• In the third graph, 𝑥 and 𝑦 are positively and closely related. In fact, they
are equal.

2
2
2

1
0

0
y

y
0

−1

−2
−1

−2

−2
−3
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x x x

5.4.3 Conditional distribution

The conditional distribution of a random variable 𝑦 given another random


variable 𝑥 assigns values to all probabilities of the form:

Pr(𝑦 ∈ 𝐴 ∩ 𝑥 ∈ 𝐵)
Pr(𝑦 ∈ 𝐴|𝑥 ∈ 𝐵) =
Pr(𝑥 ∈ 𝐵)

Since a conditional probability is just the ratio of the joint probability to the
marginal probability, the conditional distribution can always be derived from
the joint distribution.
We can describe a conditional distribution with either the conditional CDF:

𝐹𝑦|𝑥 (𝑎, 𝑏) = Pr(𝑦 ≤ 𝑎|𝑥 = 𝑏)


110 CHAPTER 5. MORE ON RANDOM VARIABLES

or the conditional PDF


Pr(𝑦 = 𝑎|𝑥 = 𝑏) if 𝑥 and 𝑦 are discrete
𝑓𝑦|𝑥 (𝑎, 𝑏) = { 𝜕
𝜕𝑎 𝐹𝑦|𝑥 (𝑎, 𝑏) if 𝑥 and 𝑦 are continuous

Example 5.6. Conditional PDFs in roulette


Let’s find the conditional PDF of the payout for a bet on 14 given the payout
for a bet on red.
Pr(𝑤14 = −1 ∩ 𝑤𝑟𝑒𝑑 = −1)
Pr(𝑤14 = −1|𝑤𝑟𝑒𝑑 = −1) = (5.18)
Pr(𝑤𝑟𝑒𝑑 = −1)
19/37
= =1 (5.19)
19/37
Pr(𝑤14 = 35 ∩ 𝑤𝑟𝑒𝑑 = −1)
Pr(𝑤14 = 35|𝑤𝑟𝑒𝑑 = −1) = (5.20)
Pr(𝑤𝑟𝑒𝑑 = −1)
0
= =0 (5.21)
19/37
Pr(𝑤14 = −1 ∩ 𝑤𝑟𝑒𝑑 = 1)
Pr(𝑤14 = −1|𝑤𝑟𝑒𝑑 = 1) = (5.22)
Pr(𝑤𝑟𝑒𝑑 = 1)
17/37
= ≈ 0.944 (5.23)
18/37
Pr(𝑤14 = 35 ∩ 𝑤𝑟𝑒𝑑 = 1)
Pr(𝑤14 = 35|𝑤𝑟𝑒𝑑 = 1) = (5.24)
Pr(𝑤𝑟𝑒𝑑 = 1)
1/37
= ≈ 0.056 (5.25)
18/37

5.4.4 Functions of multiple random variables


As with a single random variable, we can take any function 𝑔(𝑥, 𝑦) of two or
more random variables, and we will have a new random variable that has a
well-defined CDF, PDF, expected value, etc.
As with a linear function of a single random variable, we can take the expected
value inside of a linear function of two or more random variables. That is:
𝐸(𝑎 + 𝑏𝑥 + 𝑐𝑦) = 𝑎 + 𝑏𝐸(𝑥) + 𝑐𝐸(𝑦)
where 𝑥 and 𝑦 are random variables and 𝑎, 𝑏 and 𝑐 are constants.
However, we cannot take the expected value inside a nonlinear function. For
example:
𝐸(𝑥𝑦) ≠ 𝐸(𝑥)𝐸(𝑦)
𝐸(𝑥/𝑦) ≠ 𝐸(𝑥)/𝐸(𝑦)
As with a single random variable, the reason for this is that the expected value
is a sum.
5.4. MULTIPLE RANDOM VARIABLES 111

Example 5.7. Multiple bets in roulette


Suppose we bet $100 on red and $10 on 14. Our net payout will be:
𝑤𝑡𝑜𝑡𝑎𝑙 = 100 ∗ 𝑤𝑟𝑒𝑑 + 10 ∗ 𝑤14
which has expected value:
𝐸(𝑤𝑡𝑜𝑡𝑎𝑙 ) = 𝐸(100𝑤𝑟𝑒𝑑 + 10𝑤14 ) (5.26)
= 100 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ) +10 𝐸(𝑤
⏟ 14 ) (5.27)
≈−0.027 ≈−0.027
≈ −3 (5.28)
That is we expect this betting strategy to lose an average of about $3 per game.

5.4.5 Covariance

The covariance of two random variables 𝑥 and 𝑦 is defined as:


𝜎𝑥𝑦 = 𝑐𝑜𝑣(𝑥, 𝑦) = 𝐸[(𝑥 − 𝐸(𝑥)) ∗ (𝑦 − 𝐸(𝑦))]
The covariance can be interpreted as a measure of how 𝑥 and 𝑦 tend to move
together:

• If the covariance is positive:


– (𝑥 − 𝐸(𝑥)) and (𝑦 − 𝐸(𝑦)) tend to have the same sign.
– 𝑥 and 𝑦 tend to move in the same direction.
• If the covariance is negative:
– (𝑥 − 𝐸(𝑥)) and (𝑦 − 𝐸(𝑦)) tend to have opposite signs.
– 𝑥 and 𝑦 tend to move in opposite directions.

If the covariance is zero, there is no simple pattern of co-movement for 𝑥 and 𝑦.


Example 5.8. Calculating the covariance from the joint PDF
The covariance of 𝑤𝑟𝑒𝑑 and 𝑤14 is:
𝑐𝑜𝑣(𝑤𝑟𝑒𝑑 , 𝑤14 ) = (1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ))(35 − 𝐸(𝑤
⏟ 14 )) 𝑓 𝑟𝑒𝑑,14 (1, 35)
⏟⏟⏟⏟⏟ (5.29)
≈−0.027 ≈−0.027 1/37

+ (1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ))(−1 − 𝐸(𝑤
⏟ 14 )) 𝑓 𝑟𝑒𝑑,14 (1, −1)
⏟⏟⏟⏟⏟
≈−0.027 ≈−0.027 17/37

+ (−1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ))(−1 − 𝐸(𝑤
⏟ 14 )) 𝑓
⏟⏟ ⏟⏟⏟
𝑟𝑒𝑑,14 (−1,⏟⏟
−1)
≈−0.027 ≈−0.027 19/37

≈ 0.999 (5.30)
That is, the returns from a bet on red and a bet on 14 are positively related.
112 CHAPTER 5. MORE ON RANDOM VARIABLES

As with the variance, we can derive an alternative formula for the covariance:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝐸(𝑥𝑦) − 𝐸(𝑥)𝐸(𝑦)
The derivation of this result is as follows:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝐸((𝑥 − 𝐸(𝑥))(𝑦 − 𝐸(𝑦))) (5.31)
= 𝐸(𝑥𝑦 − 𝑦𝐸(𝑥) − 𝑥𝐸(𝑦) + 𝐸(𝑥)𝐸(𝑦)) (5.32)
= 𝐸(𝑥𝑦) − 𝐸(𝑦)𝐸(𝑥) − 𝐸(𝑥)𝐸(𝑦) + 𝐸(𝑥)𝐸(𝑦)) (5.33)
= 𝐸(𝑥𝑦) − 𝐸(𝑥)𝐸(𝑦) (5.34)
Again, this formula is often easier to calculate than using the original definition.
Example 5.9. Another way to calculate the covariance
The expected value of 𝑤𝑟𝑒𝑑 𝑤14 is:
𝐸(𝑤𝑟𝑒𝑑 𝑤14 ) = 1 ∗ 35 ∗ 𝑓⏟⏟⏟⏟⏟
𝑟𝑒𝑑,14 (1, 35) (5.35)
1/37

+ 1 ∗ (−1) ∗ 𝑓⏟⏟⏟⏟⏟
𝑟𝑒𝑑,14 (1, −1)
17/37

+ (−1) ∗ (−1) ∗ 𝑓⏟⏟ ⏟⏟⏟


𝑟𝑒𝑑,14 (−1,⏟⏟
−1)
19/37

= 35/37 − 17/37 + 19/37 (5.36)


=1 (5.37)
So the covariance is:
𝑐𝑜𝑣(𝑤𝑟𝑒𝑑 , 𝑤14 ) = 𝐸(𝑤𝑟𝑒𝑑 𝑤14 ) − 𝐸(𝑤𝑟𝑒𝑑 )𝐸(𝑤14 ) (5.38)
= 1 − (−0.027) ∗ (−0.027) (5.39)
≈ 0.999 (5.40)
which is the same result as we calculated earlier.

The key to understanding the covariance is that it is the expected value of a


product (𝑥 − 𝐸(𝑥))(𝑦 − 𝐸(𝑦)), and the expected value itself is just a sum.
The first implication of this insight is that the order does not matter:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝑐𝑜𝑣(𝑦, 𝑥)
since 𝑥 ∗ 𝑦 = 𝑦 ∗ 𝑥
A second implication is that the variance is just the covariance of a random
variable with itself:
𝑐𝑜𝑣(𝑥, 𝑥) = 𝑣𝑎𝑟(𝑥)
since 𝑥 ∗ 𝑥 = 𝑥2 .
Next, we can use our previous results on the expected value of a linear function
of one or more random variables to derive some results on the covariance:
5.4. MULTIPLE RANDOM VARIABLES 113

• Covariances pass through sums:

𝑐𝑜𝑣(𝑥, 𝑦 + 𝑧) = 𝑐𝑜𝑣(𝑥, 𝑦) + 𝑐𝑜𝑣(𝑥, 𝑧)

• Constants can be taken out of covariances:

𝑐𝑜𝑣(𝑥, 𝑎 + 𝑏𝑦) = 𝑏 𝑐𝑜𝑣(𝑥, 𝑦)

These results can be combined in various ways, for example:

𝑣𝑎𝑟(𝑥 + 𝑦) = 𝑐𝑜𝑣(𝑥 + 𝑦, 𝑥 + 𝑦) (5.41)


= 𝑐𝑜𝑣(𝑥 + 𝑦, 𝑥) + 𝑐𝑜𝑣(𝑥 + 𝑦, 𝑦) (5.42)
= 𝑐𝑜𝑣(𝑥, 𝑥) + 𝑐𝑜𝑣(𝑦, 𝑥) + 𝑐𝑜𝑣(𝑥, 𝑦) + 𝑐𝑜𝑣(𝑦, 𝑦) (5.43)
= 𝑣𝑎𝑟(𝑥) + 2 𝑐𝑜𝑣(𝑥, 𝑦) + 𝑣𝑎𝑟(𝑦) (5.44)
(5.45)

I do not expect you to remember all of these formulas, but be prepared to see
me use them

5.4.6 Correlation

The correlation coefficient of two random variables 𝑥 and 𝑦 is defined as:

𝑐𝑜𝑣(𝑥, 𝑦) 𝜎𝑥𝑦
𝜌𝑥𝑦 = 𝑐𝑜𝑟𝑟(𝑥, 𝑦) = =
√𝑣𝑎𝑟(𝑥)𝑣𝑎𝑟(𝑦) 𝜎𝑥 𝜎𝑦

Like the covariance, the correlation describes the strength of a (linear) relation-
ship between 𝑥 and 𝑦. But it is re-scaled in a way that makes it more convenient
for some purposes.
Example 5.10. Correlation in roulette
The correlation of 𝑤𝑟𝑒𝑑 and 𝑤14 is:

𝑐𝑜𝑣(𝑤𝑟𝑒𝑑 , 𝑤14 )
𝑐𝑜𝑟𝑟(𝑤𝑟𝑒𝑑 , 𝑤14 ) = (5.46)
√𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) ∗ 𝑣𝑎𝑟(𝑤14 )
0.999
≈√ (5.47)
1.0 ∗ 34.1
≈ 0.17 (5.48)

The covariance and correlation always have the same sign since standard devia-
tions are always1 positive. The key difference between them is that correlation
is scale-invariant. That is:
1 More precisely, either or both of 𝜎 and 𝜎 could be zero. In that case the covariance will
𝑥 𝑦
also be zero, and the correlation will be undefined (zero divided by zero).
114 CHAPTER 5. MORE ON RANDOM VARIABLES

• It always lies between -1 and 1.


• It is unchanged by any re-scaling or change in units. That is, for any
positive constants 𝑎 and 𝑏:

𝑐𝑜𝑟𝑟(𝑎𝑥, 𝑏𝑦) = 𝑐𝑜𝑟𝑟(𝑥, 𝑦)

When 𝑐𝑜𝑟𝑟(𝑥, 𝑦) ∈ {−1, 1}, then that means 𝑦 is an exact linear function of 𝑥.
That is, we can write it:
𝑦 = 𝑎 + 𝑏𝑥
where:

1 if 𝑏 > 0
𝑐𝑜𝑟𝑟(𝑥, 𝑦) = { (5.49)
−1 if 𝑏 < 0

5.4.7 Independence

We say that 𝑥 and 𝑦 are independent if every event defined in terms of 𝑥 is


independent of every event defined in terms of 𝑦.

Pr(𝑥 ∈ 𝐴 ∩ 𝑦 ∈ 𝐵) = Pr(𝑥 ∈ 𝐴) Pr(𝑦 ∈ 𝐵)

As before, independence of 𝑥 and 𝑦 implies that the conditional distribution is


the same as the marginal distribution:

Pr(𝑥 ∈ 𝐴|𝑦 ∈ 𝐵) = Pr(𝑥 ∈ 𝐴)

Pr(𝑦 ∈ 𝐴|𝑥 ∈ 𝐵) = Pr(𝑦 ∈ 𝐴)


The first graph in Figure ?? shows what independent random variables look like
in data: a cloud of unrelated points.
Independence also means that the joint and conditional PDF/CDF can be de-
rived from the marginal PDF/CDF:

𝑓𝑥,𝑦 (𝑎, 𝑏) = 𝑓𝑥 (𝑎)𝑓𝑦 (𝑏)

𝑓𝑦|𝑥 (𝑎, 𝑏) = 𝑓𝑦 (𝑎)

𝐹𝑥,𝑦 (𝑎, 𝑏) = 𝐹𝑥 (𝑎)𝐹𝑦 (𝑏)

𝐹𝑦|𝑥 (𝑎, 𝑏) = 𝐹𝑦 (𝑎)


As with independence of events, this will be very handy in simplifying the
analysis. But remember: independence is an assumption that we can only make
when it’s reasonable to do so.
5.4. MULTIPLE RANDOM VARIABLES 115

Example 5.11. Independence in roulette


The winnings from a bet on red (𝑤𝑟𝑒𝑑 ) and the winnings from a bet on 14 (𝑤14 )
in the same game are not independent.
However the winnings from a bet on red and a bet on 14 in two different games
are independent since the underlying outcomes are independent.

When random variables are independent, their covariance and correlation are
both exactly zero. However, it does not go the other way around.
Example 5.12. Zero covariance does not imply independent
Figure ?? below shows a scatter plot from a simulation of two random variables
that are clearly related (and therefore not independent) but whose covariance
is exactly zero.
Intuitively, covariance is a measure of the linear relationship between two vari-
ables. When variables have a nonlinear relationship as in Figure ??, the covari-
ance may miss it.

7.5

5.0
y

2.5

0.0

−2 0 2
x

Chapter review
Over the course of this chapter and the previous one, we have learned the
basic terminology and tools for working with random variables. These are the
two most difficult chapters in the course, but if you work hard and develop your
116 CHAPTER 5. MORE ON RANDOM VARIABLES

understanding of random variables you will find the rest of the course somewhat
easier.
This is not a course on probability, so the next few chapters will be about data
and statistics. We will first learn to use Excel to calculate common statistics
from a cleaned data set. We will then use the tools of probability and random
variables to build a theoretical framework in which we can interpret each statistic
as a random variable and each data set as a collection of random variables. This
theory will allow us to use statistics not only as a way of describing data, but
as a way of understanding the process that produced that data.

Practice problems
Answers can be found in the appendix.
Questions 1- 7 below continue our craps example. To review that example, we
have:

• An outcome (𝑟, 𝑤) where 𝑟 and 𝑤 are the numbers rolled on a pair of fair
six-sided dice
• Several random variables defined in terms of that outcome:
– The total showing on the pair of dice: 𝑡 = 𝑟 + 𝑤
– An indicator for whether a bet on “Yo” wins: 𝑦 = 𝐼(𝑡 = 11).

In addition, let 𝑏 = 𝐼(𝑡 = 12) be an indicator of whether a bet on “Boxcars”


wins. Since it is an indicator variable 𝑏 has the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) distribution with
𝑝 = 1/36, so it has mean
𝐸(𝑏) = 𝑝 = 1/36
and variance:
𝑣𝑎𝑟(𝑏) = 𝑝(1 − 𝑝) = 1/36 ∗ 35/36 ≈ 0.027

SKILL #1: Derive a joint distribution from the underlying outcome

1. Let 𝑓𝑦,𝑏 (⋅) be the joint PDF of 𝑦 and 𝑏.


a. Find 𝑓𝑦,𝑏 (1, 1)
b. Find 𝑓𝑦,𝑏 (0, 1)
c. Find 𝑓𝑦,𝑏 (1, 0)
d. Find 𝑓𝑦,𝑏 (0, 0)

SKILL #2: Derive a marginal distribution from a joint distribution

2. Let 𝑓𝑏 (⋅) be the marginal PDF of 𝑏


5.4. MULTIPLE RANDOM VARIABLES 117

a. Find 𝑓𝑏 (0) based on the joint PDF 𝑓𝑦,𝑏 (⋅).


b. Find 𝑓𝑏 (1) based on the joint PDF 𝑓𝑦,𝑏 (⋅).
c. Find 𝐸(𝑏) based on this marginal PDF you found in parts (a) and
(b).

SKILL #3: Derive a conditional distribution from a joint distribution

3. Let 𝑓𝑦|𝑏 (1, 1) be the conditional PDF of 𝑦 given 𝑏.


a. Find 𝑓𝑦|𝑏 (1, 1).
b. Find 𝑓𝑦|𝑏 (0, 1).
c. Find 𝑓𝑦|𝑏 (1, 0).
d. Find 𝑓𝑦|𝑏 (0, 0).

SKILL #4: Identify whether two random variables are independent

4. Which of the following pairs of random variables are independent?


• 𝑦 and 𝑡
• 𝑦 and 𝑏
• 𝑟 and 𝑤
• 𝑟 and 𝑦

SKILL #5: Find covariance from a PDF

5. Find 𝑐𝑜𝑣(𝑦, 𝑏) using the joint PDF 𝑓𝑦,𝑏 (⋅).

SKILL #6: Find covariance using the alternate formula

6. Find the following covariances using the alternate formula:


a. Find 𝐸(𝑦𝑏) using the joint PDF 𝑓𝑦,𝑏 (⋅).
b. Find 𝑐𝑜𝑣(𝑦, 𝑏) using your result in (a).
c. Is your answer in (b) the same as your answer to question 5 above?

SKILL #7: Find correlation from covariance

7. Find 𝑐𝑜𝑟𝑟(𝑦, 𝑏) using the results you found earlier.


8. We can find the correlation from the covariance, but we can also find the
covariance from the correlation. For example, suppose we already know
that
𝐸(𝑡) = 7 (5.50)
𝑣𝑎𝑟(𝑡) ≈ 5.83 (5.51)
𝑐𝑜𝑟𝑟(𝑏, 𝑡) ≈ 0.35 (5.52)
(5.53)
Using this information and the values of 𝐸(𝑏) and 𝑣𝑎𝑟(𝑏) provided above:
118 CHAPTER 5. MORE ON RANDOM VARIABLES

a. Find 𝑐𝑜𝑣(𝑏, 𝑡).


b. Find 𝐸(𝑏𝑡).

SKILL #8: Find correlation and covariance for independent random


variables

9. We earlier found that 𝑟 and 𝑤 were independent. Using this information:


a. Find 𝑐𝑜𝑣(𝑟, 𝑤).
b. Find 𝑐𝑜𝑟𝑟(𝑟, 𝑤).

SKILL #9: Work with linear functions of random variables

10. Your net winnings if you bet $1 on Yo and $1 on Boxcars can be written
16𝑦 + 31𝑏 − 2. Find the following expected values:
a. Find 𝐸(𝑦 + 𝑏)
b. Find 𝐸(16𝑦 + 31𝑏 − 2)
11. Find the following variances and covariances:
a. Find 𝑐𝑜𝑣(16𝑦, 31𝑏)
b. Find 𝑣𝑎𝑟(𝑦 + 𝑏)

SKILL #10: Interpret covariances and correlations

12. Based on your results, which of the following statements is correct?


• The result of a bet on Boxcars is positively related to the result of a
bet on Yo.
• The result of a bet on Boxcars is negatively related to the result of a
bet on Yo.
• The result of a bet on Boxcars is not related to the result of a bet on
Yo.

SKILL #11: Use the properties of the uniform distribution

13. Suppose that 𝑥 ∼ 𝑈 (−1, 1).


a. Find the PDF 𝑓𝑥 (⋅) of 𝑥
b. Find the CDF 𝐹𝑥 (⋅) of 𝑥
c. Find Pr(𝑥 = 0)
d. Find Pr(0 < 𝑥 < 0.5)
e. Find Pr(0 ≤ 𝑥 ≤ 0.5)
f. Find the median of 𝑥.
g. Find the 75th percentile of 𝑥.
h. Find 𝐸(𝑥)
5.4. MULTIPLE RANDOM VARIABLES 119

i. Find 𝑣𝑎𝑟(𝑥)

SKILL #12: Work with linear functions of uniform random variables

14. Suppose that 𝑥 ∼ 𝑈 (−1, 1), and let 𝑦 = 3𝑥 + 5.


a. What is the probability distribution of 𝑦?
b. Find 𝐸(𝑦).

SKILL #13: Use the properties of the normal distribution

15. Suppose that 𝑥 ∼ 𝑁 (10, 4).


a. Find 𝐸(𝑥).
b. Find the median of 𝑥.
c. Find 𝑣𝑎𝑟(𝑥).
d. Find 𝑠𝑑(𝑥).
e. Use Excel to find Pr(𝑥 ≤ 11).

SKILL #14: Work with linear functions of a normal random variable

16. Suppose that 𝑥 ∼ 𝑁 (10, 4).


a. Find the distribution of 𝑦 = 3𝑥 + 5.
b. Find a random variable 𝑧 that is a linear function of 𝑥 and has the
standard normal distribution
c. Find an expression for Pr(𝑥 ≤ 11) in terms of the standard normal
CDF Φ(⋅).
d. Use the Excel function NORM.S.DIST and the previous result to find
to find the value of Pr(𝑥 ≤ 11).
120 CHAPTER 5. MORE ON RANDOM VARIABLES
Chapter 6

Basic data analysis with


Excel

In a previous chapter, we learned how to clean a simple data set. The next step
is to learn how to analyze it. In this chapter, we will use Excel to construct
univariate statistics and charts, i.e., statistics and charts that describe a single
variable. Later in the term, we will learn multivariate methods that describe
the relationship between two or more variables.

Goals
Chapter goals
In this chapter we will learn how to:

• Calculate and interpret the main univariate summary statistics in


Excel
• Construct and interpret frequency tables in Excel
• Construct and interpret bar graphs, histograms and time series
graphs in Excel

In the next chapter, we will use the tools of probability and random variables
to understand these statistics more deeply.

6.1 Exploratory data analysis

Our emphasis in this chapter, and in much of this course will be on performing
exploratory data analysis. Exploratory data analysis is the first step in any
data analysis project: we use simple statistics and graphs to identify and under-

121
122 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

stand patterns in the data. Our knowledge of these patterns can then inform
our subsequent formal or model-based analysis.
The audience for exploratory data analysis is the analyst themselves. But we
will often be interested in presenting the patterns we have discovered to another
audience: your teacher, your boss, or your client. So we will also discuss how
to effectively present your results.

6.1.1 Looking at the data


The first step in any exploratory data analysis is to literally look at the data.
Example 6.1. Historical employment data for Canada
Our analysis in this chapter will use historical employment data for
Canada from January 1976 through January 2021. Download the file at
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpHist.xlsx make a
working copy, and open it:

Our main data set for analysis is the worksheet Data for Analysis, which includes
the following variables:

• MonthYr: the month and year of the observation.


• Population: the civilian, non-institutionalized working-age population of
Canada at that time, in thousands.
• Employed: the total number employed in the population, in thousands.
• Unemployed: the total number employed in the population, in thou-
sands.
• LabourForce: the sum of Employed and Unemployed.
• NotInLabourForce: the difference between Population and Labour-
Force.
6.2. UNIVARIATE STATISTICS IN EXCEL 123

• UnempRate: the percentage of the labour force that is unemployed. As


before, it is calculated and stored as a decimal (ranging from 0.0 to 1.0)
but displayed as a percentage (ranging from 0% to 100%).
• LFPRate: the percentage of the population that is in the labour force.
As before, it is calculated and stored as a decimal (ranging from 0.0 to
1.0) but displayed as a percentage (ranging from 0% to 100%).
• Party: the political party in control of the Federal government. If the
party in control changed during the month, it is listed as “Transfer”.
• PrimeMinister: the name of the Prime Minister. If the prime minister
changed during the month, it is listed as “Transfer”.
• AnnPopGrowth: the rate of population growth over the previous 12
months, calculated as a proportion and displayed as a percentage. Note
that this variable is blank for the first 12 months of the data set.

In addition, there is a worksheet titled Raw data that contains the original
data as obtained from Statistics Canada. Source information is also in that
worksheet.

Our historical employment data set covers more than 500 months. Other data
sets are often much larger: large surveys from Statistics Canada can have hun-
dreds of observations and hundreds of thousands of variables, and companies
and governments often work with transactions-level data that includes millions
of observations.
As humans, our brains are not large enough to fully understand a large data
set without some kind of simplification or “dimension reduction”: instead of
looking at millions of numbers and trying to identify patterns from that, we
calculate and view a relatively small number of statistics based on the data.
A statistic is just a number calculated from data.

6.2 Univariate statistics in Excel


We will start by calculating some commonly-used univariate statistics, which
are statistics that describe a single variable in isolation. In Section 6.3.1 we will
construct some commonly-used univariate graphs.
Multivariate statistics describe the relationships among multiple variables, and
will be covered in Chapter 12.

6.2.1 Summary statistics

A table of summary statistics reports various univariate statistics for each


variable in our data set. For example, our table of summary statistics might
look like this:
124 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

Statistic (variable name 1) (variable name 2) (etc.)


Count (count of valid observations)
Average (average)
StdDev (standard deviation)
Min (minimum value observed)
10th percentile
Median
90th percentile
Max (maximum value observed)

You saw many of these words - standard deviation, percentile, median, etc. -
in Chapter 4, but I need to be clear on something: even though the names are
the same, the concepts are not exactly the same.

1. The count or sample size is the number of observations with valid (nu-
meric) values for the variable.
2. The sample average is a measure of central tendency in data, and is
calculated:
1 𝑛
𝑥̄ = ∑ 𝑥𝑖
𝑛 𝑖=1

3. The sample standard deviation is a measure of variation in the data


and is calculated:
𝑛
1
𝑠𝑥 = √ ∑(𝑥𝑖 − 𝑥)̄ 2
𝑛 − 1 𝑖=1

4. The sample minimum is the lowest value observed in the data


5. The sample 10th percentile is a number that is above 10% of observations
and below the other 90% of observations.
6. The sample median is a number that is above half of observations and
below the other half.
7. The sample 90th percentile is a number that is above 10% of observations
and below the other 90% of observations.
8. The sample maximum is the highest value observed in the data.

Notice the difference here: the sample average/median/percentiles/etc. de-


scribes the values observed in a set of data, and are calculated from that data
set. The mean/median/percentiles we learned about in Chapter 4 describes
the probability distribution of a random variable and is calculated from that
probability distribution.
As you might imagine, while these two sets of concepts are distinct, they are
related. We will discuss how they are related in Chapter 7. For now, just keep
in mind that they are distinct.
6.2. UNIVARIATE STATISTICS IN EXCEL 125

6.2.1.1 Constructing the table

We are going to create a nice table of summary statistics for some of our vari-
ables.

Example 6.2. Table setup


To set up the table:

1. Create a new blank worksheet

• You can do this by clicking on the button at the bottom of


the screen.
2. Rename the worksheet Summary Statistics
• You can do this by double-clicking on the tab for your new worksheet.
A dialog box will open for you to enter the new name.
3. Fill in the first column (cells A1:A9) as in the table above.

The next step is to fill in the row of variable names. We could type them in,
but let’s do something more sophisticated and flexible: use a formula to pull
the variable names in from the original data set.

Example 6.3. Getting variable names from the data


To fill in the variable name for column B:

1. Go to cell B1.
2. Type = but don’t hit <enter> yet.
3. Select the tab for the Data for Analysis worksheet, and then select cell G1
in the Data for Analysis worksheet.
• The formula bar now says ='Data for Analysis'!G1
4. Select <enter>.
• You should now be back in cell B1 in the Summary statistics work-
sheet.
• Cell B1 should display UnempRate (the contents of cell G1 in Data
for Analysis).

What have we done here? We have constructed a formula that successfully


references a cell in another worksheet. Note that the relative references are
even copied over correctly.

Excel has many built-in functions to calculate summary statistics.


126 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

Example 6.4. Summary statistics for UnempRate


To fill in the statistics for column B:

1. Use the COUNT() function to report the observation count in cell B2.
• We will want to use an absolute reference for the rows, and a relative
reference for the columns, so the formula should be =COUNT('Data
for Analysis'!G$2:G$542).
2. Use the AVERAGE() function to report the average unemployment rate in
cell B3
3. Use the STDEV.S() function to report the standard deviation of the un-
employment rate in cell B4
• There is another built-in Excel function called STDEV.P() that uses
a slightly different formula for the standard deviation:

1 𝑛
𝑠𝑝𝑥 = √ ∑(𝑥 − 𝑥)̄ 2
𝑛 𝑖=1 𝑖

We will discuss the difference between these two statistics later.


4. Use the MIN() function to report the minimum unemployment rate in cell
B5.
5. Use the PERCENTILE.INC() function to report the 10th percentile of the
unemployment rate in cell B6.
• Warning: Despite its name, the PERCENTILE.INC() function takes a
quantile (between 0 and 1) rather than a percentile (between 0 and
100) as its argument.
6. Use the MEDIAN() function to report the median unemployment rate in
cell B7
7. Use the PERCENTILE.INC() function to report the 90th percentile of the
unemployment rate in cell B8.
8. Use the MAX() function to report the maximum unemployment rate in cell
B9.

We now have a table that reports all of the major summary statistics calculated
for the unemployment rate.

We would also like to calculate summary statistics for other variables. Fortu-
nately, we have set up our table in a way that makes that easy.
Example 6.5. Summary statistics for other variables
To fill in summary statistics for other variables:

1. Copy the contents of cells B1:B9 to cells C1:F9.


6.2. UNIVARIATE STATISTICS IN EXCEL 127

Now column C contains summary statistics for the labour force participation
rate (LFPRate), and columns D through F contain summary statistics for some
of the other variables in our data set.

If we wanted to, we could set up the table to calculate summary statistics for
all of the variables. But let’s just stop with these, and move on to making the
table look a little nicer.

6.2.1.2 Cleaning up the table

Our table now has all of the information we need, but it is still kind of ugly.
Let’s make it look nice and presentable.

Example 6.6. Cell size


The first problem is that the columns may be too narrow or too wide. If a
column is too narrow, some of its values will display as “####”. If it is too
wide, it will take up too much room.
There are many different ways to adjust row heights and column widths, but
here is the simplest:

1. Select the whole sheet by clicking on the button in its upper left
corner.
2. Select Home > Format > AutoFit Column Width from the menu.

Not only will this make everything fit, it will automatically adjust the width as
necessary when anything changes.

Example 6.7. Error codes


The second problem is that the non-numeric variables report error codes like
#DIV/0! or #NUM! for several statistics, and nonsense values for others.
It is usually better to leave something blank than have it display meaningless,
confusing or incorrect information. So let’s just delete those columns:

1. Select columns D and E.


2. Select Home > Delete > Delete Sheet Columns.

Example 6.8. Display formatting


To make the display formatting a little nicer.

1. Put the top row in bold.


2. Put the first column in italics.
128 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

3. Adjust the number display formats to look nice. Remember that the
number display format has no effect on the number itself.
• Leave the counts as they are.
• Display the unemployment rate, LFP rate and population growth
rate in percentages, rounded to one decimal place.
4. Feel free to play around with colors, fonts, etc. to get a table that you
like.

We now have a nice table of summary statistics:

that we could put into a Word or PowerPoint document, and share with an
audience.

6.2.2 Simple frequency tables

Another way of looking at a single variable is to construct a frequency table.


A frequency table describes the distribution of values for a single variable.
Frequency tables are at their simplest and most useful when the underlying
variable is discrete, categorical, or even non-numeric. The frequency table for
this kind of variable looks like this:

variablename Count Percentage


(value1) (# value1) (% value1)
(value2) (# value2) (% value2)
(value3) (# value3) (% value3)
etc.

where:

• The (variablename) column lists all possible values of the variable.


• The Count column reports the number of observations in which the vari-
able matches that value or range of values.
• The Percentage column reports the count as a percentage of the total
number of observations.
6.2. UNIVARIATE STATISTICS IN EXCEL 129

The Excel functions COUNTIF() and COUNTIFS() can be used to construct the
counts. I will use COUNTIFS() which has a pair of arguments:

• The first argument criteria_range gives the range containing the data
we want to describe.
• The second argument criteria gives the criteria we want to match.

The kind of criteria you can use are most easily described by examples:

• The formula =COUNTIFS(A1:A5,"Hello") returns a count of the number


of cells in the range A1:A5 that contain the string “Hello”.

• The formula =COUNTIFS(A1:A5,5) returns a count of the number of cells


that contain the number 5.
• The formula =COUNTIFS(A1:A5,">0") returns a count of the number of
cells in the range A1:A5 that contain a number greater than zero.
• The formula =COUNTIFS(A1:A5,B1) returns a count of the number of cells
in the range A1:A5 that satisfy the criterion given in cell B1.

Example 6.9. A simple frequency table


Let’s create a frequency table for the Party variable in our data set. It should
look like this:

Party Count Percentage


Liberal (# Liberal) (% Liberal)
Conservative (# Conservative) (% Conservative)
NDP (# NDP) (% NDP)
Transfer (# Transfer) (% Transfer)

Note that I have included a political party (the NDP) that is not observed in
our data set. We start by setting up the table:

1. Create a new sheet named “Party control”


2. Fill in the top row and first column as in the table above.

Next we will fill in values for the Count column (B):

1. Fill in cell B2 using the Excel function COUNTIFS().


• The criteria_range will be the data in our main data set (Data for
Analysis worksheet) covering the Party variable (cell range I2:I542),
or 'Data for Analysis'!I2:I542.
130 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

• The criteria we want to match is the contents of the cell in the


current row of column A, or A2.
• So the formula to enter in cell B2 is =COUNTIFS('Data for
Analysis'!I2:I542,A2).
• The result displayed should be 302, the number of observations in
our data that have the Party identified as Liberal.
2. We will want to copy the formula in cell B2 to the other cells in column
B. But before we do that, we need to make some of the relative refer-
ences into absolute references. In this case, our table will keep the same
criteria_range but will want to change the criteria, so:
• Change the formula in cell B2 to =COUNTIFS('Data for
Analysis'!I$2:I$542,A2)
3. Copy cell B2 to cells B3:B4.
• Check the formulas in those cells to make sure they look right.

Next, we need to fill in the Percentage column (C).

1. To do that we just need to divide each cell in the Count column by the
sum of all the cells in that column. So the formula in cell C2 should be
=B2/SUM(B2:B5)
2. Change relative references to absolute references as needed,
3. Copy the formula in cell C2 to cells C3:C5.
4. By default, the percentages are displayed as proportions. Change the
display format to percentage, with one decimal place.

We now have a simple frequency table for the Party variable:

6.2.3 Binned frequency tables

We can also construct frequency tables for continuous variables or discrete vari-
ables with many possible values, but doing so is a little more complicated. We
cannot just construct a table with a row for each possible value, since there are
many possible values.
Instead, we divide the data’s range of possible or observed values into a set of
sub-ranges or bins. Then we can calculate and report the number or percentage
6.2. UNIVARIATE STATISTICS IN EXCEL 131

of observations that fall within each bin. A binned frequency table looks like
this:

From To Count Percentage


(from1) (to1)
(from2) (to2)
etc.

In constructing bins, we need to apply some good judgment, and keep in mind
a few requirements and considerations:

• We need the bins to cover the full range of the data. In particular:
– The lower bound of the lowest bin should be lower than the lowest
value in the data.
– The upper bound of the highest bin should be higher than the highest
value in the data.
– Each bin’s upper bound should be the lower bound of the next bin.
– Boundaries should be addressed in a consistent manner, so that each
observation falls into exactly one bin.
• We often want the bins to be equally sized.
– But that isn’t always the case. See the unemployment rate table in
the example below; if it used equally sized bins, most of the bins
would be empty.
• We often want the upper and lower bounds of the bins to be nice round
numbers.
• The number of bins is a matter for judgment, and depends on what kind
of patterns we are aiming to find in the data.
– Too many bins and we miss broad patterns
– Too few bins and we miss potentially interesting details.
– The solution is to explore multiple options, and see what patterns
you can find out.

Again, we will use COUNTIFS() to construct the counts. But we will need to take
advantage of a feature of COUNTIFS() I have not yet mentioned: it takes mul-
tiple criteria_range and criteria arguments, allowing it to make multiple
comparisons (that’s the difference between COUNTIF() and COUNTIFS()).
Example 6.10. A binned frequency table
Let’s create the following table for the unemployment rate variable:
132 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

From To Count Percentage


0.0% 5.0%
5.0% 7.5%
7.5% 10.0%
10.0% 15.0%
15.0% 100.0%

where Count is the number of observations for UnempRate that are greater
than that row’s From value and less than or equal to the row’s To value, and
Percentage is the count as a percentage of the total. We start by setting up the
table:

1. Create a new sheet titled Unemployment frequency


2. Copy the first row and first two columns from the table above.
• Enter the From and To values as decimals (0 to 1) and display them
as percentages (0% to 100%).

Next, we fill in the first count in cell C2. This will be a somewhat complex
formula, so we will build it in stages:

1. Let’s start by counting the observations with UnempRate greater than


zero:
• The criteria_range1 value should be the data range for the Unem-
pRate variable, i.e, 'Data for Analysis'!G2:G542.
• The criteria1 value should be is the string “>0”.
• So the formula should be =COUNTIFS('Data for Analysis'!G2:G542,
">0")
2. Next, let’s add the criterion that UnempRate is less than or equal to 5%.:
• The criteria_range2 value should be the same as criteria_range1.
• The criteria2 value should be is the string “<=0.05”.
• So the formula should be =COUNTIFS('Data for Analysis'!G2:G542,
">0", 'Data for Analysis'!G2:G542, "<=0.05")
3. Finally, let’s have the formula retrieve the criteria1 and criteria2
values from cells A2 and B2.
• This enhancement is not strictly required but it has two big advan-
tages:
(1) we will be able to copy this formula into other cells
(2) we can change the bins in columns A and B, and the calculation
will automatically adjust.
• We can use the CONCAT() function to construct the criteria:
6.3. UNIVARIATE GRAPHS IN EXCEL 133

– CONCAT(">",A2) will return “>0”


– CONCAT("<=",B2) will return “<=0.05”
• So the formula is: =COUNTIFS('Data for Analysis'!G2:G542,CONCAT(">",A2),'Data
for Analysis'!G2:G542,CONCAT("<=",B2))

Cell C2 should now display the number of months (zero) in which the Canadian
unemployment rate was between 0% and 5%. To finish up the table:

1. Update the formula in cell C2 to use absolute references where appropriate.


2. Copy the contents of cell C2 to cells C3:C6.
3. Fill in the Percentage column (C) with the appropriate formula and
change its display format to Percent.

We now have our binned frequency table:

FYI
The FREQUENCY() function is another way to create a frequency table.
However, this function is a tricky one to learn, and Microsoft recently
changed how it works. So we will skip it.

6.3 Univariate graphs in Excel


Another way to explore our data is by visualizing it: constructing and viewing
graphs. In this section, we will construct and view three graphs that are useful
for understanding a single variable: the time series graph, the bar/column graph,
and the histogram. We will also discuss some basic principles for producing
presentation-quality graphs.

6.3.1 Charts in Excel

Excel calls graphs charts. Excel charts have three main components:

1. The data source. This is the table containing the data used to construct
the graph.
134 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

• The individual columns in the data source are called series.


2. The chart type. There are many chart types in Excel, but the ones we
will work on are:
• Line graphs have categories on the horizontal axis, and the value of
a variable/series on the vertical axis. Values are depicted by a line
connecting the values of the series at each category.
• Column charts have categories on the horizontal axis, and the value
of a variable or series on the vertical axis. Values are depicted by a
vertical bar for each value of the series.
• Bar charts are just column charts with the axes reversed.
• Scatter or XY charts plot the value of one series against the value
of another series. Values can be depicted as a set of points, as a
connected line, or both.
3. The chart elements, which are the different pieces of the chart
• Chart elements can be added, removed, or modified.
• The available elements vary across chart types, but can include axes,
titles, legends, gridlines, and text labels.

The usual workflow here is to select the data, choose a chart type, and then
modify chart elements until the chart looks the way you want it.

6.3.2 Time series (line) graphs

A time series graph plots one or more variables at multiple points in time.
Conventionally, time is on the horizontal axis and the variable’s value is on the
vertical axis. For example, we will generate a time series graph that looks like
this:

In Excel, time series graphs can be implemented using either the Line chart
type or the Scatter chart type. Line graphs are simpler, so we will start with
that. We will learn how to make scatter plots in Chapter 11.
The first step in creating an Excel chart is to select the data source and graph
type.
6.3. UNIVARIATE GRAPHS IN EXCEL 135

Example 6.11. Creating a time series graph for UnempRate


We want to create a time series graph for our variable UnempRate.

1. Select any cell in the Data for analysis table.


2. Select Insert > Recommended Charts from the menu.
3. The dialog box will show a few recommended charts. None of them are
what we want, so select the All Charts tab at the top to see the full set
of options.
4. Select Line to see the line graph options.
5. The dialog box gives you several different types of line graph. Select the
picture that looks like this:

and select OK.

We have our time series graph (it looks like the picture above). Unfortunately,
it has a line for every variable in our data source. We only want to plot one
variable, so let’s get rid of the others:

1. Select Design > Select Data which will open the Select Data Source
dialog box.
2. Uncheck the check box next to every series in the “Legend Entries (Series)”
box except UnempRate:

This looks like the graph we want.


136 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

6.3.3 Creating presentation-quality graphs


The graph we have right now is perfectly fine for exploratory data analysis -
using graphs and other tools to better understand the data. However, if we want
to convey our understanding to others we need to fine-tune the presentation
of our data. Like other elements of a presentation, graphs should be clear,
informative, concise, and professional in appearance.
The field of data visualization was pioneered by the statistician Edward Tufte
and studies how best to visually convey information. A full discussion of data
visualization is beyond the scope of this course, but we can talk about and apply
a few basic ideas:

1. The information provided should be clear.


2. All necessary information should be provided.
3. Repeated, irrelevant or misleading elements (what Tufte calls “chartjunk”)
should be removed.
4. Related information should be visually “close” and required eye movement
should be minimized.
5. Color and formatting should be used to simplify interpretation rather than
to complicate it.
6. Accessibility should be taken into account. For example:
• Color is not always available to readers: Some readers are color
blind, projectors often get colors “wrong”, or your document may
get printed on a black-and-white printer.
– Always include non-color visual cues (for example dashed lines or
different points with varying shape) so that readers who cannot
distinguish color (for whatever reason) can tell different series
apart.
– Consider using a color palette that is colorblind-friendly.
∗ A simple way to do this is to avoid using red and green or
blue and yellow as contrasting colors, since these are the two
most common forms of colorblindness.
∗ The geographer Cynthia Brewer has developed a set of
colorblind-friendly palettes known as ColorBrewer. They
are available at http://colorbrewer2.org/.
• Readers in their 50’s and above often have difficulty reading small
text
– Avoid squeezing important text into a tight space.
– Generate graphics that can be re-scaled.
• Readers may have more general and severe visual impairments.
– Provide alt text for all images.
Example 6.12. Preparing our graph for presentation
Let’s start by removing unnecessary chart elements.
6.3. UNIVARIATE GRAPHS IN EXCEL 137

1. Select the chart. This will cause the Chart Design and Format menu
items to appear in the menu.
2. To remove the horizontal gridlines, select Chart Design > Add Chart
Element > Gridlines > Primary Major Horizontal
3. To remove the legend, select Chart Design > Add Chart Element >
Legend > None

Next, let’s modify the title to be more informative. Right now, it is not clear
from the graph what country’s unemployment rate this is.

1. Select the title.


2. Type in the new title, and hit the <enter> key.
• I went with the title Canadian unemployment, 1976-present
• I also made it boldface.

Next, let’s modify the horizontal axis to be a little less busy.

1. Double-click on the horizontal axis. The Format Axis box should appear
to the right.
• It may take you more than one try to double-click on the correct
object.
2. Select Axis Options.
• You can see all of the choices Excel made here based on what it sees
in the data.
• Feel free to play around with these options. You can return each
option to its original/default state by clicking on that option’s Reset
button.
3. Change the major units to either 4 Years or 8 Years, whichever one you
like better.

Finally, let’s add a text label for our time series.

1. Select the chart.


2. Select Format > Insert Text Box from the menu. The cursor will
change into one that looks like this:
3. Draw the text box at the desired location on the chart and enter the text
“Unemployment rate”
4. Resize and move the box to just above the line for the time series.
5. Change the color of the text to the same color as the line.
138 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

This isn’t really necessary, and violates our principle of avoiding repeated infor-
mation. But it is useful when we are graphing multiple time series in the same
chart, as it is more direct than a legend.
Finally, we want to add alt text for the visually impaired.

1. Select the chart.


2. Select Format > Alt Text. The Alt Text dialog box will open to the
right.
3. Enter the alt text in the box.
• I used Graph of Canadian unemployment rate from January 1976 to
present.

The graph will now look like the one above. The graph here is visually clean
and simple, in part because I left out many elements that I could have included:
axis labels (not needed because the units are obvious from context), a fancy
background, etc.

One consideration that often comes up when constructing graphs is whether


the vertical axis should start at zero when the range of data does not reach
zero. In our time series graph of the unemployment rate, Excel included 0% by
default even though the variable never went below 5%. I could have changed
the vertical axis to start at 5%, but decided not to.
There is no strict rule for whether to include zero, but we can consider the
general principle described earlier: our graph should be informative and not
misleading.

FYI
For further reading
Data visualization skills are valuable in the academic world and in the
professional world. If you are interested in developing your skills fur-
ther, you might consider our course ECON 334: Data Visualization and
Economic Analysis.
You might also get a book on data visualization, either Kieran Healy’s
Data Visualization or Cole Nussbaumer Knaflic’s Storytelling with Data.
Healy’s book is aimed at a primarily academic audience while Knaflic’s
is aimed at a business audience. Both of them are practical and easy
reads, with many examples.

6.3.4 Frequency (bar/column) graphs


Bar graphs are used to represent the frequency distribution of a categorical
variable. Bar graphs use bars of different length to represent the value of some
aggregate variable in each category.
6.3. UNIVARIATE GRAPHS IN EXCEL 139

Bar graphs are produced from a frequency table, as they are a visual depiction
of the information in such a table.
They can be produced in Excel using either the Bar chart type or the Column
chart type. The difference between the two is that the bars are horizontal in
the Bar chart type and vertical in the Column chart type.
Example 6.13. A bar graph of the Party variable
To construct a bar graph of the Party variable:

1. Select any cell in the Party control worksheet.


2. Select Insert > Recommended Charts from the menu.
3. Select the All Charts tab to see the full list of options.
4. Select Column and then select OK.

The graph will look something like this:

As you can see, basic bar graphs are quite simple. There is a bar for each
category (the first column of the table) and variable (the other columns), and
the length of each bar corresponds to its value.
As with our line graph, the current graph contains more information than we
actually want - it shows the count and the percentage, and we really only need
one of those.

1. Select the chart.


2. Select Chart Design > Select Data
3. Uncheck the check boxes for Percentage, NDP and Transfer.

We now have a bar graph that looks like this:


140 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

and shows the number of months in which the two largest federal parties were
in government over the time frame of our data.

As with line graphs, we can prepare a bar graph for presentation by using Excel’s
tools with an eye towards the principles of effective presentation graphics.
Example 6.14. Cleaning up our bar graph
We can do a few easy things to simplify and clarify our bar graph:

1. Change the title.


2. Remove the gridlines.
3. Remove the legend.
4. Add the vertical axis title Months in control to clarify the units.

Another thing we can do is use color and branding to convey information: each
major Canadian political party has a distinctive color as part of its brand: red
for the Liberals, blue for the Conservatives, and orange for the NDP.

1. Double-click on the Liberal bar.


2. Change its color to red.
3. Double-click on the Conservative bar
4. Change its color to blue. The conservatives use two shades of blue, so pick
whichever one you like best.

Finally, the purpose of a bar graph is to enable the viewer to compare magni-
tudes, which requires looking at the top of each bar. But you may notice that
the category labels are at the bottom of each bar. Following the principle that
we don’t want to make the reader’s eye do any extra work, let’s put those labels
on top.

1. Select the chart.


2. Select Chart Design > Add Chart Element > Data Labels > More
data label options.
You will notice the numbers 302 and 233 (the heights of the two bars)
appear above the bars, and the Format Data Labels box will appear to
the right.
3. Select Label options from the dialog box.
4. Uncheck Value and check Category Name. The category names Liberal
and Conservative will now appear above the respective bar.
5. Select Chart Design > Add Chart Element > Axes > Primary
Horizontal to eliminate the category names from the bottom of
each bar.
6. You might also consider changing the color of the two labels to match the
color of the corresponding bar.
6.3. UNIVARIATE GRAPHS IN EXCEL 141

Our final graph looks like this:

One of the keys to effective bar graph design is to keep the graph simple and
clean, and to avoid “chartjunk.”
First, bar graphs should always start at zero. The size of the bar is meant to
visually represent the value of the variable it is depicting. Using any origin other
than zero can cause relative sizes to be misleading.
Example 6.15. Why bar graphs should always start at zero
Suppose we start our Party bar graph at 200 rather than zero. We will get
this:

The relative size of these bars is very misleading. The bar for Liberal is three
times as big as the bar for Conservative, even though the number it represents
is only 29% higher.

Pie graphs are an alternative way of depicting relative frequencies, and are
also available in Excel along with 3-dimensional variations on both pie and bar
graphs:
142 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

Graph type Example

Pie graph

3D pie graph

3D bar graph

However, most data visualization experts recommend against their use. Re-
search on how people process visual information usually find that they are less
informative in practice. People are much better at evaluating the relative size
of two lines or rectangles than they are at evaluating the relative size of two
pie slices. They are also much better at assessing relative distances in two
dimensions than they are in three dimensions.

6.3.5 Histograms

A histogram is just a frequency bar/column graph for a continuous variable. It


is constructed by sorting the variable’s values into some number of equally-sized
bins and then plotting a bar for each bin that represents counts or percentages.
The histogram is a built-in chart type in Excel.
Example 6.16. Constructing a histogram for UnempRate
To construct a histogram of the UnempRate variable:

1. Select column G in the Data for Analysis worksheet.


• Notice that for the other charts we could just select any cell in this
worksheet. For some reason, this won’t work with the histogram
chart type.
6.3. UNIVARIATE GRAPHS IN EXCEL 143

2. Select ‘Insert > Recommended Charts” from the menu bar.


3. Select the All Charts tab to see the full set of options.
4. Select Histogram and then OK.

Our graph will appear:

There are two main choices in histogram design:

• How many bins and how should they be determined?


• Should we depict frequencies as counts or percentages?

Excel allows us to customize the bins in various ways. Unfortunately, Excel


histograms depict frequencies as counts, and do not easily allow for them to be
displayed as percentages. We will use R to generate those histograms.
Example 6.17. Changing the bins for a histogram
By default, Excel has chosen 12 equally-sized bins, and each covers 0.73 per-
centage points. This is a reasonable starting point, but we may find another
choice will produce a more informative histogram.
First, let’s try varying the number of bins.

1. Select or double-click on the horizontal axis to open the “Format Axis”


box.
2. Select Axis Options.
3. Select Number of bins and change the number from 12 to something else.
• Try a large number like 100 to see what a lot of small bins looks like.
• Try a small number like 5 to see what a few large bins look like.

Looking at these graphs, a relatively large number of bins is needed to accurately


convey the distribution of the unemployment rate. But using a lot of bins makes
the horizontal axis very busy and difficult to read. One alternative that will help
is to fix the bin size at some nice round number like 1% (0.01) or 0.5% (0.005).

1. Select or double-click on the horizontal axis to open the Format Axis box.
144 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

2. Select Axis Options.


3. Select Bin width and change the number to 0.01.

Now the bins are exactly one percentage point wide, which makes the horizontal
axis a little easier to read and interpret.
It would also be nice to modify the start and end points so that instead of the
bins being 5.4-6.4%, 6.4-7.4%, etc. they were 5-6%, 6-7%, etc. Unfortunately,
that does not seem to be possible in Excel. We can work around that limita-
tion by creating the frequency table, and making a column/bar chart of that
frequency table.

Finally, as with other graphs we will want to add, delete and modify various
elements to bring this histogram closer to presentation quality.
Example 6.18. Enhancing the quality of our histogram
To enhance the quality of this histogram:

1. Add/modify the title to Histogram of the Canadian unemployment


rate, 1976-present
2. Add the horizontal axis title Unemployment rate
3. Add the vertical axis title # of months

Our final histogram will look like this:

Chapter review
You can download the complete Excel file with all analysis from this chapter at
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpHistResults.xlsx
In this chapter, we learned how to calculate simple statistics in Excel. Calculat-
ing the statistics is typically the easiest part of a statistical analysis - cleaning
data and interpreting the results is much more challenging. So hopefully you
have found this chapter an easy break from the last few.
In the next chapter, we will bring this chapter together with the previously
developed theory of random events and random variables to interpret statistics
as random variables.
6.3. UNIVARIATE GRAPHS IN EXCEL 145

Practice problems
Answers can be found in the appendix.
SKILL #1: Calculate summary statistics by hand

1. Consider the following table of data

Name Age
Al 25
Betty 32
Carl 78

Use a calculator or pencil and paper to calculate the following statistics


(please skip parts c and d of this question):

a. Sample size.
b. Sample average age.
c. Sample median of age.

d. Sample 25th percentile of age.


e. Sample variance of age.
f. Sample standard deviation of age.

SKILL #2: Calculate summary statistics in Excel

2. Consider the following table of data

Name Age
Al 25
Betty 32
Carl 78

Enter this table in Excel, and use Excel functions to to calculate the
following statistics:

a. Sample size.
b. Sample average age.
c. Sample median of age.
d. Sample 25th percentile of age.
e. Sample variance of age.
f. Sample standard deviation of age.
146 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL

SKILL #3: Construct (simple and binned) frequency tables in Excel

3. Consider the following table of data

Name Age
Al 25
Betty 32
Carl 78

Use Excel to construct a binned frequency table of age, with bin widths
of 10 years.

SKILL #4: Construct time series (line) graphs in Excel

4. Consider the following table of data

Year Al’s age


2017 25
2018 26
2019 27
2020 28
2021 29

Use Excel to construct a time series graph of Al’s age.

SKILL #5: Construct bar graphs and histograms in Excel

5. Consider the following table of data

Name Age
Al 25
Betty 32
Carl 78

Use Excel to construct a histogram of age, with bin widths of 10 years.


Chapter 7

Statistics

In earlier chapters, we learned some techniques for using Excel to clean data
and to construct common statistics and charts. We also learned the basics of
probability theory, simple random variables, and more complex random vari-
ables.

Our next step is to bring these two sets of concepts together. In this chapter,
we will develop a framework for talking about data and the statistics calculated
from that data as a random process that can be described using the theory
of probability and random variables. We will also explore one of the most
important uses of statistics: to estimate or guess at the value at, some unknown
parameter of the DGP.

Goals

Chapter goals
In this chapter we will:

• Model the random process generating a data set.


• Apply and interpret the assumption of simple random sampling,
and compare it to other sampling schemes.
• Use the theory of probability and random variables to model the
statistics we have calculated from a data set.
• Calculate and interpret the mean and variance of a statistic from
its sampling distribution.
• Calculate and interpret bias and mean squared error.
• Explain the law of large numbers, Slutsky’s theorem, and the idea
of a consistent estimator.

147
148 CHAPTER 7. STATISTICS

7.1 Data and the data generating process


Having invested in all of the probabilistic preliminaries, we can finally talk about
data. Suppose for the rest of this chapter that we have a data set called 𝐷𝑛 .
In this chapter, we will assume that 𝐷𝑛 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) is a data set with one
variable and 𝑛 observations. We use 𝑥𝑖 to refer to the value of our variable for
an arbitrary observation 𝑖.
In real-world analysis, data tends to be more complex:

• In most applications, it will be a simple table of numbers with 𝑛 observa-


tions (rows) and 𝐾 variables (columns).
• However, it is occasionally something more abstract. For example, the
data set at https://www.kaggle.com/c/dogs-vs-cats is a big folder full of
dog and cat photos.
– A great deal of research in the statistical field of machine learning has
been focused on developing methods for determining if a particular
photo in this data set shows a dog or a cat.

Although our examples will all be based on simple data sets, many of our con-
cepts and results can be applied to more complex data.

Example 7.1. Data from 3 roulette games


Suppose we have a data set 𝐷3 providing the result of 𝑛 = 3 independent games
of roulette. Let 𝑏𝑖 be the outcome in game 𝑖, and let 𝑥𝑖 be the result of a bet
on red:
1 𝑏𝑖 ∈ 𝑅𝐸𝐷
𝑥𝑖 = 𝐼(𝑏𝑖 ∈ 𝑅𝐸𝐷) = {
0 𝑏𝑖 ∉ 𝑅𝐸𝐷

Then 𝐷𝑛 = (𝑥1 , 𝑥2 , 𝑥3 ). For example, if red loses the first two games and wins
the third game we have 𝐷𝑛 = (0, 0, 1).

Our data set 𝐷𝑛 is a set of 𝑛 numbers, but we can also think of it as a set of 𝑛
random variables with unknown joint distribution 𝑃𝐷 . The distinction here is
a hard one for students to make, so give it some thought before proceeding.
The joint distribution of 𝐷𝑛 is called its data generating process or DGP.
The exact DGP is assumed to be unknown, but we usually have at least some
information about it.

Example 7.2. The DGP for the roulette data


The joint distribution of 𝐷𝑛 can be derived. Let

𝑝 = Pr(𝑏 ∈ 𝑅𝑒𝑑)
7.1. DATA AND THE DATA GENERATING PROCESS 149

We showed in a previous chapter that 𝑝 ≈ 0.486 if the roulette wheel is fair.


But rather than assuming it is fair, let’s treat 𝑝 as an unknown parameter.
The PDF of 𝑥𝑖 is
⎧(1 − 𝑝) 𝑎=0
{
𝑓𝑥 (𝑎) = ⎨𝑝 𝑎=1
{0 otherwise

Since the random variables in 𝐷𝑛 are independent, their joint PDF is:

Pr(𝐷𝑛 ) = 𝑓𝑥 (𝑥1 )𝑓𝑥 (𝑥2 )𝑓𝑥 (𝑥3 ) = 𝑝𝑥1 +𝑥2 +𝑥3 (1 − 𝑝)3−𝑥1 −𝑥2 −𝑥3

Note that even with a small data set of a simple random variable, the joint PDF
is not easy to calculate. Once we get into larger data sets and more complex
random variables, it can get very difficult. That’s OK, we don’t usually need to
calculate it - we just need to know that it could be calculated.

7.1.1 Simple random sampling

In order to model the data generating process, we need to model the entire joint
distribution of 𝐷𝑛 . As mentioned earlier, this means we must model both:

• The marginal probability distribution of each 𝑥𝑖


• The relationship between the 𝑥𝑖 ’s

Fortunately, we often can simplify this joint distribution quite a bit by assum-
ing that 𝐷𝑛 is independent and identically distributed (IID) or a simple
random sample from a large population.
A simple random sample has two features:

1. Independent: Each 𝑥𝑖 is independent of the others.


2. Identically distributed: Each 𝑥𝑖 has the same (unknown) marginal
distribution.

This implies that its joint PDF can be written:

Pr(𝐷𝑛 = (𝑎1 , 𝑎2 , … , 𝑎𝑛 )) = 𝑓𝑥 (𝑎1 )𝑓𝑥 (𝑎2 ) … 𝑓𝑥 (𝑎𝑛 )

where 𝑓𝑥 (𝑎) = Pr(𝑥𝑖 = 𝑎) is just the marginal PDF of a single observation.


Independence allows us to write the joint PDF as the product of the marginal
PDFs for each observation, and identical distribution allows us to use the same
marginal PDF for each observation.
The reason we call this “independent and identically distributed” is hopefully
obvious, but what does it mean to say we have a “random sample” from a
“population”? Well, one simple way of generating an IID sample is to:
150 CHAPTER 7. STATISTICS

1. Define the population of interest, for example all Canadian residents.


2. Use some purely random mechanism1 to choose a small subset of cases
from this population.
• The subset is called our sample
• “Purely random” here means some mechanism like a computer’s ran-
dom number generator, which can then be used to dial random tele-
phone numbers or select cases from a list.
3. Collect data from every case in our sample, usually by contacting them
and asking them questions (survey).

It will turn out that a moderately-sized random sample provides surprisingly


accurate information on the underlying population.

Example 7.3. Our roulette data is a random sample


Each observation 𝑥𝑖 in our roulette data is an independent random draw from
the 𝐵𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝) distribution where 𝑝 = Pr(𝑏 ∈ 𝑅𝑒𝑑).
Therefore, this data set satisfies the criteria for a simple random sample.

7.1.2 Time series data

Our employment data set is an example of time series data; it is made of


observations of each variable at regularly-spaced points in time. Most macroe-
conomic data - GDP, population, inflation, interest rates - are time series.
Time series have several features that are inconsistent with the random sampling
assumption:

• They usually have clear time trends.


– For example, Canada’s real GDP has been steadily growing for as
long as we have data.
– This violates “identically distributed” since 2010 GDP is drawn from
a distribution with a higher expected value than the distribution for
1910 GDP.
• They usually have clear recurring cyclical patterns or seasonality.
– For example, unemployment in Canada is usually lower from Septem-
ber through December.
– This also violates “identically distributed” since February unemploy-
ment has a higher expected value than November unemployment.
1 As a technical matter, the assumption of independence requires that we sample with

replacement. This means we allow for the possibility that we sample the same case more than
once. In practice this doesn’t matter as long as the sample is small relative to the population.
7.1. DATA AND THE DATA GENERATING PROCESS 151

• They usually exhibit what is called autocorrelation.


– For example, shocks to the economy that affect GDP in one month
or quarter (think of COVID or a financial crisis) are likely to have a
similar (if smaller) effect on GDP in the next month or quarter.

– This violates “independence” since nearby time periods are positively


correlated.

We can calculate statistics for time series, and we already did in Chapter 6.
However, time series data often requires more advanced techniques than we will
learn in this class. ECON 433 addresses time series data.

7.1.3 Other sampling models

Not all useful data sets come from a simple random sample or a time series. For
example:

• A stratified sample is collected by dividing the population into strata


(subgroups) based on some observable characteristics, and then randomly
sampling a predetermined number of cases within each strata.
– Most professional surveys are constructed from stratified samples
rather than random samples.
– Stratified sampling is often combined with oversampling of some
smaller strata that are of particular interest.
∗ The LFS oversamples residents of Prince Edward Island (PEI)
because a national random sample would not catch enough PEI
residents to accurately measure PEI’s unemployment rate.
∗ Government surveys typically oversample disadvantaged groups.
– Stratified samples can usually be handled as if they were from a
random sample, with some adjustments.
• A cluster sample is gathered by dividing the population into clusters,
randomly selecting some of these clusters, and sampling cases within the
cluster.
– Educational data sets are often gathered this way: we pick a random
sample of schools, and then collect data from each student within
those schools.
– Cluster samples can usually be handled as if they were from a random
sample, with some adjustments.
• A census gathers data on every case in the population.
– For example, we might have data on all 50 US states, or all 10 Cana-
dian provinces, or all of the countries of the world.
152 CHAPTER 7. STATISTICS

– Data from administrative sources such as tax records or school


records often cover the entire population of interest as well.
– Censuses are often treated as random samples from some hypothetical
population of “possible” cases.
• A convenience sample is gathered by whatever method is convenient.
– For example, we might gather a survey from people who walk by, or
we might recruit our friends to participate in the survey.
– Convenience samples are the worst-case scenario; in many cases they
simply aren’t usable for accurate statistical analysis.

Many data sets combine several of these elements. For example, Canada’s un-
employment rate is calculated using data from the Labour Force Survey (LFS).
The LFS is built from a stratified sample of the civilian non-institutionalized
working-age population of Canada. There is also some clustering: the LFS
will typically interview whole households, and will do some geographic cluster-
ing to save on travel costs. The LFS is gathered monthly, and the resulting
unemployment rate is a time series.

7.1.4 Sample selection and representativeness

Random samples and their close relatives have the feature that they are rep-
resentative of the population from which they are drawn. In a sense that will
be made more clear over the next few chapters, any sufficiently large random
sample “looks just like” the population.
Unfortunately, a simple random is quite difficult to collect from humans. Even
if we are able to randomly select cases, we often run into the following problems:

• Nonresponse occurs when a sampled individual does not provide the


information requested by the survey
– Survey-level nonresponse occurs when the sampled individual does
not answer any questions.
∗ This can occur if the sampled individual cannot be found, refuses
to answer, or cannot answer (for example, is incapacitated due
to illness or disability).
∗ Recent response rates to telephone surveys have been around 9%,
implying over 90% of those contacted do not respond.
– Item-level nonresponse occurs when the sampled individual does
not answer a particular question.
∗ This can occur if the respondent refuses to answer, or the ques-
tion is not applicable or has no valid answer.
∗ Item-level nonresponse is particularly common on sensitive ques-
tions including income.
7.1. DATA AND THE DATA GENERATING PROCESS 153

• Censoring occurs when a particular quantity of interest cannot be ob-


served for a particular case. Censored outcomes are extremely common in
economics, for example:
– In labour market analysis, we cannot observe the market wage for
individuals who are not currently employed.
– In supply/demand analysis, we only observe quantity supplied and
quantity demanded at the current market price.

When observations are subject to nonresponse or censoring, we must interpret


the data carefully.
Example 7.4. Wald’s airplanes
Abraham Wald was a Hungarian/American statistician and econometrician who
made important contributions to both the theory of statistical inference and the
development of economic index numbers such as the Consumer Price Index.
Like many scientists of his time, he advised the US government during World
War II. As part of his work, he was provided with data on combat damage
received by airplanes, with the hopes that the data could be used to help make
the planes more robust to damage. The data looked something like this (this
isn’t the real data, just a visualization constructed for the example):
Seeing this data, some military analysts concluded that planes were mostly
being shot in the wings and in the middle of the fuselage (body), and that they
should be reinforced with additional steel in these locations.
Wald quickly realized that this was wrong: the data were taken from planes
that returned, which is not a random sample of planes that went out. Planes
were probably shot in the nose, the engines, and the back of the fuselage just
as often as anywhere else, but they did not appear often in the data because
they crashed. Wald’s insight led to a counter-intuitive policy recommendation:
reinforce the parts of the plane that rarely show damage.

There are two basic solutions to nonresponse and censoring:

• Imputation: we assume or impute values for all missing quantities. For


example, we might assume that the wage of each non-employed worker is
equal to the average wage among employed workers with similar charac-
teristics.
• Redefinition: we redefine the population so that our data can be cor-
rectly interpreted as a random sample from that population. For example,
instead of having a random sample of Canadians, we can interpret our
data as a random sample of Canadians who would answer these questions
if asked.

This is not an issue that has a purely technical solution, but requires careful
thought instead. If we are imputing values, do we believe that our imputation
154 CHAPTER 7. STATISTICS

method is reasonable? If we are redefining the population, is the redefined


population one we are interested in? There is no right or wrong answers to
these questions, and sometimes our data are simply not good enough to answer
our questions.

FYI
Nonresponse bias in recent US elections
Going into both the 2016 and 2020 US presidential elections, polls in-
dicated that the Democratic candidate had a substantial lead over the
Republican candidate:

• Hillary Clinton led Donald Trump by 4-6% nationally in 2016


• Joe Biden led Trump by 8% nationally in 2020.

The actual vote was much closer:

• Clinton won the popular vote (but lost the election) by 2%


• Biden won the popular vote (and won the election) by about 4.5%.

The generally accepted explanation among pollsters for the clear dispar-
ity between polls and voting is systematic nonresponse: for some reason,
Trump voters are less likely to respond to polls. Since most people do
not respond to standard telephone polls any more (response rates are
typically around 9%), it does not take much difference in response rates
to produce a large difference in responses. For example, suppose that:

• We call 1,000 voters


• These voters are equally split, with 500 supporting Biden and 500
supporting trump.
• 10% of Biden voters respond (50 voters)
• 8% of Trump voters respond (40 voters)

The overall response rate is 9% (similar to what we usually see in sur-


veys), Biden has the support of 50/90 = 56% of the respondents while
Trump has the support of 40/90 = 44%. Actual support is even, but the
polls show a 12 percentage point gap in support, entirely because of the
small difference in response rates.
Polling organizations employ statisticians who are well aware of this
problem, and they made various adjustments after 2016 to address it.
For example, most now weight their analysis by education, since more
educated people tend to have a higher response rate. Unfortunately,
the 2020 results indicate that this adjustment was not enough. Some
pollsters have argued that it makes more sense to just assume the non-
response bias is 2-3% and adjust the numbers by that amount directly.
7.2. STATISTICS AND THEIR PROPERTIES 155

7.2 Statistics and their properties


Suppose we have some statistic 𝑠𝑛 = 𝑠(𝐷𝑛 ), i.e., a number that is calculated
from the data.

• Since the data is observed/known, the value of the statistic is ob-


served/known.
• Since the elements of 𝐷𝑛 are random variables, 𝑠𝑛 is also a random variable
with a well-defined (but unknown) probability distribution that depends
on the unknown DGP.
Example 7.5. Roulette wins
In our roulette example, the total number of wins is:
𝑅 = 𝑥 1 + 𝑥2 + 𝑥3
Since this is a number calculated from our data, it is a statistic.
Since 𝑥𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), we can show that 𝑅 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(3, 𝑝).

7.2.1 Some important statistics

I will use 𝑠𝑛 to represent an abstract statistic, but we will often use other
notation to talk about specific statistics.
The most important statistic is the sample average which is defined as:

1 𝑛
𝑥𝑛̄ = ∑𝑥
𝑛 𝑖=1 𝑖

We will also consider several other commonly-used univariate statistics:

• The sample variance of 𝑥𝑖 is defined as:


𝑛
1
𝑠2𝑥 = ∑(𝑥𝑖 − 𝑥)̄ 2
𝑛 − 1 𝑖=1

A closely-related statistic is the sample standard deviation 𝑠𝑥 = √𝑠2𝑥


which is the square root of the sample variance.
• The sample frequency or relative sample frequency of the event
𝑥𝑖 ∈ 𝐴 is defined as the proportion of cases in which the event occurs:

1 𝑛
𝑓𝐴̂ = ∑ 𝐼(𝑥𝑖 ∈ 𝐴)
𝑛 𝑖=1

A closely-related statistic is the absolute sample frequency 𝑛𝑓𝐴̂ which


is the number of cases in which the event occurs.
156 CHAPTER 7. STATISTICS

• The sample median of 𝑥𝑖 is defined as:

𝑓𝑥<𝑚̂ ≤ 0.5
𝑚̂ 𝑥 = 𝑚 ∶ {
̂
𝑓𝑥>𝑚 ≤ 0.5

7.2.2 The sampling distribution


Since the data itself is a collection of random variables, any statistic calculated
from that data is also a random variable, with a probability distribution that
can be derived from the DGP.
Example 7.6. The sampling distribution of the sample frequency
Calculating the exact probability distribution of most statistics is quite difficult,
but it is easy to do for the sample frequency. Let 𝑝 = Pr(𝑥𝑖 ∈ 𝐴). Then:
𝑛𝑓𝐴̂ ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝)
In other words, we can calculate the exact probability distribution of the sample
frequency using the formula for the binomial distribution.

Unfortunately, most statistics typically have sampling distributions that are


quite difficult to calculate.

FYI
To see why the sampling distribution of a statistic is so difficult to cal-
culate, suppose we have a discrete random variable 𝑥𝑖 whose support 𝑆𝑥
has five elements. Then we need to calculate the sampling distribution
of our statistic by adding its probability up across the support of 𝐷𝑛 .
The support has 5𝑛 elements, a number that can quickly get very large.
For example, a typical data set in microeconomics has at least a few
hundred or a few thousand observations. With 100 observations, 𝐷𝑛 can
take on 5100 ≈ 7.9×1069 (that’s 79 followed by 68 zeros!) distinct values.
With 1,000 observations , 𝐷𝑛 can take on 51000 distinct values, a number
too big for Excel to even calculate.

7.2.3 The mean and variance


If our statistic has a probability distribution, it (usually) has a mean and vari-
ance as well. Under some circumstances, we can calculate them.
Example 7.7. The mean of the sample average
Let 𝜇𝑥 = 𝐸(𝑥𝑖 ) be the mean of 𝑥𝑖 Then the mean of 𝑥̄ is:

1 𝑛 1 𝑛 1 𝑛
𝐸(𝑥𝑛̄ ) = 𝐸 ( ∑ 𝑥𝑖 ) = ∑ 𝐸 (𝑥𝑖 ) = ∑ 𝜇𝑥 = 𝜇𝑥
𝑛 𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1
7.2. STATISTICS AND THEIR PROPERTIES 157

This is an important and general result in statistics. The mean of the sample
average in a random sample is identical to the mean of the random variable
being averaged.
𝐸(𝑥𝑛̄ ) = 𝐸(𝑥𝑖 )
We have shown this property specifically for a random sample, but it holds
under many other sampling processes.
The variance of the sample average is not equal to the variance of the random
variable being averaged, but they are closely related.
Example 7.8. The variance of the sample average
To keep the math simple, suppose we only have 𝑛 = 2 observations. Then the
sample average is:
1
𝑥̄ = (𝑥1 + 𝑥2 )
2
By our earlier formula for the variance:
1
𝑣𝑎𝑟(𝑥)̄ = 𝑣𝑎𝑟 ( (𝑥1 + 𝑥2 )) (7.1)
2
2
1
= ( ) 𝑣𝑎𝑟(𝑥1 + 𝑥2 ) (7.2)
2
1⎛ ⎞
= ⎜𝑣𝑎𝑟(𝑥1 ) +2 𝑐𝑜𝑣(𝑥 1 ,⏟
𝑥⏟
2 ) + 𝑣𝑎𝑟(𝑥 2 )⎟ (7.3)
4 ⏟ ⏟⏟⏟ ⏟
⎝ 𝜎𝑥2 0(independence) 2
𝜎𝑥 ⎠
1
= (2𝜎𝑥2 ) (7.4)
4
𝜎2
= 𝑥 (7.5)
2
(7.6)

More generally, the variance of the sample average in a random sample of size
𝑛 is:
𝜎2
𝑣𝑎𝑟(𝑥𝑛̄ ) = 𝑥
𝑛
where 𝜎𝑥2 = 𝑣𝑎𝑟(𝑥𝑖 ).
Other commonly-used statistics also have a mean and variance.
Example 7.9. The mean and variance of the sample frequency
Since the absolute sample frequency has the binomial distribution, we have
already seen its mean and variance. Let 𝑝 = Pr(𝑥𝑖 ∈ 𝐴). Then 𝑛𝑓𝐴̂ ∼
𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) and:
𝐸(𝑛𝑓𝐴̂ ) = 𝑛𝑝
𝑣𝑎𝑟(𝑛𝑓𝐴̂ ) = 𝑛𝑝(1 − 𝑝)
158 CHAPTER 7. STATISTICS

Applying the usual rules for expected values, the mean and variance of the
relative sample frequency is:
𝐸(𝑛𝑓𝐴̂ ) 𝑛𝑝
𝐸(𝑓𝐴̂ ) = = =𝑝
𝑛 𝑛
𝑣𝑎𝑟(𝑛𝑓𝐴̂ ) 𝑛𝑝(1 − 𝑝) 𝑝(1 − 𝑝)
𝑣𝑎𝑟(𝑓𝐴̂ ) = 2
= 2
=
𝑛 𝑛 𝑛

7.3 Estimation
Statistics are often used to estimate, or guess the value of, some unknown feature
of the population or DGP.

7.3.1 Parameters and estimators


A parameter is an unknown number characterizing a DGP.
Example 7.10. Examples of parameters
Sometimes a single parameter completely describes the DGP:

• In our roulette data set, the joint distribution of the data depends only
on the single parameter 𝑝 = Pr(𝑏 ∈ 𝑅𝑒𝑑).

Sometimes a group of parameters completely describe the DGP:

• If 𝑥𝑖 is a random sample from the 𝑈 (𝐿, 𝐻) distribution, then 𝐿 and 𝐻 are


both parameters.

And sometimes a parameter only partially describes the DGP

• If 𝑥𝑖 is a random sample from some unknown distribution with unknown


mean 𝜇𝑥 = 𝐸(𝑥𝑖 ), then 𝜇𝑥 is a parameter.
• If 𝑥𝑖 is a random sample from some unknown distribution with unknown
median 𝑚𝑥 = 𝑞0.5 (𝑥𝑖 ), then 𝑚𝑥 is a parameter.

Typically there will be particular parameters whose value we wish to know.


Such a parameter is called a parameter of interest. Our model may include
other parameters, which are typically called auxiliary parameters or nuisance
parameters.
An estimator is a statistic that is being used to estimate (guess at the value
of) an unknown parameter of interest. The distinction between estimator and
estimate is a subtle one: we use “estimate” when talking about our statistic as
a number calculated for a specific data set and “estimator” when talking about
it as a random variable.
7.3. ESTIMATION 159

Example 7.11. Commonly used estimators


Our four example statistics are commonly used as estimators:

1. The relative sample frequency 𝑓𝐴̂ is typically used as an estimator of the


probability 𝑝𝐴 = Pr(𝑥𝑖 ∈ 𝐴)
2. The sample average 𝑥̄ is typically used as an estimator of the population
mean 𝜇𝑥 = 𝐸(𝑥𝑖 ).
3. The sample variance 𝑠2𝑥 is typically used as an estimator of the population
variance 𝜎𝑥2 = 𝑣𝑎𝑟(𝑥𝑖 ).
4. The sample median 𝑚̂ 𝑥 is typically used as an estimator of the population
median 𝑚𝑥 = 𝑞0.5 (𝑥𝑖 ).

Estimators are statistics, so they have all the usual characteristics of a statistic,
including a sampling distribution, a mean, a variance, etc.
In addition estimators have properties specific to their purpose as a statistic
that is supposed to take on a value close to the unknown parameter of interest.

7.3.2 Sampling error

Let 𝑠𝑛 be a statistic we are using as an estimator of some parameter of interest


𝜃. We can define its sampling error as:

𝑒𝑟𝑟(𝑠𝑛 ) = 𝑠𝑛 − 𝜃

In principle, we want 𝑠𝑛 to be a good estimator of 𝜃, i.e., we want the sampling


error to be as close to zero as possible.
There are several major complications to keep in mind:

1. Since 𝑠𝑛 is a random variable with a probability distribution, 𝑒𝑟𝑟(𝑠𝑛 ) is


also a random variable with a probability distribution.
2. Since the value of 𝜃 is unknown, the value of 𝑒𝑟𝑟(𝑠𝑛 ) is also unknown.

Always remember that 𝑒𝑟𝑟(𝑠𝑛 ) is not an inherent property of the statistic - it


depends on the relationship between the statistic and the parameter of interest.
A given statistic may be a good estimator of one parameter, and a bad estimator
of another parameter.

7.3.3 Bias

In choosing an estimator, we can consider several criteria.


160 CHAPTER 7. STATISTICS

The first is the bias of the estimator, which is defined as its expected sampling
error:

𝑏𝑖𝑎𝑠(𝑠𝑛 ) = 𝐸(𝑒𝑟𝑟(𝑠𝑛 )) (7.7)


= 𝐸(𝑠𝑛 − 𝜃) (7.8)
= 𝐸(𝑠𝑛 ) − 𝜃 (7.9)

Note that bias is always defined relative to the parameter we wish to estimate,
and is not an inherent property of the statistic.
Ideally we would want 𝑏𝑖𝑎𝑠(𝑠𝑛 ) to be zero, in which case we would say that 𝑠𝑛
is an unbiased estimator of 𝜃.
Example 7.12. Two unbiased estimators of the mean
Consider the sample average 𝑥𝑛̄ in a random sample as an estimator of the
parameter 𝜇𝑥 = 𝐸(𝑥𝑖 ). The bias is:

𝑏𝑖𝑎𝑠(𝑥𝑛̄ ) = 𝐸(𝑥𝑛̄ ) − 𝜇𝑥 = 𝜇𝑥 − 𝜇𝑥 = 0

That is, the sample average is an unbiased estimator of the population mean.
However, it is not the only unbiased estimator. For example, suppose we simply
take the value of 𝑥𝑖 in the first observation and throw away the rest of the data.
This “first observation estimator” is easier to calculate than the sample average,
and is also an unbiased estimator of 𝜇𝑥 :

𝑏𝑖𝑎𝑠(𝑥1 ) = 𝐸(𝑥1 ) − 𝜇𝑥 = 𝜇𝑥 − 𝜇𝑥 = 0

This example illustrates a general principle: there is rarely exactly one unbiased
estimator. There are either none, or many.
Example 7.13. An unbiased estimator of the variance
The sample variance is an unbiased estimator of the population variance:

𝐸(𝑠2𝑥 ) = 𝜎𝑥2 = 𝑣𝑎𝑟(𝑥𝑖 )

This is not hard to prove, but I will skip it for now.

If there are multiple unbiased estimators available for a given parameter, we


need to apply a second criterion to choose one. A natural second criterion is
the variance of the estimator:

𝑣𝑎𝑟(𝑠𝑛 ) = 𝐸[(𝑠𝑛 − (𝐸(𝑠𝑛 ))2 ]

If 𝑠𝑛 is unbiased, then a low variance means it is usually close to 𝜃, while a


high variance means that it is often either much larger or much smaller than 𝜃.
Clearly, low variance is better than high variance.
The minimum variance unbiased estimator (MVUE) of a parameter is the
unbiased estimator with the lowest variance.
7.3. ESTIMATION 161

Example 7.14. The variance of the sample average and and first ob-
servation estimators
In our previous example, we found two unbiased estimators for the mean, the
sample average 𝑥𝑛̄ and the first observation 𝑥1 .
The variance of the sample average is:

𝑣𝑎𝑟(𝑥𝑛̄ ) = 𝜎2 /𝑛

and the variance of the first observation estimator is:

𝑣𝑎𝑟(𝑥1 ) = 𝜎2

For any 𝑛 > 1, the sample average 𝑥𝑛̄ has lower variance than the first obser-
vation estimator 𝑥1 . Since they are both unbiased, it is the preferred estimator
of the two.
In fact, we can prove that 𝑥𝑛̄ is the minimum variance unbiased estimator of
𝜇𝑥 .

7.3.4 Mean squared error

Unfortunately, once we move beyond the simple case of estimating the popula-
tion mean, we run into several complications:
The first complication is that an unbiased estimator may not exist for a par-
ticular parameter of interest. If there is no unbiased estimator, there is no
minimum variance unbiased estimator. So we need some other way of choosing
an estimator.

Example 7.15. The sample median


There is no unbiased estimator of the median of a random variable with unknown
distribution. To see why, consider the simplest possible data set: a random
sample of size 𝑛 = 1 on the random variable 𝑥𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), where 0 < 𝑝 < 1.
The median of 𝑥𝑖 in this case is:

𝑚𝑥 = 𝐼(𝑝 > 0.5)

First we show that the sample median is a biased estimator of 𝑚𝑥 . The sample
median is:
𝑚̂ 𝑥 = 𝑥1
and its expected value is:

𝐸(𝑚̂ 𝑥 ) = 𝐸(𝑥1 ) = 𝑝 ≠ 𝐼(𝑝 > 0.5)

So the sample median 𝑚̂ 𝑥 is a biased estimator for the population median 𝑚𝑥 .


162 CHAPTER 7. STATISTICS

More generally, any statistic calculated from this data set must take the form
𝑠 = 𝑎0 + 𝑎1 𝑥1 , where 𝑠 = 𝑎0 when 𝑥1 = 0 and 𝑠 = 𝑎0 + 𝑎1 is its value when
𝑥1 = 1. This statistic has expected value 𝐸(𝑎0 + 𝑎1 𝑥1 ) = 𝑎0 + 𝑎1 𝑝, so any
unbiased estimator would need to solve the equation:
𝑎0 + 𝑎1 𝑝 = 𝐼(𝑝 > 0.5)
and there is no such solution.

The second complication is that we often have access to an unbiased estimator


and a biased estimator with lower variance.
Example 7.16. The relationship between age and earnings
One common question in labour economics is how earnings vary by various
characteristics such as age.
Suppose we have a random sample of 800 Canadians, and we want to estimate
the earnings of the average 35-year old Canadian. Assuming for simplicity
that ages are equally-spaced between 0 and 80, our data set will have only 10
Canadians at each age. So we have several options:

• Average earnings of the 10 35-year-olds in our data.

This estimator will be unbiased, but 10 observations isn’t very much and so its
variance will be high. We can reduce the variance by adding more observations
from people who are almost 35 years old:

• Average earnings of the 30 34-36 year olds in our data.


• Average earnings of the 100 30-39 year olds in our data.
• Average earnings of the 800 0-80 year olds in our data.

By including more data, these estimators will have lower variance but will in-
troduce bias. My guess is introducing 34 and 36 year olds is a good idea since
they probably have similar earnings to 35 year olds, but including children and
the elderly is not such a good idea.

This suggests that we need a criterion that

• Can be used to choose between biased estimators


• Can choose slightly biased estimators with low variance over unbiased
estimators with high variance.

The mean squared error of an estimator is defined as the expected value of


the squared sampling error:
𝑀 𝑆𝐸(𝑠𝑛 ) = 𝐸[𝑒𝑟𝑟(𝑠𝑛 )2 ] (7.10)
2
= 𝐸[(𝑠𝑛 − 𝜃) ] (7.11)
2
= 𝑣𝑎𝑟(𝑠𝑛 ) + [𝑏𝑖𝑎𝑠(𝑠𝑛 )] (7.12)
7.3. ESTIMATION 163

The MSE criterion allows us to choose a biased estimator with low variance over
an unbiased estimator with high variance, and also allows us to choose between
biased estimators when no unbiased estimator exists.
Example 7.17. The MSE of the sample mean and first observation
estimators
The mean squared error of the sample average is:

𝜎𝑥2 𝜎2
𝑀 𝑆𝐸(𝑥𝑛̄ ) = 𝑣𝑎𝑟(𝑥𝑛̄ ) + [𝑏𝑖𝑎𝑠(𝑥𝑛̄ )]2 = + 02 = 𝑥
𝑛 𝑛
and the mean squared error of the first observation estimator is:

𝑀 𝑆𝐸(𝑥1 ) = 𝜎𝑥2

The sample average is the preferred estimator by the MSE criterion, so in this
case we get the same result as applying the MVUE criterion.

7.3.5 Standard error

Parameter estimates are typically reported along with their standard errors.
The standard error of a statistic is an estimate of its standard deviation.
Example 7.18. The standard error of the average
We have shown that the sample average provides a good estimate of the popu-
lation mean, and that its variance is:

𝜎𝑥2 𝑣𝑎𝑟(𝑥𝑖 )
𝑣𝑎𝑟(𝑥𝑛̄ ) = =
𝑛 𝑛
Since 𝑠2𝑥 is an unbiased estimator of 𝑣𝑎𝑟(𝑥𝑖 ) we can use it to construct an
unbiased estimator of 𝑣𝑎𝑟(𝑥):
̄

𝑠2𝑥 𝐸(𝑠2𝑥 ) 𝑣𝑎𝑟(𝑥𝑖 )


𝐸( )= = = 𝑣𝑎𝑟(𝑥𝑛̄ )
𝑛 𝑛 𝑛

We might also want to estimate the standard deviation of 𝑥.̄ A natural approach
would be to take the square root of the estimator above, yielding:
𝑠
𝑠𝑒(𝑥𝑛̄ ) = √𝑥
𝑛

This is the conventional formula for the standard error of the sample average,
and is typically reported next to the sample average.

Standard errors are usually biased estimators of the statistic’s true standard
deviation, but the bias is typically small.
164 CHAPTER 7. STATISTICS

7.4 The law of large numbers


So far, we have described statistics and estimators in terms of their probability
distribution and the mean, variance and mean squared error associated with
that probability distribution.
We are able to do this fairly easy with both sample averages and sample frequen-
cies (which are also sample averages) because they are sums. Unfortunately, this
is not so easy with other statistics (e.g. standard errors, medians, percentiles,
etc.) that are nonlinear functions of the data.
In order to deal with those statistics, we need to construct approximations
based on the asymptotic properties of the statistics. Asymptotic properties
are properties that hold approximately, with the approximation getting closer
and closer to the truth as the sample size gets larger.
Properties that hold exactly for any sample size are sometimes called exact or
finite sample properties. All of the results we have discussed so far are finite
sample results.
We will state two main asymptotic results in this chapter: the law of large
numbers and Slutsky’s theorem. A third asymptotic result called the central
limit theorem will be stated in the next chapter.
All three results rely on the concept of a limit, which you would have learned in
your calculus course. If you need to review that concept, please see the section
on limits in the math appendix. However, I will not expect you to do any
significant amount of math with these results. Please focus on the intuition and
interpretation and don’t worry too much about the math.

7.4.1 Defining the LLN

The law of large numbers (LLN) says that for a large enough random sample,
the sample average is almost identical to the corresponding population mean.
In order to state the LLN, we need to introduce some concepts. Consider a data
set 𝐷𝑛 of size 𝑛, and let 𝑠𝑛 be some statistic calculated from 𝐷𝑛 . We say that
𝑠𝑛 converges in probability to some constant 𝑐 if:

lim Pr(|𝑠𝑛 − 𝑐| < 𝜖) = 1


𝑛→∞

for any positive number 𝜖 > 0.


Intuitively, what this means is that for a sufficiently large 𝑛 (the lim𝑛→∞ part),
𝑠𝑛 is almost certainly (the Pr(⋅) = 1 part) very close to 𝑐 (the |𝑠𝑛 − 𝑐| < 𝜖 part).
We have a compact way of writing convergence in probability:

𝑤𝑛 →𝑝 𝑐
7.4. THE LAW OF LARGE NUMBERS 165

means that 𝑤𝑛 converges in probability to 𝑐.


Having defined our terms we can now state the law of large numbers.

FYI
LAW OF LARGE NUMBERS: Let 𝑥𝑛̄ be the sample average from
a random sample of size 𝑛 on the random variable 𝑥𝑖 with mean 𝐸(𝑥𝑖 ) =
𝜇𝑥 . Then
𝑥𝑛̄ →𝑝 𝜇𝑥

Example 7.19. The LLN in the economy


The law of large numbers is extremely powerful and important, as it is the basis
for the gambling industry, the insurance industry, and much of the banking
industry.
A casino works by taking in a large number of independent small bets. As we
have seen for the case of roulette, these bets have a small house advantage, so
their average benefit to the casino is positive. The casino can lose any bet, but
the LLN virtually guarantees that the gains will outweigh the losses as long as
the casino takes in a large enough number of independent bets.
An insurance company works almost the same as a casino. Each of us faces
a small risk of a catastrophic cost: a house that burns down, a car accident
leading to serious injury, etc. Insurance companies collect a little bit of money
from each of us, and pay out a lot of money to the small number of people who
have claims. Although the context is quite different, the underlying economics
are identical to those of a casino: the insurance company prices its products so
that its revenues exceed its expected payout, and takes on a large number of
independent risks.
Sometimes insurance companies do lose money, and even go bankrupt. The
usual cause of this is a big systemic event like a natural disaster, pandemic or
financial crisis that affects everyone. Here the independence needed for the LLN
does not apply.

7.4.2 Consistent estimation

The law of large numbers applies to the sample mean, but we are interested in
other estimators as well.
In general, we say that the statistic 𝑠𝑛 is a consistent estimator of a parameter
𝜃 if:
𝑠𝑛 →𝑃 𝜃
It will turn out that most of the statistics we use are consistent estimators of
the thing we typically use them to estimate.
166 CHAPTER 7. STATISTICS

The key to this property is a result called Slutsky’s theorem. Slutsky’s the-
orem roughly says that if the law of large numbers applies to a statistic 𝑠𝑛 , it
also applies to 𝑔(𝑠𝑛 ) for any continuous function 𝑔(⋅).

FYI

SLUTSKY THEOREM: Let 𝑔(⋅) be a continuous function. Then:

𝑠𝑛 →𝑝 𝑐 ⟹ 𝑔(𝑠𝑛 ) →𝑝 𝑔(𝑐)

What is the importance of Slutsky’s theorem? Most commonly used statistics


can be written as continuous functions of a sample average (or several sample
averages). Slutsky’s theorem extends the LLN to these statistics, and ensures
that these commonly used statistics are consistent estimators of the correspond-
ing population parameter. For example:

• The sample variance is a consistent estimator of the population variance:

𝑠2𝑥 →𝑝 𝑣𝑎𝑟(𝑥)

• The sample standard deviation is a consistent estimator of the population


standard deviation:
𝑠𝑥 →𝑝 𝑠𝑑(𝑥)
• The relative sample frequency is a consistent estimator of the population
probability:
𝑓𝐴̂ →𝑝 Pr(𝐴)
• The sample median is a consistent estimator of the population median, and
all other sample quantiles are consistent estimators of the corresponding
population quantile.

The math needed to make full use of Slutsky’s theorem and prove these results
is beyond the scope of this course, so all I am asking here is for you to know
that it can be used for this purpose.

Chapter review
In this chapter we have learned to model a data generating process, describe
the probability distribution of a statistic, interpret a statistic as an estimator
of some unknown parameter of the underlying data generating process.
Almost by definition, estimators are rarely identical to the parameter of interest,
so any conclusions based on estimators have a degree of uncertainty. To describe
this uncertainty in a rigorous and quantitative manner, we will next learn some
principles of statistical inference.
7.4. THE LAW OF LARGE NUMBERS 167

Practice problems
Answers can be found in the appendix.
SKILL #1: Identify whether a data set is a random sample

1. Suppose we have a data set 𝐷𝑛 = (𝑥1 , 𝑥2 ) of size 𝑛 = 2. For each of the


following conditions, identify whether it implies that 𝐷𝑛 is (i) definitely
a random sample; (ii) definitely not a random sample; or (iii) possibly a
random sample.
a. The two observations are independent and have the same mean
𝐸(𝑥1 ) = 𝐸(𝑥2 ) = 𝜇𝑥 .
b. The two observations are independent and have the same mean
𝐸(𝑥1 ) = 𝐸(𝑥2 ) = 𝜇𝑥 and variance 𝑣𝑎𝑟(𝑥1 ) = 𝑣𝑎𝑟(𝑥2 ) = 𝜎𝑥2 .
c. The two observations are independent and have different means
𝐸(𝑥1 ) ≠ 𝐸(𝑥2 ).
d. The two observations have the same PDFs, and are independent.
e. The two observations have the same PDFs, and have 𝑐𝑜𝑟𝑟(𝑥1 , 𝑥2 ) = 0
f. The two observations have the same PDFs, and have 𝑐𝑜𝑣(𝑥1 , 𝑥2 ) > 0

SKILL #2: Classify data sets by sampling types

2. Identify the sampling type (random sample, time series, stratified sample,
cluster sample, census, convenience sample) for each of the following data
sets.
a. A data set from a survey of 100 SFU students who I found waiting
in line at Tim Horton’s.
b. A data set from a survey of 1,000 randomly selected SFU students.
c. A data set from a survey of 100 randomly selected SFU students from
each faculty.
d. A data set that reports total SFU enrollment for each year from
2005-2020.
e. A data set from administrative sources that describes demographic
information and postal code of residence for all SFU students in 2020.

SKILL #3: Describe the probability distribution of a very simple


data set

3. Suppose we have a data set 𝐷𝑛 = (𝑥1 , 𝑥2 ) that is a random sample of size


𝑛 = 2 on the random variable 𝑥𝑖 which has discrete PDF:

0.4 𝑎=1
𝑓𝑥 (𝑎) = { (7.13)
0.6 𝑎=2

Let 𝑓𝐷𝑛 (𝑎, 𝑏) = Pr(𝑥1 = 𝑎 ∩ 𝑥2 = 𝑏) be the joint PDF of the data set
168 CHAPTER 7. STATISTICS

a. Find the support 𝑆𝐷𝑛


b. Find 𝑓𝐷𝑛 (1, 1)
c. Find 𝑓𝐷𝑛 (2, 1)
d. Find 𝑓𝐷𝑛 (1, 2)
e. Find 𝑓𝐷𝑛 (2, 2)

SKILL #4: Find the sampling distribution of a very simple statistic

4. Suppose we have the data set described in question 3 above. Find the
support 𝑆 and sampling distribution 𝑓(⋅) for each of the following statis-
tics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample variance 𝜎̂𝑥2 = (𝑥1 − 𝑥)̄ 2 + (𝑥2 − 𝑥)̄ 2 .
d. The sample standard deviation 𝜎̂𝑥 = √𝜎̂𝑥2 .
e. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).

SKILL #5: Find the mean and variance of a statistic from its sam-
pling distribution

5. Suppose we have the data set described in question 3 above. Find the
mean of each of the following statistics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample variance 𝜎̂𝑥2 = (𝑥1 − 𝑥)̄ 2 + (𝑥2 − 𝑥)̄ 2 .
d. The sample standard deviation 𝜎̂𝑥 = √𝜎̂𝑥2 .
e. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).
6. Suppose we have the data set described in question 3 above. Find the
variance of the following statistics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
d. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).

SKILL #6: Find the mean and variance of a linear statistic

7. Suppose we have a data set 𝐷𝑛 = (𝑥1 , 𝑥2 ) that is a random sample of size


𝑛 = 2 on the random variable 𝑥𝑖 which has mean 𝐸(𝑥𝑖 ) = 1.6 and variance
𝑣𝑎𝑟(𝑥𝑖 ) = 0.24. Find the mean and variance of the following statistics:
a. The first observation 𝑥1 .
7.4. THE LAW OF LARGE NUMBERS 169

b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.


c. The weighted average 𝑤 = 0.2 ∗ 𝑥1 + 0.8 ∗ 𝑥2 .

SKILL #7: Distinguish between parameters and statistics

8. Suppose 𝐷𝑛 is a random sample of size 𝑛 = 100 on a random variable 𝑥𝑖


which has the 𝑁 (𝜇, 𝜎2 ) distribution. Which of the following are unknown
parameters of the DGP? Which are statistics calculated from the data?
a. 𝐷𝑛
b. 𝑛
c. 𝑥𝑖
d. 𝑖
e. 𝑁
f. 𝜇
g. 𝜎2
h. 𝐸(𝑥𝑖 )
i. 𝐸(𝑥3𝑖 )
j. 𝑣𝑎𝑟(𝑥𝑖 )√
k. 𝑠𝑑(𝑥𝑖 )/ 𝑛
l. 𝑥̄
9. Suppose we have the data set described in question 3 above. Find the true
value of each of the following parameters:
a. The probability Pr(𝑥𝑖 = 1).
b. The population mean 𝐸(𝑥𝑖 )
c. The population variance 𝑣𝑎𝑟(𝑥𝑖 )
d. The population standard deviation 𝑠𝑑(𝑥𝑖 )
e. The population minimum min(𝑆𝑥 ).
f. The population maximum max(𝑆𝑥 ).

SKILL #8: Calculate sampling error

10. Suppose we have the data set described in question 3 above. Suppose
we use the sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ) population maximum
max(𝑆𝑥 ).
a. Find the support 𝑆𝑒𝑟𝑟 of the sampling error 𝑒𝑟𝑟 = max(𝑥1 , 𝑥2 ) −
𝑚𝑎𝑥(𝑆𝑥 ).
b. Find the PDF 𝑓𝑒𝑟𝑟 (⋅) for the sampling distribution of the sampling
error 𝑒𝑟𝑟.

SKILL #9: Calculate bias and classify estimators as biased or unbi-


ased
170 CHAPTER 7. STATISTICS

11. Suppose we have the data set described in question 3 above. Classify each
of the following estimators as biased or unbiased, and calculate the bias.
a. The sample frequency 𝑓1̂ as an estimator of the probability Pr(𝑥𝑖 =
1).
b. The sample average 𝑥̄ as an estimator of the population mean 𝐸(𝑥𝑖 )
c. The sample variance 𝜎̂𝑥2 as an estimator of the population variance
𝑣𝑎𝑟(𝑥𝑖 )
d. The sample standard deviation 𝜎̂𝑥 as an estimator of the population
standard deviation 𝑠𝑑(𝑥𝑖 )
e. The sample minimum 𝑥𝑚𝑖𝑛 as an estimator of the population mini-
mum min(𝑆𝑥 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 as an estimator of the population max-
imum max(𝑆𝑥 ).
12. Suppose we are interested in the following parameters:
• The average earnings of Canadian men: 𝜇𝑀 .
• The average earnings of Canadian women: 𝜇𝑊 .
• The male-female earnings gap in Canada: 𝜇𝑀 − 𝜇𝑊 .
• The male-female earnings ratio in Canada: 𝜇𝑀 /𝜇𝑊 .
and we have calculated the following statistics from a random sample of
Canadians:
• The average earnings of men in our sample 𝑦𝑀̄
• The average earnings of women in our sample 𝑦𝑊 ̄
• The male-female earnings gap in our sample 𝑦𝑀 ̄ − 𝑦𝑊̄ .
• The male-female earnings ratio in our sample 𝑦𝑀̄ /𝑦𝑊
̄ .
We already know that 𝑦𝑀 ̄ is an unbiased estimator of 𝜇𝑀 and 𝑦𝑊
̄ is an
unbiased estimator of 𝜇𝑊 .
a. Is the sample earnings gap 𝑦𝑀̄ − 𝑦𝑊 ̄ a biased or unbiased estimator
of the population gap 𝜇𝑀 − 𝜇𝑊 ? Explain.
b. Is the sample earnings ratio 𝑦𝑀̄ /𝑦𝑊
̄ a biased or unbiased estimator
of the population earnings ratio 𝜇𝑀 /𝜇𝑊 ? Explain.

SKILL #10: Calculate mean squared error

13. Suppose we have the data set described in question 3 above. Calculate
the mean squared error for each of the following estimators.
a. The sample frequency 𝑓1̂ as an estimator of the probability Pr(𝑥𝑖 =
1).
b. The sample average 𝑥̄ as an estimator of the population mean 𝐸(𝑥𝑖 )
c. The sample minimum 𝑥𝑚𝑖𝑛 as an estimator of the population mini-
mum min(𝑆𝑥 ).
d. The sample maximum 𝑥𝑚𝑎𝑥 as an estimator of the population max-
imum max(𝑆𝑥 ).
7.4. THE LAW OF LARGE NUMBERS 171

SKILL #11: Apply MVUE and MSE criteria to select an estimator

14. Suppose you have a random sample of size 𝑛 = 2 on the random variable 𝑥
with mean 𝐸 (𝑥) = 𝜇 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎2 . Two potential estimators
of 𝜇 are the sample average
𝑥1 + 𝑥 2
𝑥̄ =
2
and the last observation
𝑥2
a. Are these estimators biased or unbiased?
b. Find 𝑣𝑎𝑟(𝑥)̄
c. Find 𝑣𝑎𝑟(𝑥2 )
d. Find 𝑀 𝑆𝐸(𝑥)̄
e. Find 𝑀 𝑆𝐸(𝑥2 )
f. Which estimator is preferred under the MVUE criterion?
g. Which estimator is preferred under the MSE criterion?

SKILL #12: Calculate the standard error for a sample average

15. Suppose that we have a random sample 𝐷𝑛 of size 𝑛 = 100 on the random
variable 𝑥𝑖 with unknown mean 𝜇 and unknown variance 𝜎2 . Suppose
that the sample average is 𝑥̄ = 12 and the sample variance is 𝜎̂ 2 = 4. Find
the standard error of 𝑥.̄

SKILL #13: Explain and interpret the law of large numbers

16. Suppose we have a random sample of size 𝑛 on the random variable 𝑥𝑖


with mean 𝐸(𝑥𝑖 ) = 𝜇. Which of the following statistics are consistent
estimators of 𝜇?
a. The sample average 𝑥̄
b. The sample median.
c. The first observation 𝑥1 .
d. The average of all even-numbered observations.
e. The average of the first 100 observations.
172 CHAPTER 7. STATISTICS
Chapter 8

Statistical inference

In a previous chapter, we learned about estimation: the use of data and statistics
to construct the best possible guess at the value of some parameter.

In this chapter, we will pursue a different goal. Instead of estimating the single
“most likely” value of the parameter, we will construct statistics that can be
used to classify particular parameter values as plausible (could be the true value)
or implausible (unlikely to be the true value).

• A hypothesis test determines whether a particular value can be ruled out.


• A confidence interval determines a range of values that cannot be ruled
out.

The set of procedures for constructing confidence intervals and hypothesis tests
is called statistical inference.

Goals
Chapter goals
In this chapter we will learn how to:

• Construct and perform a simple hypothesis test


• Construct a confidence interval
• Correctly interpret both hypothesis tests and confidence intervals

173
174 CHAPTER 8. STATISTICAL INFERENCE

8.1 Principles of inference

8.1.1 Evidence

The purpose of statistical inference is to systematically account for the uncer-


tainty associated with limited evidence. That is, there are important aspects of
the data generating process we do not know. The data provide some evidence
about those unknown aspects, but the evidence they provide may not be strong.
Statistical inference asks us what statements about the data generating process
can be made with confidence based on the data.

Example 8.1. Fair and unfair roulette games


Suppose you work as a casino regulator for the BCLC (British Columbia Lottery
Corporation, the crown corporation that regulates all commercial gambling in
B.C.). You have been given data with recent roulette results from a particular
casino and are tasked with determining whether the casino is running a fair
game.
Before getting caught up in math, let’s think about how we might assess evi-
dence:

• We might have results from many games, or only a few games.

• Our results may have a win rate close to the expected rate for a fair game,
or far from that rate.

We can put those possibilities into a table:

Win rate Many games Few games


Close to fair game rate Probably fair Could be fair or unfair
Far from fair game rate Probably unfair Could be fair or unfair

That is, we can make a fairly confident conclusion if we have a lot of evidence,
and our conclusion depends on what the evidence shows. But if we do not have
a lot of evidence, we cannot make a confident conclusion either way.
In this chapter we will formalize these basic ideas about evidence.

8.1.2 A basic framework

For the remainder of this chapter, suppose we have a data set 𝐷𝑛 =


(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) of size 𝑛. The data comes from an unknown data generating
process that includes an unknown parameter of interest 𝜃.
8.2. HYPOTHESIS TESTS 175

Example 8.2. DGP and parameter of interest for roulette


Let 𝐷𝑛 = (𝑥1 , … , 𝑥𝑛 ) be a data set of results from 𝑛 = 100 games of roulette at
a local casino. More specifically, let

𝑥𝑖 = 𝐼(Red wins)

Our parameter of interest is the probability that red wins:

𝑝𝑟𝑒𝑑 = Pr(𝑥𝑖 = 1) = 𝐸(𝑥𝑖 )

We know that red wins in a fair game with probability 𝑝𝑟𝑒𝑑 = 18/37 ≈ 0.486.

8.2 Hypothesis tests


We will start with hypothesis tests. The idea of a hypothesis test is to determine
whether the data rule out or reject a specific value of the unknown parameter
𝜃.
Intuitively, if we have no (useful) data we cannot rule anything out, but as we
obtain more data, we can rule out more values.

8.2.1 The null and alternative hypotheses

The first step in a hypothesis test is to define the null hypothesis. The null
hypothesis is a statement about our parameter 𝜃 that takes the form:

𝐻0 ∶ 𝜃 = 𝜃 0

where 𝜃0 is a specific number. This is the value of 𝜃 we are interested in ruling


out.
The next step is to define the alternative hypothesis. The alternative hy-
pothesis defines every other value of 𝜃 we are allowing, and is usually written
as:
𝐻1 ∶ 𝜃 ≠ 𝜃 0
where 𝜃0 is the same number as used in the null.
Example 8.3. Null and alternative for 𝑝𝑟𝑒𝑑
In our roulette example, our null hypothesis is that the game is fair:

𝐻0 ∶ 𝑝𝑟𝑒𝑑 = 18/37 ≈ 0.486

and the alternative hypothesis is that it is not fair:

𝐻1 ∶ 𝑝𝑟𝑒𝑑 ≠ 18/37
176 CHAPTER 8. STATISTICAL INFERENCE

Notice that there is something of an asymmetry between the null and alternative
hypothesis: the null is typically (though not necessarily) a single value and the
alternative is every other possible value.

FYI
What null hypothesis to choose?
Our framework here assumes that you already know what null hypothesis
you wish to test, but we might briefly consider how we might choose a
null hypothesis to test.
In some applications there are null hypotheses that are of clear interest
for that specific case:

• In our roulette example, the natural null to test is whether the win
probability matches that of a fair game (𝑝 = 𝑝𝑓𝑎𝑖𝑟 ).
• When measuring the effect 𝛽 of one variable on another, the natural
null to test is “no effect at all” (𝛽 = 0).
• When comparing the mean of some characteristic or outcome across
two groups (for example, average wages of men and women), the
natural null to test is that they are the same (𝜇𝑚 = 𝜇𝑊 )
• In epidemiology, a contagious disease will tend to spread if its re-
production rate 𝑅 is greater than one, and decline if it is less than
one, so 𝑅 = 1 is a natural null to test.

If there is no obvious null hypothesis, it may make sense to test many


null hypotheses and report all of the results. There is nothing wrong
with doing that.

8.2.2 The test statistic

Our next step is to construct a test statistic that can be calculated from our
data. A valid test statistic for a given null hypothesis is a statistic 𝑡𝑛 that has
the following two properties:

1. The probability distribution of 𝑡𝑛 under the null (i.e., when 𝐻0 is true)


is known.
2. The probability distribution of 𝑡𝑛 under the alternative (i.e., when 𝐻1
is true) is different from its probability distribution under the null.

The test statistic is usually based on an estimator of the parameter, and is


usually constructed so that it is typically close to zero when the null is true,
and far from zero when the null is false. But it does not need to be.

Example 8.4. A test statistic for roulette


8.2. HYPOTHESIS TESTS 177

A natural test statistic for the win probability of a bet on red would be the
corresponding win frequency in our data. We could use either the relative win
frequency (which also happens to be the sample average):

̂ = 𝑥̄ = 1 𝑛
𝑓𝑟𝑒𝑑 𝑛 ∑𝑥
𝑛 𝑖=1 𝑖

but it will be more convenient to use the absolute win frequency:


𝑛
̂ = 𝑛𝑥 ̄ = ∑ 𝑥
𝑡𝑛 = 𝑛𝑓𝑟𝑒𝑑 𝑛 𝑖
𝑖=1

Next we need to find the probability distribution of 𝑡𝑛 under the null, and under
the alternative.
In general, since 𝑥𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑟𝑒𝑑 ) we have:

𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝𝑟𝑒𝑑 )

Remember that the 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) distribution is the distribution correspond-


ing to the number of times an event with probability 𝑝 happens in 𝑛 independent
trials.
Under the null (when 𝐻0 is true), 𝑝𝑟𝑒𝑑 = 18/37 and so:

𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)

Since this distribution does not involve any unknown parameters, our test statis-
tic satisfies the requirement of having a known distribution under the null.
Under the alternative (when 𝐻1 is true), 𝑝𝑟𝑒𝑑 can take on any value other than
18/37. The sample size is still 𝑛 = 100, so the distribution of the test statistic
is:
𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 𝑝𝑟𝑒𝑑 ) where 𝑝𝑟𝑒𝑑 ≠ 18/37
Notice that the distribution of our test statistic under the alternative is not
known, since 𝑝𝑟𝑒𝑑 is not known. But the distribution is different under the
alternative, and that is what we require from our test statistic.

8.2.3 Significance and critical values

After choosing a test statistic 𝑡𝑛 , the next step is to choose critical values.
The critical values are two numbers 𝑐𝐿 and 𝑐𝐻 (where 𝑐𝐿 < 𝑐𝐻 ) such that

• 𝑡𝑛 has a high probability of being between 𝑐𝐿 and 𝑐𝐻 when the null is true.
• 𝑡𝑛 has a lower probability of being between 𝑐𝐿 and 𝑐𝐻 when the alternative
is true.
178 CHAPTER 8. STATISTICAL INFERENCE

The range of values from 𝑐𝐿 to 𝑐𝐻 is called the critical range of our test.
Given the test statistic and critical values:

• We reject the null if 𝑡𝑛 is outside of the critical range.


– This means we have strong evidence that 𝐻0 is false.
– The reason we reject here is that we know we would be unlikely to
observe such a value of 𝑡𝑛 if 𝐻0 were true.
• We fail to reject the null if 𝑡𝑛 is inside of the critical range.
– This means we do not have strong evidence that 𝐻0 is false.
– This does not mean we have strong evidence that 𝐻0 is true. We
may just not have enough evidence to reach a conclusion.

Notice that there is an asymmetry here: in the absence of evidence, we will not
reject any null hypotheses.

Figure 8.1: Critical values and rejecting the null hypothesis

How do we choose critical values? You can think of critical values as setting a
standard of evidence, so we need to balance two considerations:

• The probability of rejecting a false null is called the power of the test.
– We want our test to reject the null when it is false, so power is good.
• The probability of rejecting a true null is called the size or significance
of a test.
– We do not want our test to reject the null when it is true, so size is
bad.
• There is always a trade off between power and size
– A narrower critical range (higher 𝑐𝐿 or lower 𝑐𝐻 ) will increase the
rejection rate, increasing both power (good) and size (bad).
– A wider critical range (lower 𝑐𝐿 or higher 𝑐𝐻 ) will reduce the rejection
rate, reducing both power (bad) and size (good).
8.2. HYPOTHESIS TESTS 179

Given this trade off between power and size, we might construct some criterion
that includes both (just like MSE includes both variance and bias) and choose
critical values to maximize that criterion. In practice, we do not typically do
that.
Instead, we follow a simple convention:

1. Set the size to a fixed value 𝛼.


• In general, the conventional size varies by field, and typically varies
with how much data is typical in that field.
• In economics and most other social sciences, the usual convention is
to use a size of 5% (𝛼 = 0.05).
• We sometimes see 1% (𝛼 = 0.01) when working with larger data sets
or 10% (𝛼 = 0.10) when working with small data sets.
• In physics or genetics, where data sets are much larger, the conven-
tional size is much lower.
2. Calculate critical values that imply the desired size.
• With a size of 5% (𝛼 = 0.05), we would:
– set 𝑐𝐿 to the 2.5 percentile (0.025 quantile) of the null distribu-
tion
– set 𝑐𝐻 to the 97.5 percentile (0.975 quantile) of the null distri-
bution
• With a size of 10% (𝛼 = 0.10), we would:
– set 𝑐𝐿 to the 5 percentile (0.05 quantile) of the null distribution
– set 𝑐𝐻 to the 95 percentile (0.95 quantile) of the null distribution
• With a size of 𝛼, we would:
– set 𝑐𝐿 to the 𝛼/2 quantile of the null distribution
– set 𝑐𝐻 to the 1 − 𝛼/2 quantile of the null distribution

In other words, we set size equal to a conventional value, and let the power be
whatever is implied by that.
Example 8.5. Critical values for roulette
We earlier showed that the distribution of 𝑡𝑛 under the null is:

𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)

We can get a size of 5% by choosing:

𝑐𝐿 = 2.5 percentile of 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)

𝑐𝐻 = 97.5 percentile of 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)


We can then use Excel or R to calculate these critical values. In Excel, the
function you would use is BINOM.INV()
180 CHAPTER 8. STATISTICAL INFERENCE

• The formula to calculate 𝑐𝐿 is =BINOM.INV(100,18/37,0.025)


• The formula to calculate 𝑐𝐻 is =BINOM.INV(100,18/37,0.975)

The calculations below were done in R:

## 2.5 percentile of binomial(100,18/37) = 39


## 97.5 percentile of binomial(100,18/37) = 58

In other words we reject the null (at 5% significance) that the roulette wheel is
fair if red wins fewer than 39 games or more than 58 games.

FYI
A general test for a single probability
We can generalize the test we have constructed so far to the case of the
probability of any event:

Test component Roulette example General case


Parameter 𝑝𝑟𝑒𝑑 = Pr(Red wins) 𝑝 = Pr(event)
Null hypothesis 𝐻0 ∶ 𝑝𝑟𝑒𝑑 = 18/37 𝐻0 ∶ 𝑝 = 𝑝 0
Alternative hypothesis 𝐻1 ∶ 𝑝𝑟𝑒𝑑 ≠ 18/37 𝐻1 ∶ 𝑝 ≠ 𝑝 0
Test statistic ̂
𝑡 = 𝑛𝑓𝑅𝐸𝐷 ̂
𝑡 = 𝑛𝑓event
Null distribution 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37) 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝0 )
Critical value 𝑐𝐿 39 2.5 percentile of 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝0 )
Critical value 𝑐𝐻 58 97.5 percentile of 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝0 )
Decision Reject if 𝑡 ∉ [39, 58] Reject if 𝑡 ∉ [𝑐𝐿 , 𝑐𝐻 ]

8.2.4 The power of a test

As mentioned above, the power of a test is defined as the probability of rejecting


the null when it is false, and the alternative is true.
The size of a test is a number, since the distribution of the test statistic is known
under the null. Since the alternative typically allows more than one value of the
parameter 𝜃, the power of a test is not a number but a function of the unknown
true value of 𝜃 (and sometimes other unknown features of the DGP):

𝑝𝑜𝑤𝑒𝑟(𝜃) = Pr(reject 𝐻0 )

In some cases we can actually calculate this function.

Example 8.6. The power curve for roulette


8.2. HYPOTHESIS TESTS 181

Power curves can be tricky to calculate, and I will not ask you to calculate them
for this course. But they can be calculated, and it is useful to see what they
look like.
Figure 8.2 below depicts the power curve for the roulette test we have just
constructed; that is, we are testing the null that 𝑝𝑟𝑒𝑑 = 18/37 at a 5% size. The
blue line depicts the power curve for 𝑛 = 100 as in our example, while the green
line depicts the power curve for 𝑛 = 20.

Power curve for fair roulette wheel

1.00

n=100 n=20
0.75
power

0.50

0.25

0.00
0.00 0.25 0.50 0.75 1.00
true probability of winning
H0: Pr(red wins) = 18/37, significance = 0.05

Figure 8.2: Power curves for the roulette example

There are a few features I would like you to notice, all of which are common to
most regularly used tests:

• The power curve reaches its lowest value at the red point (18/37, 0.05).
Note that 18/37 is the parameter value under the null, and 0.05 is the size
of the test. In other words:
– The power is always at least as big as the size, and is usually bigger.
– We are more likely to reject the null when it is false than when it is
true. That’s good!
– When a test has this desirable property, we call it an unbiased test.
• The power increases as 𝜃 gets further from the null.
– That is, we are more likely to detect unfairness in a game that is very
unfair than when in one that is a little unfair.
182 CHAPTER 8. STATISTICAL INFERENCE

• Power also increases with the sample size; the blue line (𝑛 = 100) is above
the green line (𝑛 = 20).

Power analysis is often used by researchers to determine how much data to


collect. Each additional observation increases power but costs money, so it is
important to spend enough to get clear results but not much more than that.

FYI
P values
The convention of always using a 5% significance level for hypothesis tests
is somewhat arbitrary and has some negative unintended consequences:

1. Sometimes a test statistic falls just below or just above the critical
value, and small changes in the analysis can change a result from
reject to cannot-reject.
2. In many fields, unsophisticated researchers and journal editors mis-
interpret “cannot reject the null” as “the null is true.”

One common response to these issues is to report what is called the p-


value of a test. The p-value of a test is defined as the significance level
at which one would switch from rejecting to not-rejecting the null. For
example:

• If the p-value is 0.43 (43%) we would not reject the null at 10%,
5% or 1%.
• If the p-value is 0.06 (6%) we would reject the null at 10% but not
at 5% or 1%.
• If the p-value is 0.02 (2%) we would reject the null at 10% and 5%
but not at 1%.
• If the p-value is 0.001 (0.1%) we would reject the null at 10%, 5%,
and 1%.

The p-value of a test is simple to calculate from the test statistic and its
distribution under the null. I won’t go through that calculation here.

8.3 The central limit theorem

In order for a test statistic to work, its exact probability distribution must
be known under the null hypothesis. The example test in the previous section
worked because it was based on a sample frequency, a statistic whose probability
distribution is relatively easy to calculate. Unfortunately, most statistics do not
have a probability distribution that is easy to calculate.
8.3. THE CENTRAL LIMIT THEOREM 183

Fortunately, we have a very powerful asymptotic result called the Central


Limit Theorem (CLT). The CLT roughly says that we can approximate the
entire probability distribution of the sample average 𝑥𝑛̄ by a normal distribution
if the sample size is sufficiently large.
As with the LLN, we need to invest in some terminology before we can state the
CLT. Let 𝑠𝑛 be a statistic calculated from 𝐷𝑛 and let 𝐹𝑛 (⋅) be its CDF. We say
that 𝑠𝑛 converges in distribution to a random variable 𝑠 with CDF 𝐹 (⋅), or
𝑠𝑛 →𝐷 𝑠
if
lim |𝐹𝑛 (𝑎) − 𝐹 (𝑎)| = 0
𝑛→∞
for every 𝑎 ∈ ℝ.
Convergence in distribution means we can approximate the actual CDF 𝐹𝑛 (⋅)
of 𝑠𝑛 with its limit 𝐹 (⋅). As with most approximations, this is useful whenever
𝐹𝑛 (⋅) is difficult to calculate and 𝐹 (⋅) is easy to calculate.

FYI
CENTRAL LIMIT THEOREM: Let 𝑥𝑛̄ be the sample average from
a random sample of size 𝑛 on the random variable 𝑥𝑖 with mean 𝐸(𝑥𝑖 ) =
𝜇𝑥 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎𝑥2 . Then

𝑧𝑛 →𝐷 𝑧 ∼ 𝑁 (0, 1)

What does the central limit theorem mean?

• Fundamentally, it means that if 𝑛 is big enough then the probability dis-


tribution of 𝑥𝑛̄ is approximately normal no matter what the original dis-
tribution of 𝑥𝑖 looks like.
• In order for the CLT to apply, we need to re-scale 𝑥𝑛̄ so that it has zero
mean (by subtracting 𝐸(𝑥𝑛̄ ) √ = 𝜇𝑥 ) and constant variance as 𝑛 increases
(by dividing by 𝑠𝑑(𝑥𝑛̄ ) = 𝜎𝑥 / 𝑛)). That re-scaled sample average is 𝑧𝑛 .

• In practice, we don’t usually know 𝜇𝑥 or 𝜎𝑥 so we can’t calculate 𝑧𝑛 from


data. Fortunately, there are some tricks for getting around this problem
that we will talk about later.

What about statistics other than the sample average? Well it turns out that
Slutsky’s theorem also extends to convergence in distribution.

FYI

SLUTSKY THEOREM: Let 𝑔(⋅) be a continuous function. Then:

𝑠𝑛 →𝐷 𝑠 ⟹ 𝑔(𝑠𝑛 ) →𝐷 𝑔(𝑠)
184 CHAPTER 8. STATISTICAL INFERENCE

The implication here is that nearly all statistics have a sampling distribution
that can be approximated using the normal distribution if the sample size is
large enough.

8.4 Inference on the mean

Having described the general framework and a single example, we now move on
to the most common application: constructing hypothesis tests and confidence
intervals on the mean in a random sample.
Let 𝐷 = (𝑥1 , … , 𝑥𝑛 ) be a random sample of size 𝑛 on some random variable 𝑥𝑖
with unknown mean 𝐸(𝑥𝑖 ) = 𝜇𝑥 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎𝑥2 .
Let the sample average be:
1 𝑛
𝑥𝑛̄ = ∑𝑥
𝑛 𝑖=1 𝑖

let the sample variance be:


𝑛
1
𝑠2𝑥 = ∑(𝑥𝑖 − 𝑥)̄ 2
𝑛 − 1 𝑖=1

and let the sample standard deviation be:

𝑠𝑥 = √𝑠2𝑥

These statistics are easily calculated from the data, and we have previously
discussed their properties in detail.

8.4.1 The null and alternative hypotheses

Suppose that you want to test the null hypothesis:

𝐻0 ∶ 𝜇 𝑥 = 1

against the alternative hypothesis

𝐻1 ∶ 𝜇 𝑥 ≠ 1

Having stated our null and alternative hypotheses, we need to construct a test
statistic.
Remember that our test statistic needs to have a known distribution under the
null, and a different distribution under the alternative.
8.4. INFERENCE ON THE MEAN 185

8.4.2 The T statistic

The typical test statistic we use in this setting is called the T statistic, and
takes the form:
𝑥̄ − 1
𝑡𝑛 = 𝑛 √
𝑠𝑥 / 𝑛
The idea here is that we take our estimate of the parameter (𝑥),
̄ subtract its
expected value√under the null (1), and divide by an estimate of its standard
deviation (𝑠𝑥 / 𝑛). We can add and subtract the unknown true mean 𝜇𝑥 to
get:
𝑥𝑛̄ − 𝜇𝑥 + 𝜇𝑥 − 1
𝑡𝑛 = √ (8.1)
𝑠𝑥 / 𝑛
𝑥̄ − 𝜇 𝜇 −1
= 𝑛 √𝑥+ 𝑥√ (8.2)
𝑠𝑥 / 𝑛 𝑠𝑥 / 𝑛

The first part of this expression is a random variable with a mean of zero and a
variance of (about) one. The second part of the expression is exactly zero when
𝐻0 is true, and not exactly zero when it is false.
Recall that we need the probability distribution of 𝑡𝑛 to be known when 𝐻0 is
true, and different when it is false. The second criterion is clearly met, and the
first criterion is met if we can find the probability distribution of 𝑥𝑠̄𝑛/−𝜇
√𝑥.
𝑛
𝑥

Unfortunately, if we don’t know the exact probability distribution of 𝑥𝑖 we don’t


know the exact probability distribution of statistics calculated from it. Once we
have a potential test statistic, there are two standard solutions to this problem:

1. Assume a specific probability distribution (usually a normal distribution)


for 𝑥𝑖 . We can (or at least a professional statistician can) then mathemat-
ically derive the distribution of any test statistic from this distribution.
2. Use the central limit theorem to get an approximate probability distribu-
tion.

We will explore both of those options.

8.4.3 Asymptotic critical values

We will start with the asymptotic solution to the problem. As we learned in


Chapter 7, the Central Limit Theorem tells us that:
𝑥𝑛̄ − 𝜇𝑥
√ → 𝑁 (0, 1)
𝜎𝑥 / 𝑛

Under the null our test statistic looks just like this, but with the sample standard
deviation 𝑠𝑥 in place of the population standard deviation 𝜎𝑥 . It turns out that
186 CHAPTER 8. STATISTICAL INFERENCE

Slutsky’s theorem allows us to make this substitution, and it can be proved


that:
𝑥𝑛̄ − 𝜇𝑥
√ → 𝑁 (0, 1)
𝑠𝑥 / 𝑛

Therefore, under the null:


𝑡𝑛 → 𝑁 (0, 1)
In other words, while we do not know the exact (finite-sample) distribution of
𝑡𝑛 we know that 𝑁 (0, 1) provides a useful asymptotic approximation to that
distribution.

Asymptotic distribution of t_n under null

0.4

0.3
PDF of t_n

0.2

0.1

0.0
−2 0 2
value

Figure 8.3: Asymptotic distribution of t_n under the null

Therefore, if we want a test that has the asymptotic size of 5%, we can use
Excel or R to calculate critical values. In Excel, the function would be NORM.INV
or NORM.S.INV, and the formulas would be:

• 𝑐𝐿 : =NORM.S.INV(0.025) or =NORM.INV(0.025,0,1).
• 𝑐𝐻 : =NORM.S.INV(0.975) or =NORM.INV(0.975,0,1).

The calculations below were done in R:

## cL = 2.5 percentile of N(0,1) = -1.96


## cH = 97.5 percentile of N(0,1) = 1.96
8.4. INFERENCE ON THE MEAN 187

These particular critical values are so commonly used that I want you to re-
member them.

8.4.4 Exact critical values


Most economic data comes in sufficiently large samples that the asymptotic
distribution of 𝑡𝑛 is a reasonable approximation and the asymptotic test works
well. But occasionally we have samples that are small enough that it doesn’t.
Another option is to assume that the 𝑥𝑖 variables are normally distributed:
𝑥𝑖 ∼ 𝑁 (𝜇𝑥 , 𝜎𝑥2 )
where 𝜇𝑥 and 𝜎𝑥2 are unknown parameters.
Now at this point it is important to remind you: many interesting variables are
not normally distributed (for example, our roulette outcome is discrete uniform,
and the result of a given bet is Bernoulli) and so this assumption may very well
be incorrect.
If this was a more advanced course I would derive the distribution of 𝑡𝑛 under
the null. But for ECON 233, I will just ask you to understand that it can be
derived once we assume normality of the 𝑥𝑖 .
The null distribution of this particular test statistic under these particular as-
sumptions was derived in the 1920’s by William Sealy Gosset, a statistician
working at the Guinness brewery. To avoid getting in trouble at work (Guinness
did not want to give away trade secrets) Gosset published under the pseudonym
“Student”. As a result, the family of distributions he derived is called “Student’s
T distribution”.
When the null is true, the test statistic 𝑡𝑛 as described above has Student’s T
distribution with 𝑛 − 1 degrees of freedom:
𝑡𝑛 ∼ 𝑇𝑛−1
As always, it has a different distribution (sometimes called the “noncentral T
distribution”) when the null is false.
The 𝑇𝑛−1 distribution looks a lot like the 𝑁 (0, 1) distribution, but has slightly
higher probability of extreme positive or negative values. As 𝑛 increases the
𝑇𝑛−1 distribution converges to the 𝑁 (0, 1) distribution, just as predicted by the
central limit theorem.
Having found our test statistic and its distribution under the null, we can cal-
culate our critical values:
𝑐𝐿 = 2.5 percentile of 𝑇𝑛−1
𝑐𝐻 = 97.5 percentile of 𝑇𝑛−1
We can obtain these percentiles using Excel or R. In Excel, the relevant function
is T.INV. For example, if we have 𝑛 = 5 observations, then:
188 CHAPTER 8. STATISTICAL INFERENCE

Exact distribution of t_n under null

0.4 n = infinity

0.3 n = 30
PDF

0.2 n = 10

0.1 n=5

0.0
−2 0 2
value
(x assumed to be normally distributed)

Figure 8.4: Exact distribution of t_n under the null

• We would calculate 𝑐𝐿 by the formula =T.INV(0.025,5-1).


• We would calculate 𝑐𝐻 by the formula =T.INV(0.975,5-1).

The results (calculated below using R) would be:

## cL = 2.5 percentile of T_4 = -2.776


## cH = 97.5 percentile of T_4 = 2.776

In contrast, if we have 30 observations, then:

• We would calculate 𝑐𝐿 by the formula =T.INV(0.025,30-1).


• We would calculate 𝑐𝐻 by the formula =T.INV(0.975,30-1).

The results (calculated below using R) would be:

## cL = 2.5 percentile of T_29 = -2.045


## cH = 97.5 percentile of T_29 = 2.045

and if we have 1,000 observations:

• We would calculate 𝑐𝐿 by the formula =T.INV(0.025,1000-1).


8.5. CONFIDENCE INTERVALS 189

• We would calculate 𝑐𝐻 by the formula =T.INV(0.975,1000-1).

The results (calculated below using R) would be:

## cL = 2.5 percentile of T_999 = -1.962


## cH = 97.5 percentile of T_999 = 1.962

8.4.5 Which test to use?


Statisticians often call the finite-sample test the T test and the asymptotic
test the Z test, as a result of the notation typically used to represent the test
statistic. The two tests have the same underlying test statistic, but different
critical values.
Which test should we use in practice? The T test is a more conservative
test than the Z test, meaning it has larger critical values and is less likely to
reject the null. As a result it has lower power and lower size. But at some point
(around 𝑛 = 30) the difference between the two tests becomes too small to make
much of a difference.
As a result, statisticians typically recommend using the T test for smaller sam-
ples (less than 30 or so), and then using whichever test is more convenient with
larger samples. Most data sets in economics have well over 30 observations so
economists tend to use asymptotic tests unless they have a very small sample.

8.5 Confidence intervals


Hypothesis tests have one very important limitation: although they allow us to
rule out 𝜃 = 𝜃0 for a single value of 𝜃0 , they say nothing about other values very
close to 𝜃0 .
For example, suppose you are a medical researcher trying to measure the effect
of a particular treatment, let 𝜃 be the treatment, and suppose that you have
tested the null hypothesis that the treatment has no effect (𝜃 = 0).

• If you reject this null, you have concluded that the effect has some ef-
fect. However, that does not rule out the possibility that the effect of the
treatment is very small.
• If you fail to reject this null, you cannot rule out the possibility that the
treatment has no effect. However, this does not rule out the possibility
that the effect is very large.

The solution to this would be to do a hypothesis test for every possible value of
𝜃, and classify them into values that were rejected and not rejected. This is the
idea of a confidence interval.
190 CHAPTER 8. STATISTICAL INFERENCE

A confidence interval for the parameter 𝜃 with coverage rate 𝐶𝑃 is an interval


with lower bound 𝐶𝐼𝐿 and upper bound 𝐶𝐼𝐻 constructed from the data in such
a way that
Pr(𝐶𝐼𝐿 < 𝜃 < 𝐶𝐼𝐻 ) = 𝐶𝑃
In economics and most other social sciences, the convention is to report confi-
dence intervals with a coverage rate of 95%.

Pr(𝐶𝐼𝐿 < 𝜃 < 𝐶𝐼𝐻 ) = 0.95

Note that 𝜃 is a fixed (but unknown) parameter, while 𝐶𝐼𝐿 and 𝐶𝐼𝐻 are statis-
tics calculated from the data.
How do we calculate confidence intervals? It turns out to be entirely straight-
forward: confidence intervals can be constructed by inverting hypothesis tests:

• The 95% confidence interval is all values that cannot be rejected at a 5%


level of significance.
• The 90% confidence interval all values that cannot be rejected at a 10%
level of significance.
– It is narrower than the 95% confidence interval.
• The 99% confidence interval is all values that cannot be rejected at a 1%
level of significance.
– It is wider than the 95% confidence interval.

Example 8.7. A confidence interval for the win probability


Calculating a confidence interval for 𝑝𝑟𝑒𝑑 is somewhat tricky to do by hand, but
easy to do on a computer:

1. Construct a grid of many values between 0 and 1.


2. For each value 𝑝0 in the grid, test the null hypothesis 𝐻0 ∶ 𝑝𝑟𝑒𝑑 = 𝑝0
against the alternative hypothesis 𝐻1 ∶ 𝑝𝑟𝑒𝑑 ≠ 𝑝0 .
3. The confidence interval is the range of values for 𝑝0 that are not rejected.

For example, suppose that red wins on 40 of the 100 games. Then a 95%
confidence interval for 𝑝𝑟𝑒𝑑 is:

## 0.32 to 0.49

Notice that the confidence interval includes the fair value of 0.486 but it also
includes some very unfair values. In other words, while we are unable to rule
out the possibility that we have a fair game, the evidence that we have a fair
game is not very strong.
8.5. CONFIDENCE INTERVALS 191

8.5.1 Confidence intervals for the mean

Confidence intervals for the mean are very easy to calculate. Again we construct
them by inverting the hypothesis test.
Pick any 𝜇0 . To test the null

𝐻0 ∶ 𝜇 𝑥 = 𝜇 0

our test statistic is:


√ 𝑥 ̄ − 𝜇0
𝑡𝑛 = 𝑛
𝑠𝑥
and we fail to reject the null if

𝑐𝐿 < 𝑡 𝑛 < 𝑐 𝐻

where 𝑐𝐿 and 𝑐𝐻 are our critical values.


Plugging 𝑡𝑛 to this expression we fail to reject the null whenever:
√ 𝑥 ̄ − 𝜇0
𝑐𝐿 < 𝑛 < 𝑐𝐻
𝑠𝑥

Solving for 𝜇0 we fail to reject whenever:


√ √
𝑥̄ − 𝑐𝐻 𝑠𝑥 / 𝑛 < 𝜇0 < 𝑥̄ − 𝑐𝐿 𝑠𝑥 / 𝑛

All that remains is to choose a confidence/size level, and decide whether to use
an asymptotic or finite sample test.
If we are using the asymptotic approximation to construct a 95% confidence
interval, then the 5% asymptotic critical values are 𝑐𝐿 = −1.96 and 𝑐𝐻 ≈ 1.96
and the confidence interval is:

𝐶𝐼 = 𝑥̄ ± 1.96𝑠𝑥 / 𝑛

In other words, the 95% confidence interval for 𝜇𝑥 is just the point estimate
plus or minus roughly 2 standard errors.
If we have a small sample, and choose to assume normality rather than using
the asymptotic approximation, then we need to use the slightly larger critical
values from the 𝑇𝑛−1 distribution. For example, if 𝑛 = 5, then 𝑐𝐿 ≈ −2.78,
𝑐𝐻 ≈ 2.78 and the 95% confidence interval is:

𝐶𝐼 = 𝑥̄ ± 2.78𝑠𝑥 / 𝑛

As with hypothesis tests, finite sample confidence intervals are typically more
conservative (wider) than their asymptotic cousins, but the difference becomes
negligible as the sample size increases.
192 CHAPTER 8. STATISTICAL INFERENCE

Chapter review
In this chapter we have learned to formulate and test hypotheses, and to con-
struct confidence intervals. The mechanics of doing so are complicated, but you
should not let the various formulas distract you from the more basic idea of
evidence: hypothesis testing is about how strong the evidence is in favor of (or
against) a particular true/false statement about the data generating process,
and confidence intervals are about finding a range of values for a parameter
that are consistent with the observed data.
In practice, modern statistical packages automatically calculate and report con-
fidence intervals for most estimates, and report the result of some basic hypoth-
esis tests as well. When you need something more complicated, it is usually just
a matter of looking up the command. I will ask you to do these calculations
yourself so you get used to them, but it is more important that you can correctly
interpret the results.
This is the last primarily theoretical chapter. The remaining chapters will be
oriented towards data and applications.

Practice problems
Answers can be found in the appendix.
SKILL #1: Identify parameter, null and alternative

1. Suppose we have a research study of the effect of the minimum wage on


employment. Let 𝛽 be the parameter defining that effect. Formally state
a null hypothesis corresponding to the idea that the minimum wage has
no effect on employment, and state the alternative hypothesis as well.

SKILL #2: Classify test statistics as valid or invalid

2. Which of the following characteristics do test statistics need to possess?


a. The distribution of the test statistic is known under the null.
b. The distribution of the test statistic is known under the alternative.
c. The test statistic has the same distribution whether the null is true
or false.
d. The test statistic has a different distribution when the null is true
versus when the null is false.
e. The test statistic needs to be a number that can be calculated from
the data.
f. The test statistic needs to have a normal distribution.
g. The test statistic’s value depends on the true value of the parameter.
8.5. CONFIDENCE INTERVALS 193

SKILL #3: Find distribution of test statistic under the null

3. Suppose we have a random sample of size 𝑛 on the random variable 𝑥𝑖 with


unknown mean 𝜇 and unknown variance 𝜎2 . The conventional T-statistic
for the mean is defined as
𝑥 ̄ − 𝜇0
𝑡= √
𝜎̂𝑥 / 𝑛
where 𝑥̄ is the sample average, 𝜇0 is the value of 𝜇 under the null, and 𝜎̂𝑥
is the sample standard deviation.
a. What needs to be true in order for 𝑡 to have the 𝑇𝑛−1 distribution
under the null?
b. What is the asymptotic distribution of 𝑡?

SKILL #4: Describe distribution of test statistic under the alterna-


tive

4. Consider the setting from problem 3 above, and suppose that the true
value of 𝜇 is some number 𝜇1 ≠ 𝜇0 . Write an expression describing 𝑡 as
the sum of (a) a random variable that has the 𝑇𝑛−1 distribution and (b)
a random variable that is proportional to 𝜇1 − 𝜇0 .

SKILL #5: Find the size of a test

5. Suppose that we have a random sample of size 𝑛 = 14 on the random


variable 𝑥𝑖 ∼ 𝑁 (𝜇, 𝜎2 ). We wish to test the null hypothesis 𝐻0 ∶ 𝜇 = 0.
Suppose we use the standard t-statistic:
𝑥 ̄ − 𝜇0
𝑡= √
𝜎̂𝑥 / 𝑛

a. Suppose we use critical values 𝑐𝐿 = −1.96 and 𝑐𝐻 = 1.96. Use Excel


to calculate the exact size of this test.
b. Suppose we use critical values 𝑐𝐿 = −1.96 and 𝑐𝐻 = 1.96. Use Excel
to calculate the asymptotic size of this test.
c. Suppose we use critical values 𝑐𝐿 = −3 and 𝑐𝐻 = 2. Use Excel to
calculate the exact size of this test.
d. Suppose we use critical values 𝑐𝐿 = −3 and 𝑐𝐻 = 2. Use Excel to
calculate the asymptotic size of this test.
e. Suppose we use critical values 𝑐𝐿 = −∞ and 𝑐𝐻 = 1.96. Use Excel
to calculate the exact size of this test.
f. Suppose we use critical values 𝑐𝐿 = −∞ and 𝑐𝐻 = 1.96. Use Excel
to calculate the asymptotic size of this test.

SKILL #6: Find critical values


194 CHAPTER 8. STATISTICAL INFERENCE

6. Suppose that we have a random sample of size 𝑛 = 18 on the random


variable 𝑥𝑖 ∼ 𝑁 (𝜇, 𝜎2 ). We wish to test the null hypothesis 𝐻0 ∶ 𝜇 = 0.
Suppose we use the standard t-statistic:
𝑥 ̄ − 𝜇0
𝑡= √
𝜎̂𝑥 / 𝑛

a. Use Excel to calculate the (two-tailed) critical values that produce


an exact size of 1%.
b. Use Excel to calculate the (two-tailed) critical values that produce
an exact size of 5%.
c. Use Excel to calculate the (two-tailed) critical values that produce
an exact size of 10%.
d. Use Excel to calculate the (two-tailed) critical values that produce
an asymptotic size of 1%.
e. Use Excel to calculate the (two-tailed) critical values that produce
an asymptotic size of 5%.
f. Use Excel to calculate the (two-tailed) critical values that produce
an asymptotic size of 10%.

SKILL #7: Construct and interpret a confidence interval

7. Suppose we have a random sample of size 𝑛 = 16 on the random variable


𝑥𝑖 ∼ 𝑁 (𝜇, 𝜎2 ), and we calculate the sample average 𝑥̄ = 4 and the sample
standard deviation 𝜎̂ = 0.3.
a. Use Excel to calculate the 95% (exact) confidence interval for 𝜇.
b. Use Excel to calculate the 90% (exact) confidence interval for 𝜇.
c. Use Excel to calculate the 99% (exact) confidence interval for 𝜇.
d. Use Excel to calculate the 95% asymptotic confidence interval for 𝜇.
e. Use Excel to calculate the 90% asymptotic confidence interval for 𝜇.
f. Use Excel to calculate the 99% asymptotic confidence interval for 𝜇.

SKILL #8: Interpret test results and confidence intervals

8. Suppose you estimate the effect of a university degree on earnings at age


30, and you test the null hypothesis that this effect is zero. You conduct
a test at the 5% level of significance, and reject the null. Based on this
information, classify each of these statements as “probably true”, ’possibly
true“, or”probably false”:
a. A university degree has no effect on earnings.
b. A university degree has some effect on earnings.
c. A university degree has a large effect on earnings.
9. Suppose you estimate the effect of a university degree on earnings at age
30, and you test the null hypothesis that this effect is zero. You conduct
8.5. CONFIDENCE INTERVALS 195

a test at the 5% level of significance, and fail to reject the null. Based
on this information, classify each of these statements as “probably true”,
’possibly true“, or”probably false”:
a. A university degree has no effect on earnings.
b. A university degree has some effect on earnings.
c. A university degree has a large effect on earnings.
10. Suppose you estimate the effect of a university degree on earnings at age
30, and your 95% confidence interval for the effect is (0.10, 0.40), where
an effect of 0.10 means a degree increases earnings by 10% and an effect
of 0.40 means that a degree increases earnings by 40%. Based on this
information, classify each of these statements as “probably true”, ’possibly
true“, or”probably false”:
a. A university degree has no effect on earnings.
b. A university degree has some effect on earnings.
c. A university degree has a large effect on earnings, where “large”
means at least 10%.
d. A university degree has a very large effect on earnings, where “very
large” means at least 50%.
196 CHAPTER 8. STATISTICAL INFERENCE
Chapter 9

An introduction to R

As we have seen, Excel is a useful tool for both cleaning and analyzing data. R
is an application that has many of the same features as Excel, but is specially
designed for statistical analysis. It is a little more complex, but more powerful
in many important ways. This chapter will introduce you to some of the basic
concepts of R and associated tools such as R Markdown, RStudio, and the
Tidyverse. We will later use these tools to read and analyze data, and to create
publication-quality graphs that are well beyond what can be done in Excel.

Goals
Chapter goals
In this chapter we will learn how to:

• Write and execute some simple R commands in the console window


a script, and an R Markdown document.
• Perform simple calculations in R.
• Manage R by installing and loading libraries, and opening and
closing files.

In this course, we will only have time to learn a little bit about R, so my goal is
not to give a comprehensive treatment. My goal here is primarily to introduce
you to the terminology and concepts of R, and to show you a few applications
where R outshines Excel. You will learn much more about R in ECON 333 and
(if you take it) ECON 334.

197
198 CHAPTER 9. AN INTRODUCTION TO R

9.1 A brief tour of RStudio

Start the program RStudio. You should see something that looks like this:

You may wonder what the difference is between R and RStudio.

• R is a programming language designed for statistical analysis.


• R is also the computer program that runs R commands.

– It can also run R scripts, which are just a series of R commands


written in a text file.

• RStudio is an integrated development environment (IDE) for R. It com-


bines R with a set of additional useful tools:

– an interactive session of R (running in the “Console” window).


– a text editor for writing R scripts and R Markdown documents.
– tools for managing files and packages used by R
– tools for comparing and combining scripts and other files
– help and documentation
– many other features

You can run commands and scripts in R itself, but without RStudio you won’t
have all these handy extra features. So most people these days use RStudio or
another IDE.
RStudio normally displays three or four open windows, each of which has tabs
you can select to access different features. We will not use most of them, but
some of them will be very handy indeed.
9.1. A BRIEF TOUR OF RSTUDIO 199

9.1.1 The console window


Like most programming languages, R is designed to execute a series of com-
mands provided by the user. The simplest way to have R execute a command
is by entering it into the Console window in the lower left corner.
Example 9.1. Using the console window
Move your cursor into the console window, type the command print("Hello
world!") and press the Enter key to execute the command.

print("Hello world!")
## [1] "Hello world!"

As you type your command in, you may notice that RStudio showed various
pop-ups with helpful information about the command. It will also auto-complete
your command for you.

R maintains a command history that remembers commands you have previ-


ously entered. This is useful when you did something a while ago, but either
don’t remember exactly how you did it, or don’t want to type it all in from the
beginning.
The simplest way of accessing recent commands is to press the up-arrow key
while in the console window.
Example 9.2. Accessing the command history
Suppose you decide you want to say “Hello [your name]!” instead of “Hello
world”, and you don’t want to type in the whole command. Then you can:

1. Press the up-arrow key in the Console window to show the most recently
executed command. If you press it a second time it gives you the command
before that, and so on.
2. Look at the to the History window in the upper right corner to see a full list
of recently executed commands. You can double-click on any command
in the window to copy it to the Console window.

Once you have copied the previous command, you can edit it before pressing
<enter>.

9.1.2 Scripts
The Console window is ideal for simple tasks and experimentation, and we will
continue using it regularly. But in order to create reproducible research and take
full advantage of R’s capabilities, we will need to write and execute scripts.
An script is just a text file containing a sequence of R commands. By convention,
an R script should have the .R extension but any text file will work.
200 CHAPTER 9. AN INTRODUCTION TO R

Example 9.3. Creating an R script

To create an R script

1. Select File > New File > R Script from the menu.
2. Enter a valid command in the first line of the file, for example
print("Hello world!")
3. Enter another valid command in the second line of the file, for example
print("Goodbye world?")
4. Select File > Save to save your file.

• Name it Chapter10Example.R

To run your script:

1. Press the button.

You will see the results of your commands in the Console window.

9.1.3 R Markdown

RStudio can also run text files written in the R Markdown format. R Mark-
down files have the .Rmd extension.

R Markdown is a language for producing documents - HTML files (web pages),


Microsoft Word documents, PDF files, etc. - that have R code and analysis
embedded in them. In fact, this book is written in R Markdown.

R Markdown is an implementation of the Markdown markup language in R.


9.1. A BRIEF TOUR OF RSTUDIO 201

FYI
What is Markdown?
Markdown is a markup language just like HTML, which means that it
is a way of writing documents in text files whose content is readable
directly but can also be formatted and displayed (rendered) in a visually
appealing way.
The original idea of HTML was that content creators could write their
content in text files (pages), with a few HTML tags sprinkled around
to give the browser information about structure, and then the browser
would display the page. However, as web users demanded fancy graphics,
custom colors, interactivity, and mobile-friendly display, HTML became
much more complicated.
Markdown was created as radically simplified markup language. The
basic idea is to use common conventions for how to indicate structure in
a text file.

• Adjacent lines of text are interpreted as part of the same paragraph.


• A line of text following a blank line starts a new paragraph.
• A line of text that begins with “#” is a header, with “#” for level
one headers, “##” for level two, etc.
• A line of text that begins with “-” is a bullet point.
• A line of text that begins with a number is part of a numbered list.
• Text written like *this* is rendered like this.
• Text written like **this** is rendered like this.
• Text written like ***this*** is rendered like this.

Markdown documents can also include links and pictures (by simply
providing the URL or file name), tables, and all sorts of other things.

In addition to ordinary text and Markdown information, R Markdown docu-


ments can include pieces of executable R code. R code needs to be surrounded
by a code fence that identifies the text inside the fence as R code, and in some
cases provides additional information about how it should be executed. This
sounds complicated, but is easy to see in a real R Markdown file.

Example 9.4. Creating an R Markdown file

To create our first R Markdown file:

1. Select File > New File > R Markdown from the menu.

• You will see a dialog box that looks like this:


202 CHAPTER 9. AN INTRODUCTION TO R

2. The default options are fine, so select OK.


3. Save the file.

RStudio has taken the liberty of creating an example R Markdown file that you
can use as a template.

You can run the R code in an R Markdown document in one of two ways:
You can run and display results for individual chunks of code. A chunk is a
few lines of R code surrounded by a code fence.
Example 9.5. Running code chunks
To run a code chunk in our R Markdown file:

1. Go to the code chunk that looks like this:

2. Press the button.

As you can see, the code in the chunk will run and the results will be displayed
below.

You can also knit the entire R Markdown file into an HTML/word/PDF docu-
ment that includes both the text and the R results by pressing the Knit button.
9.1. A BRIEF TOUR OF RSTUDIO 203

Example 9.6. Knitting an R Markdown document


To knit an entire document:

1. Press the button.

It will take a few moments to process the file, and then the HTML file will open
in a browser.

By default, R Markdown files usually knit to HTML, but we can knit to other
file formats including Word and PDF. We will stick to HTML in this course.

FYI
R Markdown resources
R Markdown is as simple or as complicated as you want to make it. A
plain text file with a few lines of content is a valid R Markdown file, and
like HTML, Markdown is designed so it still “works” if you do something
unexpected.
If you want to try something new in R Markdown, or have forgotten how
to do something, the most useful resource is the one-page R Markdown
Cheat sheet. It is available directly in RStudio, or at https://github.
com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf. You can also
just search for “r markdown cheatsheet”.

9.1.4 Other RStudio features

RStudio has many other features, most of which we will not use. But I would
like to highlight a few that may seem useful.
In the lower right window:

• The Files tab gives you easy access to files in the current active folder.
• The Plots tab will display plots, when you create them.
• The Packages tab is useful for managing packages (more on them later)
• The Help tab allows you to access R’s help system.

In the upper right window

• The Environment tab allows you to view all currently-defined variables


and their values.
• The History tab shows the command history.

In the menu:
204 CHAPTER 9. AN INTRODUCTION TO R

• You can select Session > Restart R to clear the memory and restart the
current R session.

We are done for now, so close RStudio. You may get a warning message that
looks something like this:

Never click on the Save button here, as it would cause R to save the current
state of its memory and re-load it next time you start R. In the interest of
reproducibility, you should start R “clean” every time. Click on the Don't
Save button, and you will exit RStudio.

9.2 The R language


Next, we will learn some basic features of the R language. Open RStudio and
go to the console window so we can enter commands and see what they do.

9.2.1 Expressions

An expression is any piece of R code that can be evaluated on its own. For
example:

• Any text, numerical or logical constant: "Hello world", 105, 1.34, or


TRUE.
• Any complete formula built from functions and arithmetic operators:
log(10) or 2+2

An expression needs to be complete, for example ln( is not an expression, nor


is 2+.
Every valid R expression returns a value, also called an object.

• An object can be a number, a text string, a date, or a logical value, just


like in Excel.
• Objects can also be much more complex

You can execute any valid R expression as a command, and have it display the
value it returns.
9.2. THE R LANGUAGE 205

# This is a comment. R ignores everything in a line after the '#'


4 + 5
## [1] 9

You can also use any valid R expression within a larger expression.

sqrt(4 + 5)
## [1] 3

In addition, some expressions have a side effect. That is, they make something
happen: they cause something to appear on your computer screen, or change a
file, or change something in R’s memory.

# This expression has a side effect It causes R to plot a histogram of 100


# N(0,1) random numbers
hist(rnorm(100))

Histogram of rnorm(100)
25
20
Frequency

15
10
5
0

−3 −2 −1 0 1 2 3

rnorm(100)

Although we call it a “side effect”, the side effect is often the main purpose of
the expression.

9.2.2 Assignment

We can use the <- or assignment operator to assign the results of an expression
to a named variable. We can then use that variable in later expressions.
206 CHAPTER 9. AN INTRODUCTION TO R

For example, the R command x <- 2 assigns the value 2 (i.e., the number 4)
to the variable x. Any subsequent code can then refer to the variable x in its
own calculations or actions.

Example 9.7. Using the assignment operator

# This will cause the variable x to take on the value 2


x <- 2
# We can then use x in any expression
y <- x + 1
print(y)
## [1] 3
# We can change the value of x at any time
x <- 0
# But this will not change the result of any previous calculations
print(y)
## [1] 3

We can display the contents of an object using the print() function, or by


simply giving its name:

x <- 5
print(x)
## [1] 5
x
## [1] 5

9.2.3 Vectors

The primary data structure in R is a vector, which is just an ordered list of


elements.
The simplest type of vector is called an atomic vector - its elements are nor-
mally from one of R’s basic or atomic data types:

• text strings
• numbers
• logical values (either TRUE or FALSE)

The elements of an atomic vector need to be all part of the same atomic type;
a single vector cannot contain both strings and numbers, for example.
We can construct a vector by enumeration using the c() function:
9.2. THE R LANGUAGE 207

fruits <- c("Avocado", "Banana", "Cantaloupe")


print(fruits)
## [1] "Avocado" "Banana" "Cantaloupe"

There are many other functions that can be used to construct vectors. Two
particularly useful ones are rep which repeats something a particular number
of times, and seq which creates a sequence:

# REP repeats something (like Excel's Fill tool)


ones <- rep(1, times = 10)
print(ones)
## [1] 1 1 1 1 1 1 1 1 1 1
# SEQ creates a sequence (like Excel's Series tool)
evens <- seq(from = 2, to = 20, by = 2)
print(evens)
## [1] 2 4 6 8 10 12 14 16 18 20
# You can also create a sequence with the : operator:
print(1:10)
## [1] 1 2 3 4 5 6 7 8 9 10

Mathematical functions in R operate directly on vectors, and automatically


expand scalars (single numbers) to vectors as needed:

# This command subtracts 1 from every element in evens


odds <- evens - ones
print(odds)
## [1] 1 3 5 7 9 11 13 15 17 19
# This command does the same
odds <- evens - 1
print(odds)
## [1] 1 3 5 7 9 11 13 15 17 19

The subscript operator [] can be used to select part of a vector. You can
enumerate the indexes of the elements you want:

# You can give a single index evens[2] is the 2nd element in evens
x <- evens[2]
print(x)
## [1] 4
# You can give a vector of indices evens[c(2,5)] is a vector containing the 2nd
# and 5th element in evens
x <- evens[c(2, 5)]
print(x)
## [1] 4 10
208 CHAPTER 9. AN INTRODUCTION TO R

# You can give a range of indices evens[2:5] is a vector containing the 2nd,
# 3rd, 4th and 5th element in evens
x <- evens[2:5]
print(x)
## [1] 4 6 8 10

You can also provide logical values instead of numeric indices. R will then
operate on those elements whose corresponding item has the value TRUE:

print(evens)
## [1] 2 4 6 8 10 12 14 16 18 20
# This creates a vector of the same length as evens, that contains TRUE for all
# values less than 10, and FALSE for all other values
lessthan10 <- (evens < 10)
print(lessthan10)
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# This creates a vector that includes only those elements of evens for which
# forexample is TRUE
x <- evens[lessthan10]
print(x)
## [1] 2 4 6 8
# This is a quicker way of accomplishing the same result
x <- evens[evens < 10]
print(x)
## [1] 2 4 6 8

Vector subscripting can be used on either side of the assignment operator:

x <- evens
print(x)
## [1] 2 4 6 8 10 12 14 16 18 20
# This assigns the number 1000 to the 2nd element in x
x[2] <- 1000
print(x)
## [1] 2 1000 6 8 10 12 14 16 18 20

9.2.4 Lists

The other type of vector is a list. A list is a vector whose elements are themselves
other vectors. These vectors can be any type, so we can use lists inside lists to
build very complex objects.
Lists can be built using the list() function:
9.2. THE R LANGUAGE 209

everything <- list(fruits, evens, odds)


print(everything)
## [[1]]
## [1] "Avocado" "Banana" "Cantaloupe"
##
## [[2]]
## [1] 2 4 6 8 10 12 14 16 18 20
##
## [[3]]
## [1] 1 3 5 7 9 11 13 15 17 19

You can (and should) assign names to the elements of a list:

everything <- list(fruits = fruits, evens = evens, odds = odds)


print(everything)
## $fruits
## [1] "Avocado" "Banana" "Cantaloupe"
##
## $evens
## [1] 2 4 6 8 10 12 14 16 18 20
##
## $odds
## [1] 1 3 5 7 9 11 13 15 17 19

You can access part of a list by specifying its numerical index inside of the [[]]
operator:

print(everything[[2]])
## [1] 2 4 6 8 10 12 14 16 18 20

If the items in a list are named, you can also access them by name using either
[[]] or $ notation

print(everything[["evens"]])
## [1] 2 4 6 8 10 12 14 16 18 20
print(everything$fruits)
## [1] "Avocado" "Banana" "Cantaloupe"

You can also use the $ notation to add new items to an existing list:

# There is no element in everything called 'allnumbers'


everything$allnumbers <- c(evens, odds)
# But now there is...
print(everything)
210 CHAPTER 9. AN INTRODUCTION TO R

## $fruits
## [1] "Avocado" "Banana" "Cantaloupe"
##
## $evens
## [1] 2 4 6 8 10 12 14 16 18 20
##
## $odds
## [1] 1 3 5 7 9 11 13 15 17 19
##
## $allnumbers
## [1] 2 4 6 8 10 12 14 16 18 20 1 3 5 7 9 11 13 15 17 19

9.2.5 Attributes
Any object can also have attributes. This attributes of an object are a list
associated with the object that provides additional information.
Let’s see if any of the objects we have created have attributes:

print(attributes(fruits))
## NULL
print(attributes(evens))
## NULL
print(attributes(everything))
## $names
## [1] "fruits" "evens" "odds" "allnumbers"

Note that:

• our two atomic vectors have attributes NULL. That’s R’s way of saying
they have no attributes
• our list stores the names of its three elements in the $names attribute.

R has hundreds of standard object types that are built from atomic vectors,
lists, and attributes. These object types include matrices, arrays, data sets,
objects structured as the output of a particular statistical analysis, descriptions
of graphs, and so on. Users can also define their own object types, and there
is an extensive system for generic functions and object-based programming (if
you know what that is).

9.2.6 Functions and operators


There are hundreds of built-in mathematical and statistical functions in R,
and users can easily define their own functions. As you have seen, their format
and usage is quite similar to Excel though there are a few important differences.
9.2. THE R LANGUAGE 211

Let’s get to know the main features of functions in R by considering the seq()
function. We have already seen this function: it is used to create a vector with
a sequence of numbers, much like Excel’s Series tool.

1. Every function has a name.

• In our example, the function’s name is seq.

2. You can obtain help on any function by entering ? and its name in the
console window

• Try ? seq.

3. Most functions accept one or more arguments.

• The seq function’s arguments include from, to, by and length.out


• Every argument has a name and a position. For example, the from
argument is in position one, the to argument is in position two, etc.
• Arguments can be passed to the function by name or by position.
– Passing by name looks like this:seq(from=1,to=5)
– Passing by position looks like this: seq(1,5)
– You can mix both methods: seq(1,5,length.out=10)
– I recommend passing by position for simple functions, and pass-
ing by name for more complex functions, but it is really just a
matter of what works for you.
• Some arguments are required. They must be provided every time the
function is called, or else the function will return an error.
• Some arguments are optional. They can be provided, but have a
default value if not provided.
– All arguments to seq() are optional; execute the command
seq() to see what happens.

4. Every function returns a value. This is even true for functions like
print(). To see this:

y <- print("Hello world")


## [1] "Hello world"
print(y)
## [1] "Hello world"

As you can see, print("Hello world") returns “Hello world” as its value.
5. Some functions also produce side effects, as we have described earlier.

In addition to functions, R has the usual binary mathematical operators such


as +, -, * and /. Operators are just another way of expressing functions. For
example the + operator is really just another way of calling the sum() function:
212 CHAPTER 9. AN INTRODUCTION TO R

# These two statements are equivalent


2 + 2
## [1] 4
sum(2, 2)
## [1] 4

There are several other commonly used operators:

# Basic arithmetic operators


2 + 3
## [1] 5
2 - 3
## [1] -1
2 * 3
## [1] 6
2/3
## [1] 0.6666667
2^3
## [1] 8
# Comparison operators
2 < 3
## [1] TRUE
2 == 3
## [1] FALSE
2 > 3
## [1] FALSE
# Logical operators
2 == 3 & 2 < 3 # this is logical AND
## [1] FALSE
2 == 3 | 2 < 3 # this is logical OR
## [1] TRUE

The assignment operator is also an operator. It is equivalent to the assign()


function:

# These two statements are equivalent:


x <- 2
assign(y, 2)
print(x)
## [1] 2
print(y)
## [1] "Hello world"
# The assign function returns its own value, so you can do this:
x <- y <- 3
print(x)
9.3. PACKAGES AND THE TIDYVERSE 213

## [1] 3
print(y)
## [1] 3

9.3 Packages and the Tidyverse


R has many useful built-in functions and features. But one of its most useful
features is how easy it can be extended by users, and the fact that it has a large
user community who have provided packages of useful new functions and data.
There are thousands of packages available online. We will use a particularly
useful package called the Tidyverse.

FYI
What is the Tidyverse?
The Tidyverse was created by the data scientist Hadley Wickham (also
one of the key people behind RStudio) as a way of solving some long-
standing problems with R. The Tidyverse is both an R package contain-
ing a set of new functions and data structures as well as a philosophy
about how to analyze data.
The basic structure of R dates back to 1976 (R itself was created in the
early 1990s but is closely based on an earlier program called S). Computer
science has advanced a lot since 1976, so some design aspects of R seemed
like a good idea at the time but would be designed differently today.

• Too many different ways of doing the same thing


• Too many rarely-used functions,
• Some functions that don’t do what they should.

Unfortunately, we can’t change any of the original functions without


causing thousands of existing programs to stop working.
The Tidyverse addresses this problem by replacing many Base R func-
tions with alternative versions that are easier to use, better-designed,
and usually faster. It does this in part by being “opinionated” - for ex-
ample, most data analysis tools in the Tidyverse expect data to be in a
tidy format. This reflects a philosophy that data cleaning should precede
and be separate from data analysis.

Most commonly-used packages including the Tidyverse are open-source, and are
available online from the Comprehensive R Archive Network (CRAN).
Before you can use any package, two steps must be followed:

1. The package needs to be installed on your computer using the


install.packages() function.
214 CHAPTER 9. AN INTRODUCTION TO R

• This only needs to be done once for each package.


2. The package needs to be loaded into memory using the library() func-
tion.
• This needs to be done in every R session.

Once the package is installed and loaded, you can use its functions and other
features.

Example 9.8. Loading the Tidyverse


You can get a list of all available CRAN packages by simply executing the
install.packages() function with no arguments:

install.packages()

If you know the name of the CRAN package you want to install, you can provide
it as the argument:

install.packages("tidyverse")

You only need to install each package once.


However, installing a package only puts the files on your computer. In order to
actually use the features of a package you need to load it into memory during
your current R session using the library() function:

library("tidyverse")

You can then use the Tidyverse functions and other tools.

9.4 Some examples


I have explained some of the basic structure of R, but the best way to learn a
tool is by using it.

Example 9.9. Plotting a PDF


Suppose we want to plot the 𝑁 (0, 1) PDF. We can start by describing step-by-
step what we need to do:

1. Construct a vector 𝑥 of values at which to plot the PDF.


2. Calculate a vector 𝑝 = 𝜙(𝑥), where 𝜙(⋅) is the 𝑁 (0, 1) PDF.
3. Plot 𝑝 against 𝑥.
9.4. SOME EXAMPLES 215

Then we need to figure out how to accomplish each step using R:

1. Our first step can be accomplished using the seq() function, which we
have already used. If you know the name of the function you want to
use, you can access its help page by executing the command ? [function
name here]:

# ? seq

As you can see, the seq() function takes arguments from= (for the starting
point), to= (for the end point), and length.out= (for the total number
of points). Let’s plot the function at 10 points between -4 and 4:

x <- seq(from = -4, to = 4, length.out = 10)


print(x)
## [1] -4.0000000 -3.1111111 -2.2222222 -1.3333333 -0.4444444 0.4444444
## [7] 1.3333333 2.2222222 3.1111111 4.0000000

Note that I’ve picked only 10 points here so that our code is easy to check.

2. The next step is to calculate the standard normal PDF at each of these
points. R is a program for statisticians, so it presumably has that PDF
available as a built-in function. But what if we don’t know its name? We
can just Google “normal pdf in r” and click on a page or two to find out
that the function we need is called dnorm().

p <- dnorm(x)
print(p)
## [1] 0.0001338302 0.0031560163 0.0337736510 0.1640100747 0.3614238299
## [6] 0.3614238299 0.1640100747 0.0337736510 0.0031560163 0.0001338302

3. Our last step is to plot 𝑝 against 𝑥. We could Google, or we could guess


that the function for creating plots is called plot() and try something out.
Don’t be scared to try things out, nothing bad could possibly happen here.

plot(x, p)
216 CHAPTER 9. AN INTRODUCTION TO R

0.3
0.2
p

0.1
0.0

−4 −2 0 2 4

You will see this plot in the Plots tab in the lower right corner of your
screen.

Well, that’s not too bad, but we might want to make some improvements:

1. Plot it at more points (1000 rather than 10, for example)


2. Connect the points with a line
3. Add a title

So we can read through the documentation for the plot() function, try a few
things out, and we can produce a much prettier graph by just adding a few
options:

x <- seq(from = -4, to = 4, length.out = 1000)


p <- dnorm(x)
plot(x, p, type = "l", ylab = expression(phi(x)), main = "PDF of N(0,1) distribution")
9.4. SOME EXAMPLES 217

PDF of N(0,1) distribution


0.4
0.3
φ(x)

0.2
0.1
0.0

−4 −2 0 2 4

As you can see, we have a much nicer and clearer looking plot.

Chapter review

In this chapter, we learned how to run R programs whether in the console, in


a script, or in an R Markdown document. We also learned some basics of the
R language. We haven’t learned how to do much of anything useful yet with
data, but we will over the next few chapters. In particular, we will learn how
to move data from Excel to R, how to view data in R, and how to clean and
analyze data in R. We will also learn a sophisticated R graphing package called
ggplot.

Although you will be tested on specific knowledge, you should also keep in mind
the bigger picture: my real goal here is for you to develop some long-lasting skills
that you will find useful in the future. This should be your goal as well.

A year from now, or five years from now, you will probably not be able to
remember exactly what the format of the seq function is, nor will you need to.
Instead I want you to focus on learning how to think about a coding task, how
to find information, and how to design and implement your plans.
218 CHAPTER 9. AN INTRODUCTION TO R

FYI
For more information on R
There are many free sources of useful information about R.

• A good short introduction is available at https://cran.r-project.


org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf.
• A good longer book that focuses on the Tidyverse is Wickham and
Grolemund’s R for Data Science. It can be purchased as an actual
book from Amazon or your local book shop, and is also available
as a free e-book at https://r4ds.had.co.nz/index.html.

Practice problems
Answers can be found in the appendix.
SKILL #1: Perform basic tasks in RStudio

1. Open RStudio and do the following:


a. Execute a command in the console window.
b. Write and execute (source) a brief script.
c. Write and knit a brief R Markdown document.

SKILL #2: Use R expressions and vectors

2. Which of the following are valid R expressions?


a. "Hello world"
b. Hello
c. Hello"
d. 2+2
e. x <- 2 + 2
f. x <- 2 +
3. Write the R code to perform the following actions:
a. Create a vector named cookies that contains the elements “oat-
meal”, “chocolate chip”, and “shortbread”.
b. Create a vector named ‘threes that contains all of the integers be-
tween 1 and 100 that are divisible by 3.
c. Use the vector threes to find the 5th-lowest integer between 1 and
100 that is divisible by 3.
d. Create a list named threecookies that contains cookies and
threes.
9.4. SOME EXAMPLES 219

SKILL #3: Use R packages

4. Load the tidyverse package (you will need to install it if you have not
already done so), and execute the R code below:

data("mtcars") # load data


ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(colour=factor(cyl), size = qsec))
220 CHAPTER 9. AN INTRODUCTION TO R
Chapter 10

Advanced data cleaning

In an earlier chapter, we learned some basic data cleaning skills using Excel, and
some introductory R skills. This chapter will build on these skills by teaching
some more advanced tools and concepts.

Goals
Chapter goals
In this chapter we will learn how to:

• Move data files across multiple file formats.


• Combine multiple data files into a single workbook.
• Link observations in Excel using XLOOKUP.
• Construct group-level aggregate variables in Excel.
• Avoid, detect and handle errors in Excel.
• Protect Excel data from unintentional modification.
• Read CSV files in R.
• View data tables (“tibbles”) in R.

We will do this while building a data set describing long-run economic growth
in a wide cross-section of countries.

221
222 CHAPTER 10. ADVANCED DATA CLEANING

Economics background

Data on long-run economic growth


Our main data source will be the Penn World Table (PWT), a cross-
country data set covering real GDP, population, and other macroeco-
nomic variables. The PWT is built from two distinct data sources:

• National income data from each country’s national statistical


agency.
• Systematic data comparing prices across countries, constructed by
the World Bank’s International Comparison Program (ICP).

The ICP data is needed to account for a simple economic reality: each
country’s GDP is calculated using local prices, but prices of key goods
and services vary dramatically across countries: housing is much more
expensive in Vancouver than in Houston, and a haircut is much cheaper
in Mumbai than in London. The PWT research team use the results of
the ICP to convert each country’s GDP data to comparable (PPP) units.
The current version of the PWT is available online at http://www.ggdc.
net/pwt.

Download the required file pwt100.xlsx from


https://bookdown.org/bkrauth/BOOK/sampledata/pwt100.xlsx

We also have a secondary data source with cross-country data on top


marginal tax rates. Most countries have progressive tax systems - that
means that residents with high income pay a higher tax rate than res-
idents with low income - but these higher rates only apply to marginal
income. For example if the marginal tax rate is 30% on taxable income
below $100,000 and 40% on taxable income above $100,000, a taxpayer
with $150,000 in taxable income would pay 30% on the first $100,000
and 40% on the remaining $50,000 for a total tax bill of $50,000, or an
average tax rate of 33%.
The data on the top marginal tax rate is obtained from the Tax Founda-
tiona , a US-based policy and research organization. Our data set comes
from Table 1 in the report “Taxing High Incomes: A Comparison of 41
Countries”, available online at https://taxfoundation.org/taxing-high-
income-2019/.

Download the required file TaxRates.xlsx from


https://bookdown.org/bkrauth/BOOK/sampledata/TaxRates.xlsx

Finally, we have a “crosswalk” data file I have constructed to help with


linking the GDP and tax data (more about this below).

Download the required file CountryCodes.txt from


https://bookdown.org/bkrauth/BOOK/sampledata/CountryCodes.txt
This file may display in your browser rather than download-
ing like the other files. If that happens, you can right-click
on the link above and select “Save link as” to save the file
on your computer.

The rest of the chapter will use these three data files.
a I should mention that the Tax Foundation is not an entirely neutral organization

- it generally supports lower taxes - and so my use of their data here should not be
taken as expressing an opinion on their policy views. It is not unusual for useful data
to come from politically-motivated sources; for many years the best data on tobacco
10.1. DATA FILE FORMATS 223

10.1 Data file formats

Nearly every software application we might use for data analysis has a native
file format specifically designed for saving and reading data in that application.
For example, Excel’s native format is the .xlsx file format.

Applications work most seamlessly with their native format. However, most
modern applications can also import (read) and export (save) files in other
formats including native formats of other popular programs.

There are also several standard or open file formats that are commonly used to
share data across applications. Most of these open file formats are built from
text files.

• A text file is just a sequence of lines of plain text characters with no


formatting, graphics, or complex structure.

– Most plain text characters are standard visible letters, numbers,


spaces, and symbols.
– There may also be some non-display characters, including the end-
of-line character(s), the end-of-file character, and tab characters.

• You can view and edit text files using a simple application called a text
editor.

– Windows has a simple built-in text editor called Notepad


– MacOS has a built-in text editor called TextEdit.
– There are more advanced text editors available for free online; I use
one called Notepad++.

• Text files can be read by humans as well as computer programs.

Files that are not text files are usually called binary files. Binary files are
generally not human-readable.
224 CHAPTER 10. ADVANCED DATA CLEANING

Economics background

The economics of file formats


What determines the file format used by a given program? Technical
considerations play an important role, but so does economics.
As a technical matter, binary files can be more efficient for storage and
processing. To a computer, everything is a number, and so the number
123 can in principle be stored and handled more efficiently than the text
string “123”. The primary technical advantage of text files is interoper-
ability: sharing data across different users, applications, and devices.
In a market economy, companies can increase profits by efficiently using
scarce resources. Many well-known applications have moved from binary
to text file formats as computers have become more powerful (so reducing
storage and processing requirements becomes less valuable) and more
networked (so increasing interoperability becomes more valuable).
Another economically important difference is that a company can con-
trol use of its binary file format through some combination of intellectual
property rights (e.g. patents) and simply limiting access to information
about the format (trade secrets). Controlling a proprietary file format
in this manner can provide a competitive advantage to a software com-
pany. However, this advantage has become less valuable over time as
big companies demand the interoperability and customization potential
associated with open file formats and standards.
For example, Excel’s original .xls files were binary files in a proprietary
format whose details were only known only by Microsoft. Its current
.xlsx files are (zipped) text files in an open format whose details are
published here.
If you are interested in learning more about the economics of strate-
gic interaction among firms, you should consider taking our third-year
elective ECON 325.

10.1.1 Fixed-width files

Fixed-width text files represent a table of data by allocating the same number
of characters to each cell in the same column. They can be formatted to be
readable to humans like this:

Name Year of birth Year of death


Mary, Queen of Scots 1542 1547
Mary I 1516 1558
Elizabeth I 1533 1603

or they may be in a less-readable format like this:


10.1. DATA FILE FORMATS 225

Mary, Queen of Scots15421547


Mary I 15161558
Elizabeth I 15331603

The key feature of a fixed-width file is that each column begins at the same
point in each line. For example,

• In the first file, the year of birth always starts on the 24th character of
each line, and the year of death always starts on the 40th character.
• In the second file, the year of birth always starts on the 21st character of
each line, and the year of death always starts on the 25th character.

You can open fixed-width text files in Excel using the Text Import Wizard.

Example 10.1. Opening a fixed-width file

Open the file CountryCodes.txt in a text editor. It will look like this:

This is a fixed format file in which the CountryName variable starts on the
1st character of each line and the CountryCode variable starts on the 36th
character.

To open this file in Excel, use the Text Import Wizard:

1. Select File > Open > from the menu.


2. Select Browse to get to the usual OpenFile dialog box.
3. Find and select the file CountryCodes.txt. You may need to move to the
correct folder, and to change the file type to “All files (.)”
4. The first dialog box of the Text Import Wizard allows you to specify the
overall structure of your data file:
226 CHAPTER 10. ADVANCED DATA CLEANING

Excel guesses about the structure


of your data, but you may need to correct its guesses. In this case, you
should change the file type from Delimited to Fixed width, and select
Next>.
• Excel made a few other wrong guesses (it thinks that the file is written
in Japanese!) but you can ignore them.
5. The second dialog box of the Text Import Wizard allows you to specify
where each column starts:

Excel seems to have guessed correctly in this case, so go ahead and select
Next>.
6. The final dialog box of the Text Import Wizard provides some options for
changing the data format for each individual column:
10.1. DATA FILE FORMATS 227

We do not need to worry about any of these options, so go ahead and


select Finish.

Our worksheet now contains the imported data, correctly arranged into cells.
Keep this worksheet open.

10.1.2 CSV files

The most common and useful general-purpose format for tabular data is called
the comma separated values or CSV file format. A CSV file is just a text
file with the following features:

• Each line in the file represents a row in the table


• Within each row, cell values are separated or delimited by commas.
• Where necessary, text values are enclosed in quotes.

For example, this CSV file:

Name,Year of birth,Year of death


"Mary, Queen of Scots",1542,1547
Mary I,1516,1558
Elizabeth I,1533,1603

will appear in Excel as the table:

Name Year of birth Year of death


Mary, Queen of Scots 1542 1547
Mary I 1516 1558
Elizabeth I 1533 1603

Notice that the quotes around “Mary, Queen of Scots” are needed in order
for the comma to be interpreted as an actual comma rather than a delimiter
between cells.
Example 10.2. Exporting a CSV file from Excel
To save the Excel worksheet we created in the previous section as a CSV file:

1. Select File > Save As. You will see:


• a text box giving the file name (CountryCodes.txt)
• a drop-down box giving the file format (Text (Tab delimited)
(*.txt)
228 CHAPTER 10. ADVANCED DATA CLEANING

2. Select CSV (Comma delimited) (*.csv) from the drop-down box and
3. Enter a file name (I suggest CountryCodes.csv) in the text box.
4. Select Save.
• You may get the warning message: The selected file type does not
support workbooks that contain multiple sheets. To save only the
active sheet click OK…. If so, select OK.
5. Close Excel.
• You may get another warning message: Want to save your
changes to CountryCodes.csv? If so, select Don't save.

You have created a CSV file.

You can see the exact contents of a CSV file by opening it in a text editor.
Example 10.3. Opening a CSV file as text
Use your preferred text editor to open the file CountryCodes.csv. It will look
something like this:

If a CSV file has .csv at the end of its file name, you can open it in Excel by
just double-clicking on it. Otherwise, you can use the Text Import Wizard.
CSV files (and text files in general) have several important limitations relative
to regular Excel (.xlsx) files:

1. A CSV file can only contain one table, while an Excel file can contain
multiple tables.
2. A CSV file can only contain cell values, and cannot contain:
• Formulas
• Formatting
• Graphs
• Any other fancy Excel features
10.1. DATA FILE FORMATS 229

These limitations are also the main advantage of the CSV file: they are simple
ways of reporting tabular data and can be read by virtually any program, or
even by a human.

10.1.3 Other file formats


In addition to CSV files, we will run into files that use characters other than
the comma to delimit cells.

• Space-delimited text files are just like CSV files, but use spaces as the
delimiter. For example, a space-delimited version of our table might look
like this:

Name "Year of birth" "Year of death"


"Mary, Queen of Scots" 1542 1547
"Mary I" 1516 1558
"Elizabeth I" 1533 1603

• Tab-delimited text files are just like CSV files, but use a tab character
as the delimiter.

Like fixed-width and CSV files, space-delimited and tab-delimited files can be
imported into Excel using the Text Import Wizard.
Excel also has a wide variety of tools for obtaining data from various databases,
online services, etc. We do not have time to explore all of these tools, but you
can select Data from the menu bar and look around to see what is available.

10.1.4 Text to columns


Sometimes you will run into an Excel sheet that looks like this:

This can happen if a text file has been incorrectly imported, or if you have
copied-and-pasted data from a PDF file or web page. Fortunately, Excel has a
way of fixing that: the Text to Columns tool.
230 CHAPTER 10. ADVANCED DATA CLEANING

Example 10.4. Using the text-to-columns tool


Open the file TaxRates.xlsx. As you can see, it looks like the picture above. To
use the Text to Columns tool:

1. Select the column with the data (column A).


• Be sure to select the whole column, not just a single cell.
2. Select Data > Text to Columns from the menu.
3. The Text to Columns Wizard will appear.
• It is very similar to the Text Import Wizard
• Excel is usually smart enough to guess the structure of your data and
to offer the correct default options here.
• Assuming the default options are correct (you can see what happens
if you change the options), select Next>, then Next> again, then
Finish.

Your worksheet should now look like this:

Save this file and exit Excel.

10.2 Combining data


The data that we wish to analyze often comes in multiple tables from multi-
ple sources. In order to proceed with the analysis, we often need to combine
information in various ways:

1. We will want to combine multiple data files into a single Excel file.
2. We will want to link observations in one table with observations in another
table.
3. We will want to construct new variables that are group-level aggregates
(counts, averages, or sums) of existing variables.

This section will go through the most important tools and techniques of com-
bining data in these ways.
10.2. COMBINING DATA 231

10.2.1 Combining Excel files

Data tables often come in multiple files, especially if they are from different
sources. It is possible for one Excel file to use data from another Excel file (or
even a non-Excel file or online data source), but doing so can lead to problems
if not done very carefully. For example, if one file has a formula that references
another file, that formula may stop working if either file is moved to another
folder.
An easier approach in most situations is to just combine everything into a single
Excel workbook.

Example 10.5. Combining our cross-country data files


We currently have three data files: pwt100.xlsx, CountryCodes.csv, and
TaxRates.xlsx. To combine them in a single file:

1. Open all three files.


2. Go to CountryCodes.csv, and right-click on the CountryCodes tab at the
bottom of the page. A menu will appear; select Move or Copy... and
the Move or Copy dialog box will appear:

3. Select pwt100.xslx from the To book: drop-down box. Your worksheet


will now move to that workbook.
4. Repeat the same process with both worksheets in the TaxRates.xlsx work-
book.
5. At this point all of our worksheets are in a single workbook. Save the
workbook with name GrowthData.xlsx.

Before proceeding, it’s a good idea to make sure our data is tidy.

Example 10.6. Making the our workbook tidy

1. Take a look at each of our three main worksheets and identify if they need
to be adjusted in any way to meet our criteria for tidy data:
• CountryCodes is tidy and does not need alteration.
• TaxRates is tidy and does not need alteration.
232 CHAPTER 10. ADVANCED DATA CLEANING

• Data has one issue: each observation (row) appears to describe a


particular country in a particular year. There is an identifier for
country and an identifier for year, but no combined identifier that
takes on a unique value for each individual observation. We will
need to create one.
2. Add a unique ID variable to the Data worksheet:
a. Insert a new column to the left of the current column A.
b. Enter CountryYear in cell A1 to name the variable.
c. Use the CONCAT() function to create the identifier in cell A2.
• The formula would be =CONCAT(B2,E2)
• For example, if row 2 depicts Aruba (ABW) in 1950, cell A2
should display “ABW1950”.
d. Copy/paste or fill this formula to the rest of column A.
3. Make a copy of the TaxRates worksheet, and name it GrowthData. This
sheet will be the starting point for the linked data that we will construct
in the next two sections.

Our data set is now tidy and ready to link.

10.2.2 Linking observations

One of the most common tasks in cleaning data is to combine variables from two
or more data sets. In order to do this, we need to match or link observations
in the two tables on the basis of one or more ID variables or keys.
All statistics packages have tools to link observations. In Excel, the table you are
obtaining data from is called a lookup table and the key tool is the XLOOKUP()
function:

• The XLOOKUP() function takes three main/required arguments:


1. The lookup_value is the value in the current table to look up.
2. The lookup_array is the range containing the column of IDs.
3. The return_array is the range containing the column of values to
return.

For example, the formula =XLOOKUP("California",A1:A10,C1:C10) looks for


a row/observation with the value “California” in column A (cells A1:A10), and
then returns the value in column C (cells C1:C10) for that row/observation.
Example 10.7. Adding country codes to the tax rate data
The TaxRates worksheet from the Tax Foundation and the Data worksheet from
the Penn World Table both provide information at the country level. But they
do not yet have a common key variable:
10.2. COMBINING DATA 233

• Data provides two ways of identifying a country in each observation:


– The country name
– The three-letter ISO country code.
• TaxRates only provides the country name.

We could try to match on country name, but the country names are not exactly
the same in the two tables. For example, the same country is called “South
Korea” in TaxRates and “Republic of Korea” in Data. This is a common problem
with names, so people who work with data usually prefer standardized codes.
One option is to simply change the country names in one of our tables, but that
goes against our general principle that we avoid changing data. A better solution
is to use a crosswalk table that gives the country code for each country name,
including name variations like “Republic of Korea” and “South Korea.” The
CountryCodes worksheet is a crosswalk table I have created for this purpose.
If you take a look at it, you will notice that there are observations for both
“Republic of Korea” and “South Korea.”
Let’s use our crosswalk table and the XLOOKUP() function to add a country code
to the GrowthData worksheet.

1. Insert a new column to the left of the current column A.


• You can do that by selecting any cell in column A, and then selecting
Home > Insert > Insert Sheet Columns
2. Enter CountryCode in cell A1 to name the variable.
3. Construct the appropriate formula in cell A2.
• lookup_value should be the country name of the current observation
(B2)
• lookup_array should be the full list of country names (CountryCodes!A2:A185)
• return_array should be the full list of country codes (CountryCodes!B2:B185)
4. Change references in the formula from relative to absolute as appropriate.
The resulting formula will be =XLOOKUP(B2,CountryCodes!A$2:A$185,CountryCodes!B$2:B$185)
5. Copy/paste or fill the formula to the remaining cells in column A.

Column A should now display the ISO country code for each observation.

XLOOKUP() can also be used to match on multiple criteria.


Example 10.8. Matching on multiple criteria
Now suppose we want to create a variable that is each country’s population
(pop in the Data worksheet) in 2019. Since we need to match both country and
year, we will need to use the CONCAT() function to put the country and year
together in a single criterion.
234 CHAPTER 10. ADVANCED DATA CLEANING

1. Enter Pop2019 in cell K1 of GrowthData to name the variable.


2. Enter the correct formula in cell K2:

• lookup_value should combine the country code (cell A2) with the
year (2019), so it should be CONCAT(A2,"2019").
• lookup_array should be the full range of the CountryYear variable
in the Data worksheet (Data!A2:A12811).
• return_array should be the full range of the pop variable in the
Data worksheet (Data!H2:H12811).

3. Make the cell references absolute where needed, which will result in the for-
mula =XLOOKUP(CONCAT(A2,"2019"),Data!A$2:A$12811,Data!H$2:H$12811)
4. Copy/paste or fill the formula to the remaining cells in column

K.

We can also create another variable whose variable is the country’s population
in 1990.

1. Enter Pop1990 in cell L1 of GrowthData to name the variable.


2. Enter the correct formula in cell L2. It will be nearly identical
to the formula in cell K2, but with “2019” replaced by “1990”:
=XLOOKUP(CONCAT(A2,"1990"),Data!A$2:A$12811,Data!H$2:H$12811)
3. Copy/paste or fill the formula to the remaining cells in column

K.

You may note that the PWT goes all the way back to 1950, and may wonder
why I picked 1990 as my starting point. The reason for this is many countries
do not have data going back to 1950, especially those countries that were part
of or allied with the Soviet Union. By 1990, the PWT has data for almost all
countries.

Normally, XLOOKUP() looks for an exact match and returns an error if no match
is found. This is a good default, but the optional arguments if_not_found and
match_mode can be used if you want to do something other than that.

FYI
Related functions
XLOOKUP() is a relatively new addition to Excel, so you may see files that
use the older functions VLOOKUP() and HLOOKUP(). The syntax of these
functions is somewhat different, but the underlying idea is the same.
10.2. COMBINING DATA 235

10.2.3 Aggregating by group

Sometimes we will want to create a new variable that is the sum or average of
another variable within some group, or the count of the number of observations
in that group.
We can construct group-level averages variables using the AVERAGEIFS() func-
tion. It takes three arguments:

• average_range is the range of cells containing the variable that should


be averaged.
• criteria_range is the range of cells containing the group identifier vari-
able
• criteria is the cell containing the group identifier of the current obser-
vation.

For example =AVERAGEIFS(C1:C100,A1:A100,"California") returns the av-


erage value in column C of all of the observations whose value in column A is
“California”.

Example 10.9. Average investment share by country


Suppose we are interested in whether high-tax countries tend to have higher or
lower investment rates than low-tax countries. In order to do this, we might add
a variable to our GrowthData worksheet that describes the average investment
share of GDP (the variable csh_i) in each country over the full period of the
data.
We can do this using the AVERAGEIFS() function:

1. Enter AvgIShare in cell M1 to name the variable.


2. Construct the correct formula in cell M2 using the function AVERAGEIFS():
• average_range should be the full range of investment shares
(Data!AO2:AO12811)
• criteria_range should be the full range of country codes
(Data!A2:A12811)
• criteria should be the cell containing the country code of the cur-
rent observation (A2)
3. Adjust the formula to have the correct absolute references. The resulting
formula should be =AVERAGEIFS(Data!AO$2:AO$12811,Data!A$2:A$12811,A2)
4. Copy/paste or fill this formula into the remaining cells in column M.

The average investment share should range from a minimum of 0.13 for Bulgaria
to a maximum of 0.44 for Cyprus.
236 CHAPTER 10. ADVANCED DATA CLEANING

The AVERAGEIFS() function is part of a family of functions that calculate a


statistic for a subset of observations that satisfy a set of criteria:

• COUNTIFS() calculates the number of observations meeting the criteria.


– We used COUNTIFS() to construct a frequency table in Chapter 6.
• AVERAGEIFS() calculate the average of a variable for those observations
that meet the criteria.
• SUMIFS() calculate the sum of a variable for those observations that meet
the criteria.
• MINIFS() calculate the minimum of a variable for those observations that
meet the criteria.
• MAXIFS() calculate the maximum of a variable for those observations that
meet the criteria.

These functions can all be used to construct summary statistics (as in Chapter
6) or to construct group aggregate variables within our data set (as in this
chapter).

FYI
Related functions
The COUNTIFS(), AVERAGEIFS(), SUMIFS(), MINIFS() and MAXIFS()
functions all allow for multiple criteria to be used. Excel also includes a
set of older functions COUNTIF(), AVERAGEIF(), SUMIF(), MINIF() and
MAXIF() that allow only a single criterion. You may see these in older
worksheets.

10.3 Advanced data management

10.3.1 Error codes

Sometimes an Excel formula does not produce a valid result. When that hap-
pens, Excel will return an error code to indicate what has gone wrong. Excel’s
most commonly-used error codes are:

• #VALUE! means you have given a function the wrong type of argument
– Example: =LN("THIS IS A STRING AND NOT A NUMBER")
• #NAME? means you have referenced a function that does not exist.
– Example: =NOTAREALFUNCTION(1)
• DIV/0! means you have divided by zero.
10.3. ADVANCED DATA MANAGEMENT 237

– Example: =1/0
– This error also appears when you take the AVERAGE() of a range of
cells that does no include any numbers.

• #REF! means you have referenced a cell that does not exist.

– This usually happens when you delete a row or column that the
formula refers to.

• #NUM! means that the result of your numeric calculation is not a real
number.

– Example: =SQRT(-1)

• #NA means that a lookup function such as XLOOKUP() was unable to find
a match.

If you aren’t sure what an error code means, Google it.


Although error codes are helpful in figuring out what has gone wrong with a
calculation, they should not be part of your final work product. Instead, you
should rewrite your formulas to catch and handle error conditions.

Example 10.10. Catching and handling an error condition


Suppose you want the variable in column B to be the square root of the variable
in column A, but the variable in column A is sometimes negative. As a result
the simple formula =SQRT(A1) displays the square root in some cases and the
error code #NUM! in others.
We can use the IF() function to catch this error, and handle it in whatever way
we want. For example, the formula: =IF(A1>=0,SQRT(A1),"") will return the
square root of A1 if A1 is positive or zero, and a blank cell if A1 is negative.

10.3.2 Data validation

One potential source of errors is the entry of invalid data. Invalid data means
that a particular cell in our data takes on a value that is not in the set of possible
values for that variable. For example:

• A numeric variable has a non-numeric value.


• A logical variable has a value other than TRUE or FALSE.
• A date variable is in the wrong format.
• A text variable is the wrong length.
• A numeric variable is outside of its expected range, for example an unem-
ployment rate of 200% or a negative value of GDP.
238 CHAPTER 10. ADVANCED DATA CLEANING

Invalid data can result from typos or other human error, or it can result from
imperfect translation of data between different data sources.
Excel has a set of data validation tools to help prevent and fix invalid data.
Data validation can be accessed by selecting Data from the menu and then
clicking on the Data Validation button, which looks like this: .

Example 10.11. Adding data validation


The Pop2019 variable (column K in the GrowthData table) is a country’s pop-
ulation in 2019. Population cannot be negative, so let’s add this as a data
validation requirement:

1. Select column K in GrowthData.


2. Select Data > Data Validation. The Data Validation dialog box will
appear:

By default, it allows “Any value”, which means no restriction at all.


3. To add a restriction that negative values are not allowed:
a. Select Decimal from the Allow drop-down box.
b. Select greater than or equal to from the Data drop-down box
c. Enter 0 in the Minimum drop-down box.
4. Select OK.

Although nothing appears to have changed, column K is now subject to data


validation.

Once we have added data validation to a range of cells, several things will
happen. First, Excel will not allow you to enter invalid data into a cell with
validation turned on. This feature will help avoid problems in the first place.

Example 10.12. Entering invalid data


Go to cell K43, and (try to) enter -1. You will see this dialog box telling you
that this is an invalid value:
10.3. ADVANCED DATA MANAGEMENT 239

Select Cancel.

The Data Validation tool also allows you to identify previously-entered obser-
vations with invalid data.

Example 10.13. Finding invalid data


All of our Pop2019 data is valid, so in order to “find” invalid data we will need
to cheat a little.

1. Select column K again.


2. Select Data > Data Validation again.
3. Change the Minimum value from “0” to “10”.

This will cause all values of Pop2019 below 10 to be (incorrectly) considered


invalid.
To identify invalid observations, select Data > Data Validation > Circle
Invalid Data. You will now see red circles around all values that violate the
validation criteria (there may be a slight delay):

To remove the circles, select Data > Data Validation > Clear Validation
Circles.

Finally, we can remove all data validation from a cell.

Example 10.14. Removing data validation


To remove data validation from the Pop2019 variable:

1. Select column K in GrowthData.


240 CHAPTER 10. ADVANCED DATA CLEANING

2. Select Data > Data Validation. The Data Validation dialog box will
appear.
3. Select Clear All and then OK.

You can confirm that data validation has been removed by entering an invalid
value (Excel will now let you do this) or by adding validation circles (there will
not be any).

10.3.3 Protecting data

One of the most common Excel problems occurs when someone who is analyzing
a data set unintentionally changes the data.

• This can happen because you accidentally touch your keyboard and over-
write the contents of a cell.
• It can also happens when an inexperienced analyst breaks the rule about
keeping the original data untouched, and makes a mistake.

This is to some extent an unavoidable consequence of Excel’s fundamental de-


sign: there is no separation between data and analysis. This design feature has
the positive implication that your data is always visible and accessible, but it
comes at a cost.
One way of avoiding this cost is to make a practice of making your original data
“read only.” Excel does this through the combined mechanisms of protecting
sheets and locking cells.

• Each worksheet is either protected or unprotected.


– By default, all worksheets are unprotected.
– You can change the status of any worksheet.
• Each cell is either locked or unlocked.
– By default, all cells are locked.
– You can change the status of a cell or cell range.
• Any locked cell in any protected sheet cannot be edited.

Example 10.15. Protecting an entire worksheet


The PWT data table Data contains original raw that we do not want changed.
To prevent changes to that worksheet:

1. Right-click on the Data tab at the bottom of the page. A menu will
appear.
10.3. ADVANCED DATA MANAGEMENT 241

2. Select Protect Sheet. The Protect Sheet dialog box will provide several
options:

3. Select OK.

Now that you have protected the Data worksheet, try to edit any cell. You will
get an error message.

In many cases, you will only want to protect certain cells in a given sheet. To
do this, remember the rules: all cells are initially locked and all worksheets are
unprotected. So we will need to unlock all cells except the ones we want locked,
and then protect the sheet.
Example 10.16. Protecting part of a worksheet
To protect and lock columns A through N in GrowthData but keep the other
columns unlocked:

1. Select all of the cells in the worksheet by clicking on the button.


2. Unlock all of the cells by selecting Home > Format > Lock Cell.
3. Lock columns A through N by selecting them, and then selecting Home >
Format > Lock Cell.
4. Protect the worksheet by selecting Home > Format > Protect Sheet
5. The Protect Sheet dialog box will appear; select OK.

Note that you have to do it in this order - Excel will not let you unlock locked
cells after you have protected the sheet.
You will now get an error message if you try to edit any of the locked cells, but
you can edit the unlocked cells in any way you like.

Finally, you can remove protection from any worksheet by simply selecting Home
> Format > Unprotect Sheet.

You can download the complete Excel file with all data cleaning from
this chapter at https://bookdown.org/bkrauth/BOOK/sampledata/GrowthData.xlsx
242 CHAPTER 10. ADVANCED DATA CLEANING

10.4 Reading and viewing data in R


Before doing any statistical or graphical analysis, we need to get the data into
R. We will also want to look at our data before diving into the analysis, and we
may want to do some data cleaning as well.
We will use numerous Tidyverse functions, so we need to load the Tidyverse
package:

# Load the tidyverse


library("tidyverse")

We will work with the employment data.

10.4.1 Reading a CSV file

Our first step will be reading the data in from the CSV file. The Tidyverse
function to do this is called read_csv(). It has one required argument: the
name of the CSV file.

• Ye can access the online data set directly in R.


• If you have a slow connection or want to work offline, you can download the
data set from https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv
and then access the local data set.

# The code below accesses the online data. You can also download the file
# 'EmploymentData.csv' and change the argument to read_csv() to the local file
# location
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")
## Rows: 541 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): MonthYr, Party, PrimeMinister
## dbl (8): Population, Employed, Unemployed, LabourForce, NotInLabourForce, Un...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

As you can see, R guesses each variable’s data type, and reports its guess. It
is always a good idea to read this output and make sure that everything is the
way that we want it:

• The numeric variables are all stored as col_double().


10.4. READING AND VIEWING DATA IN R 243

– This means double-precision (64 bit) real number


– This is what we want.
• The two text variables are both stored as col_character().
– This means character or text string
– This is what we want.
• The MonthYr variable is also stored as col_character().
– This is not what we want.

– We will want MonthYr to be stored explicitly as a date variable, so


we make a note of that issue here and will fix it later.

We have assigned the data in the CSV file to the variable EmpData.

FYI
Additional options
Our data file happens to be a nice and tidy one, so read_csv() worked
just fine with its default options. Not all data files are so tidy, so
read_csv() has many optional arguments. There are also functions
for other delimited file types:

• read_csv2() for files delimited by semicolons rather than commas


• read_tsv() for tab-delimited files
• read_delim() for files delimited using any other character

Base R function has a similar function called read.csv(), but


read_csv() is preferable for various reasons.

10.4.2 Viewing a data table

The read_csv() function creates an object called a tibble. A tibble is a Tidy-


verse object type that describes a tidy data table.

• Each row in the tibble represents an observation


• Each column represents a variable, and has a name.

The base R equivalent of a tibble is called a data frame. Tibbles


and data frames are interchangeable in most applications, but tibbles have some
additional features that make them work better with the Tidyverse.
We have several ways of viewing the contents of a tibble. We can start with the
print function, which we have already seen in other contexts:
244 CHAPTER 10. ADVANCED DATA CLEANING

print(EmpData)
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce UnempRate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733 10370. 6483. 0.0707
## 2 2/1/1976 16892 9660. 730 10390. 6502. 0.0703
## 3 3/1/1976 16931. 9704. 692. 10396. 6535 0.0665
## 4 4/1/1976 16969. 9738. 713. 10451. 6518. 0.0682
## 5 5/1/1976 17008. 9726. 720 10446. 6562 0.0689
## 6 6/1/1976 17047. 9748. 721. 10470. 6577. 0.0689
## 7 7/1/1976 17086. 9760. 780. 10539. 6546. 0.0740
## 8 8/1/1976 17124. 9780. 744. 10524. 6600. 0.0707
## 9 9/1/1976 17154. 9795. 737. 10532. 6622. 0.0699
## 10 10/1/1976 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>

Tibbles can be quite large, so the print() function will usually show an abbre-
viated version of the table.
We can also see the whole table by executing the command View(EmpData) or
through RStudio:

1. Go to the Environment tab in the upper right window.


• You will see a list of all variables currently in memory, including
EmpData.
2. Double-click on EmpData.

You will see a spreadsheet-like display of EmpData. As in Excel, you can sort
and filter this table. Unlike Excel, you cannot edit it here.

10.4.3 Data table properties

There are several R functions available for exploring the properties of a data
table.
We can obtain the column names of a tibble using the names() function:

names(EmpData)
## [1] "MonthYr" "Population" "Employed" "Unemployed"
## [5] "LabourForce" "NotInLabourForce" "UnempRate" "LFPRate"
## [9] "Party" "PrimeMinister" "AnnPopGrowth"

and we can count the rows and columns with nrow() and ncol() respectively:
10.4. READING AND VIEWING DATA IN R 245

nrow(EmpData)
## [1] 541
ncol(EmpData)
## [1] 11

We can access any variable by name using the $ notation:

length(EmpData$UnempRate)
## [1] 541

As you can see, the length() function returns the length of a vector.

Chapter review
In this chapter, we learned about many loosely-related topics. The underlying
theme that connects them all is that real data can be complicated. We will
often need to get data from multiple sources and in varying formats, and we
cannot trust either ourselves or others not to make mistakes. So we need to be
both disciplined in handling our data, and flexible in finding solutions to the
problems that pop up.
In the next two chapters, we will develop more advanced methods for data
analysis in both Excel and R.

Practice problems
Answers can be found in the appendix.
SKILL #1: Identify common data file formats

1. Identify each of these text files as fixed-width, tab/space separated, or


CSV format.
a. Name Age
Al 25
Betty 32
b. Name Age
Al 25
Betty 32
c. Name,Age
Al,25
Betty,32
246 CHAPTER 10. ADVANCED DATA CLEANING

SKILL #2: Explain and implement common data cleaning tasks

2. What is the purpose of each of the following:


a. A crosswalk table
b. Matching observations by keys
c. Aggregating data by groups

SKILL #3: Describe and use Excel data management tools

3. Under which of these scenarios can you edit cell A1?


a. You open a blank sheet.
b. You open a blank sheet, and protect the sheet.
c. You open a blank sheet, unlock cells A1:C9 and protect the sheet.
d. You open a blank sheet, lock cells A1:C9 and protect the sheet.
4. What will happen if you:
a. Add data validation to a column that contains invalid data.
b. Add data validation to a column, and then try to enter invalid data

SKILL #4: Import and view data in R

5. Use R (with the Tidyverse loaded) to open the data file https://people.sc.
fsu.edu/~jburkardt/data/csv/deniro.csv and count the number of obser-
vations and variables in it.
Chapter 11

Using R

In a previous chapter we learned some basic R terminology and how to run


R, and how to use R to read and view data. In this chapter, we will use this
knowledge to do some useful analysis of the Canadian employment data we have
already worked with. We will also use ggplot, R’s powerful graphing system.

Goals
Chapter goals
In this chapter we will learn how to use R to:

• Use filter, select, arrange, and mutate to transform data tables.


• Calculate univariate statistics in R.
• Produce simple but informative histograms and line graphs using
ggplot.

To get started, open R, load the Tidyverse, and read in our employment data.

library(tidyverse)
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")

11.1 Cleaning data in R


We will not spend a lot of time on data cleaning in R, as we can always clean
data in Excel and export it to R. However, we have a few Tidyverse tools that
are useful to learn.
The Tidyverse includes four core functions for modifying data:

• mutate() allows us to add or change variables.

247
248 CHAPTER 11. USING R

• filter() allows us to select particular observations, much like Excel’s


filter tool.
• arrange() allows us to sort observations, much like Excel’s sort tool.
• select() allows us to select particular variables.

All four functions follow a common syntax that is designed to work with a
convenient Tidyverse tool called the “pipe” operator.

11.1.1 The pipe operator


The pipe operator is part of the Tidyverse and is written %>%. Recall that an
operator is just a symbol like + or * that performs some function on whatever
comes before it and whatever comes after it.
To see how it works, I’ll show you a few examples:

# This is equivalent to names(EmpData)


EmpData %>%
names()
## [1] "MonthYr" "Population" "Employed" "Unemployed"
## [5] "LabourForce" "NotInLabourForce" "UnempRate" "LFPRate"
## [9] "Party" "PrimeMinister" "AnnPopGrowth"
# This is equivalent to sqrt(2)
2 %>%
sqrt()
## [1] 1.414214
# This is equivalent to cat(sqrt(2),' is the square root of 2')
2 %>%
sqrt() %>%
cat(" is the square root of 2")
## 1.414214 is the square root of 2

As you can see, R’s rule for interpreting the pipe operator is that the object
before the %>% is taken as the first argument for the function after the %>%.
The pipe operator does not add any functionality to R; anything you can do
with it can also be done without it. But it addresses a common problem: we
often want to perform multiple transformations on a data set, but doing so in
the usual functional language can lead to code that is quite difficult to read.
The pipe operator can be used to create much more readable code, as we will
see in the examples below.

11.1.2 Mutate
The most important data transformation function is mutate, which allows us
to change or add variables. We will start by changing the MonthYr variable
11.1. CLEANING DATA IN R 249

from character (text) to date, using the as.Date() function:

# Change MonthYr to date format


EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y"))
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1976-01-01 16852. 9637. 733 10370. 6483.
## 2 1976-02-01 16892 9660. 730 10390. 6502.
## 3 1976-03-01 16931. 9704. 692. 10396. 6535
## 4 1976-04-01 16969. 9738. 713. 10451. 6518.
## 5 1976-05-01 17008. 9726. 720 10446. 6562
## 6 1976-06-01 17047. 9748. 721. 10470. 6577.
## 7 1976-07-01 17086. 9760. 780. 10539. 6546.
## 8 1976-08-01 17124. 9780. 744. 10524. 6600.
## 9 1976-09-01 17154. 9795. 737. 10532. 6622.
## 10 1976-10-01 17183. 9782. 783. 10565. 6618.
## # ... with 531 more rows, and 5 more variables: UnempRate <dbl>, LFPRate <dbl>,
## # Party <chr>, PrimeMinister <chr>, AnnPopGrowth <dbl>

As you can see, the MonthYr column is now labeled as a date rather than text.
Like Excel, R has an internal representation of dates that allows for correct
ordering and calculations, but displays dates in a standard human-readable
format.
Mutate can be used to add variables as well as changing them. For example,
suppose we also want to create versions of UnempRate and LFPRate that
are expressed in percentages rather than decimal units:

# Add UnempPct and LFPPct


EmpData %>%
mutate(UnempPct = 100 * UnempRate) %>%
mutate(LFPPct = 100 * LFPRate)
## # A tibble: 541 x 13
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce UnempRate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733 10370. 6483. 0.0707
## 2 2/1/1976 16892 9660. 730 10390. 6502. 0.0703
## 3 3/1/1976 16931. 9704. 692. 10396. 6535 0.0665
## 4 4/1/1976 16969. 9738. 713. 10451. 6518. 0.0682
## 5 5/1/1976 17008. 9726. 720 10446. 6562 0.0689
## 6 6/1/1976 17047. 9748. 721. 10470. 6577. 0.0689
## 7 7/1/1976 17086. 9760. 780. 10539. 6546. 0.0740
## 8 8/1/1976 17124. 9780. 744. 10524. 6600. 0.0707
## 9 9/1/1976 17154. 9795. 737. 10532. 6622. 0.0699
250 CHAPTER 11. USING R

## 10 10/1/1976 17183. 9782. 783. 10565. 6618. 0.0741


## # ... with 531 more rows, and 6 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>, UnempPct <dbl>, LFPPct <dbl>

If you look closely, you can see that the UnempPct and LFPPct variables are
now included in the data table.
Before we go any further, note that we haven’t yet changed the EmpData data
table:

print(EmpData)
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce UnempRate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733 10370. 6483. 0.0707
## 2 2/1/1976 16892 9660. 730 10390. 6502. 0.0703
## 3 3/1/1976 16931. 9704. 692. 10396. 6535 0.0665
## 4 4/1/1976 16969. 9738. 713. 10451. 6518. 0.0682
## 5 5/1/1976 17008. 9726. 720 10446. 6562 0.0689
## 6 6/1/1976 17047. 9748. 721. 10470. 6577. 0.0689
## 7 7/1/1976 17086. 9760. 780. 10539. 6546. 0.0740
## 8 8/1/1976 17124. 9780. 744. 10524. 6600. 0.0707
## 9 9/1/1976 17154. 9795. 737. 10532. 6622. 0.0699
## 10 10/1/1976 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>

As you can see, the MonthYr variable is still listed as a character variable and
the new UnempPct and LFPPct variables do not seem to exist.
What has happened here? Our original commands simply created a new object
based on EmpData that was then displayed on the screen. In order to change
EmpData itself, we need to assign that new object back to EmpData:

# Make permanent changes to EmpData


EmpData <- EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y")) %>%
mutate(UnempPct = 100 * UnempRate) %>%
mutate(LFPPct = 100 * LFPRate)

We can confirm that now we have changed EmpData:

print(EmpData)
## # A tibble: 541 x 13
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce
11.1. CLEANING DATA IN R 251

## <date> <dbl> <dbl> <dbl> <dbl> <dbl>


## 1 1976-01-01 16852. 9637. 733 10370. 6483.
## 2 1976-02-01 16892 9660. 730 10390. 6502.
## 3 1976-03-01 16931. 9704. 692. 10396. 6535
## 4 1976-04-01 16969. 9738. 713. 10451. 6518.
## 5 1976-05-01 17008. 9726. 720 10446. 6562
## 6 1976-06-01 17047. 9748. 721. 10470. 6577.
## 7 1976-07-01 17086. 9760. 780. 10539. 6546.
## 8 1976-08-01 17124. 9780. 744. 10524. 6600.
## 9 1976-09-01 17154. 9795. 737. 10532. 6622.
## 10 1976-10-01 17183. 9782. 783. 10565. 6618.
## # ... with 531 more rows, and 7 more variables: UnempRate <dbl>, LFPRate <dbl>,
## # Party <chr>, PrimeMinister <chr>, AnnPopGrowth <dbl>, UnempPct <dbl>,
## # LFPPct <dbl>

11.1.3 Filter, arrange, and select

Now let’s suppose we want to know more about the months in our data set with
the highest unemployment rates. We can use filter() for this purpose:

# This will give all of the observations with unemployment rates over 12.5%
EmpData %>%
filter(UnempPct > 12.5)
## # A tibble: 8 x 13
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-10-01 19183. 10787. 1602. 12389. 6794.
## 2 1982-11-01 19203. 10764. 1600. 12364. 6839.
## 3 1982-12-01 19223. 10774. 1624. 12398. 6824.
## 4 1983-01-01 19244. 10801. 1573. 12374 6870.
## 5 1983-02-01 19266. 10818. 1574. 12392. 6875.
## 6 1983-03-01 19285. 10875. 1555. 12430. 6856.
## 7 2020-04-01 30994. 16142. 2444. 18586. 12409.
## 8 2020-05-01 31009. 16444 2610. 19054. 11955.
## # ... with 7 more variables: UnempRate <dbl>, LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>, UnempPct <dbl>, LFPPct <dbl>

As you can see, only 8 of the 541 months in our data have unemployment rates
over 12.5% - the worst months of the 1982-83 recession, and April and May of
last year.

Now let’s suppose that we only want to see a few pieces of information about
those months. We can use select() to choose variables:
252 CHAPTER 11. USING R

# This will take out all variables except a few


EmpData %>%
filter(UnempPct > 12.5) %>%
select(MonthYr, UnempRate, LFPPct, PrimeMinister)
## # A tibble: 8 x 4
## MonthYr UnempRate LFPPct PrimeMinister
## <date> <dbl> <dbl> <chr>
## 1 1982-10-01 0.129 64.6 Pierre Trudeau
## 2 1982-11-01 0.129 64.4 Pierre Trudeau
## 3 1982-12-01 0.131 64.5 Pierre Trudeau
## 4 1983-01-01 0.127 64.3 Pierre Trudeau
## 5 1983-02-01 0.127 64.3 Pierre Trudeau
## 6 1983-03-01 0.125 64.5 Pierre Trudeau
## 7 2020-04-01 0.131 60.0 Justin Trudeau
## 8 2020-05-01 0.137 61.4 Justin Trudeau

Finally, suppose that we want to show months in descending order by unemploy-


ment rate (i.e., the highest unemployment rate first). We can use arrange() to
sort rows in this way:

# This will sort the rows by unemployment rate


EmpData %>%
filter(UnempPct > 12.5) %>%
select(MonthYr, UnempPct, LFPPct, PrimeMinister) %>%
arrange(UnempPct)
## # A tibble: 8 x 4
## MonthYr UnempPct LFPPct PrimeMinister
## <date> <dbl> <dbl> <chr>
## 1 1983-03-01 12.5 64.5 Pierre Trudeau
## 2 1983-02-01 12.7 64.3 Pierre Trudeau
## 3 1983-01-01 12.7 64.3 Pierre Trudeau
## 4 1982-10-01 12.9 64.6 Pierre Trudeau
## 5 1982-11-01 12.9 64.4 Pierre Trudeau
## 6 1982-12-01 13.1 64.5 Pierre Trudeau
## 7 2020-04-01 13.1 60.0 Justin Trudeau
## 8 2020-05-01 13.7 61.4 Justin Trudeau

Hopefully you can see why the pipe operator is useful in making our code clear
and readable:

# This is what the same code looks like without the pipe
arrange(select(filter(EmpData, UnempPct > 12.5), MonthYr, UnempPct, LFPPct, PrimeMinist
UnempPct)
## # A tibble: 8 x 4
## MonthYr UnempPct LFPPct PrimeMinister
11.1. CLEANING DATA IN R 253

## <date> <dbl> <dbl> <chr>


## 1 1983-03-01 12.5 64.5 Pierre Trudeau
## 2 1983-02-01 12.7 64.3 Pierre Trudeau
## 3 1983-01-01 12.7 64.3 Pierre Trudeau
## 4 1982-10-01 12.9 64.6 Pierre Trudeau
## 5 1982-11-01 12.9 64.4 Pierre Trudeau
## 6 1982-12-01 13.1 64.5 Pierre Trudeau
## 7 2020-04-01 13.1 60.0 Justin Trudeau
## 8 2020-05-01 13.7 61.4 Justin Trudeau

Now I should probably say: these results imply nothing meaningful about the
economic policy of either Pierre or Justin Trudeau. The severe worldwide reces-
sions in 1982-83 (caused by US monetary policy) and 2020-2021 (caused by the
COVID-19 pandemic) were caused by world events largely outside the control
of Canadian policy makers.

11.1.4 Saving code and data

It is possible to save your data set in R’s internal format just like you would
save an Excel file. But I’m not going to tell you how to do that, because what
you really need to do is save your code.
Because it is command-based, R enables an entirely different and much more
reproducible model for data cleaning and analysis. In Excel, the original data,
the data cleaning, the data analysis, and the results are all mixed in together
in a simple file. This is convenient and simple to use in many applications, but
it can be a disaster in complex projects.
In contrast, R allows you to have three separate files or groups of files

1. The original data, which you do not change.


2. The code to clean and analyze the data, which you maintain carefully
(including version control).
• Your code can be saved in an R script, or as part of an R Markdown
document.
• This code can be split into multiple files.
• You can have one script to clean the data and another to analyze it.
3. The results of your code, which you treat as temporary files that can be
deleted or replaced at any time.
• This can include the results of your analysis
• It can also include your cleaned data.

The key is to make sure that all of your cleaned data and results can be regen-
erated from the original data at any time by running your code.
254 CHAPTER 11. USING R

Example 11.1. BC education data


My colleagues and I have a long-term research project that uses student records
from the British Columbia elementary school system. It requires linking a stu-
dent’s records across multiple years, linking student records to school records,
calculating school-level averages of some variables, and many other tasks.
The code to clean this data is split into about 10 scripts which must be run in
sequence and takes about 30-45 minutes to run. I do not want to do that every
time I do something with the data, so the last script saves the cleaned data to
a file, and my analysis scripts use that cleaned data file.
But every time I make a major change, or every time I put results into a research
paper for others to read, I re-run everything from the beginning.

11.2 Data analysis in R


Having read and cleaned our data set, we can now move on to some summary
statistics.

11.2.1 The summary function

The summary() function will give a basic summary of any object. Exactly what
that summary looks like depends on the object. For tibbles, summary() produces
a set of summary statistics for each variable:

summary(EmpData)
## MonthYr Population Employed Unemployed
## Min. :1976-01-01 Min. :16852 Min. : 9637 Min. : 691.5
## 1st Qu.:1987-04-01 1st Qu.:20290 1st Qu.:12230 1st Qu.:1102.5
## Median :1998-07-01 Median :23529 Median :14064 Median :1265.5
## Mean :1998-07-01 Mean :23795 Mean :14383 Mean :1261.0
## 3rd Qu.:2009-10-01 3rd Qu.:27327 3rd Qu.:16926 3rd Qu.:1404.6
## Max. :2021-01-01 Max. :31191 Max. :19130 Max. :2609.8
##
## LabourForce NotInLabourForce UnempRate LFPRate
## Min. :10370 Min. : 6483 Min. :0.05446 Min. :0.5996
## 1st Qu.:13467 1st Qu.: 6842 1st Qu.:0.07032 1st Qu.:0.6501
## Median :15333 Median : 8162 Median :0.07691 Median :0.6573
## Mean :15644 Mean : 8151 Mean :0.08207 Mean :0.6564
## 3rd Qu.:18230 3rd Qu.: 9099 3rd Qu.:0.09369 3rd Qu.:0.6674
## Max. :20316 Max. :12409 Max. :0.13697 Max. :0.6766
##
## Party PrimeMinister AnnPopGrowth UnempPct
## Length:541 Length:541 Min. :0.007522 Min. : 5.446
11.2. DATA ANALYSIS IN R 255

## Class :character Class :character 1st Qu.:0.012390 1st Qu.: 7.032


## Mode :character Mode :character Median :0.013156 Median : 7.691
## Mean :0.013703 Mean : 8.207
## 3rd Qu.:0.014286 3rd Qu.: 9.369
## Max. :0.024815 Max. :13.697
## NA's :12
## LFPPct
## Min. :59.96
## 1st Qu.:65.01
## Median :65.73
## Mean :65.64
## 3rd Qu.:66.74
## Max. :67.66
##

11.2.2 Univariate statistics

The R function mean() calculates the sample average of any numeric vector:

# Mean of a single variable


mean(EmpData$UnempPct)
## [1] 8.207112

There are many other functions in R to calculate other univariate summary


statistics:

# VAR calculates the sample variance


var(EmpData$UnempPct)
## [1] 2.923088
# SD calculates the standard deviation
sd(EmpData$UnempPct)
## [1] 1.709704
# MEDIAN calculates the sample median
median(EmpData$UnempPct)
## [1] 7.691411

As you can see, they work just like mean().


In real-world data, some variables have missing values for one or more obser-
vations. For example, the AnnPopGrowth variable in our data set is missing for
the first year of data (1976), since calculating the growth rate for 1976 would
require data from 1975. In R, missing values are given the special value NA
which stands for “not available”:
256 CHAPTER 11. USING R

EmpData %>%
select(MonthYr, Population, AnnPopGrowth)
## # A tibble: 541 x 3
## MonthYr Population AnnPopGrowth
## <date> <dbl> <dbl>
## 1 1976-01-01 16852. NA
## 2 1976-02-01 16892 NA
## 3 1976-03-01 16931. NA
## 4 1976-04-01 16969. NA
## 5 1976-05-01 17008. NA
## 6 1976-06-01 17047. NA
## 7 1976-07-01 17086. NA
## 8 1976-08-01 17124. NA
## 9 1976-09-01 17154. NA
## 10 1976-10-01 17183. NA
## # ... with 531 more rows

When we try to take the mean of this variable we also get NA:

mean(EmpData$AnnPopGrowth)
## [1] NA

This is because math in R follows the IEEE-754 standard for numerical arith-
metic, which says that any calculation involving NA should also result in NA.
Some other applications drop missing data from the calculation.
Whenever you have missing values, you should investigate before proceeding.
Sometimes (as in our case here), missing values are for a good reason, other
times they are the result of a mistake or problem that needs to be fixed.
Once we have investigated the missing values, we can tell R explicitly to exclude
them from the calculation by adding the na.rm = TRUE option:

mean(EmpData$AnnPopGrowth, na.rm = TRUE)


## [1] 0.01370259

11.2.3 Tables of statistics

Suppose we want to calculate the sample average for each column in our tibble.
We could just call mean() for each of them, but there should be a quicker way.
Here is the code to do that:

# Mean of each column


EmpData %>%
11.2. DATA ANALYSIS IN R 257

select(where(is.numeric)) %>%
lapply(mean, na.rm = TRUE)
## $Population
## [1] 23795.46
##
## $Employed
## [1] 14383.15
##
## $Unemployed
## [1] 1260.953
##
## $LabourForce
## [1] 15644.1
##
## $NotInLabourForce
## [1] 8151.352
##
## $UnempRate
## [1] 0.08207112
##
## $LFPRate
## [1] 0.6563653
##
## $AnnPopGrowth
## [1] 0.01370259
##
## $UnempPct
## [1] 8.207112
##
## $LFPPct
## [1] 65.63653

I would not expect you to come up with this code, but maybe it kind of makes
sense.

• The select(where(is.numeric)) step selects only the columns in


EmpData that are numeric rather than text.
• The lapply(mean,na.rm=TRUE) step calculates mean(x,na.rm=TRUE) for
each (numeric) column x in EmpData.

We can use this method with any function that calculates a summary statistic:

# Standard deviation of each column


EmpData %>%
258 CHAPTER 11. USING R

select(where(is.numeric)) %>%
lapply(sd, na.rm = TRUE)
## $Population
## [1] 4034.558
##
## $Employed
## [1] 2704.267
##
## $Unemployed
## [1] 243.8356
##
## $LabourForce
## [1] 2783.985
##
## $NotInLabourForce
## [1] 1294.117
##
## $UnempRate
## [1] 0.01709704
##
## $LFPRate
## [1] 0.01401074
##
## $AnnPopGrowth
## [1] 0.00269365
##
## $UnempPct
## [1] 1.709704
##
## $LFPPct
## [1] 1.401074

11.2.4 Frequency tables

We can also construct frequency tables for both discrete and continuous vari-
ables:

# COUNT creates a frequency table for discrete variables


EmpData %>%
count(PrimeMinister)
## # A tibble: 10 x 2
## PrimeMinister n
## <chr> <int>
## 1 Brian Mulroney 104
11.2. DATA ANALYSIS IN R 259

## 2 Jean Chretien 120


## 3 Joe Clark 8
## 4 John Turner 2
## 5 Justin Trudeau 62
## 6 Kim Campbell 4
## 7 Paul Martin 25
## 8 Pierre Trudeau 91
## 9 Stephen Harper 116
## 10 Transfer 9
# COUNT and CUT_INTERVAL create a binned frequency table
EmpData %>%
count(cut_interval(UnempPct, 6))
## # A tibble: 6 x 2
## `cut_interval(UnempPct, 6)` n
## <fct> <int>
## 1 [5.45,6.82] 94
## 2 (6.82,8.2] 240
## 3 (8.2,9.57] 95
## 4 (9.57,10.9] 56
## 5 (10.9,12.3] 43
## 6 (12.3,13.7] 13

As you might imagine, there are various ways of customizing the intervals just
like in Excel.

11.2.5 Probability distributions in R

R has a family built-in functions for each commonly-used probability distribu-


tion.
For example:

• The dnorm() function gives the normal PDF:

# The N(0,1) PDF, evaluated at 1.96


dnorm(1.96)
## [1] 0.05844094
# The N(1,4) PDF, evaluated at 1.96
dnorm(1.96, mean = 1, sd = 4)
## [1] 0.09690415

• The pnorm() function gives the normal CDF:


260 CHAPTER 11. USING R

# The N(0,1) CDF, evaluated at 1.96


pnorm(1.97)
## [1] 0.9755808

• The qnorm() function gives the inverse normal CDF (or quantile function):

# The 97.5 percentile of the N(0,1) CDF


qnorm(0.975)
## [1] 1.959964

• The rnorm() function produces normal random numbers:

# Four random numbers from the N(0,1) distribution


rnorm(4)
## [1] 0.2427564 -0.1532671 0.9348635 -0.7583637

There is a similar set of functions available for the uniform distribution


(dunif, punif, qunif, runif), the binomial distribution (dbinom, pbinom,
qbinom,rbinom), and Student’s T distribution (dt, pt, qt, rt), along with
many others.

11.3 Graphs with ggplot

The Tidyverse also contains a powerful graphics package called1 ggplot.

11.3.1 Creating a graph

We can start by making a histogram of the unemployment rate:

ggplot(data = EmpData, mapping = aes(x = UnempPct)) + geom_histogram()


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1 The package is technically called ggplot2 since it is the second version of ggplot. But

everyone calls it “ggplot” anyway.


11.3. GRAPHS WITH GGPLOT 261

60

40
count

20

5.0 7.5 10.0 12.5


UnempPct

We can also make a time series (line) graph:

ggplot(data = EmpData, mapping = aes(x = MonthYr, y = UnempPct)) + geom_line()

12.5
UnempPct

10.0

7.5

1980 1990 2000 2010 2020


MonthYr

The ggplot() function has a non-standard syntax, so I’d like to go over it.
262 CHAPTER 11. USING R

• The first line sets up the basic characteristics of the graph:

– The data argument tells R which data set (tibble) will be used
– the mapping argument describes the basic aesthetics of the graph,
i.e., the relationship in the data we will be graphing.

∗ For the histogram, our aesthetic includes only one variable


∗ For the line graph, our aesthetic includes two variables

• The rest of the command is one or more statements separated by a + sign.


These are called geometries and are geometric elements to be included
in the plot.

– The geom_histogram() geometry produces a histogram


– The geom_line() geometry produces a line

A graph can include multiple geometries in a given graph, as we will see shortly.

11.3.2 Modifying a graph

As when making graphs in Excel, the basic graph gives us some useful informa-
tion but we can improve upon it in various ways.

11.3.2.1 Titles and labels

You can add a title and subtitle, and you can change the axis titles:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + geom_line() + labs(title = "Un


subtitle = "January 1976 - January 2021", caption = "Source: Statistics Canada, Lab
tag = "Canada") + xlab("") + ylab("Unemployment rate, %")
11.3. GRAPHS WITH GGPLOT 263

Canada
Unemployment rate
January 1976 − January 2021

12.5
Unemployment rate, %

10.0

7.5

1980 1990 2000 2010 2020

Source: Statistics Canada, Labour Force Survey

11.3.2.2 Color

You can change the color of any geometric element using the col= argument:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + geom_line(col = "blue")


264 CHAPTER 11. USING R

UnempPct 12.5

10.0

7.5

1980 1990 2000 2010 2020


MonthYr

Colors can be given in ordinary English (or local language) words, or with
detailed color codes in RGB or CMYK format.

Some geometric elements, such as the bars in a histogram, also have a fill color:

ggplot(data = EmpData, aes(x = UnempPct)) + geom_histogram(col = "red", fill = "blue")


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
11.3. GRAPHS WITH GGPLOT 265

60

40
count

20

5.0 7.5 10.0 12.5


UnempPct

As you can see, the col= argument sets the color for the exterior of each bar,
and the fill= argument sets the color for the interior.

11.3.3 Adding graph elements

We can include multiple geometries in the same graph. For example, we can
include lines for both unemployment and labour force participation:

ggplot(data = EmpData, aes(x = MonthYr, y = UnempPct)) + geom_line(col = "blue") +


geom_line(aes(y = LFPPct), col = "red")
266 CHAPTER 11. USING R

60
UnempPct

40

20

1980 1990 2000 2010 2020


MonthYr

A few things to note here:

• The third line gives geom_line() an aesthetics argument aes(y=LFPPct).


This overrides the aesthetics in the first line.

• We have used color to differentiate the two lines, but there is no legend to
tell the reader which line is which. We will need to fix that.
• The vertical axis is labeled UnempPct. We will need to fix that.

We could add a legend here, but it is better (and friendlier to the color-blind)
to just label the lines. We can use the geom_text geometry to do this:

ggplot(data = EmpData, aes(x = MonthYr)) + geom_line(aes(y = UnempPct), col = "blue") +


geom_text(x = as.Date("1/1/2000", "%m/%d/%Y"), y = 15, label = "Unemployment",
col = "blue") + geom_line(aes(y = LFPPct), col = "red") + geom_text(x = as.Date
"%m/%d/%Y"), y = 60, label = "LFP", col = "red")
11.3. GRAPHS WITH GGPLOT 267

60 LFP
UnempPct

40

20

Unemployment

1980 1990 2000 2010 2020


MonthYr

The graphs below combine all of the features described above to yield clean and
clear graphs.

ggplot(data = EmpData, aes(x = UnempPct)) + geom_histogram(binwidth = 0.5, fill = "blue") +


geom_density() + labs(title = "Unemployment rate", subtitle = paste("January 1976 - January 2
nrow(EmpData), " months)", sep = "", collapse = ""), caption = "Source: Statistics Canada, La
tag = "Canada") + xlab("Unemployment rate, %") + ylab("Count")
268 CHAPTER 11. USING R

Canada
Unemployment rate
January 1976 − January 2021 (541 months)
100

75
Count

50

25

5.0 7.5 10.0 12.5


Unemployment rate, %
Source: Statistics Canada, Labour Force Survey

ggplot(data = EmpData, aes(x = MonthYr)) + geom_line(aes(y = UnempPct), col = "blue") +


geom_text(x = as.Date("1/1/2000", "%m/%d/%Y"), y = 15, label = "Unemployment",
col = "blue") + geom_line(aes(y = LFPPct), col = "red") + geom_text(x = as.Date
"%m/%d/%Y"), y = 60, label = "LFP", col = "red") + labs(title = "Unemployment and L
subtitle = paste("January 1976 - January 2021 (", nrow(EmpData), " months)",
sep = "", collapse = ""), caption = "Source: Statistics Canada, Labour Force Su
tag = "Canada") + xlab("") + ylab("Percent")
11.3. GRAPHS WITH GGPLOT 269

Canada
Unemployment and LFP rates
January 1976 − January 2021 (541 months)

60 LFP
Percent

40

20
Unemployment

1980 1990 2000 2010 2020

Source: Statistics Canada, Labour Force Survey

Chapter review
As we have seen, we can do many of the same things in Excel and R. R is
typically more difficult to use for simple analysis tasks, and there is nothing
wrong with using Excel when it is easier. But the usability gap gets smaller
with more complicated tasks, and there are many tasks where Excel doesn’t do
everything that R can do. You should think of them as complementary tools,
and be comfortable using both.

Practice problems
Answers can be found in the appendix.
SKILL #1: Use mutate to add or change a variable
SKILL #2: Use filter, arrange and select to modify a data table

1. Starting with the data table EmpData:


a. Add the numeric variable Year based on the existing variable Mon-
thYr. The formula for Year should be format(MonthYr, "%Y")
b. Add the numeric variable EmpRate, which is the proportion of the
population (Population) that is employed (Employed), also called
the employment rate or employment-to-population ratio.
270 CHAPTER 11. USING R

c. Drop all observations from years before 2010.


d. Drop all variables except MonthYr, Year, EmpRate, Unem-
pRate, and AnnPopGrowth
e. Sort observations by EmpRate.
f. Give the resulting data table the name PPData.

SKILL #3: Calculate univariate statistics for a single variable


SKILL #4: Construct a table of summary statistics
SKILL #5: Recognize and handle missing data problems

2. Starting with the PPData data table you created in question (1) above:
a. Calculate and report the mean employment rate since 2010.
b. Calculate and report a table reporting the median for all variables in
PPData.
c. Did any variables in PPData have missing data? If so, how did you
decide to address it in your answer to (b), and why?

SKILL #6: Construct a simple or binned frequency table

3. Using the PPDatadata set, construct a frequency table of the employment


rate.

SKILL #7: Perform simple probability calculations in R

4. Calculate the following quantities in R:


a. The 45th percentile of the 𝑁 (4, 6) distribution.
b. The 97.5 percentile of the 𝑇8 distribution.
c. The value of the standard uniform CDF, evaluated at 0.75.
d. 5 random numbers from the 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(10, 0.5) distribution.

SKILL #8: Create a histogram with ggplot

5. Using the PPDatadata set, create a histogram of the employment rate.

SKILL #9: Create a line graph with ggplot

6. Using the PPData data set, create a time series graph of the employment
rate.
Chapter 12

Multivariate data analysis

So far, most of our emphasis has been on univariate analysis: understanding


the behavior of a single variable at a time. However, we are often interested in
the relationship among multiple variables. This will be the primary subject of
your next course in statistics (most likely ECON 333), but we will touch on a
few of the basics in this chapter, using a combination of theoretical concepts,
Excel, and R.

Goals
Chapter goals
In this chapter, we will learn how to:

• Calculate and interpret the sample covariance and correlation.


• Interpret frequency tables, cross-tabulations and conditional aver-
ages
• Construct Excel Pivot Tables, including frequency tables, cross-
tabulations, and conditional averages.
• Interpret scatter plots, binned-mean plots, smoothed-mean plots,
and linear regression plots.
• Construct scatter plots, smoothed-mean plots and linear regression
plots in R.

For the most part, we will focus on the case of a random sample of size 𝑛 on
two random variables 𝑥𝑖 and 𝑦𝑖 .
Example 12.1. Obtaining the data
The primary application in this chapter will use our Canadian employment data.
We will be using both Excel and R in our examples.
For the Excel examples we will start with the file https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentDa
This file is similar to the employment data file we used in Chapter 6.

271
272 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

For the R examples we will start with the EmploymentData.csv file we used in
Chapter 11. Execute the following R code to get started:

library(tidyverse)
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")
# Make permanent changes to EmpData
EmpData <- EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y")) %>%
mutate(UnempPct = 100 * UnempRate) %>%
mutate(LFPPct = 100 * LFPRate)

12.1 Covariance and correlation

When both variables are numeric, we can summarize their relationship using
the sample covariance:
𝑛
1
𝑠𝑥,𝑦 = ∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝑛 − 1 𝑖=1

and the sample correlation


𝑠𝑥,𝑦
𝜌𝑥,𝑦 =
𝑠𝑥 𝑠𝑦

where 𝑥̄ and 𝑦 ̄ are the sample averages and 𝑠𝑥 and 𝑠𝑦 are the the sample standard
deviations. These univariate statistics are defined in Chapter 7.
The sample covariance and sample correlation can be interpreted as estimates of
the corresponding population covariance and correlation as defined in Chapter
5.

12.1.1 Covariance and correlation in R

The sample covariance and correlation can be calculated in R using the cov()
and cor() functions.
These functions can be applied to any two columns of data:

# For two specific columns of data


cov(EmpData$UnempPct, EmpData$LFPPct)
## [1] -0.6126071
cor(EmpData$UnempPct, EmpData$LFPPct)
## [1] -0.2557409
12.1. COVARIANCE AND CORRELATION 273

As you can see, unemployment and labour force participation are negatively
correlated: when unemployment is high, LFP tends to be low. This makes
sense given the economics: if it is hard to find a job, people will move into other
activities that take one out of the labour force: education, childcare, retirement,
etc.
Both cov() and cor() can also be applied to (the numeric variables in) an entire
data set. The result is what is called a covariance matrix or correlation
matrix:

# Correlation matrix for the whole data set (at least the numerical parts)
EmpData %>%
select(where(is.numeric)) %>%
cor()
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9905010 0.3759661 0.9950675 0.9769639
## Employed 0.9905010 1.0000000 0.2866686 0.9964734 0.9443252
## Unemployed 0.3759661 0.2866686 1.0000000 0.3660451 0.3846586
## LabourForce 0.9950675 0.9964734 0.3660451 1.0000000 0.9509753
## NotInLabourForce 0.9769639 0.9443252 0.3846586 0.9509753 1.0000000
## UnempRate -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPRate 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## AnnPopGrowth NA NA NA NA NA
## UnempPct -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPPct 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## UnempRate LFPRate AnnPopGrowth UnempPct LFPPct
## Population -0.4721230 0.4535956 NA -0.4721230 0.4535956
## Employed -0.5542043 0.5369032 NA -0.5542043 0.5369032
## Unemployed 0.6249095 0.1874114 NA 0.6249095 0.1874114
## LabourForce -0.4836022 0.5379437 NA -0.4836022 0.5379437
## NotInLabourForce -0.4315427 0.2568786 NA -0.4315427 0.2568786
## UnempRate 1.0000000 -0.2557409 NA 1.0000000 -0.2557409
## LFPRate -0.2557409 1.0000000 NA -0.2557409 1.0000000
## AnnPopGrowth NA NA 1 NA NA
## UnempPct 1.0000000 -0.2557409 NA 1.0000000 -0.2557409
## LFPPct -0.2557409 1.0000000 NA -0.2557409 1.0000000

Each element in the matrix reports the covariance or correlation of a pair of


variables. As you can see, the matrix is symmetric since 𝑐𝑜𝑣(𝑥, 𝑦) = 𝑐𝑜𝑣(𝑦, 𝑥). In
addition, the diagonal elements of the covariance matrix are 𝑐𝑜𝑣(𝑥, 𝑥) = 𝑣𝑎𝑟(𝑥)
and the diagonal elements of the correlation matrix are 𝑐𝑜𝑟(𝑥, 𝑥) = 1.
Every variable’s correlation with AnnPopGrowth is NA, so we will want to
exclude NA values from the calculation. Excluding missing values is more com-
plicated for covariance and correlation matrices because there are two different
ways to exclude them:
274 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

1. Pairwise deletion: when calculating the covariance or correlation of two


variables, exclude observations with a missing values for either of those two
variables.
2. Casewise or listwise deletion: when calculating the covariance or cor-
relation of two variables, exclude observations with a missing value for any
variable.

The use argument allows you to specify which approach you want to use:

# EmpData has missing data in 1976 for the variable AnnPopGrowth Pairwise will
# only exclude 1976 from calculations involving AnnPopGrowth
EmpData %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9905010 0.3759661 0.9950675 0.9769639
## Employed 0.9905010 1.0000000 0.2866686 0.9964734 0.9443252
## Unemployed 0.3759661 0.2866686 1.0000000 0.3660451 0.3846586
## LabourForce 0.9950675 0.9964734 0.3660451 1.0000000 0.9509753
## NotInLabourForce 0.9769639 0.9443252 0.3846586 0.9509753 1.0000000
## UnempRate -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPRate 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## AnnPopGrowth -0.5427605 -0.5239765 -0.5771164 -0.5618814 -0.4851752
## UnempPct -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPPct 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## UnempRate LFPRate AnnPopGrowth UnempPct LFPPct
## Population -0.47212303 0.4535956 -0.54276051 -0.47212303 0.4535956
## Employed -0.55420434 0.5369032 -0.52397653 -0.55420434 0.5369032
## Unemployed 0.62490950 0.1874114 -0.57711636 0.62490950 0.1874114
## LabourForce -0.48360222 0.5379437 -0.56188142 -0.48360222 0.5379437
## NotInLabourForce -0.43154270 0.2568786 -0.48517519 -0.43154270 0.2568786
## UnempRate 1.00000000 -0.2557409 -0.06513125 1.00000000 -0.2557409
## LFPRate -0.25574087 1.0000000 -0.48645089 -0.25574087 1.0000000
## AnnPopGrowth -0.06513125 -0.4864509 1.00000000 -0.06513125 -0.4864509
## UnempPct 1.00000000 -0.2557409 -0.06513125 1.00000000 -0.2557409
## LFPPct -0.25574087 1.0000000 -0.48645089 -0.25574087 1.0000000
# Casewise will exclude 1976 from all calculations
EmpData %>%
select(where(is.numeric)) %>%
cor(use = "complete.obs")
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9898651 0.32223097 0.9951165 0.9782320
## Employed 0.9898651 1.0000000 0.22300335 0.9964469 0.9443181
## Unemployed 0.3222310 0.2230034 1.00000000 0.3043132 0.3495771
## LabourForce 0.9951165 0.9964469 0.30431322 1.0000000 0.9529715
## NotInLabourForce 0.9782320 0.9443181 0.34957711 0.9529715 1.0000000
12.2. PIVOT TABLES 275

## UnempRate -0.5162732 -0.6032136 0.62908791 -0.5350956 -0.4601639


## LFPRate 0.3943552 0.4879065 0.05409547 0.4814461 0.1986298
## AnnPopGrowth -0.5427605 -0.5239765 -0.57711636 -0.5618814 -0.4851752
## UnempPct -0.5162732 -0.6032136 0.62908791 -0.5350956 -0.4601639
## LFPPct 0.3943552 0.4879065 0.05409547 0.4814461 0.1986298
## UnempRate LFPRate AnnPopGrowth UnempPct LFPPct
## Population -0.51627317 0.39435518 -0.54276051 -0.51627317 0.39435518
## Employed -0.60321359 0.48790649 -0.52397653 -0.60321359 0.48790649
## Unemployed 0.62908791 0.05409547 -0.57711636 0.62908791 0.05409547
## LabourForce -0.53509557 0.48144610 -0.56188142 -0.53509557 0.48144610
## NotInLabourForce -0.46016393 0.19862976 -0.48517519 -0.46016393 0.19862976
## UnempRate 1.00000000 -0.33577578 -0.06513125 1.00000000 -0.33577578
## LFPRate -0.33577578 1.00000000 -0.48645089 -0.33577578 1.00000000
## AnnPopGrowth -0.06513125 -0.48645089 1.00000000 -0.06513125 -0.48645089
## UnempPct 1.00000000 -0.33577578 -0.06513125 1.00000000 -0.33577578
## LFPPct -0.33577578 1.00000000 -0.48645089 -0.33577578 1.00000000

In most applications, pairwise deletion makes the most sense because it avoids
throwing out data. But it is occasionally important to use the same data for all
calculations, in which case we would use listwise deletion.

FYI
Covariance and correlation in Excel
The sample covariance and correlation between two variables (data
ranges) can be calculated in Excel using the COVARIANCE.S() and
CORREL() functions.

12.2 Pivot tables


Excel’s Pivot Tables are a powerful tool for the analysis of frequencies, condi-
tional averages, and various other aspects of the data. They are somewhat tricky
to use, and we will only scratch the surface here. But the more comfortable you
can get with them, the better.
The first step is to create a blank Pivot Table that is tied to a particular data
table. We can create as many Pivot Tables as we want.
Example 12.2. Creating a blank Pivot Table
To create a blank Pivot Table based on the employment data:

1. Open the Data for Analysis worksheet in EmploymentData.xlsx and select


any cell in the data table.
2. Select Insert > PivotTable from the menu.
276 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

3. Excel will display the Create PivotTable dialog box:

The default settings are fine, so select OK.

Excel will open a new worksheet that looks like this:

The Pivot Table itself is on the left side of the new worksheet.

The next step is to add elements to the table. There are various tools available
to do that:

• the Pivot Table Fields box on the right side of the screen
• the PivotTable Analyze menu
• the Design menu.

These tools only appear in context, so they will disappear if you click a cell
outside of the Pivot Table. You can fix this by just clicking any cell in the Pivot
Table.
12.2. PIVOT TABLES 277

12.2.1 Simple frequencies


The simplest application of a Pivot Table is to construct a table of frequencies.
By default, Pivot Tables report absolute frequencies - a count of the number
of times we observe a particular value in the data.
Example 12.3. A simple frequency table
To create a simple frequency table showing the number of months in office for
each Canadian prime minister:

1. Check the box next to PrimeMinister. The Pivot Table will look like
this:

2. Drag MonthYr into the box marked “Σ values”. The Pivot Table will
now look like this:

As we can see, the table shows the number of observations for each value of the
PrimeMinister variable, which also happens to be the number of months in
office for each prime minister. It also shows a grand total.

In many applications, we are also interested in relative frequencies - the


fraction or percentage of observations that take on a particular value. As we
discussed earlier, a relative frequency can be interpreted as an estimate of the
corresponding relative probability.
278 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

Example 12.4. Reporting relative frequencies


To add a relative frequency column, we first need to add a second absolute
frequency column:

1. In the PivotTable Fields box, drag MonthYr to the “Σ values” box.

Then we convert it to a relative frequency column:

2. Right-click on the “Count of MonthYr2” column, and select Value Field


Settings...

3. Click on the Show Values As tab and select “% of Column Total” from
the Show Values As drop-down box.
4. Select OK.

The third column will now show the number of observations as a percentage of
the total:

12.2.2 Cross tabulations


We can also construct frequency tables for pairs of variables. There are various
ways of laying out such a table, but the simplest is to have one variable in rows
and the other variable in columns. When the table is set up this way, we often
call it a cross tabulation or crosstab. Crosstabs can be expressed in terms of
absolute frequency, relative frequency, or both.
12.2. PIVOT TABLES 279

Example 12.5. An absolute frequency table


Starting with a blank Pivot Table:

1. Drag PrimeMinister into the Rows box.


2. Drag Party into the Columns box.
3. Drag MonthYr into the Σ values box.

You will now have this table of absolute frequencies:

For example, this crosstab tells us Brian Mulroney served 104 months as prime
minister, with all of those months as a member of the (Progressive) Conservative
party.

We can also construct crosstabs using relative frequencies, but there is more than
one kind of relative frequency we can use here. A joint frequency crosstab
shows the count in each cell as a percentage of all observations. Joint frequency
tables can be interpreted as estimates of joint probabilities.
Example 12.6. A joint frequency crosstab
To convert our absolute frequency crosstab into a joint frequency crosstab:

1. Right click on “Count of MonthYr” and select Value Field Settings...


2. Select the Show Values As tab, and select “% of Grand Total” from the
Show Values As drop-down box.

Your table will now look like this:

For example, the table tells us that Brian Mulroney’s 104 months as prime
minister represent 19.22% of all months in our data.
280 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

A conditional frequency crosstab shows the count in each cell as a percentage


in that row or column. Conditional frequencies can be interpreted as estimates
of conditional probabilities.
Example 12.7. A conditional frequency crosstab
To convert our crosstab into a conditional frequency crosstab:

1. Right click on “Count of MonthYr” and select Value Field Settings


2. Select the Show Values As tab, and select “% of column total” from the
Show Values As drop-down box.

Your table will now look like this:

For example, Brian Mulroney’s 104 months as prime minister represent 44.64%
of all months served by a Conservative prime minister in our data.

12.2.3 Conditional averages

We can also use Pivot Tables to report conditional averages. A conditional


average is just the average of one variable, taken within a sub-population defined
by another variable. For example, we might be interested in average earnings
for men in Canada versus average earnings for women.
A conditional average can be interpreted as an estimate of the corresponding
conditional mean; for example average earnings for men in a random sample
of Canadians can be interpreted as an estimate of average earnings among all
Canadian men.
Example 12.8. Adding a conditional average
Suppose we want to add the average unemployment rate during each prime
minister’s time in office to this Pivot Table:
12.2. PIVOT TABLES 281

1. Drag UnempRate into the box marked “Σ values”. The table will now
look like this:

Unfortunately, we wanted to see the average unemployment rate for each prime
minister, but instead we see the sum of unemployment rates for each prime
minister. To change this:

3. Right-click “Sum of UnempRate”, then select Value Field Settings....


4. Select Average

The table now looks like this:

We now have the average unemployment rate for each prime minister. It is not
very easy to read, so we will want to change the formatting later.

In addition to conditional averages, we can report other conditional statistics


including variances, standard deviations, minimum, and maximum.

12.2.4 Modifying a Pivot Table

As you might expect, we can modify Pivot Tables in various ways to make them
clearer, more informative, and more visually appealing.
As with other tables in Excel, we can filter and sort them. Filtering is par-
ticularly useful with Pivot Tables since there are often categories we want to
exclude.

Example 12.9. Filtering a Pivot Table


There is no Canadian prime minister named “Transfer.” If you recall, we used
that value to represent months in the data where the prime minister changed.
To exclude those months from our main table:
282 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

1. Click on the . The sort and filter menu will appear:

2. Uncheck the check box next to “Transfer”, and select OK:

The table no longer includes the Transfer value:

Note that the grand total has also gone down from 541 to 532.

By default, the table is sorted on the row labels, but we can sort on any column.

Example 12.10. Sorting a Pivot Table


To sort our table by months in office:

1. Click on and the sort and filter menu will appear.


2. Select More sort options; the Pivot Table sort dialog box will appear:
12.2. PIVOT TABLES 283

3. Select the Descending (Z to A) radio button and “Count of MonthYr”


from the drop-down box.

The table is now sorted by number of months in office:

We can change number formatting, column and row titles, and various other
aspects of the table’s appearance.
Example 12.11. Cleaning up a table’s appearance
Our table can be improved by making the column headers more informative and
reporting the unemployment rate in percentage terms and fewer decimal places:

1. Right-click on “Average of UnempRate”, and then select Value Field


Settings...
2. Enter “Average Unemployment” in the Custom Name text box.
3. Select Number Format, then change the number format to Percentage with
1 decimal place.
4. Select OK and then OK again. The table will now look like this:

5. Change the other three headers. You can do this through Value Field
Settings... but you can also just edit the text directly.
• Change “Row Labels” to “Prime Minister”
• Change “Count of MonthYr” to “Months in office”
• Change “Count of MonthYr2” to “% in office”

Our final table looks like this:


284 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

Finally, we can use Pivot Tables to create graphs.

Example 12.12. A Pivot Table graph

To create a simple bar graph depicting months in office, we start by cleaning


up the Pivot Table so that it shows the data we want to represent:

1. Select any cell in this table:

2. Use filtering to remove “Transfer” from the list of prime ministers.

3. Use sorting to sort by (grand total) number of months in office.


The table should now look like this:

Then we can generate the graph:

4. Select any cell in the table, then select Insert > Recommended Charts
from the menu.

5. Select Column, and then Stacked Column from the dialog box, and then
select OK.

Your graph will look like this:


12.3. GRAPHICAL METHODS 285

As always, there are various ways we could customize this graph to be more
attractive and informative.

You can download the full set of Pivot Tables and associated charts generated in
this chapter at https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentDataPT.xlsx

12.3 Graphical methods

Bivariate summary statistics like the covariance and correlation provide a sim-
ple way of characterizing the relationship between any two numeric variables.
Frequency tables, cross tabulations, and conditional averages allow us to gain
a greater understanding of the relationship between two discrete or categorical
variables, or between a discrete/categorical variable and a continuous variable.

In order to develop a detailed understanding of the relationship between two


continuous variables (or discrete variables with many values), we need to de-
velop some additional methods. The methods that we will explore in this class
are primarily graphical. You will learn more about the underlying numerical
methods in courses like ECON 333.

12.3.1 Scatter plots

A scatter plot is the simplest way to view the relationship between two vari-
ables in data. The horizontal (𝑥) axis represents one variable, the vertical (𝑦)
axis represents the other variable, and each point represents an observation.

Scatter plots can be created in R using the geom_point() geometry:

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct)) + geom_point()


286 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

68

66
LFPPct

64

62

60

7.5 10.0 12.5


UnempPct

In some sense, the scatter plot shows everything about the relationship between
the two variables, since it shows every observation. The negative relationship
between the two variables indicated by the correlation we calculated earlier
(-0.2557409) is clear, but it is also clear that this relationship is not very strong.

12.3.1.1 Jittering

If both of our variables are truly continuous, each point represents a single
observation. But if both variables are actually discrete, points can “stack” on top
of each other. In that case, the same point can represent multiple observations,
leading to a misleading scatter plot.
For example, suppose we had rounded our unemployment and LFP data to the
nearest percent:

# Round UnempPct and LFPPct to nearest integer


RoundedEmpData <- EmpData %>%
mutate(UnempPct = round(UnempPct)) %>%
mutate(LFPPct = round(LFPPct))

The scatter plot with the rounded data would look like this:

# Create graph using rounded data


ggplot(data = RoundedEmpData, aes(x = UnempPct, y = LFPPct)) + geom_point(col = "red")
12.3. GRAPHICAL METHODS 287

68

66
LFPPct

64

62

60

5.0 7.5 10.0 12.5


UnempPct

As you can see from the graph, the scatter plot is misleading: there are 541
observations in the data set represented by only 40 points.

A common solution to this problem is to jitter the data by adding a small


amount of random noise so that every observation is at least a little different
and appears as a single point. We can use the geom_jitter() geometry to do
a jittered scatter plot:

ggplot(data = RoundedEmpData, aes(x = UnempPct, y = LFPPct)) + geom_point(col = "red") +


geom_jitter(size = 0.5, col = "blue")
288 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

68

66
LFPPct

64

62

60

5.0 7.5 10.0 12.5


UnempPct

As you can see the jittered rounded data (small blue dots) more accurately
reflects the original unrounded data than the rounded data (large red dots).

12.3.1.2 Using color as a third dimension

We can use color to add a third dimension to the data. That is, we can color-
code points based on a third variable by including it as part of the aesthetic:

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct, col = Party)) + geom_point()


12.3. GRAPHICAL METHODS 289

68

66

Party
LFPPct

64 Conservative
Liberal
Transfer

62

60

7.5 10.0 12.5


UnempPct

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct, col = MonthYr)) + geom_point()

68

66

MonthYr
2020
LFPPct

2010
64
2000

1990

1980

62

60

7.5 10.0 12.5


UnempPct

As these graphs show, R will use a discrete or continuous color scheme depending
on whether the variable is discrete or continuous.
290 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

As we discussed earlier, you want to make sure your graph can be read by a
reader who is color blind or is printing in black and white. So we can use shapes
in addition to color:

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct, col = Party)) + geom_point(aes(sha

68

66

Party
LFPPct

64 Conservative
Liberal
Transfer

62

60

7.5 10.0 12.5


UnempPct

We would also choose a color scheme other than red and green, since that is the
most common form of color blindness.

FYI
Scatter plots in Excel
Scatter plots can also be created in Excel, though it is more work and
produces less satisfactory results.

12.3.2 Binned averages


In section 12.2.3 we calculated a conditional average of the (continuous) variable
UnempRate for each observed value of the discrete variable PrimeMinister.
When both variables are continuous, this isn’t such a good idea: there are
as many values of each variable as there are observations, so the “conditional
average” ends up just being the original data
When both variables are continuous, one solution is to divide the range for 𝑥𝑖
into a set of bins and then take averages within each bin. We can then plot the
12.3. GRAPHICAL METHODS 291

average 𝑦𝑖 within each bin against the midpoint of the bin. This kind of plot is
called a binned scatterplot.

Binned scatterplots are not difficult to do in R but the code is quite


a bit more complex than you are used to. As a result, I will not
ask you to be able to produce binned scatter plots, I will only ask
you to interpret them. Here is my binned scatter plot with 20 bins:
68

66
LFPPct

64

62

60

7.5 10.0 12.5


UnempPct

The number of bins is an important choice. The graph below adds a red line
based on 4 bins and a green line based on 100 bins.

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct)) + geom_point(size = 0.5) +


stat_summary_bin(fun = "mean", bins = 4, col = "red", size = 1, geom = "point") +
stat_summary_bin(fun = "mean", bins = 4, col = "red", size = 0.5, geom = "line") +
stat_summary_bin(fun = "mean", bins = 20, col = "blue", size = 1, geom = "point") +
stat_summary_bin(fun = "mean", bins = 20, col = "blue", size = 0.5, geom = "line") +
stat_summary_bin(fun = "mean", bins = 100, col = "green", size = 1, geom = "point") +
stat_summary_bin(fun = "mean", bins = 100, col = "green", size = 0.5, geom = "line") +
geom_text(x = 13.8, y = 64.1, label = "4 bins", col = "red") + geom_text(x = 13.8,
y = 62.5, label = "20 bins", col = "blue") + geom_text(x = 13.8, y = 60, label = "100 bins",
col = "green")
292 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

68

66
LFPPct

64 4 bins

20 bins
62

60 100 bins
5.0 7.5 10.0 12.5
UnempPct

As you can see, the binned scatterplot tends to be smooth when there are only
a few bins, and jagged when there are many bins. This reflects a trade-off
between bias (too few bins may lead us to miss important patterns in the data)
and variance (too many bins may lead us to see patterns in the data that aren’t
really part of the DGP).

12.3.3 Smoothing

An alternative to binned averaging is smoothing, which calculates a smooth


curve that fits the data as well as possible. There are many different techniques
for smoothing, but they are all based on taking a weighted average of 𝑦𝑖 near
each point, with high weights on observations with 𝑥𝑖 close to that point and low
(or zero) weights on observations with 𝑥𝑖 far from that point. The calculations
required for smoothing can be quite complex and well beyond the scope of this
course.

Fortunately, smoothing is easy to do in R using the geom_smooth() geometry:

ggplot(data = EmpData, aes(x = UnempRate, y = LFPRate)) + geom_point(size = 0.5) +


geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
12.3. GRAPHICAL METHODS 293

0.68

0.66
LFPRate

0.64

0.62

0.60

0.075 0.100 0.125


UnempRate

Notice that by default, the graph includes both the fitted line (in blue) and a
95% confidence interval (the shaded area around the line). Also note that the
confidence interval is narrow in the middle (where there is a lot of data) and
wide in the ends (where there is less data).

12.3.4 Linear regression

Our last approach is to assume that the relationship between the two variables
is linear, and estimate it by a technique called linear regression. Linear
regression calculates the straight line that fits the data best.

You can include a linear regression line in your plot by adding the method=lm
argument to the geom_smooth() geometry:

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct)) + geom_point(size = 0.5) +


geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
294 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

68

66
LFPPct

64

62

60

7.5 10.0 12.5


UnempPct

We can compare the linear and smoothed fits to see where they differ:

ggplot(data = EmpData, aes(x = UnempPct, y = LFPPct)) + geom_point(size = 0.5) +


geom_smooth(col = "red") + geom_smooth(method = "lm", col = "blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
12.3. GRAPHICAL METHODS 295

68

66
LFPPct

64

62

60

7.5 10.0 12.5


UnempPct

As you can see, the two fits are quite similar for unemployment rates below
12%, but diverge quite a bit above that level. This is inevitable, because the
smooth fit becomes steeper, but linear fit can’t do that.
Linear regression is much more restrictive than smoothing, but has several im-
portant advantages:

• The relationship is much easier to interpret, as it can be summarized by


a single number: the slope of the line.
• The linear relationship is much more precisely estimated

These advantages are not particularly important in this case, with only two
variables and a reasonably large data set. The advantages of linear regression
become overwhelming when you have more than two variables to work with. As
a result, linear regression is the most important tool in applied econometrics,
and you will spend much of your time in ECON 333 learning to use it.

Chapter review
Econometrics is mostly about the relationship between variables: price and
quantity, consumption and savings, labour and capital, today and tomorrow.
So most of what we do is multivariate analysis.
This chapter has provided a brief view of some of the main techniques for mul-
tivariate analysis. Our higher-level statistics courses (ECON 333, ECON 334,
296 CHAPTER 12. MULTIVARIATE DATA ANALYSIS

ECON 335, ECON 433, ECON 435) are all about multivariate analysis, and
will develop both the theory behind these tools and the set of applications in
much greater detail.

Practice problems
Answers can be found in the appendix.
SKILL #1: Calculate and interpret covariance and correlation

1. Using the EmpData data set, calculate the covariance and correlation of
UnempPct and AnnPopGrowth. Based on these results, are periods
of high population growth typically periods of high unemployment?

SKILL #2: Distinguish between pairwise and casewise deletion of


missing values

2. In problem (1) above, did you use pairwise or casewise deletion of missing
values? Did it matter? Explain why.

SKILL #3: Construct and interpret a pivot table in Excel

3. The following tables are based on 2019 data for Canadians aged 25-34.
Classify each of these tables as simple frequency tables, crosstabs, or con-
ditional averages.

Educational attainment Percent


Below high school 6
High school 31
Tertiary (e.g. university) 63

a.

Gender Years of schooling


Male 14.06
Female 14.74

b.
12.3. GRAPHICAL METHODS 297

Educational attainment Male Female


Below high school 7 5
High school 38 24
Tertiary 55 71

c.

SKILL #4: Construct and interpret a scatter plot in R

4. Using the EmpData data set, construct a scatter plot with annual popula-
tion growth on the horizontal axis and unemployment rate on the vertical
axis.

SKILL #5: Construct and interpret a linear or smoothed average


plot in R

5. Using the EmpData data set, construct the same scatter plot as in problem
(4) above, but add a smooth fit and a linear fit.
298 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
Appendix A

Math review

The math used in this textbook is all covered in high school or in introductory
calculus. However, you may have forgotten it, or never understood it very
well in the first place. This appendix provides a review of the most important
mathematical terms, concepts and methods. It can be used to review ideas
before starting the main text, or as a reference while going through the main
text.
If you have trouble with a few of these ideas, don’t panic. This is not a math
class. For example, if you forget what it means for two sets to be “disjoint”,
you can always ask.

A.1 Sets
The most fundamental notion in mathematics is the idea of a set. A set is
typically described as a collection or gathering of distinct objects. These objects
are called the elements of the set. Sets are not ordered, and elements cannot
be repeated.

A.1.1 Defining a set

We have several ways of defining or describing sets. The simplest is enumer-


ation, which means you just list the elements:

𝐴 = {1, 2, 3}

𝐵 = {𝐴𝑣𝑜𝑐𝑎𝑑𝑜, 𝐵𝑎𝑛𝑎𝑛𝑎}

Notice that the list of elements is surrounded by curly brackets.

299
300 APPENDIX A. MATH REVIEW

A second way of defining a set is to use set-builder notation. Set builder


notation defines sets in terms of rules they must satisfy. For example:

𝐶 = {𝑥 ∈ 𝐵 ∶ 𝑥 is yellow}

We read this as “𝐶 is the set of all 𝑥 in the set 𝐵 such that 𝑥 is yellow.” In
other words:

• Avocados are not in set 𝐶 because they are not yellow.


• Bananas are in set 𝐶 because they are in set 𝐵 and they are yellow.
• Lemons are not in set 𝐶 because they are not in set 𝐵.

We sometimes leave the initial set implicit:

𝐶 = {𝑥 ∶ 𝑥 is yellow}

We would interpret this as saying that 𝐶 is the set of everything that is yellow.
There are a few special sets defined by convention:

• The empty set, usually written ∅, is a set with no elements.


• The universal set (usually written 𝕌) is the set of “everything” we might
be talking about in a given context. The universal set is usually implicit,
though we will occasionally need to define it explicitly.
• The set of integers is usually written as ℤ.
• The set of rational numbers is usually written as ℚ.
• The set of real numbers is usually written as ℝ.
• The set of positive integers is ℤ+ , the set of positive real numbers is ℝ+ ,
etc.

Finally we can just refer to an abstract set without specifying its contents, just
like we can refer to a variable in algebra without specifying its value.

A.1.2 Characteristics of a set

The size or cardinality of the set 𝐴, usually written |𝐴|, is simply the number
of elements it has.

• 𝐴 is a singleton if |𝐴| = 1.
• 𝐴 is a finite set if |𝐴| is a finite number. Otherwise it is an infinite set.

Example A.1. Cardinality (size)


The integers (ℤ) and real numbers (ℝ) are infinite sets:

|ℤ| = ∞
A.1. SETS 301

|ℝ| = ∞

These are finite sets:


|{1, 2, 3}| = 3
|{𝐴𝑣𝑜𝑐𝑎𝑑𝑜}| = 1
|∅| = 0
In addition, {𝐴𝑣𝑜𝑐𝑎𝑑𝑜} is a singleton.

A.1.3 Set algebra

Let 𝐴 and 𝐵 be two sets. We have several ways of describing how they are
related:

• 𝐴 and 𝐵 are identical (written 𝐴 = 𝐵) if they contain the same elements.


• 𝐴 and 𝐵 are disjoint if they have no elements in common.
• 𝐴 is a subset of 𝐵 (written 𝐴 ⊆ 𝐵 or 𝐴 ⊂ 𝐵 ) if all elements of 𝐴 are
also elements of 𝐵.

Example A.2. Set relationships


These sets are identical:
{1, 2} = {1, 2}
{1, 2} = {2, 1}
These sets are not identical:

{1, 2} ≠ {1, 2, 3}

{1, 2} ≠ {1, 3}
These sets are disjoint
{1, 2} and {3, 4}
These sets are not disjoint

{1, 2, 3} and {3, 4}

The first of these sets is a subset of the second set:

{1, 2} ⊂ {1, 2}

{1, 2} ⊂ {1, 2, 3}
The first of these sets is not a subset of the second set:

{1, 2} ⊄ {1, 3}
302 APPENDIX A. MATH REVIEW

We can also perform various mathematical operations on sets.

• The intersection of 𝐴 and 𝐵 (usually written 𝐴 ∩ 𝐵) is the set of every-


thing that is an element of both 𝐴 and 𝐵.

𝐴 ∩ 𝐵 = {𝑥 ∶ 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵}

• The union of 𝐴 and 𝐵 (usually written 𝐴 ∪ 𝐵) is the set of everything


that is in 𝐴 or 𝐵 (or both):

𝐴 ∪ 𝐵 = {𝑥 ∶ 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵}

• The set difference between 𝐴 and 𝐵 is written as 𝐵 − 𝐴 or 𝐵 𝐴 and


defined as the set of everything in 𝐵 that is not in 𝐴:

𝐵 − 𝐴 = {𝑥 ∈ 𝐵 ∶ 𝑥 ∉ 𝐴}

We will not use the set difference, but it helps to define the complement,
which we will use.
• The complement of 𝐴 is written 𝐴′ , ¬𝐴 or 𝐴𝑐 , and is simply everything
that is not in 𝐴:
𝐴𝑐 = 𝕌 − 𝐴 = {𝑥 ∶ 𝑥 ∉ 𝐴}

Example A.3. Set operations


Intersections take all common elements from the two sets:

{1, 2} ∩ {3, 4} = ∅

{1, 2, 3} ∩ {3, 4} = {3}


Unions take all elements from the two sets:

{1, 2} ∪ {3, 4} = {1, 2, 3, 4}

{1, 2, 3} ∪ {3, 4} = {1, 2, 3, 4}


Set differences take elements out of the first set if they are in the second set:

{1, 2} − {3, 4} = {1, 2}

{1, 2, 3} − {3, 4} = {1, 2}


The complement of a set is everything (in the universal set, whatever that is)
else:

• The complement of “people in the labour force” is “people not in the


labour force” (the universal set is “people”)
• The complement of “the Canucks win” is “the Canucks do not win”
A.1. SETS 303

The combination of these basic definitions, relationships (identity, element, sub-


set) and operations (union, intersection, complement) are collectively called set
algebra.
304 APPENDIX A. MATH REVIEW

FYI
Some standard results about sets
Given the basic components of set algebra, we can establish many useful
rules. This is not a course on set theory, so I will simply list some of the
most important rules for your reference. None of these rules is difficult
to prove, and most of them should make intuitive sense to you.

• Non-negative cardinality:

|𝐴| ≥ 0

• Cardinality of unions:

|𝐴 ∪ 𝐵| = |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| ≤ |𝐴| + |𝐵|

• Cardinality of intersections:

|𝐴 ∩ 𝐵| ≤ 𝑚𝑖𝑛(|𝐴|, |𝐵|)

• Commutative laws:
𝐴∪𝐵 =𝐵∪𝐴
𝐴∩𝐵 =𝐵∩𝐴
• Associative laws:

(𝐴 ∪ 𝐵) ∪ 𝐶 = 𝐴 ∪ (𝐵 ∪ 𝐶)

(𝐴 ∩ 𝐵) ∩ 𝐶 = 𝐴 ∩ (𝐵 ∩ 𝐶)
• Distributive laws:

𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶)

𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶)
• Identity laws:
𝐴∪∅=𝐴
𝐴∩𝕌=𝐴
• Complement laws:
𝐴 ∪ 𝐴𝑐 = 𝕌
𝐴 ∩ 𝐴𝑐 = ∅
𝕌𝑐 = ∅
∅𝑐 = 𝕌
• Double-complement law:

(𝐴𝑐 )𝑐 = 𝐴

• Idempotent laws:
𝐴∪𝐴=𝐴
𝐴∩𝐴=𝐴
• Domination laws:
𝐴∪𝕌=𝕌
𝐴∩∅=∅
A.2. FUNCTIONS 305

A.2 Functions

A.2.1 Definition of a function

A function is a rule that matches (“maps”) elements of one set (called the
domain of the function) to elements of another set (called the range of the
function). We use the notation

𝑓 ∶𝐷→𝑅

to declare that a particular function 𝑓 has domain 𝐷 and range 𝑅.

Example A.4. Domain and range of a function

• 𝑓 ∶ ℝ → ℝ means that the function 𝑓 takes a real number, and returns a


real number.
• 𝑔 ∶ ℤ+ → {0, 1} means that the function 𝑔 takes a positive integer, and
returns either zero or one.
• 𝑠 ∶ {𝐴𝑣𝑜𝑐𝑎𝑑𝑜, 𝐵𝑎𝑛𝑎𝑛𝑎, 𝐶𝑎𝑛𝑡𝑎𝑙𝑜𝑢𝑝𝑒} → ℤ means that the function 𝑠 takes
either 𝐴𝑣𝑜𝑐𝑎𝑑𝑜, 𝐵𝑎𝑛𝑎𝑛𝑎 or 𝐶𝑎𝑛𝑡𝑎𝑙𝑜𝑢𝑝𝑒 and returns an integer.

If a function has a finite domain, we can define the function by simple enumer-
ation.

Example A.5. A function defined by enumeration


Since the function 𝑠 ∶ {𝐴𝑣𝑜𝑐𝑎𝑑𝑜, 𝐵𝑎𝑛𝑎𝑛𝑎, 𝐶𝑎𝑛𝑡𝑎𝑙𝑜𝑢𝑝𝑒} → ℤ has a finite domain,
we can define it by listing all of its values:

𝑠(𝐴𝑣𝑜𝑐𝑎𝑑𝑜) = 1

𝑠(𝐵𝑎𝑛𝑎𝑛𝑎) = 0
𝑠(𝐶𝑎𝑛𝑡𝑎𝑙𝑜𝑢𝑝𝑒) = 500
We could also make a table:

𝑓𝑟𝑢𝑖𝑡 𝑠(𝑓𝑟𝑢𝑖𝑡)
Avocado 1
Banana 0
Cantaloupe 500

We can also define a function by a mathematical expression, or we can just refer


to a function without defining exactly what it is, just like we can talk about a
variable 𝑥 or a set 𝐴 without assigning it a particular value.
306 APPENDIX A. MATH REVIEW

FYI
Function or multiplication?
Students sometimes confuse functions and multiplication because of the
way we conventionally write functions, students sometimes get confused.
Suppose I write this:
𝑧 = 𝑓(𝑥 + 𝑦)
There are two possible interpretations of this statement:

1. 𝑓 is a number, so 𝑧 is equal to the number 𝑓 times the number


(𝑥 + 𝑦).
2. 𝑓 is a function, so 𝑧 is equal to the function 𝑓 applied to the number
(𝑥 + 𝑦).

It is sometimes clear from the context which of these interpretations is


correct in a given problem. But please ask if you aren’t sure.

A.2.2 The indicator function

The indicator function is a special function whose argument is a statement.


The indicator function is typically represented by 𝐼(⋅) or 𝟙(⋅). It returns a value
of 1 if the statement is true, and 0 if it is false.

Example A.6. Indicator functions

𝐼(3 < 5) = 1
𝐼(3 = 5) = 0
𝐼(Ottawa is the capital of Canada) = 1
𝐼(Ottawa is in Alberta) = 0

We use indicator functions all the time in probability and statistics because
they allow us to convert a qualitative statement like “Bob is employed” into a
quantitative statement like 𝐼(Bob is employed) = 1.

A.3 Sequences and summations

A.3.1 Cartesian products

The Cartesian product of two sets 𝐴 and 𝐵, usually written 𝐴 × 𝐵, is the set
of all ordered pairs of elements in the two sets.
A.3. SEQUENCES AND SUMMATIONS 307

Example A.7. Cartesian products


If 𝐴 = {1, 2} and 𝐵 = {3, 4} then

𝐴 × 𝐵 = {(1, 3), (1, 4), (2, 3), (2, 4)}

A particularly important example of a Cartesian product is:

ℝ2 = ℝ × ℝ

the set of ordered pairs of real numbers. For example (0, 3) and (0.427, 2000)
are both elements of ℝ2 .

A.3.2 Sequences

We will also be interested in the set of ordered sequences of 𝑛 real numbers


(where 𝑛 is some positive integer):

ℝ𝑛 = ℝ × ℝ × ⋯ × ℝ

We usually distinguish between sequences and sets by using parentheses for


sequences and curly brackets for sets.
A sequence is much like a set but with two important differences:

1. Order matters.
2. Elements can be repeated. For example

Like sets, sequences can be empty, finite, or infinite.


Example A.8. Sequences and sets
Order matters for sequences but not for sets:

{1, 2} = {2, 1}

(1, 2) ≠ (2, 1)
Sequences can include repeated elements, but sets cannot

{1, 3, 1} is not a valid set.

(1, 3, 1) is a valid sequence.


Sequences have cardinality (size)

() is an empty sequence.

(2, 4, 6) is a finite sequence.


(2, 4, 6, …) is an infinite sequence.
308 APPENDIX A. MATH REVIEW

We usually number each element in a sequence. The number assigned to a


given element is called its index. Typically, the first element in a sequence is
numbered either 0 or 1, and the remaining elements are numbered sequentially
after that.
We can define sequences by simple enumeration as in the examples above. These
are a few other ways of defining a sequence:

• As a sequence of variables with subscripts: (𝑥1 , 𝑥2 , 𝑥3 ). We will usually


use a variable as the subscript 𝑥𝑖 when we want to talk about an arbitrary
element of the sequence.
• As functions of other sequences: 𝑦𝑖 = ln(𝑥𝑖 ).
• As functions of the index itself 𝑥𝑖 = 𝑎𝑖

A.3.3 Summations

Many statistics we are calculating are constructed by adding up a sequence of


numbers. It will be convenient to use the summation operator, which looks
like this:
∑ (variable that depends on index)
(index)∈(set)

Notice that an expression using the summation operator has several components:

• The summation sign ∑.


• An expression identifying the index variable, and what values it takes on.
The index variable is usually but not always called 𝑖.
• An expression identifying what is to be added up. It is usually but not
always a function of the index variable.

Example A.9. Summations


∑𝑖∈𝐴 means we add up over all of the values in the set 𝐴:

∑ 𝑥𝑖 = 𝑥 1 + 𝑥 2 + 𝑥 3
𝑖∈{1,2,3}

𝑒𝑛𝑑
∑𝑖=𝑠𝑡𝑎𝑟𝑡 means we add up over all of the integers between 𝑠𝑡𝑎𝑟𝑡 and 𝑒𝑛𝑑.

3
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + 𝑥3
𝑖=1

𝑛
∑ 𝑥𝑖 = 𝑥 1 + 𝑥 2 + ⋯ + 𝑥 𝑛
𝑖=1
A.3. SEQUENCES AND SUMMATIONS 309

The index set can be infinite:



∑ 𝑥𝑖 = 𝑥 1 + 𝑥 2 + ⋯
𝑖=1

The summation can be of any expression that depends on the index:


3
∑ ln(𝑥𝑖 ) = ln(𝑥1 ) + ln(𝑥2 ) + ln(𝑥3 )
𝑖=1

3
∑ 𝛽𝑗 = 𝛽 + 𝛽2 + 𝛽3
𝑗=1

But the expression does not have to vary with the index:
3
∑ 2𝑥 = 2𝑥 + 2𝑥 + 2𝑥 = 6𝑥
𝑖=1

3
∑3 = 3 + 3 + 3 = 9
𝑖=1

We can use multiple summation operators in an expression:


2 2 2
∑ ∑ 𝑥𝑖 𝑦𝑗 = ∑(𝑥1 𝑦𝑗 + 𝑥2 𝑦𝑗 ) = 𝑥1 𝑦1 + 𝑥1 𝑦2 + 𝑥2 𝑦1 + 𝑥2 𝑦2
𝑗=1 𝑖=1 𝑗=1

The summation operator looks fancy, but remember it is just a concise way of
describing a sum. If you are ever struggling with understanding a summation,
write it out.

You learned the basic properties of addition and multiplication in Grade 3:

• Associative property:

(𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐)

(𝑎𝑏)𝑐 = 𝑎(𝑏𝑐)
• Commutative property:
𝑎+𝑏 =𝑏+𝑎
𝑎𝑏 = 𝑏𝑎
• Distributive property:
𝑎(𝑏 + 𝑐) = 𝑎𝑏 + 𝑎𝑐

An expressions using the summation operator describes a sum, so it also obeys


these properties:
310 APPENDIX A. MATH REVIEW

• The associative and commutative properties allow you to switch any two
summation operators:

∑ ∑ 𝑥𝑖 𝑦𝑗 = ∑ ∑ 𝑥𝑖 𝑦𝑗
𝑖∈𝐴 𝑗∈𝐵 𝑗∈𝐵 𝑖∈𝐴

and to take the summation operator “into” or “out of” a sum.

∑(𝑥𝑖 + 𝑦𝑖 ) = (∑ 𝑥𝑖 ) + (∑ 𝑦𝑖 )
𝑖∈𝐴 𝑖∈𝐴 𝑖∈𝐴

• The distributive property allows you to take any constant out of the sum-
mation operator
𝑛 𝑛
∑ 𝑎𝑥𝑖 = 𝑎 ∑ 𝑥𝑖
𝑖=1 𝑖=1

We will use these three results later on.

A.4 Limits
Let (𝑥1 , 𝑥2 , …) be a sequence of infinite length. We say that some number 𝑐 is
the limit of this sequence if 𝑥𝑖 gets closer and closer to 𝑐 as 𝑖 gets bigger and
bigger.

• The limit of a sequence can be a number.


• The limit of a sequence can be ∞ or −∞.
• Not all sequences have limits.

Example A.10. Limits

lim(1, 1, 1, …) = 1
1 1
lim (1, , , …) = 0
2 3
lim(1, 2, 3, …) = ∞
lim(−1, −2, −3, …) = −∞
lim(0, 1, 0, 1, 0, 1, …) does not exist

You will learn or have learned the formal definition of a limit in your calculus
course. I won’t make you re-learn it for this class, but the formal definition is
provided below for your reference.
A.4. LIMITS 311

FYI
Definition of a limit
Let (𝑥1 , 𝑥2 , …) be a sequence of infinite length. We say that the number
𝑐 is the limit of this sequence:

lim 𝑥𝑖 = 𝑐
𝑖→∞

if for any 𝛿 > 0 there exists an 𝑁𝛿 such that

|𝑥𝑖 − 𝑐| < 𝛿 for all 𝑖 > 𝑁𝛿

Chapter review
This appendix provides some of the basic mathematical background needed for
this course. There are many other useful sources if you need further information.

• Wikipedia’s coverage of mathematics and introductory statistics content


is usually quite reliable, and there is no shame in using it.
• Your calculus textbook will typically provide much more detail on func-
tions, sequences, and limits, and almost any undergraduate math textbook
will include an appendix on set theory.

Finally, you can just ask.

Practice problems
Answers can be found in the appendix.
SKILL #1: Define a set using enumeration or set-builder notation

1. Use enumeration to define 𝐴 as the set of “people who live in your house”.
2. Use set-builder notation to define 𝐵 as the set of integers between 1 and
1,000.

SKILL #2: Determine the cardinality of a set

3. Calculate the cardinality of the sets you defined in problems (1) and (2)
above.

SKILL #3: Identify whether sets are identical, disjoint, or subsets


312 APPENDIX A. MATH REVIEW

4. Let 𝐴 = {1, 2, 3}, let 𝐵 = {2}, let 𝐶 = ∅, and let 𝐷 = {𝑥 ∈ ℤ ∶ 1 < 𝑥 < 3}
a. Which pairs of sets are identical?
b. Which sets are disjoint with 𝐴?
c. Which sets are subsets of 𝐴?
d. Which sets are subsets of 𝐵?

SKILL #4: Perform set algebra operations on simple sets

5. Let 𝐴 = {1, 2, 3}, let 𝐵 = {2, 4}, let 𝐶 = ∅, and let 𝐷 = {𝑥 ∈ ℤ ∶ 1 < 𝑥 <
3}
a. Find 𝐴 ∩ 𝐵.
b. Find 𝐴 ∪ 𝐵.
c. Find 𝐴 ∩ 𝐷.
d. Find 𝐴 ∪ 𝐷.
e. Suppose the universe set is 𝑈 = {1, 2, 3, 4, 5}. Find 𝐵𝐶 .
f. Suppose the universe set is the set of integers ℤ. Find 𝐵𝐶

SKILL #5: Use the indicator function

6. Let 𝑥 = 4. Find the value of each of the following:


a. 𝐼(𝑥 < 5)
b. 𝐼(𝑥 is an odd√number)
c. 𝐼(𝑥 < 5) − 𝐼( 𝑥 is an integer).

SKILL #6: Use sequences and limits

7. Find the limit of each of these sequences, if it exists.


a. (1, 21 , 13 , 14 , …)
b. (1, −1, 1, −1, …)
c. (6, 5 12 , 5 31 , 5 14 , …)
d. (−1, 21 , − 13 , 14 , …)
e. (1, 2, 3, 4, …)

SKILL #7: Use the summation operator

8. Find the values of each of these summations


5
a. ∑𝑥=1 𝑥2
b. ∑𝑖∈{1,3,5} ln(𝑖)

c. ∑𝑖∈1 𝑥𝑖 𝐼(𝑖 < 4)

SKILL #8: Apply the rules of arithmetic to a summation


A.4. LIMITS 313

9. Which of the following statements are true?


𝑛 𝑛
a. ∑𝑖=1 (5 + 2𝑥𝑖 ) = 5 + 2 ∑𝑖=1 𝑥𝑖
𝑛 𝑛 𝑛
b. ∑𝑖=1 (𝑥𝑖 × 𝑦𝑖 ) = ∑𝑖=1 𝑥𝑖 × ∑𝑖=1 𝑦𝑖
𝑛 𝑛 𝑛
c. ∑𝑖=1 (𝑥𝑖 + 𝑦𝑖 ) = ∑𝑖=1 𝑥𝑖 + ∑𝑖=1 𝑦𝑖
𝑛 2 𝑛
d. (∑𝑖=1 𝑥𝑖 ) = ∑𝑖=1 𝑥2𝑖
314 APPENDIX A. MATH REVIEW
Appendix B

Solutions to practice
problems

2 Basic data cleaning with Excel


Click here to see the problems

1. Table (b) shows a tidy data set.


2. Characteristic (a) is necessary for a variable to function as an ID variable.
3. Cell E15 is 2 columns and 3 rows away from cell C12. So we change all
relative references from column B to column D, and all relative references
from row 2 to row 5, and all relative references from row 10 to row 13.
Absolute references are not changed.
a. =D5
b. =$B$2
c. =$B5
d. =D$2
e. =SUM(D5:D13)
f. =SUM($B$2:$B$10)
g. =SUM($B5,$B13)
h. =SUM(D$2,D$10)
4. The formulas are:
a. =SQRT(A2)
b. =MIN(A2:A100)
c. =ABS(A2-B2)
5. The formulas are:
a. =IF(A2<0.05,"Reject","Fail to reject")

315
316 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

b. =CONCAT("A2 = ",A2)
c. =LEFT(A2,2)

6. The formulas are:

a. =MONTH(TODAY())
b. =TODAY()+100
c. =TODAY()-DATE(1969,11,20) - using my birthday; yours will obvi-
ously be different.

3 Probability and random events

Click here to see the problems

1. The sample space is the set of all possible outcomes for (𝑟, 𝑤), and its
cardinality is just the number of elements it has.

a. The sample space is:

⎧(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),⎫
{ }
{(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),}
{ }
{(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),}
Ω=⎨ ⎬ (B.1)
{(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),}
{(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),}
{ }
{ }
⎩(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) ⎭

b. Counting up the number of elements in the set, we get |Ω| = 36.

2. Each event is just a set listing the (𝑟, 𝑤) outcomes that satisfy the relevant
conditions.

a. Yo wins whenever 𝑟 + 𝑤 = 11.

𝑌 𝑜 = {(5, 6), (6, 5)} (B.2)

b. Snake eyes wins when 𝑟 + 𝑤 = 2:

𝑆𝑛𝑎𝑘𝑒𝐸𝑦𝑒𝑠 = {(1, 1)} (B.3)

c. Boxcars wins when 𝑟 + 𝑤 = 12:

𝐵𝑜𝑥𝑐𝑎𝑟𝑠 = {(6, 6)} (B.4)


317

d. Field wins when 𝑟 + 𝑤 ∈ {2, 3, 4, 9, 10, 11, 12}:

⎧(1, 1), (1, 2), (1, 3), ⎫


{ }
{(2, 1), (2, 2), }
{ }
{(3, 1), (3, 6), }
𝐹 𝑖𝑒𝑙𝑑 = ⎨ ⎬ (B.5)
{(4, 5), (4, 6), }
{(5, 4), (5, 5), (5, 6), }
{ }
{ }
⎩(6, 3), (6, 4), (6, 5), (6, 6) ⎭

3. Statements (b) and (c) are true.


4. Statements (a) and (d) are true.
5. Statements (b) and (c) are true.
6. Event (c) is an elementary event.
7. All three elementary events have probability 1/36 ≈ 0.028
8. The probability of each event can be calculated by adding up the proba-
bilities of its elementary events:

a. The probability a bet on Yo wins is:

Pr(𝑌 𝑜) = Pr({(5, 6), (6, 5)}) (B.6)


= Pr({(5, 6)}) + Pr({(6, 5)}) (B.7)
= 2/36 (B.8)
≈ 0.056 (B.9)

b. The probability a bet on Snake eyes wins is:

Pr(𝑆𝑛𝑎𝑘𝑒𝐸𝑦𝑒𝑠) = Pr({(1, 1)}) (B.10)


= 1/36 (B.11)
≈ 0.028 (B.12)

c. The probability a bet on Boxcars wins is:

Pr(𝐵𝑜𝑥𝑐𝑎𝑟𝑠) = Pr({(6, 6)}) (B.13)


= 1/36 (B.14)
≈ 0.028 (B.15)
318 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

d. The probability a bet on Field wins is:

⎧(1, 1), (1, 2), (1, 3), ⎫


⎛ { }⎞

⎜ {(2, 1), (2, 2), }⎟
⎜ { }⎟

⎜ {(3, 1), (3, 6), }⎟


Pr(𝐹 𝑖𝑒𝑙𝑑) = Pr ⎜
⎜ ⎨ ⎬

⎟ (B.16)
⎜ {(4, 5), (4, 6), }⎟

⎜ {(5, 4), (5, 5), (5, 6), }⎟

⎜{ }⎟
{ }
⎝⎩(6, 3), (6, 4), (6, 5), (6, 6)⎭⎠
= Pr({(1, 1)}) + Pr({(1, 2)}) + ⋯ + Pr({(6, 6)}) (B.17)
= 16/36 (B.18)
≈ 0.444 (B.19)

9. To calculate the joint probability, start by calculating the intersection of


the two events:
a. The joint probability is:

Pr(𝑌 𝑜 ∩ 𝐵𝑜𝑥𝑐𝑎𝑟𝑠) = Pr({(5, 6), (6, 5)} ∩ {(6, 6)}) (B.20)


= Pr(∅) (B.21)
=0 (B.22)

b. The joint probability is:

⎧(1, 1), (1, 2), (1, 3), ⎫


⎛ { }⎞

⎜ {(2, 1), (2, 2), }⎟
⎜ { }⎟

⎜ {(3, 1), (3, 6), }⎟


Pr(𝑌 𝑜 ∩ 𝐹 𝑖𝑒𝑙𝑑) = Pr ⎜
⎜{(5, 6), (6, 5)} ∩ ⎨ ⎬


⎜ {(4, 5), (4, 6), }⎟

⎜ {(5, 4), (5, 5), (5, 6), }⎟

⎜ { }⎟
{ }
⎝ ⎩(6, 3), (6, 4), (6, 5), (6, 6)⎭⎠
(B.23)
= Pr({(5, 6), (6, 5)}) (B.24)
= 2/36 (B.25)
≈ 0.056 (B.26)

c. The joint probability is:

Pr(𝑌 𝑜 ∩ 𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 ) = Pr({(5, 6), (6, 5)} ∩ {(6, 6)}𝐶 ) (B.27)


= Pr({(5, 6), (6, 5)}) (B.28)
= 2/36 (B.29)
≈ 0.056 (B.30)

10. The conditional probability is just the ratio of the joint probability to the
probability of the event we are conditioning on:
319

a. The conditional probability is:


𝑃 𝑟(𝑌 𝑜 ∩ 𝐵𝑜𝑥𝑐𝑎𝑟𝑠)
Pr(𝑌 𝑜|𝐵𝑜𝑥𝑐𝑎𝑟𝑠) = (B.31)
Pr(𝐵𝑜𝑥𝑐𝑎𝑟𝑠)
0
= (B.32)
1/36
=0 (B.33)

b. The conditional probability is:


𝑃 𝑟(𝑌 𝑜 ∩ 𝐹 𝑖𝑒𝑙𝑑)
Pr(𝑌 𝑜|𝐹 𝑖𝑒𝑙𝑑) = (B.34)
Pr(𝐹 𝑖𝑒𝑙𝑑)
2/36
= (B.35)
16/36
= 0.125 (B.36)

c. The conditional probability is:

𝑃 𝑟(𝑌 𝑜 ∩ 𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 )
Pr(𝑌 𝑜|𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 ) = (B.37)
Pr(𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 )
2/36
= (B.38)
1 − 1/36
≈ 0.057 (B.39)

d. The conditional probability is:


𝑃 𝑟(𝐹 𝑖𝑒𝑙𝑑 ∩ 𝑌 𝑜)
Pr(𝐹 𝑖𝑒𝑙𝑑|𝑌 𝑜) = (B.40)
Pr(𝑌 𝑜)
2/36
= (B.41)
2/36
=1 (B.42)

e. The conditional probability is:


𝑃 𝑟(𝐵𝑜𝑥𝑐𝑎𝑟𝑠 ∩ 𝑌 𝑜)
Pr(𝐵𝑜𝑥𝑐𝑎𝑟𝑠|𝑌 𝑜) = (B.43)
Pr(𝑌 𝑜)
0
= (B.44)
2/36
=0 (B.45)

11. The events in (e) are independent.


12. Statements (a), (c), (e), (g), and (i) are true.
13. Statements (b) and (c) are true.
14. Statements (a), (d), (e), (f), and (h) are true.
15. Statements (a) and (b) are true.
16. Statements (b) and (c) are true.
320 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

4 Introduction to random variables

Click here to see the problems

1. We can define 𝑡 = 𝑟 + 𝑤.
2. We can define 𝑦 = 𝐼(𝑟 + 𝑤 = 11).
3. The support of a random variable is the set of all values with positive
probability.
a. The support of 𝑟 is 𝑆𝑟 = {1, 2, 3, 4, 5, 6}.
b. The support of 𝑡 is 𝑆𝑡 = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
c. The support of 𝑦 is 𝑆𝑦 = {0, 1}.
4. The range of a random variable is just the interval defined by the support’s
minimum and maximum values.
a. The range of 𝑟 is [1, 6].
b. The range of 𝑡 is [2, 12].
c. The range of 𝑦 is [0, 1].
5. The PDF can be derived from the probability distribution of the underly-
ing outcome (𝑟, 𝑤), which was calculated in the previous chapter.
a. The PDF of 𝑟 is:

1/6 𝑎 ∈ {1, 2, 3, 4, 5, 6}
𝑓𝑟 (𝑎) = { (B.46)
0 otherwise

b. The PDF of 𝑡 is:

⎧1/36 𝑎 ∈ {2, 12}


{
{2/36 (or 1/18) 𝑎 ∈ {3, 11}
{
{3/36 (or 1/12) 𝑎 ∈ {4, 10}
{
𝑓𝑡 (𝑎) = ⎨4/36 (or 1/9) 𝑎 ∈ {5, 9} (B.47)
{5/36 𝑎 ∈ {6, 8}
{
{6/36 (or 1/6) 𝑎=7
{
{
⎩0 otherwise

c. The PDF of 𝑦 is:

⎧17/18 𝑎=0
{
𝑓𝑦 (𝑎) = ⎨1/18 𝑎=1 (B.48)
{0 otherwise

6. We can construct the CDF by cumulatively summing up the PDF:


321

a. The CDF of 𝑟 is:

⎧0 𝑎<1
{
{1/6 1≤𝑎<2
{
{1/3 2≤𝑎<3
{
𝐹𝑟 (𝑎) = ⎨1/2 3≤𝑎<4 (B.49)
{2/3 4≤𝑎<5
{
{5/6 5≤𝑎<6
{
{
⎩1 𝑎≥6

b. The CDF of 𝑦 is:

⎧0 𝑎<0
{
𝐹𝑦 (𝑎) = ⎨17/18 0≤𝑎<1 (B.50)
{1 𝑎≥1

7. We find the expected value by applying the definition:


a. The expected value is

𝐸(𝑟) = ∑ 𝑎 Pr(𝑟 = 𝑎) (B.51)


𝑎∈𝑆𝑟
1 1 1 1 1 1
=1∗ +2∗ +3∗ +4∗ +5∗ +6∗ (B.52)
6 6 6 6 6 6
= 21/6 (B.53)
= 3.5 (B.54)

b. The expected value is

𝐸(𝑟2 ) = ∑ 𝑎2 Pr(𝑟 = 𝑎) (B.55)


𝑎∈𝑆𝑟
1 1 1 1 1 1
= 12 ∗ + 22 ∗ + 32 ∗ + 4 2 ∗ + 52 ∗ + 62 ∗ (B.56)
6 6 6 6 6 6
= 91/6 (B.57)
= 15.17 (B.58)

8. The key here is to write down the definition of the specific quantile you
are looking for, then substitute the CDF you derived earlier.
a. The median of 𝑟 is:

𝐹𝑟−1 (0.5) = min{𝑎 ∶ Pr(𝑟 ≤ 𝑎) ≥ 0.5} (B.59)


= min{𝑎 ∶ 𝐹𝑟 (𝑎) ≥ 0.5} (B.60)
= min{𝑎 ∶ 𝑎 ≥ 3} (B.61)
=3 (B.62)
322 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

b. The 0.25 quantile of 𝑟 is

𝐹𝑟−1 (0.25) = min{𝑎 ∶ Pr(𝑟 ≤ 𝑎) ≥ 0.25} (B.63)


= min{𝑎 ∶ 𝐹𝑟 (𝑎) ≥ 0.25} (B.64)
= min{𝑎 ∶ 𝑎 ≥ 2} (B.65)
=2 (B.66)

c. The 75th percentile of 𝑟 is just its 0.75 quantile:

𝐹𝑟−1 (0.75) = min{𝑎 ∶ Pr(𝑟 ≤ 𝑎) ≥ 0.75} (B.67)


= min{𝑎 ∶ 𝐹𝑟 (𝑎) ≥ 0.75} (B.68)
= min{𝑎 ∶ 𝑎 ≥ 5} (B.69)
=5 (B.70)

9. Let 𝑑 = (𝑦 − 𝐸(𝑦))2
a. We can derive the PDF of 𝑑 from the PDF of 𝑦:

⎧17/18 𝑎 = (0 − 1/18)2
{
𝑓𝑑 (𝑎) = ⎨1/18 𝑎 = (1 − 1/18)2 (B.71)
{0 otherwise

b. The expected value is:

17 1
𝐸(𝑑) = (0 − 1/18)2 ∗ + (1 − 1/18)2 ∗ (B.72)
18 18
≈ 0.0525 (B.73)

c. The variance is:

𝑣𝑎𝑟(𝑦) = 𝐸(𝑑) (B.74)


≈ 0.0525 (B.75)

d. The standard deviation is



𝑠𝑑(𝑦) = √𝑣𝑎𝑟(𝑦) ≈ 0.0525 ≈ 0.229

10. Earlier, you found 𝐸(𝑟) = 3.5 and 𝐸(𝑟2 ) = 15.17. So we can apply our
result that 𝑣𝑎𝑟(𝑥) = 𝐸(𝑥2 ) − 𝐸(𝑥)2 for a simpler way of calculating the
variance.
a. The variance is:

𝑣𝑎𝑟(𝑟) = 𝐸(𝑟2 ) − 𝐸(𝑟)2 (B.76)


2
= 15.17 − 3.5 (B.77)
= 2.92 (B.78)
323

b. The standard deviation is:



2𝑠𝑑(𝑟) = √𝑣𝑎𝑟(𝑟) = 2.92 ≈ 1.71 (B.79)

11. The Bernoulli distribution describes any random variable (like 𝑦) that has
a binary {0, 1} support.
a. 𝑦 has the 𝐵𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑖(𝑝) distribution with 𝑝 = 1/18, or

𝑦 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1/18)

b. In the Bernoulli distribution 𝐸(𝑦) = 𝑝 = 1/18.


c. In the Bernoulli distribution, 𝑣𝑎𝑟(𝑦) = 𝑝 ∗ (1 − 𝑝) = 1/18 ∗ 17/18 ≈
0.525.
12. The binomial distribution describes any random variable that describes
the number of times an event happens in a fixed number of independent
trials.
a. 𝑌10 has the binomial distribution with 𝑛 = 10 and 𝑝 = 1/18. We can
also write that as

𝑌10 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(10, 1/18)

b. For the Binomial distribution we have

𝐸(𝑌10 ) = 𝑛𝑝 = 10 ∗ 1/18 ≈ 0.556

c. For the binomial distribution we have

𝑣𝑎𝑟(𝑌10 ) = 𝑛𝑝(1 − 𝑝) = 10 ∗ 1/18 ∗ 17/18 ≈ 0.525

d. The Excel formula for Pr(𝑌10 = 0) would be =BINOM.DIST(0,10,1/18,FALSE)


which produces the result

Pr(𝑌10 = 0) ≈ 0.565

e. The Excel formula for Pr(𝑌10 ≤ 10/16) would be =BINOM.DIST(10/16,10,1/18,TRUE)


which produces the result

Pr(𝑌10 ≤ 10/16) ≈ 0.565

f. The Excel formula for Pr(𝑌10 > 10/16) would be =1-BINOM.DIST(10/16,10,1/18,TRUE)


which produces the result

Pr(𝑌10 > 10/16) ≈ 0.435

13. The key here is to apply the formulas for the expected value and variance
of a linear function of a random variable.
324 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

a. The expected value is:

𝐸(𝑊 ) = 𝐸(160 ∗ 𝑦 − 10) (B.80)


= 160 ∗ 𝐸(𝑦) − 10 (B.81)
= 160 ∗ 1/18 − 10 (B.82)
≈ −1.11 (B.83)

b. The variance is:

𝑣𝑎𝑟(𝑊 ) = 𝑣𝑎𝑟(160 ∗ 𝑦 − 10) (B.84)


= 1602 ∗ 𝑣𝑎𝑟(𝑦) (B.85)
2
= 160 ∗ 1/18 ∗ 17/18 (B.86)
≈ 1343.2 (B.87)

c. The event probability is

Pr(𝑊 > 0) = Pr(𝑦 = 1) = 1/18 ≈ 0.056 (B.88)

14. Applying the formulas for a linear function:

a. The expected value is:

𝐸(𝑊10 ) = 𝐸(16 ∗ 𝑌10 − 10) (B.89)


= 16 ∗ 𝐸(𝑌10 ) − 10 (B.90)
= 16 ∗ 10 ∗ 1/18 − 10 (B.91)
≈ −1.11 (B.92)

b. The variance is:

𝑣𝑎𝑟(𝑊10 ) = 𝑣𝑎𝑟(16 ∗ 𝑌10 − 10) (B.93)


2
= 16 𝑣𝑎𝑟(𝑌1 0) (B.94)
= 162 ∗ 10 ∗ 1/18 ∗ 17/18 (B.95)
≈ 134.32 (B.96)

c. The event probability is

Pr(𝑊10 > 0) = Pr(𝑌10 > 10/16) ≈ 0.435 (B.97)

15. If you care mostly about expected net winnings, you should just keep your
$10 and not bet at all. This strategy has an expected net winnings of zero,
while the others both have negative expected net winnings.
16. Betting $1 on ten rolls produces the highest probability of walking away
from the table with more money than you started with.
17. Betting $10 on one roll produces more variable net winnings.
325

5 More on random variables


Click here to see the problems

1. The key here is to redefine joint events for 𝑦 and 𝑏 as events for the single
random variable 𝑡.
a. The joint PDF is:
𝑓𝑦,𝑏 (1, 1) = Pr(𝑦 = 1 ∩ 𝑏 = 1) (B.98)
= Pr(𝑡 = 11 ∩ 𝑡 = 12) (B.99)
= Pr(∅) (B.100)
=0 (B.101)
b. The joint PDF is:
𝑓𝑦,𝑏 (0, 1) = Pr(𝑦 ≠ 1 ∩ 𝑏 = 1) (B.102)
= Pr(𝑡 ≠ 11 ∩ 𝑡 = 12) (B.103)
= Pr(𝑡 = 12) (B.104)
= 𝑓𝑡 (12) (B.105)
= 1/36 (B.106)
c. The joint PDF is:
𝑓𝑦,𝑏 (1, 0) = Pr(𝑦 = 1 ∩ 𝑏 = 0) (B.107)
= Pr(𝑡 = 11 ∩ 𝑡 ≠ 12) (B.108)
= Pr(𝑡 = 11) (B.109)
= 𝑓𝑡 (11) (B.110)
= 1/18 (B.111)
d. The joint PDF is:
𝑓𝑦,𝑏 (0, 0) = Pr(𝑦 = 0 ∩ 𝑏 = 0) (B.112)
= Pr(𝑡 ≠ 11 ∩ 𝑡 ≠ 12) (B.113)
= Pr(𝑡 ∉ {11, 12}) (B.114)
= 1 − Pr(𝑡 ∈ {11, 12}) (B.115)
= 1 − 1/36 − 1/18 (B.116)
= 11/12 (B.117)
2. The marginal PDF is constructed by adding together the joint PDFs.
a. The marginal PDF is:
𝑓𝑏 (0) = 𝑓𝑦,𝑏 (0, 0) + 𝑓𝑦,𝑏 (1, 0) (B.118)
= 11/12 + 1/18 (B.119)
= 35/36 (B.120)
326 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

b. The marginal PDF is:

𝑓𝑏 (1) = 𝑓𝑦,𝑏 (0, 1) + 𝑓𝑦,𝑏 (1, 1) (B.121)


= 1/36 + 0 (B.122)
= 1/36 (B.123)

c. The expected value is:

𝐸(𝑏) = 0 ∗ 𝑓𝑏 (0) + 1 ∗ 𝑓𝑏 (1) (B.124)


= 𝑓𝑏 (1) (B.125)
= 1/36 (B.126)

3. The conditional PDF is the ratio of the joint PDF to the marginal PDF:

a. The conditional PDF is:

𝑓𝑦|𝑏 (1, 1) = Pr(𝑦 = 1|𝑏 = 1) (B.127)


𝑓𝑦,𝑏 (1, 1)
= (B.128)
𝑓𝑏 (1)
0
= (B.129)
1/36
=0 (B.130)

b. The conditional PDF is:

𝑓𝑦|𝑏 (0, 1) = Pr(𝑦 = 0|𝑏 = 1) (B.131)


𝑓𝑦,𝑏 (0, 1)
= (B.132)
𝑓𝑏 (1)
1/36
= (B.133)
1/36
=1 (B.134)

c. The conditional PDF is:

𝑓𝑦|𝑏 (1, 0) = Pr(𝑦 = 1|𝑏 = 0) (B.135)


𝑓𝑦,𝑏 (1, 0)
= (B.136)
𝑓𝑏 (0)
1/18
= (B.137)
35/36
≈ 0.057 (B.138)
327

d. The conditional PDF is:

𝑓𝑦|𝑏 (0, 0) = Pr(𝑦 = 0|𝑏 = 0) (B.139)


𝑓𝑦,𝑏 (0, 0)
= (B.140)
𝑓𝑏 (0)
11/12
= (B.141)
35/36
≈ 0.943 (B.142)

4. 𝑟 and 𝑤 are independent.


5. Since 𝐸(𝑦) = 1/18 and 𝐸(𝑏) = 1/36 we have

𝑐𝑜𝑣(𝑦, 𝑏) = 𝐸((𝑦 − 𝐸(𝑦))(𝑏 − 𝐸(𝑏))) (B.143)


= (0 − 𝐸(𝑦)) ∗ (0 − 𝐸(𝑏)) ∗ 𝑓𝑦,𝑏 (0, 0) (B.144)
+ (1 − 𝐸(𝑦)) ∗ (0 − 𝐸(𝑏)) ∗ 𝑓𝑦,𝑏 (1, 0)
+ (0 − 𝐸(𝑦)) ∗ (1 − 𝐸(𝑏)) ∗ 𝑓𝑦,𝑏 (0, 1)
+ (1 − 𝐸(𝑦)) ∗ (1 − 𝐸(𝑏)) ∗ 𝑓𝑦,𝑏 (1, 1)
= (0 − 1/18) ∗ (0 − 1/36) ∗ (11/12) (B.145)
+ (1 − 1/18) ∗ (0 − 1/36) ∗ (1/18)
+ (0 − 1/18) ∗ (1 − 1/36) ∗ (1/36)
+ (1 − 1/18) ∗ (1 − 1/36) ∗ (0)
≈ −0.00154 (B.146)

6. The alternate formula for the covariance is 𝑐𝑜𝑣(𝑦, 𝑏) = 𝐸(𝑦𝑏) − 𝐸(𝑦)𝐸(𝑏).


a. The expected value is:

0 ∗ 0 ∗ 𝑓𝑦,𝑏 (0, 0)
+ 1 ∗ 0 ∗ 𝑓𝑦,𝑏 (1, 0)
𝐸(𝑦𝑏) = (B.147)
+ 0 ∗ 1 ∗ 𝑓𝑦,𝑏 (0, 1)
+ 1 ∗ 1 ∗ 𝑓𝑦,𝑏 (1, 1)
= 𝑓𝑦,𝑏 (1, 1) (B.148)
=0 (B.149)

b. The covariance is:

𝑐𝑜𝑣(𝑦, 𝑏) = 𝐸(𝑦𝑏) − 𝐸(𝑦)𝐸(𝑏) (B.150)


= 0 − (1/18) ∗ (1/36) (B.151)
≈ −0.00154 (B.152)

c. Yes, if you have done it right you should get the same answer.
328 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

7. The correlation is:


𝑐𝑜𝑣(𝑦, 𝑏)
𝑐𝑜𝑟𝑟(𝑦, 𝑏) = (B.153)
√𝑣𝑎𝑟(𝑦)𝑣𝑎𝑟(𝑏)
−0.00154
≈√ (B.154)
0.0525 ∗ 0.027
≈ −0.04 (B.155)

8. In this question we know the correlation and several values of the formula
defining it, and we use this to solve for the missing value.
𝑐𝑜𝑣(𝑏,𝑡)
a. We know that 𝑐𝑜𝑟𝑟(𝑏, 𝑡) = √𝑣𝑎𝑟(𝑏)𝑣𝑎𝑟(𝑡)
, so we can substitute known
values to get:

𝑐𝑜𝑣(𝑏, 𝑡)
0.35 = √ (B.156)
0.027 ∗ 5.83

Solving for 𝑐𝑜𝑣(𝑏, 𝑡) we get



𝑐𝑜𝑣(𝑏, 𝑡) = 0.35 ∗ 0.027 ∗ 5.83 (B.157)
= 0.139 (B.158)

b. We know that 𝑐𝑜𝑣(𝑏, 𝑡) = 𝐸(𝑏𝑡) − 𝐸(𝑏)𝐸(𝑡), so we can substitute


known values to get:

0.139 = 𝐸(𝑏𝑡) − (1/36) ∗ 7 (B.159)

Solving for 𝐸(𝑏𝑡) we get:

𝐸(𝑏𝑡) = 0.139 + (1/36) ∗ 7 (B.160)


= 0.333 (B.161)

9. Remember that independent events are also uncorrelated. So:


a. The covariance is 𝑐𝑜𝑣(𝑟, 𝑤) = 0.
b. The correlation is 𝑐𝑜𝑟𝑟(𝑟, 𝑤) = 0.
10. This question uses the formulas for the mean of a linear function of two
random variables.
a. The expected value is:

𝐸(𝑦 + 𝑏) = 𝐸(𝑦) + 𝐸(𝑏) (B.162)


= 1/18 + 1/36 (B.163)
= 3/36 (B.164)
= 1/12 (B.165)
329

b. The expected value is:


𝐸(16𝑦 + 31𝑏 − 2) = 16𝐸(𝑦) + 31𝐸(𝑏) − 2 (B.166)
= 16/18 + 31/36 − 2 (B.167)
= −9/36 (B.168)
11. This question uses the formulas for the variance and covariance of a linear
function of two random variables.
a. The covariance is:
𝑐𝑜𝑣(16𝑦, 31𝑏) = 16 ∗ 31 ∗ 𝑐𝑜𝑣(𝑦, 𝑏) (B.169)
≈ 16 ∗ 31 ∗ (−0.00154) (B.170)
≈ −0.7638 (B.171)
b. The variance is:
𝑣𝑎𝑟(𝑦 + 𝑏) = 𝑣𝑎𝑟(𝑦) + 𝑣𝑎𝑟(𝑏) + 2𝑐𝑜𝑣(𝑦, 𝑏) (B.172)
≈ 0.0525 + 0.027 + 2 ∗ (−0.00154) (B.173)
≈ 0.07642 (B.174)
12. The result of a bet on Boxcars is negatively related to the result of a bet
on Yo.
13. This question uses various formulas for the uniform distribution.
a. The PDF is:
⎧0 𝑎 < −1
{
𝑓𝑥 (𝑎) = ⎨0.5 −1 ≤ 𝑎 ≤ 1 (B.175)
{0 𝑎>1

b. The CDF is:
⎧0 𝑎 < −1
{
𝑓𝑥 (𝑎) = ⎨ 𝑎+1
2 −1 ≤ 𝑎 ≤ 1 (B.176)
{1 𝑎>1

c. 𝑥 is a continuous random variable so Pr(𝑥 = 0) = 0
d. 𝑥 is a continuous random variable so
Pr(0 < 𝑥 < 0.5) = 𝐹𝑥 (0.5) − 𝐹𝑥 (0) (B.177)
0.5 + 1 1
= − (B.178)
2 2
= 0.25 (B.179)
e. 𝑥 is a continuous random variable so
Pr(0 ≤ 𝑥 ≤ 0.5) = 𝐹𝑥 (0.5) − 𝐹𝑥 (0) (B.180)
0.5 + 1 1
= − (B.181)
2 2
= 0.25 (B.182)
330 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

f. The median is:

𝐹𝑥−1 (0.5) = min{𝑎 ∶ Pr(𝑥 ≤ 𝑎) ≥ 0.5} (B.183)


= min{𝑎 ∶ 𝐹𝑥 (𝑎) ≥ 0.5} (B.184)
𝑎+1
= min{𝑎 ∶ ≥ 0.5} (B.185)
2
𝑎+1
= the 𝑎 that solves the equation = 0.5 (B.186)
2
=0 (B.187)

g. The 75th percentile is

𝐹𝑥−1 (0.75) = min{𝑎 ∶ Pr(𝑥 ≤ 𝑎) ≥ 0.75} (B.188)


= min{𝑎 ∶ 𝐹𝑥 (𝑎) ≥ 0.75} (B.189)
𝑎+1
= min{𝑎 ∶ ≥ 0.75} (B.190)
2
𝑎+1
= the 𝑎 that solves the equation = 0.75 (B.191)
2
= 0.5 (B.192)

h. The mean of a 𝑈 𝑛𝑖𝑓𝑜𝑟𝑚(𝑎, 𝑏) random variable is (𝑎 + 𝑏)/2, so

1 + (−1)
𝐸(𝑥) = (B.193)
2
=0 (B.194)

i. The variance of a 𝑈 𝑛𝑖𝑓𝑜𝑟𝑚(𝑎, 𝑏) random variable is (𝑏 − 𝑎)2 /12, so

𝑣𝑎𝑟(𝑥) = (𝑏 − 𝑎)2 /12 (B.195)


2
(1 − (−1))
= (B.196)
12
≈ 0.33 (B.197)

14. A linear function of a uniform random variable is also uniform.


a. When 𝑥 = −1, we have 𝑦 = 3 ∗ (−1) + 5 = 2. When 𝑥 = 1, we have
𝑦 = 3 ∗ (1) + 5 = 8. Therefore 𝑦 ∼ 𝑈 (2, 8)
b. The expected value is 𝐸(𝑦) = (2 + 8)/2 = 5.
15. This question applies various formulas for the normal distribution.
a. The expected value of a 𝑁 (𝜇, 𝜎2 ) random variable is 𝜇, so 𝐸(𝑥) = 10
b. The median of a 𝑁 (𝜇, 𝜎2 ) random variable is 𝜇, so 𝑀 𝑒𝑑𝑖𝑎𝑛(𝑥) = 10.
c. The variance of a 𝑁 (𝜇, 𝜎2 ) random variable is √
𝜎2 , so 𝑣𝑎𝑟(𝑥) = 4.
d. The standard deviation is 𝑠𝑑(𝑥) = √𝑣𝑎𝑟(𝑥) = 4 = 2
e. The Excel formula would be =NORM.DIST(11,10,2,TRUE) which
yields the result Pr(𝑥 ≤ 11) ≈ 0.691
331

16. A linear function of a normal random variable is also normal.


a. The distribution of 𝑦 will be normal with mean 3𝐸(𝑥) + 5 = 35 and
variance 32 𝑣𝑎𝑟(𝑥) = 36, so 𝑦 ∼ 𝑁 (35, 36)
b. If 𝑥 ∼ 𝑁 (10, 4) then 𝑧 = 𝑥−10
2 ∼ 𝑁 (0, 1).
c. Solving for 𝑥 in terms of 𝑧 we get 𝑥 = 2𝑧 + 10

Pr(𝑥 ≤ 11) = Pr(2𝑧 + 10 ≤ 11) (B.198)


= Pr(𝑧 ≤ (11 − 10)/2) (B.199)
= Pr(𝑧 ≤ 0.5) (B.200)
= Φ(0.5) (B.201)

d. The correct Excel formula would be =NORM.S.DIST(0.5,TRUE) which


yields the result Pr(𝑥 ≤ 11) ≈ 0.691.

6 Basic data analysis with Excel


Click here to see the problems

1. The answers are:

a. Sample size is 3
b. Sample average age is 45.
c. Sample median of age is 32.
d. Sample 25th percentile of age is 28.5.
e. Sample variance of age is 829.
f. Sample standard deviation of age is 28.8.

2. The numerical answers are the same as in question 1. Assuming the table
starts in cell A1, the Excel formulas are:

a. =COUNT(B2:B4)
b. =AVERAGE(B2:B4)
c. =MEDIAN(B2:B4)
d. =PERCENTILE.INC(B2:B4,0.25)
e. =VAR.S(B2:B4)
f. =STDEV.S(B2:B4)

3. The binned frequency table should look something like this:

Range Frequency % Freq


0-10 0 0
11-20 0 0
21-30 1 33
332 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

Range Frequency % Freq


31-40 1 33
41-50 0 0
51-60 0 0
61-70 0 0
71-80 1 33

4. The time series graph should look something like this:

5. The histogram should look something like this:


333

7 Statistics

Click here to see the problems

1. A random sample needs to be independent (not just uncorrelated) and


identically distributed (not just the same mean and variance).

a. Possibly a random sample.


b. Possibly a random sample.
c. Not a random sample.
d. Definitely a random sample.
e. Possibly a random sample.
f. Not a random sample.

2. Identify the sampling type (random sample, time series, stratified sample,
cluster sample, census, convenience sample) for each of the following data
sets.

a. This is a convenience sample.


b. This is a random sample.
c. This is a stratified sample.
d. This is a time series.
e. This is a census.

3. Since we have a random sample, the joint PDF is just a product of the
marginal PDFs.

a. The support is just the set of all possible combinations:

𝑆𝐷𝑛 = {(1, 1), (1, 2), (2, 1), (2, 2)}

b. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:

𝑓𝐷𝑛 (1, 1) = 𝑓𝑥 (1)𝑓𝑥 (1) (B.202)


= 0.4 ∗ 0.4 (B.203)
= 0.16 (B.204)

c. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:

𝑓𝐷𝑛 (2, 1) = 𝑓𝑥 (2)𝑓𝑥 (1) (B.205)


= 0.6 ∗ 0.4 (B.206)
= 0.24 (B.207)
334 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

d. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:

𝑓𝐷𝑛 (1, 2) = 𝑓𝑥 (1)𝑓𝑥 (2) (B.208)


= 0.4 ∗ 0.6 (B.209)
= 0.24 (B.210)

e. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:

𝑓𝐷𝑛 (2, 2) = 𝑓𝑥 (2)𝑓𝑥 (2) (B.211)


= 0.6 ∗ 0.6 (B.212)
= 0.36 (B.213)

4. This is a long question but we can make it easier by preparing a simple


table mapping each of the four possible values of 𝐷𝑛 into a value for each
statistic:

𝑥1 𝑥2 𝑓𝐷𝑛 (⋅) 𝑓1̂ 𝑥̄ 𝜎̂𝑥2 𝜎̂𝑥 𝑥𝑚𝑖𝑛 𝑥𝑚𝑎𝑥


1 1 0.16 1 1 0 0 1 1
2 1 0.24 0.5 1.5 0.5 0.71 1 2
1 2 0.24 0.5 1.5 0.5 0.71 1 2
2 2 0.36 0 2 0 0 2 2

This will make our lives much easier.

a. The support is 𝑆𝑓 ̂ = {0, 0.5, 1} and the sampling distribution is:


1

𝑓𝑓 ̂ (0) = Pr(𝑓1̂ = 0) (B.214)


1

= 𝑓𝐷𝑛 (2, 2) (B.215)


= 0.36 (B.216)
𝑓𝑓 ̂ (0.5) = Pr(𝑓1̂ = 0.5) (B.217)
1

= 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.218)


= 0.24 + 0.24 (B.219)
= 0.48 (B.220)
𝑓𝑓 ̂ (1) = Pr(𝑓1̂ = 1) (B.221)
1

= 𝑓𝐷𝑛 (1, 1) (B.222)


= 0.16 (B.223)
335

b. The support is 𝑆𝑥̄ = {1, 1.5, 2} and the sampling distribution is:

𝑓𝑥̄ (1) = Pr(𝑥̄ = 1) (B.224)


= 𝑓𝐷𝑛 (1, 1) (B.225)
= 0.16 (B.226)
𝑓𝑥̄ (1.5) = Pr(𝑥̄ = 1.5) (B.227)
= 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.228)
= 0.24 + 0.24 (B.229)
= 0.48 (B.230)
𝑓𝑥̄ (2) = Pr(𝑥̄ = 2) (B.231)
= 𝑓𝐷𝑛 (2, 2) (B.232)
= 0.36 (B.233)

c. The support is 𝑆𝜎̂ 𝑥2 = {0, 0.5} and the sampling distribution is:

𝑓𝜎̂ 𝑥2 (0) = Pr(𝜎̂𝑥2 = 0) (B.234)


= 𝑓𝐷𝑛 (1, 1) + 𝑓𝐷𝑛 (2, 2) (B.235)
= 0.16 + 0.36 (B.236)
= 0.52 (B.237)
𝑓𝜎̂ 𝑥2 (0.5) = Pr(𝜎̂𝑥2 = 0.5) (B.238)
= 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.239)
= 0.24 + 0.24 (B.240)
= 0.48 (B.241)

d. The support is 𝑆𝜎̂ 𝑥 = {0, 0.71} and the sampling distribution is:

𝑓𝜎̂ 𝑥2 (0) = Pr(𝜎̂𝑥 = 0) (B.242)


= 𝑓𝐷𝑛 (1, 1) + 𝑓𝐷𝑛 (2, 2) (B.243)
= 0.16 + 0.36 (B.244)
= 0.52 (B.245)
𝑓𝜎̂ 𝑥2 (0.71) = Pr(𝜎̂𝑥 = 0.71) (B.246)
= 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.247)
= 0.24 + 0.24 (B.248)
= 0.48 (B.249)
336 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

e. The support is 𝑆𝑥𝑚𝑖𝑛 = {1, 2} and the sampling distribution is:


𝑓𝑥𝑚𝑖𝑛 (1) = Pr(𝑥𝑚𝑖𝑛 = 1) (B.250)
= 𝑓𝐷𝑛 (1, 1) + 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.251)
= 0.16 + 0.24 + 0.24 (B.252)
= 0.64 (B.253)
𝑓𝑥𝑚𝑖𝑛 (2) = Pr(𝑥𝑚𝑖𝑛 = 2) (B.254)
= 𝑓𝐷𝑛 (2, 2) (B.255)
= 0.36 (B.256)
f. The support is 𝑆𝑥𝑚𝑎𝑥 = {1, 2} and the sampling distribution is:
𝑓𝑥𝑚𝑎𝑥 (1) = Pr(𝑥𝑚𝑎𝑥 = 1) (B.257)
= 𝑓𝐷𝑛 (1, 1) (B.258)
= 0.16 (B.259)
𝑓𝑥𝑚𝑎𝑥 (2) = Pr(𝑥𝑚𝑎𝑥 = 2) (B.260)
= 𝑓𝐷𝑛 (2, 2) + 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) (B.261)
= 0.36 + 0.24 + 0.24 (B.262)
= 0.84 (B.263)
5. The mean can be calculated from each statistic’s PDF:

a. The mean is:


𝐸(𝑓1̂ ) = 0 ∗ 𝑓𝑓 ̂ (0) + 0.5 ∗ 𝑓𝑓 ̂ (0) + 1 ∗ 𝑓𝑓 ̂ (1) (B.264)
1 1 1

= 0 ∗ 0.36 + 0.5 ∗ 0.48 + 1 ∗ 0.16 (B.265)


= 0.4 (B.266)
b. The mean is:
𝐸(𝑥)̄ = 1 ∗ 𝑓𝑥̄ (1) + 1.5 ∗ 𝑓𝑥̄ (1.5) + 2 ∗ 𝑓𝑥̄ (2) (B.267)
= 1 ∗ 0.16 + 1.5 ∗ 0.48 + 2 ∗ 0.36 (B.268)
= 1.6 (B.269)
c. The mean is:
𝐸(𝜎̂𝑥2 ) = 0 ∗ 𝑓𝜎̂ 𝑥2 (0) + 0.5 ∗ 𝑓𝜎̂ 𝑥2 (0.5) (B.270)
= 0 ∗ 0.52 + 0.5 ∗ 0.48 (B.271)
= 0.24 (B.272)
d. The mean is:
𝐸(𝜎̂𝑥 ) = 0 ∗ 𝑓𝜎̂ 𝑥 (0) + 0.71 ∗ 𝑓𝜎̂ 𝑥 (0.71) (B.273)
= 0 ∗ 0.52 + 0.71 ∗ 0.48 (B.274)
= 0.34 (B.275)
337

e. The mean is:

𝐸(𝑥𝑚𝑖𝑛) = 1 ∗ 𝑓𝑥𝑚𝑖𝑛 (1) + 2 ∗ 𝑓𝑥𝑚𝑖𝑛 (2) (B.276)


= 1 ∗ 0.64 + 2 ∗ 0.36 (B.277)
= 1.36 (B.278)

f. The mean is:

𝐸(𝑥𝑚𝑎𝑥) = 1 ∗ 𝑓𝑥𝑚𝑎𝑥 (1) + 2 ∗ 𝑓𝑥𝑚𝑎𝑥 (2) (B.279)


= 1 ∗ 0.16 + 2 ∗ 0.84 (B.280)
= 1.84 (B.281)

6. We have already found the mean of each statistic, so we can calculate the
variance using the alternative formula 𝑣𝑎𝑟(𝑥) = 𝐸(𝑥2 ) − 𝐸(𝑥)2

a. To calculate the variance, we can use the alternative formula:

𝐸(𝑓12̂ ) = 02 ∗ 𝑓𝑓 ̂ (0) + 0.52 ∗ 𝑓𝑓 ̂ (0) + 12 ∗ 𝑓𝑓 ̂ (1) (B.282)


1 1 1

= 0 ∗ 0.36 + 0.25 ∗ 0.48 + 1 ∗ 0.16 (B.283)


= 0.28 (B.284)
𝑣𝑎𝑟(𝑓1̂ ) = 𝐸(𝑓12̂ ) − 𝐸(𝑓1̂ ) 2
(B.285)
2
= 0.28 − 0.4 (B.286)
= 0.12 (B.287)

b. To calculate the variance, we can use the alternative formula:

𝐸(𝑥2̄ ) = 12 ∗ 𝑓𝑥̄ (1) + 1.52 ∗ 𝑓𝑥̄ (1.5) + 22 ∗ 𝑓𝑥̄ (2) (B.288)


= 1 ∗ 0.16 + 2.25 ∗ 0.48 + 4 ∗ 0.36 (B.289)
= 2.68 (B.290)
2 2
𝑣𝑎𝑟(𝑥)̄ = 𝐸(𝑥̄ ) − 𝐸(𝑥)̄ (B.291)
= 2.68 − 1.62 (B.292)
= 0.12 (B.293)

c. To calculate the variance, we can use the alternative formula:

𝐸(𝑥𝑚𝑖𝑛2 ) = 12 ∗ 𝑓𝑥𝑚𝑖𝑛 (1) + 22 ∗ 𝑓𝑥𝑚𝑖𝑛 (2) (B.294)


= 1 ∗ 0.64 + 4 ∗ 0.36 (B.295)
= 2.08 (B.296)
2 2
𝑣𝑎𝑟(𝑥𝑚𝑖𝑛) = 𝐸(𝑥𝑚𝑖𝑛 ) − 𝐸(𝑥𝑚𝑖𝑛) (B.297)
2
= 2.08 − 1.36 (B.298)
= 0.23 (B.299)
338 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

d. To calculate the variance, we can use the alternative formula:

𝐸(𝑥𝑚𝑎𝑥2 ) = 12 ∗ 𝑓𝑥𝑚𝑎𝑥 (1) + 22 ∗ 𝑓𝑥𝑚𝑎𝑥 (2) (B.300)


= 1 ∗ 0.16 + 4 ∗ 0.84 (B.301)
= 3.52 (B.302)
2 2
𝑣𝑎𝑟(𝑥𝑚𝑎𝑥) = 𝐸(𝑥𝑚𝑎𝑥 ) − 𝐸(𝑥𝑚𝑎𝑥) (B.303)
= 3.52 − 1.842 (B.304)
= 0.134 (B.305)

7. All three of these statistics are simple linear functions of the data, so their
mean and variance can be calculated without using the PDF:

a. The mean and variance are already given to us:

𝐸(𝑥1 ) = 1.6 (B.306)


𝑣𝑎𝑟(𝑥1 ) = 0.24 (B.307)

b. The mean and variance can be found using the standard formulas for
the sample average:

𝐸(𝑥)̄ = 𝐸(𝑥𝑖 ) (B.308)


= 1.6 (B.309)
𝑣𝑎𝑟(𝑥)̄ = 𝑣𝑎𝑟(𝑥𝑖 )/2 (B.310)
= 0.24/2 (B.311)
= 0.12 (B.312)

c. The mean and variance can be found using the standard formula for
a linear function of two random variables:

𝐸(𝑤) = 𝐸(0.2 ∗ 𝑥1 + 0.8 ∗ 𝑥2 ) (B.313)


= 0.2 ∗ 𝐸(𝑥1 ) + 0.8 ∗ 𝐸(𝑥2 ) (B.314)
= 0.2 ∗ 1.6 + 0.8 ∗ 1.6 (B.315)
= 1.6 (B.316)
𝑣𝑎𝑟(𝑤) = 𝑣𝑎𝑟(0.2 ∗ 𝑥1 + 0.8 ∗ 𝑥2 ) (B.317)
2 2
= 0.2 ∗ 𝑣𝑎𝑟(𝑥1 ) + 0.8 ∗ 𝑣𝑎𝑟(𝑥2 ) + 2 ∗ 0.2 ∗ 0.8 ∗ 𝑐𝑜𝑣(𝑥1 , 𝑥2 )
(B.318)
= 0.22 ∗ 0.24 + 0.82 ∗ 0.24 + 2 ∗ 0.2 ∗ 0.8 ∗ 0 (B.319)
= 0.16 (B.320)

8. Unknown parameters: 𝜇,𝜎2 , 𝐸(𝑥𝑖 ), 𝐸(𝑥3𝑖 ), 𝑣𝑎𝑟(𝑥𝑖 ), and 𝑠𝑑(𝑥𝑖 )/ 𝑛.
Statistics: 𝐷𝑛 , 𝑥.̄

9. We can find the true values directly from the PDF 𝑓𝑥 (⋅):
339

a. The true probability is: Pr(𝑥𝑖 = 1) = 0.4


b. The population mean is: 𝐸(𝑥𝑖 ) = 1 ∗ 0.4 + 2 ∗ 0.6 = 1.6
c. The population variance is: 𝑣𝑎𝑟(𝑥𝑖 ) = (1−1.6)2 ∗0.4+(2−1.6)2 ∗0.6 =
0.24 √
d. The population standard deviation is: 𝑠𝑑(𝑥𝑖 ) = √𝑣𝑎𝑟(𝑥𝑖 ) = 0.24 =
0.49
e. The population minimum is: 𝑚𝑖𝑛(𝑆𝑥 ) = 1
f. The population maximum is: 𝑚𝑎𝑥(𝑆𝑥 ) = 2

10. Since max(𝑆𝑥 ) = 2, the error is either 𝑒𝑟𝑟 = 1−2 = −1 or 𝑒𝑟𝑟 = 2−2 = 0.

a. The sampling error has support 𝑆𝑒𝑟𝑟 = {−1, 0}


b. The sampling error has PDF:
𝑓𝑒𝑟𝑟 (−1) = 𝑓𝐷𝑛 (1, 1) (B.321)
= 0.16 (B.322)
𝑓𝑒𝑟𝑟 (0) = 𝑓𝐷𝑛 (2, 1) + 𝑓𝐷𝑛 (1, 2) + 𝑓𝐷𝑛 (2, 2) (B.323)
= 0.24 + 0.24 + 0.36 (B.324)
= 0.84 (B.325)
11. We have already calculated the mean of each statistic and the true value
of each parameter, so we can calculate the bias and then use the result to
classify each estimator:

a. Unbiased
𝑏𝑖𝑎𝑠(𝑓1̂ ) = 𝐸(𝑓1̂ ) − Pr(𝑥𝑖 = 1) (B.326)
= 0.4 − 0.4 (B.327)
=0 (B.328)
b. Unbiased
𝑏𝑖𝑎𝑠(𝑥)̄ = 𝐸(𝑥)̄ − 𝐸(𝑥𝑖 ) (B.329)
= 1.6 − 1.6 (B.330)
=0 (B.331)
c. Unbiased
𝑏𝑖𝑎𝑠(𝜎̂𝑥2 ) = 𝐸(𝜎̂𝑥2 ) − 𝑣𝑎𝑟(𝑥𝑖 ) (B.332)
= 0.24 − 0.24 (B.333)
=0 (B.334)
d. Biased
𝑏𝑖𝑎𝑠(𝜎̂𝑥 ) = 𝐸(𝜎̂𝑥 ) − 𝑠𝑑(𝑥𝑖 ) (B.335)
= 0.34 − 0.49 (B.336)
= −0.15 (B.337)
340 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

e. Biased

𝑏𝑖𝑎𝑠(𝑥𝑚𝑖𝑛) = 𝐸(𝑥𝑚𝑖𝑛) − min(𝑆𝑥 ) (B.338)


= 1.36 − 1 (B.339)
= 0.36 (B.340)

f. Biased

𝑏𝑖𝑎𝑠(𝑥𝑚𝑎𝑥) = 𝐸(𝑥𝑚𝑎𝑥) − max(𝑆𝑥 ) (B.341)


= 1.84 − 2 (B.342)
= −0.16 (B.343)

12. Remember that the expected value passes through linear functions, but
not nonlinear functions:

a. It is unbiased, as 𝐸(𝑦𝑀̄ − 𝑦𝑊̄ ) = 𝐸(𝑦𝑀̄ ) − 𝐸(𝑦𝑊̄ ) = 𝜇 𝑀 − 𝜇𝑊 .


b. It is biased, as 𝐸(𝑦𝑀
̄ /𝑦𝑊̄ ) ≠ 𝐸(𝑦𝑀̄ )/𝐸(𝑦𝑊̄ ) = 𝜇𝑀 /𝜇𝑊 .

13. We have already calculated the variance and bias, so we can apply the
formula 𝑀 𝑆𝐸 = 𝑣𝑎𝑟 + 𝑏𝑖𝑎𝑠2 :

a. The MSE is:

𝑀 𝑆𝐸(𝑓1̂ ) = 𝑣𝑎𝑟(𝑓1̂ ) + 𝑏𝑖𝑎𝑠(𝑓1̂ )2 (B.344)


2
= 0.12 + 0 (B.345)
= 0.12 (B.346)

b. The MSE is:

𝑀 𝑆𝐸(𝑥)̄ = 𝑣𝑎𝑟(𝑥)̄ + 𝑏𝑖𝑎𝑠(𝑥)̄ 2 (B.347)


2
= 0.12 + 0 (B.348)
= 0.12 (B.349)

c. The MSE is:

𝑀 𝑆𝐸(𝑥𝑚𝑖𝑛) = 𝑣𝑎𝑟(𝑥𝑚𝑖𝑛) + 𝑏𝑖𝑎𝑠(𝑥𝑚𝑖𝑛)2 (B.350)


2
= 0.23 + 0.36 (B.351)
= 0.36 (B.352)

d. The MSE is:

𝑀 𝑆𝐸(𝑥𝑚𝑎𝑥) = 𝑣𝑎𝑟(𝑥𝑚𝑎𝑥) + 𝑏𝑖𝑎𝑠(𝑥𝑚𝑎𝑥)2 (B.353)


2
= 0.13 + (−0.16) (B.354)
= 0.16 (B.355)

14. Again, these estimators are both linear functions of the data, so their
mean and variance are easy to calculate.
341

a. Both estimators are unbiased.


b. The variance is 𝑣𝑎𝑟(𝑥)̄ = 𝜎2 /2
c. The variance is 𝑣𝑎𝑟(𝑥2 ) = 𝜎2
d. The MSE is 𝑀 𝑆𝐸(𝑥)̄ = 𝜎2 /2
e. The MSE is 𝑀 𝑆𝐸(𝑥2 ) = 𝜎2
f. The estimator 𝑥̄ is preferred under the MVUE criterion.
g. The estimator 𝑥̄ is preferred under the MSE criterion.

15. The standard error is:



𝑠𝑒(𝑥)̄ = 𝜎/
̂ 𝑛 (B.356)
√ √
= 4/ 100 (B.357)
= 0.2 (B.358)

16. Both the sample average (a) and the average of all even numbered obser-
vations (d) are consistent estimators of 𝜇, because they keep using more
and more information as the sample size increases. In contrast the esti-
mators based on the first observation, and on the first 100 observations
are both unbiased, but they do not change as the sample size increases.
So the LLN does not apply, and the estimators are not consistent. The
LLN does apply to the sample median because of Slutsky’s theorem, but
that makes it a consistent estimator of the population median, which may
or may not be equal to the population mean.

8 Statistical inference
Click here to see the problems

1. The null is:


𝐻0 ∶ 𝛽 = 0
and the alternative is:
𝐻𝑤 ∶ 𝛽 ≠ 0
2. Test statistics need to possess characteristics (a), (d), and (e).
3. In this setting:
a. 𝑥𝑖 must have a normal distribution.
b. The asymptotic distribution is 𝑁 (0, 1)
4. The expression is:
𝑥 ̄ − 𝜇1 𝜇 −𝜇
𝑡= √ + 1 √0
𝜎̂𝑥 / 𝑛 𝜎̂𝑥 / 𝑛
5. This question is about calculating the size given the critical values.
342 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

a. The size is:

𝑠𝑖𝑧𝑒 = 𝐹𝑇13 (−1.96) + 1 − 𝐹𝑇13 (1.96) (B.359)


= 0.035894296 + 1 − 0.964105704 (B.360)
≈ 0.07 (B.361)

and the formula for calculating this in Excel is =T.DIST(-1.96,13,TRUE)


+1 - T.DIST(1.96,13,TRUE).
b. The size is:

𝑠𝑖𝑧𝑒 = Φ(−1.96) + 1 − Φ(1.96) (B.362)


= 0.024997895 + 1 − 0.975002105 (B.363)
≈ 0.05 (B.364)

and the formula for calculating this in Excel is =NORM.S.DIST(-1.96,TRUE)


+1 - NORM.S.DIST(1.96,TRUE).
c. The size is:

𝑠𝑖𝑧𝑒 = 𝐹𝑇13 (−3) + 1 − 𝐹𝑇13 (2) (B.365)


≈ 0.005119449 + 1 − 0.966579821 (B.366)
≈ 0.04 (B.367)

and the formula for calculating this in Excel is =T.DIST(-3,13,TRUE)


+1 - T.DIST(1.96,13,TRUE).
d. The size is:

𝑠𝑖𝑧𝑒 = Φ(−3) + 1 − Φ(2) (B.368)


≈ 0.001349898 + 1 − 0.977249868 (B.369)
≈ 0.02 (B.370)

and the formula for calculating this in Excel is =NORM.S.DIST(-3,TRUE)


+1 - NORM.S.DIST(2,TRUE).
e. The size is:

𝑠𝑖𝑧𝑒 = 𝐹𝑇13 (−∞) + 1 − 𝐹𝑇13 (2) (B.371)


≈ 0 + 1 − 0.966579821 (B.372)
≈ 0.03 (B.373)

and the formula for calculating this in Excel is =1 - T.DIST(2,13,TRUE).


f. The size is:

𝑠𝑖𝑧𝑒 = Φ(−∞) + 1 − Φ(2) (B.374)


≈ 0 + 1 − 0.977249868 (B.375)
≈ 0.02 (B.376)

and the formula for calculating this in Excel is =1 - NORM.S.DIST(2,TRUE).


343

6. This problem asks you to find the critical values that deliver a given size.
a. The critical values are:

𝑐𝐿 = 𝐹𝑇−1
17
(0.005) (B.377)
≈ −2.90 (B.378)
𝑐𝐻 = 𝐹𝑇−1
17
(0.995) (B.379)
≈ 2.90 (B.380)

and the formulas for calculating this in Excel are =T.INV(0.005,17)


and =T.INV(0.995,17).
b. The critical values are:

𝑐𝐿 = 𝐹𝑇−1
17
(0.025) (B.381)
≈ −2.11 (B.382)
𝑐𝐻 = 𝐹𝑇−1
17
(0.975) (B.383)
≈ 2.11 (B.384)

and the formulas for calculating this in Excel are =T.INV(0.025,17)


and =T.INV(0.975,17).
c. The critical values are:

𝑐𝐿 = 𝐹𝑇−1
17
(0.05) (B.385)
≈ −1.74 (B.386)
𝑐𝐻 = 𝐹𝑇−1
17
(0.95) (B.387)
≈ 1.74 (B.388)

and the formulas for calculating this in Excel are =T.INV(0.05,17)


and =T.INV(0.95,17).
d. The critical values are:

𝑐𝐿 = Φ−1 (0.005) (B.389)


≈ −2.58 (B.390)
𝑐𝐻 = Φ−1 (0.995) (B.391)
≈ 2.58 (B.392)

and the formulas for calculating this in Excel are =NORM.S.INV(0.005)


and =NORM.S.INV(0.995).
e. The critical values are:

𝑐𝐿 = Φ−1 (0.025) (B.393)


≈ −1.96 (B.394)
−1
𝑐𝐻 = Φ (0.975) (B.395)
≈ 1.96 (B.396)
344 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

and the formulas for calculating this in Excel are =NORM.S.INV(0.025)


and =NORM.S.INV(0.975).
f. The critical values are:

𝑐𝐿 = Φ−1 (0.05) (B.397)


≈ −1.65 (B.398)
−1
𝑐𝐻 = Φ (0.95) (B.399)
≈ 1.65 (B.400)

and the formulas for calculating this in Excel are =NORM.S.INV(0.05)


and =NORM.S.INV(0.95).
7. The confidence interval for a mean always follows the same formula
a. The confidence interval is:

𝐶𝐼 95 = 𝑥̄ ± 𝐹𝑇−1
15
(0.975) ∗ 𝜎/
̂ 𝑛 (B.401)

≈ 4 ± 2.13 ∗ 0.3/ 16 (B.402)
≈ (3.84, 4.16) (B.403)

and the formulas for calculating these numbers in Excel are


=4-T.INV(0.975,15)*0.3/SQRT(16) and =4+T.INV(0.975,15)*0.3/SQRT(16)
b. The confidence interval is:

𝐶𝐼 90 = 𝑥̄ ± 𝐹𝑇−1
15
(0.95) ∗ 𝜎/
̂ 𝑛 (B.404)

≈ 4 ± 1.75 ∗ 0.3/ 16 (B.405)
≈ (3.87, 4.13) (B.406)

and the formulas for calculating these numbers in Excel are


=4-T.INV(0.95,15)*0.3/SQRT(16) and =4+T.INV(0.95,15)*0.3/SQRT(16)
c. The confidence interval is:

𝐶𝐼 99 = 𝑥̄ ± 𝐹𝑇−1
15
(0.995) ∗ 𝜎/
̂ 𝑛 (B.407)

≈ 4 ± 2.95 ∗ 0.3/ 16 (B.408)
≈ (3.77, 4.22) (B.409)

and the formulas for calculating these numbers in Excel are


=4-T.INV(0.995,15)*0.3/SQRT(16) and =4+T.INV(0.995,15)*0.3/SQRT(16)
d. The confidence interval is:

𝐶𝐼 95 = 𝑥̄ ± Φ−1 (0.975) ∗ 𝜎/
̂ 𝑛 (B.410)

≈ 4 ± 1.96 ∗ 0.3/ 16 (B.411)
≈ (3.85, 4.15) (B.412)

and the formulas for calculating these numbers in Excel are


=4-NORM.S.INV(0.975)*0.3/SQRT(16) and =4+NORM.S.INV(0.975)*0.3/SQRT(16)
345

e. The confidence interval is:



𝐶𝐼 90 = 𝑥̄ ± Φ−1 (0.95) ∗ 𝜎/
̂ 𝑛 (B.413)

≈ 4 ± 1.64 ∗ 0.3/ 16 (B.414)
≈ (3.88, 4.12) (B.415)

and the formulas for calculating these numbers in Excel are


=4-NORM.S.INV(0.95)*0.3/SQRT(16) and =4+NORM.S.INV(0.95)*0.3/SQRT(16)
f. The confidence interval is:

𝐶𝐼 99 = 𝑥̄ ± Φ−1 (0.995) ∗ 𝜎/
̂ 𝑛 (B.416)

≈ 4 ± 2.58 ∗ 0.3/ 16 (B.417)
≈ (3.81, 4.19) (B.418)

and the formulas for calculating these numbers in Excel are


=4-NORM.S.INV(0.995)*0.3/SQRT(16) and =4+NORM.S.INV(0.995)*0.3/SQRT(16)
8. The only thing we have tested here is whether there is no effect. We
rejected that null, so we can rule out the possibility that there is no effect.
But that does not tell us whether the effect is small or large - in order to
say anything about that question we need to test other nulls, or construct
a confidence interval.
a. Probably false.
b. Probably true.
c. Possibly true.
9. The only thing we have tested here is whether there is no effect, and we
haven’t been able to rule out that possibility. So we can’t really rule
anything out.
a. Possibly true.
b. Possibly true.
c. Possibly true.
10. In effect, the confidence interval simultaneously reports the results of every
possible null hypothesis. So anything in the range (0.10, 0.40) cannot be
ruled out but anything outside of that range can be ruled out.
a. Probably false.
b. Probably true.
c. Probably true.
d. Probably false.

9 An introduction to R
Click here to see the problems
346 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

1. The chapter describes how to do each of these.


2. Items (a), (b), (d), and (e) are valid R expressions.
3. Write the R code to perform the following actions:
a. cookies <- c("oatmeal", "chocolate chip", "shortbread")
b. threes <- seq(3,100,by=3)
c. threes[5]
d. threecookies <- list(cookies = cookies, threes = threes).
4. This code will load a built-in data set of automobile gas mileage, and
produce a graph that depicts the relationship between weight and miles
per gallon:

data("mtcars") # load data


ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(colour = factor(cyl), size = qsec))

35

30
factor(cyl)
4
6
25 8
mpg

qsec
20 15.0
17.5
20.0
15 22.5

10
2 3 4 5
wt

10 Advanced data cleaning


Click here to see the problems

1. The file formats are:


a. Fixed width
b. Space or tab delimited
347

c. CSV
2. Here are my descriptions, yours may be somewhat different:
a. A crosswalk table is a data table we can use to translate variables that
are expressed in one way into another way. For example, we might
use a crosswalk table to translate country names into standardized
country codes, or to translate postal codes into provinces.
b. When we have two data tables that contain information on related
cross-sectional
units, we can combine their information into a single table by match-
ing observations based on a variable that (a) exists in both tables and
(b) connects the observations in some way.
c. Aggregating data by groups allows us to group observations according
to a common characteristic, and describe those groups using data
calculated from the individual observations.
3. You can edit cell A1 under scenarios (a) and (c).
4. If you do this:
a. Nothing will happen, but you can ask Excel to mark invalid data.
b. Excel will not allow you to enter invalid data.
5. The R code will be something like this.

library("tidyverse")
deniro <- read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/deniro.csv")
## Rows: 87 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Title
## dbl (2): Year, Score
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(deniro)
## [1] 87
ncol(deniro)
## [1] 3

11 Using R
Click here to see the problems

1. The R code to create this data table is as follows:


348 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

PPData <- EmpData %>%


mutate(Year = format(MonthYr, "%Y")) %>%
mutate(EmpRate = Employed/Population) %>%
filter(Year >= 2010) %>%
select(MonthYr, Year, EmpRate, UnempRate, AnnPopGrowth) %>%
arrange(EmpRate)
print(PPData)

## # A tibble: 133 x 5
## MonthYr Year EmpRate UnempRate AnnPopGrowth
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2020-04-01 2020 0.521 0.131 0.0132
## 2 2020-05-01 2020 0.530 0.137 0.0124
## 3 2020-06-01 2020 0.560 0.125 0.0118
## 4 2020-07-01 2020 0.573 0.109 0.0109
## 5 2020-08-01 2020 0.580 0.102 0.0104
## 6 2020-03-01 2020 0.585 0.0789 0.0143
## 7 2021-01-01 2021 0.586 0.0941 0.00863
## 8 2020-09-01 2020 0.591 0.0918 0.00992
## 9 2020-12-01 2020 0.593 0.0876 0.00905
## 10 2020-10-01 2020 0.594 0.0902 0.00961
## # ... with 123 more rows

2. The R code is as follows. None of the variables have missing data.

### A: Mean employment rate


mean(PPData$EmpRate)

## [1] 0.610809

### B: Table of medians


PPData %>%
select(where(is.numeric)) %>%
lapply(median)

## $EmpRate
## [1] 0.6139724
##
## $UnempRate
## [1] 0.07082771
##
## $AnnPopGrowth
## [1] 0.01171851
349

3. The employment rate is a continuous variable, so the appropriate kind of


frequency table here is a binned frequency table.

PPData %>%
count(cut_interval(EmpRate, 6))

## # A tibble: 5 x 2
## `cut_interval(EmpRate, 6)` n
## <fct> <int>
## 1 [0.521,0.537] 2
## 2 (0.554,0.57] 1
## 3 (0.57,0.587] 4
## 4 (0.587,0.603] 4
## 5 (0.603,0.62] 122

4. The R code for these calculations is as follows:

qnorm(0.45, mean = 4, sd = 6)

## [1] 3.246032

qt(0.975, df = 8)

## [1] 2.306004

pnorm(0.75)

## [1] 0.7733726

rbinom(5, size = 10, prob = 0.5)

## [1] 6 6 8 5 8

5. The code below is the minimal code needed to create the histogram us-
ing the default options. You could improve on the graph quite easily by
adjusting some of these options.

ggplot(data = PPData, mapping = aes(x = EmpRate)) + geom_histogram()


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
350 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

40
count

20

0.54 0.57 0.60


EmpRate

6. The code below is the minimal code needed to create the graph using the
default options. You could improve on the graph quite easily by adjusting
some of these options.

ggplot(data = PPData, mapping = aes(x = MonthYr, y = EmpRate)) + geom_line()


351

0.600
EmpRate

0.575

0.550

0.525

2010 2015 2020


MonthYr

12 Multivariate data analysis

Click here to see the problems

1. The covariance and correlation is negative here, so periods of high popu-


lation growth tend to be periods of low unemployment.

cov(EmpData$UnempPct, EmpData$AnnPopGrowth, use = "complete.obs")


## [1] -0.0003017825
cor(EmpData$UnempPct, EmpData$AnnPopGrowth, use = "complete.obs")
## [1] -0.06513125

2. I used casewise deletion but it does not matter. It only matters when you
add a third variable.
3. The tables are
a. Simple frequency table
b. Conditional average
c. Crosstab
4. The scatter plot should look something like this:
352 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

ggplot(data = EmpData, mapping = aes(x = AnnPopGrowth, y = UnempRate)) + geom_point()


## Warning: Removed 12 rows containing missing values (geom_point).

0.125
UnempRate

0.100

0.075

0.010 0.015 0.020 0.025


AnnPopGrowth

5. The plot should look something like this:

ggplot(data = EmpData, mapping = aes(x = AnnPopGrowth, y = UnempRate)) + geom_point() +


geom_smooth(col = "green") + geom_smooth(method = "lm", col = "blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).
353

0.125
UnempRate

0.100

0.075

0.010 0.015 0.020 0.025


AnnPopGrowth

Appendix A Math review


Click here to see the problems

1. In my case it would be 𝐴 = (𝐵𝑟𝑖𝑎𝑛, 𝐶ℎ𝑎𝑟𝑙𝑖𝑒, 𝐻𝑎𝑧𝑒𝑙, 𝑆𝑢𝑒).


2. This would be 𝐵 = {𝑥 ∈ 𝑍 ∶ 1 ≤ 𝑥 ≤ 1000}
3. The cardinality is |𝐴| = 4 (in my case, yours may be different) and |𝐵| =
1000.
4. Remember that sets are subsets of themselves.
a. 𝐵 and 𝐷 are identical (𝐵 = 𝐷)
b. 𝐶 is disjoint with 𝐴
c. All four sets are subsets of 𝐴 (𝐴 ⊂ 𝐴, 𝐵 ⊂ 𝐴, 𝐶 ⊂ 𝐴, and 𝐷 ⊂ 𝐴)
d. 𝐶 and 𝐷 are subsets of 𝐵.
𝐶
5. Remember that ∩ means “and”, ∪ means “or”, and means “not”
a. 𝐴 ∩ 𝐵 = {2}
b. 𝐴 ∪ 𝐵 = {1, 2, 3, 4}
c. 𝐴 ∩ 𝐷 = {2}
d. 𝐴 ∪ 𝐷 = {1, 2, 3}
e. 𝐵𝐶 = {1, 3, 5}
f. Enumeration is not an option here, so we use set-builder notation:
𝐵𝐶 = {𝑥 ∈ ℤ ∶ 𝑥 ∉ {2, 4}}
354 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS

6. Remember that the indicator function 𝐼(⋅) returns one for true statements
and zero for false statements.
a. 𝐼(𝑥 < 5) = 𝐼(4 < 5) = 1
b. 𝐼(𝑥 is an odd √
number) = 𝐼(4 is an odd number) √
=0
c. 𝐼(𝑥 < 5) − 𝐼( 𝑥 is an integer) = 𝐼(4 < 5) − 𝐼( 4 is an integer) =
1−1=0
7. The limits are:
a. The limit of this sequence is 0
b. This sequence has no limit.
c. The limit of this sequence is 5
d. The limit of this sequence is 0
e. The limit of this sequence is ∞
8. The summation values are:
5
a. ∑𝑥=1 𝑥2 = 12 + 22 + 32 + 42 + 52 = 55
b. ∑𝑖∈{1,3,5} ln(𝑖) = ln(1) + ln(3) + ln(5) ≈ 2.708

c. ∑𝑖∈1 𝑥𝑖 𝐼(𝑖 < 4) = 𝑥1 + 𝑥2 + 𝑥3
9. Statements (a) and (c) are true.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy