Stats Book Sfu
Stats Book Sfu
Stats Book Sfu
Brian Krauth
Fall 2021
2
Contents
1 Introduction 9
1.1 Course goals and context . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 SFU-specific information . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Computer resources . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Conventions of this book . . . . . . . . . . . . . . . . . . . . . . . 16
3
4 CONTENTS
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 Statistics 147
7.1 Data and the data generating process . . . . . . . . . . . . . . . 148
7.2 Statistics and their properties . . . . . . . . . . . . . . . . . . . . 155
7.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . 164
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
CONTENTS 5
9 An introduction to R 197
9.1 A brief tour of RStudio . . . . . . . . . . . . . . . . . . . . . . . 198
9.2 The R language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.3 Packages and the Tidyverse . . . . . . . . . . . . . . . . . . . . . 213
9.4 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11 Using R 247
11.1 Cleaning data in R . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.2 Data analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11.3 Graphs with ggplot . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6 CONTENTS
This book has been written for use as a textbook for ECON 233, the introduc-
tory statistics course for economics majors at Simon Fraser University.
The current version of the book can be obtained at https://bookdown.org/
bkrauth/BOOK/.
The book itself is written using Bookdown, and its source code is available at
https://github.com/bvkrauth/econ233.
7
8 CONTENTS
Chapter 1
Introduction
Goals
Chapter goals
In this chapter we will:
9
10 CHAPTER 1. INTRODUCTION
You will be able to apply these skills in combination to analyze real-world eco-
nomic data:
We will be switching back and forth between theory, data analysis and applica-
tions. All three skill sets are valuable.
Hopefully you are in this course because you are fascinated by statistics and
can’t wait to learn more about it. But most of you are taking it because it’s a
required course.
So I’d like to motivate everyone to take this course as an opportunity to learn
some very useful skills. Today’s world is awash in data:
These databases can be linked and analyzed in various ways, and many of the
world’s most successful companies rely heavily on the innovative gathering and
usage of data:
• Google’s core product (the search engine) is built on the innovative anal-
ysis of massive amounts of data.
• Both Google and the major social media companies are based on providing
valuable “free” services in order to gather data on consumers that can then
be sold (in some form) to other businesses.
• Amazon and other retailers use what is called A/B testing to fine-tune
product descriptions and set prices so as to maximize profits.
1.2. EXPECTATIONS 11
Some of this data analysis is done by computer scientists, but much of it is done
by economists: for example, Amazon is the second-largest employer of PhD
economists in the US (after the Federal Reserve System).
This course will not qualify you for those jobs, but it is a first step in that
direction.
Be the Mona Lisa
I always tell students thinking about the future to remember supply and demand
in the labour market. In the labour market your skills and effort are the product,
and you are the seller. Like all sellers, you want to be expensive. This requires
that you have skills that are both:
In other words, you need to be like the Mona Lisa. If your skills are useful but
common (like water), or rare but useless (like my one-of-a-kind drawing of the
Mona Lisa) your labour will sell at a low price.
The ability to analyze data in a sophisticated way, and to explain the results in
written or oral presentation, is an extremely useful and uncommon skill. Most of
you do not have the technical skills of your colleagues in Computer Science, but
if you can combine a reasonable level of computer skills with writing, knowledge
of the underlying statistical principles, and the ability to recognize the economic
considerations in a situation, you will do quite well.
1.2 Expectations
The course is constructed under the assumptions that:
3. You can do high-school level math including algebra and basic set theory
and have taken or are currently taking an introductory calculus course.
• I will not ask you to take derivatives or solve integrals; instead I will
refer to concepts like functions, sequences and limits.
• The math review appendix provides material and practice problems
if you need to review these concepts and tools.
4. You have access to a desktop or laptop computer, and basic computer
skills.
• Tools: ECON 233 uses both Excel and R, while BUS 232 uses Excel.
– You are likely to use R in ECON 333 and other upper-division ECON
courses, so it is nice to get used to it now.
• Applications: ECON 233 emphasizes economics applications, while BUS
232 emphasizes business applications.
ECON 233 is part of the Social Data and Analytics (SDA) minor; if you are
an economics student and are interested in that minor you are recommended to
take ECON 233.
ECON 333 is the second course in the two-course econometrics sequence required
for all economics majors. In ECON 333, you will learn more advanced techniques
including linear regression, you will use R more extensively, and you will go
deeper into the theory.
Related courses
If you find you enjoy and/or do well in this course, I would strongly encourage
you to take further courses in econometrics:
1.4. COMPUTER RESOURCES 13
I would also encourage you to take courses outside of the economics depart-
ment, and to consider a Statistics minor or the new interdisciplinary Social
Data Analytics (SDA) minor .
• Microsoft Excel
• R
• RStudio
https://www.sfu.ca/itservices/technical/software/office365.html.
Once you have installed Excel, you should confirm that it is working by starting
the program. You should see something that looks like this:
Later in the semester, we will also be using a more specialized statistical program
called R, and a related program called RStudio.
Both R and RStudio are open-source, and are available free of charge for both
Windows and macOS. Installation instructions are available at:
https://rstudio.com/products/rstudio/download/#download.
After installing R and RStudio, you should confirm that they are working by
opening RStudio. You should see something like this:
1.4. COMPUTER RESOURCES 15
One of the most useful features of R is that it allows users to write and distribute
packages that extend its capabilities.
One of the most popular and useful packages is called the Tidyverse. R is a
very powerful program, but it is also a very old one: the underlying language
(called “S”) was originally created in 1976. The result of this is that some of
the original commands are outdated in design and aren’t well suited for modern
capabilities or principles of software development. The Tidyverse solves this
problem by adding new, more modern versions of these commands. You can
learn more about the Tidyverse at https://www.tidyverse.org/.
To install the Tidyverse package:
Once the installation is concluded and the > prompt reappears you can test to
make sure the installation worked.
• If you don’t get an error message (you will get some message about
“Conflicts”), the installation worked.
If you run into trouble here, don’t worry. We will not need the Tidyverse for a
few weeks, so there is plenty of time to get help.
• Organization:
– Each chapter corresponds to one full week of the course.
• Typography:
– Computer code or other inputs are shown like this.
– Math is usually shown like 𝑡ℎ𝑖𝑠.
– When new terminology is introduced, it is shown like this.
• Boxes:
– Pull-out information is shown in colored boxes.
Goals
Economics background
FYI
Boxes like this are for providing optional information that might be of
interest to some students.
Chapter 2
Before we can analyze data, we usually need to clean it. Cleaning data means
putting it into a form that is ready to analyze, and can include:
This chapter will develop both both some principles of data cleaning and the
basic Excel tools needed to implement them . We will also apply this knowledge
by using Excel to clean a real data set.
17
18 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
Goals
Chapter goals
In this chapter we will learn how to:
We will demonstrate the key principles and tools in this chapter by cleaning
the November 2020 employment data for Canadian provinces. The data can be
found in the file
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpNov20.xlsx
2.1. APPLICATION: CANADIAN EMPLOYMENT DATA 19
Economics background
The unemployment rate and LFP rate are key indicators of labour mar-
ket conditions. A higher-than-usual unemployment rate means that
workers are having difficulty finding work, and a lower-than-usual LPF
rate means that some workers have stopped looking for work.
20 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
2.2.1 Reproducibility
When analyzing data it is important for the results of our analysis to be re-
producible. What that means is that an interested reader should be able to
figure out exactly how you got your results, and should be able to generate
those results themselves. Note that the “interested reader” might be you, as it
is common to return to an analysis you did earlier. Without reproducibility,
you may not remember where you found the data, what you have done with it,
or what it means.
Reproducibility requires that we treat our data with care. In particular, we
should:
To keep analysis easy and minimize errors, data should be organized as what
data scientists have come to call tidy data. Tidy data has the following seven
properties:
• Variable names are typically displayed in the top row of the table.
4. Each row in a table (after the top row) represents a distinct observation,
data point or case.
5. All observations in a given table come from the same unit of observation
• For example, data on Canadian cities and data on Canadian
provinces should probably be in two separate tables.
6. One variable in a table serves as a unique identifier or ID for the ob-
servation.
• A unique identifier takes on a different value for each observation.
• The ID variable is often in the first column of the table.
7. The order in which observations or variables are listed is irrelevant to the
analysis.
• That is, the interpretation of the table would not change if its rows
or columns were in a different order.
Data that is not tidy is sometimes called messy data. One of the first steps in
cleaning data is rearranging it from a messy format to a tidy format.
Each tidy data set normally includes at least one unique identifier or ID variable.
By definition, an ID variable must take on a different value for each observation.
With this property, we can use ID variables to link and combine data from
multiple sources.
Example 2.1. ID variables at SFU
SFU is one of British Columbia’s largest organizations, and relies heavily on ID
variables to organize its information:
Your library records, grades, financial records, and almost any other information
SFU has about you includes one or more of these IDs.
22 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
1. They can be numbers (like your student ID number) or text strings (like
your computing ID).
2. They can be
Issue Solution
Some applications will round non-integer Numeric ID variables should
values (changing 1.23 to 1) or drop always be integers without leading
leading zeros (changing 00045 to 45). zeros, or converted to text.
Some applications will reject or transform Text ID variables should only use
spaces or unusual characters. (Latin) letters and (Arabic)
digits.
Some applications are case-sensitive (so Text ID variables should typically
that “hello” and “Hello” are different use either all upper-case or all
values) and others are not. lower-case.
FYI
Names, IDs, and probabilistic matching
Proper names are typically not used as ID variables since they are not
necessarily unique. In addition, they are not always written consistently.
For example, the same person might be called “Doug” in one data set
and “Douglas” in another.
Occasionally a data set will not have a standardized ID variable, and
our only option is to match observations based on a proper name. For
example, in BC school data an ID number called a PEN while health
data uses a different ID number called a PHN. There is no direct way
to match PEN and PHN, so we have to match education and health
records on a combination of proper names and other information such as
year of birth. Matches made this way are called “probabilistic” matches,
meaning (roughly) that the records probably describe the same person
but might not.
Many of you have probably used Excel before, but there are many features you
are probably not yet familiar with. We will start by going over some of its basic
characteristics and terminology.
24 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
When giving instructions, I will refer to various elements of Excel’s user interface
by name. You may have been using these elements for years without ever
knowing their names, so I will list them here:
The first step in any data cleaning exercise is to look at the data and assess
what needs to be done.
26 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
It is in tidy format:
– There is a single rectangular table starting in cell A1.
– Each row represents a Canadian province
– Canadian provinces are the unit of observation
– Each column represents a variable describing that province
– The top row shows brief but clear names for each variable
– The provinces are listed in alphabetical order, but the interpretation
of the table would not change if they were listed in some other order
• Raw data is the original data as downloaded from Statistics Canada.
2.3. INTRODUCTION TO EXCEL 27
Note that we are following good data management practice by saving a copy
of the original data and creating a new data set based on it, rather than by
directly editing the original data. We are also documenting data sources. Both
of these practices will enhance the reproducibility and reliability of our analysis.
In the remainder of this section, we will learn a few tools for changing how our
data is displayed that do not change the content of the data.
Sorting allows us to re-order the rows of our data based on the value in one or
more of the columns. Since order does not matter with tidy data, we can sort
in whatever way we like without changing the content of our data.
As you can see, the data set is now sorted by population. We can follow similar
steps to put the data set back in alphabetical order:
Notice that Excel can tell whether a column contains numbers or text, and will
sort accordingly.
28 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
You can sort in ascending or descending order, and you can sort on multiple
columns by selecting the “Custom sort” option.
Filtering allows us to hide some observations so we can look at a particular
subset of observations that we are interested in. The hidden observations are
still there.
1. Select Home > Sort and Filter > Filter. If you look at the column
headers in your sheet you will see that they have become drop-down boxes.
2. Click on the drop-down box for Population, then select Number filters
> Greater Than...
3. Enter one million (1000000) in the box and select OK.
At this point, only the provinces with at least one million residents appear in
the table. Don’t worry, the other ones haven’t gone anywhere.
We can undo the filter and remove the drop-down boxes by selecting Home >
Sort and Filter > Filter again.
You can filter on more complex criteria, and you can combine sorting and fil-
tering.
Freezing panes keeps some rows and/or columns visible regardless of which
cell is currently selected. This allows us to work with large tables while keeping
the top row (variable names) and/or the first column (observation IDs) visible.
Now go back down to row 50 or so. You will see that the top row is still displayed
and you can see the variable names.
To undo this, select View > Freeze Panes > Unfreeze Panes.
Instead of freezing the first row, you can freeze the first column, or you can
freeze any number of rows and columns.
2.4. CLEANING DATA 29
Another way we can change the appearance of our data without changing its
content is to adjust the cell formatting for one or more cells. Cell formatting
characteristics include:
• Column width
• Row height
• Font
• Bold/italics/underline
• Text color
• Background color
• Cell borders
• Alignment (left/right/center as well as top/bottom/middle)
The procedure for modifying the cell format is straightforward if you regularly
use productivity applications like Microsoft Word.
Example 2.6. Changing cell size
You may notice that cell B8 (Ontario population) appears to contain
“######” rather than a number. The cause of this problem is that the cell
is not wide enough to display the correct number. So let’s make it wider.
We have several options for doing this:
• From the menu: Select any cell in column B, and then select Home >
Format > Cell Width.... A dialog box will appear that allows you to
enter your preferred width. Try 10.
• With the mouse: Move your cursor to the line between column headers
B and C until the cursor looks like this: . Click on the
mouse and drag the cursor to resize the column.
• Auto-fit (this is what I usually do): Move your cursor to the line between
column headers B and C, and double-click. The column width will auto-
matically adjust to fit the data.
2.4.1 Preparation
The next step is to look at the data and construct a cleaning plan. Your cleaning
plan should have several steps:
The data cleaning plan should be based around what we plan to do with the
data, but should preserve flexibility in case we want to use the data for other
purposes.
Example 2.8. A plan for cleaning the employment data
The first step in cleaning the employment data is to ensure we have tidy data.
This step has already been completed, so we can move on.
The second step is to ensure that each table has a unique ID variable. In our
employment data, the province name could serve as a unique identifier, but
names typically have some drawbacks:
• Names are not always unique. This is obviously not an issue with Cana-
dian provinces, but it is an issue in many other data sets. For example,
there are 41 cities and towns in the United States named “Springfield.”
• The same observation often appears with different names in different data
sets. For example, some older data sets call the province of Newfoundland
and Labrador just “Newfoundland” as that was the province’s name before
December 6, 2001.
2.4. CLEANING DATA 31
We will do both.
The third step is to identify and address problems in our existing variables.
The fourth step is to consider is whether there are any variables we would like
to analyze that have not yet been constructed. In our employment data, we will
want to construct
• Labour force
• Labour force participation rate
• Unemployment rate
We will also want to include information on which specific month these obser-
vations are describing (November 2020). Although that information is in the
Raw data worksheet and in the title of the main worksheet (Employment Nov
2020) it may be useful to have it in the table as well.
We will now add several variables to our working data set. Be sure to save the
file after adding each variable so that you don’t lose your work.
Most of the time we will add data to the end of the existing table, adding new
variables after the last column or new observations after the last row. But
occasionally we will want to insert a column or row into our table.
This will shift all columns to the right, and insert a new blank column A.
We can also insert rows, delete rows or columns, and even insert or delete
individual cells.
32 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
The simplest way to add a variable to an Excel table is by typing data directly
into the cells.
Excel has several tools available to speed the process of entering data. First,
you can copy-and-paste or cut-and-paste the contents of any cell into any other
cell.
Excel’s fill tool allows you to quickly copy the contents of a cell into a set of
cells immediately, above, below or to the left or right.
2.4. CLEANING DATA 33
Now we could enter the exact same date in cells G3:G11, but we can save
ourselves some time by using Excel’s fill tool:
As you can see, Excel fills in all selected cells with the value in the top cell.
The series tool allows you to fill in a group of cells with an ascending or
descending sequence of numbers or dates.
As you can see, column A now contains a unique identifier that numbers
provinces from 1 to 10.
2.4.4 Formulas
Most of our new variables will be calculated from existing variables using for-
mulas. A formula is just a rule for calculating a value from some other values.
Formulas always start with the equals sign = followed by a mathematical ex-
pression that can include any combination of:
34 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
Cell H2 now displays 2,493,300 which is in fact the value in cell D2 plus the
value in cell E2.
Let’s also add a column for the unemployment rate. To remind you, this is the
proportion or percentage of the labour force (column H) that is unemployed
(column E).
Notice that both of these formulas use cell H2, which itself contains a formula.
2.4. CLEANING DATA 35
2.4.5 Functions
Excel has about 500 built-in functions that we can use in formulas. Each
function has a name and a set of arguments whose values you can set.
To use a function, you simply include its name and its arguments as part of
the formula. For example, the SQRT() function takes a single numeric argument
and returns the square root of the argument. So if you enter =SQRT(2) in a cell,
the cell will display 1.414, the square root of 2.
Excel also has extensive tools for
You can also use Google or any other search engine to find this information.
5. Select OK.
You will see that cell K2 now contains =LN(C2) which displays as 15.0933. Note
that if you already knew the function and arguments you needed, you could have
just typed =LN(C2) into cell K2 instead.
Some functions like SUM() and AVERAGE() operate on a range of cells rather
than a single cell. A range is just a rectangular set of cells, and is described by
its upper-left and lower-right cells, separated by a colon (“:”). For example:
A single cell can also be thought of as a range of cells with one row and one
column.
Example 2.16. Total population
Suppose we want to create a new variable that reports the total population
across all observations in the data. The function to do that is SUM().
Cell L2 should display 31,275,600 which is indeed the sum of cells C2 to C11.
In Excel, you can copy-and-paste the contents of a cell to any other cell. This is
particularly handy when a cell contains a formula, as it would be inconvenient
to type the same formula into each cell.
Example 2.17. Copying a formula
To use copy-and-paste to copy the formula in cell H2 to the other cells in column
H:
You can also use fill for this purpose, and it is usually quicker.
2.4. CLEANING DATA 37
Excel is smart and normally treats cell addresses in formulas as relative ref-
erences when copying and pasting cells. That is, when a formula is copied to
another cell 𝑎 columns to the right and 𝑏 rows down, the column letters in the
formula are increased by 𝑎 units, and the row numbers are increased by 𝑏 units.
For example, suppose cell B5 contains the formula =A1. If we copy the contents
of this cell to other cells, we get:
Cell Formula
B5 =A1
B6 =A2
B7 =A3
C5 =B1
D5 =C1
C6 =B2
D7 =C3
Because Excel treats cell references as relative, copying the cell L2 to the rest
of column L produces: - Cell L2 contains =SUM(C2:C11) - Cell L3 contains
=SUM(C3:C12) - Cell L4 contains =SUM(C4:C13)
but in this case we want all of the cells to contain =SUM(C2:C11).
We can tell Excel to treat a given cell reference as absolute by adding the $
character to the cell reference. For example, suppose that we copy the formula
in cell C2 to cell D3. Then:
Note that the presence or absence of the $ does not affect the calculation in the
cell, it only affects how the formula is copied over to other cells.
Example 2.21. Total population, part 2
To fix the TotPop variable:
Sometimes we will want to combine absolute and relative references in the same
formula. This typically will happen when we want to compare the current
observation to the other observations.
Example 2.22. Population rank
Suppose we want to create a new variable that is the province’s population rank.
That is, the province with the highest population has a rank of 1, second highest
has a rank of 2, etc. The function to do that is RANK.EQ(), which takes two
arguments: the value to rank, and the list of values to use for the ranking. We
want the first argument (the province’s population) to vary across provinces,
but we want the list of values (the populations of all of the provinces) to stay
the same.
2.5. DATA TYPES 39
As you can see, Excel displays the correct ranks. We can check this by sorting
on Population and seeing if PopRank is also sorted.
FYI
Advanced options
The RANK.EQ() function has several relatives with similar syntax:
• The RANK() and RANK.AVG() functions also return the rank, but
use slightly different rules for handling ties.
• The PERCENTRANK(), PERCENTRANK.EXC() and
PERCENTRANK.INC() functions return ranks in percentiles.
• Text
• Logical (true/false) values
• Dates and times
Excel has various tools for entering, storing and processing these data types.
But the results of these calculations are typically displayed to fewer decimal
places than that. A cell’s contents are distinct from its numeric display
format, which is how it appears on the screen. We can change the display
format of a cell or group of cells without changing its contents.
You will see that the cell now displays the rates in percentage, to two decimal
places. I think it would be more readable if we round to just one decimal place.
To do that, select Home > Decrease Decimal.
An important thing to understand here: all we have done is change how the
numbers are displayed. If we do any calculations with these cells, the calculation
will use the original proportional value without rounding.
Text data is also called character data or string data. In Excel, text values
can be entered directly in a cell, can be used in a formula, and can be the result
of a formula.
Excel has many functions for working with text data. A particularly useful one
is the CONCAT() function, which allows you to join or concatenate two or more
strings. This is useful in building reports, in constructing ID variables, and in
many other applications.
As you can see, CONCAT() is useful for converting data into human-readable
statements. It is also useful for creating ID variables.
FYI
Advanced options
There are many useful functions to manipulate text strings in Excel:
In addition to text and numbers, cells can also contain logical values (TRUE or
FALSE). Logical values can be entered directly in a cell, can be used in a formula,
and can be the result of a formula.
Mathematical expressions using the comparison operators = (equal), <> (not
equal), > (greater than), < (less than), >= (greater than or equal), and <= (less
then or equal) can be used to create logical values.
Example 2.25. Creating a logical variable
To create a logical variable that indicates whether a province has a labour force
participation rate below 64%.
42 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
• Notice that (I2 < 0.64) is a statement that is either true or false,
not a numeric expression.
• Logical statements can use other comparison operators, including =,
<, >, and <=.
As you can see, the cells display TRUE in the three provinces with LFP rates
below 64%, and FALSE in the other seven.
Logical values are particularly powerful in combination with the IF() function.
This function takes three arguments: - a statement, - a value to return if the
statement is true - a value to return if the statement is false.
As you can see, the cells display 1 in the three provinces with LFP rates be-
low 64%, and 0 in the other seven. When cleaning data we will typically use
indicator variables rather than logical variables.
FYI
Advanced options
Some additional functions that work with logical variables:
• NOT() returns TRUE if its argument is FALSE and FALSE if its argu-
ment is TRUE.
• AND() returns TRUE if all of its arguments are TRUE.
• OR() returns TRUE if any of its arguments are TRUE.
• SWITCH() and IFS() are extensions of IF() that take multiple
conditions.
2.5. DATA TYPES 43
In additions to numbers, text, and logical values, Excel can handle dates and
times. Dates and times are a surprisingly complex subject that can create all
sorts of problems on a computer, for several reasons:
Each application has its own rules for handling dates and times, though there
are some standards that have developed. Excel handles these issues as follows:
1. Dates are stored as the number of days elapsed since some base date. In
Excel the base date is January 1, 1900, which means that
• January 1, 1900 is day 1.
• January 2, 1900 is day 2.
• November 1, 2020 is day 44136.
2. Dates are displayed according to the cell’s display formatting.
• The default display formatting varies across regions, so the same date
in the same Excel file might appear different on your computer and
my computer.
44 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
3. When you enter something that looks like a date, Excel does several things
behind the scenes:
• it guesses the date format for what you have entered
• it converts what you have entered to the internal storage form (days
since base date).
• it changes the display format to what Excel thinks it should be (based
on your location settings).
Most of the time, this system works seamlessly and you don’t even notice it.
But it can cause problems, and understanding the underlying structure can help
you to solve those problems.
Example 2.27. How dates are stored and displayed
The MonthYr variable in column G is a date.
Now while a date of 44136 is quite clear to Excel, we want to display dates in a
more human-readable way. I don’t like the Nov-20 display format, since it isn’t
obvious whether that means November 2020 or November 20. So let’s change
the formatting:
You can see even more options if you select More number formats
We can also do calculations with dates, and there are various functions using
dates.
Example 2.28. Some date calculations
To calculate how long ago November 2020 was:
• This will display the number of days that have passed between
November 1, 2020 and today.
5. If you have not already done so, save your data file.
Most of the time, Excel’s handling of dates works seamlessly and is very clever.
But sometimes Excel guesses wrong, and this can create all sorts of problems.
FYI
Excel dates and genetics
Excel’s handling of dates caused a significant unanticipated problem in
the the field of human genetics, where it is a widely used tool.
Each gene has a standard abbreviation like “TCEA1” or “CTCF” as-
signed by a scientific body called the HUGO Gene Nomenclature Com-
mittee (HGNC). Unfortunately, 27 of these genes have abbreviations
that Excel misinterprets as dates, for example “Membrane Associated
Ring-CH-Type Finger 2,” also known as “MARCH2”. If you enter the
text “MARCH2” in an Excel cell, Excel will automatically convert it to
the date of March 2 in the current year. A 2016 research paper found
that roughly 20% of published research articles in the field used data
that was affected by this problem.
Unfortunately, it is too late to “fix” Excel to keep this from happening.
Any change to its behavior would “break” Excel in thousands of other
applications that rely on its current behavior.
When you can’t fix a problem in a computer application, you need to
find a workaround: a modification to how you use the application that
avoids or minimizes the effects of the problem. So the HGNC changed
the names of these 27 genes in 2020. For example, the gene MARCH2 is
now called MARCHF2.
In addition to dates, Excel can also handle date-time values such as 11/1/2020
12:00:00 PM.
46 CHAPTER 2. BASIC DATA CLEANING WITH EXCEL
FYI
How Excel handles times
Excel treats times as partial days. For example, Excel will store
11/1/2020 12:00:00 PM as day number 44136.5 and 11/1/2020
1:00:00 PM as day number 44136.5416666667.
There are also functions that work with date-time values. For example,
we have already seen the function TODAY(.) which returns the current
date but there is also a function NOW(.) that returns the current date
and time.
• Maybe you are trying something new, and are keeping an earlier version
in case something goes wrong.
• Maybe you did an analysis a few weeks ago that you are no longer using,
but you don’t want to throw it away in case you change your mind.
• Maybe you and a classmate are working on a project together, and you
have each made changes to separate copies of the original file.
You will want to use some form of version control here, with a goal of keeping
everything you might need without making mistakes or spending a lot of time on
it. Version control is an important element in making your analysis reproducible.
Software developers and professional data analysts (like me) use a formal version
control system like Git/GitHub. For our purposes, we can just follow a few
simple rules:
1. The working copy is the file you are actively working on right now.
2. The master version is the most recently saved file that is “complete.”
• You should try to work in small and discrete projects so that you can
– Make a working copy of the master version.
– Complete a project in your working copy.
– Make this working copy the new master version.
2.6. VERSION CONTROL 47
At this level, you do not normally need to keep archived versions. But you
should at least distinguish between your master version and working copy.
Chapter review
Data cleaning is among the most important practical skills one can develop
in applied statistical analysis. Simple statistical methods like averages and
frequencies are all most people will ever use, but everyone who works with data
regularly encounters complex and messy data.
In this chapter we have learned some important data cleaning concepts, in-
cluding reproducible research, tidy data, ID variables, and version control. We
have also learned how to implement these concepts in Excel using tools such as
fill/series, sorting, formatting, formulas, and functions.
We will soon use Excel to do some basic statistical analysis and graphing us-
ing our cleaned data. Later on, we will learn more advanced data cleaning
concepts such as linking, aggregating, error validation/handling, importing and
exporting, as well has how to to implement them in both Excel and R.
Practice problems
Each chapter will include a few simple practice problems to help you check your
knowledge. They are organized by the specific skill or area of knowledge you
are practicing.
Answers can be found in the appendix.
SKILL #1: Identify features of tidy data
a.
b.
Variable Value
Name Bob
Age 30
Occupation Chef
Name Joe
Age 35
Occupation Waiter
c.
3. For each of the following formulas, suppose we copy that from cell C12 to
cell E15. What formula appears in cell E15?
a. =B2
b. =$B$2
c. =$B2
d. =B$2
e. =SUM(B2:B10)
2.6. VERSION CONTROL 49
f. =SUM($B$2:$B$10)
g. =SUM($B2,$B10)
h. =SUM(B$2,B$10)
Goals
Chapter goals
In this chapter we will learn how to:
This chapter uses mathematical notation and terminology that you have seen
before but may need to review. If you have difficulty with the math, please refer
to the sections on Sets and on Functions in the Math Review appendix.
We will develop ideas by considering the casino game of Roulette. The picture
below shows what a roulette wheel looks like.
51
52 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
• It features
– a ball.
– a spinning wheel with numbered/colored slots.
– a table on which to place bets
• The slots are numbered from 0 to 36
– Slot number 0 is green
– 18 slots are red
– 18 slots are black.
– The picture above depicts an American roulette table, which has an
additional green slot labeled “00”,
– I will assume we have a European roulette table, which does not
include the “00” slot.
• Players can place various bets on the table including:
– Red (ball lands in a red slot) pays $1 per $1 bet
– Black (ball lands in a black slot) pays $1 per $1 bet
– A straight bet on any specific number (ball lands on that number)
pays $35 per $1 bet
Next, we define a set of events that we are interested in. We can think of an
event as either:
These two concepts are equivalent, though the subset concept makes the math
clearer.
Example 3.4. Events in roulette
These roulette events are well-defined for our sample space:
We could define many more events, depending on what bets we are interested
in.
• Two events are identical (𝐴 = 𝐵) if they contain exactly the same out-
comes:
– The event “ball lands on 14” and “a bet on 14 wins” are identical
since {14} = {14}.
– Intuitively, identical means they are just two different ways of de-
scribing the same event.
• An event implies another event (𝐴 ⊂ 𝐵) if all of its outcomes are also in
the implied event
– The event “ball lands on 14” implies the event “ball lands on red”
since {14} ⊂ 𝑅𝑒𝑑.
– When an event happens, any event it implies also happens.
• Two events are disjoint (𝐴 ∩ 𝐵) = ∅ if they share no outcomes:
– The events “ball lands on red” and “ball lands on black” are disjoint
since 𝑅𝑒𝑑 ∩ 𝐵𝑙𝑎𝑐𝑘 = ∅.
– If two events are disjoint, they cannot both happen.
– But they can both fail to happen. For example, if the ball lands in
the green zero slot (𝑏 = 0), neither red nor black wins.
• Any two elementary events are either identical or disjoint
– The events “ball lands on 14” and “ball lands on 25” are disjoint
since {14} ∩ {25} = ∅.
3.2 Probabilities
Our final step is to define a probability distribution for this random process,
which is a function that assigns a number to each possible event. The number
is called the event’s probability.
Probabilities are normally between zero and one:
3.2. PROBABILITIES 55
All valid probability distributions must obey the following three conditions,
which are sometimes called the axioms of probability.
Pr(𝐴) ≥ 0
Pr(Ω) = 1
3. For any two disjoint events 𝐴 and 𝐵, the probability that 𝐴 or 𝐵 happen
is the sum of their individual probabilities:
Probability distributions have many other properties, but they can all be derived
from these three axioms.
Pr(Ω) = 1
Pr(Ω)) = ⏟
⏟ Pr({0}) + ⏟
Pr({1}) + ⋯ + ⏟
Pr({36})
⏟⏟⏟⏟
1 𝑝 𝑝 𝑝
56 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
Since this is an introductory course, our sample space will usually contain a
finite number of outcomes, as in our roulette example. In that case, probability
calculations are pretty simple:
In the roulette example, the probability of any event 𝐴 is just the number of
outcomes in 𝐴 times the probability of each outcome 1/37:
The notation |𝐴| just means the size of (number of elements in) the set 𝐴.
For example:
Pr(𝑏 = 25) = |{25}| ∗ 1/37 = 1/37 ≈ 0.027
However, not all sample spaces contain a finite number of outcomes. For exam-
ple, suppose we are interested in using probability to model the unemployment
rate, or a person’s income. Those are real numbers, and can take on any of
an infinite number of values. This adds a few complications, and is the reason
that the probability axioms refer to events (sets of outcomes) and not individual
outcomes.
3.2. PROBABILITIES 57
FYI
What do probabilities really mean?
What does it really mean to say that the probability of the ball landing
in a red slot is about 0.486? That’s actually a tough question. There are
two standard interpretations for probabilities:
Let 𝐴 and 𝐵 be two events. Then our three axioms of probability imply several
additional rules:
Pr(𝐴) ≤ 1
𝐴 = 𝐵 ⟹ Pr(𝐴) = Pr(𝐵)
𝐴 ⊂ 𝐵 ⟹ Pr(𝐴) ≤ Pr(𝐵)
Pr(𝐴𝐶 ) = 1 − Pr(𝐴)
Pr(∅) = 0
58 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
These results are not hard to prove, but I will not go through the proofs. How-
ever, I will use these results so you should be familiar with them.
We are often interested in more than one event, and want to talk about how
they are related. For example:
This section will develop some tools for dealing with the relationship between
different random events.
The joint probability of two events 𝐴 and 𝐵 is the probability that they both
happen:
Pr(𝐴 ∩ 𝐵)
Remember that the intersection (∩) of 𝐴 and 𝐵 is the set of all outcomes that
are in both 𝐴 and 𝐵.
Suppose you are interested in the probability that the ball lands on a number
that is both red and even. This event is just the intersection of 𝑅𝑒𝑑 and 𝐸𝑣𝑒𝑛
so this joint probability is:
Pr(𝑅𝑒𝑑 ∩ 𝐸𝑣𝑒𝑛) = Pr({12, 14, 16, 18, 30, 32, 34, 36}) (3.7)
= 8/37 (3.8)
≈ 0.216 (3.9)
Joint probabilities are just probabilities, so they obey all of the axioms and rules
of probability described in Section 3.2.
We say that two events 𝐴 and 𝐵 are independent if their joint probability is
just the two individual probabilities multiplied together:
After 3 games:
probabilities, why do we define it in terms of joint probabilities? The key is the require-
ment that the events have nonzero probability. When 𝐵 has zero probability the conditional
probability Pr(𝐴|𝐵) is not well defined since its denominator is zero.
3.3. JOINT AND CONDITIONAL PROBABILITIES 61
What is the probability of each of these events? Since we can assume that each
game’s outcome is independent, this is an easy problem:
So we have an 11.5% chance of winning big, and an 88.5% chance of going broke.
Very important: equation (3.11) only follows from the previous equation because
we have assumed the events 𝑅𝑒𝑑1 , 𝑅𝑒𝑑2 , and 𝑅𝑒𝑑3 are independent.
When is it not reasonable to assume that events are independent? In almost any
other case. Remember that events are defined in terms of the same underlying
outcome, so they are typically related unless you have some very specific reason
to assume otherwise.
Example 3.11. Independence within a roulette game?
Consider the roulette events “Red wins” and “Even wins”. We earlier showed
that the unconditional probability that Red wins is:
The conditional probability that Red wins given that Even wins is:
In addition to the results we have already discussed, there are two important
results using conditional probabilities:
The first is the law of total probability which is a rule for determining un-
conditional probabilities from conditional probabilities:
Pr(𝑇 𝑐 |𝐷𝑐 ) = 𝑞
and the prevalence of the infection is the probability that a given patient has
the disease:
Pr(𝐷) = 𝑑
Suppose that a patient has tested positive. What is the probability that he has
the disease, i.e. what the value of Pr(𝐷|𝑇 )?
This is a classic probability question, as it makes use of Bayes’ law and the law
of total probability, and it has obvious practical usage.
Since we want a conditional probability, we start by stating Bayes’ law:
Bayes’ law will allow us to calculate Pr(𝐷|𝑇 ) if we can find the components
of the right side of this equation. We already know that Pr(𝑇 |𝐷) = 𝑝 and
Pr(𝐷) = 𝑟, so all we need is to find Pr(𝑇 ).
Since Pr(𝑇 ) is an unconditional probability, we can use the law of total proba-
bility:
Pr(𝑇 ) = Pr(𝑇 |𝐷) Pr(𝐷) 𝑐 Pr(𝐷𝑐 )
⏟ ⏟ +⏟ Pr(𝑇
⏟⏟ |𝐷
⏟⏟) ⏟
𝑝 𝑑 1−𝑞 1−𝑑
𝑝𝑑
Pr(𝐷|𝑇 ) ==
𝑝𝑑 + (1 − 𝑞)(1 − 𝑑)
1 ∗ 0.1
Pr(𝐷|𝑇 ) == ≈ 0.917
1 ∗ 0.1 + (1 − 0.99) ∗ (1 − 0.1)
1 ∗ 0.001
Pr(𝐷|𝑇 ) == ≈ 0.091
1 ∗ 0.001 + (1 − 0.99) ∗ (1 − 0.001)
In other words, the exact same test has a very different meaning depending on
the prevalence in the population: when the disease is common a positive test
64 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
means a 91.7% chance of having the disease, and when the disease is rare a
positive test result means a 9.1% chance of having the disease.
This general issue (even a small false positive rate can have a big impact when
prevalence is low) appeared repeatedly in March and April of 2020. Several
studies by well-known researchers2 dramatically overestimated the early preva-
lence of the COVID-19 virus and thus dramatically underestimated its fatality
rate. These studies were regularly cited as support by those who wanted to
substantially relax public health restrictions in April 2020, and had substantial
real world consequences.
Chapter review
In this chapter we have learned the basic terminology and concepts of probabil-
ity. You may have seen a number of these terms and ideas in high school, but
we are approaching them at a higher level. Be sure to review these terms and
concepts in detail, and do the practice problems to test your knowledge.
Our next step is to take our general framework of outcomes and events, and
apply them to random variables - outcomes that are specifically numerical.
Practice problems
Answers can be found in the appendix.
Most of these practice problems will be based on the casino game of craps.
Craps is played with a pair of 6-sided dice.
Players take turns rolling the dice, and the player currently rolling the dice is
called the “shooter”. There are various bets - pass, don’t pass, come, don’t
come, field, place, buy - that can be placed on the results of multiple rolls of
the dice. These bets and their probability calculations can be quite complex, so
we will focus on “single roll” bets.
of the controversy, and a blog post by statistician Andrew Gelman provides a thorough dis-
cussion of the statistical issues.
3.3. JOINT AND CONDITIONAL PROBABILITIES 65
An outcome for a single roll of the dice is a pair of numbers (𝑟, 𝑤) where 𝑟 is
the amount showing on the red die, and 𝑤 is the amount showing on the white
die. For example an outcome (2, 4) means that the red die is showing 2 and the
white die is showing 4.
SKILL #1: Define outcomes and sample space for a simple example
1. Let Ω be the sample space for the outcome of a single roll in craps.
a. Define Ω by enumeration.
b. Find the cardinality of Ω.
2. Using enumeration, define the following events:
a. Yo wins
b. Snake eyes wins
c. Boxcars wins
d. Field wins
d. Boxcars loses.
e. Field wins.
f. Field loses.
b. Pr(𝐴) > 0.
c. Pr(𝐴) ≤ 1.
d. Pr(𝐴) < 1.
e. Pr(𝐴𝑐 ) ≥ 0.
f. Pr(𝐴𝑐 ) > 0.
g. Pr(𝐴𝑐 ) ≤ 1.
h. Pr(𝐴𝑐 ) < 1.
i. Pr(𝐴𝑐 ) = 1 − Pr(𝐴).
13. Let 𝐴 and 𝐵 be two events. Which of the following statements are true?
a. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵).
b. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 ∩ 𝐵).
c. Pr(𝐴 ∪ 𝐵) ≤ Pr(𝐴) + Pr(𝐵).
d. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
14. Let 𝐴 and 𝐵 be two disjoint events. Which of the following statements
are true?
a. Pr(𝐴 ∩ 𝐵) = 0.
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) + Pr(𝐵).
c. Pr(𝐴 ∪ 𝐵) = 0.
d. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵).
e. Pr(𝐴 ∪ 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 ∩ 𝐵).
f. Pr(𝐴 ∪ 𝐵) ≤ Pr(𝐴) + Pr(𝐵).
g. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
h. Pr(𝐴|𝐵) = 0
15. Let 𝐴 and 𝐵 be two events such that 𝐴 ⊂ 𝐵. Which of the following
statements are true?
a. Pr(𝐴) ≤ Pr(𝐵)
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴)
c. Pr(𝐴|𝐵) = 1
16. Let 𝐴 and 𝐵 be two independent events. Which of the following statements
are true?
a. Pr(𝐴 ∩ 𝐵) = 0.
b. Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵).
c. Pr(𝐴|𝐵) = Pr(𝐴).
68 CHAPTER 3. PROBABILITY AND RANDOM EVENTS
Chapter 4
Introduction to random
variables
The previous chapter developed a general framework for modeling random out-
comes and events. This framework can be applied to any set of random out-
comes, no matter how complex.
This chapter develops additional tools for the case when the random outcomes
we are interested in are quantitative, that is, they can be described by a number.
Quantitative outcomes are also called “random variables.”
Goals
Chapter goals
In this chapter we will learn how to:
The material in this chapter will use some mathematical notation (the summa-
tion operator) that provides a convenient way to represent long sums. Please
review the section on sequences and summations in the math appendix.
69
70 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
1 𝑏 ∈ 𝑅𝑒𝑑
𝑟 = 𝐼(𝑏 ∈ 𝑅𝑒𝑑) = {
0 𝑏 ∉ 𝑅𝑒𝑑
1 if 𝑏 ∈ 𝑅𝑒𝑑
𝑤𝑟𝑒𝑑 = 𝑤𝑟𝑒𝑑 (𝑏) = {
−1 if 𝑏 ∈ 𝑅𝑒𝑑𝑐
That is, a player who bets $1 on red wins $1 if the ball lands on red and
loses $1 if the ball lands anywhere else.
• The net payout from a $1 bet on 14:
35 if 𝑏 = 14
𝑤14 = 𝑤14 (𝑏) = {
−1 if 𝑏 ≠ 14
That is, a player who bets $1 on 14 wins $35 if the ball lands on 14 and
loses $1 if the ball lands anywhere else.
All of these random variables are defined in terms of the underlying outcome.
A random variable is always a function of the original outcome, but for conve-
nience, we usually leave its dependence on the original outcome implicit, and
write it as if it were an ordinary variable.
A random variable has its own sample space (normally ℝ) and probability dis-
tribution. This probability distribution can be derived from the probability
distribution of the underlying outcome.
Notice that these random variables are related to each other since they all
depend on the same underlying outcome. Section 5.4 will explain how we can
describe and analyze those relationships.
The random variables we will consider in this chapter have discrete support.
That is, the support is a set of isolated points each of which has a strictly positive
probability. In most examples the support will also have a finite number of
elements. All finite sets are also discrete, but it is also possible for a discrete
set to have an infinite number of elements. For example, the set of integers is
both discrete and infinite.
Some random variables have a support that is continuous rather than discrete.
Chapter 5 will cover continuous random variables.
1 Technically, it is the smallest closed set, but let’s ignore that for now.
72 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
𝑓𝑥 (𝑎) = Pr(𝑥 = 𝑎)
Our three random variables are all discrete, and each has its own PDF:
⎧19/37 𝑎 = −1
{
𝑓𝑟𝑒𝑑 (𝑎) = Pr(𝑤𝑟𝑒𝑑 = 𝑎) = ⎨18/37 𝑎=1
{0 𝑎 ∉ {−1, 1}
⎩
⎧36/37 𝑎 = −1
{
𝑓14 (𝑎) = Pr(𝑤14 = 𝑎) = ⎨1/37 𝑎 = 35
{0 𝑎 ∉ {−1, 35}
⎩
0.75
f(a)
0.50 f_red(a)
0.25
f_b(a)
0.00
0 10 20 30
a
We can calculate any probability from the PDF by simple addition. That is:
Pr(𝑥 ∈ 𝐴) = ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ∈ 𝐴)
𝑠∈𝑆𝑥
2
where 𝐴 ⊂ ℝ is any event defined for 𝑥.
Example 4.5. Some event probabilities in roulette
Since the outcome in roulette is discrete, we can calculate any event probability
by adding up the probabilities of the event’s outcomes.
The probability of the event 𝑏 ≤ 3 can be calculated:
36
Pr(𝑏 ≤ 3) = ∑ 𝑓𝑥 (𝑠)𝐼(𝑠 ≤ 3) (4.1)
𝑠=0
= 𝑓𝑏 (0) + 𝑓𝑏 (1) + 𝑓𝑏 (2) + 𝑓𝑏 (3) (4.2)
= 4/37 (4.3)
0 ≤ 𝑓𝑥 (𝑎) ≤ 1
since it is a probability.
2. It sums up to one over the support:
∑ 𝑓𝑥 (𝑎) = Pr(𝑥 ∈ 𝑆𝑥 ) = 1
𝑎∈𝑆𝑥
𝑎 ∈ 𝑆𝑥 ⟹ 𝑓𝑥 (𝑎) > 0
since the support is the smallest set that has probability one.
You can confirm that examples above all satisfy these properties.
𝐹𝑥 (𝑎) = 𝑃 𝑟(𝑥 ≤ 𝑎)
This formula leads to a “stair-step” appearance: the CDF is flat for all values
outside of the support, and then jumps up at all values in the support.
⎧0 𝑎<0
{
{1/37 0≤𝑎<1
{
{2/37 1≤𝑎<2
𝐹𝑏 (𝑎) = ⎨
{⋮ ⋮
{36/37 35 ≤ 𝑎 < 36
{
{
⎩1 𝑎 ≥ 36
⎧0 𝑎 < −1
{
𝐹𝑟𝑒𝑑 (𝑎) = ⎨19/37 −1 ≤ 𝑎 < 1
{1 𝑎≥1
⎩
• The CDF of 𝑤14 is:
⎧0 𝑎 < −1
{
𝐹14 (𝑎) = ⎨36/37 −1 ≤ 𝑎 < 35
{1 𝑎 ≥ 35
⎩
The CDF has several properties. First, it is non-decreasing. That is, choose any
two numbers 𝑎 and 𝑏 so that 𝑎 ≤ 𝑏. Then
𝐹𝑥 (𝑎) ≤ 𝐹𝑥 (𝑏)
The reason for this is simple: the event 𝑥 ≤ 𝑎 implies the event 𝑥 ≤ 𝑏, so its
probability cannot be higher.
Second, it is a probability, which implies:
0 ≤ 𝐹𝑥 (𝑎) ≤ 1
The intuition is simple: all values in the support are between −∞ and ∞.
Example 4.7. CDF properties
Figure 4.1 below graphs the CDFs from the previous example:
Notice that they show all of the general properties described above.
76 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
f_14(a)
0.75
F(a)
0.50 f_red(a)
0.25
f_b(a)
0.00
0 10 20 30
a
• The CDF never goes down, only goes up or stays the same.
• The CDF runs from zero to one, and never leaves that range.
In addition to constructing the CDF from the PDF, we can also go the other
way, and construct the PDF of a discrete random variable from its CDF. Each
little jump in the CDF is a point in the support, and the size of the jump is
exactly equal to the PDF.
In more formal mathematics, the formula for deriving the PDF of a discrete
random variable from its CDF would be written:
Finally, we can use the CDF to calculate the probability that 𝑥 lies in any
4.1. RANDOM VARIABLES 77
interval. That is, let 𝑎 and 𝑏 be any two numbers such that 𝑎 < 𝑏. Then:
Notice that we have to be a little careful here to distinguish between the strict
inequality < and the weak inequality ≤, because it is always possible for 𝑥 to
be exactly equal to 𝑎 or 𝑏.
𝑦 = 𝑎𝑥 + 𝑏
where 𝑎 and 𝑏 are constants. We will have many results that apply specifically
for linear functions.
The net payout from a $1 bet on red (𝑤𝑟𝑒𝑑 ) was earlier defined directly from
the underlying outcome 𝑏. However, we could have also defined it as a linear
function of the random variable 𝑟:
𝑤𝑟𝑒𝑑 = 2𝑟 − 1
That is, 𝑤𝑟𝑒𝑑 = −1 when red loses (𝑟 = 0) and 𝑤𝑟𝑒𝑑 = 1 when red wins (𝑟 = 1).
78 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
The expected value is also called the mean, the population mean or the
expectation of the random variable.
The formula might look difficult if you are not used to the notation, but it is
actually quite simple to calculate:
𝑏 (0) +1 ∗ 𝑓
𝐸(𝑏) = 0 ∗ 𝑓⏟ ⏟ 𝑏 (1) + ⋯ 36 ∗ 𝑓
⏟ 𝑏 (36) (4.17)
1/37 1/37 1/37
= 18 (4.18)
𝐸(𝑟) = 0 ∗ 𝑓⏟
𝑟 (0) +1 ∗ 𝑓
⏟ 𝑟 (1) (4.19)
19/37 18/37
= 18/37 (4.20)
≈ 0.486 (4.21)
𝐸(𝑤14 ) = −1 ∗ 𝑓⏟
14 (−1) +35 ∗ 𝑓
⏟ 14 (35) (4.22)
36/37 1/37
= 1/37 (4.23)
≈ −0.027 (4.24)
That is, each dollar bet on 14 leads to an average loss of 2.7 cents for the bettor.
We can think of the expected value as a weighted average of its possible values,
with each value weighted by the probability of observing that value.
4.2. THE EXPECTED VALUE 79
Since the expected value is a sum, it has some of the same properties as sums.
In particular, the associative and distributive rules apply, which means that:
𝐸(𝑎 + 𝑏𝑥) = 𝑎 + 𝑏𝐸(𝑥)
That is, we can take the expected value “inside” any linear function. This will
turn out to be a very handy property.
Example 4.11. The expected value of a linear function in roulette
Earlier, we showed that 𝑤𝑟𝑒𝑑 is a linear function of 𝑟:
𝑤𝑟𝑒𝑑 = 2𝑟 − 1
so its expected value is:
𝐸(𝑤𝑟𝑒𝑑 ) = 𝐸(2𝑟 − 1) = 2 𝐸(𝑟)
⏟ −1 = −1/37 ≈ −0.027
18/37
We can verify this calculation is correct by deriving the expected value directly
from the PDF:
𝐸(𝑤𝑟𝑒𝑑 ) = −1 ∗ 𝑓⏟
𝑟𝑒𝑑 (−1) +1 ∗ 𝑓⏟
𝑟𝑒𝑑 (1) ≈ −0.027
19/37 18/37
That is, each dollar bet on red leads to an average loss of 2.7 cents for the bettor,
as does each dollar bet on 14.
The expected value is one way of describing something about a random variable,
but there are many others. We will describe a few of the most important ones.
4.3.1 Range
The range of a random variable is the interval from its lowest possible value
(min(𝑆𝑥 ) its highest possible value max(𝑆𝑥 ).
Let 𝑞 be any number between zero and one. Then the 𝑞 quantile of a random
variable 𝑥 is defined as:
where 𝐹𝑥 (⋅) is the CDF of 𝑥. The quantile function 𝐹𝑥−1 (⋅) is also called the
inverse CDF.
The 𝑞 quantile of a distribution is also called the 100𝑞 percentile; for example
the 0.25 quantile of 𝑥 is also called the 25th percentile of 𝑥.
⎧0 𝑎 < −1
{
𝐹𝑟𝑒𝑑 (𝑎) = ⎨0.514 −1 ≤ 𝑎 < 1
{1 𝑎≥1
⎩
F_red(a)
0.75 q_red(0.75) = 1
F(a)
0.50 q_red(0.50) = −1
0.25 q_red(0.25) = −1
0.00
0 10 20 30
a
We can use this graph to find any quantile. For example, the 0.25 quantile
(25th percentile) is:
−1
𝐹𝑟𝑒𝑑 (0.25) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.25} = min[−1, 1] = −1
By the same method, we can find that the 0.5 quantile (50th percentile) is:
−1
𝐹𝑟𝑒𝑑 (0.5) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.5} = min[−1, 1] = −1
−1
𝐹𝑟𝑒𝑑 (0.75) = min{𝑎 ∶ Pr(𝑤𝑟𝑒𝑑 ≤ 𝑎) ≥ 0.75} = min{1} = 1
The formula for the quantile function may look intimidating, but it can be
constructed by just “flipping” the axes of the CDF.
The quantile function for 𝑤𝑟𝑒𝑑 can be constructed by inverting the 𝐹𝑟𝑒𝑑 (⋅):
82 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
0.5
0.0
−0.5
−1.0
4.3.3 Median
The median of a random variable is its 0.5 quantile or 50th percentile. It can
be interpreted roughly as the “middle” of the distribution.
4.3.4 Variance
The median and expected value both aim to describe a typical or central value
of the random variable. We are also interested in measures of how much the
random variable varies. We have already seen one - the range - but there are
others, including the variance and standard deviation.
The variance of a random variable 𝑥 is defined as:
≈ 0.25 (4.29)
2 19 2 18
𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) = (−1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 )) ∗ + (1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 )) ∗ (4.30)
37 37
≈0.027 ≈0.027
≈ 1.0 (4.31)
2 36 2 1
𝑣𝑎𝑟(𝑤14 ) = (−1 − 𝐸(𝑤
⏟ 14 )) ∗ + (35 − 𝐸(𝑤
⏟ 14 )) ∗ (4.32)
37 37
≈0.027 ≈0.027
≈ 34.1 (4.33)
That is, a bet on 14 has the same expected payout as a bet on red, but its
payout is much more variable.
The key to understanding the variance is that it is the expected value of a square
(𝑥 − 𝐸(𝑥))2 , and the expected value is just a (weighted) sum.
The first implication of this is that the variance is always positive (or more
precisely, non-negative):
𝑣𝑎𝑟(𝑥) ≥ 0
The intuition is straightforward. All squares are positive, and the expected
value is just a sum. If you add up several positive numbers, you will get a
positive number.
The second implication is that:
2
𝐸(𝑤14 ) = (−1)2 𝑓14 (−1) + 352 𝑓14 (35) (4.39)
36 1
=1∗ + 1225 ∗ (4.40)
37 37
≈ 34.08 (4.41)
2
𝑣𝑎𝑟(𝑤14 ) = 𝐸(𝑤14 ) − 𝐸(𝑤14 )2 (4.42)
2
≈ 34.08 + (−0.027) (4.43)
≈ 34.1 (4.44)
That is, a bet on 14 has the same expected payout as a bet on red, but its
payout is much more variable.
We can also find the variance of any linear function of a random variable. For
any constants 𝑎 and 𝑏:
𝑣𝑎𝑟(𝑎 + 𝑏𝑥) = 𝑏2 𝑣𝑎𝑟(𝑥)
I do not expect you to remember how to derive these results, but I want you to
know them and use them.
𝜎𝑥 = 𝑠𝑑(𝑥) = √𝑣𝑎𝑟(𝑥)
1. It is always non-negative:
𝑠𝑑(𝑥) ≥ 0
2. For any constants 𝑎 and 𝑏:
These properties follow directly from the corresponding properties of the vari-
ance.
have given them names. This provides a quick way to describe a particular
distribution without writing out its full PDF, using the notation
𝑅𝑎𝑛𝑑𝑜𝑚𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ∼ 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑁 𝑎𝑚𝑒(𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
where 𝑅𝑎𝑛𝑑𝑜𝑚𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 is the name of the random variable whose distribution is
being described, the ∼ character can be read as “has the following probability
distribution”, 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑁 𝑎𝑚𝑒 is the name of the probability distribution,
and 𝑃 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 is a list of numbers called parameters that provide additional
information about the probability distribution.
Using a standard distribution also allows us to establish the properties of a
commonly-used distribution once, and use those results every time we use that
distribution.
4.4.1 Bernoulli
The Bernoulli probability distribution is usually written:
𝑥 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝)
It has discrete support 𝑆𝑥 = {0, 1} and PDF:
⎧(1 − 𝑝) 𝑎=0
{
𝑓𝑥 (𝑎) = ⎨𝑝 𝑎=1
{0 𝑎 = anything else
⎩
Note that the “Bernoulli distribution” isn’t really a (single) probability distri-
bution. Instead it is what we call a parametric family of distributions. That
is, the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) is a different distribution with a different PDF for each
value of the parameter 𝑝.
We typically use Bernoulli random variables to model the probability of some
random event 𝐴. If we define 𝑥 as the indicator variable 𝑥 = 𝐼(𝐴), then 𝑥 ∼
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) where 𝑝 = Pr(𝐴).
Example 4.21. The Bernoulli distribution in roulette
The variable 𝑟 = 𝐼(𝑅𝑒𝑑) has the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(18/37) distribution.
4.4.2 Binomial
𝑥 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝)
𝑛!
𝑝𝑎 (1 − 𝑝)𝑛−𝑎 𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = { 𝑎!(𝑛−𝑎)!
0 anything else
You do not need to memorize or even understand this formula. The Excel
function BINOMDIST() can be used to calculate the PDF or CDF of the bino-
mial distribution, and the function BINOM.INV() can be used to calculate its
quantiles.
The binomial distribution is typically used to model frequencies or counts. We
can show that it is the distribution of how many times a probability-𝑝 event
happens in 𝑛 independent attempts.
For example, the basketball player Stephen Curry makes about 43% of his 3-
point shot attempts. If each shot is independent of the others, then the number
of shots he makes in 10 attempts will have the 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(10, 0.43) distribution.
𝐸(𝑥) = 𝑛𝑝
𝑣𝑎𝑟(𝑥) = 𝑛𝑝(1 − 𝑝)
The formula for the binomial PDF looks strange, but it can actually be derived
from a fairly simple and common situation. Let (𝑏1 , 𝑏2 , … , 𝑏𝑛 ) be a sequence of
𝑛 independent random variables from the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) distribution and let:
𝑛
𝑥 = ∑ 𝑏𝑖
𝑖=1
count up the number of times that 𝑏𝑖 is equal to one (i.e., the event modeled by
𝑏𝑖 happened). Then it is possible to derive the PDF for 𝑦, and that is the PDF
we call 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝). The derivation is not easy, but the intuition is simple:
88 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
𝑛! 𝑎
Therefore the probability of the event 𝑥 = 𝑎 is 𝑎!(𝑛−𝑎)! 𝑝 (1 − 𝑝)𝑛−𝑎 .
𝑥 ∼ 𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒𝑈 𝑛𝑖𝑓𝑜𝑟𝑚(𝑆𝑥 )
1/|𝑆𝑥 | 𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = {
0 𝑎 ∉ 𝑆𝑥
Chapter review
In this chapter we have learned various ways of describing the probability dis-
tribution of a simple random variable - a single random variable that takes on
values in a finite set. We have also learned two standard probability distribu-
tions for simple random variables.
In the next chapter, we will deal with more complex random variables including
random variables that take on values in a continuous set, or random variables
that are related to other random variables. We will then use the concept of a
random variable to understand both data and statistics calculated from data.
Practice problems
Answers can be found in the appendix.
4.4. STANDARD DISCRETE DISTRIBUTIONS 89
The questions below continue our craps example. To review that example, we
have an outcome (𝑟, 𝑤) where 𝑟 and 𝑤 are the numbers rolled on a pair of fair
six-sided dice
Let the random variable 𝑡 be the total showing on the pair of dice, and let the
random variable 𝑦 = 𝐼(𝑡 = 11) be an indicator for whether a bet on “Yo” wins.
SKILL #1: Define a random variable in terms of its underlying out-
come
6. Using the PDFs you found earlier, find the following CDFs
a. Find the CDF 𝐹𝑟 for the random variable 𝑟.
b. Find the CDF 𝐹𝑦 for the random variable 𝑦.
SKILL #6: Find the expected value from the (discrete) PDF
90 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
7. Using the PDFs you found earlier, find the following expected values:
a. Find the expected value 𝐸(𝑟).
b. Find the expected value 𝐸(𝑟2 ).
8. Using the CDFs you found earlier, find the following quantiles:
a. Find the median 𝑀 𝑒𝑑(𝑟).
b. Find the 0.25 quantile 𝐹𝑟−1 (0.25).
c. Find the 75th percentile of 𝑟.
SKILL #8: Calculate variance and standard deviation from the (dis-
crete) PDF
9. Let 𝑑 = (𝑦 − 𝐸(𝑦))2
a. Find the PDF 𝑓𝑑 of 𝑑.
b. Use this PDF to find 𝐸(𝑑)
c. Use the results above to find the variance 𝑣𝑎𝑟(𝑦).
d. Use the results above to find the standard deviation 𝑠𝑑(𝑦).
10. In question (7) above, you calculated 𝐸(𝑟) and 𝐸(𝑟2 ) from the PDF.
a. Use these results to find 𝑣𝑎𝑟(𝑟)
b. Use these results to find 𝑠𝑑(𝑟)
SKILL #10: Identify and use random variables from standard dis-
crete distributions
13. The “Yo” bet pays out at 15:1, meaning you win $15 for each dollar
bet. Suppose you bet $10 on Yo. Your net winnings in that case will be
𝑊 = 160 ∗ 𝑦 − 10.
a. Using earlier results, find 𝐸(𝑊 ).
b. Using earlier results, find 𝑣𝑎𝑟(𝑊 ).
c. The event 𝑊 > 0 (your net winnings are positive) is identical to the
event 𝑦 = 1. Using earlier results, find Pr(𝑊 > 0).
14. Suppose you bet $1 on Yo in ten independent rolls. Your net winnings in
that case will be 𝑊10 = 16 ∗ 𝑌10 − 10.
a. Using earlier results, find 𝐸(𝑊10 ).
b. Using earlier results, find 𝑣𝑎𝑟(𝑊10 ).
c. The event 𝑊10 > 0 (your net winnings are positive) is identical to
the event 𝑌10 > 10/16. Using earlier results, find Pr(𝑊10 > 0).
15. If you have $10 and care mostly about expected net winnings, which would
be your preferred betting strategy?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
• Keep your $10 and not bet at all.
16. Which of the following two betting strategies produces the highest proba-
bility of walking away from the table with more money than you started
with?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
17. Which of the following two betting strategies has more variable net win-
nings?
• Bet Yo $10 on a single roll.
• Bet Yo $1 on each of 10 rolls.
92 CHAPTER 4. INTRODUCTION TO RANDOM VARIABLES
Chapter 5
The previous chapter developed the basic terminology and analytical tools for
a single discrete random variable. This chapter will extend those tools to work
wit continuous random variables, and with multiple random variables.
Goals
Chapter goals
In this chapter we will:
93
94 CHAPTER 5. MORE ON RANDOM VARIABLES
Pr(𝑥 = 𝑎) = 0
for all 𝑎 ∈ 𝑆𝑥 .
• The probability that 𝑥 is somewhere in the support is:
Pr(𝑥 ∈ 𝑆𝑥 ) = 1
This feature applies to ranges as well. For example, the Canadian labour force
participation rate has a high probability of being between 60% and 70%, but
zero probability of being exactly 65%.
The math for working with continuous random variables is a little different from
the math for working with discrete random variables, and is harder because it
requires calculus. In many cases, it requires integral calculus (MATH 152 or
MATH 158) which is not a prerequisite to this course, and which I do not
expect you to know.
But deep down, there are no really important differences between continuous
and discrete random variables. The intuition for why is straightforward: you
can make a continuous random variable into a discrete random variable by just
rounding it. For example, suppose you round the labour force participation rate
to the nearest percentage point. Then it becomes a discrete random variable,
with support:
𝑆𝑥 = {0%, 1%, … 99%, 100%}
The same point applies if you round to the nearest 1/100th of a percentage
point, or the nearest 1/1,000,000th of a percentage point.
Since discrete and random variables are more alike than first appears, most of
the results and intuition we have already developed for discrete random variables
can also be applied to continuous random variables. So:
Any time you see an integral here, you can ignore it.
5.1. CONTINUOUS RANDOM VARIABLES 95
The CDF of a continuous random variable 𝑥 is defined exactly the same way as
for the discrete case:
𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎)
The only difference is how it looks.
If you recall, the CDF of a discrete random variable takes on a stair-step form:
increasing in discrete jumps at every point in the discrete support, and flat
everywhere else.
In contrast, the CDF of a continuous random variable increases continuously.
It can have flat parts, but never jumps.
⎧0 𝑎 < 0
{
𝐹𝑥 (𝑎) = Pr(𝑥 ≤ 𝑎) = ⎨𝑎 𝑎 ∈ [0, 1]
{1 𝑎 > 1
⎩
Figure 5.1 below shows the CDF of the standard uniform distribution.
As you can see, the CDF is smoothly increasing between zero and one, and flat
everywhere else.
The CDF of a continuous random variable obeys all of the properties described
in section 4.1.4:
𝐹𝑥 (𝑎) ≤ 𝐹𝑥 (𝑏) if 𝑎 ≤ 𝑏
0 ≤ 𝐹𝑥 (𝑎) ≤ 1
lim 𝐹𝑥 (𝑎) = Pr(𝑥 ≤ −∞) = 0
𝑎→−∞
0.75
F_x(a)
F(a)
0.50
0.25
0.00
−2 −1 0 1 2
a
In addition, the result on intervals applies to both strict and weak inequalities:
𝐹 (𝑏) − 𝐹 (𝑎) = Pr(𝑎 < 𝑥 ≤ 𝑏) (5.2)
= Pr(𝑎 < 𝑥 < 𝑏) (5.3)
= Pr(𝑎 ≤ 𝑥 ≤ 𝑏) (5.4)
= Pr(𝑎 ≤ 𝑥 < 𝑏) (5.5)
since a continuous random variable has probability zero of taking on any specific
value.
In other words, instead of the amount the CDF increases (jumps) at 𝑎, it is the
rate at which it increases.
⎧0 𝑎<0
{
𝑓𝑥 (𝑎) = ⎨1 𝑎 ∈ [0, 1]
{0 𝑎>1
⎩
f_x(a)
0.75
f(a)
0.50
0.25
0.00
−2 −1 0 1 2
a
The PDF of a continuous random variable is a good way to visualize its prob-
ability distribution, and this is about the only way we will use the continuous
PDF in this class (since everything else requires integration).
I have defined the PDF in terms of the CDF, but it is also possible to derive the
CDF from the PDF. This requires integral calculus, so I will give the definition
below but not expect you to use it.
98 CHAPTER 5. MORE ON RANDOM VARIABLES
FYI
Deriving the CDF from the PDF of a continuous random vari-
able
The formula for deriving the CDF of a continuous random variable from
its PDF is: 𝑎
𝐹𝑥 (𝑎) = ∫ 𝑓𝑥 (𝑣)𝑑𝑣
−∞
Unless you have taken MATH 152 or MATH 158, you may have no idea
what this is or how to solve it. That’s OK! All you need to know for this
course is that it can be solved.
The continuous PDF has many properties that are similar but not identical to
the properties of the discrete PDF.
FYI
Like the discrete PDF, the continuous PDF is non-negative for all values:
𝑓𝑥 (𝑎) ≥ 0
𝑎 ∈ 𝑆𝑥 ⟹ 𝑓𝑥 (𝑎) > 0
Quantiles and percentiles, including the range and median, have the same defi-
nition and interpretation whether the random variable is continuous or discrete.
The definition for the expected value of a continuous random variable is different
and uses integral calculus.
5.2. THE UNIFORM DISTRIBUTION 99
FYI
The expected value for a continuous random variable
When 𝑥 is continuous, its expected value is defined as:
∞
𝐸(𝑥) = ∫ 𝑎𝑓𝑥 (𝑎)𝑑𝑎
−∞
Notice that this looks just like the definition for the discrete case, but
with the sum replaced by an integral sign. If you know much about
integral calculus, you may require that an integral is a sum (or more
precisely, the limit of a sum). This is why the same properties we earlier
found for the expected value of a discrete random variable also applies
to continuous random variables.
There is even a general definition that covers both discrete and contin-
uous variables, as well as any mix between them:
∞
𝐸(𝑥) − ∫ 𝑎𝑑𝐹𝑥 (𝑎)
−∞
This expression uses notation that is not typically taught in a first course
in integral calculus, so even if you have taken MATH 152 or MATH 158
you may not know how to interpret it. Again, I am only showing you so
that you know the formula exists, I am not asking you to remember or
use it.
More importantly, the expected value has the same interpretation as it does for
a discrete random variable, and it has all of the properties described earlier as
well.
The variance and standard deviation are both defined as expected values, so
they also have the same interpretation and properties for a continuous random
variable as they do for a discrete random variable.
and PDF:
1
𝑎 ∈ 𝑆𝑥
𝑓𝑥 (𝑎) = { 𝐻−𝐿
0 otherwise
For example, if 𝑥 ∼ 𝑈 (2, 5) its support is the range of all values from 2 to 5,
and its PDF looks like this:
0.3
0.2
f(a)
0.1
0.0
−6 −3 0 3 6
a
The uniform distribution puts equal probability on all values between 𝐿 and
𝐻. We have already seen the standard uniform distribution, which is just
the 𝑈 (0, 1) distribution.
⎧0 𝑎≤𝐿
{ 𝑎−𝐿
𝐹𝑥 (𝑎) = ⎨ 𝐻−𝐿 𝐿<𝑎<𝐻
{1 𝑎≥𝐻
⎩
5.2. THE UNIFORM DISTRIBUTION 101
0.75
F_x(a)
F(a)
0.50
0.25
0.00
−6 −3 0 3 6
a
𝐿+𝐻
𝐸(𝑥) =
2
(𝐻 − 𝐿)2
𝑣𝑎𝑟(𝑥) =
12
and its standard deviation is just the square root of the variance:
(𝐻 − 𝐿)2
𝑠𝑑(𝑥) = √
12
as always.
102 CHAPTER 5. MORE ON RANDOM VARIABLES
FYI
Uniform distributions in video games
Uniform distributions are important in many computer applications in-
cluding video games.
It is easy for a computer to generate a random number from the 𝑈 (0, 1)
distribution, and a 𝑈 (0, 1) has the unusual feature that its 𝑞 quantile is
equal to 𝑞.
As a result, you can generate a random variable with any probability
distribution you like by following these steps:
For example, the PDF for the 𝑁 (0, 1) distribution looks like this:
0.3
f(a)
0.2
0.1
0.0
−6 −3 0 3 6
a
For other values of 𝜇, the 𝑁 (𝜇, 𝜎2 ) distribution is also symmetric around 𝜇 and
bell-shaped, with the “spread” of the distribution depending on the value of 𝜎2 :
104 CHAPTER 5. MORE ON RANDOM VARIABLES
0.4
0.2
0.1
N(0,2)
0.0
−6 −3 0 3 6
a
The CDF of the normal distribution can be derived by integrating the PDF.
There is no simple closed-form expression for this CDF, but it is easy to calculate
with a computer.
5.3. THE NORMAL DISTRIBUTION 105
1.00
N(0,1)
0.75
f(a)
0.50
N(1,1)
0.25
N(0,2)
0.00
−6 −3 0 3 6
a
The Excel function NORM.DIST() can be used to calculate the PDF or CDF of
any normal distribution.
The Excel function NORM.INV() can be used to calculate the quantile (inverse
CDF) function of any normal distribution.
Since the 𝑁 (𝜇, 𝜎2 ) distribution is symmetric around 𝜇, its median is also 𝜇.
The mean and variance of a 𝑁 (𝜇, 𝜎2 ) random variable can be found by integra-
tion:
𝐸(𝑥) = 𝜇
𝑣𝑎𝑟(𝑥) = 𝜎2
and the standard deviation is just the square root of the variance:
𝑠𝑑(𝑥) = 𝜎
Any linear function of a normal random variable is also normal. That is, if
𝑥 ∼ 𝑁 (𝜇, 𝜎2 )
106 CHAPTER 5. MORE ON RANDOM VARIABLES
FYI
A very important result called the Central Limit Theorem tells us that
many “real world” random variables have a probability distribution that
is well-approximated by the normal distribution.
We will discuss the central limit theorem in much more detail later.
Let 𝑥 = 𝑥(𝑏) and 𝑦 = 𝑦(𝑏) be two random variables defined in terms of the same
underlying outcome 𝑏.
Their joint probability distribution assigns a probability to every event that
can be defined in terms of 𝑥 and 𝑦, for example Pr(𝑥 = 6 ∩ 𝑦 = 0) or Pr(𝑥 < 𝑦).
This joint distribution can be fully described by the joint CDF:
The joint distribution tells you two things about these variables
Note that while you can always derive the marginal distributions from the joint
distribution, you cannot go the other way around unless you know everything
about the relationship between the two variables.
Example 5.5. Three joint distributions with identical marginal distri-
butions
The scatter plots in Figure ?? below depict simulation results for a pair of
random variables (𝑥, 𝑦), with a different joint distribution in each graph. In all
three graphs, 𝑥 and 𝑦 have the same marginal distribution (standard normal).
The differences between the graphs are in the relationship between 𝑥 and 𝑦.
5.4. MULTIPLE RANDOM VARIABLES 109
• In the first graph, 𝑥 and 𝑦 are unrelated, so the data looks like as a
“cloud” of random dots.
2
2
2
1
0
0
y
y
0
−1
−2
−1
−2
−2
−3
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x x x
Pr(𝑦 ∈ 𝐴 ∩ 𝑥 ∈ 𝐵)
Pr(𝑦 ∈ 𝐴|𝑥 ∈ 𝐵) =
Pr(𝑥 ∈ 𝐵)
Since a conditional probability is just the ratio of the joint probability to the
marginal probability, the conditional distribution can always be derived from
the joint distribution.
We can describe a conditional distribution with either the conditional CDF:
5.4.5 Covariance
+ (1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ))(−1 − 𝐸(𝑤
⏟ 14 )) 𝑓 𝑟𝑒𝑑,14 (1, −1)
⏟⏟⏟⏟⏟
≈−0.027 ≈−0.027 17/37
+ (−1 − 𝐸(𝑤
⏟ 𝑟𝑒𝑑 ))(−1 − 𝐸(𝑤
⏟ 14 )) 𝑓
⏟⏟ ⏟⏟⏟
𝑟𝑒𝑑,14 (−1,⏟⏟
−1)
≈−0.027 ≈−0.027 19/37
≈ 0.999 (5.30)
That is, the returns from a bet on red and a bet on 14 are positively related.
112 CHAPTER 5. MORE ON RANDOM VARIABLES
As with the variance, we can derive an alternative formula for the covariance:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝐸(𝑥𝑦) − 𝐸(𝑥)𝐸(𝑦)
The derivation of this result is as follows:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝐸((𝑥 − 𝐸(𝑥))(𝑦 − 𝐸(𝑦))) (5.31)
= 𝐸(𝑥𝑦 − 𝑦𝐸(𝑥) − 𝑥𝐸(𝑦) + 𝐸(𝑥)𝐸(𝑦)) (5.32)
= 𝐸(𝑥𝑦) − 𝐸(𝑦)𝐸(𝑥) − 𝐸(𝑥)𝐸(𝑦) + 𝐸(𝑥)𝐸(𝑦)) (5.33)
= 𝐸(𝑥𝑦) − 𝐸(𝑥)𝐸(𝑦) (5.34)
Again, this formula is often easier to calculate than using the original definition.
Example 5.9. Another way to calculate the covariance
The expected value of 𝑤𝑟𝑒𝑑 𝑤14 is:
𝐸(𝑤𝑟𝑒𝑑 𝑤14 ) = 1 ∗ 35 ∗ 𝑓⏟⏟⏟⏟⏟
𝑟𝑒𝑑,14 (1, 35) (5.35)
1/37
+ 1 ∗ (−1) ∗ 𝑓⏟⏟⏟⏟⏟
𝑟𝑒𝑑,14 (1, −1)
17/37
I do not expect you to remember all of these formulas, but be prepared to see
me use them
5.4.6 Correlation
𝑐𝑜𝑣(𝑥, 𝑦) 𝜎𝑥𝑦
𝜌𝑥𝑦 = 𝑐𝑜𝑟𝑟(𝑥, 𝑦) = =
√𝑣𝑎𝑟(𝑥)𝑣𝑎𝑟(𝑦) 𝜎𝑥 𝜎𝑦
Like the covariance, the correlation describes the strength of a (linear) relation-
ship between 𝑥 and 𝑦. But it is re-scaled in a way that makes it more convenient
for some purposes.
Example 5.10. Correlation in roulette
The correlation of 𝑤𝑟𝑒𝑑 and 𝑤14 is:
𝑐𝑜𝑣(𝑤𝑟𝑒𝑑 , 𝑤14 )
𝑐𝑜𝑟𝑟(𝑤𝑟𝑒𝑑 , 𝑤14 ) = (5.46)
√𝑣𝑎𝑟(𝑤𝑟𝑒𝑑 ) ∗ 𝑣𝑎𝑟(𝑤14 )
0.999
≈√ (5.47)
1.0 ∗ 34.1
≈ 0.17 (5.48)
The covariance and correlation always have the same sign since standard devia-
tions are always1 positive. The key difference between them is that correlation
is scale-invariant. That is:
1 More precisely, either or both of 𝜎 and 𝜎 could be zero. In that case the covariance will
𝑥 𝑦
also be zero, and the correlation will be undefined (zero divided by zero).
114 CHAPTER 5. MORE ON RANDOM VARIABLES
When 𝑐𝑜𝑟𝑟(𝑥, 𝑦) ∈ {−1, 1}, then that means 𝑦 is an exact linear function of 𝑥.
That is, we can write it:
𝑦 = 𝑎 + 𝑏𝑥
where:
1 if 𝑏 > 0
𝑐𝑜𝑟𝑟(𝑥, 𝑦) = { (5.49)
−1 if 𝑏 < 0
5.4.7 Independence
When random variables are independent, their covariance and correlation are
both exactly zero. However, it does not go the other way around.
Example 5.12. Zero covariance does not imply independent
Figure ?? below shows a scatter plot from a simulation of two random variables
that are clearly related (and therefore not independent) but whose covariance
is exactly zero.
Intuitively, covariance is a measure of the linear relationship between two vari-
ables. When variables have a nonlinear relationship as in Figure ??, the covari-
ance may miss it.
7.5
5.0
y
2.5
0.0
−2 0 2
x
Chapter review
Over the course of this chapter and the previous one, we have learned the
basic terminology and tools for working with random variables. These are the
two most difficult chapters in the course, but if you work hard and develop your
116 CHAPTER 5. MORE ON RANDOM VARIABLES
understanding of random variables you will find the rest of the course somewhat
easier.
This is not a course on probability, so the next few chapters will be about data
and statistics. We will first learn to use Excel to calculate common statistics
from a cleaned data set. We will then use the tools of probability and random
variables to build a theoretical framework in which we can interpret each statistic
as a random variable and each data set as a collection of random variables. This
theory will allow us to use statistics not only as a way of describing data, but
as a way of understanding the process that produced that data.
Practice problems
Answers can be found in the appendix.
Questions 1- 7 below continue our craps example. To review that example, we
have:
• An outcome (𝑟, 𝑤) where 𝑟 and 𝑤 are the numbers rolled on a pair of fair
six-sided dice
• Several random variables defined in terms of that outcome:
– The total showing on the pair of dice: 𝑡 = 𝑟 + 𝑤
– An indicator for whether a bet on “Yo” wins: 𝑦 = 𝐼(𝑡 = 11).
10. Your net winnings if you bet $1 on Yo and $1 on Boxcars can be written
16𝑦 + 31𝑏 − 2. Find the following expected values:
a. Find 𝐸(𝑦 + 𝑏)
b. Find 𝐸(16𝑦 + 31𝑏 − 2)
11. Find the following variances and covariances:
a. Find 𝑐𝑜𝑣(16𝑦, 31𝑏)
b. Find 𝑣𝑎𝑟(𝑦 + 𝑏)
i. Find 𝑣𝑎𝑟(𝑥)
In a previous chapter, we learned how to clean a simple data set. The next step
is to learn how to analyze it. In this chapter, we will use Excel to construct
univariate statistics and charts, i.e., statistics and charts that describe a single
variable. Later in the term, we will learn multivariate methods that describe
the relationship between two or more variables.
Goals
Chapter goals
In this chapter we will learn how to:
In the next chapter, we will use the tools of probability and random variables
to understand these statistics more deeply.
Our emphasis in this chapter, and in much of this course will be on performing
exploratory data analysis. Exploratory data analysis is the first step in any
data analysis project: we use simple statistics and graphs to identify and under-
121
122 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
stand patterns in the data. Our knowledge of these patterns can then inform
our subsequent formal or model-based analysis.
The audience for exploratory data analysis is the analyst themselves. But we
will often be interested in presenting the patterns we have discovered to another
audience: your teacher, your boss, or your client. So we will also discuss how
to effectively present your results.
Our main data set for analysis is the worksheet Data for Analysis, which includes
the following variables:
In addition, there is a worksheet titled Raw data that contains the original
data as obtained from Statistics Canada. Source information is also in that
worksheet.
Our historical employment data set covers more than 500 months. Other data
sets are often much larger: large surveys from Statistics Canada can have hun-
dreds of observations and hundreds of thousands of variables, and companies
and governments often work with transactions-level data that includes millions
of observations.
As humans, our brains are not large enough to fully understand a large data
set without some kind of simplification or “dimension reduction”: instead of
looking at millions of numbers and trying to identify patterns from that, we
calculate and view a relatively small number of statistics based on the data.
A statistic is just a number calculated from data.
You saw many of these words - standard deviation, percentile, median, etc. -
in Chapter 4, but I need to be clear on something: even though the names are
the same, the concepts are not exactly the same.
1. The count or sample size is the number of observations with valid (nu-
meric) values for the variable.
2. The sample average is a measure of central tendency in data, and is
calculated:
1 𝑛
𝑥̄ = ∑ 𝑥𝑖
𝑛 𝑖=1
We are going to create a nice table of summary statistics for some of our vari-
ables.
The next step is to fill in the row of variable names. We could type them in,
but let’s do something more sophisticated and flexible: use a formula to pull
the variable names in from the original data set.
1. Go to cell B1.
2. Type = but don’t hit <enter> yet.
3. Select the tab for the Data for Analysis worksheet, and then select cell G1
in the Data for Analysis worksheet.
• The formula bar now says ='Data for Analysis'!G1
4. Select <enter>.
• You should now be back in cell B1 in the Summary statistics work-
sheet.
• Cell B1 should display UnempRate (the contents of cell G1 in Data
for Analysis).
1. Use the COUNT() function to report the observation count in cell B2.
• We will want to use an absolute reference for the rows, and a relative
reference for the columns, so the formula should be =COUNT('Data
for Analysis'!G$2:G$542).
2. Use the AVERAGE() function to report the average unemployment rate in
cell B3
3. Use the STDEV.S() function to report the standard deviation of the un-
employment rate in cell B4
• There is another built-in Excel function called STDEV.P() that uses
a slightly different formula for the standard deviation:
1 𝑛
𝑠𝑝𝑥 = √ ∑(𝑥 − 𝑥)̄ 2
𝑛 𝑖=1 𝑖
We now have a table that reports all of the major summary statistics calculated
for the unemployment rate.
We would also like to calculate summary statistics for other variables. Fortu-
nately, we have set up our table in a way that makes that easy.
Example 6.5. Summary statistics for other variables
To fill in summary statistics for other variables:
Now column C contains summary statistics for the labour force participation
rate (LFPRate), and columns D through F contain summary statistics for some
of the other variables in our data set.
If we wanted to, we could set up the table to calculate summary statistics for
all of the variables. But let’s just stop with these, and move on to making the
table look a little nicer.
Our table now has all of the information we need, but it is still kind of ugly.
Let’s make it look nice and presentable.
1. Select the whole sheet by clicking on the button in its upper left
corner.
2. Select Home > Format > AutoFit Column Width from the menu.
Not only will this make everything fit, it will automatically adjust the width as
necessary when anything changes.
3. Adjust the number display formats to look nice. Remember that the
number display format has no effect on the number itself.
• Leave the counts as they are.
• Display the unemployment rate, LFP rate and population growth
rate in percentages, rounded to one decimal place.
4. Feel free to play around with colors, fonts, etc. to get a table that you
like.
that we could put into a Word or PowerPoint document, and share with an
audience.
where:
The Excel functions COUNTIF() and COUNTIFS() can be used to construct the
counts. I will use COUNTIFS() which has a pair of arguments:
• The first argument criteria_range gives the range containing the data
we want to describe.
• The second argument criteria gives the criteria we want to match.
The kind of criteria you can use are most easily described by examples:
Note that I have included a political party (the NDP) that is not observed in
our data set. We start by setting up the table:
1. To do that we just need to divide each cell in the Count column by the
sum of all the cells in that column. So the formula in cell C2 should be
=B2/SUM(B2:B5)
2. Change relative references to absolute references as needed,
3. Copy the formula in cell C2 to cells C3:C5.
4. By default, the percentages are displayed as proportions. Change the
display format to percentage, with one decimal place.
We can also construct frequency tables for continuous variables or discrete vari-
ables with many possible values, but doing so is a little more complicated. We
cannot just construct a table with a row for each possible value, since there are
many possible values.
Instead, we divide the data’s range of possible or observed values into a set of
sub-ranges or bins. Then we can calculate and report the number or percentage
6.2. UNIVARIATE STATISTICS IN EXCEL 131
of observations that fall within each bin. A binned frequency table looks like
this:
In constructing bins, we need to apply some good judgment, and keep in mind
a few requirements and considerations:
• We need the bins to cover the full range of the data. In particular:
– The lower bound of the lowest bin should be lower than the lowest
value in the data.
– The upper bound of the highest bin should be higher than the highest
value in the data.
– Each bin’s upper bound should be the lower bound of the next bin.
– Boundaries should be addressed in a consistent manner, so that each
observation falls into exactly one bin.
• We often want the bins to be equally sized.
– But that isn’t always the case. See the unemployment rate table in
the example below; if it used equally sized bins, most of the bins
would be empty.
• We often want the upper and lower bounds of the bins to be nice round
numbers.
• The number of bins is a matter for judgment, and depends on what kind
of patterns we are aiming to find in the data.
– Too many bins and we miss broad patterns
– Too few bins and we miss potentially interesting details.
– The solution is to explore multiple options, and see what patterns
you can find out.
Again, we will use COUNTIFS() to construct the counts. But we will need to take
advantage of a feature of COUNTIFS() I have not yet mentioned: it takes mul-
tiple criteria_range and criteria arguments, allowing it to make multiple
comparisons (that’s the difference between COUNTIF() and COUNTIFS()).
Example 6.10. A binned frequency table
Let’s create the following table for the unemployment rate variable:
132 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
where Count is the number of observations for UnempRate that are greater
than that row’s From value and less than or equal to the row’s To value, and
Percentage is the count as a percentage of the total. We start by setting up the
table:
Next, we fill in the first count in cell C2. This will be a somewhat complex
formula, so we will build it in stages:
Cell C2 should now display the number of months (zero) in which the Canadian
unemployment rate was between 0% and 5%. To finish up the table:
FYI
The FREQUENCY() function is another way to create a frequency table.
However, this function is a tricky one to learn, and Microsoft recently
changed how it works. So we will skip it.
Excel calls graphs charts. Excel charts have three main components:
1. The data source. This is the table containing the data used to construct
the graph.
134 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
The usual workflow here is to select the data, choose a chart type, and then
modify chart elements until the chart looks the way you want it.
A time series graph plots one or more variables at multiple points in time.
Conventionally, time is on the horizontal axis and the variable’s value is on the
vertical axis. For example, we will generate a time series graph that looks like
this:
In Excel, time series graphs can be implemented using either the Line chart
type or the Scatter chart type. Line graphs are simpler, so we will start with
that. We will learn how to make scatter plots in Chapter 11.
The first step in creating an Excel chart is to select the data source and graph
type.
6.3. UNIVARIATE GRAPHS IN EXCEL 135
We have our time series graph (it looks like the picture above). Unfortunately,
it has a line for every variable in our data source. We only want to plot one
variable, so let’s get rid of the others:
1. Select Design > Select Data which will open the Select Data Source
dialog box.
2. Uncheck the check box next to every series in the “Legend Entries (Series)”
box except UnempRate:
1. Select the chart. This will cause the Chart Design and Format menu
items to appear in the menu.
2. To remove the horizontal gridlines, select Chart Design > Add Chart
Element > Gridlines > Primary Major Horizontal
3. To remove the legend, select Chart Design > Add Chart Element >
Legend > None
Next, let’s modify the title to be more informative. Right now, it is not clear
from the graph what country’s unemployment rate this is.
1. Double-click on the horizontal axis. The Format Axis box should appear
to the right.
• It may take you more than one try to double-click on the correct
object.
2. Select Axis Options.
• You can see all of the choices Excel made here based on what it sees
in the data.
• Feel free to play around with these options. You can return each
option to its original/default state by clicking on that option’s Reset
button.
3. Change the major units to either 4 Years or 8 Years, whichever one you
like better.
This isn’t really necessary, and violates our principle of avoiding repeated infor-
mation. But it is useful when we are graphing multiple time series in the same
chart, as it is more direct than a legend.
Finally, we want to add alt text for the visually impaired.
The graph will now look like the one above. The graph here is visually clean
and simple, in part because I left out many elements that I could have included:
axis labels (not needed because the units are obvious from context), a fancy
background, etc.
FYI
For further reading
Data visualization skills are valuable in the academic world and in the
professional world. If you are interested in developing your skills fur-
ther, you might consider our course ECON 334: Data Visualization and
Economic Analysis.
You might also get a book on data visualization, either Kieran Healy’s
Data Visualization or Cole Nussbaumer Knaflic’s Storytelling with Data.
Healy’s book is aimed at a primarily academic audience while Knaflic’s
is aimed at a business audience. Both of them are practical and easy
reads, with many examples.
Bar graphs are produced from a frequency table, as they are a visual depiction
of the information in such a table.
They can be produced in Excel using either the Bar chart type or the Column
chart type. The difference between the two is that the bars are horizontal in
the Bar chart type and vertical in the Column chart type.
Example 6.13. A bar graph of the Party variable
To construct a bar graph of the Party variable:
As you can see, basic bar graphs are quite simple. There is a bar for each
category (the first column of the table) and variable (the other columns), and
the length of each bar corresponds to its value.
As with our line graph, the current graph contains more information than we
actually want - it shows the count and the percentage, and we really only need
one of those.
and shows the number of months in which the two largest federal parties were
in government over the time frame of our data.
As with line graphs, we can prepare a bar graph for presentation by using Excel’s
tools with an eye towards the principles of effective presentation graphics.
Example 6.14. Cleaning up our bar graph
We can do a few easy things to simplify and clarify our bar graph:
Another thing we can do is use color and branding to convey information: each
major Canadian political party has a distinctive color as part of its brand: red
for the Liberals, blue for the Conservatives, and orange for the NDP.
Finally, the purpose of a bar graph is to enable the viewer to compare magni-
tudes, which requires looking at the top of each bar. But you may notice that
the category labels are at the bottom of each bar. Following the principle that
we don’t want to make the reader’s eye do any extra work, let’s put those labels
on top.
One of the keys to effective bar graph design is to keep the graph simple and
clean, and to avoid “chartjunk.”
First, bar graphs should always start at zero. The size of the bar is meant to
visually represent the value of the variable it is depicting. Using any origin other
than zero can cause relative sizes to be misleading.
Example 6.15. Why bar graphs should always start at zero
Suppose we start our Party bar graph at 200 rather than zero. We will get
this:
The relative size of these bars is very misleading. The bar for Liberal is three
times as big as the bar for Conservative, even though the number it represents
is only 29% higher.
Pie graphs are an alternative way of depicting relative frequencies, and are
also available in Excel along with 3-dimensional variations on both pie and bar
graphs:
142 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
Pie graph
3D pie graph
3D bar graph
However, most data visualization experts recommend against their use. Re-
search on how people process visual information usually find that they are less
informative in practice. People are much better at evaluating the relative size
of two lines or rectangles than they are at evaluating the relative size of two
pie slices. They are also much better at assessing relative distances in two
dimensions than they are in three dimensions.
6.3.5 Histograms
1. Select or double-click on the horizontal axis to open the Format Axis box.
144 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
Now the bins are exactly one percentage point wide, which makes the horizontal
axis a little easier to read and interpret.
It would also be nice to modify the start and end points so that instead of the
bins being 5.4-6.4%, 6.4-7.4%, etc. they were 5-6%, 6-7%, etc. Unfortunately,
that does not seem to be possible in Excel. We can work around that limita-
tion by creating the frequency table, and making a column/bar chart of that
frequency table.
Finally, as with other graphs we will want to add, delete and modify various
elements to bring this histogram closer to presentation quality.
Example 6.18. Enhancing the quality of our histogram
To enhance the quality of this histogram:
Chapter review
You can download the complete Excel file with all analysis from this chapter at
https://bookdown.org/bkrauth/BOOK/sampledata/CanEmpHistResults.xlsx
In this chapter, we learned how to calculate simple statistics in Excel. Calculat-
ing the statistics is typically the easiest part of a statistical analysis - cleaning
data and interpreting the results is much more challenging. So hopefully you
have found this chapter an easy break from the last few.
In the next chapter, we will bring this chapter together with the previously
developed theory of random events and random variables to interpret statistics
as random variables.
6.3. UNIVARIATE GRAPHS IN EXCEL 145
Practice problems
Answers can be found in the appendix.
SKILL #1: Calculate summary statistics by hand
Name Age
Al 25
Betty 32
Carl 78
a. Sample size.
b. Sample average age.
c. Sample median of age.
Name Age
Al 25
Betty 32
Carl 78
Enter this table in Excel, and use Excel functions to to calculate the
following statistics:
a. Sample size.
b. Sample average age.
c. Sample median of age.
d. Sample 25th percentile of age.
e. Sample variance of age.
f. Sample standard deviation of age.
146 CHAPTER 6. BASIC DATA ANALYSIS WITH EXCEL
Name Age
Al 25
Betty 32
Carl 78
Use Excel to construct a binned frequency table of age, with bin widths
of 10 years.
Name Age
Al 25
Betty 32
Carl 78
Statistics
In earlier chapters, we learned some techniques for using Excel to clean data
and to construct common statistics and charts. We also learned the basics of
probability theory, simple random variables, and more complex random vari-
ables.
Our next step is to bring these two sets of concepts together. In this chapter,
we will develop a framework for talking about data and the statistics calculated
from that data as a random process that can be described using the theory
of probability and random variables. We will also explore one of the most
important uses of statistics: to estimate or guess at the value at, some unknown
parameter of the DGP.
Goals
Chapter goals
In this chapter we will:
147
148 CHAPTER 7. STATISTICS
Although our examples will all be based on simple data sets, many of our con-
cepts and results can be applied to more complex data.
Then 𝐷𝑛 = (𝑥1 , 𝑥2 , 𝑥3 ). For example, if red loses the first two games and wins
the third game we have 𝐷𝑛 = (0, 0, 1).
Our data set 𝐷𝑛 is a set of 𝑛 numbers, but we can also think of it as a set of 𝑛
random variables with unknown joint distribution 𝑃𝐷 . The distinction here is
a hard one for students to make, so give it some thought before proceeding.
The joint distribution of 𝐷𝑛 is called its data generating process or DGP.
The exact DGP is assumed to be unknown, but we usually have at least some
information about it.
𝑝 = Pr(𝑏 ∈ 𝑅𝑒𝑑)
7.1. DATA AND THE DATA GENERATING PROCESS 149
Pr(𝐷𝑛 ) = 𝑓𝑥 (𝑥1 )𝑓𝑥 (𝑥2 )𝑓𝑥 (𝑥3 ) = 𝑝𝑥1 +𝑥2 +𝑥3 (1 − 𝑝)3−𝑥1 −𝑥2 −𝑥3
Note that even with a small data set of a simple random variable, the joint PDF
is not easy to calculate. Once we get into larger data sets and more complex
random variables, it can get very difficult. That’s OK, we don’t usually need to
calculate it - we just need to know that it could be calculated.
In order to model the data generating process, we need to model the entire joint
distribution of 𝐷𝑛 . As mentioned earlier, this means we must model both:
Fortunately, we often can simplify this joint distribution quite a bit by assum-
ing that 𝐷𝑛 is independent and identically distributed (IID) or a simple
random sample from a large population.
A simple random sample has two features:
replacement. This means we allow for the possibility that we sample the same case more than
once. In practice this doesn’t matter as long as the sample is small relative to the population.
7.1. DATA AND THE DATA GENERATING PROCESS 151
We can calculate statistics for time series, and we already did in Chapter 6.
However, time series data often requires more advanced techniques than we will
learn in this class. ECON 433 addresses time series data.
Not all useful data sets come from a simple random sample or a time series. For
example:
Many data sets combine several of these elements. For example, Canada’s un-
employment rate is calculated using data from the Labour Force Survey (LFS).
The LFS is built from a stratified sample of the civilian non-institutionalized
working-age population of Canada. There is also some clustering: the LFS
will typically interview whole households, and will do some geographic cluster-
ing to save on travel costs. The LFS is gathered monthly, and the resulting
unemployment rate is a time series.
Random samples and their close relatives have the feature that they are rep-
resentative of the population from which they are drawn. In a sense that will
be made more clear over the next few chapters, any sufficiently large random
sample “looks just like” the population.
Unfortunately, a simple random is quite difficult to collect from humans. Even
if we are able to randomly select cases, we often run into the following problems:
This is not an issue that has a purely technical solution, but requires careful
thought instead. If we are imputing values, do we believe that our imputation
154 CHAPTER 7. STATISTICS
FYI
Nonresponse bias in recent US elections
Going into both the 2016 and 2020 US presidential elections, polls in-
dicated that the Democratic candidate had a substantial lead over the
Republican candidate:
The generally accepted explanation among pollsters for the clear dispar-
ity between polls and voting is systematic nonresponse: for some reason,
Trump voters are less likely to respond to polls. Since most people do
not respond to standard telephone polls any more (response rates are
typically around 9%), it does not take much difference in response rates
to produce a large difference in responses. For example, suppose that:
I will use 𝑠𝑛 to represent an abstract statistic, but we will often use other
notation to talk about specific statistics.
The most important statistic is the sample average which is defined as:
1 𝑛
𝑥𝑛̄ = ∑𝑥
𝑛 𝑖=1 𝑖
1 𝑛
𝑓𝐴̂ = ∑ 𝐼(𝑥𝑖 ∈ 𝐴)
𝑛 𝑖=1
𝑓𝑥<𝑚̂ ≤ 0.5
𝑚̂ 𝑥 = 𝑚 ∶ {
̂
𝑓𝑥>𝑚 ≤ 0.5
FYI
To see why the sampling distribution of a statistic is so difficult to cal-
culate, suppose we have a discrete random variable 𝑥𝑖 whose support 𝑆𝑥
has five elements. Then we need to calculate the sampling distribution
of our statistic by adding its probability up across the support of 𝐷𝑛 .
The support has 5𝑛 elements, a number that can quickly get very large.
For example, a typical data set in microeconomics has at least a few
hundred or a few thousand observations. With 100 observations, 𝐷𝑛 can
take on 5100 ≈ 7.9×1069 (that’s 79 followed by 68 zeros!) distinct values.
With 1,000 observations , 𝐷𝑛 can take on 51000 distinct values, a number
too big for Excel to even calculate.
1 𝑛 1 𝑛 1 𝑛
𝐸(𝑥𝑛̄ ) = 𝐸 ( ∑ 𝑥𝑖 ) = ∑ 𝐸 (𝑥𝑖 ) = ∑ 𝜇𝑥 = 𝜇𝑥
𝑛 𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1
7.2. STATISTICS AND THEIR PROPERTIES 157
This is an important and general result in statistics. The mean of the sample
average in a random sample is identical to the mean of the random variable
being averaged.
𝐸(𝑥𝑛̄ ) = 𝐸(𝑥𝑖 )
We have shown this property specifically for a random sample, but it holds
under many other sampling processes.
The variance of the sample average is not equal to the variance of the random
variable being averaged, but they are closely related.
Example 7.8. The variance of the sample average
To keep the math simple, suppose we only have 𝑛 = 2 observations. Then the
sample average is:
1
𝑥̄ = (𝑥1 + 𝑥2 )
2
By our earlier formula for the variance:
1
𝑣𝑎𝑟(𝑥)̄ = 𝑣𝑎𝑟 ( (𝑥1 + 𝑥2 )) (7.1)
2
2
1
= ( ) 𝑣𝑎𝑟(𝑥1 + 𝑥2 ) (7.2)
2
1⎛ ⎞
= ⎜𝑣𝑎𝑟(𝑥1 ) +2 𝑐𝑜𝑣(𝑥 1 ,⏟
𝑥⏟
2 ) + 𝑣𝑎𝑟(𝑥 2 )⎟ (7.3)
4 ⏟ ⏟⏟⏟ ⏟
⎝ 𝜎𝑥2 0(independence) 2
𝜎𝑥 ⎠
1
= (2𝜎𝑥2 ) (7.4)
4
𝜎2
= 𝑥 (7.5)
2
(7.6)
More generally, the variance of the sample average in a random sample of size
𝑛 is:
𝜎2
𝑣𝑎𝑟(𝑥𝑛̄ ) = 𝑥
𝑛
where 𝜎𝑥2 = 𝑣𝑎𝑟(𝑥𝑖 ).
Other commonly-used statistics also have a mean and variance.
Example 7.9. The mean and variance of the sample frequency
Since the absolute sample frequency has the binomial distribution, we have
already seen its mean and variance. Let 𝑝 = Pr(𝑥𝑖 ∈ 𝐴). Then 𝑛𝑓𝐴̂ ∼
𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝) and:
𝐸(𝑛𝑓𝐴̂ ) = 𝑛𝑝
𝑣𝑎𝑟(𝑛𝑓𝐴̂ ) = 𝑛𝑝(1 − 𝑝)
158 CHAPTER 7. STATISTICS
Applying the usual rules for expected values, the mean and variance of the
relative sample frequency is:
𝐸(𝑛𝑓𝐴̂ ) 𝑛𝑝
𝐸(𝑓𝐴̂ ) = = =𝑝
𝑛 𝑛
𝑣𝑎𝑟(𝑛𝑓𝐴̂ ) 𝑛𝑝(1 − 𝑝) 𝑝(1 − 𝑝)
𝑣𝑎𝑟(𝑓𝐴̂ ) = 2
= 2
=
𝑛 𝑛 𝑛
7.3 Estimation
Statistics are often used to estimate, or guess the value of, some unknown feature
of the population or DGP.
• In our roulette data set, the joint distribution of the data depends only
on the single parameter 𝑝 = Pr(𝑏 ∈ 𝑅𝑒𝑑).
Estimators are statistics, so they have all the usual characteristics of a statistic,
including a sampling distribution, a mean, a variance, etc.
In addition estimators have properties specific to their purpose as a statistic
that is supposed to take on a value close to the unknown parameter of interest.
𝑒𝑟𝑟(𝑠𝑛 ) = 𝑠𝑛 − 𝜃
7.3.3 Bias
The first is the bias of the estimator, which is defined as its expected sampling
error:
Note that bias is always defined relative to the parameter we wish to estimate,
and is not an inherent property of the statistic.
Ideally we would want 𝑏𝑖𝑎𝑠(𝑠𝑛 ) to be zero, in which case we would say that 𝑠𝑛
is an unbiased estimator of 𝜃.
Example 7.12. Two unbiased estimators of the mean
Consider the sample average 𝑥𝑛̄ in a random sample as an estimator of the
parameter 𝜇𝑥 = 𝐸(𝑥𝑖 ). The bias is:
𝑏𝑖𝑎𝑠(𝑥𝑛̄ ) = 𝐸(𝑥𝑛̄ ) − 𝜇𝑥 = 𝜇𝑥 − 𝜇𝑥 = 0
That is, the sample average is an unbiased estimator of the population mean.
However, it is not the only unbiased estimator. For example, suppose we simply
take the value of 𝑥𝑖 in the first observation and throw away the rest of the data.
This “first observation estimator” is easier to calculate than the sample average,
and is also an unbiased estimator of 𝜇𝑥 :
𝑏𝑖𝑎𝑠(𝑥1 ) = 𝐸(𝑥1 ) − 𝜇𝑥 = 𝜇𝑥 − 𝜇𝑥 = 0
This example illustrates a general principle: there is rarely exactly one unbiased
estimator. There are either none, or many.
Example 7.13. An unbiased estimator of the variance
The sample variance is an unbiased estimator of the population variance:
Example 7.14. The variance of the sample average and and first ob-
servation estimators
In our previous example, we found two unbiased estimators for the mean, the
sample average 𝑥𝑛̄ and the first observation 𝑥1 .
The variance of the sample average is:
𝑣𝑎𝑟(𝑥𝑛̄ ) = 𝜎2 /𝑛
𝑣𝑎𝑟(𝑥1 ) = 𝜎2
For any 𝑛 > 1, the sample average 𝑥𝑛̄ has lower variance than the first obser-
vation estimator 𝑥1 . Since they are both unbiased, it is the preferred estimator
of the two.
In fact, we can prove that 𝑥𝑛̄ is the minimum variance unbiased estimator of
𝜇𝑥 .
Unfortunately, once we move beyond the simple case of estimating the popula-
tion mean, we run into several complications:
The first complication is that an unbiased estimator may not exist for a par-
ticular parameter of interest. If there is no unbiased estimator, there is no
minimum variance unbiased estimator. So we need some other way of choosing
an estimator.
First we show that the sample median is a biased estimator of 𝑚𝑥 . The sample
median is:
𝑚̂ 𝑥 = 𝑥1
and its expected value is:
More generally, any statistic calculated from this data set must take the form
𝑠 = 𝑎0 + 𝑎1 𝑥1 , where 𝑠 = 𝑎0 when 𝑥1 = 0 and 𝑠 = 𝑎0 + 𝑎1 is its value when
𝑥1 = 1. This statistic has expected value 𝐸(𝑎0 + 𝑎1 𝑥1 ) = 𝑎0 + 𝑎1 𝑝, so any
unbiased estimator would need to solve the equation:
𝑎0 + 𝑎1 𝑝 = 𝐼(𝑝 > 0.5)
and there is no such solution.
This estimator will be unbiased, but 10 observations isn’t very much and so its
variance will be high. We can reduce the variance by adding more observations
from people who are almost 35 years old:
By including more data, these estimators will have lower variance but will in-
troduce bias. My guess is introducing 34 and 36 year olds is a good idea since
they probably have similar earnings to 35 year olds, but including children and
the elderly is not such a good idea.
The MSE criterion allows us to choose a biased estimator with low variance over
an unbiased estimator with high variance, and also allows us to choose between
biased estimators when no unbiased estimator exists.
Example 7.17. The MSE of the sample mean and first observation
estimators
The mean squared error of the sample average is:
𝜎𝑥2 𝜎2
𝑀 𝑆𝐸(𝑥𝑛̄ ) = 𝑣𝑎𝑟(𝑥𝑛̄ ) + [𝑏𝑖𝑎𝑠(𝑥𝑛̄ )]2 = + 02 = 𝑥
𝑛 𝑛
and the mean squared error of the first observation estimator is:
𝑀 𝑆𝐸(𝑥1 ) = 𝜎𝑥2
The sample average is the preferred estimator by the MSE criterion, so in this
case we get the same result as applying the MVUE criterion.
Parameter estimates are typically reported along with their standard errors.
The standard error of a statistic is an estimate of its standard deviation.
Example 7.18. The standard error of the average
We have shown that the sample average provides a good estimate of the popu-
lation mean, and that its variance is:
𝜎𝑥2 𝑣𝑎𝑟(𝑥𝑖 )
𝑣𝑎𝑟(𝑥𝑛̄ ) = =
𝑛 𝑛
Since 𝑠2𝑥 is an unbiased estimator of 𝑣𝑎𝑟(𝑥𝑖 ) we can use it to construct an
unbiased estimator of 𝑣𝑎𝑟(𝑥):
̄
We might also want to estimate the standard deviation of 𝑥.̄ A natural approach
would be to take the square root of the estimator above, yielding:
𝑠
𝑠𝑒(𝑥𝑛̄ ) = √𝑥
𝑛
This is the conventional formula for the standard error of the sample average,
and is typically reported next to the sample average.
Standard errors are usually biased estimators of the statistic’s true standard
deviation, but the bias is typically small.
164 CHAPTER 7. STATISTICS
The law of large numbers (LLN) says that for a large enough random sample,
the sample average is almost identical to the corresponding population mean.
In order to state the LLN, we need to introduce some concepts. Consider a data
set 𝐷𝑛 of size 𝑛, and let 𝑠𝑛 be some statistic calculated from 𝐷𝑛 . We say that
𝑠𝑛 converges in probability to some constant 𝑐 if:
𝑤𝑛 →𝑝 𝑐
7.4. THE LAW OF LARGE NUMBERS 165
FYI
LAW OF LARGE NUMBERS: Let 𝑥𝑛̄ be the sample average from
a random sample of size 𝑛 on the random variable 𝑥𝑖 with mean 𝐸(𝑥𝑖 ) =
𝜇𝑥 . Then
𝑥𝑛̄ →𝑝 𝜇𝑥
The law of large numbers applies to the sample mean, but we are interested in
other estimators as well.
In general, we say that the statistic 𝑠𝑛 is a consistent estimator of a parameter
𝜃 if:
𝑠𝑛 →𝑃 𝜃
It will turn out that most of the statistics we use are consistent estimators of
the thing we typically use them to estimate.
166 CHAPTER 7. STATISTICS
The key to this property is a result called Slutsky’s theorem. Slutsky’s the-
orem roughly says that if the law of large numbers applies to a statistic 𝑠𝑛 , it
also applies to 𝑔(𝑠𝑛 ) for any continuous function 𝑔(⋅).
FYI
𝑠𝑛 →𝑝 𝑐 ⟹ 𝑔(𝑠𝑛 ) →𝑝 𝑔(𝑐)
𝑠2𝑥 →𝑝 𝑣𝑎𝑟(𝑥)
The math needed to make full use of Slutsky’s theorem and prove these results
is beyond the scope of this course, so all I am asking here is for you to know
that it can be used for this purpose.
Chapter review
In this chapter we have learned to model a data generating process, describe
the probability distribution of a statistic, interpret a statistic as an estimator
of some unknown parameter of the underlying data generating process.
Almost by definition, estimators are rarely identical to the parameter of interest,
so any conclusions based on estimators have a degree of uncertainty. To describe
this uncertainty in a rigorous and quantitative manner, we will next learn some
principles of statistical inference.
7.4. THE LAW OF LARGE NUMBERS 167
Practice problems
Answers can be found in the appendix.
SKILL #1: Identify whether a data set is a random sample
2. Identify the sampling type (random sample, time series, stratified sample,
cluster sample, census, convenience sample) for each of the following data
sets.
a. A data set from a survey of 100 SFU students who I found waiting
in line at Tim Horton’s.
b. A data set from a survey of 1,000 randomly selected SFU students.
c. A data set from a survey of 100 randomly selected SFU students from
each faculty.
d. A data set that reports total SFU enrollment for each year from
2005-2020.
e. A data set from administrative sources that describes demographic
information and postal code of residence for all SFU students in 2020.
0.4 𝑎=1
𝑓𝑥 (𝑎) = { (7.13)
0.6 𝑎=2
Let 𝑓𝐷𝑛 (𝑎, 𝑏) = Pr(𝑥1 = 𝑎 ∩ 𝑥2 = 𝑏) be the joint PDF of the data set
168 CHAPTER 7. STATISTICS
4. Suppose we have the data set described in question 3 above. Find the
support 𝑆 and sampling distribution 𝑓(⋅) for each of the following statis-
tics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample variance 𝜎̂𝑥2 = (𝑥1 − 𝑥)̄ 2 + (𝑥2 − 𝑥)̄ 2 .
d. The sample standard deviation 𝜎̂𝑥 = √𝜎̂𝑥2 .
e. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).
SKILL #5: Find the mean and variance of a statistic from its sam-
pling distribution
5. Suppose we have the data set described in question 3 above. Find the
mean of each of the following statistics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample variance 𝜎̂𝑥2 = (𝑥1 − 𝑥)̄ 2 + (𝑥2 − 𝑥)̄ 2 .
d. The sample standard deviation 𝜎̂𝑥 = √𝜎̂𝑥2 .
e. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).
6. Suppose we have the data set described in question 3 above. Find the
variance of the following statistics:
a. The sample frequency 𝑓1̂ = 𝐼(𝑥1 =1)+𝐼(𝑥
2
2 =1)
.
b. The sample average 𝑥̄ = (𝑥1 + 𝑥2 )/2.
c. The sample minimum 𝑥𝑚𝑖𝑛 = min(𝑥1 , 𝑥2 ).
d. The sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ).
10. Suppose we have the data set described in question 3 above. Suppose
we use the sample maximum 𝑥𝑚𝑎𝑥 = max(𝑥1 , 𝑥2 ) population maximum
max(𝑆𝑥 ).
a. Find the support 𝑆𝑒𝑟𝑟 of the sampling error 𝑒𝑟𝑟 = max(𝑥1 , 𝑥2 ) −
𝑚𝑎𝑥(𝑆𝑥 ).
b. Find the PDF 𝑓𝑒𝑟𝑟 (⋅) for the sampling distribution of the sampling
error 𝑒𝑟𝑟.
11. Suppose we have the data set described in question 3 above. Classify each
of the following estimators as biased or unbiased, and calculate the bias.
a. The sample frequency 𝑓1̂ as an estimator of the probability Pr(𝑥𝑖 =
1).
b. The sample average 𝑥̄ as an estimator of the population mean 𝐸(𝑥𝑖 )
c. The sample variance 𝜎̂𝑥2 as an estimator of the population variance
𝑣𝑎𝑟(𝑥𝑖 )
d. The sample standard deviation 𝜎̂𝑥 as an estimator of the population
standard deviation 𝑠𝑑(𝑥𝑖 )
e. The sample minimum 𝑥𝑚𝑖𝑛 as an estimator of the population mini-
mum min(𝑆𝑥 ).
f. The sample maximum 𝑥𝑚𝑎𝑥 as an estimator of the population max-
imum max(𝑆𝑥 ).
12. Suppose we are interested in the following parameters:
• The average earnings of Canadian men: 𝜇𝑀 .
• The average earnings of Canadian women: 𝜇𝑊 .
• The male-female earnings gap in Canada: 𝜇𝑀 − 𝜇𝑊 .
• The male-female earnings ratio in Canada: 𝜇𝑀 /𝜇𝑊 .
and we have calculated the following statistics from a random sample of
Canadians:
• The average earnings of men in our sample 𝑦𝑀̄
• The average earnings of women in our sample 𝑦𝑊 ̄
• The male-female earnings gap in our sample 𝑦𝑀 ̄ − 𝑦𝑊̄ .
• The male-female earnings ratio in our sample 𝑦𝑀̄ /𝑦𝑊
̄ .
We already know that 𝑦𝑀 ̄ is an unbiased estimator of 𝜇𝑀 and 𝑦𝑊
̄ is an
unbiased estimator of 𝜇𝑊 .
a. Is the sample earnings gap 𝑦𝑀̄ − 𝑦𝑊 ̄ a biased or unbiased estimator
of the population gap 𝜇𝑀 − 𝜇𝑊 ? Explain.
b. Is the sample earnings ratio 𝑦𝑀̄ /𝑦𝑊
̄ a biased or unbiased estimator
of the population earnings ratio 𝜇𝑀 /𝜇𝑊 ? Explain.
13. Suppose we have the data set described in question 3 above. Calculate
the mean squared error for each of the following estimators.
a. The sample frequency 𝑓1̂ as an estimator of the probability Pr(𝑥𝑖 =
1).
b. The sample average 𝑥̄ as an estimator of the population mean 𝐸(𝑥𝑖 )
c. The sample minimum 𝑥𝑚𝑖𝑛 as an estimator of the population mini-
mum min(𝑆𝑥 ).
d. The sample maximum 𝑥𝑚𝑎𝑥 as an estimator of the population max-
imum max(𝑆𝑥 ).
7.4. THE LAW OF LARGE NUMBERS 171
14. Suppose you have a random sample of size 𝑛 = 2 on the random variable 𝑥
with mean 𝐸 (𝑥) = 𝜇 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎2 . Two potential estimators
of 𝜇 are the sample average
𝑥1 + 𝑥 2
𝑥̄ =
2
and the last observation
𝑥2
a. Are these estimators biased or unbiased?
b. Find 𝑣𝑎𝑟(𝑥)̄
c. Find 𝑣𝑎𝑟(𝑥2 )
d. Find 𝑀 𝑆𝐸(𝑥)̄
e. Find 𝑀 𝑆𝐸(𝑥2 )
f. Which estimator is preferred under the MVUE criterion?
g. Which estimator is preferred under the MSE criterion?
15. Suppose that we have a random sample 𝐷𝑛 of size 𝑛 = 100 on the random
variable 𝑥𝑖 with unknown mean 𝜇 and unknown variance 𝜎2 . Suppose
that the sample average is 𝑥̄ = 12 and the sample variance is 𝜎̂ 2 = 4. Find
the standard error of 𝑥.̄
Statistical inference
In a previous chapter, we learned about estimation: the use of data and statistics
to construct the best possible guess at the value of some parameter.
In this chapter, we will pursue a different goal. Instead of estimating the single
“most likely” value of the parameter, we will construct statistics that can be
used to classify particular parameter values as plausible (could be the true value)
or implausible (unlikely to be the true value).
The set of procedures for constructing confidence intervals and hypothesis tests
is called statistical inference.
Goals
Chapter goals
In this chapter we will learn how to:
173
174 CHAPTER 8. STATISTICAL INFERENCE
8.1.1 Evidence
• Our results may have a win rate close to the expected rate for a fair game,
or far from that rate.
That is, we can make a fairly confident conclusion if we have a lot of evidence,
and our conclusion depends on what the evidence shows. But if we do not have
a lot of evidence, we cannot make a confident conclusion either way.
In this chapter we will formalize these basic ideas about evidence.
𝑥𝑖 = 𝐼(Red wins)
We know that red wins in a fair game with probability 𝑝𝑟𝑒𝑑 = 18/37 ≈ 0.486.
The first step in a hypothesis test is to define the null hypothesis. The null
hypothesis is a statement about our parameter 𝜃 that takes the form:
𝐻0 ∶ 𝜃 = 𝜃 0
𝐻1 ∶ 𝑝𝑟𝑒𝑑 ≠ 18/37
176 CHAPTER 8. STATISTICAL INFERENCE
Notice that there is something of an asymmetry between the null and alternative
hypothesis: the null is typically (though not necessarily) a single value and the
alternative is every other possible value.
FYI
What null hypothesis to choose?
Our framework here assumes that you already know what null hypothesis
you wish to test, but we might briefly consider how we might choose a
null hypothesis to test.
In some applications there are null hypotheses that are of clear interest
for that specific case:
• In our roulette example, the natural null to test is whether the win
probability matches that of a fair game (𝑝 = 𝑝𝑓𝑎𝑖𝑟 ).
• When measuring the effect 𝛽 of one variable on another, the natural
null to test is “no effect at all” (𝛽 = 0).
• When comparing the mean of some characteristic or outcome across
two groups (for example, average wages of men and women), the
natural null to test is that they are the same (𝜇𝑚 = 𝜇𝑊 )
• In epidemiology, a contagious disease will tend to spread if its re-
production rate 𝑅 is greater than one, and decline if it is less than
one, so 𝑅 = 1 is a natural null to test.
Our next step is to construct a test statistic that can be calculated from our
data. A valid test statistic for a given null hypothesis is a statistic 𝑡𝑛 that has
the following two properties:
A natural test statistic for the win probability of a bet on red would be the
corresponding win frequency in our data. We could use either the relative win
frequency (which also happens to be the sample average):
̂ = 𝑥̄ = 1 𝑛
𝑓𝑟𝑒𝑑 𝑛 ∑𝑥
𝑛 𝑖=1 𝑖
Next we need to find the probability distribution of 𝑡𝑛 under the null, and under
the alternative.
In general, since 𝑥𝑖 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑟𝑒𝑑 ) we have:
𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝𝑟𝑒𝑑 )
𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)
Since this distribution does not involve any unknown parameters, our test statis-
tic satisfies the requirement of having a known distribution under the null.
Under the alternative (when 𝐻1 is true), 𝑝𝑟𝑒𝑑 can take on any value other than
18/37. The sample size is still 𝑛 = 100, so the distribution of the test statistic
is:
𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 𝑝𝑟𝑒𝑑 ) where 𝑝𝑟𝑒𝑑 ≠ 18/37
Notice that the distribution of our test statistic under the alternative is not
known, since 𝑝𝑟𝑒𝑑 is not known. But the distribution is different under the
alternative, and that is what we require from our test statistic.
After choosing a test statistic 𝑡𝑛 , the next step is to choose critical values.
The critical values are two numbers 𝑐𝐿 and 𝑐𝐻 (where 𝑐𝐿 < 𝑐𝐻 ) such that
• 𝑡𝑛 has a high probability of being between 𝑐𝐿 and 𝑐𝐻 when the null is true.
• 𝑡𝑛 has a lower probability of being between 𝑐𝐿 and 𝑐𝐻 when the alternative
is true.
178 CHAPTER 8. STATISTICAL INFERENCE
The range of values from 𝑐𝐿 to 𝑐𝐻 is called the critical range of our test.
Given the test statistic and critical values:
Notice that there is an asymmetry here: in the absence of evidence, we will not
reject any null hypotheses.
How do we choose critical values? You can think of critical values as setting a
standard of evidence, so we need to balance two considerations:
• The probability of rejecting a false null is called the power of the test.
– We want our test to reject the null when it is false, so power is good.
• The probability of rejecting a true null is called the size or significance
of a test.
– We do not want our test to reject the null when it is true, so size is
bad.
• There is always a trade off between power and size
– A narrower critical range (higher 𝑐𝐿 or lower 𝑐𝐻 ) will increase the
rejection rate, increasing both power (good) and size (bad).
– A wider critical range (lower 𝑐𝐿 or higher 𝑐𝐻 ) will reduce the rejection
rate, reducing both power (bad) and size (good).
8.2. HYPOTHESIS TESTS 179
Given this trade off between power and size, we might construct some criterion
that includes both (just like MSE includes both variance and bias) and choose
critical values to maximize that criterion. In practice, we do not typically do
that.
Instead, we follow a simple convention:
In other words, we set size equal to a conventional value, and let the power be
whatever is implied by that.
Example 8.5. Critical values for roulette
We earlier showed that the distribution of 𝑡𝑛 under the null is:
𝑡𝑛 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(100, 18/37)
In other words we reject the null (at 5% significance) that the roulette wheel is
fair if red wins fewer than 39 games or more than 58 games.
FYI
A general test for a single probability
We can generalize the test we have constructed so far to the case of the
probability of any event:
𝑝𝑜𝑤𝑒𝑟(𝜃) = Pr(reject 𝐻0 )
Power curves can be tricky to calculate, and I will not ask you to calculate them
for this course. But they can be calculated, and it is useful to see what they
look like.
Figure 8.2 below depicts the power curve for the roulette test we have just
constructed; that is, we are testing the null that 𝑝𝑟𝑒𝑑 = 18/37 at a 5% size. The
blue line depicts the power curve for 𝑛 = 100 as in our example, while the green
line depicts the power curve for 𝑛 = 20.
1.00
n=100 n=20
0.75
power
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
true probability of winning
H0: Pr(red wins) = 18/37, significance = 0.05
There are a few features I would like you to notice, all of which are common to
most regularly used tests:
• The power curve reaches its lowest value at the red point (18/37, 0.05).
Note that 18/37 is the parameter value under the null, and 0.05 is the size
of the test. In other words:
– The power is always at least as big as the size, and is usually bigger.
– We are more likely to reject the null when it is false than when it is
true. That’s good!
– When a test has this desirable property, we call it an unbiased test.
• The power increases as 𝜃 gets further from the null.
– That is, we are more likely to detect unfairness in a game that is very
unfair than when in one that is a little unfair.
182 CHAPTER 8. STATISTICAL INFERENCE
• Power also increases with the sample size; the blue line (𝑛 = 100) is above
the green line (𝑛 = 20).
FYI
P values
The convention of always using a 5% significance level for hypothesis tests
is somewhat arbitrary and has some negative unintended consequences:
1. Sometimes a test statistic falls just below or just above the critical
value, and small changes in the analysis can change a result from
reject to cannot-reject.
2. In many fields, unsophisticated researchers and journal editors mis-
interpret “cannot reject the null” as “the null is true.”
• If the p-value is 0.43 (43%) we would not reject the null at 10%,
5% or 1%.
• If the p-value is 0.06 (6%) we would reject the null at 10% but not
at 5% or 1%.
• If the p-value is 0.02 (2%) we would reject the null at 10% and 5%
but not at 1%.
• If the p-value is 0.001 (0.1%) we would reject the null at 10%, 5%,
and 1%.
The p-value of a test is simple to calculate from the test statistic and its
distribution under the null. I won’t go through that calculation here.
In order for a test statistic to work, its exact probability distribution must
be known under the null hypothesis. The example test in the previous section
worked because it was based on a sample frequency, a statistic whose probability
distribution is relatively easy to calculate. Unfortunately, most statistics do not
have a probability distribution that is easy to calculate.
8.3. THE CENTRAL LIMIT THEOREM 183
FYI
CENTRAL LIMIT THEOREM: Let 𝑥𝑛̄ be the sample average from
a random sample of size 𝑛 on the random variable 𝑥𝑖 with mean 𝐸(𝑥𝑖 ) =
𝜇𝑥 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎𝑥2 . Then
𝑧𝑛 →𝐷 𝑧 ∼ 𝑁 (0, 1)
What about statistics other than the sample average? Well it turns out that
Slutsky’s theorem also extends to convergence in distribution.
FYI
𝑠𝑛 →𝐷 𝑠 ⟹ 𝑔(𝑠𝑛 ) →𝐷 𝑔(𝑠)
184 CHAPTER 8. STATISTICAL INFERENCE
The implication here is that nearly all statistics have a sampling distribution
that can be approximated using the normal distribution if the sample size is
large enough.
Having described the general framework and a single example, we now move on
to the most common application: constructing hypothesis tests and confidence
intervals on the mean in a random sample.
Let 𝐷 = (𝑥1 , … , 𝑥𝑛 ) be a random sample of size 𝑛 on some random variable 𝑥𝑖
with unknown mean 𝐸(𝑥𝑖 ) = 𝜇𝑥 and variance 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎𝑥2 .
Let the sample average be:
1 𝑛
𝑥𝑛̄ = ∑𝑥
𝑛 𝑖=1 𝑖
𝑠𝑥 = √𝑠2𝑥
These statistics are easily calculated from the data, and we have previously
discussed their properties in detail.
𝐻0 ∶ 𝜇 𝑥 = 1
𝐻1 ∶ 𝜇 𝑥 ≠ 1
Having stated our null and alternative hypotheses, we need to construct a test
statistic.
Remember that our test statistic needs to have a known distribution under the
null, and a different distribution under the alternative.
8.4. INFERENCE ON THE MEAN 185
The typical test statistic we use in this setting is called the T statistic, and
takes the form:
𝑥̄ − 1
𝑡𝑛 = 𝑛 √
𝑠𝑥 / 𝑛
The idea here is that we take our estimate of the parameter (𝑥),
̄ subtract its
expected value√under the null (1), and divide by an estimate of its standard
deviation (𝑠𝑥 / 𝑛). We can add and subtract the unknown true mean 𝜇𝑥 to
get:
𝑥𝑛̄ − 𝜇𝑥 + 𝜇𝑥 − 1
𝑡𝑛 = √ (8.1)
𝑠𝑥 / 𝑛
𝑥̄ − 𝜇 𝜇 −1
= 𝑛 √𝑥+ 𝑥√ (8.2)
𝑠𝑥 / 𝑛 𝑠𝑥 / 𝑛
The first part of this expression is a random variable with a mean of zero and a
variance of (about) one. The second part of the expression is exactly zero when
𝐻0 is true, and not exactly zero when it is false.
Recall that we need the probability distribution of 𝑡𝑛 to be known when 𝐻0 is
true, and different when it is false. The second criterion is clearly met, and the
first criterion is met if we can find the probability distribution of 𝑥𝑠̄𝑛/−𝜇
√𝑥.
𝑛
𝑥
Under the null our test statistic looks just like this, but with the sample standard
deviation 𝑠𝑥 in place of the population standard deviation 𝜎𝑥 . It turns out that
186 CHAPTER 8. STATISTICAL INFERENCE
0.4
0.3
PDF of t_n
0.2
0.1
0.0
−2 0 2
value
Therefore, if we want a test that has the asymptotic size of 5%, we can use
Excel or R to calculate critical values. In Excel, the function would be NORM.INV
or NORM.S.INV, and the formulas would be:
• 𝑐𝐿 : =NORM.S.INV(0.025) or =NORM.INV(0.025,0,1).
• 𝑐𝐻 : =NORM.S.INV(0.975) or =NORM.INV(0.975,0,1).
These particular critical values are so commonly used that I want you to re-
member them.
0.4 n = infinity
0.3 n = 30
PDF
0.2 n = 10
0.1 n=5
0.0
−2 0 2
value
(x assumed to be normally distributed)
• If you reject this null, you have concluded that the effect has some ef-
fect. However, that does not rule out the possibility that the effect of the
treatment is very small.
• If you fail to reject this null, you cannot rule out the possibility that the
treatment has no effect. However, this does not rule out the possibility
that the effect is very large.
The solution to this would be to do a hypothesis test for every possible value of
𝜃, and classify them into values that were rejected and not rejected. This is the
idea of a confidence interval.
190 CHAPTER 8. STATISTICAL INFERENCE
Note that 𝜃 is a fixed (but unknown) parameter, while 𝐶𝐼𝐿 and 𝐶𝐼𝐻 are statis-
tics calculated from the data.
How do we calculate confidence intervals? It turns out to be entirely straight-
forward: confidence intervals can be constructed by inverting hypothesis tests:
For example, suppose that red wins on 40 of the 100 games. Then a 95%
confidence interval for 𝑝𝑟𝑒𝑑 is:
## 0.32 to 0.49
Notice that the confidence interval includes the fair value of 0.486 but it also
includes some very unfair values. In other words, while we are unable to rule
out the possibility that we have a fair game, the evidence that we have a fair
game is not very strong.
8.5. CONFIDENCE INTERVALS 191
Confidence intervals for the mean are very easy to calculate. Again we construct
them by inverting the hypothesis test.
Pick any 𝜇0 . To test the null
𝐻0 ∶ 𝜇 𝑥 = 𝜇 0
𝑐𝐿 < 𝑡 𝑛 < 𝑐 𝐻
All that remains is to choose a confidence/size level, and decide whether to use
an asymptotic or finite sample test.
If we are using the asymptotic approximation to construct a 95% confidence
interval, then the 5% asymptotic critical values are 𝑐𝐿 = −1.96 and 𝑐𝐻 ≈ 1.96
and the confidence interval is:
√
𝐶𝐼 = 𝑥̄ ± 1.96𝑠𝑥 / 𝑛
In other words, the 95% confidence interval for 𝜇𝑥 is just the point estimate
plus or minus roughly 2 standard errors.
If we have a small sample, and choose to assume normality rather than using
the asymptotic approximation, then we need to use the slightly larger critical
values from the 𝑇𝑛−1 distribution. For example, if 𝑛 = 5, then 𝑐𝐿 ≈ −2.78,
𝑐𝐻 ≈ 2.78 and the 95% confidence interval is:
√
𝐶𝐼 = 𝑥̄ ± 2.78𝑠𝑥 / 𝑛
As with hypothesis tests, finite sample confidence intervals are typically more
conservative (wider) than their asymptotic cousins, but the difference becomes
negligible as the sample size increases.
192 CHAPTER 8. STATISTICAL INFERENCE
Chapter review
In this chapter we have learned to formulate and test hypotheses, and to con-
struct confidence intervals. The mechanics of doing so are complicated, but you
should not let the various formulas distract you from the more basic idea of
evidence: hypothesis testing is about how strong the evidence is in favor of (or
against) a particular true/false statement about the data generating process,
and confidence intervals are about finding a range of values for a parameter
that are consistent with the observed data.
In practice, modern statistical packages automatically calculate and report con-
fidence intervals for most estimates, and report the result of some basic hypoth-
esis tests as well. When you need something more complicated, it is usually just
a matter of looking up the command. I will ask you to do these calculations
yourself so you get used to them, but it is more important that you can correctly
interpret the results.
This is the last primarily theoretical chapter. The remaining chapters will be
oriented towards data and applications.
Practice problems
Answers can be found in the appendix.
SKILL #1: Identify parameter, null and alternative
4. Consider the setting from problem 3 above, and suppose that the true
value of 𝜇 is some number 𝜇1 ≠ 𝜇0 . Write an expression describing 𝑡 as
the sum of (a) a random variable that has the 𝑇𝑛−1 distribution and (b)
a random variable that is proportional to 𝜇1 − 𝜇0 .
a test at the 5% level of significance, and fail to reject the null. Based
on this information, classify each of these statements as “probably true”,
’possibly true“, or”probably false”:
a. A university degree has no effect on earnings.
b. A university degree has some effect on earnings.
c. A university degree has a large effect on earnings.
10. Suppose you estimate the effect of a university degree on earnings at age
30, and your 95% confidence interval for the effect is (0.10, 0.40), where
an effect of 0.10 means a degree increases earnings by 10% and an effect
of 0.40 means that a degree increases earnings by 40%. Based on this
information, classify each of these statements as “probably true”, ’possibly
true“, or”probably false”:
a. A university degree has no effect on earnings.
b. A university degree has some effect on earnings.
c. A university degree has a large effect on earnings, where “large”
means at least 10%.
d. A university degree has a very large effect on earnings, where “very
large” means at least 50%.
196 CHAPTER 8. STATISTICAL INFERENCE
Chapter 9
An introduction to R
As we have seen, Excel is a useful tool for both cleaning and analyzing data. R
is an application that has many of the same features as Excel, but is specially
designed for statistical analysis. It is a little more complex, but more powerful
in many important ways. This chapter will introduce you to some of the basic
concepts of R and associated tools such as R Markdown, RStudio, and the
Tidyverse. We will later use these tools to read and analyze data, and to create
publication-quality graphs that are well beyond what can be done in Excel.
Goals
Chapter goals
In this chapter we will learn how to:
In this course, we will only have time to learn a little bit about R, so my goal is
not to give a comprehensive treatment. My goal here is primarily to introduce
you to the terminology and concepts of R, and to show you a few applications
where R outshines Excel. You will learn much more about R in ECON 333 and
(if you take it) ECON 334.
197
198 CHAPTER 9. AN INTRODUCTION TO R
Start the program RStudio. You should see something that looks like this:
You can run commands and scripts in R itself, but without RStudio you won’t
have all these handy extra features. So most people these days use RStudio or
another IDE.
RStudio normally displays three or four open windows, each of which has tabs
you can select to access different features. We will not use most of them, but
some of them will be very handy indeed.
9.1. A BRIEF TOUR OF RSTUDIO 199
print("Hello world!")
## [1] "Hello world!"
As you type your command in, you may notice that RStudio showed various
pop-ups with helpful information about the command. It will also auto-complete
your command for you.
1. Press the up-arrow key in the Console window to show the most recently
executed command. If you press it a second time it gives you the command
before that, and so on.
2. Look at the to the History window in the upper right corner to see a full list
of recently executed commands. You can double-click on any command
in the window to copy it to the Console window.
Once you have copied the previous command, you can edit it before pressing
<enter>.
9.1.2 Scripts
The Console window is ideal for simple tasks and experimentation, and we will
continue using it regularly. But in order to create reproducible research and take
full advantage of R’s capabilities, we will need to write and execute scripts.
An script is just a text file containing a sequence of R commands. By convention,
an R script should have the .R extension but any text file will work.
200 CHAPTER 9. AN INTRODUCTION TO R
To create an R script
1. Select File > New File > R Script from the menu.
2. Enter a valid command in the first line of the file, for example
print("Hello world!")
3. Enter another valid command in the second line of the file, for example
print("Goodbye world?")
4. Select File > Save to save your file.
• Name it Chapter10Example.R
You will see the results of your commands in the Console window.
9.1.3 R Markdown
RStudio can also run text files written in the R Markdown format. R Mark-
down files have the .Rmd extension.
FYI
What is Markdown?
Markdown is a markup language just like HTML, which means that it
is a way of writing documents in text files whose content is readable
directly but can also be formatted and displayed (rendered) in a visually
appealing way.
The original idea of HTML was that content creators could write their
content in text files (pages), with a few HTML tags sprinkled around
to give the browser information about structure, and then the browser
would display the page. However, as web users demanded fancy graphics,
custom colors, interactivity, and mobile-friendly display, HTML became
much more complicated.
Markdown was created as radically simplified markup language. The
basic idea is to use common conventions for how to indicate structure in
a text file.
Markdown documents can also include links and pictures (by simply
providing the URL or file name), tables, and all sorts of other things.
1. Select File > New File > R Markdown from the menu.
RStudio has taken the liberty of creating an example R Markdown file that you
can use as a template.
You can run the R code in an R Markdown document in one of two ways:
You can run and display results for individual chunks of code. A chunk is a
few lines of R code surrounded by a code fence.
Example 9.5. Running code chunks
To run a code chunk in our R Markdown file:
As you can see, the code in the chunk will run and the results will be displayed
below.
You can also knit the entire R Markdown file into an HTML/word/PDF docu-
ment that includes both the text and the R results by pressing the Knit button.
9.1. A BRIEF TOUR OF RSTUDIO 203
It will take a few moments to process the file, and then the HTML file will open
in a browser.
By default, R Markdown files usually knit to HTML, but we can knit to other
file formats including Word and PDF. We will stick to HTML in this course.
FYI
R Markdown resources
R Markdown is as simple or as complicated as you want to make it. A
plain text file with a few lines of content is a valid R Markdown file, and
like HTML, Markdown is designed so it still “works” if you do something
unexpected.
If you want to try something new in R Markdown, or have forgotten how
to do something, the most useful resource is the one-page R Markdown
Cheat sheet. It is available directly in RStudio, or at https://github.
com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf. You can also
just search for “r markdown cheatsheet”.
RStudio has many other features, most of which we will not use. But I would
like to highlight a few that may seem useful.
In the lower right window:
• The Files tab gives you easy access to files in the current active folder.
• The Plots tab will display plots, when you create them.
• The Packages tab is useful for managing packages (more on them later)
• The Help tab allows you to access R’s help system.
In the menu:
204 CHAPTER 9. AN INTRODUCTION TO R
• You can select Session > Restart R to clear the memory and restart the
current R session.
We are done for now, so close RStudio. You may get a warning message that
looks something like this:
Never click on the Save button here, as it would cause R to save the current
state of its memory and re-load it next time you start R. In the interest of
reproducibility, you should start R “clean” every time. Click on the Don't
Save button, and you will exit RStudio.
9.2.1 Expressions
An expression is any piece of R code that can be evaluated on its own. For
example:
You can execute any valid R expression as a command, and have it display the
value it returns.
9.2. THE R LANGUAGE 205
You can also use any valid R expression within a larger expression.
sqrt(4 + 5)
## [1] 3
In addition, some expressions have a side effect. That is, they make something
happen: they cause something to appear on your computer screen, or change a
file, or change something in R’s memory.
Histogram of rnorm(100)
25
20
Frequency
15
10
5
0
−3 −2 −1 0 1 2 3
rnorm(100)
Although we call it a “side effect”, the side effect is often the main purpose of
the expression.
9.2.2 Assignment
We can use the <- or assignment operator to assign the results of an expression
to a named variable. We can then use that variable in later expressions.
206 CHAPTER 9. AN INTRODUCTION TO R
For example, the R command x <- 2 assigns the value 2 (i.e., the number 4)
to the variable x. Any subsequent code can then refer to the variable x in its
own calculations or actions.
x <- 5
print(x)
## [1] 5
x
## [1] 5
9.2.3 Vectors
• text strings
• numbers
• logical values (either TRUE or FALSE)
The elements of an atomic vector need to be all part of the same atomic type;
a single vector cannot contain both strings and numbers, for example.
We can construct a vector by enumeration using the c() function:
9.2. THE R LANGUAGE 207
There are many other functions that can be used to construct vectors. Two
particularly useful ones are rep which repeats something a particular number
of times, and seq which creates a sequence:
The subscript operator [] can be used to select part of a vector. You can
enumerate the indexes of the elements you want:
# You can give a single index evens[2] is the 2nd element in evens
x <- evens[2]
print(x)
## [1] 4
# You can give a vector of indices evens[c(2,5)] is a vector containing the 2nd
# and 5th element in evens
x <- evens[c(2, 5)]
print(x)
## [1] 4 10
208 CHAPTER 9. AN INTRODUCTION TO R
# You can give a range of indices evens[2:5] is a vector containing the 2nd,
# 3rd, 4th and 5th element in evens
x <- evens[2:5]
print(x)
## [1] 4 6 8 10
You can also provide logical values instead of numeric indices. R will then
operate on those elements whose corresponding item has the value TRUE:
print(evens)
## [1] 2 4 6 8 10 12 14 16 18 20
# This creates a vector of the same length as evens, that contains TRUE for all
# values less than 10, and FALSE for all other values
lessthan10 <- (evens < 10)
print(lessthan10)
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# This creates a vector that includes only those elements of evens for which
# forexample is TRUE
x <- evens[lessthan10]
print(x)
## [1] 2 4 6 8
# This is a quicker way of accomplishing the same result
x <- evens[evens < 10]
print(x)
## [1] 2 4 6 8
x <- evens
print(x)
## [1] 2 4 6 8 10 12 14 16 18 20
# This assigns the number 1000 to the 2nd element in x
x[2] <- 1000
print(x)
## [1] 2 1000 6 8 10 12 14 16 18 20
9.2.4 Lists
The other type of vector is a list. A list is a vector whose elements are themselves
other vectors. These vectors can be any type, so we can use lists inside lists to
build very complex objects.
Lists can be built using the list() function:
9.2. THE R LANGUAGE 209
You can access part of a list by specifying its numerical index inside of the [[]]
operator:
print(everything[[2]])
## [1] 2 4 6 8 10 12 14 16 18 20
If the items in a list are named, you can also access them by name using either
[[]] or $ notation
print(everything[["evens"]])
## [1] 2 4 6 8 10 12 14 16 18 20
print(everything$fruits)
## [1] "Avocado" "Banana" "Cantaloupe"
You can also use the $ notation to add new items to an existing list:
## $fruits
## [1] "Avocado" "Banana" "Cantaloupe"
##
## $evens
## [1] 2 4 6 8 10 12 14 16 18 20
##
## $odds
## [1] 1 3 5 7 9 11 13 15 17 19
##
## $allnumbers
## [1] 2 4 6 8 10 12 14 16 18 20 1 3 5 7 9 11 13 15 17 19
9.2.5 Attributes
Any object can also have attributes. This attributes of an object are a list
associated with the object that provides additional information.
Let’s see if any of the objects we have created have attributes:
print(attributes(fruits))
## NULL
print(attributes(evens))
## NULL
print(attributes(everything))
## $names
## [1] "fruits" "evens" "odds" "allnumbers"
Note that:
• our two atomic vectors have attributes NULL. That’s R’s way of saying
they have no attributes
• our list stores the names of its three elements in the $names attribute.
R has hundreds of standard object types that are built from atomic vectors,
lists, and attributes. These object types include matrices, arrays, data sets,
objects structured as the output of a particular statistical analysis, descriptions
of graphs, and so on. Users can also define their own object types, and there
is an extensive system for generic functions and object-based programming (if
you know what that is).
Let’s get to know the main features of functions in R by considering the seq()
function. We have already seen this function: it is used to create a vector with
a sequence of numbers, much like Excel’s Series tool.
2. You can obtain help on any function by entering ? and its name in the
console window
• Try ? seq.
4. Every function returns a value. This is even true for functions like
print(). To see this:
As you can see, print("Hello world") returns “Hello world” as its value.
5. Some functions also produce side effects, as we have described earlier.
## [1] 3
print(y)
## [1] 3
FYI
What is the Tidyverse?
The Tidyverse was created by the data scientist Hadley Wickham (also
one of the key people behind RStudio) as a way of solving some long-
standing problems with R. The Tidyverse is both an R package contain-
ing a set of new functions and data structures as well as a philosophy
about how to analyze data.
The basic structure of R dates back to 1976 (R itself was created in the
early 1990s but is closely based on an earlier program called S). Computer
science has advanced a lot since 1976, so some design aspects of R seemed
like a good idea at the time but would be designed differently today.
Most commonly-used packages including the Tidyverse are open-source, and are
available online from the Comprehensive R Archive Network (CRAN).
Before you can use any package, two steps must be followed:
Once the package is installed and loaded, you can use its functions and other
features.
install.packages()
If you know the name of the CRAN package you want to install, you can provide
it as the argument:
install.packages("tidyverse")
library("tidyverse")
You can then use the Tidyverse functions and other tools.
1. Our first step can be accomplished using the seq() function, which we
have already used. If you know the name of the function you want to
use, you can access its help page by executing the command ? [function
name here]:
# ? seq
As you can see, the seq() function takes arguments from= (for the starting
point), to= (for the end point), and length.out= (for the total number
of points). Let’s plot the function at 10 points between -4 and 4:
Note that I’ve picked only 10 points here so that our code is easy to check.
2. The next step is to calculate the standard normal PDF at each of these
points. R is a program for statisticians, so it presumably has that PDF
available as a built-in function. But what if we don’t know its name? We
can just Google “normal pdf in r” and click on a page or two to find out
that the function we need is called dnorm().
p <- dnorm(x)
print(p)
## [1] 0.0001338302 0.0031560163 0.0337736510 0.1640100747 0.3614238299
## [6] 0.3614238299 0.1640100747 0.0337736510 0.0031560163 0.0001338302
plot(x, p)
216 CHAPTER 9. AN INTRODUCTION TO R
0.3
0.2
p
0.1
0.0
−4 −2 0 2 4
You will see this plot in the Plots tab in the lower right corner of your
screen.
Well, that’s not too bad, but we might want to make some improvements:
So we can read through the documentation for the plot() function, try a few
things out, and we can produce a much prettier graph by just adding a few
options:
0.2
0.1
0.0
−4 −2 0 2 4
As you can see, we have a much nicer and clearer looking plot.
Chapter review
Although you will be tested on specific knowledge, you should also keep in mind
the bigger picture: my real goal here is for you to develop some long-lasting skills
that you will find useful in the future. This should be your goal as well.
A year from now, or five years from now, you will probably not be able to
remember exactly what the format of the seq function is, nor will you need to.
Instead I want you to focus on learning how to think about a coding task, how
to find information, and how to design and implement your plans.
218 CHAPTER 9. AN INTRODUCTION TO R
FYI
For more information on R
There are many free sources of useful information about R.
Practice problems
Answers can be found in the appendix.
SKILL #1: Perform basic tasks in RStudio
4. Load the tidyverse package (you will need to install it if you have not
already done so), and execute the R code below:
In an earlier chapter, we learned some basic data cleaning skills using Excel, and
some introductory R skills. This chapter will build on these skills by teaching
some more advanced tools and concepts.
Goals
Chapter goals
In this chapter we will learn how to:
We will do this while building a data set describing long-run economic growth
in a wide cross-section of countries.
221
222 CHAPTER 10. ADVANCED DATA CLEANING
Economics background
The ICP data is needed to account for a simple economic reality: each
country’s GDP is calculated using local prices, but prices of key goods
and services vary dramatically across countries: housing is much more
expensive in Vancouver than in Houston, and a haircut is much cheaper
in Mumbai than in London. The PWT research team use the results of
the ICP to convert each country’s GDP data to comparable (PPP) units.
The current version of the PWT is available online at http://www.ggdc.
net/pwt.
The rest of the chapter will use these three data files.
a I should mention that the Tax Foundation is not an entirely neutral organization
- it generally supports lower taxes - and so my use of their data here should not be
taken as expressing an opinion on their policy views. It is not unusual for useful data
to come from politically-motivated sources; for many years the best data on tobacco
10.1. DATA FILE FORMATS 223
Nearly every software application we might use for data analysis has a native
file format specifically designed for saving and reading data in that application.
For example, Excel’s native format is the .xlsx file format.
Applications work most seamlessly with their native format. However, most
modern applications can also import (read) and export (save) files in other
formats including native formats of other popular programs.
There are also several standard or open file formats that are commonly used to
share data across applications. Most of these open file formats are built from
text files.
• You can view and edit text files using a simple application called a text
editor.
Files that are not text files are usually called binary files. Binary files are
generally not human-readable.
224 CHAPTER 10. ADVANCED DATA CLEANING
Economics background
Fixed-width text files represent a table of data by allocating the same number
of characters to each cell in the same column. They can be formatted to be
readable to humans like this:
The key feature of a fixed-width file is that each column begins at the same
point in each line. For example,
• In the first file, the year of birth always starts on the 24th character of
each line, and the year of death always starts on the 40th character.
• In the second file, the year of birth always starts on the 21st character of
each line, and the year of death always starts on the 25th character.
You can open fixed-width text files in Excel using the Text Import Wizard.
Open the file CountryCodes.txt in a text editor. It will look like this:
This is a fixed format file in which the CountryName variable starts on the
1st character of each line and the CountryCode variable starts on the 36th
character.
Excel seems to have guessed correctly in this case, so go ahead and select
Next>.
6. The final dialog box of the Text Import Wizard provides some options for
changing the data format for each individual column:
10.1. DATA FILE FORMATS 227
Our worksheet now contains the imported data, correctly arranged into cells.
Keep this worksheet open.
The most common and useful general-purpose format for tabular data is called
the comma separated values or CSV file format. A CSV file is just a text
file with the following features:
Notice that the quotes around “Mary, Queen of Scots” are needed in order
for the comma to be interpreted as an actual comma rather than a delimiter
between cells.
Example 10.2. Exporting a CSV file from Excel
To save the Excel worksheet we created in the previous section as a CSV file:
2. Select CSV (Comma delimited) (*.csv) from the drop-down box and
3. Enter a file name (I suggest CountryCodes.csv) in the text box.
4. Select Save.
• You may get the warning message: The selected file type does not
support workbooks that contain multiple sheets. To save only the
active sheet click OK…. If so, select OK.
5. Close Excel.
• You may get another warning message: Want to save your
changes to CountryCodes.csv? If so, select Don't save.
You can see the exact contents of a CSV file by opening it in a text editor.
Example 10.3. Opening a CSV file as text
Use your preferred text editor to open the file CountryCodes.csv. It will look
something like this:
If a CSV file has .csv at the end of its file name, you can open it in Excel by
just double-clicking on it. Otherwise, you can use the Text Import Wizard.
CSV files (and text files in general) have several important limitations relative
to regular Excel (.xlsx) files:
1. A CSV file can only contain one table, while an Excel file can contain
multiple tables.
2. A CSV file can only contain cell values, and cannot contain:
• Formulas
• Formatting
• Graphs
• Any other fancy Excel features
10.1. DATA FILE FORMATS 229
These limitations are also the main advantage of the CSV file: they are simple
ways of reporting tabular data and can be read by virtually any program, or
even by a human.
• Space-delimited text files are just like CSV files, but use spaces as the
delimiter. For example, a space-delimited version of our table might look
like this:
• Tab-delimited text files are just like CSV files, but use a tab character
as the delimiter.
Like fixed-width and CSV files, space-delimited and tab-delimited files can be
imported into Excel using the Text Import Wizard.
Excel also has a wide variety of tools for obtaining data from various databases,
online services, etc. We do not have time to explore all of these tools, but you
can select Data from the menu bar and look around to see what is available.
This can happen if a text file has been incorrectly imported, or if you have
copied-and-pasted data from a PDF file or web page. Fortunately, Excel has a
way of fixing that: the Text to Columns tool.
230 CHAPTER 10. ADVANCED DATA CLEANING
1. We will want to combine multiple data files into a single Excel file.
2. We will want to link observations in one table with observations in another
table.
3. We will want to construct new variables that are group-level aggregates
(counts, averages, or sums) of existing variables.
This section will go through the most important tools and techniques of com-
bining data in these ways.
10.2. COMBINING DATA 231
Data tables often come in multiple files, especially if they are from different
sources. It is possible for one Excel file to use data from another Excel file (or
even a non-Excel file or online data source), but doing so can lead to problems
if not done very carefully. For example, if one file has a formula that references
another file, that formula may stop working if either file is moved to another
folder.
An easier approach in most situations is to just combine everything into a single
Excel workbook.
Before proceeding, it’s a good idea to make sure our data is tidy.
1. Take a look at each of our three main worksheets and identify if they need
to be adjusted in any way to meet our criteria for tidy data:
• CountryCodes is tidy and does not need alteration.
• TaxRates is tidy and does not need alteration.
232 CHAPTER 10. ADVANCED DATA CLEANING
One of the most common tasks in cleaning data is to combine variables from two
or more data sets. In order to do this, we need to match or link observations
in the two tables on the basis of one or more ID variables or keys.
All statistics packages have tools to link observations. In Excel, the table you are
obtaining data from is called a lookup table and the key tool is the XLOOKUP()
function:
We could try to match on country name, but the country names are not exactly
the same in the two tables. For example, the same country is called “South
Korea” in TaxRates and “Republic of Korea” in Data. This is a common problem
with names, so people who work with data usually prefer standardized codes.
One option is to simply change the country names in one of our tables, but that
goes against our general principle that we avoid changing data. A better solution
is to use a crosswalk table that gives the country code for each country name,
including name variations like “Republic of Korea” and “South Korea.” The
CountryCodes worksheet is a crosswalk table I have created for this purpose.
If you take a look at it, you will notice that there are observations for both
“Republic of Korea” and “South Korea.”
Let’s use our crosswalk table and the XLOOKUP() function to add a country code
to the GrowthData worksheet.
Column A should now display the ISO country code for each observation.
• lookup_value should combine the country code (cell A2) with the
year (2019), so it should be CONCAT(A2,"2019").
• lookup_array should be the full range of the CountryYear variable
in the Data worksheet (Data!A2:A12811).
• return_array should be the full range of the pop variable in the
Data worksheet (Data!H2:H12811).
3. Make the cell references absolute where needed, which will result in the for-
mula =XLOOKUP(CONCAT(A2,"2019"),Data!A$2:A$12811,Data!H$2:H$12811)
4. Copy/paste or fill the formula to the remaining cells in column
K.
We can also create another variable whose variable is the country’s population
in 1990.
K.
You may note that the PWT goes all the way back to 1950, and may wonder
why I picked 1990 as my starting point. The reason for this is many countries
do not have data going back to 1950, especially those countries that were part
of or allied with the Soviet Union. By 1990, the PWT has data for almost all
countries.
Normally, XLOOKUP() looks for an exact match and returns an error if no match
is found. This is a good default, but the optional arguments if_not_found and
match_mode can be used if you want to do something other than that.
FYI
Related functions
XLOOKUP() is a relatively new addition to Excel, so you may see files that
use the older functions VLOOKUP() and HLOOKUP(). The syntax of these
functions is somewhat different, but the underlying idea is the same.
10.2. COMBINING DATA 235
Sometimes we will want to create a new variable that is the sum or average of
another variable within some group, or the count of the number of observations
in that group.
We can construct group-level averages variables using the AVERAGEIFS() func-
tion. It takes three arguments:
The average investment share should range from a minimum of 0.13 for Bulgaria
to a maximum of 0.44 for Cyprus.
236 CHAPTER 10. ADVANCED DATA CLEANING
These functions can all be used to construct summary statistics (as in Chapter
6) or to construct group aggregate variables within our data set (as in this
chapter).
FYI
Related functions
The COUNTIFS(), AVERAGEIFS(), SUMIFS(), MINIFS() and MAXIFS()
functions all allow for multiple criteria to be used. Excel also includes a
set of older functions COUNTIF(), AVERAGEIF(), SUMIF(), MINIF() and
MAXIF() that allow only a single criterion. You may see these in older
worksheets.
Sometimes an Excel formula does not produce a valid result. When that hap-
pens, Excel will return an error code to indicate what has gone wrong. Excel’s
most commonly-used error codes are:
• #VALUE! means you have given a function the wrong type of argument
– Example: =LN("THIS IS A STRING AND NOT A NUMBER")
• #NAME? means you have referenced a function that does not exist.
– Example: =NOTAREALFUNCTION(1)
• DIV/0! means you have divided by zero.
10.3. ADVANCED DATA MANAGEMENT 237
– Example: =1/0
– This error also appears when you take the AVERAGE() of a range of
cells that does no include any numbers.
• #REF! means you have referenced a cell that does not exist.
– This usually happens when you delete a row or column that the
formula refers to.
• #NUM! means that the result of your numeric calculation is not a real
number.
– Example: =SQRT(-1)
• #NA means that a lookup function such as XLOOKUP() was unable to find
a match.
One potential source of errors is the entry of invalid data. Invalid data means
that a particular cell in our data takes on a value that is not in the set of possible
values for that variable. For example:
Invalid data can result from typos or other human error, or it can result from
imperfect translation of data between different data sources.
Excel has a set of data validation tools to help prevent and fix invalid data.
Data validation can be accessed by selecting Data from the menu and then
clicking on the Data Validation button, which looks like this: .
Once we have added data validation to a range of cells, several things will
happen. First, Excel will not allow you to enter invalid data into a cell with
validation turned on. This feature will help avoid problems in the first place.
Select Cancel.
The Data Validation tool also allows you to identify previously-entered obser-
vations with invalid data.
To remove the circles, select Data > Data Validation > Clear Validation
Circles.
2. Select Data > Data Validation. The Data Validation dialog box will
appear.
3. Select Clear All and then OK.
You can confirm that data validation has been removed by entering an invalid
value (Excel will now let you do this) or by adding validation circles (there will
not be any).
One of the most common Excel problems occurs when someone who is analyzing
a data set unintentionally changes the data.
• This can happen because you accidentally touch your keyboard and over-
write the contents of a cell.
• It can also happens when an inexperienced analyst breaks the rule about
keeping the original data untouched, and makes a mistake.
1. Right-click on the Data tab at the bottom of the page. A menu will
appear.
10.3. ADVANCED DATA MANAGEMENT 241
2. Select Protect Sheet. The Protect Sheet dialog box will provide several
options:
3. Select OK.
Now that you have protected the Data worksheet, try to edit any cell. You will
get an error message.
In many cases, you will only want to protect certain cells in a given sheet. To
do this, remember the rules: all cells are initially locked and all worksheets are
unprotected. So we will need to unlock all cells except the ones we want locked,
and then protect the sheet.
Example 10.16. Protecting part of a worksheet
To protect and lock columns A through N in GrowthData but keep the other
columns unlocked:
Note that you have to do it in this order - Excel will not let you unlock locked
cells after you have protected the sheet.
You will now get an error message if you try to edit any of the locked cells, but
you can edit the unlocked cells in any way you like.
Finally, you can remove protection from any worksheet by simply selecting Home
> Format > Unprotect Sheet.
You can download the complete Excel file with all data cleaning from
this chapter at https://bookdown.org/bkrauth/BOOK/sampledata/GrowthData.xlsx
242 CHAPTER 10. ADVANCED DATA CLEANING
Our first step will be reading the data in from the CSV file. The Tidyverse
function to do this is called read_csv(). It has one required argument: the
name of the CSV file.
# The code below accesses the online data. You can also download the file
# 'EmploymentData.csv' and change the argument to read_csv() to the local file
# location
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")
## Rows: 541 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): MonthYr, Party, PrimeMinister
## dbl (8): Population, Employed, Unemployed, LabourForce, NotInLabourForce, Un...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
As you can see, R guesses each variable’s data type, and reports its guess. It
is always a good idea to read this output and make sure that everything is the
way that we want it:
We have assigned the data in the CSV file to the variable EmpData.
FYI
Additional options
Our data file happens to be a nice and tidy one, so read_csv() worked
just fine with its default options. Not all data files are so tidy, so
read_csv() has many optional arguments. There are also functions
for other delimited file types:
print(EmpData)
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce UnempRate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733 10370. 6483. 0.0707
## 2 2/1/1976 16892 9660. 730 10390. 6502. 0.0703
## 3 3/1/1976 16931. 9704. 692. 10396. 6535 0.0665
## 4 4/1/1976 16969. 9738. 713. 10451. 6518. 0.0682
## 5 5/1/1976 17008. 9726. 720 10446. 6562 0.0689
## 6 6/1/1976 17047. 9748. 721. 10470. 6577. 0.0689
## 7 7/1/1976 17086. 9760. 780. 10539. 6546. 0.0740
## 8 8/1/1976 17124. 9780. 744. 10524. 6600. 0.0707
## 9 9/1/1976 17154. 9795. 737. 10532. 6622. 0.0699
## 10 10/1/1976 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>
Tibbles can be quite large, so the print() function will usually show an abbre-
viated version of the table.
We can also see the whole table by executing the command View(EmpData) or
through RStudio:
You will see a spreadsheet-like display of EmpData. As in Excel, you can sort
and filter this table. Unlike Excel, you cannot edit it here.
There are several R functions available for exploring the properties of a data
table.
We can obtain the column names of a tibble using the names() function:
names(EmpData)
## [1] "MonthYr" "Population" "Employed" "Unemployed"
## [5] "LabourForce" "NotInLabourForce" "UnempRate" "LFPRate"
## [9] "Party" "PrimeMinister" "AnnPopGrowth"
and we can count the rows and columns with nrow() and ncol() respectively:
10.4. READING AND VIEWING DATA IN R 245
nrow(EmpData)
## [1] 541
ncol(EmpData)
## [1] 11
length(EmpData$UnempRate)
## [1] 541
As you can see, the length() function returns the length of a vector.
Chapter review
In this chapter, we learned about many loosely-related topics. The underlying
theme that connects them all is that real data can be complicated. We will
often need to get data from multiple sources and in varying formats, and we
cannot trust either ourselves or others not to make mistakes. So we need to be
both disciplined in handling our data, and flexible in finding solutions to the
problems that pop up.
In the next two chapters, we will develop more advanced methods for data
analysis in both Excel and R.
Practice problems
Answers can be found in the appendix.
SKILL #1: Identify common data file formats
5. Use R (with the Tidyverse loaded) to open the data file https://people.sc.
fsu.edu/~jburkardt/data/csv/deniro.csv and count the number of obser-
vations and variables in it.
Chapter 11
Using R
Goals
Chapter goals
In this chapter we will learn how to use R to:
To get started, open R, load the Tidyverse, and read in our employment data.
library(tidyverse)
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")
247
248 CHAPTER 11. USING R
All four functions follow a common syntax that is designed to work with a
convenient Tidyverse tool called the “pipe” operator.
As you can see, R’s rule for interpreting the pipe operator is that the object
before the %>% is taken as the first argument for the function after the %>%.
The pipe operator does not add any functionality to R; anything you can do
with it can also be done without it. But it addresses a common problem: we
often want to perform multiple transformations on a data set, but doing so in
the usual functional language can lead to code that is quite difficult to read.
The pipe operator can be used to create much more readable code, as we will
see in the examples below.
11.1.2 Mutate
The most important data transformation function is mutate, which allows us
to change or add variables. We will start by changing the MonthYr variable
11.1. CLEANING DATA IN R 249
As you can see, the MonthYr column is now labeled as a date rather than text.
Like Excel, R has an internal representation of dates that allows for correct
ordering and calculations, but displays dates in a standard human-readable
format.
Mutate can be used to add variables as well as changing them. For example,
suppose we also want to create versions of UnempRate and LFPRate that
are expressed in percentages rather than decimal units:
If you look closely, you can see that the UnempPct and LFPPct variables are
now included in the data table.
Before we go any further, note that we haven’t yet changed the EmpData data
table:
print(EmpData)
## # A tibble: 541 x 11
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce UnempRate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/1976 16852. 9637. 733 10370. 6483. 0.0707
## 2 2/1/1976 16892 9660. 730 10390. 6502. 0.0703
## 3 3/1/1976 16931. 9704. 692. 10396. 6535 0.0665
## 4 4/1/1976 16969. 9738. 713. 10451. 6518. 0.0682
## 5 5/1/1976 17008. 9726. 720 10446. 6562 0.0689
## 6 6/1/1976 17047. 9748. 721. 10470. 6577. 0.0689
## 7 7/1/1976 17086. 9760. 780. 10539. 6546. 0.0740
## 8 8/1/1976 17124. 9780. 744. 10524. 6600. 0.0707
## 9 9/1/1976 17154. 9795. 737. 10532. 6622. 0.0699
## 10 10/1/1976 17183. 9782. 783. 10565. 6618. 0.0741
## # ... with 531 more rows, and 4 more variables: LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>
As you can see, the MonthYr variable is still listed as a character variable and
the new UnempPct and LFPPct variables do not seem to exist.
What has happened here? Our original commands simply created a new object
based on EmpData that was then displayed on the screen. In order to change
EmpData itself, we need to assign that new object back to EmpData:
print(EmpData)
## # A tibble: 541 x 13
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce
11.1. CLEANING DATA IN R 251
Now let’s suppose we want to know more about the months in our data set with
the highest unemployment rates. We can use filter() for this purpose:
# This will give all of the observations with unemployment rates over 12.5%
EmpData %>%
filter(UnempPct > 12.5)
## # A tibble: 8 x 13
## MonthYr Population Employed Unemployed LabourForce NotInLabourForce
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-10-01 19183. 10787. 1602. 12389. 6794.
## 2 1982-11-01 19203. 10764. 1600. 12364. 6839.
## 3 1982-12-01 19223. 10774. 1624. 12398. 6824.
## 4 1983-01-01 19244. 10801. 1573. 12374 6870.
## 5 1983-02-01 19266. 10818. 1574. 12392. 6875.
## 6 1983-03-01 19285. 10875. 1555. 12430. 6856.
## 7 2020-04-01 30994. 16142. 2444. 18586. 12409.
## 8 2020-05-01 31009. 16444 2610. 19054. 11955.
## # ... with 7 more variables: UnempRate <dbl>, LFPRate <dbl>, Party <chr>,
## # PrimeMinister <chr>, AnnPopGrowth <dbl>, UnempPct <dbl>, LFPPct <dbl>
As you can see, only 8 of the 541 months in our data have unemployment rates
over 12.5% - the worst months of the 1982-83 recession, and April and May of
last year.
Now let’s suppose that we only want to see a few pieces of information about
those months. We can use select() to choose variables:
252 CHAPTER 11. USING R
Hopefully you can see why the pipe operator is useful in making our code clear
and readable:
# This is what the same code looks like without the pipe
arrange(select(filter(EmpData, UnempPct > 12.5), MonthYr, UnempPct, LFPPct, PrimeMinist
UnempPct)
## # A tibble: 8 x 4
## MonthYr UnempPct LFPPct PrimeMinister
11.1. CLEANING DATA IN R 253
Now I should probably say: these results imply nothing meaningful about the
economic policy of either Pierre or Justin Trudeau. The severe worldwide reces-
sions in 1982-83 (caused by US monetary policy) and 2020-2021 (caused by the
COVID-19 pandemic) were caused by world events largely outside the control
of Canadian policy makers.
It is possible to save your data set in R’s internal format just like you would
save an Excel file. But I’m not going to tell you how to do that, because what
you really need to do is save your code.
Because it is command-based, R enables an entirely different and much more
reproducible model for data cleaning and analysis. In Excel, the original data,
the data cleaning, the data analysis, and the results are all mixed in together
in a simple file. This is convenient and simple to use in many applications, but
it can be a disaster in complex projects.
In contrast, R allows you to have three separate files or groups of files
The key is to make sure that all of your cleaned data and results can be regen-
erated from the original data at any time by running your code.
254 CHAPTER 11. USING R
The summary() function will give a basic summary of any object. Exactly what
that summary looks like depends on the object. For tibbles, summary() produces
a set of summary statistics for each variable:
summary(EmpData)
## MonthYr Population Employed Unemployed
## Min. :1976-01-01 Min. :16852 Min. : 9637 Min. : 691.5
## 1st Qu.:1987-04-01 1st Qu.:20290 1st Qu.:12230 1st Qu.:1102.5
## Median :1998-07-01 Median :23529 Median :14064 Median :1265.5
## Mean :1998-07-01 Mean :23795 Mean :14383 Mean :1261.0
## 3rd Qu.:2009-10-01 3rd Qu.:27327 3rd Qu.:16926 3rd Qu.:1404.6
## Max. :2021-01-01 Max. :31191 Max. :19130 Max. :2609.8
##
## LabourForce NotInLabourForce UnempRate LFPRate
## Min. :10370 Min. : 6483 Min. :0.05446 Min. :0.5996
## 1st Qu.:13467 1st Qu.: 6842 1st Qu.:0.07032 1st Qu.:0.6501
## Median :15333 Median : 8162 Median :0.07691 Median :0.6573
## Mean :15644 Mean : 8151 Mean :0.08207 Mean :0.6564
## 3rd Qu.:18230 3rd Qu.: 9099 3rd Qu.:0.09369 3rd Qu.:0.6674
## Max. :20316 Max. :12409 Max. :0.13697 Max. :0.6766
##
## Party PrimeMinister AnnPopGrowth UnempPct
## Length:541 Length:541 Min. :0.007522 Min. : 5.446
11.2. DATA ANALYSIS IN R 255
The R function mean() calculates the sample average of any numeric vector:
EmpData %>%
select(MonthYr, Population, AnnPopGrowth)
## # A tibble: 541 x 3
## MonthYr Population AnnPopGrowth
## <date> <dbl> <dbl>
## 1 1976-01-01 16852. NA
## 2 1976-02-01 16892 NA
## 3 1976-03-01 16931. NA
## 4 1976-04-01 16969. NA
## 5 1976-05-01 17008. NA
## 6 1976-06-01 17047. NA
## 7 1976-07-01 17086. NA
## 8 1976-08-01 17124. NA
## 9 1976-09-01 17154. NA
## 10 1976-10-01 17183. NA
## # ... with 531 more rows
When we try to take the mean of this variable we also get NA:
mean(EmpData$AnnPopGrowth)
## [1] NA
This is because math in R follows the IEEE-754 standard for numerical arith-
metic, which says that any calculation involving NA should also result in NA.
Some other applications drop missing data from the calculation.
Whenever you have missing values, you should investigate before proceeding.
Sometimes (as in our case here), missing values are for a good reason, other
times they are the result of a mistake or problem that needs to be fixed.
Once we have investigated the missing values, we can tell R explicitly to exclude
them from the calculation by adding the na.rm = TRUE option:
Suppose we want to calculate the sample average for each column in our tibble.
We could just call mean() for each of them, but there should be a quicker way.
Here is the code to do that:
select(where(is.numeric)) %>%
lapply(mean, na.rm = TRUE)
## $Population
## [1] 23795.46
##
## $Employed
## [1] 14383.15
##
## $Unemployed
## [1] 1260.953
##
## $LabourForce
## [1] 15644.1
##
## $NotInLabourForce
## [1] 8151.352
##
## $UnempRate
## [1] 0.08207112
##
## $LFPRate
## [1] 0.6563653
##
## $AnnPopGrowth
## [1] 0.01370259
##
## $UnempPct
## [1] 8.207112
##
## $LFPPct
## [1] 65.63653
I would not expect you to come up with this code, but maybe it kind of makes
sense.
We can use this method with any function that calculates a summary statistic:
select(where(is.numeric)) %>%
lapply(sd, na.rm = TRUE)
## $Population
## [1] 4034.558
##
## $Employed
## [1] 2704.267
##
## $Unemployed
## [1] 243.8356
##
## $LabourForce
## [1] 2783.985
##
## $NotInLabourForce
## [1] 1294.117
##
## $UnempRate
## [1] 0.01709704
##
## $LFPRate
## [1] 0.01401074
##
## $AnnPopGrowth
## [1] 0.00269365
##
## $UnempPct
## [1] 1.709704
##
## $LFPPct
## [1] 1.401074
We can also construct frequency tables for both discrete and continuous vari-
ables:
As you might imagine, there are various ways of customizing the intervals just
like in Excel.
• The qnorm() function gives the inverse normal CDF (or quantile function):
1 The package is technically called ggplot2 since it is the second version of ggplot. But
60
40
count
20
12.5
UnempPct
10.0
7.5
The ggplot() function has a non-standard syntax, so I’d like to go over it.
262 CHAPTER 11. USING R
– The data argument tells R which data set (tibble) will be used
– the mapping argument describes the basic aesthetics of the graph,
i.e., the relationship in the data we will be graphing.
A graph can include multiple geometries in a given graph, as we will see shortly.
As when making graphs in Excel, the basic graph gives us some useful informa-
tion but we can improve upon it in various ways.
You can add a title and subtitle, and you can change the axis titles:
Canada
Unemployment rate
January 1976 − January 2021
12.5
Unemployment rate, %
10.0
7.5
11.3.2.2 Color
You can change the color of any geometric element using the col= argument:
UnempPct 12.5
10.0
7.5
Colors can be given in ordinary English (or local language) words, or with
detailed color codes in RGB or CMYK format.
Some geometric elements, such as the bars in a histogram, also have a fill color:
60
40
count
20
As you can see, the col= argument sets the color for the exterior of each bar,
and the fill= argument sets the color for the interior.
We can include multiple geometries in the same graph. For example, we can
include lines for both unemployment and labour force participation:
60
UnempPct
40
20
• We have used color to differentiate the two lines, but there is no legend to
tell the reader which line is which. We will need to fix that.
• The vertical axis is labeled UnempPct. We will need to fix that.
We could add a legend here, but it is better (and friendlier to the color-blind)
to just label the lines. We can use the geom_text geometry to do this:
60 LFP
UnempPct
40
20
Unemployment
The graphs below combine all of the features described above to yield clean and
clear graphs.
Canada
Unemployment rate
January 1976 − January 2021 (541 months)
100
75
Count
50
25
Canada
Unemployment and LFP rates
January 1976 − January 2021 (541 months)
60 LFP
Percent
40
20
Unemployment
Chapter review
As we have seen, we can do many of the same things in Excel and R. R is
typically more difficult to use for simple analysis tasks, and there is nothing
wrong with using Excel when it is easier. But the usability gap gets smaller
with more complicated tasks, and there are many tasks where Excel doesn’t do
everything that R can do. You should think of them as complementary tools,
and be comfortable using both.
Practice problems
Answers can be found in the appendix.
SKILL #1: Use mutate to add or change a variable
SKILL #2: Use filter, arrange and select to modify a data table
2. Starting with the PPData data table you created in question (1) above:
a. Calculate and report the mean employment rate since 2010.
b. Calculate and report a table reporting the median for all variables in
PPData.
c. Did any variables in PPData have missing data? If so, how did you
decide to address it in your answer to (b), and why?
6. Using the PPData data set, create a time series graph of the employment
rate.
Chapter 12
Goals
Chapter goals
In this chapter, we will learn how to:
For the most part, we will focus on the case of a random sample of size 𝑛 on
two random variables 𝑥𝑖 and 𝑦𝑖 .
Example 12.1. Obtaining the data
The primary application in this chapter will use our Canadian employment data.
We will be using both Excel and R in our examples.
For the Excel examples we will start with the file https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentDa
This file is similar to the employment data file we used in Chapter 6.
271
272 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
For the R examples we will start with the EmploymentData.csv file we used in
Chapter 11. Execute the following R code to get started:
library(tidyverse)
EmpData <- read_csv("https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentData.csv")
# Make permanent changes to EmpData
EmpData <- EmpData %>%
mutate(MonthYr = as.Date(MonthYr, "%m/%d/%Y")) %>%
mutate(UnempPct = 100 * UnempRate) %>%
mutate(LFPPct = 100 * LFPRate)
When both variables are numeric, we can summarize their relationship using
the sample covariance:
𝑛
1
𝑠𝑥,𝑦 = ∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝑛 − 1 𝑖=1
where 𝑥̄ and 𝑦 ̄ are the sample averages and 𝑠𝑥 and 𝑠𝑦 are the the sample standard
deviations. These univariate statistics are defined in Chapter 7.
The sample covariance and sample correlation can be interpreted as estimates of
the corresponding population covariance and correlation as defined in Chapter
5.
The sample covariance and correlation can be calculated in R using the cov()
and cor() functions.
These functions can be applied to any two columns of data:
As you can see, unemployment and labour force participation are negatively
correlated: when unemployment is high, LFP tends to be low. This makes
sense given the economics: if it is hard to find a job, people will move into other
activities that take one out of the labour force: education, childcare, retirement,
etc.
Both cov() and cor() can also be applied to (the numeric variables in) an entire
data set. The result is what is called a covariance matrix or correlation
matrix:
# Correlation matrix for the whole data set (at least the numerical parts)
EmpData %>%
select(where(is.numeric)) %>%
cor()
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9905010 0.3759661 0.9950675 0.9769639
## Employed 0.9905010 1.0000000 0.2866686 0.9964734 0.9443252
## Unemployed 0.3759661 0.2866686 1.0000000 0.3660451 0.3846586
## LabourForce 0.9950675 0.9964734 0.3660451 1.0000000 0.9509753
## NotInLabourForce 0.9769639 0.9443252 0.3846586 0.9509753 1.0000000
## UnempRate -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPRate 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## AnnPopGrowth NA NA NA NA NA
## UnempPct -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPPct 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## UnempRate LFPRate AnnPopGrowth UnempPct LFPPct
## Population -0.4721230 0.4535956 NA -0.4721230 0.4535956
## Employed -0.5542043 0.5369032 NA -0.5542043 0.5369032
## Unemployed 0.6249095 0.1874114 NA 0.6249095 0.1874114
## LabourForce -0.4836022 0.5379437 NA -0.4836022 0.5379437
## NotInLabourForce -0.4315427 0.2568786 NA -0.4315427 0.2568786
## UnempRate 1.0000000 -0.2557409 NA 1.0000000 -0.2557409
## LFPRate -0.2557409 1.0000000 NA -0.2557409 1.0000000
## AnnPopGrowth NA NA 1 NA NA
## UnempPct 1.0000000 -0.2557409 NA 1.0000000 -0.2557409
## LFPPct -0.2557409 1.0000000 NA -0.2557409 1.0000000
The use argument allows you to specify which approach you want to use:
# EmpData has missing data in 1976 for the variable AnnPopGrowth Pairwise will
# only exclude 1976 from calculations involving AnnPopGrowth
EmpData %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9905010 0.3759661 0.9950675 0.9769639
## Employed 0.9905010 1.0000000 0.2866686 0.9964734 0.9443252
## Unemployed 0.3759661 0.2866686 1.0000000 0.3660451 0.3846586
## LabourForce 0.9950675 0.9964734 0.3660451 1.0000000 0.9509753
## NotInLabourForce 0.9769639 0.9443252 0.3846586 0.9509753 1.0000000
## UnempRate -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPRate 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## AnnPopGrowth -0.5427605 -0.5239765 -0.5771164 -0.5618814 -0.4851752
## UnempPct -0.4721230 -0.5542043 0.6249095 -0.4836022 -0.4315427
## LFPPct 0.4535956 0.5369032 0.1874114 0.5379437 0.2568786
## UnempRate LFPRate AnnPopGrowth UnempPct LFPPct
## Population -0.47212303 0.4535956 -0.54276051 -0.47212303 0.4535956
## Employed -0.55420434 0.5369032 -0.52397653 -0.55420434 0.5369032
## Unemployed 0.62490950 0.1874114 -0.57711636 0.62490950 0.1874114
## LabourForce -0.48360222 0.5379437 -0.56188142 -0.48360222 0.5379437
## NotInLabourForce -0.43154270 0.2568786 -0.48517519 -0.43154270 0.2568786
## UnempRate 1.00000000 -0.2557409 -0.06513125 1.00000000 -0.2557409
## LFPRate -0.25574087 1.0000000 -0.48645089 -0.25574087 1.0000000
## AnnPopGrowth -0.06513125 -0.4864509 1.00000000 -0.06513125 -0.4864509
## UnempPct 1.00000000 -0.2557409 -0.06513125 1.00000000 -0.2557409
## LFPPct -0.25574087 1.0000000 -0.48645089 -0.25574087 1.0000000
# Casewise will exclude 1976 from all calculations
EmpData %>%
select(where(is.numeric)) %>%
cor(use = "complete.obs")
## Population Employed Unemployed LabourForce NotInLabourForce
## Population 1.0000000 0.9898651 0.32223097 0.9951165 0.9782320
## Employed 0.9898651 1.0000000 0.22300335 0.9964469 0.9443181
## Unemployed 0.3222310 0.2230034 1.00000000 0.3043132 0.3495771
## LabourForce 0.9951165 0.9964469 0.30431322 1.0000000 0.9529715
## NotInLabourForce 0.9782320 0.9443181 0.34957711 0.9529715 1.0000000
12.2. PIVOT TABLES 275
In most applications, pairwise deletion makes the most sense because it avoids
throwing out data. But it is occasionally important to use the same data for all
calculations, in which case we would use listwise deletion.
FYI
Covariance and correlation in Excel
The sample covariance and correlation between two variables (data
ranges) can be calculated in Excel using the COVARIANCE.S() and
CORREL() functions.
The Pivot Table itself is on the left side of the new worksheet.
The next step is to add elements to the table. There are various tools available
to do that:
• the Pivot Table Fields box on the right side of the screen
• the PivotTable Analyze menu
• the Design menu.
These tools only appear in context, so they will disappear if you click a cell
outside of the Pivot Table. You can fix this by just clicking any cell in the Pivot
Table.
12.2. PIVOT TABLES 277
1. Check the box next to PrimeMinister. The Pivot Table will look like
this:
2. Drag MonthYr into the box marked “Σ values”. The Pivot Table will
now look like this:
As we can see, the table shows the number of observations for each value of the
PrimeMinister variable, which also happens to be the number of months in
office for each prime minister. It also shows a grand total.
3. Click on the Show Values As tab and select “% of Column Total” from
the Show Values As drop-down box.
4. Select OK.
The third column will now show the number of observations as a percentage of
the total:
For example, this crosstab tells us Brian Mulroney served 104 months as prime
minister, with all of those months as a member of the (Progressive) Conservative
party.
We can also construct crosstabs using relative frequencies, but there is more than
one kind of relative frequency we can use here. A joint frequency crosstab
shows the count in each cell as a percentage of all observations. Joint frequency
tables can be interpreted as estimates of joint probabilities.
Example 12.6. A joint frequency crosstab
To convert our absolute frequency crosstab into a joint frequency crosstab:
For example, the table tells us that Brian Mulroney’s 104 months as prime
minister represent 19.22% of all months in our data.
280 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
For example, Brian Mulroney’s 104 months as prime minister represent 44.64%
of all months served by a Conservative prime minister in our data.
1. Drag UnempRate into the box marked “Σ values”. The table will now
look like this:
Unfortunately, we wanted to see the average unemployment rate for each prime
minister, but instead we see the sum of unemployment rates for each prime
minister. To change this:
We now have the average unemployment rate for each prime minister. It is not
very easy to read, so we will want to change the formatting later.
As you might expect, we can modify Pivot Tables in various ways to make them
clearer, more informative, and more visually appealing.
As with other tables in Excel, we can filter and sort them. Filtering is par-
ticularly useful with Pivot Tables since there are often categories we want to
exclude.
Note that the grand total has also gone down from 541 to 532.
By default, the table is sorted on the row labels, but we can sort on any column.
We can change number formatting, column and row titles, and various other
aspects of the table’s appearance.
Example 12.11. Cleaning up a table’s appearance
Our table can be improved by making the column headers more informative and
reporting the unemployment rate in percentage terms and fewer decimal places:
5. Change the other three headers. You can do this through Value Field
Settings... but you can also just edit the text directly.
• Change “Row Labels” to “Prime Minister”
• Change “Count of MonthYr” to “Months in office”
• Change “Count of MonthYr2” to “% in office”
4. Select any cell in the table, then select Insert > Recommended Charts
from the menu.
5. Select Column, and then Stacked Column from the dialog box, and then
select OK.
As always, there are various ways we could customize this graph to be more
attractive and informative.
You can download the full set of Pivot Tables and associated charts generated in
this chapter at https://bookdown.org/bkrauth/BOOK/sampledata/EmploymentDataPT.xlsx
Bivariate summary statistics like the covariance and correlation provide a sim-
ple way of characterizing the relationship between any two numeric variables.
Frequency tables, cross tabulations, and conditional averages allow us to gain
a greater understanding of the relationship between two discrete or categorical
variables, or between a discrete/categorical variable and a continuous variable.
A scatter plot is the simplest way to view the relationship between two vari-
ables in data. The horizontal (𝑥) axis represents one variable, the vertical (𝑦)
axis represents the other variable, and each point represents an observation.
68
66
LFPPct
64
62
60
In some sense, the scatter plot shows everything about the relationship between
the two variables, since it shows every observation. The negative relationship
between the two variables indicated by the correlation we calculated earlier
(-0.2557409) is clear, but it is also clear that this relationship is not very strong.
12.3.1.1 Jittering
If both of our variables are truly continuous, each point represents a single
observation. But if both variables are actually discrete, points can “stack” on top
of each other. In that case, the same point can represent multiple observations,
leading to a misleading scatter plot.
For example, suppose we had rounded our unemployment and LFP data to the
nearest percent:
The scatter plot with the rounded data would look like this:
68
66
LFPPct
64
62
60
As you can see from the graph, the scatter plot is misleading: there are 541
observations in the data set represented by only 40 points.
68
66
LFPPct
64
62
60
As you can see the jittered rounded data (small blue dots) more accurately
reflects the original unrounded data than the rounded data (large red dots).
We can use color to add a third dimension to the data. That is, we can color-
code points based on a third variable by including it as part of the aesthetic:
68
66
Party
LFPPct
64 Conservative
Liberal
Transfer
62
60
68
66
MonthYr
2020
LFPPct
2010
64
2000
1990
1980
62
60
As these graphs show, R will use a discrete or continuous color scheme depending
on whether the variable is discrete or continuous.
290 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
As we discussed earlier, you want to make sure your graph can be read by a
reader who is color blind or is printing in black and white. So we can use shapes
in addition to color:
68
66
Party
LFPPct
64 Conservative
Liberal
Transfer
62
60
We would also choose a color scheme other than red and green, since that is the
most common form of color blindness.
FYI
Scatter plots in Excel
Scatter plots can also be created in Excel, though it is more work and
produces less satisfactory results.
average 𝑦𝑖 within each bin against the midpoint of the bin. This kind of plot is
called a binned scatterplot.
66
LFPPct
64
62
60
The number of bins is an important choice. The graph below adds a red line
based on 4 bins and a green line based on 100 bins.
68
66
LFPPct
64 4 bins
20 bins
62
60 100 bins
5.0 7.5 10.0 12.5
UnempPct
As you can see, the binned scatterplot tends to be smooth when there are only
a few bins, and jagged when there are many bins. This reflects a trade-off
between bias (too few bins may lead us to miss important patterns in the data)
and variance (too many bins may lead us to see patterns in the data that aren’t
really part of the DGP).
12.3.3 Smoothing
0.68
0.66
LFPRate
0.64
0.62
0.60
Notice that by default, the graph includes both the fitted line (in blue) and a
95% confidence interval (the shaded area around the line). Also note that the
confidence interval is narrow in the middle (where there is a lot of data) and
wide in the ends (where there is less data).
Our last approach is to assume that the relationship between the two variables
is linear, and estimate it by a technique called linear regression. Linear
regression calculates the straight line that fits the data best.
You can include a linear regression line in your plot by adding the method=lm
argument to the geom_smooth() geometry:
68
66
LFPPct
64
62
60
We can compare the linear and smoothed fits to see where they differ:
68
66
LFPPct
64
62
60
As you can see, the two fits are quite similar for unemployment rates below
12%, but diverge quite a bit above that level. This is inevitable, because the
smooth fit becomes steeper, but linear fit can’t do that.
Linear regression is much more restrictive than smoothing, but has several im-
portant advantages:
These advantages are not particularly important in this case, with only two
variables and a reasonably large data set. The advantages of linear regression
become overwhelming when you have more than two variables to work with. As
a result, linear regression is the most important tool in applied econometrics,
and you will spend much of your time in ECON 333 learning to use it.
Chapter review
Econometrics is mostly about the relationship between variables: price and
quantity, consumption and savings, labour and capital, today and tomorrow.
So most of what we do is multivariate analysis.
This chapter has provided a brief view of some of the main techniques for mul-
tivariate analysis. Our higher-level statistics courses (ECON 333, ECON 334,
296 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
ECON 335, ECON 433, ECON 435) are all about multivariate analysis, and
will develop both the theory behind these tools and the set of applications in
much greater detail.
Practice problems
Answers can be found in the appendix.
SKILL #1: Calculate and interpret covariance and correlation
1. Using the EmpData data set, calculate the covariance and correlation of
UnempPct and AnnPopGrowth. Based on these results, are periods
of high population growth typically periods of high unemployment?
2. In problem (1) above, did you use pairwise or casewise deletion of missing
values? Did it matter? Explain why.
3. The following tables are based on 2019 data for Canadians aged 25-34.
Classify each of these tables as simple frequency tables, crosstabs, or con-
ditional averages.
a.
b.
12.3. GRAPHICAL METHODS 297
c.
4. Using the EmpData data set, construct a scatter plot with annual popula-
tion growth on the horizontal axis and unemployment rate on the vertical
axis.
5. Using the EmpData data set, construct the same scatter plot as in problem
(4) above, but add a smooth fit and a linear fit.
298 CHAPTER 12. MULTIVARIATE DATA ANALYSIS
Appendix A
Math review
The math used in this textbook is all covered in high school or in introductory
calculus. However, you may have forgotten it, or never understood it very
well in the first place. This appendix provides a review of the most important
mathematical terms, concepts and methods. It can be used to review ideas
before starting the main text, or as a reference while going through the main
text.
If you have trouble with a few of these ideas, don’t panic. This is not a math
class. For example, if you forget what it means for two sets to be “disjoint”,
you can always ask.
A.1 Sets
The most fundamental notion in mathematics is the idea of a set. A set is
typically described as a collection or gathering of distinct objects. These objects
are called the elements of the set. Sets are not ordered, and elements cannot
be repeated.
𝐴 = {1, 2, 3}
𝐵 = {𝐴𝑣𝑜𝑐𝑎𝑑𝑜, 𝐵𝑎𝑛𝑎𝑛𝑎}
299
300 APPENDIX A. MATH REVIEW
𝐶 = {𝑥 ∈ 𝐵 ∶ 𝑥 is yellow}
We read this as “𝐶 is the set of all 𝑥 in the set 𝐵 such that 𝑥 is yellow.” In
other words:
𝐶 = {𝑥 ∶ 𝑥 is yellow}
We would interpret this as saying that 𝐶 is the set of everything that is yellow.
There are a few special sets defined by convention:
Finally we can just refer to an abstract set without specifying its contents, just
like we can refer to a variable in algebra without specifying its value.
The size or cardinality of the set 𝐴, usually written |𝐴|, is simply the number
of elements it has.
• 𝐴 is a singleton if |𝐴| = 1.
• 𝐴 is a finite set if |𝐴| is a finite number. Otherwise it is an infinite set.
|ℤ| = ∞
A.1. SETS 301
|ℝ| = ∞
Let 𝐴 and 𝐵 be two sets. We have several ways of describing how they are
related:
{1, 2} ≠ {1, 2, 3}
{1, 2} ≠ {1, 3}
These sets are disjoint
{1, 2} and {3, 4}
These sets are not disjoint
{1, 2} ⊂ {1, 2}
{1, 2} ⊂ {1, 2, 3}
The first of these sets is not a subset of the second set:
{1, 2} ⊄ {1, 3}
302 APPENDIX A. MATH REVIEW
𝐴 ∩ 𝐵 = {𝑥 ∶ 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵}
𝐴 ∪ 𝐵 = {𝑥 ∶ 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵}
𝐵 − 𝐴 = {𝑥 ∈ 𝐵 ∶ 𝑥 ∉ 𝐴}
We will not use the set difference, but it helps to define the complement,
which we will use.
• The complement of 𝐴 is written 𝐴′ , ¬𝐴 or 𝐴𝑐 , and is simply everything
that is not in 𝐴:
𝐴𝑐 = 𝕌 − 𝐴 = {𝑥 ∶ 𝑥 ∉ 𝐴}
{1, 2} ∩ {3, 4} = ∅
FYI
Some standard results about sets
Given the basic components of set algebra, we can establish many useful
rules. This is not a course on set theory, so I will simply list some of the
most important rules for your reference. None of these rules is difficult
to prove, and most of them should make intuitive sense to you.
• Non-negative cardinality:
|𝐴| ≥ 0
• Cardinality of unions:
• Cardinality of intersections:
|𝐴 ∩ 𝐵| ≤ 𝑚𝑖𝑛(|𝐴|, |𝐵|)
• Commutative laws:
𝐴∪𝐵 =𝐵∪𝐴
𝐴∩𝐵 =𝐵∩𝐴
• Associative laws:
(𝐴 ∪ 𝐵) ∪ 𝐶 = 𝐴 ∪ (𝐵 ∪ 𝐶)
(𝐴 ∩ 𝐵) ∩ 𝐶 = 𝐴 ∩ (𝐵 ∩ 𝐶)
• Distributive laws:
𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶)
𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶)
• Identity laws:
𝐴∪∅=𝐴
𝐴∩𝕌=𝐴
• Complement laws:
𝐴 ∪ 𝐴𝑐 = 𝕌
𝐴 ∩ 𝐴𝑐 = ∅
𝕌𝑐 = ∅
∅𝑐 = 𝕌
• Double-complement law:
(𝐴𝑐 )𝑐 = 𝐴
• Idempotent laws:
𝐴∪𝐴=𝐴
𝐴∩𝐴=𝐴
• Domination laws:
𝐴∪𝕌=𝕌
𝐴∩∅=∅
A.2. FUNCTIONS 305
A.2 Functions
A function is a rule that matches (“maps”) elements of one set (called the
domain of the function) to elements of another set (called the range of the
function). We use the notation
𝑓 ∶𝐷→𝑅
If a function has a finite domain, we can define the function by simple enumer-
ation.
𝑠(𝐴𝑣𝑜𝑐𝑎𝑑𝑜) = 1
𝑠(𝐵𝑎𝑛𝑎𝑛𝑎) = 0
𝑠(𝐶𝑎𝑛𝑡𝑎𝑙𝑜𝑢𝑝𝑒) = 500
We could also make a table:
𝑓𝑟𝑢𝑖𝑡 𝑠(𝑓𝑟𝑢𝑖𝑡)
Avocado 1
Banana 0
Cantaloupe 500
FYI
Function or multiplication?
Students sometimes confuse functions and multiplication because of the
way we conventionally write functions, students sometimes get confused.
Suppose I write this:
𝑧 = 𝑓(𝑥 + 𝑦)
There are two possible interpretations of this statement:
𝐼(3 < 5) = 1
𝐼(3 = 5) = 0
𝐼(Ottawa is the capital of Canada) = 1
𝐼(Ottawa is in Alberta) = 0
We use indicator functions all the time in probability and statistics because
they allow us to convert a qualitative statement like “Bob is employed” into a
quantitative statement like 𝐼(Bob is employed) = 1.
The Cartesian product of two sets 𝐴 and 𝐵, usually written 𝐴 × 𝐵, is the set
of all ordered pairs of elements in the two sets.
A.3. SEQUENCES AND SUMMATIONS 307
ℝ2 = ℝ × ℝ
the set of ordered pairs of real numbers. For example (0, 3) and (0.427, 2000)
are both elements of ℝ2 .
A.3.2 Sequences
ℝ𝑛 = ℝ × ℝ × ⋯ × ℝ
1. Order matters.
2. Elements can be repeated. For example
{1, 2} = {2, 1}
(1, 2) ≠ (2, 1)
Sequences can include repeated elements, but sets cannot
() is an empty sequence.
A.3.3 Summations
Notice that an expression using the summation operator has several components:
∑ 𝑥𝑖 = 𝑥 1 + 𝑥 2 + 𝑥 3
𝑖∈{1,2,3}
𝑒𝑛𝑑
∑𝑖=𝑠𝑡𝑎𝑟𝑡 means we add up over all of the integers between 𝑠𝑡𝑎𝑟𝑡 and 𝑒𝑛𝑑.
3
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + 𝑥3
𝑖=1
𝑛
∑ 𝑥𝑖 = 𝑥 1 + 𝑥 2 + ⋯ + 𝑥 𝑛
𝑖=1
A.3. SEQUENCES AND SUMMATIONS 309
3
∑ 𝛽𝑗 = 𝛽 + 𝛽2 + 𝛽3
𝑗=1
But the expression does not have to vary with the index:
3
∑ 2𝑥 = 2𝑥 + 2𝑥 + 2𝑥 = 6𝑥
𝑖=1
3
∑3 = 3 + 3 + 3 = 9
𝑖=1
The summation operator looks fancy, but remember it is just a concise way of
describing a sum. If you are ever struggling with understanding a summation,
write it out.
• Associative property:
(𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐)
(𝑎𝑏)𝑐 = 𝑎(𝑏𝑐)
• Commutative property:
𝑎+𝑏 =𝑏+𝑎
𝑎𝑏 = 𝑏𝑎
• Distributive property:
𝑎(𝑏 + 𝑐) = 𝑎𝑏 + 𝑎𝑐
• The associative and commutative properties allow you to switch any two
summation operators:
∑ ∑ 𝑥𝑖 𝑦𝑗 = ∑ ∑ 𝑥𝑖 𝑦𝑗
𝑖∈𝐴 𝑗∈𝐵 𝑗∈𝐵 𝑖∈𝐴
∑(𝑥𝑖 + 𝑦𝑖 ) = (∑ 𝑥𝑖 ) + (∑ 𝑦𝑖 )
𝑖∈𝐴 𝑖∈𝐴 𝑖∈𝐴
• The distributive property allows you to take any constant out of the sum-
mation operator
𝑛 𝑛
∑ 𝑎𝑥𝑖 = 𝑎 ∑ 𝑥𝑖
𝑖=1 𝑖=1
A.4 Limits
Let (𝑥1 , 𝑥2 , …) be a sequence of infinite length. We say that some number 𝑐 is
the limit of this sequence if 𝑥𝑖 gets closer and closer to 𝑐 as 𝑖 gets bigger and
bigger.
lim(1, 1, 1, …) = 1
1 1
lim (1, , , …) = 0
2 3
lim(1, 2, 3, …) = ∞
lim(−1, −2, −3, …) = −∞
lim(0, 1, 0, 1, 0, 1, …) does not exist
You will learn or have learned the formal definition of a limit in your calculus
course. I won’t make you re-learn it for this class, but the formal definition is
provided below for your reference.
A.4. LIMITS 311
FYI
Definition of a limit
Let (𝑥1 , 𝑥2 , …) be a sequence of infinite length. We say that the number
𝑐 is the limit of this sequence:
lim 𝑥𝑖 = 𝑐
𝑖→∞
Chapter review
This appendix provides some of the basic mathematical background needed for
this course. There are many other useful sources if you need further information.
Practice problems
Answers can be found in the appendix.
SKILL #1: Define a set using enumeration or set-builder notation
1. Use enumeration to define 𝐴 as the set of “people who live in your house”.
2. Use set-builder notation to define 𝐵 as the set of integers between 1 and
1,000.
3. Calculate the cardinality of the sets you defined in problems (1) and (2)
above.
4. Let 𝐴 = {1, 2, 3}, let 𝐵 = {2}, let 𝐶 = ∅, and let 𝐷 = {𝑥 ∈ ℤ ∶ 1 < 𝑥 < 3}
a. Which pairs of sets are identical?
b. Which sets are disjoint with 𝐴?
c. Which sets are subsets of 𝐴?
d. Which sets are subsets of 𝐵?
5. Let 𝐴 = {1, 2, 3}, let 𝐵 = {2, 4}, let 𝐶 = ∅, and let 𝐷 = {𝑥 ∈ ℤ ∶ 1 < 𝑥 <
3}
a. Find 𝐴 ∩ 𝐵.
b. Find 𝐴 ∪ 𝐵.
c. Find 𝐴 ∩ 𝐷.
d. Find 𝐴 ∪ 𝐷.
e. Suppose the universe set is 𝑈 = {1, 2, 3, 4, 5}. Find 𝐵𝐶 .
f. Suppose the universe set is the set of integers ℤ. Find 𝐵𝐶
Solutions to practice
problems
315
316 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
b. =CONCAT("A2 = ",A2)
c. =LEFT(A2,2)
a. =MONTH(TODAY())
b. =TODAY()+100
c. =TODAY()-DATE(1969,11,20) - using my birthday; yours will obvi-
ously be different.
1. The sample space is the set of all possible outcomes for (𝑟, 𝑤), and its
cardinality is just the number of elements it has.
⎧(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),⎫
{ }
{(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),}
{ }
{(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),}
Ω=⎨ ⎬ (B.1)
{(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),}
{(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),}
{ }
{ }
⎩(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) ⎭
2. Each event is just a set listing the (𝑟, 𝑤) outcomes that satisfy the relevant
conditions.
10. The conditional probability is just the ratio of the joint probability to the
probability of the event we are conditioning on:
319
𝑃 𝑟(𝑌 𝑜 ∩ 𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 )
Pr(𝑌 𝑜|𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 ) = (B.37)
Pr(𝐵𝑜𝑥𝑐𝑎𝑟𝑠𝐶 )
2/36
= (B.38)
1 − 1/36
≈ 0.057 (B.39)
1. We can define 𝑡 = 𝑟 + 𝑤.
2. We can define 𝑦 = 𝐼(𝑟 + 𝑤 = 11).
3. The support of a random variable is the set of all values with positive
probability.
a. The support of 𝑟 is 𝑆𝑟 = {1, 2, 3, 4, 5, 6}.
b. The support of 𝑡 is 𝑆𝑡 = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
c. The support of 𝑦 is 𝑆𝑦 = {0, 1}.
4. The range of a random variable is just the interval defined by the support’s
minimum and maximum values.
a. The range of 𝑟 is [1, 6].
b. The range of 𝑡 is [2, 12].
c. The range of 𝑦 is [0, 1].
5. The PDF can be derived from the probability distribution of the underly-
ing outcome (𝑟, 𝑤), which was calculated in the previous chapter.
a. The PDF of 𝑟 is:
1/6 𝑎 ∈ {1, 2, 3, 4, 5, 6}
𝑓𝑟 (𝑎) = { (B.46)
0 otherwise
⎧17/18 𝑎=0
{
𝑓𝑦 (𝑎) = ⎨1/18 𝑎=1 (B.48)
{0 otherwise
⎩
⎧0 𝑎<1
{
{1/6 1≤𝑎<2
{
{1/3 2≤𝑎<3
{
𝐹𝑟 (𝑎) = ⎨1/2 3≤𝑎<4 (B.49)
{2/3 4≤𝑎<5
{
{5/6 5≤𝑎<6
{
{
⎩1 𝑎≥6
⎧0 𝑎<0
{
𝐹𝑦 (𝑎) = ⎨17/18 0≤𝑎<1 (B.50)
{1 𝑎≥1
⎩
8. The key here is to write down the definition of the specific quantile you
are looking for, then substitute the CDF you derived earlier.
a. The median of 𝑟 is:
9. Let 𝑑 = (𝑦 − 𝐸(𝑦))2
a. We can derive the PDF of 𝑑 from the PDF of 𝑦:
⎧17/18 𝑎 = (0 − 1/18)2
{
𝑓𝑑 (𝑎) = ⎨1/18 𝑎 = (1 − 1/18)2 (B.71)
{0 otherwise
⎩
17 1
𝐸(𝑑) = (0 − 1/18)2 ∗ + (1 − 1/18)2 ∗ (B.72)
18 18
≈ 0.0525 (B.73)
10. Earlier, you found 𝐸(𝑟) = 3.5 and 𝐸(𝑟2 ) = 15.17. So we can apply our
result that 𝑣𝑎𝑟(𝑥) = 𝐸(𝑥2 ) − 𝐸(𝑥)2 for a simpler way of calculating the
variance.
a. The variance is:
11. The Bernoulli distribution describes any random variable (like 𝑦) that has
a binary {0, 1} support.
a. 𝑦 has the 𝐵𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑖(𝑝) distribution with 𝑝 = 1/18, or
𝑦 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1/18)
Pr(𝑌10 = 0) ≈ 0.565
13. The key here is to apply the formulas for the expected value and variance
of a linear function of a random variable.
324 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
15. If you care mostly about expected net winnings, you should just keep your
$10 and not bet at all. This strategy has an expected net winnings of zero,
while the others both have negative expected net winnings.
16. Betting $1 on ten rolls produces the highest probability of walking away
from the table with more money than you started with.
17. Betting $10 on one roll produces more variable net winnings.
325
1. The key here is to redefine joint events for 𝑦 and 𝑏 as events for the single
random variable 𝑡.
a. The joint PDF is:
𝑓𝑦,𝑏 (1, 1) = Pr(𝑦 = 1 ∩ 𝑏 = 1) (B.98)
= Pr(𝑡 = 11 ∩ 𝑡 = 12) (B.99)
= Pr(∅) (B.100)
=0 (B.101)
b. The joint PDF is:
𝑓𝑦,𝑏 (0, 1) = Pr(𝑦 ≠ 1 ∩ 𝑏 = 1) (B.102)
= Pr(𝑡 ≠ 11 ∩ 𝑡 = 12) (B.103)
= Pr(𝑡 = 12) (B.104)
= 𝑓𝑡 (12) (B.105)
= 1/36 (B.106)
c. The joint PDF is:
𝑓𝑦,𝑏 (1, 0) = Pr(𝑦 = 1 ∩ 𝑏 = 0) (B.107)
= Pr(𝑡 = 11 ∩ 𝑡 ≠ 12) (B.108)
= Pr(𝑡 = 11) (B.109)
= 𝑓𝑡 (11) (B.110)
= 1/18 (B.111)
d. The joint PDF is:
𝑓𝑦,𝑏 (0, 0) = Pr(𝑦 = 0 ∩ 𝑏 = 0) (B.112)
= Pr(𝑡 ≠ 11 ∩ 𝑡 ≠ 12) (B.113)
= Pr(𝑡 ∉ {11, 12}) (B.114)
= 1 − Pr(𝑡 ∈ {11, 12}) (B.115)
= 1 − 1/36 − 1/18 (B.116)
= 11/12 (B.117)
2. The marginal PDF is constructed by adding together the joint PDFs.
a. The marginal PDF is:
𝑓𝑏 (0) = 𝑓𝑦,𝑏 (0, 0) + 𝑓𝑦,𝑏 (1, 0) (B.118)
= 11/12 + 1/18 (B.119)
= 35/36 (B.120)
326 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
3. The conditional PDF is the ratio of the joint PDF to the marginal PDF:
0 ∗ 0 ∗ 𝑓𝑦,𝑏 (0, 0)
+ 1 ∗ 0 ∗ 𝑓𝑦,𝑏 (1, 0)
𝐸(𝑦𝑏) = (B.147)
+ 0 ∗ 1 ∗ 𝑓𝑦,𝑏 (0, 1)
+ 1 ∗ 1 ∗ 𝑓𝑦,𝑏 (1, 1)
= 𝑓𝑦,𝑏 (1, 1) (B.148)
=0 (B.149)
c. Yes, if you have done it right you should get the same answer.
328 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
8. In this question we know the correlation and several values of the formula
defining it, and we use this to solve for the missing value.
𝑐𝑜𝑣(𝑏,𝑡)
a. We know that 𝑐𝑜𝑟𝑟(𝑏, 𝑡) = √𝑣𝑎𝑟(𝑏)𝑣𝑎𝑟(𝑡)
, so we can substitute known
values to get:
𝑐𝑜𝑣(𝑏, 𝑡)
0.35 = √ (B.156)
0.027 ∗ 5.83
1 + (−1)
𝐸(𝑥) = (B.193)
2
=0 (B.194)
a. Sample size is 3
b. Sample average age is 45.
c. Sample median of age is 32.
d. Sample 25th percentile of age is 28.5.
e. Sample variance of age is 829.
f. Sample standard deviation of age is 28.8.
2. The numerical answers are the same as in question 1. Assuming the table
starts in cell A1, the Excel formulas are:
a. =COUNT(B2:B4)
b. =AVERAGE(B2:B4)
c. =MEDIAN(B2:B4)
d. =PERCENTILE.INC(B2:B4,0.25)
e. =VAR.S(B2:B4)
f. =STDEV.S(B2:B4)
7 Statistics
2. Identify the sampling type (random sample, time series, stratified sample,
cluster sample, census, convenience sample) for each of the following data
sets.
3. Since we have a random sample, the joint PDF is just a product of the
marginal PDFs.
b. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:
c. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:
d. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:
e. Since we have a random sample, the joint PDF is the product of the
marginal PDFs:
b. The support is 𝑆𝑥̄ = {1, 1.5, 2} and the sampling distribution is:
c. The support is 𝑆𝜎̂ 𝑥2 = {0, 0.5} and the sampling distribution is:
d. The support is 𝑆𝜎̂ 𝑥 = {0, 0.71} and the sampling distribution is:
6. We have already found the mean of each statistic, so we can calculate the
variance using the alternative formula 𝑣𝑎𝑟(𝑥) = 𝐸(𝑥2 ) − 𝐸(𝑥)2
7. All three of these statistics are simple linear functions of the data, so their
mean and variance can be calculated without using the PDF:
b. The mean and variance can be found using the standard formulas for
the sample average:
c. The mean and variance can be found using the standard formula for
a linear function of two random variables:
9. We can find the true values directly from the PDF 𝑓𝑥 (⋅):
339
10. Since max(𝑆𝑥 ) = 2, the error is either 𝑒𝑟𝑟 = 1−2 = −1 or 𝑒𝑟𝑟 = 2−2 = 0.
a. Unbiased
𝑏𝑖𝑎𝑠(𝑓1̂ ) = 𝐸(𝑓1̂ ) − Pr(𝑥𝑖 = 1) (B.326)
= 0.4 − 0.4 (B.327)
=0 (B.328)
b. Unbiased
𝑏𝑖𝑎𝑠(𝑥)̄ = 𝐸(𝑥)̄ − 𝐸(𝑥𝑖 ) (B.329)
= 1.6 − 1.6 (B.330)
=0 (B.331)
c. Unbiased
𝑏𝑖𝑎𝑠(𝜎̂𝑥2 ) = 𝐸(𝜎̂𝑥2 ) − 𝑣𝑎𝑟(𝑥𝑖 ) (B.332)
= 0.24 − 0.24 (B.333)
=0 (B.334)
d. Biased
𝑏𝑖𝑎𝑠(𝜎̂𝑥 ) = 𝐸(𝜎̂𝑥 ) − 𝑠𝑑(𝑥𝑖 ) (B.335)
= 0.34 − 0.49 (B.336)
= −0.15 (B.337)
340 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
e. Biased
f. Biased
12. Remember that the expected value passes through linear functions, but
not nonlinear functions:
13. We have already calculated the variance and bias, so we can apply the
formula 𝑀 𝑆𝐸 = 𝑣𝑎𝑟 + 𝑏𝑖𝑎𝑠2 :
14. Again, these estimators are both linear functions of the data, so their
mean and variance are easy to calculate.
341
16. Both the sample average (a) and the average of all even numbered obser-
vations (d) are consistent estimators of 𝜇, because they keep using more
and more information as the sample size increases. In contrast the esti-
mators based on the first observation, and on the first 100 observations
are both unbiased, but they do not change as the sample size increases.
So the LLN does not apply, and the estimators are not consistent. The
LLN does apply to the sample median because of Slutsky’s theorem, but
that makes it a consistent estimator of the population median, which may
or may not be equal to the population mean.
8 Statistical inference
Click here to see the problems
6. This problem asks you to find the critical values that deliver a given size.
a. The critical values are:
𝑐𝐿 = 𝐹𝑇−1
17
(0.005) (B.377)
≈ −2.90 (B.378)
𝑐𝐻 = 𝐹𝑇−1
17
(0.995) (B.379)
≈ 2.90 (B.380)
𝑐𝐿 = 𝐹𝑇−1
17
(0.025) (B.381)
≈ −2.11 (B.382)
𝑐𝐻 = 𝐹𝑇−1
17
(0.975) (B.383)
≈ 2.11 (B.384)
𝑐𝐿 = 𝐹𝑇−1
17
(0.05) (B.385)
≈ −1.74 (B.386)
𝑐𝐻 = 𝐹𝑇−1
17
(0.95) (B.387)
≈ 1.74 (B.388)
9 An introduction to R
Click here to see the problems
346 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
35
30
factor(cyl)
4
6
25 8
mpg
qsec
20 15.0
17.5
20.0
15 22.5
10
2 3 4 5
wt
c. CSV
2. Here are my descriptions, yours may be somewhat different:
a. A crosswalk table is a data table we can use to translate variables that
are expressed in one way into another way. For example, we might
use a crosswalk table to translate country names into standardized
country codes, or to translate postal codes into provinces.
b. When we have two data tables that contain information on related
cross-sectional
units, we can combine their information into a single table by match-
ing observations based on a variable that (a) exists in both tables and
(b) connects the observations in some way.
c. Aggregating data by groups allows us to group observations according
to a common characteristic, and describe those groups using data
calculated from the individual observations.
3. You can edit cell A1 under scenarios (a) and (c).
4. If you do this:
a. Nothing will happen, but you can ask Excel to mark invalid data.
b. Excel will not allow you to enter invalid data.
5. The R code will be something like this.
library("tidyverse")
deniro <- read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/deniro.csv")
## Rows: 87 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Title
## dbl (2): Year, Score
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(deniro)
## [1] 87
ncol(deniro)
## [1] 3
11 Using R
Click here to see the problems
## # A tibble: 133 x 5
## MonthYr Year EmpRate UnempRate AnnPopGrowth
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2020-04-01 2020 0.521 0.131 0.0132
## 2 2020-05-01 2020 0.530 0.137 0.0124
## 3 2020-06-01 2020 0.560 0.125 0.0118
## 4 2020-07-01 2020 0.573 0.109 0.0109
## 5 2020-08-01 2020 0.580 0.102 0.0104
## 6 2020-03-01 2020 0.585 0.0789 0.0143
## 7 2021-01-01 2021 0.586 0.0941 0.00863
## 8 2020-09-01 2020 0.591 0.0918 0.00992
## 9 2020-12-01 2020 0.593 0.0876 0.00905
## 10 2020-10-01 2020 0.594 0.0902 0.00961
## # ... with 123 more rows
## [1] 0.610809
## $EmpRate
## [1] 0.6139724
##
## $UnempRate
## [1] 0.07082771
##
## $AnnPopGrowth
## [1] 0.01171851
349
PPData %>%
count(cut_interval(EmpRate, 6))
## # A tibble: 5 x 2
## `cut_interval(EmpRate, 6)` n
## <fct> <int>
## 1 [0.521,0.537] 2
## 2 (0.554,0.57] 1
## 3 (0.57,0.587] 4
## 4 (0.587,0.603] 4
## 5 (0.603,0.62] 122
qnorm(0.45, mean = 4, sd = 6)
## [1] 3.246032
qt(0.975, df = 8)
## [1] 2.306004
pnorm(0.75)
## [1] 0.7733726
## [1] 6 6 8 5 8
5. The code below is the minimal code needed to create the histogram us-
ing the default options. You could improve on the graph quite easily by
adjusting some of these options.
40
count
20
6. The code below is the minimal code needed to create the graph using the
default options. You could improve on the graph quite easily by adjusting
some of these options.
0.600
EmpRate
0.575
0.550
0.525
2. I used casewise deletion but it does not matter. It only matters when you
add a third variable.
3. The tables are
a. Simple frequency table
b. Conditional average
c. Crosstab
4. The scatter plot should look something like this:
352 APPENDIX B. SOLUTIONS TO PRACTICE PROBLEMS
0.125
UnempRate
0.100
0.075
0.125
UnempRate
0.100
0.075
6. Remember that the indicator function 𝐼(⋅) returns one for true statements
and zero for false statements.
a. 𝐼(𝑥 < 5) = 𝐼(4 < 5) = 1
b. 𝐼(𝑥 is an odd √
number) = 𝐼(4 is an odd number) √
=0
c. 𝐼(𝑥 < 5) − 𝐼( 𝑥 is an integer) = 𝐼(4 < 5) − 𝐼( 4 is an integer) =
1−1=0
7. The limits are:
a. The limit of this sequence is 0
b. This sequence has no limit.
c. The limit of this sequence is 5
d. The limit of this sequence is 0
e. The limit of this sequence is ∞
8. The summation values are:
5
a. ∑𝑥=1 𝑥2 = 12 + 22 + 32 + 42 + 52 = 55
b. ∑𝑖∈{1,3,5} ln(𝑖) = ln(1) + ln(3) + ln(5) ≈ 2.708
∞
c. ∑𝑖∈1 𝑥𝑖 𝐼(𝑖 < 4) = 𝑥1 + 𝑥2 + 𝑥3
9. Statements (a) and (c) are true.