Reading 2
Reading 2
Reading 2
READING
2
Organizing, Visualizing,
and Describing Data
by Pamela Peterson Drake, PhD, CFA, and Jian Wu, PhD
Pamela Peterson Drake, PhD, CFA, is at James Madison University (USA). Jian Wu, PhD,
is at State Street (USA).
LEARNING OUTCOMES
Mastery The candidate should be able to:
1 INTRODUCTION
Data have always been a key input for securities analysis and investment management,
but the acceleration in the availability and the quantity of data has also been driving
the rapid evolution of the investment industry. With the rise of big data and machine
learning techniques, investment practitioners are embracing an era featuring large
volume, high velocity, and a wide variety of data—allowing them to explore and exploit
this abundance of information for their investment strategies.
While this data-rich environment offers potentially tremendous opportunities for
investors, turning data into useful information is not so straightforward. Organizing,
cleaning, and analyzing data are crucial to the development of successful investment
strategies; otherwise, we end up with “garbage in and garbage out” and failed invest-
ments. It is often said that 80% of an analyst’s time is spent on finding, organizing,
cleaning, and analyzing data, while just 20% of her/his time is taken up by model
development. So, the importance of having a properly organized, cleansed, and well-
analyzed dataset cannot be over-emphasized. With this essential requirement met,
an appropriately executed data analysis can detect important relationships within
data, uncover underlying structures, identify outliers, and extract potentially valuable
insights. Utilizing both visual tools and quantitative methods, like the ones covered
in this reading, is the first step in summarizing and understanding data that will be
crucial inputs to an investment strategy.
This reading provides a foundation for understanding important concepts that are
an indispensable part of the analytical tool kit needed by investment practitioners,
from junior analysts to senior portfolio managers. These basic concepts pave the way
for more sophisticated tools that will be developed as the quantitative methods topic
unfolds and that are integral to gaining competencies in the investment management
techniques and asset classes that are presented later in the CFA curriculum.
Section 2 covers core data types, including continuous and discrete numerical
data, nominal and ordinal categorical data, and structured versus unstructured data.
Organizing data into arrays and data tables and summarizing data in frequency dis-
tributions and contingency tables are discussed in Sections 3–5. Section 6 introduces
the important topic of data visualization using a range of charts and graphics to
summarize, explore, and better understand data. Section 7 covers the key measures
of central tendency, including several variants of mean that are especially useful in
investments. Quantiles and their investment applications are the focus of Section
8. Key measures of dispersion are discussed in Sections 9 and 10. The shape of data
distributions—specifically, skewness and kurtosis—are covered in Section 11. Section
12 provides a graphical introduction to covariance and correlation between two vari-
ables. The reading concludes with a Summary.
2 DATA TYPES
Energy 10
Materials 15
Industrials 20
Consumer Discretionary 25
Consumer Staples 30
Health Care 35
Financials 40
Information Technology 45
Communication Services 50
(continued)
© CFA Institute. For candidate use only. Not for distribution.
66 Reading 2 ■ Organizing, Visualizing, and Describing Data
Exhibit 1 (Continued)
Sector Code
(Text Label) (Numerical Label)
Utilities 55
Real Estate 60
Text labels are a common format to represent nominal data, but nominal data
can also be coded with numerical labels. As shown below, the column named “Code”
contains a corresponding GICS code of each sector as a numerical value. However,
the nominal data in numerical format do not indicate ranking, and any arithmetic
operations on nominal data are not meaningful. In this example, the energy sector
with the code 10 does not represent a lower or higher rank than the real estate sector
with the code 60. Often, financial models, such as regression models, require input
data to be numerical; so, nominal data in the input dataset must be coded numerically
before applying an algorithm (that is, a process for problem solving) for performing
the analysis. This would be mainly to identify the category (here, sector) in the model.
Ordinal data are categorical values that can be logically ordered or ranked. For
example, the Morningstar and Standard & Poor’s star ratings for investment funds
are ordinal data in which one star represents a group of funds judged to have had
relatively the worst performance, with two, three, four, and five stars representing
groups with increasingly better performance or quality as evaluated by those firms.
Ordinal data may also involve numbers to identify categories. For example, in
ranking growth-oriented investment funds based on their five-year cumulative returns,
we might assign the number 1 to the top performing 10% of funds, the number 2 to
next best performing 10% of funds, and so on; the number 10 represents the bottom
performing 10% of funds. Despite the fact that categories represented by ordinal
data can be ranked higher or lower compared to each other, they do not necessarily
establish a numerical difference between each category. Importantly, such investment
fund ranking tells us nothing about the difference in performance between funds
ranked 1 and 2 compared with the difference in performance between funds ranked
3 and 4 or 9 and 10.
Having discussed different data types from a statistical perspective, it is import-
ant to note that at first glance, identifying data types may seem straightforward. In
some situations, where categorical data are coded in numerical format, they should
be distinguished from numerical data. A sound rule of thumb: Meaningful arithmetic
operations can be performed on numerical data but not on categorical data.
EXAMPLE 1
2 Cash dividends per share paid by a public company. Note that cash
dividends are a distribution paid to shareholders based on the number of
shares owned.
3 Credit ratings for corporate bond issues. As background, credit ratings
gauge the bond issuer’s ability to meet the promised payments on the
bond. Bond rating agencies typically assign bond issues to discrete catego-
ries that are in descending order of credit quality (i.e., increasing probabil-
ity of non-payment or default).
4 Hedge fund classification types. Note that hedge funds are investment
vehicles that are relatively unconstrained in their use of debt, derivatives,
and long and short investment strategies. Hedge fund classification types
group hedge funds by the kind of investment strategy they pursue.
Solution to 1
Number of coupon payments are discrete data. For example, a newly-issued
5-year corporate bond paying interest semi-annually (quarterly) will make 10
(20) coupon payments during its life. In this case, coupon payments are limited
to a finite number of values; so, they are discrete.
Solution to 2
Cash dividends per share are continuous data since they can take on any non-
negative values.
Solution to 3
Credit ratings are ordinal data. A rating places a bond issue in a category, and
the categories are ordered with respect to the expected probability of default.
But arithmetic operations cannot be done on credit ratings, and the difference in
the expected probability of default between categories of highly rated bonds, for
example, is not necessarily equal to that between categories of lowly rated bonds.
Solution to 4
Hedge fund classification types are nominal data. Each type groups together
hedge funds with similar investment strategies. In contrast to credit ratings for
bonds, however, hedge fund classification schemes do not involve a ranking.
Thus, such classification schemes are not ordinal data.
Unstructured data are a relatively new classification driven by the rise of alterna-
tive data (i.e., data generated from unconventional sources, like electronic devices,
social media, sensor networks, and satellites, but also by companies in the normal
course of business) and its growing adoption in the financial industry. Unstructured
data are typically alternative data as they are usually collected from unconventional
sources. By indicating the source from which the data are generated, such data can
be classified into three groups:
■■ Produced by individuals (i.e., via social media posts, web searches, etc.);
■■ Generated by business processes (i.e., via credit card transactions, corporate
regulatory filings, etc.); and
■■ Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices,
etc.).
Unstructured data may offer new market insights not normally contained in data
from traditional sources and may provide potential sources of returns for investment
processes. Unlike structured data, however, utilizing unstructured data in investment
analysis is challenging. Typically, financial models are able to take only structured data
as inputs; therefore, unstructured data must first be transformed into structured data
that models can process.
Exhibit 3 shows an excerpt from Form 10-Q (Quarterly Report) filed by Company
XYZ with the US Securities and Exchange Commission (SEC) for the fiscal quarter
ended 31 March 20XX. The form is an unstructured mix of text and tables, so it can-
not be directly used by computers as input to financial models. The SEC has utilized
eXtensible Business Reporting Language (XBRL) to structure such data. The data
extracted from the XBRL submission can be organized into five tab-delimited TXT
format files that contain information about the submission, including taxonomy tags
(i.e., financial statement items), dates, units of measure (uom), values (i.e., for the tag
items), and more—making it readable by computer. Exhibit 4 shows an excerpt from
one of the now structured data tables downloaded from the SEC’s EDGAR (Electronic
Data Gathering, Analysis, and Retrieval) database.
Exhibit 3 Excerpt from 10-Q of Company XYZ for Fiscal Quarter Ended 31 March 20XX
Company XYZ
Form 10-Q
Fiscal Quarter Ended 31 March 20XX
Table of Contents
Part I
Page
Item 1 Financial Statements 1
Item 2 Management’s Discussion and Analysis of Financial 21
Condition and Results of Operations
Item 3 Quantitative and Qualitative Disclosures About Market 32
Risk
Item 4 Controls and Procedures 32
Part II
Item 1 Legal Proceedings 33
Item 1A Risk Factors 33
Item 2 Unregistered Sales of Equity Securities and Use of 43
Proceeds
Item 3 Defaults Upon Senior Securities 43
(continued)
© CFA Institute. For candidate use only. Not for distribution.
70 Reading 2 ■ Organizing, Visualizing, and Describing Data
Exhibit 3 (Continued)
Cost of sales:
Products 32,047
Services 4,147
Total cost of sales 36,194
Gross margin 21,821
Operating expenses:
Research and development 3,948
Selling, general and administrative 4,458
Total operating expenses 8,406
Source: EDGAR.
Exhibit 4 Structured Data Extracted from Form 10-Q of Company XYZ for Fiscal Quarter Ended 31
March 20XX
adsh tag ddate uom value
Exhibit 4 (Continued)
Source: EDGAR.
EXAMPLE 2
Solution to 1
B is correct as daily closing prices constitute structured data. A is incorrect as
social media posts are unstructured data. C is incorrect as audio and video are
unstructured data.
Solution to 2
C is correct as it most accurately describes panel data. A is incorrect as it
describes time-series data. B is incorrect as it describes cross-sectional data.
Solution to 3
C is correct as dates are ordinal data that can be sorted by chronological order but
not by value. A and B are incorrect as both daily trading volumes and earnings
per share (EPS) are numerical data, so they can be sorted by values.
Solution to 4
A is correct since a time series is a sequence of observations of a specific variable
(XYZ stock price) collected over time (60 months) and at discrete intervals of
time (daily). B and C are both incorrect as they are cross-sectional data.
1 57.21
2 58.26
3 58.64
4 56.19
5 54.78
6 54.26
7 56.88
8 54.74
9 52.42
10 50.14
A two-dimensional rectangular array (also called a data table) is one of the most
popular forms for organizing data for processing by computers or for presenting data
visually for consumption by humans. Similar to the structure in an Excel spreadsheet,
a data table is comprised of columns and rows to hold multiple variables and multiple
observations, respectively. When a data table is used to organize the data of one single
observational unit (i.e., a single company), each column represents a different variable
(feature or attribute) of that observational unit, and each row holds an observation for
the different variables; successive rows represent the observations for successive time
periods. In other words, observations of each variable are a time-series sequence that
is sorted in either ascending or descending time order. Consequently, observations
of different variables must be sorted and aligned to the same time scale. Example 3
shows how to organize a raw dataset for a company collected online into a machine-
readable data table.
EXAMPLE 3
March
Revenue $3,784(M) $4,097(M)
EPS 1.37 −0.34
DPS N/A N/A
June
Revenue $4,236(M) $5,905(M)
EPS 1.78 3.89
DPS N/A 0.25
September
Revenue $4,187(M) $4,997(M)
EPS −3.38 −2.88
DPS N/A 0.25
December
Revenue $3,889(M) $4,389(M)
EPS −8.66 −3.98
DPS N/A 0.25
Use the data to construct a two-dimensional rectangular array (i.e., data table)
with the columns representing the metrics for valuation and the observations
arranged in a time-series sequence.
Solution:
To construct a two-dimensional rectangular array, we first need to determine
the data table structure. The columns have been specified to represent the three
valuation metrics (i.e., variables): revenue, EPS and DPS. The rows should be
the observations for each variable in a time ordered sequence. In this example,
the data for the valuation measures will be organized in the same quarterly
intervals as the raw data retrieved online, starting from Q1 Year 1 to Q4 Year 2.
Then, the observations from the original table can be placed accordingly into the
data table by variable name and by filing quarter. Exhibit 7 shows the raw data
reorganized in the two-dimensional rectangular array (by date and associated
valuation metric), which can now be used in financial analysis and is readable
by a computer.
It is worth pointing out that in case of missing values while organizing
data, how to handle them depends largely on why the data are missing. In this
example, dividends (DPS) in the first five quarters are missing because ABC Inc.
did not authorize (and pay) any dividends. So, filling the dividend column with
zeros is appropriate. If revenue, EPS, and DPS of a given quarter are missing
due to particular data source issues, however, these missing values cannot be
simply replaced with zeros; this action would result in incorrect interpretation.
Instead, the missing values might be replaced with the latest available data or with
interpolated values, depending on how the data will be consumed or modeled.
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using Frequency Distributions 75
Industrials 73 15.2%
Information Technology 69 14.4%
Financials 67 14.0%
Consumer Discretionary 62 12.9%
Health Care 54 11.3%
Consumer Staples 33 6.9%
Real Estate 30 6.3%
Energy 29 6.1%
Utilities 26 5.4%
Materials 26 5.4%
Communication Services 10 2.1%
Total 479 100.0%
6 Determine the number of observations falling into each bin by counting the
number of observations whose values are equal to or exceed the bin minimum
value yet are less than the bin’s maximum value. The exception is in the last bin,
where the maximum value is equal to the last bin’s maximum, and therefore, the
observation with the maximum value is included in this bin’s count.
7 Construct a table of the bins listed from smallest to largest that shows the num-
ber of observations falling into each bin.
In Step 4, when rounding the bin width, round up (rather than down) to ensure
that the final bin includes the maximum value of the data.
These seven steps are basic guidelines for constructing frequency distributions. In
practice, however, we may want to refine the above basic procedure. For example, we
may want the bins to begin and end with whole numbers for ease of interpretation.
Another practical refinement that promotes interpretation is to start the first bin at
the nearest whole number below the minimum value.
As this procedure implies, a frequency distribution groups data into a set of bins,
where each bin is defined by a unique set of values (i.e., beginning and ending points).
Each observation falls into only one bin, and the total number of bins covers all the
values represented in the data. The frequency distribution is the list of the bins together
with the corresponding measures of frequency.
To illustrate the basic procedure, suppose we have 12 observations sorted in
ascending order (Step 1):
−4.57, −4.04, −1.64, 0.28, 1.34, 2.35, 2.38, 4.28, 4.42, 4.68, 7.16, and 11.43.
The minimum observation is −4.57, and the maximum observation is +11.43. So,
the range is +11.43 − (−4.57) = 16 (Step 2).
If we set k = 4 (Step 3), then the bin width is 16/4 = 4 (Step 4).
Exhibit 9 shows the repeated addition of the bin width of 4 to determine the end-
point for each of the bins (Step 5).
Thus, the bins are [−4.57 to −0.57), [−0.57 to 3.43), [3.43 to 7.43), and [7.43 to
11.43], where the notation [−4.57 to −0.57) indicates −4.57 ≤ observation < −0.57. The
parentheses indicate that the endpoints are not included in the bins, and the square
brackets indicate that the beginning points and the last endpoint are included in the
bin. Exhibit 10 summarizes Steps 5 through 7.
Note that the bins do not overlap, so each observation can be placed uniquely into
one bin, and the last bin includes the maximum value.
We turn to these issues in discussing the construction of frequency distributions
for daily returns of the fictitious Euro-Asia-Africa (EAA) Equity Index. The dataset
of daily returns of the EAA Equity Index spans a five-year period and consists of
1,258 observations with a minimum value of −4.1% and a maximum value of 5.0%.
Thus, the range of the data is 5% − (−4.1%) = 9.1%, approximately. [The mean daily
return—mean as a measure of central tendency will be discussed shortly—is 0.04%.]
The decision on the number of bins (k) into which we should group the observa-
tions often involves inspecting the data and exercising judgment. How much detail
should we include? If we use too few bins, we will summarize too much and may lose
pertinent characteristics. Conversely, if we use too many bins, we may not summarize
enough and may introduce unnecessary noise.
We can establish an appropriate value for k by evaluating the usefulness of the
resulting bin width. A large number of empty bins may indicate that we are attempting
to over-organize the data to present too much detail. Starting with a relatively small
bin width, we can see whether or not the bins are mostly empty and whether or not
the value of k associated with that bin width is too large. If the bins are mostly empty,
implying that k is too large, we can consider increasingly larger bins (i.e., smaller values
of k) until we have a frequency distribution that effectively summarizes the distribution.
Suppose that for ease of interpretation we want to use a bin width stated in whole
rather than fractional percentages. In the case of the daily EAA Equity Index returns,
a 1% bin width would be associated with 9.1/1 = 9.1 bins, which can be rounded up to
k = 10 bins. That number of bins will cover a range of 1% × 10 = 10%. By constructing
the frequency distribution in this manner, we will also have bins that end and begin
at a value of 0%, thereby allowing us to count the negative and positive returns in the
data. Without too much work, we have found an effective way to summarize the data.
Exhibit 11 shows the frequency distribution for the daily returns of the EAA Equity
Index using return bins of 1%, where the first bin includes returns from −5.0% to −4.0%
(exclusive, meaning < −4%) and the last bin includes daily returns from 4.0% to 5.0%
(inclusive, meaning ≤ 5%). Note that to facilitate interpretation, the first bin starts at
the nearest whole number below the minimum value (so, at −5.0%).
Exhibit 11 includes two other useful ways to present the data (which can be com-
puted in a straightforward manner once we have established the absolute and relative
frequency distributions): the cumulative absolute frequency and the cumulative relative
frequency. The cumulative absolute frequency cumulates (meaning, adds up) the
absolute frequencies as we move from the first bin to the last bin. Similarly, the cumu-
lative relative frequency is a sequence of partial sums of the relative frequencies. For
the last bin, the cumulative absolute frequency will equal the number observations in
the dataset (1,258), and the cumulative relative frequency will equal 100%.
Exhibit 11 (Continued)
EXAMPLE 4
Country A 7.7
Country B 8.5
Country C 9.1
Country D 5.5
Country E 7.1
Country F 9.9
Country G 6.2
Country H 6.8
Country I 7.5
(continued)
© CFA Institute. For candidate use only. Not for distribution.
80 Reading 2 ■ Organizing, Visualizing, and Describing Data
Exhibit 12 (Continued)
Construct a frequency distribution table from these data and state some key
findings from the summarized data.
Solution:
The first step in constructing a frequency distribution table is to sort the return
data in ascending order:
Market Index Return (%)
Country D 5.5
Country P 6.1
Country G 6.2
Country H 6.8
Country O 6.8
Country E 7.1
Country K 7.4
Country I 7.5
Country A 7.7
Country N 7.7
Country R 7.9
Country B 8.5
Country L 8.6
Country Q 8.8
Country J 8.9
Country C 9.1
Country M 9.6
Country F 9.9
The second step is to calculate the range of the data, which is 9.9% − 5.5% = 4.4%.
The third step is to decide on the number of bins. Here, we will use k = 5.
The fourth step is to determine the bin width. Here, it is 4.4%/5 = 0.88%, which
we will round up to 1.0%.
The fifth step is to determine the bins, which are as follows:
5.0% + 1.0% = 6.0%
6.0% + 1.0% = 7.0%
7.0% + 1.0% = 8.0%
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using a Contingency Table 81
The entries in the cells of the contingency table show the number of stocks of each
sector with a given level of market cap. For example, there are 275 small-cap health
care stocks, making it the portfolio’s largest subgroup in terms of frequency. These
data are also called joint frequencies because you are joining one variable from the
row (i.e., sector) and the other variable from the column (i.e., market cap) to count
observations. The joint frequencies are then added across rows and across columns,
and these corresponding sums are called marginal frequencies. For example, the
marginal frequency of health care stocks in the portfolio is the sum of the joint fre-
quencies across all three levels of market cap, so 435 (= 275 + 105 + 55). Similarly,
adding the joint frequencies of small-cap stocks across all five sectors gives the marginal
frequency of small-cap stocks of 575 (= 55 + 50 + 175 + 275 + 20).
Clearly, health care stocks and small-cap stocks have the largest marginal frequen-
cies among sector and market cap, respectively, in this portfolio. Note the marginal
frequencies represent the frequency distribution for each variable. Finally, the marginal
frequencies for each variable must sum to the total number of stocks (overall total)
in the portfolio—here, 1,000 (shown in the lower right cell).
Similar to the one-way frequency distribution table, we can express frequency in
percentage terms as relative frequency by using one of three options. We can divide
the joint frequencies by: a) the total count; b) the marginal frequency on a row; or c)
the marginal frequency on a column.
Exhibit 15 shows the contingency table using relative frequencies based on total
count. It is readily apparent that small-cap health care and energy stocks comprise the
largest portions of the total portfolio, at 27.5% (= 275/1,000) and 17.5% (= 175/1,000),
respectively, followed by mid-cap health care and energy stocks, at 10.5% and 9.5%,
respectively. Together, these two sectors make up nearly three-quarters of the portfolio
(43.5% + 29.0% = 72.5%).
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using a Contingency Table 83
In conclusion, the findings from these contingency tables using frequencies and
relative frequencies indicate that in terms of the number of stocks, the portfolio can
be generally described as a small- to mid-cap-oriented health care and energy sector
portfolio that also includes stocks of several other defensive sectors.
As an analytical tool, contingency tables can be used in different applications. One
application is for evaluating the performance of a classification model (in this case,
the contingency table is called a confusion matrix). Suppose we have a model for
classifying companies into two groups: those that default on their bond payments and
those that do not default. The confusion matrix for displaying the model’s results will
be a 2 × 2 table showing the frequency of actual defaults versus the model’s predicted
frequency of defaults. Exhibit 17 shows such a confusion matrix for a sample of 2,000
non-investment-grade bonds. Using company characteristics and other inputs, the
model correctly predicts 300 cases of bond defaults and 1,650 cases of no defaults.
© CFA Institute. For candidate use only. Not for distribution.
84 Reading 2 ■ Organizing, Visualizing, and Describing Data
We can also observe that this classification model incorrectly predicts default in
40 cases where no default actually occurred and also incorrectly predicts no default
in 10 cases where default actually did occur. Later in the CFA Program curriculum
you will learn how to construct a confusion matrix, how to calculate related model
performance metrics, and how to use them to evaluate and tune a classification model.
Another application of contingency tables is to investigate potential association
between two categorical variables. For example, revisiting Exhibit 14, one may ask
whether the distribution of stocks by sectors is independent of the levels of market
capitalization? Given the dominance of small-cap and mid-cap health care and energy
stocks, the answer is likely, no.
One way to test for a potential association between categorical variables is to per-
form a chi-square test of independence. Essentially, the procedure involves using
the marginal frequencies in the contingency table to construct a table with expected
values of the observations. The actual values and expected values are used to derive
the chi-square test statistic. This test statistic is then compared to a value from the
chi-square distribution for a given level of significance. If the test statistic is greater
than the chi-square distribution value, then there is evidence to reject the claim of
independence, implying a significant association exists between the categorical vari-
ables. The following example describes how a contingency table is used to set up this
test of independence.
EXAMPLE 5
Growth 73 26
Value 183 33
1 Calculate the number of growth funds and number of value funds out of
the total funds.
2 Calculate the number of low-risk and high-risk funds out of the total
funds.
3 Describe how the contingency table is used to set up a test for indepen-
dence between fund style and risk level.
Solution to 1
The task is to calculate the marginal frequencies by fund style, which is done by
adding joint frequencies across the rows. Therefore, the marginal frequency for
growth is 73 + 26 = 99, and the marginal frequency for value is 183 + 33 = 216.
Solution to 2
The task is to calculate the marginal frequencies by fund risk, which is done by
adding joint frequencies down the columns. Therefore, the marginal frequency
for low risk is 73 + 183 = 256, and the marginal frequency for high risk is 26 +
33 = 59.
Solution to 3
Based on the procedure mentioned for conducting a chi-square test of indepen-
dence, we would perform the following three steps.
Step 1: Add the marginal frequencies and overall total to the contingency
table. We have also included the relative frequency table for observed values.
Expected value for Value/High Risk is: (216 × 59) / 315 = 40.46.
The table of expected values (and accompanying relative frequency table) are:
Step 3: Use the actual values and the expected values of observation counts
to derive the chi-square test statistic, which is then compared to a value from
the chi-square distribution for a given level of significance. If the test statistic
is greater than the chi-square distribution value, then there is evidence of a
significant association between the categorical variables.
6 DATA VISUALIZATION
e Describe ways that data may be visualized and evaluate uses of specific
visualizations
Visualization is the presentation of data in a pictorial or graphical format for the
purpose of increasing understanding and for gaining insights into the data. As has
been said, “a picture is worth a thousand words.” In this section, we discuss a variety
of charts that are useful for understanding distributions, making comparisons, and
exploring potential relationships among data. Specifically, we will cover visualizing
frequency distributions of numerical and categorical data by using plots that represent
multi-dimensional data for discovering relationships and by interpreting visuals that
display unstructured data.
bin. A quick glance can tell us that the return bin 0% to 1% (exclusive) has the highest
frequency, with more than 500 observations (555, to be exact), and it is represented
by the tallest bar in the histogram.
An advantage of the histogram is that it can effectively present a large amount of
numerical data that has been grouped into a frequency distribution and can allow a
quick inspection of the shape, center, and spread of the distribution to better under-
stand it. For example, in Exhibit 20, despite the histogram of daily EAA Equity Index
returns appearing bell-shaped and roughly symmetrical, most bars to the right side
of the origin (i.e., zero) are taller than those on the left side, indicating that more
observations lie in the bins in positive territory. Remember that in the earlier dis-
cussion of this return distribution, it was noted that 54.1% of the observations are
positive daily returns.
As mentioned, histograms can also be created with relative frequencies—the choice
of using absolute versus relative frequency depends on the question being answered.
An absolute frequency histogram best answers the question of how many items are
in each bin, while a relative frequency histogram gives the proportion or percentage
of the total observations in each bin.
500
400
300
200
100
0
–5 –4 –3 –2 –1 0 1 2 3 4 5
Index Return (%)
frequency distribution, we graph the returns in the fourth (i.e., Cumulative Absolute
Frequency) or fifth (i.e., Cumulative Relative Frequency) column of Exhibit 11 against
the upper limit of each return interval.
Exhibit 21 presents the graph of the cumulative absolute frequency distribution
for the daily returns on the EAA Equity Index. Notice that the cumulative distribu-
tion tends to flatten out when returns are extremely negative or extremely positive
because the frequencies in these bins are quite small. The steep slope in the middle
of Exhibit 21 reflects the fact that most of the observations—[(470 + 555)/1,258], or
81.5%—lie in the neighborhood of −1.0% to 1.0%.
1,200
1,000
800
600
400
200
–4 –3 –2 –1 0 1 2 3 4 5
Index Return (%)
each bar represents the absolute frequency of each sector. Since sectors are nominal
data with no logical ordering, the bars representing sectors may be arranged in any
order. However, in the particular case where the categories in a bar chart are ordered
by frequency in descending order and the chart includes a line displaying cumulative
relative frequency, then it is called a Pareto Chart. The chart is often used to highlight
dominant categories or the most important groups.
Bar charts provide a snapshot to show the comparison between categories of data.
As shown in Exhibit 22, the sector in which the portfolio holds most stocks is the
health care sector, with 435 stocks, followed by the energy sector, with 290 stocks.
The sector in which the portfolio has the least number of stocks is utilities, with 55
stocks. To compare categories more accurately, in some cases we may add the fre-
quency count to the right end of each bar (or the top end of each bar in the case of
a vertical bar chart).
Communication Services
Consumer Staples
Energy
Health Care
Utilities
The bar chart shown in Exhibit 22 can present the frequency distribution of only
one categorical variable. In the case of two categorical variables, we need an enhanced
version of the bar chart, called a grouped bar chart (also known as a clustered bar
chart), to show joint frequencies. Using the joint frequencies by sector and by level
of market capitalization given in Exhibit 14, for example, we show how a grouped
bar chart is constructed in Exhibit 23. While the y-axis still represents the same cat-
egorical variable (the distinct GICS sectors as in Exhibit 22), in Exhibit 23 three bars
are clustered side-by-side within the same sector to represent the three respective
levels of market capitalization. The bars within each cluster should be colored differ-
ently to distinguish between them, but the color schemes for the sub-groups must
be identical across the sector clusters, as shown by the legend at the upper right of
Exhibit 23. Additionally, the bars in each sector cluster must always be placed in the
same order throughout the chart. It is easy to see that the small-cap heath care stocks
are the sub-group with the highest frequency (275), and we can also see that small-
cap stocks are the largest sub-group within each sector—except for utilities, where
mid cap is the largest.
© CFA Institute. For candidate use only. Not for distribution.
90 Reading 2 ■ Organizing, Visualizing, and Describing Data
Communication Services
Consumer Staples
Energy
Health Care
Utilities
An alternative form for presenting the joint frequency distribution of two cat-
egorical variables is a stacked bar chart. In the vertical version of a stacked bar
chart, the bars representing the sub-groups are placed on top of each other to form
a single bar. Each subsection of the bar is shown in a different color to represent the
contribution of each sub-group, and the overall height of the stacked bar represents
the marginal frequency for the category. Exhibit 23 can be replotted in a stacked bar
chart, as shown in Exhibit 24.
400
300
200
100
0
Communication Consumer Energy Healh Utilities
Services Staples Care
Sector
We have shown that the frequency distribution of categorical data can be clearly
and efficiently presented by using a bar chart. However, it is worth noting that appli-
cations of bar charts may be extended to more general cases when categorical data are
associated with numerical data. For example, suppose we want to show a company’s
quarterly profits over the past one year. In this case, we can plot a vertical bar chart
where each bar represents one of the four quarters in a time order and its height
indicates the value of profits for that quarter.
6.3 Tree-Map
In addition to bar charts and grouped bar charts, another graphical tool for displaying
categorical data is a tree-map. It consists of a set of colored rectangles to represent
distinct groups, and the area of each rectangle is proportional to the value of the
corresponding group. For example, referring back to the marginal frequencies by
GICS sector in Exhibit 14, we plot a tree-map in Exhibit 25 to represent the frequency
distribution by sector for stocks in the portfolio. The tree-map clearly shows that
health care is the sector with the largest number of stocks in the portfolio, which is
represented by the rectangle with the largest area.
Small (275)
Mid (35)
Note that this example also depicts one more categorical variable (i.e., level of
market capitalization). The tree-map can represent data with additional dimensions
by displaying a set of nested rectangles. To show the joint frequencies of sub-groups
by sector and level of market capitalization, as given in Exhibit 14, we can split each
existing rectangle for sector into three sub-rectangles to represent small-cap, mid-cap,
and large-cap stocks, respectively. In this case, the area of each nested rectangle would
be proportional to the number of stocks in each market capitalization sub-group.
The exhibit clearly shows that small-cap health care is the sub-group with the largest
number of stocks. It is worth noting a caveat for using tree-maps: Tree-maps become
difficult to read if the hierarchy involves more than three levels.
© CFA Institute. For candidate use only. Not for distribution.
92 Reading 2 ■ Organizing, Visualizing, and Describing Data
Exhibit 26 Excerpt of MDA Section in Form 10-Q of QXR Inc. for Quarter
Ended 31 March 20XX
MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL
CONDITION AND RESULTS OF OPERATIONS
Please read the following discussion and analysis of our financial condition and
results of operations together with our consolidated financial statements and
related notes included under Part I, Item 1 of this Quarterly Report on Form 10-Q
Executive Overview of Results
Below are our key financial results for the three months ended March 31, 20XX
(consolidated unless otherwise noted):
■■ Revenues of $36.3 billion and revenue growth of 17% year over year, con-
stant currency revenue growth of 19% year over year.
■■ Major segment revenues of $36.2 billion with revenue growth of 17% year
over year and other segments’ revenues of $170 million with revenue
growth of 13% year over year.
■■ Revenues from the United States, EMEA, APAC, and Other Americas
were $16.5 billion, $11.8 billion, $6.1 billion, and $1.9 billion, respectively.
■■ Cost of revenues was $16.0 billion, consisting of TAC of $6.9 billion and
other cost of revenues of $9.2 billion. Our TAC as a percentage of adver-
tising revenues were 22%.
■■ Operating expenses (excluding cost of revenues) were $13.7 billion,
including the EC AFS fine of $1.7 billion.
■■ Income from operations was $6.6 billion
■■ Other income (expense), net, was $1.5 billion.
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 93
Exhibit 26 (Continued)
■■ Effective tax rate was 18%
■■ Net income was $6.7 billion with diluted net income per share of $9.50.
■■ Operating cash flow was $12.0 billion.
■■ Capital expenditures were $4.6 billion.
Exhibit 28 Daily Closing Prices of ABC Inc.’s Stock and Its Sector Index
Price ($) Sector Index
58 6,380
6,360
56 6,340
6,320
54
6,200
6,280
52
6,260
6,240
50
1 2 3 4 5 6 7 8 9 10
Day
Price ($) Sector Index
A line chart is also capable of accommodating more than one set of data points,
which is especially helpful for making comparisons. We can add a line to represent
each group of data (e.g., a competitor’s stock price or a sector index), and each
line would have a distinct color or line pattern identified in a legend. For example,
Exhibit 28 also includes a plot of ABC’s sector index (i.e., the sector index for which
ABC stock is a member, like health care or energy) over the same period. The sector
index is displayed with its own distinct color to facilitate comparison. Note also that
because the sector index has a different range (approximately 6,230 to 6,390) than
ABCs’ stock ($50 to $59 per share), we need a secondary y-axis to correctly display
the sector index, which is on the right-hand side of the exhibit.
This comparison can help us understand whether ABC’s stock price movement
over the period is due to potential mispricing of its share issuance or instead due to
industry-specific factors that also affect its competitors’ stock prices. The comparison
shows that over the period, the sector index moved in a nearly opposite trend versus
ABC’s stock price movement. This indicates that the steep decline in ABC’s stock price
is less likely attributable to sector-specific factors and more likely due to potential
over-pricing of its IPO or to other company-specific factors.
When an observational unit (here, ABC Inc.) has more than two features (or
variables) of interest, it would be useful to show the multi-dimensional data all in
one chart to gain insights from a more holistic view. How can we add an additional
dimension to a two-dimensional line chart? We can replace the data points with
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 95
6,000 $3.89
6,500
−$2.88
5,000
−$3.98
4,500 $1.78 −$3.38
−$8.66
3,500
Q1 Year 1 Q2 Year 1 Q3 Year 1 Q4 Year 1 Q1 Year 2 Q2 Year 2 Q3 Year 2 Q4 Year 2
As depicted, ABC’s earning were quite volatile during its initial two years as a
public company. Earnings started off as a profit of $1.37/share but finished the first
year with a big loss of −$8.66/share, during which time revenue experienced only small
fluctuations. Furthermore, while revenues and earnings both subsequently recovered
sharply—peaking in Q2 of Year 2—revenues then declined, and the company returned
to significant losses (−3.98/share) by the end of Year 2.
five years under investigation and plotted the data points in the scatter plots, shown
in Exhibit 30 for IT versus the S&P 500 returns and in Exhibit 31 for utilities versus
the S&P 500 returns.
Despite their relatively straightforward construction, scatter plots convey lots of
valuable information. First, it is important to inspect for any potential association
between the two variables. The pattern of the scatter plot may indicate no apparent
relationship, a linear association, or a non-linear relationship. A scatter plot with
randomly distributed data points would indicate no clear association between the
two variables. However, if the data points seem to align along a straight line, then
there may exist a significant relationship among the variables. A positive (negative)
slope for the line of data points indicates a positive (negative) association, meaning
the variables move in the same (opposite) direction. Furthermore, the strength of the
association can be determined by how closely the data points are clustered around
the line. Tight (loose) clustering signals a potentially stronger (weaker) relationship.
10.0
7.5
5.0
2.5
–2.5
–5.0
–7.5
Exhibit 31 Scatter Plot of Utilities Sector Index Return vs. S&P 500 Index
Return
Information Technology
–2
–4
–6
Examining Exhibit 30, we can see the returns of the IT sector are highly positively
associated with S&P 500 Index returns because the data points are tightly clustered
along a positively sloped line. Exhibit 31 tells a different story for relative performance
of the utilities sector and S&P 500 index returns: The data points appear to be distrib-
uted in no discernable pattern, indicating no clear relationship among these variables.
Second, observing the data points located toward the ends of each axis, which represent
the maximum or minimum values, provides a quick sense of the data range. Third,
assuming that a relationship among the variables is apparent, inspecting the scatter
plot can help to spot extreme values (i.e., outliers). For example, an outlier data point
is readily detected in Exhibit 30, as indicated by the arrow. As you will learn later in
the CFA Program curriculum, finding these extreme values and handling them with
appropriate measures is an important part of the financial modeling process.
Scatter plots are a powerful tool for finding patterns between two variables, for
assessing data range, and for spotting extreme values. In practice, however, there
are situations where we need to inspect for pairwise associations among many vari-
ables—for example, when conducting feature selection from dozens of variables to
build a predictive model.
A scatter plot matrix is a useful tool for organizing scatter plots between pairs
of variables, making it easy to inspect all pairwise relationships in one combined
visual. For example, suppose the analyst would like to extend his or her investigation
by adding another sector index. He or she can use a scatter plot matrix, as shown in
Exhibit 32, which now incorporates four variables, including index returns for the
S&P 500 and for three sectors: IT, utilities, and financials.
© CFA Institute. For candidate use only. Not for distribution.
98 Reading 2 ■ Organizing, Visualizing, and Describing Data
The scatter plot matrix contains each combination of bivariate scatter plot (i.e.,
S&P 500 vs. each sector, IT vs. utilities, IT vs. financials, and financials vs. utilities) as
well as univariate frequency distribution histograms for each variable plotted along
the diagonal. In this way, the scatter plot matrix provides a concise visual summary of
each variable and of potential relationships among them. Importantly, the construction
of the scatter plot matrix is typically a built-in function in most major statistical soft-
ware packages, so it is relatively easy to implement. It is worth pointing out that the
upper triangle of the matrix is the mirror image of the lower triangle, so the compact
form of the scatter plot matrix that uses only the lower triangle is also appropriate.
With the addition of the financial sector, the bottom panel of Exhibit 32 reveals the
following additional information, which can support sector allocation in the portfolio
construction process:
■■ Strong positive relationship between returns of financial and S&P 500;
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 99
Communication Services 21 43 83
80
Consumer Staples 36 81 45
60
Energy 99 95 29
40
Health Care 4 8 18
20
Utilities 81 37 58
Cells in the chart are color-coded to differentiate high values from low values by
using the color scheme defined in the color spectrum on the right side of the chart.
As shown by the heat map, this portfolio has the largest exposure (in terms of num-
ber of stocks) to small- and mid-cap energy stocks. It has substantial exposures to
large-cap communications services, mid-cap consumer staples, and small-cap utilities;
however, exposure to the health care sector is limited. In sum, the heat map reveals
this portfolio to be relatively well-diversified among sectors and market-cap levels.
Besides their use in displaying frequency distributions, heat maps are commonly used
for visualizing the degree of correlation among different variables.
EXAMPLE 6
Solution 1
The slope of the graph of a cumulative absolute frequency distribution reflects
the change in the number of observations between two adjacent return bins. A
steep (flat) slope indicates a large (small) change in the frequency of observations
between adjacent return bins.
Solution 2
Color can add an additional dimension to the information conveyed in the word
cloud. For example, red can be used for “losses” and other words conveying neg-
ative sentiment, and green can be used for “profit” and other words indicative
of positive sentiment.
Solution 3
Besides the sign and degree of association of the stocks’ returns, the scatter
plot can provide a visual representation of whether the association is linear or
non-linear, the maximum and minimum values for the return observations, and
an indication of which observations may have extreme values (i.e., are potential
outliers).
Solution 4
Typically, the heights of bars in a vertical bar chart are proportional to the values
that they represent. However, if the graph is using a truncated y-axis (i.e., one
that does not start at zero), then values are not accurately represented by the
height of bars. Therefore, we need to examine the y-axis of the bar chart before
concluding that sales in the fifth year were triple the sales of the prior years.
• Histogram
Numerical
Data • Frequency
Polygon
• Cumulative
• Scatter Plot Distribution
(Two Variables) Chart
Relationship What to Distribution
• Scatter Plot Matrix
explore or
(Multiple Variables)
present?
• Heat Map
• (Multiple Variables) Categorical • Bar Chart
Comparison Data
• Tree Map
• Heat Map
• Line Chart
• Bar Chart (Two Variables)
• Tree Map • Bubble Line Chart
• Heat Map (Three Variables)
Data visualization is a powerful tool to show data and gain insights into data.
However, we need to be cautious that a graph could be misleading if data are mispre-
sented or the graph is poorly constructed. There are numerous different ways that may
lead to a misleading graph. We list four typical pitfalls here that analysts should avoid.
First, an improper chart type is selected to present data, which would hinder the
accurate interpretation of data. For example, to investigate the correlation between
two data series, we can construct a scatter plot to visualize the joint variation between
two variables. In contrast, plotting the two data series separately in a line chart would
make it rather difficult to examine the relationship.
Second, data are selectively plotted in favor of the conclusion an analyst intends
to draw. For example, data presented for an overly short time period may appear to
show a trend that is actually noise—that is, variation within the data’s normal range
if examining the data over a longer time period. So, presenting data for too short a
time window may mistakenly point to a non-existing trend.
Third, data are improperly plotted in a truncated graph that has a y-axis that
does not start at zero. In some situations, the truncated graph can create the false
impression of significant differences when there is actually only a small difference.
For example, suppose a vertical bar chart is used to compare annual revenues of two
companies, one with $9 billion and the other with $10 billion. If the y-axis starts at
$8 billion, then the bar heights would inaccurately imply that the latter company’s
revenue is twice the former company’s revenue.
Last, but not least, is the improper scaling of axes. For example, given a line chart,
setting a higher than necessary maximum on the y-axis tends to compress the graph
into an area close to the x-axis. This causes the graph to appear to be less steep and
less volatile than if it was properly plotted. In sum, analysts need to avoid these misuses
of visualization when charting data and must ensure the ethical use of data visuals.
© CFA Institute. For candidate use only. Not for distribution.
102 Reading 2 ■ Organizing, Visualizing, and Describing Data
EXAMPLE 7
Solution to 1
The five-year history of daily trading volumes contains a large amount of
numerical data. Therefore, a histogram is the best chart for grouping these data
into frequency distribution bins and for showing a quick snapshot of the shape,
center, and spread of the data’s distribution.
Solution to 2
To inspect for a potential relationship between two variables, a scatter plot is
a good choice. But with 10 variables, plotting individual scatter plots is not an
efficient approach. Instead, utilizing a scatter plot matrix would give the analyst
a good overview in one comprehensive visual of all the pairwise associations
between the variables.
Solution to 3
Since the meeting minutes consist of textual data, a word cloud would be the
most suitable tool to visualize the textual data and facilitate the researcher’s
understanding of the topic of the text as well as the sentiment, positive or neg-
ative, it may convey.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 103
Solution to 4
The best chart for making this comparison would be a bubble line chart using
two different color lines to represent the quarterly revenues for each company.
The bubble sizes would then indicate the magnitude of each company’s quarterly
earnings, with green bubbles signifying profits and red bubbles indicating losses.
Sample Mean Formula. The sample mean or average, X (read “X-bar”), is the
arithmetic mean value of a sample:
n
Xi
i 1
X (2)
n
where n is the number of observations in the sample.
Equation 2 tells us to sum the values of the observations (Xi) and divide the sum
by the number of observations. For example, if a sample of market capitalizations for
six publicly traded Australian companies contains the values (in AUD billions) 35, 30,
22, 18, 15, and 12, the sample mean market cap is 132/6 = A$22 billion. As previously
noted, the sample mean is a statistic (that is, a descriptive measure of a sample).
Means can be computed for individual units or over time. For instance, the sample
might be the return on equity (ROE) in a given year for a sample of 25 companies in
the FTSE Eurotop 100, an index of Europe’s 100 largest companies. In this case, we
calculate the mean ROE in that year as an average across 25 individual units. When
we examine the characteristics of some units at a specific point in time (such as ROE
for the FTSE Eurotop 100), we are examining cross-sectional data; the mean of these
observations is the cross-sectional mean. If the sample consists of the historical
monthly returns on the FTSE Eurotop 100 for the past five years, however, then we
have time-series data; the mean of these observations is the time-series mean. We
will examine specialized statistical methods related to the behavior of time series in
the reading on time-series analysis.
Except in cases of large datasets with many observations, we should not expect any
of the actual observations to equal the mean; sample means provide only a summary
of the data being analyzed. Also, although in some cases the number of values below
the mean is quite close to the number of values above the mean, this need not be
the case. As an analyst, you will often need to find a few numbers that describe the
characteristics of the distribution, and we will consider more later. The mean is gen-
erally the statistic that you use as a measure of the typical outcome for a distribution.
You can then use the mean to compare the performance of two different markets.
For example, you might be interested in comparing the stock market performance of
investments in Asia Pacific with investments in Europe. You can use the mean returns
in these markets to compare investment results.
EXAMPLE 8
Exhibit 35 (Continued)
Country
A
B
C
D
E
F
G
H
I
J
K
1 2 3 4 5 6 7 9 10 11 12
Fulcrum
As analysts, we often use the mean return as a measure of the typical outcome for
an asset. As in Example 8, however, some outcomes are above the mean and some are
below it. We can calculate the distance between the mean and each outcome, which is
the deviation. Mathematically, it is always true that the sum of the deviations around
the mean equals 0. We can see this by using the definition of the arithmetic mean
n
shown in Equation 2, multiplying both sides of the equation by n: nX X i . The
sum of the deviations from the mean is calculated as follows: i 1
n n n n
X i X X i X X i nX 0
i 1 i 1 i 1 i 1
Deviations from the arithmetic mean are important information because they
indicate risk. The concept of deviations around the mean forms the foundation for the
more complex concepts of variance, skewness, and kurtosis, which we will discuss later.
A property and potential drawback of the arithmetic mean is its sensitivity to
extreme values, or outliers. Because all observations are used to compute the mean and
are given equal weight (i.e., importance), the arithmetic mean can be pulled sharply
upward or downward by extremely large or small observations, respectively. For
example, suppose we compute the arithmetic mean of the following seven numbers:
1, 2, 3, 4, 5, 6, and 1,000. The mean is 1,021/7 = 145.86, or approximately 146. Because
the magnitude of the mean, 146, is so much larger than most of the observations (the
first six), we might question how well it represents the location of the data. Perhaps
the most common approach in such cases is to report the median, or middle value,
in place of or in addition to the mean.
7.1.3 Outliers
In practice, although an extreme value or outlier in a financial dataset may just repre-
sent a rare value in the population, it may also reflect an error in recording the value
of an observation or an observation generated from a different population from that
producing the other observations in the sample. In the latter two cases, in particu-
lar, the arithmetic mean could be misleading. So, what do we do? The first step is to
examine the data, either by inspecting the sample observations if the sample is not
too large or by using visualization approaches. Once we are comfortable that we have
identified and eliminated errors (that is, we have “cleaned” the data), we can then
address what to do with extreme values in the sample. When dealing with a sample
that has extreme values, there may be a possibility of transforming the variable (e.g.,
a log transformation) or of selecting another variable that achieves the same purpose.
However, if alternative model specifications or variable transformations are not pos-
sible, then here are three options for dealing with extreme values:
Option 1 Do nothing; use the data without any adjustment.
Option 2 Delete all the outliers.
Option 3 Replace the outliers with another value.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 107
The first option is appropriate if the values are legitimate, correct observations, and
it is important to reflect the whole of the sample distribution. Outliers may contain
meaningful information, so excluding or altering these values may reduce valuable
information. Further, because identifying a data point as extreme leaves it up to the
judgment of the analyst, leaving in all observations eliminates that need to judge a
value as extreme.
The second option excludes the extreme observations. One measure of central
tendency in this case is the trimmed mean, which is computed by excluding a stated
small percentage of the lowest and highest values and then computing an arithmetic
mean of the remaining values. For example, a 5% trimmed mean discards the lowest
2.5% and the highest 2.5% of values and computes the mean of the remaining 95%
of values. A trimmed mean is used in sports competitions when judges’ lowest and
highest scores are discarded in computing a contestant’s score.
The third option involves substituting values for the extreme values. A measure
of central tendency in this case is the winsorized mean. It is calculated by assigning
a stated percentage of the lowest values equal to one specified low value and a stated
percentage of the highest values equal to one specified high value, and then it com-
putes a mean from the restated data. For example, a 95% winsorized mean sets the
bottom 2.5% of values equal to the value at or below which 2.5% of all the values lie
(as will be seen shortly, this is called the “2.5th percentile” value) and the top 2.5%
of values equal to the value at or below which 97.5% of all the values lie (the “97.5th
percentile” value).
In Exhibit 37, we show the differences among these options for handling outliers
using daily returns for the fictitious Euro-Asia-Africa (EAA) Equity Index in Exhibit 11.
The trimmed mean eliminates the lowest 2.5% of returns, which in this
sample is any daily return less than −1.934%, and it eliminates the highest 2.5%,
which in this sample is any daily return greater than 1.671%. The result of this
trimming is that the mean is calculated using 1,194 observations instead of the
original sample’s 1,258 observations.
The winsorized mean substitutes −1.934% for any return below −1.934 and
substitutes 1.671% for any return above 1.671. The result in this case is that the
trimmed and winsorized means are above the arithmetic mean.
Country B −1.5 1
Country G −1.2 2
Country F −1.0 3
Country J −0.9 4
Country K 1.2 5
Country E 3.0 ← 6
Country I 3.2 7
Country H 3.4 8
Country C 3.5 9
Country A 6.1 10
Country D 6.2 11
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 109
Country
D
A
C
H
I
E Median
K
J
F
G
B
–2 0 2 4 6 8
Return (%)
If a sample has an even number of observations, the median is the mean of the two
values in the middle. For example, if our sample in Exhibit 38 had 12 indexes instead
of 11, the median would be the mean of the values in the sorted array that occupy
the sixth and the seventh positions.
500 493
470
400
300
200
122
100 94
37 27
5 7 1 1 1
0
–4.1 to –3.3 to –2.5 to –1.6 to –0.8 to 0 to 0.9 to 1.7 to 2.5 to 3.3 to 4.2 to
–3.3 –2.5 –1.6 –0.8 0 0.9 1.7 2.5 3.3 4.2 5.0
Daily Return Range (%)
The mode is the only measure of central tendency that can be used with nominal
data. For example, when we categorize investment funds into different styles and
assign a number to each style, the mode of these categorized data is the most frequent
investment fund style.
Weighted Mean Formula. The weighted mean X w (read “X-bar sub-w”), for a
set of observations X1, X2, …, Xn with corresponding weights of w1, w2, …, wn,
is computed as:
n
Xw wi X i (3)
i 1
is the formula for the arithmetic mean. Therefore, the arithmetic mean is a special
case of the weighted mean in which all the weights are equal.
EXAMPLE 9
Using the information provided, calculate the returns on the portfolio for
each year.
Solution
Converting the percentage asset allocation to decimal form, we find the mean
return as the weighted average of the funds’ returns. We have:
Mean portfolio return for Year 1 = 0.25 (5.3) + 0.45 (12.7) + 0.30(11.5)
= 10.50%
Mean portfolio return for Year 2 = 0.25 (1.2) + 0.45 (6.7) + 0.30 (3.4)
= 4.34%
Mean portfolio return for Year 3 = 0.25 (3.5) + 0.45 (−1.2) + 0.30 (1.2)
= 0.70%
This example illustrates the general principle that a portfolio return is a weighted
sum. Specifically, a portfolio’s return is the weighted average of the returns on the
assets in the portfolio; the weight applied to each asset’s return is the fraction of the
portfolio invested in that asset.
© CFA Institute. For candidate use only. Not for distribution.
112 Reading 2 ■ Organizing, Visualizing, and Describing Data
When we have computed lnX G , then X G = elnX G (on most calculators, the key for
this step is ex).
Risky assets can have negative returns up to −100% (if their price falls to zero), so
we must take some care in defining the relevant variables to average in computing a
geometric mean. We cannot just use the product of the returns for the sample and
then take the nth root because the returns for any period could be negative. We must
recast the returns to make them positive. We do this by adding 1.0 to the returns
expressed as decimals, where Rt represents the return in period t. The term (1 + Rt)
represents the year-ending value relative to an initial unit of investment at the begin-
ning of the year. As long as we use (1 + Rt), the observations will never be negative
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 113
because the biggest negative return is −100%. The result is the geometric mean of
1 + Rt; by then subtracting 1.0 from this result, we obtain the geometric mean of the
individual returns Rt.
An equation that summarizes the calculation of the geometric mean return, RG, is
a slightly modified version of Equation 4 in which Xi represents “1 + return in decimal
form.” Because geometric mean returns use time series, we use a subscript t indexing
time as well. We calculate one plus the geometric mean return as:
1 RG T 1 R11 R2 1 RT
We can represent this more compactly as:
1
T T
1 RG 1 Rt
t 1
where the capital Greek letter ‘pi,’ Π, denotes the arithmetical operation of mul-
tiplication of the T terms. Once we subtract one, this becomes the formula for the
geometric mean return.
For example, the returns on Country B’s index are given in Exhibit 35 as 7.8, 6.3,
and −1.5%. Putting the returns into decimal form and adding 1.0 produces 1.078,
1.063, and 0.985. Using Equation 4, we have 3 1.0781.0630.985 3 1.128725 =
1.041189. This number is 1 plus the geometric mean rate of return. Subtracting 1.0
from this result, we have 1.041189 − 1.0 = 0.041189, or approximately 4.12%. This is
lower than the arithmetic mean for County B’s index of 4.2%.
Geometric Mean Return Formula. Given a time series of holding period
returns Rt, t = 1, 2, …, T, the geometric mean return over the time period
spanned by the returns R1 through RT is:
1
T T
RG 1 Rt 1 (5)
t 1
We can use Equation 5 to solve for the geometric mean return for any return data
series. Geometric mean returns are also referred to as compound returns. If the returns
being averaged in Equation 5 have a monthly frequency, for example, we may call the
geometric mean monthly return the compound monthly return. The next example
illustrates the computation of the geometric mean while contrasting the geometric
and arithmetic means.
EXAMPLE 10
In Example 10, the geometric mean return is less than the arithmetic mean return
for each country’s index returns. In fact, the geometric mean is always less than or
equal to the arithmetic mean. The only time that the two means will be equal is when
there is no variability in the observations—that is, when all the observations in the
series are the same.
In general, the difference between the arithmetic and geometric means increases
with the variability within the sample; the more disperse the observations, the greater
the difference between the arithmetic and geometric means. Casual inspection of the
returns in Exhibit 35 and the associated graph of means suggests a greater variability for
Country A’s index relative to the other indexes, and this is confirmed with the greater
deviation of the geometric mean return (−5.38%) from the arithmetic mean return
(−4.97%), as we show in Exhibit 40. How should the analyst interpret these results?
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 115
A
B
C
D
E
F
G
H
I
J
K
6 –4 –2 0 2 4 6 8
Mean Return (%)
Geometric Mean Arithmetic Average
The geometric mean return represents the growth rate or compound rate of return
on an investment. One unit of currency invested in a fund tracking the Country B
index at the beginning of Year 1 would have grown to (1.078)(1.063)(0.985) = 1.128725
units of currency, which is equal to 1 plus the geometric mean return compounded
over three periods: [1 + 0.041189]3 = 1.128725, confirming that the geometric mean
is the compound rate of return. With its focus on the profitability of an investment
over a multi-period horizon, the geometric mean is of key interest to investors. The
arithmetic mean return, focusing on average single-period performance, is also of
interest. Both arithmetic and geometric means have a role to play in investment
management, and both are often reported for return series.
For reporting historical returns, the geometric mean has considerable appeal
because it is the rate of growth or return we would have to earn each year to match
the actual, cumulative investment performance. Suppose we purchased a stock for
€100 and two years later it was worth €100, with an intervening year at €200. The
geometric mean of 0% is clearly the compound rate of growth during the two years,
which we can confirm by compounding the returns: [(1 + 1.00)(1 − 0.50)]1/2 − 1 =
0%. Specifically, the ending amount is the beginning amount times (1 + RG)2. The
geometric mean is an excellent measure of past performance.
The arithmetic mean, which is [100% + −50%]/2 = 25% in the above example,
can distort our assessment of historical performance. As we noted previously, the
arithmetic mean is always greater than or equal to the geometric mean. If we want to
estimate the average return over a one-period horizon, we should use the arithmetic
mean because the arithmetic mean is the average of one-period returns. If we want
to estimate the average returns over more than one period, however, we should use
the geometric mean of returns because the geometric mean captures how the total
returns are linked over time. In a forward-looking context, a financial analyst calcu-
lating expected risk premiums may find that the weighted mean is appropriate, with
the probabilities of the possible outcomes used as the weights.
Dispersion in cash flows or returns causes the arithmetic mean to be larger than
the geometric mean. The more dispersion in the sample of returns, the more diver-
gence exists between the arithmetic and geometric means. If there is zero variance in
a sample of observations, the geometric and arithmetic return are equal.
© CFA Institute. For candidate use only. Not for distribution.
116 Reading 2 ■ Organizing, Visualizing, and Describing Data
n
XH with Xi > 0 for i = 1, 2, …, n. (6)
n
1 X i
i 1
The harmonic mean is the value obtained by summing the reciprocals of the observa-
tions—terms of the form 1/Xi—then averaging that sum by dividing it by the number
of observations n, and, finally, taking the reciprocal of the average.
The harmonic mean may be viewed as a special type of weighted mean in which an
observation’s weight is inversely proportional to its magnitude. For example, if there
is a sample of observations of 1, 2, 3, 4, 5, 6, and 1,000, the harmonic mean is 2.8560.
Compared to the arithmetic mean of 145.8571, we see the influence of the outlier (the
1,000) to be much less than in the case of the arithmetic mean. So, the harmonic mean
is quite useful as a measure of central tendency in the presence of outliers.
The harmonic mean is used most often when the data consist of rates and ratios,
such as P/Es. Suppose three peer companies have P/Es of 45, 15, and 15. The arithmetic
mean is 25, but the harmonic mean, which gives less weight to the P/E of 45, is 19.3.
EXAMPLE 11
Comparing the three types of means, we see the arithmetic mean is higher
than the geometric mean return, and the geometric mean return is higher than
the harmonic mean return. We can see the differences in these means in the
following graph:
Harmonic, Geometric, and Arithmetic Means of Selected Country
Indexes
Country
0.060
D 0.146
0.233
−1.428
E −1.381
−1.333
3.113
F 3.157
3.200
–2 –1 0 1 2 3 4
Mean Return (%)
Harmonic Geometric Mean Arithmetic
The harmonic mean is a relatively specialized concept of the mean that is appro-
priate for averaging ratios (“amount per unit”) when the ratios are repeatedly applied
to a fixed quantity to yield a variable number of units. The concept is best explained
through an illustration. A well-known application arises in the investment strategy
known as cost averaging, which involves the periodic investment of a fixed amount
of money. In this application, the ratios we are averaging are prices per share at
different purchase dates, and we are applying those prices to a constant amount of
money to yield a variable number of shares. An illustration of the harmonic mean to
cost averaging is provided in Example 12.
EXAMPLE 12
EXAMPLE 13
Stock 1 22.29
Stock 2 15.54
Stock 3 9.38
Stock 4 15.12
Stock 5 10.72
Stock 6 14.57
Stock 7 7.20
Stock 8 7.97
Stock 9 10.34
Stock 10 8.35
Solution
Collect Sample
Include all
values, Yes
Arithmetic Mean
including
outliers?
Yes
Compounding? Geometric Mean
8 QUANTILES
When dealing with actual data, we often find that we need to approximate the
value of a percentile. For example, if we are interested in the value of the 75th percen-
tile, we may find that no observation divides the sample such that exactly 75% of the
observations lie at or below that value. The following procedure, however, can help us
determine or estimate a percentile. The procedure involves first locating the position
of the percentile within the set of observations and then determining (or estimating)
the value associated with that position.
Let P y be the value at or below which y% of the distribution lies, or the yth per-
centile. (For example, P18 is the point at or below which 18% of the observations
lie; this implies that 100 − 18 = 82% of the observations are greater than P18.) The
formula for the position (or location) of a percentile in an array with n entries sorted
in ascending order is:
y
L y n 1 (7)
100
where y is the percentage point at which we are dividing the distribution, and Ly is the
location (L) of the percentile (P y) in the array sorted in ascending order. The value of
Ly may or may not be a whole number. In general, as the sample size increases, the
percentile location calculation becomes more accurate; in small samples it may be
quite approximate.
To summarize:
■■ When the location, Ly, is a whole number, the location corresponds to an actual
observation. For example, if we are determining the third quartile (Q3) in a
sample of size n = 11, then Ly would be L75 = (11 + 1)(75/100) = 9, and the third
quartile would be P75 = X9, where Xi is defined as the value of the observation
in the ith (i = L75, so 9th), position of the data sorted in ascending order.
■■ When Ly is not a whole number or integer, Ly lies between the two closest
integer numbers (one above and one below), and we use linear interpolation
between those two places to determine P y. Interpolation means estimating an
unknown value on the basis of two known values that surround it (i.e., lie above
and below it); the term “linear” refers to a straight-line estimate.
Example 14 illustrates the calculation of various quantiles for the daily return on the
EAA Equity Index.
EXAMPLE 14
1 5 −4.108 −1.416 63
2 10 −1.416 −0.876 63
3 15 −0.876 −0.629 63
4 20 −0.629 −0.432 63
5 25 −0.432 −0.293 63
6 30 −0.293 −0.193 63
7 35 −0.193 −0.124 62
8 40 −0.124 −0.070 63
9 45 −0.070 −0.007 63
10 50 −0.007 0.044 63
11 55 0.044 0.108 63
12 60 0.108 0.173 63
13 65 0.173 0.247 63
14 70 0.247 0.343 62
15 75 0.343 0.460 63
16 80 0.460 0.575 63
17 85 0.575 0.738 63
18 90 0.738 0.991 63
19 95 0.991 1.304 63
20 100 1.304 5.001 63
Note that because of the continuous nature of returns, it is not likely for a
return to fall on the boundary for any bin other than the minimum (Bin = 1)
and maximum (Bin = 20).
1 Identify the 10th and 90th percentiles.
2 Identify the first, second, and third quintiles.
3 Identify the first and third quartiles.
4 Identify the median.
5 Calculate the interquartile range.
Solution to 1
The 10th and 90th percentiles correspond to the bins or ranked returns that
include 10% and 90% of the daily returns, respectively. The 10th percentile
corresponds to the return of −0.876% (and includes returns of that much and
lower), and the 90th percentile corresponds to the return of 0.991% (and lower).
Solution to 2
The first quintile corresponds to the lowest 20% of the ranked data, or −0.432%
(and lower).
The second quintile corresponds to the lowest 40% of the ranked data, or
−0.070% (and lower).
© CFA Institute. For candidate use only. Not for distribution.
Quantiles 123
The third quintile corresponds to the lowest 60% of the ranked data, or
0.173% (and lower).
Solution to 3
The first quartile corresponds to the lowest 25% of the ranked data, or −0.293%
(and lower).
The third quartile corresponds to the lowest 75% of the ranked data, or
0.460% (and lower).
Solution to 4
The median is the return for which 50% of the data lies on either side, which is
0.044%, the highest daily return in the 10th bin out of 20.
Solution to 5
The interquartile range is the difference between the third and first quartiles,
0.460% and −0.293%, or 0.753%.
One way to visualize the dispersion of data across quartiles is to use a diagram,
such as a box and whisker chart. A box and whisker plot consists of a “box” with
“whiskers” connected to the box, as shown in Exhibit 44. The “box” represents the
lower bound of the second quartile and the upper bound of the third quartile, with
the median or arithmetic average noted as a measure of central tendency of the entire
distribution. The whiskers are the lines that run from the box and are bounded by the
“fences,” which represent the lowest and highest values of the distribution.
Interquartile Median
Range × Arithmetic Average
There are several variations for box and whisker displays. For example, for ease
in detecting potential outliers, the fences of the whiskers may be a function of the
interquartile range instead of the highest and lowest values like that in Exhibit 44.
In Exhibit 44, visually, the interquartile range is the height of the box and the
fences are set at extremes. But another form of box and whisker plot typically uses
1.5 times the interquartile range for the fences. Thus, the upper fence is 1.5 times the
interquartile range added to the upper bound of Q3, and the lower fence is 1.5 times
the interquartile range subtracted from the lower bound of Q2. Observations beyond
the fences (i.e., outliers) may also be displayed.
We can see the role of outliers in such a box and whisker plot using the EAA
Equity Index daily returns, as shown in Exhibit 45. Referring back to Exhibit 43
(Example 13), we know:
■■ The maximum and minimum values of the distribution are 5.001 and −4.108,
respectively, while the median (50th percentile) value is 0.044.
© CFA Institute. For candidate use only. Not for distribution.
124 Reading 2 ■ Organizing, Visualizing, and Describing Data
Exhibit 45 Box and Whisker Chart for EAA Equity Index Daily Returns
Daily Return (%)
6
5 Maximum of 5.001%
4
3
2 (Q3 Upper 1.589% (Upper Fence)
1 Bound)
0.460% Median of
0 (Q2 Lower 0.044%
–1 Bound) –1.422%
–0.293% (Lower Fence)
–2
–3
–4 Minimum of –4.108%
–5
EXAMPLE 15
Quantiles
Consider the results of an analysis focusing on the market capitalizations of a
sample of 100 firms:
1 5 0.28 15.45 5
2 10 15.45 21.22 5
3 15 21.22 29.37 5
4 20 29.37 32.57 5
5 25 32.57 34.72 5
6 30 34.72 37.58 5
7 35 37.58 39.90 5
8 40 39.90 41.57 5
9 45 41.57 44.86 5
10 50 44.86 46.88 5
11 55 46.88 49.40 5
12 60 49.40 51.27 5
13 65 51.27 53.58 5
14 70 53.58 56.66 5
© CFA Institute. For candidate use only. Not for distribution.
Quantiles 125
Solution to 1
B is correct because the tenth percentile corresponds to the lowest 10% of the
observations in the sample, which are in bins 1 and 2.
Solution to 2
B is correct because the second quintile corresponds to the second 20% of
observations. The first 20% consists of bins 1 through 4. The second 20% of
observations consists of bins 5 through 8.
Solution to 3
C is correct because a quartile consists of 25% of the data, and the last 25% of
the 20 bins are 16 through 20.
Solution to 4
B is correct because this is the center of the 20 bins. The market capitalization
of 46.88 is the highest value of the 10th bin and the lowest value of the 11th bin.
© CFA Institute. For candidate use only. Not for distribution.
126 Reading 2 ■ Organizing, Visualizing, and Describing Data
Solution to 5
B is correct because the interquartile range is the difference between the lowest
value in the second quartile and the highest value in the third quartile. The lowest
value of the second quartile is 34.72, and the highest value of the third quartile
is 58.34. Therefore, the interquartile range is 58.34 − 34.72 = 23.62.
9 MEASURES OF DISPERSION
Definition of Range. The range is the difference between the maximum and
minimum values in a dataset:
n
Xi X
i 1
MAD (9)
n
EXAMPLE 16
For Year 3, for example, the sum of the absolute deviations from the arithmetic
mean ( X = 2.0) is 26.8. We divide this by 11, with the resulting MAD of 2.44.
where X is the sample mean and n is the number of observations in the sample.
Given knowledge of the sample mean, we can use Equation 10 to calculate the sum
of the squared differences from the mean, taking account of all n items in the sample,
and then to find the mean squared difference by dividing the sum by n − 1. Whether
a difference from the mean is positive or negative, squaring that difference results in
a positive number. Thus, variance takes care of the problem of negative deviations
from the mean canceling out positive deviations by the operation of squaring those
deviations.
For the sample variance, by dividing by the sample size minus 1 (or n − 1) rather
than n, we improve the statistical properties of the sample variance. In statistical terms,
the sample variance defined in Equation 10 is an unbiased estimator of the population
variance (a concept covered later in the curriculum on sampling). The quantity n − 1 is
also known as the number of degrees of freedom in estimating the population variance.
To estimate the population variance with s2, we must first calculate the sample mean,
which itself is an estimated parameter. Therefore, once we have computed the sample
mean, there are only n − 1 independent pieces of information from the sample; that
is, if you know the sample mean and n − 1 of the observations, you could calculate
the missing sample observation.
where X is the sample mean and n is the number of observations in the sample.
To calculate the sample standard deviation, we first compute the sample variance.
We then take the square root of the sample variance. The steps for computing the
sample variance and the standard deviation are provided in Exhibit 46.
Exhibit 46 (Continued)
We illustrate the process of calculating the sample variance and standard deviation
in Example 17 using the returns of the selected country stock indexes presented in
Exhibit 35.
EXAMPLE 17
9.3.3 Dispersion and the Relationship between the Arithmetic and the Geometric Means
We can use the sample standard deviation to help us understand the gap between the
arithmetic mean and the geometric mean. The relation between the arithmetic
mean X and geometric mean X G is:
s2
XG X
2
In other words, the larger the variance of the sample, the wider the difference
between the geometric mean and the arithmetic mean.
Using the data for Country F from Example 8, the geometric mean return is 3.1566%,
the arithmetic mean return is 3.2%, and the factor s2/2 is 0.001324/2 = 0.0662%:
3.1566% ≈ 3.2% − 0.0662%
3.1566% ≈ 3.1338%.
This relation informs us that the more disperse or volatile the returns, the larger
the gap between the geometric mean return and the arithmetic mean return.
In practice, we may be concerned with values of return (or another variable) below
some level other than the mean. For example, if our return objective is 6.0% annually
(our minimum acceptable return), then we may be concerned particularly with returns
below 6.0% a year. The 6.0% is the target. The target downside deviation, also referred
to as the target semideviation, is a measure of dispersion of the observations (here,
returns) below the target. To calculate a sample target semideviation, we first specify
the target. After identifying observations below the target, we find the sum of the
squared negative deviations from the target, divide that sum by the total number of
observations in the sample minus 1, and, finally, take the square root.
Sample Target Semideviation Formula. The target semideviation, sTarget, is:
2
n X i B
sTarget = n 1
, (12)
for allX i B
where B is the target and n is the total number of sample observations. We illustrate
this in Example 18.
EXAMPLE 18
January 5
February 3
March −1
April −4
May 4
June 2
July 0
August 4
September 3
October 0
November 6
December 5
1 Calculate the target downside deviation when the target return is 3%.
2 If the target return were 4%, would your answer be different from that for
question 1? Without using calculations, explain how would it be different?
© CFA Institute. For candidate use only. Not for distribution.
Downside Deviation and Coefficient of Variation 133
Solution to 1
Squared
Deviation Deviations Deviations
from the 3% below the below the
Month Observation Target Target Target
January 5 2 — —
February 3 0 — —
March −1 −4 −4 16
April −4 −7 −7 49
May 4 1 — —
June 2 −1 −1 1
July 0 −3 −3 9
August 4 1 — —
September 3 0 — —
October 0 −3 −3 9
November 6 3 — —
December 5 2 — —
Sum 84
84
Target semideviation = = 2.7634%
11
Solution to 2
If the target return is higher, then the existing deviations would be larger and
there would be several more values in the deviations and squared deviations
below the target; so, the target semideviation would be larger.
How does the target downside deviation relate to the sample standard deviation?
We illustrate the differences between the target downside deviation and the standard
deviation in Example 19, using the data in Example 18.
EXAMPLE 19
Solution to 1
96.2500
The sample standard deviation is = 2.958%.
11
Solution to 2
Squared
Deviation Deviations Deviations
from the 2% below the below the
Month Observation Target Target Target
January 5 3 — —
February 3 1 — —
March −1 −3 −3 9
April −4 −6 −6 36
May 4 2 — —
June 2 0 — —
July 0 −2 −2 4
August 4 2 — —
September 3 1 — —
October 0 −2 −2 4
November 6 4 — —
December 5 3 — —
Sum 53
53
The target semideviation with 2% target = = 2.195%.
11
Solution to 3
The standard deviation is based on the deviation from the mean, which is 2.25%.
The standard deviation includes all deviations from the mean, not just those
below it. This results in a sample standard deviation of 2.958%.
Considering just the four observations below the 2% target, the target
semideviation is 2.195%. It is less than the sample standard deviation since
target semideviation captures only the downside risk (i.e., deviations below the
target). Considering target semideviation with a 3% target, there are now five
observations below 3%, so the target semideviation is higher, at 2.763%.
© CFA Institute. For candidate use only. Not for distribution.
Downside Deviation and Coefficient of Variation 135
EXAMPLE 20
1 −5 −10
2 −3 −9
3 −1 −7
4 2 −3
5 4 1
6 6 3
7 7 5
8 9 18
9 10 20
10 11 22
Industry A
–5 –3 –1 2 4 67 9 10 11
Industry B
–10 –9 –7 –3 1 3 5 18 20 22
Solution to 1
The arithmetic mean for both industries is the sum divided by 10, or 40/10 = 4%.
Solution to 2
The standard deviation using Equation 11 for Industry A is 5.60, and for Industry
B the standard deviation is 12.12.
Solution to 3
l. Interpret skewness
Mean and variance may not adequately describe an investment’s distribution of returns.
In calculations of variance, for example, the deviations around the mean are squared,
so we do not know whether large deviations are likely to be positive or negative.
We need to go beyond measures of central tendency and dispersion to reveal other
important characteristics of the distribution. One important characteristic of interest
to analysts is the degree of symmetry in return distributions.
If a return distribution is symmetrical about its mean, each side of the distribution
is a mirror image of the other. Thus, equal loss and gain intervals exhibit the same
frequencies. If the mean is zero, for example, then losses from −5% to −3% occur with
about the same frequency as gains from 3% to 5%.
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 137
Density of Probability
–5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation
B. Negatively Skewed
Density of Probability
Skewness is the name given to a statistical measure of skew. (The word “skewness” is
also sometimes used interchangeably for “skew.”) Like variance, skewness is computed
using each observation’s deviation from its mean. Skewness (sometimes referred
to as relative skewness) is computed as the average cubed deviation from the mean
standardized by dividing by the standard deviation cubed to make the measure free
of scale. A symmetric distribution has skewness of 0, a positively skewed distribution
has positive skewness, and a negatively skewed distribution has negative skewness,
as given by this measure.
We can illustrate the principle behind the measure by focusing on the numera-
tor. Cubing, unlike squaring, preserves the sign of the deviations from the mean. If
a distribution is positively skewed with a mean greater than its median, then more
than half of the deviations from the mean are negative and less than half are positive.
However, for the sum of the cubed deviations to be positive, the losses must be small
and likely and the gains less likely but more extreme. Therefore, if skewness is positive,
the average magnitude of positive deviations is larger than the average magnitude of
negative deviations.
The approximation for computing sample skewness when n is large (100 or
more) is:
n
3
X i X
1
Skewness i 1
n s3
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 139
0.5
0.4
0.3
0.2
0.1
0 –5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation
Normal Distribution Fat-Tailed Distribution
© CFA Institute. For candidate use only. Not for distribution.
140 Reading 2 ■ Organizing, Visualizing, and Describing Data
The calculation for kurtosis involves finding the average of deviations from the
mean raised to the fourth power and then standardizing that average by dividing by
the standard deviation raised to the fourth power. A normal distribution has kurtosis
of 3.0, so a fat-tailed distribution has a kurtosis of above 3 and a thin-tailed distribu-
tion of below 3.0.
Excess kurtosis is the kurtosis relative to the normal distribution. For a large sam-
ple size (n = 100 or more), sample excess kurtosis (KE) is approximately as follows:
n
X i X 4
1
KE i 1 3
n 4
s
As with skewness, this measure is free of scale. Many statistical packages report
estimates of sample excess kurtosis, labeling this as simply “kurtosis.”
Excess kurtosis thus characterizes kurtosis relative to the normal distribution. A
normal distribution has excess kurtosis equal to 0. A fat-tailed distribution has excess
kurtosis greater than 0, and a thin-tailed distribution has excess kurtosis less than 0. A
return distribution with positive excess kurtosis—a fat-tailed return distribution—has
more frequent extremely large deviations from the mean than a normal distribution.
Summarizing:
And we refer to
then excess Therefore, the the distribution as
If kurtosis is … kurtosis is … distribution is … being …
Most equity return series have been found to be fat-tailed. If a return distribution
is fat-tailed and we use statistical models that do not account for the distribution,
then we will underestimate the likelihood of very bad or very good outcomes. Using
the data on the daily returns of the fictitious EAA Equity Index, we see the skewness
and kurtosis of these returns in Exhibit 50.
Measure of Symmetry
Skewness −0.4260
Excess kurtosis 3.7962
We can see this graphically, comparing the distribution of the daily returns with
a normal distribution with the same mean and standard deviation:
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 141
Exhibit 50 (Continued)
Number of Observations
–5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation
EAA Daily Returns Normal Distribution
Using both the statistics and the graph, we see the following:
■■ The distribution is negatively skewed, as indicated by the negative calcu-
lated skewness of −0.4260 and the influence of observations below the
mean of 0.0347%.
■■ The highest frequency of returns occurs within the −0.5 to 0.0 standard
deviations from the mean (i.e., negatively skewed).
■■ The distribution is fat-tailed, as indicated by the positive excess kurtosis of
3.7962. We can see fat tails, a concentration of returns around the mean,
and fewer observations in the regions between the central region and the
two-tail regions.
EXAMPLE 21
60
50
40
30
20
10
0
3.1 4.6 6.1 7.7 9.2 10.7 12.3 13.8 15.3 16.8 18.4 19.9 21.4 23.0 24.5 26.0 27.5 29.1 30.6 32.1
to to to to to to to to to to to to to to to to to to to to
4.6 6.1 7.7 9.2 10.7 12.3 13.8 15.3 16.8 18.4 19.9 21.4 23.0 24.5 26.0 27.5 29.1 30.6 32.1 33.7
Trading Volume Range of Shares (millions)
Solution to 1
The distribution appears to be skewed to the right, or positively skewed. This is
likely due to: (1) no possible negative trading volume on a given trading day, so the
distribution is truncated at zero; and (2) greater-than-typical trading occurring
relatively infrequently, such as when there are company-specific announcements.
The actual skewness for this distribution is 2.1090, which supports this
interpretation.
Solution to 2
The distribution appears to have excess kurtosis, with a right-side fat tail and
with maximum shares traded in the 4.6 to 6.1 million range, exceeding what
is expected if the distribution was normally distributed. There are also fewer
observations than expected between the central region and the tail.
The actual excess kurtosis for this distribution is 5.2151, which supports
this interpretation.
Now that we have some understanding of sample variance and standard deviation, we
can more formally consider the concept of correlation between two random variables
that we previously explored visually in the scatter plots in Section 6. Correlation is a
measure of the linear relationship between two random variables.
The first step is to consider how two variables vary together, their covariance.
Definition of Sample Covariance. The sample covariance (sXY) is a measure
of how two variables in a sample move together:
n
X i X Yi Y
i 1
s XY (14)
n 1
Equation 14 indicates that the sample covariance is the average value of the product
of the deviations of observations on two random variables (Xi and Yi) from their sample
means. If the random variables are returns, the units would be returns squared. Also,
note the use of n − 1 in the denominator, which ensures that the sample covariance
is an unbiased estimate of population covariance.
Stated simply, covariance is a measure of the joint variability of two random vari-
ables. If the random variables vary in the same direction—for example, X tends to be
above its mean when Y is above its mean, and X tends to be below its mean when Y is
below its mean—then their covariance is positive. If the variables vary in the opposite
direction relative to their respective means, then their covariance is negative.
By itself, the size of the covariance measure is difficult to interpret as it is not
normalized and so depends on the magnitude of the variables. This brings us to the
normalized version of covariance, which is the correlation coefficient.
Definition of Sample Correlation Coefficient. The sample correlation
coefficient is a standardized measure of how two variables in a sample move
together. The sample correlation coefficient (rXY) is the ratio of the sample cova-
riance to the product of the two variables’ standard deviations:
s XY
rXY = (15)
s X sY
Importantly, the correlation coefficient expresses the strength of the linear rela-
tionship between the two random variables.
We will make use of scatter plots, similar to those used previously in our discussion
of data visualization, to illustrate correlation. In contrast to the correlation coefficient,
which expresses the relationship between two data series using a single number, a
scatter plot depicts the relationship graphically. Therefore, scatter plots are a very
useful tool for the sensible interpretation of a correlation coefficient.
Exhibit 51 shows examples of scatter plots. Panel A shows the scatter plot of
two variables with a correlation of +1. Note that all the points on the scatter plot in
Panel A lie on a straight line with a positive slope. Whenever variable X increases by
one unit, variable Y increases by two units. Because all of the points in the graph lie
on a straight line, an increase of one unit in X is associated with exactly a two-unit
increase in Y, regardless of the level of X. Even if the slope of the line were different
(but positive), the correlation between the two variables would still be +1 as long as
all the points lie on that straight line. Panel B shows a scatter plot for two variables
with a correlation coefficient of −1. Once again, the plotted observations all fall on a
straight line. In this graph, however, the line has a negative slope. As X increases by
one unit, Y decreases by two units, regardless of the initial value of X.
Panel C shows a scatter plot of two variables with a correlation of 0; they have no
linear relation. This graph shows that the value of variable X tells us nothing about
the value of variable Y. Panel D shows a scatter plot of two variables that have a
© CFA Institute. For candidate use only. Not for distribution.
Correlation between Two Variables 145
EXAMPLE 22
Exhibit 52
Fund A Fund B Fund C
The covariances are represented in the upper-triangle (shaded area) of the matrix
shown in Exhibit 53.
Exhibit 53
Fund A Fund B Fund C
Exhibit 54
Fund A Fund B Fund C
1 Interpret the correlation between Fund A’s returns and Fund B’s returns.
2 Interpret the correlation between Fund A’s returns and Fund C’s returns.
3 Describe the relationship of the covariance of these returns and the cor-
relation of returns.
Solutions
information about the two variables’ relationship (and should thus be included in the
correlation analysis) or contain no information (and should thus be excluded). If they
are to be excluded from the correlation analysis, as we have seen previously, outlier
observations can be handled by trimming or winsorizing the dataset.
Importantly, keep in mind that correlation does not imply causation. Even if two
variables are highly correlated, one does not necessarily cause the other in the sense
that certain values of one variable bring about the occurrence of certain values of
the other.
Moreover, with visualizations too, including scatter plots, we must be on guard
against unconsciously making judgments about causal relationships that may or may
not be supported by the data.
The term spurious correlation has been used to refer to: 1) correlation between
two variables that reflects chance relationships in a particular dataset; 2) correlation
induced by a calculation that mixes each of two variables with a third variable; and
3) correlation between two variables arising not from a direct relation between them
but from their relation to a third variable.
As an example of the chance relationship, consider the monthly US retail sales of
beer, wine, and liquor and the atmospheric carbon dioxide levels from 2000–2018.
The correlation is 0.824, indicating that there is a positive relation between the two.
However, there is no reason to suspect that the levels of atmospheric carbon dioxide
are related to the retail sales of beer, wine, and liquor.
As an example of the second kind of spurious correlation, two variables that are
uncorrelated may be correlated if divided by a third variable. For example, consider a
cross-sectional sample of companies’ dividends and total assets. While there may be
a low correlation between these two variables, dividing each by market capitalization
may increase the correlation.
As an example of the third kind of spurious correlation, height may be positively
correlated with the extent of a person’s vocabulary, but the underlying relationships
are between age and height and between age and vocabulary.
Investment professionals must be cautious in basing investment strategies on high
correlations. Spurious correlations may suggest investment strategies that appear
profitable but actually would not be, if implemented.
A further issue is that correlation does not tell the whole story about the data.
Consider Anscombe’s Quartet, discussed in Exhibit 55, where very dissimilar graphs
can be developed with variables that have the same mean, same standard deviation,
and same correlation.
Exhibit 55 (Continued)
I II III IV
Observation X Y X Y X Y X Y
N 11 11 11 11 11 11 11 11
Mean 9.00 7.50 9.00 7.50 9.00 7.50 9.00 7.50
Standard
deviation 3.32 2.03 3.32 2.03 3.32 2.03 3.32 2.03
Correlation 0.82 0.82 0.82 0.82
While the X variable has the same values for I, II, and III in the quartet of
datasets, the Y variables are quite different, creating different relationships.
The four datasets are:
I An approximate linear relationship between X and Y.
II A curvilinear relationship between X and Y.
III A linear relationship except for one outlier.
IV A constant X with the exception of one outlier.
Depicting the quartet visually,
© CFA Institute. For candidate use only. Not for distribution.
Summary 149
Exhibit 55 (Continued)
I II
Variable Y Variable Y
14 14
12 12
10 10
8
6 6
4 4
2 2
0 0
0 5 10 15 0 5 10 15
Variable X Variable X
III IV
Variable Y Variable Y
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 5 10 15 0 5 10 15 20
Variable X The bot-
Variable X
tom line? Knowing the means and standard deviations of the two variables, as
well as the correlation between them, does not tell the entire story.
Source: Francis John Anscombe, “Graphs in Statistical Analysis,” The American Statistician 27
(February 1973): 17–21.
SUMMARY
In this reading, we have presented tools and techniques for organizing, visualizing,
and describing data that permit us to convert raw data into useful information for
investment analysis.
■■ Data can be defined as a collection of numbers, characters, words, and text—as
well as images, audio, and video—in a raw or organized format to represent
facts or information.
■■ From a statistical perspective, data can be classified as numerical data and
categorical data. Numerical data (also called quantitative data) are values that
represent measured or counted quantities as a number. Categorical data (also
called qualitative data) are values that describe a quality or characteristic of a
group of observations and usually take only a limited number of values that are
mutually exclusive.
© CFA Institute. For candidate use only. Not for distribution.
150 Reading 2 ■ Organizing, Visualizing, and Describing Data
■■ Numerical data can be further split into two types: continuous data and discrete
data. Continuous data can be measured and can take on any numerical value in
a specified range of values. Discrete data are numerical values that result from a
counting process and therefore are limited to a finite number of values.
■■ Categorical data can be further classified into two types: nominal data and
ordinal data. Nominal data are categorical values that are not amenable to being
organized in a logical order, while ordinal data are categorical values that can be
logically ordered or ranked.
■■ Based on how they are collected, data can be categorized into three types:
cross-sectional, time series, and panel. Time-series data are a sequence of
observations for a single observational unit on a specific variable collected
over time and at discrete and typically equally spaced intervals of time. Cross-
sectional data are a list of the observations of a specific variable from multiple
observational units at a given point in time. Panel data are a mix of time-series
and cross-sectional data that consists of observations through time on one or
more variables for multiple observational units.
■■ Based on whether or not data are in a highly organized form, they can be classi-
fied into structured and unstructured types. Structured data are highly orga-
nized in a pre-defined manner, usually with repeating patterns. Unstructured
data do not follow any conventionally organized forms; they are typically alter-
native data as they are usually collected from unconventional sources.
■■ Raw data are typically organized into either a one-dimensional array or a two-
dimensional rectangular array (also called a data table) for quantitative analysis.
■■ A frequency distribution is a tabular display of data constructed either by
counting the observations of a variable by distinct values or groups or by tal-
lying the values of a numerical variable into a set of numerically ordered bins.
Frequency distributions permit us to evaluate how data are distributed.
■■ The relative frequency of observations in a bin (interval or bucket) is the num-
ber of observations in the bin divided by the total number of observations. The
cumulative relative frequency cumulates (adds up) the relative frequencies as
we move from the first bin to the last, thus giving the fraction of the observa-
tions that are less than the upper limit of each bin.
■■ A contingency table is a tabular format that displays the frequency distributions
of two or more categorical variables simultaneously. One application of contin-
gency tables is for evaluating the performance of a classification model (using a
confusion matrix). Another application of contingency tables is to investigate a
potential association between two categorical variables by performing a chi-
square test of independence.
■■ Visualization is the presentation of data in a pictorial or graphical format for
the purpose of increasing understanding and for gaining insights into the data.
■■ A histogram is a bar chart of data that have been grouped into a frequency
distribution. A frequency polygon is a graph of frequency distributions obtained
by drawing straight lines joining successive midpoints of bars representing the
class frequencies.
■■ A bar chart is used to plot the frequency distribution of categorical data, with
each bar representing a distinct category and the bar’s height (or length) pro-
portional to the frequency of the corresponding category. Grouped bar charts
or stacked bar charts can present the frequency distribution of multiple cate-
gorical variables simultaneously.
© CFA Institute. For candidate use only. Not for distribution.
Summary 151
the two variables tend to move in opposite directions. Correlation does not
imply causation, simply association. Issues that arise in evaluating correlation
include the presence of outliers and spurious correlation.
© CFA Institute. For candidate use only. Not for distribution.
154 Reading 2 ■ Organizing, Visualizing, and Describing Data
PRACTICE PROBLEMS
1 Published ratings on stocks ranging from 1 (strong sell) to 5 (strong buy) are
examples of which measurement scale?
A Ordinal
B Continuous
C Nominal
2 Data values that are categorical and not amenable to being organized in a logi-
cal order are most likely to be characterized as:
A ordinal data.
B discrete data.
C nominal data.
3 Which of the following data types would be classified as being categorical?
A Discrete
B Nominal
C Continuous
4 A fixed-income analyst uses a proprietary model to estimate bankruptcy proba-
bilities for a group of firms. The model generates probabilities that can take any
value between 0 and 1. The resulting set of estimated probabilities would most
likely be characterized as:
A ordinal data.
B discrete data.
C continuous data.
5 An analyst uses a software program to analyze unstructured data—specifically,
management’s earnings call transcript for one of the companies in her research
coverage. The program scans the words in each sentence of the transcript and
then classifies the sentences as having negative, neutral, or positive sentiment.
The resulting set of sentiment data would most likely be characterized as:
A ordinal data.
B discrete data.
C nominal data.
6 Each individual column of data in the table can be best characterized as:
A panel data.
B time-series data.
C cross-sectional data.
7 Each individual row of data in the table can be best characterized as:
A panel data.
B time-series data.
C cross-sectional data.
−10.0 to −7.0 3
−7.0 to −4.0 7
−4.0 to −1.0 10
−1.0 to +2.0 12
+2.0 to +5.0 23
+5.0 to +8.0 5
The cumulative relative frequency for the bin −1.71% ≤ x < 2.03% is closest to:
A 0.250.
B 0.333.
C 0.583.
Bond Rating
Sector A AA AAA
Communication Services 25 32 27
Consumer Staples 30 25 25
Energy 100 85 30
Health Care 200 100 63
Utilities 22 28 14
Frequency
8
0
–37 –32 –27 –22 –17 –12 –7 –2 3 8 13 18 23 28 33
to to to to to to to to to to to to to to to
–32 –27 –22 –17 –12 –7 –2 3 8 13 18 23 28 33 38
Return Intervals (%)
15
10
0
–5 –3 –1 1 3
Return Interval Midpoint (%)
A Heat map
B Bubble line chart
C Scatter plot matrix
23 The annual returns for three portfolios are shown in the following exhibit.
Portfolios P and R were created in Year 1, Portfolio Q in Year 2.
Fund Y (%)
Year 1 19.5
Year 2 −1.9
Year 3 19.7
Year 4 35.0
Year 5 5.7
Year 1 62.00
Year 2 76.00
Year 3 84.00
Year 4 90.00
27 The fourth quintile return for the MSCI World Index is closest to:
A 20.65%.
B 26.03%.
C 27.37%.
28 For Year 6–Year 10, the mean absolute deviation of the MSCI World Index total
returns is closest to:
A 10.20%.
B 12.74%.
C 16.40%.
29 Annual returns and summary statistics for three funds are listed in the follow-
ing exhibit:
1 4.5%
2 6.0%
3 1.5%
4 −2.0%
5 0.0%
6 4.5%
7 3.5%
8 2.5%
9 5.5%
10 4.0%
160
154.45
140
120
114.25
100 100.49
80 79.74
60
51.51
40
SOLUTIONS
1 A is correct. Ordinal scales sort data into categories that are ordered with
respect to some characteristic and may involve numbers to identify categories
but do not assure that the differences between scale values are equal. The buy
rating scale indicates that a stock ranked 5 is expected to perform better than a
stock ranked 4, but it tells us nothing about the performance difference between
stocks ranked 4 and 5 compared with the performance difference between
stocks ranked 1 and 2, and so on.
2 C is correct. Nominal data are categorical values that are not amenable to being
organized in a logical order. A is incorrect because ordinal data are categorical
data that can be logically ordered or ranked. B is incorrect because discrete
data are numerical values that result from a counting process; thus, they can be
ordered in various ways, such as from highest to lowest value.
3 B is correct. Categorical data (or qualitative data) are values that describe a
quality or characteristic of a group of observations and therefore can be used
as labels to divide a dataset into groups to summarize and visualize. The two
types of categorical data are nominal data and ordinal data. Nominal data are
categorical values that are not amenable to being organized in a logical order,
while ordinal data are categorical values that can be logically ordered or ranked.
A is incorrect because discrete data would be classified as numerical data (not
categorical data). C is incorrect because continuous data would be classified as
numerical data (not categorical data).
4 C is correct. Continuous data are data that can be measured and can take on
any numerical value in a specified range of values. In this case, the analyst is
estimating bankruptcy probabilities, which can take on any value between 0 and
1. Therefore, the set of bankruptcy probabilities estimated by the analyst would
likely be characterized as continuous data. A is incorrect because ordinal data
are categorical values that can be logically ordered or ranked. Therefore, the
set of bankruptcy probabilities would not be characterized as ordinal data. B is
incorrect because discrete data are numerical values that result from a counting
process, and therefore the data are limited to a finite number of values. The pro-
prietary model used can generate probabilities that can take any value between
0 and 1; therefore, the set of bankruptcy probabilities would not be character-
ized as discrete data.
5 A is correct. Ordinal data are categorical values that can be logically ordered or
ranked. In this case, the classification of sentences in the earnings call transcript
into three categories (negative, neutral, or positive) describes ordinal data,
as the data can be logically ordered from positive to negative. B is incorrect
because discrete data are numerical values that result from a counting process.
In this case, the analyst is categorizing sentences (i.e., unstructured data) from
the earnings call transcript as having negative, neutral, or positive sentiment.
Thus, these categorical data do not represent discrete data. C is incorrect
because nominal data are categorical values that are not amenable to being
organized in a logical order. In this case, the classification of unstructured data
(i.e., sentences from the earnings call transcript) into three categories (negative,
neutral, or positive) describes ordinal (not nominal) data, as the data can be
logically ordered from positive to negative.
6 B is correct. Time-series data are a sequence of observations of a specific vari-
able collected over time and at discrete and typically equally spaced intervals of
time, such as daily, weekly, monthly, annually, and quarterly. In this case, each
© CFA Institute. For candidate use only. Not for distribution.
Solutions 167
column is a time series of data that represents annual total return (the specific
variable) for a given country index, and it is measured annually (the discrete
interval of time). A is incorrect because panel data consist of observations
through time on one or more variables for multiple observational units. The
entire table of data is an example of panel data showing annual total returns
(the variable) for three country indexes (the observational units) by year. C is
incorrect because cross-sectional data are a list of the observations of a specific
variable from multiple observational units at a given point in time. Each row
(not column) of data in the table represents cross-sectional data.
7 C is correct. Cross-sectional data are observations of a specific variable from
multiple observational units at a given point in time. Each row of data in the
table represents cross-sectional data. The specific variable is annual total return,
the multiple observational units are the three countries’ indexes, and the given
point in time is the time period indicated by the particular row. A is incor-
rect because panel data consist of observations through time on one or more
variables for multiple observational units. The entire table of data is an exam-
ple of panel data showing annual total returns (the variable) for three country
indexes (the observational units) by year. B is incorrect because time-series data
are a sequence of observations of a specific variable collected over time and
at discrete and typically equally spaced intervals of time, such as daily, weekly,
monthly, annually, and quarterly. In this case, each column (not row) is a time
series of data that represents annual total return (the specific variable) for a
given country index, and it is measured annually (the discrete interval of time).
8 A is correct. Panel data consist of observations through time on one or more
variables for multiple observational units. A two-dimensional rectangular array,
or data table, would be suitable here as it is comprised of columns to hold
the variable(s) for the observational units and rows to hold the observations
through time. B is incorrect because a one-dimensional (not a two-dimensional
rectangular) array would be most suitable for organizing a collection of data
of the same data type, such as the time-series data from a single variable. C is
incorrect because a one-dimensional (not a two-dimensional rectangular) array
would be most suitable for organizing a collection of data of the same data type,
such as the same variable for multiple observational units at a given point in
time (cross-sectional data).
9 B is correct. In a frequency distribution, the absolute frequency, or simply the
raw frequency, is the actual number of observations counted for each unique
value of the variable. A is incorrect because the relative frequency, which is
calculated as the absolute frequency of each unique value of the variable divided
by the total number of observations, presents the absolute frequencies in terms
of percentages. C is incorrect because the relative (not absolute) frequency
provides a normalized measure of the distribution of the data, allowing compar-
isons between datasets with different numbers of total observations.
10 A is correct. The relative frequency is the absolute frequency of each bin
divided by the total number of observations. Here, the relative frequency is cal-
culated as: (12/60) × 100 = 20%. B is incorrect because the relative frequency of
this bin is (23/60) × 100 = 38.33%. C is incorrect because the cumulative relative
frequency of the last bin must equal 100%.
11 C is correct. The cumulative relative frequency of a bin identifies the fraction of
observations that are less than the upper limit of the given bin. It is determined
by summing the relative frequencies from the lowest bin up to and including
the given bin. The following exhibit shows the relative frequencies for all the
bins of the data from the previous exhibit:
© CFA Institute. For candidate use only. Not for distribution.
168 Reading 2 ■ Organizing, Visualizing, and Describing Data
The bin −1.71% ≤ x < 2.03% has a cumulative relative frequency of 0.583.
12 C is correct. The marginal frequency of energy sector bonds in the portfolio is
the sum of the joint frequencies across all three levels of bond rating, so 100 +
85 + 30 = 215. A is incorrect because 27 is the relative frequency for energy
sector bonds based on the total count of 806 bonds, so 215/806 = 26.7%, not
the marginal frequency. B is incorrect because 85 is the joint frequency for AA
rated energy sector bonds, not the marginal frequency.
13 A is correct. The relative frequency for any value in the table based on the total
count is calculated by dividing that value by the total count. Therefore, the rela-
tive frequency for AA rated energy bonds is calculated as 85/806 = 10.5%.
B is incorrect because 31.5% is the relative frequency for AA rated energy
bonds, calculated based on the marginal frequency for all AA rated bonds, so
85/(32 + 25 + 85 + 100 + 28), not based on total bond counts. C is incorrect
because 39.5% is the relative frequency for AA rated energy bonds, calculated
based on the marginal frequency for all energy bonds, so 85/(100 + 85 + 30),
not based on total bond counts.
14 C is correct. Because 50 data points are in the histogram, the median return
would be the mean of the 50/2 = 25th and (50 + 2)/2 = 26th positions. The sum
of the return bin frequencies to the left of the 13% to 18% interval is 24. As a
result, the 25th and 26th returns will fall in the 13% to 18% interval.
15 C is correct. The mode of a distribution with data grouped in intervals is the
interval with the highest frequency. The three intervals of 3% to 8%, 18% to 23%,
and 28% to 33% all have a high frequency of 7.
16 A is correct. Twenty observations lie in the interval “0.0 to 2.0,” and six observa-
tions lie in the “2.0 to 4.0” interval. Together, they represent 26/48, or 54.17%, of
all observations, which is more than 50%.
17 A is correct. A bar chart that orders categories by frequency in descending
order and includes a line displaying cumulative relative frequency is called a
Pareto Chart. A Pareto Chart is used to highlight dominant categories or the
most important groups. B is incorrect because a grouped bar chart or clustered
bar chart is used to present the frequency distribution of two categorical vari-
ables. C is incorrect because a frequency polygon is used to display frequency
distributions.
18 C is correct. A word cloud, or tag cloud, is a visual device for representing
unstructured, textual data. It consists of words extracted from text with the size
of each word being proportional to the frequency with which it appears in the
given text. A is incorrect because a tree-map is a graphical tool for displaying
and comparing categorical data, not for visualizing unstructured, textual data. B
is incorrect because a scatter plot is used to visualize the joint variation in two
numerical variables, not for visualizing unstructured, textual data.
© CFA Institute. For candidate use only. Not for distribution.
Solutions 169
27 B is correct. Quintiles divide a distribution into fifths, with the fourth quintile
occurring at the point at which 80% of the observations lie below it. The fourth
quintile is equivalent to the 80th percentile. To find the yth percentile (P y),
© CFA Institute. For candidate use only. Not for distribution.
170 Reading 2 ■ Organizing, Visualizing, and Describing Data
we first must determine its location. The formula for the location (Ly) of a yth
percentile in an array with n entries sorted in ascending order is Ly = (n + 1) ×
(y/100). In this case, n = 10 and y = 80%, so
L80 = (10 + 1) × (80/100) = 11 × 0.8 = 8.8.
With the data arranged in ascending order (−40.33%, −5.02%, 9.57%, 10.02%,
12.34%, 15.25%, 16.54%, 20.65%, 27.37%, and 30.79%), the 8.8th position would
be between the 8th and 9th entries, 20.65% and 27.37%, respectively. Using
linear interpolation, P80 = X8 + (Ly − 8) × (X9 − X8),
P80 = 20.65 + (8.8 − 8) × (27.37 − 20.65)
= 20.65 + (0.8 × 6.72) = 20.65 + 5.38
= 26.03%.
Column 1: Sum annual returns and divide by n to find the arithmetic mean X
of 16.40%.
Column 2: Calculate the absolute value of the difference between each year’s
return and the mean from Column 1. Sum the results and divide by n to find
the MAD.
These calculations are shown in the following exhibit:
Column 1 Column 2
Xi − X
Year Return
s 1.35%
CVMATR
= = = 1.08
X 1.25%
s 1.52%
CVINDU
= = = 0.51
X 3.01%
31 A is correct. The more disperse a distribution, the greater the difference
between the arithmetic mean and the geometric mean.
32 B is correct. The distribution is thin-tailed relative to the normal distribution
because the excess kurtosis is less than zero.
33 B is correct. The geometric mean compounds the periodic returns of every
period, giving the investor a more accurate measure of the terminal value of an
investment.
34 B is correct. The sum of the returns is 30.0%, so the arithmetic mean is
30.0%/10 = 3.0%.
35 B is correct.
Year Return 1+ Return
1 4.5% 1.045
2 6.0% 1.060
3 1.5% 1.015
4 −2.0% 0.980
5 0.0% 1.000
6 4.5% 1.045
(continued)
© CFA Institute. For candidate use only. Not for distribution.
172 Reading 2 ■ Organizing, Visualizing, and Describing Data
37 B is correct.
Year Return Deviation Deviation Squared
The standard deviation is the square root of the sum of the squared deviations
divided by n − 1:
0.005750
s= = 2.5276%.
9
38 B is correct.
© CFA Institute. For candidate use only. Not for distribution.
Solutions 173
Deviation Squared
Year Return below Target of 2%
1 4.5%
2 6.0%
3 1.5% 0.000025
4 −2.0% 0.001600
5 0.0% 0.000400
6 4.5%
7 3.5%
8 2.5%
9 5.5%
10 4.0%
Sum 0.002025
The target semi-deviation is the square root of the sum of the squared devia-
tions from the target, divided by n − 1:
0.002025
sTarget = = 1.5%.
9
39 B is correct. The correlation coefficient is positive, indicating that the two series
move together.
40 C is correct. Both outliers and spurious correlation are potential problems with
interpreting correlation coefficients.
41 C is correct. The correlation coefficient is positive because the covariation is
positive.
42 A is correct. The correlation coefficient is negative because the covariation is
negative.
43 C is correct. The correlation coefficient is positive because the covariance is
positive. The fact that one or both variables have a negative mean does not
affect the sign of the correlation coefficient.
44 B is correct. The median is indicated within the box, which is the 100.49 in this
diagram.
45 C is correct. The interquartile range is the difference between 114.25 and 79.74,
which is 34.51.
46 B is correct. The coefficient of variation is the ratio of the standard deviation to
the arithmetic average, or 0.001723 0.09986 = 0.416.
47 C is correct. The skewness is positive, so it is right-skewed (positively skewed).
48 C is correct. The excess kurtosis is positive, indicating that the distribution is
“fat-tailed”; therefore, there is more probability in the tails of the distribution
relative to the normal distribution.