Quantitative Methods For Communication S
Quantitative Methods For Communication S
Students1
Petro K. Poutanen2
10.2.2014
1
This material can be seen at http://blogs.helsinki.fi/quantitative-communication/.
Note that all the materials are licensed under the
Attribution-NonCommercial-ShareAlike 2.5 Generic.
2
I want to thank Sam Kingsley for correcting the language, and Heikki Hyhkö and
Olli Parviainen for their comments to some sections of this work. These contributions
improved the quality of this work. However, any errors are mine, and feedback and
comments about this material can be sent to me at petro.k.poutanen@gmail.com.
Contents
1 Introduction 3
2 Quantitative Research 5
2.1 Starting to think statistically . . . . . . . . . . . . . . . . . . . . 5
2.2 Theories and models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Research design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Research process . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methods 11
3.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Content analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Network analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Data 14
4.1 Data and variables . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Getting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Analysis 20
5.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.1 Statistics for single variables . . . . . . . . . . . . . . . . 20
5.1.2 Statistics for two variables . . . . . . . . . . . . . . . . . . 23
5.2 Inferential statistics . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . 29
5.2.2 Testing for a difference between means . . . . . . . . . . . 31
5.3 Correlation and regression . . . . . . . . . . . . . . . . . . . . . . 33
1
CONTENTS 2
Introduction
3
CHAPTER 1. INTRODUCTION 4
Quantitative Research
This section covers basic protocols and starting points for developing a research
design for quantitative inquiries in communication.
5
CHAPTER 2. QUANTITATIVE RESEARCH 6
firm” the observations. This is how human brains mostly work: we see patterns
(“evidences”) around us and try to explain them. However, patterns are also
seen where there is actually nothing at all going on!
This confirmation bias can be harmful in scientific research, and statistical
methods are exactly the way of avoiding succumbing to it. Therefore, good
scientific research is based on some type of theoretical reasoning, either taken
as given and tested against empirical data or developed over the course of data
exploration and carefully linked with an existing body of empirical research and
theories.
3. Defining what/who the observational units are (i.e. people), the variables
(what is to be measured/asked) and the sample (how the observational
units will be selected, i.e. the criteria for inclusion and how many there
will be of them)
7. Making any required additional inquiries to the data and testing cause
and effect relations
Methods
3.1 Survey
A survey is one of the most popular data collection and research methods across
the social sciences and market research. So much so, in fact, that it has become
difficult to get people to answer survey questionnaires. Moreover, online surveys
have arrived and changed the game a little, making data collection easier and
cheaper, but at the same time subject to many biases.
In essence, a survey is a research instrument that usually takes the form of
a structured questionnaire. The questionnaire data can be collected through
face-to-face or phone interviews or by sending an invitation to answer to the
questions by post or via email. In research methodology jargon, survey design
is called a cross-sectional study if it is conducted only once, or a longitudinal
study if data is collected more than once over time.
11
CHAPTER 3. METHODS 12
of the coder. An example of such unit of analysis could be attractiveness or an emotional tone
of voice, positivity or negativity of a speech act, etc.
CHAPTER 3. METHODS 13
Data
This section concerns issues related to data gathering and sources of data.
14
CHAPTER 4. DATA 15
Figure 4.1: An example of a data matrix in SPSS program. The figure shows a
data matrix of World Economic Freedom data. The countries on each row are
observational units or "cases", and the columns of different freedom indexes,
such as business freedom, are variables. Each country then has its individual
cell scores for the variables.
CHAPTER 4. DATA 16
4.2 Sampling
It is relatively easy to observe people’s behaviors or ask them what they think
about something. However, it is another thing completely to claim that these
observations (e.g. ticks on a survey form) apply to other people than those who
we have asked as well. The problem of how general our results are can be solved
by sampling methods. The size and representativeness of a sample define the
limits for inferences that are made on the basis of the sample data. In general,
researchers collect from a few hundred to thousands or even tens of thousands
of observations. But even with a sample as small as 30 cases some conclusions
can be drawn.
One of the most critical issues in data analysis is to understand what the
difference between the sample and the population is. Basically, a sample is
subset of a population. It is not just any subset, however, as a researcher
usually has some kind of idea of what kind of subset it is, i.e. how the sample
has been selected. In the ideal case, 1) each member of the population has the
same probability of being selected for the sample and 2) they are independent
with regard to the measurable characteristic. Then we refer to a random sample.
CHAPTER 4. DATA 17
This is an ideal case and we can define the exact limits within which the results
drawn from a random sample can represent the whole population.
The fact that a random subset of a larger population can be representative
is based on the central limit theorem (CLT), a finding belonging to the area of
probability theory. The central idea of CLT is that given the sample size and the
variation in population, a sufficiently large number of samples (or their mean
values) summed up will yield a bell curve -shape normal distribution. As we
know what the properties of this distribution are, we can assume, under certain
conditions, that the mean of the sample we have, is near to most of the other
samples’ means, meaning that we don’t need to take hundreds of samples but
only one, randomly selected. This finding forms a basis for statistical estimation
and testing, which are covered in the analysis section 5.
In reality, a random sample is not easily achievable, and there are other
sampling strategies too. For example, in a convenience sample the members of
the population are selected on the basis of their availability. Results drawn from
studies based on this type of sample are not generalizable to any population. If
the study is conducted for a sufficiently small population, such as a firm with
500 workers, sampling need not be used. All the workers can be sent an email
inquiry to participate in the study. Researchers need to evaluate then how
well those who participated represent the whole firm (for example, are all the
departments included) and whether there are any possible sources for systematic
bias (such as people without computers, who only rarely read their emails).
refer to some examples of collecting digital data easily. For example, data on
real social networks can be collected from social network services such as Twitter
and Facebook. For example, an Excel plug-in called NodeXL enables users with
direct connections to social networks, such as Twitter, Facebook, YouTube,
Flickr, Wikis, and emails to mine data (take look at these handouts on how to
use NodeXL). An easy option for mining network data from Facebook is also
provided by an app called Nettvizz. Many more tools and techniques for data
collection can be found at DIRT: Digital Research Tools Wiki that has gathered
together hundreds of tools for conducting research in the digital environment.
It lists tools from data mining to reference management, from quantitative to
qualitative data, and from commercial to open software.
If one is interested in extracting information from websites, web scraping is
the order of the day. OutWit Hub is an example of a tool for scraping. By this
tool a researcher can collect data from web pages directly and automatically
without painful copy-pasting. Scrapers can be used directly via web browsers.
OutWit Hub is available for Mozilla Firefox, and for Google Chrome there is
another option that can be used. Software requiring programming skills, such as
R, offers several possibilities for gathering different types of online content. Take
a look at this nice example of gathering data on consumer attitudes towards an
airline company through Twitter.
Last but not least, it goes without saying that there is tons of open and free
data online that has already been gathered by someone and is made available for
anyone to analyze. Data collected by governments, NGOs, and think tanks is
readily available. Sources for open data sets can be found through governmental
data pages and, for example, through a platform held by the Open knowledge
foundation, an organizations promoting for the idea of open data.
Analysis
This section discusses some of the most common statistics and statistical anal-
ysis methods.
resenting the top 10% of the observed values, are cut off for making the variable’s distribution
easier to analyze for the purposes of these examples.
20
CHAPTER 5. ANALYSIS 21
Figure 5.1: A histogram of salaries with the mean (red line) and median (green
line) and normal curve.
CHAPTER 5. ANALYSIS 22
On the horizontal line are salaries and on the vertical line the corresponding fre-
quencies (how many times a certain value has been observed). Not all observed
values are shown in a histogram, but adjacent values are grouped together into
several bins that are represented as the bars in the histogram. The important
thing to consider is how symmetrical the distribution is. The more symmetrical
it is, the easier it will be to describe using simple statistics. The distribution
of salaries is clearly skewed to the right, as there are relatively very few people
who earn a lot (the tail of the distribution goes right) and the bulk of the obser-
vations are located near to the mean (red vertical line) and the median (green
vertical line) a litle bit left from the center of the distribution.
Among the simplest possible descriptive statistics are the measures of central
tendency and dispersion. The measurement scale (see section 4.1) and distri-
bution of the variable generally determine which statistics would best describe
the variable. For numerical variables we can calculate an arithmetic mean. The
symbol for population mean is µ (the Greek letter “mu”) and for the sample
mean x̄. The sample mean is what we can calculate on the basis of our data,
and it serves as an estimate for the population mean, which is a theoretical “real”
value in the population. The sample mean is calculated by summing all the in-
dividual values together and dividing that sum by the number of observations:
(x1 + x2 + . . . + xn ) /n.
For the distribution of salaries above the mean is about 2782, which is located
slightly to the right of the peak of the frequencies, since the distribution is skewed
to the right (the extreme values “attract” the mean). It is easy to see that the
mean value does not capture the skewness of the distribution. Actually, the
mean best describes the central tendency of a variable when the variable is fully
symmetrical. However, if it is not, we can use the median (Md), which is the
middle value of the all values ranked on a scale from minimum to maximum (or a
mean of the two middle values in case of an even number of different values). In
this case, the median is 2505 and probably a better measure of central tendency,
since it is robust to the effect of extreme values.
The most basic measure for dispersion is called standard deviation. The
symbol for population standard deviation is σ (the Greek letter “sigma”) and
for sample it is s. Standard deviation describes how far the values are on average
from their mean value. The standard deviation of the salaries is 913, meaning
that the values are dispersed on average 913 euros away from the mean value.
Symmetrical distributions are useful, because if we know the mean and the
standard deviation, we know quite a lot of other things about the distribution
CHAPTER 5. ANALYSIS 23
as well. In the case of a symmetrical distribution about 68% of the all values
are within one standard deviation from the mean, and almost 95 % of all the
values are two standard deviations away from the mean (see figure 5.2).
For describing the distribution of a categorical variable a bar chart is used
instead of a histogram. The bar chart (see figure 5.3) displays different cate-
gories of a variable measuring how often people say they use the Internet for
personal matters. The variable varies from non-users to those who spend time
on the Internet daily. The measurement scale is ordinal, since categories can
be meaningfully ordered, but the intervals are not necessarily equal. The first
category “no access” is different from the other categories in the way that it
could be omitted from the analysis or handled independently.
For categorical variables central tendency measurements are not usually cal-
culated, except mode (Mo). Mode gives us the frequency of the biggest category.
In this case it is the category "Every day", which has 1012 observations in it.
The bar chart could also be drawn with relative frequencies on the vertical axis,
in which case each category would have a value in percentages instead of ab-
solute frequencies. If the distribution is fully symmetrical, all the measures of
central tendency – mean, median and mode – are situated in the middle of the
distribution where the peak of the frequencies are.
one could be interested to find out if age and salary are somehow related. Do
the values of age get higher, for example, as salary increases? In the case of two
continuous variables we can draw a scatterplot (5.4) figure and study how the
values of the two variables vary together, i.e. whether there is any recognizable
pattern. However, it is important to note that no causal claims (such as X
causes Y: X → Y ) should be made on the basis of this kind analysis.
In the scatterplot (see figure 5.4), the association seems to be relatively weak
over the different ages. Salary hovers around its median value (2505) for all the
ages from 25 onwards. Some higher salaries can be seen across all ages. These
exceptional observations are grouped together in the tail that was pointing right
in the histogram (5.1). A very slight ascending pattern could be observed.
A scatterplot is the standard method for exploring the relationship between
two variables, but it works only for numerical continuous variables. For explor-
ing the relationships between two categorical variables we need a contingency
table or a special kind of bar chart. A contingency table is a table with absolute
CHAPTER 5. ANALYSIS 26
Table 5.1: A contingency table of gender and the personal use of the Internet
with absolute and relative column frequencies.
or relative frequencies and it can be used for 2–3 categorical variables. It depicts
the frequencies of one variable across all the classes of another variable. Usually
of the interest are the relative frequencies (i.e. percentages), since the sample
sizes of the classes that we are interested in comparing can vary.
An example will clarify this. Let us take “gender” as the first variable and
“personal use of the Internet” as the second variable. Now, we are interested
in finding out whether the values of males and females (that is, the classes
of the first variable “gender”) vary similarly across the different values of the
personal use of the Internet. To do this we need to use relative frequencies, as
the absolute frequencies are not comparable due to different amounts of males
and females in the sample. The relative frequencies are usually presented so
that they are summed up at the bottom of the columns, i.e. for each of the
classes individually that are compared together (see table 5.1).
Relative frequencies can also be depicted visually by a bar stacked chart (see
figure 5.5). In this case, bars represent different gender classes, and Internet use
is presented as relative areas within the bars, summing up to 100% for each
bar. On the basis of the table or the bar chart, we have good reason to suspect
that there are slightly more daily Internet users in the male category than
CHAPTER 5. ANALYSIS 27
Figure 5.5: A bar chart of gender and the personal use of internet.
the female, and slightly more those who have no access at all among females.
This relationship could, however, diminish or change if a third variable, such as
different age groups, is introduced to the table. Therefore it is often not enough
to only study the relationship between two variables, rather several variables
together with different combinations should be explored. The results should
also be tested statistically to verify that the observed differences are arguably
more than just random variation, which we will discuss in section 5.2.
There is a third form of visual presentation, the box-plot, that is useful when
a categorical variable needs to be studied against a numerical variable. The
box-plot has the categorical variable on the horizontal axis and the numerical
variable on the vertical axis. Each category of the categorical variable gets its
own box with two whiskers. The boxes and whiskers illustrate the distributions
of each class of categorical variable across the values of the numerical variable.
The ends of the whiskers represent maximum and minimum values, the stars
and points are outliers, the box contains 50 % of all the observations, and the
CHAPTER 5. ANALYSIS 28
black line within the box is the median value. In the figure 5.6 salaries of males
and females are explored by plotting the categorical gender variable against the
continuous salary variable.
This plot tells us that the median salaries of males (3000 euros) are higher
and that there is more dispersal in the salaries of males (the box including 50 %
of the observations is taller). We can also spot numerous outliers (exceptional
observations) in the high end of females.
sample. These inferences are then generalizable within particular limits, called
confidence intervals, that need to be specified.
Figure 5.7: A plot of mean salaries for males and females with 95% confidence
intervals.
that in 95 of 100 samples the real population mean value lies between 2711 and
2853 euros, and yet we haven’t done more than one sample! The 95% confidence
intervals for the salaries of males and females (these groups have different means,
standard deviations, and sample sizes, and hence different standard errors) are
depicted in the figure 5.7.
In figure 5.7 the T-bars show the ranges within which the population means
for both males and females will fall. The bars show the observed sample means.
An important observation is that the error bars are not overlapping, i.e. they
do not cover same range of values. This observation makes us more confident in
saying that the population means of males and females are different from each
other and that the difference is not due to random deviations between different
samples.
CHAPTER 5. ANALYSIS 31
Figure 5.8: A sampling distribution with regions for acceptance and rejection
of the null hypothesis.
between the salaries, and tests how likely it would be to get 529.9 as the dif-
ference given the null hypothesis. It appears that given the sample sizes and
standard deviations, it would be nearly impossible to observe such a difference
by chance. This is indicated by the t-test value which is 7.670, being way up
from the 95% critical value. According to the p-value ("Sig. (2-tailed)” column),
less than 0.001% of the all possible samples could generate such a difference.
The last columns of the table gives the 95% confidence intervals of the differ-
ence, indicating that out of 100 samples 95 will yield a difference between 394.2
and 665.6. Since almost every sample, if properly conducted, would yield such a
difference, we have a very good reason to suspect the null hypothesis and hence
get support for the alternative hypothesis.
It is good to note that the t-test and the comparison of means is just one
type of statistical test. There are many possibilities available, of which the most
important are the t-test, F-test and χ2 (“chi-sqare”) -test. For example, when
doing a contingency table, the test used for examining the statistical significance
of the observed differences is the chi-square test. Tests may look different and be
calculated differently but the basic idea behind statistical testing is as presented
CHAPTER 5. ANALYSIS 33
here. The Decision Tree for Statistics is a helpful aid when finding out what
statistical test should be used given the data and measurement levels of the
variables.
Figure 5.10: A plot showing the regression line and equation for age and salary.
CHAPTER 5. ANALYSIS 37
seems to be that education is the strongest one, having almost twice as big an
effect as age. t-tests at the right end of the table show whether the predictors
are statistically significant in the model (p-values should be <0.05).
From the other output table 5.4 we can also see that R2 increased from
1.5% to 24.8% ("adjusted R2 " accounts for the error caused by increasing the
number of predictors), so it is a big change in the goodness of fit of the model.
The standard error of estimate is 791.659, and it is a measure of error in the
model. It can be interpreted as the average deviation of the predicted values
(salaries given by the model) from the observed values (the observed salaries in
the data).
The difference between observed and predicted values is called residual vari-
ation in regression analysis and it can be marked with e for “error”: y =
β0 + β1 x1 + e. This means that y is equal to the model plus the rest of the
variation of y that is not captured by the model. If we modify the equation
slightly, we can say that e = y − (β0 + β1 x1 ). Technically speaking, the goal of
a regression analysis is to minimize the amount of errors. This happens usually
by the procedure called least square estimation, which is done by finding such
values for β0 and β1 that the sum of the squared differences between y and
(β0 + β1 x1 ) become as small as possible.
Finally, a researcher needs to assess how well the model actually fits to the
CHAPTER 5. ANALYSIS 39
Figure 5.11: A residual plot should show a homogenous cloud around the center
line.
data. This is typically done by examining the residual variance, i.e. the part of
the variation that was not explained by the model. For this purpose a residual
plot can be drawn (figure 5.11). If the residual variance is normally distributed
and varies in an unsystematic manner around the predicted values, we can state
that the model is quite a good predictor of the values for y.