0% found this document useful (0 votes)
499 views

BRM Data Analysis Techniques

The document discusses the three main steps in social science data analysis: 1) Cleaning and organizing data for analysis (data preparation) 2) Describing the basic features of the data using descriptive statistics like distributions, measures of central tendency, and dispersion 3) Testing hypotheses and models using inferential statistics like univariate, bivariate, and multivariate analyses to investigate research questions

Uploaded by

S- Ajmeri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
499 views

BRM Data Analysis Techniques

The document discusses the three main steps in social science data analysis: 1) Cleaning and organizing data for analysis (data preparation) 2) Describing the basic features of the data using descriptive statistics like distributions, measures of central tendency, and dispersion 3) Testing hypotheses and models using inferential statistics like univariate, bivariate, and multivariate analyses to investigate research questions

Uploaded by

S- Ajmeri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Data Analysis

In most social research the data analysis


involves three major steps, done in
roughly this order:

• Cleaning and organizing the data for


analysis (Data Preparation)
• Describing the data (Descriptive
Statistics)
• Testing Hypotheses and Models (Infer
ential Statistics)
Data Preparation
• involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and documenting
a database structure that integrates the
various measures.
Descriptive Statistics
• Used to describe the basic features of the
data in a study. They provide simple
summaries about the sample and the
measures. Together with simple graphics
analysis, they form the basis of virtually
every quantitative analysis of data. With
descriptive statistics you are simply
describing what is, what the data shows.
Inferential statistics
• investigate questions, models and hypotheses.
In many cases, the conclusions from inferential
statistics extend beyond the immediate data
alone.
• For instance, we use inferential statistics to try to
infer from the sample data what the population
thinks. Or, we use inferential statistics to make
judgments of the probability that an observed
difference between groups is a dependable one
or one that might have happened by chance in
this study.
Types of Statistical Analysis
• Univariate Statistical Analysis
– Tests of hypotheses involving only one
variable.
– Testing of statistical significance
• Bivariate Statistical Analysis
– Tests of hypotheses involving two variables.
• Multivariate Statistical Analysis
– Statistical analysis involving three or more
variables or sets of variables.

6
Statistical Analysis: Key Terms
• Hypothesis
– Unproven proposition: a supposition that
tentatively explains certain facts or
phenomena.
– An assumption about nature of the world.
• Null Hypothesis
– No difference in sample and population.
• Alternative Hypothesis
– Statement that indicates the opposite of the
null hypothesis.
7
Statistical Analysis: Key Terms
• Hypothesis
– Unproven proposition: a supposition that
tentatively explains certain facts or
phenomena.
– An assumption about nature of the world.
• Null Hypothesis
– No difference in sample and population.
• Alternative Hypothesis
– Statement that indicates the opposite of the
null hypothesis.
8
Choosing the Appropriate Statistical
Technique

• Choosing the correct statistical technique


requires considering:
– Type of question to be answered
– Number of variables involved
– Level of scale measurement

9
Univariate analysis
Univariate analysis involves the examination
across cases of one variable at a time.
There are three major characteristics of a
single variable that we tend to look at:
– the distribution
– the central tendency
– the dispersion

In most situations, we would describe all three of


these characteristics for each of the variables in our
study.
The Distribution
The distribution is a sum
mary of the frequency o
f individual values or ra
nges of values for a vari
able. The simplest distri
bution would list every v
alue of a variable and th
e number of persons w
ho had each value.
Distributions may also be displayed using
percentages. For example, you could use
percentages to describe the:
• percentage of people in different income
levels
• percentage of people in different age ranges
• percentage of people in different ranges of
standardized test scores
Central Tendency

The central tendency of a distribution is


an estimate of the "center" of a distributi
on of values. There are three major typ
es of estimates of central tendency:
• Mean
• Median
• Mode
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so
the mean is 167/8 = 20.875.

If we order the 8 scores shown above, we would get:


15,15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent
the halfway point. Since both of these scores are 20,
the median is 20. If the two middle scores had
different values, you would have to interpolate to
determine the median.
To determine the mode, you might again
order the scores as shown above, and
then count each one. The most frequently
occurring value is the mode. In our
example, the value 15 occurs three times
and is the mode
Dispersion
Dispersion refers to the spread of the values
around the central tendency. There are tw
o common measures of dispersion, the ran
ge and the standard deviation.

The range is simply the highest value minus the


lowest value. In our example distribution, the high
value is 36 and the low is 15, so the range is 36 - 15
= 21.
The Standard Deviation
is a more accurate and detailed estimate of
dispersion because an outlier can greatly
exaggerate the range (as was true in this
example where the single outlier value of
36 stands apart from the rest of the values
. The Standard Deviation shows the
relation that set of scores has to the mean
of the sample.
5 - 20.875 = -5.875
20 - 20.875 = -0.875
1 - 20.875 = +0.125
20 - 20.875 = -0.875
6 - 20.875 = 15.125
15 - 20.875 = -5.875Mean
N 8
20.8750
5 - 20.875 = +4.125 Median 20.0000
15 - 20.875 = -5.875Mode 15.00
Std. Deviation 7.0799
Variance 50.1250
Range 21.00
The basic task of most research = Bivariate
Analysis
A. What does that involve?
 Analyzing the interrelationship of 2 variables
 Null hypothesis = independence (unrelatedness)
B. Two analytical perspectives:
1) Analysis of differences:
 Select Independent Variable and Dependent variable
 Compare Dependent Var. across values of Indep. Var.
2) Analysis of associations:
• Covariation or Correspondence of variables
• Predictability of one variable from the other
• Agreement between two variables
“Bivariate Analysis”
C. Analytical situations:
1) If both variables = categorical?(either nominal or ordinal)
• Use cross-tabulations (contingency tables) to show the relationship
2) If one variable (dependent)= categorical and other
variable (independent) = continuous/numerical?
• Use t-tests or ANOVA to test the relationship
3) What If both variables = numerical?
• Then cross-tabs are no longer manageable and interpretable
• T-tests and ANOVA don’t really apply
• ???
“Bivariate Analysis”
C. Analytical situations:
2) If both variables = numerical/continuous?
• We can graph their relationship  scatter plot
• Need a statistical measure to index the inter-relationship between 2
numeric variables
• This measure of the inter-relation of two numeric variables is called their
“correlation”
“Bivariate Analysis”
E. Footnote: relevant questions about the relationship between
variables
• Does a relationship exist or are they independent?
(significance test)
• What is the form of the inter-relationship?
– Linear or non-linear (for numerical variables)
– Ordinal or Nonmonotonic (for ordinal variables)
– Positive or negative (for ordered variables)
• What is the strength of the relationship? (coefficient
of association)
• What is the meaning (or explanation) of the
correlation? (not a statistical question)
I. Correlation
A. A quantitative measure of the degree of association between
2 numeric variables
B. The analytical model? Several alternative views:
• Predictability
• Covariance (mostly emphasizes this model)
I. Correlation
B. The analytical model for correlations:
• Key concept = covariance of two variables
• This reflects how strongly or consistently two variables vary together in
a predictable way
 Whether they are exactly or just somewhat predictable
• It presumes that the relationship between them is “linear”
 Covariance reflects how closely points of the bivariate distribution (of
scores on X and corresponding scores on Y) are bunched around a
straight line
Formula for Covariance?

C ov( X ,Y ) 
 X i  X  Y i  Y 
N

Note the similarity with the formula for the variance of a


single variable.

V ar( X ) 
 X i  X  X i  X 
N
Correlation (continued)
Scatter Plot #1 (of moderate correlation):
Correlation (continued)
Scatter Plot #2 (of negative correlation):
Correlation (continued)
Scatter Plot #3 (of high correlation)
Correlation (continued)
Scatter Plot #4 (of very low correlation)
Correlation (continued)
C. How to compute a correlation
coefficient?
• By hand:
– Definitional formula (the familiar one)

C o v ( X ,Y )
r 
– S X S Y (different but equivalent)
Computational formula
• By SPSS: Analyze  Correlate  Bivariate
Correlation Coefficient (r): Definitional Formula

 ( X  X )(Y  Y )
r 
 (X  X )2  (Y  Y ) 2

Correlation Coefficient (r): Computational Formula

 XY  N X Y
r 
( X )( Y
2 2
2
 N X 2
 N Y )
Correlation (continued)
D. How to test correlation for significance?
D. Test Null Hypothesis that: r = 0
E. Use t-test:

r N  2
t  df  N  2
1  r 2
Correlation (continued)
E. What are assumptions/requirements of correlation
1. Numeric variables (interval or ratio level)
2. Linear relationship between variables
3. Random sampling (for significance test)
4. Normal distribution of data (for significance test)
F. What to do if the assumptions do not hold
1. May be able to transform variables
2. May use ranks instead of scores
– Pearson Correlation Coefficient (scores)
– Spearman Correlation Coefficient (ranks)
Correlation (continued)
G. How to interpret correlations
1. Sign of coefficient?
2. Magnitude of coefficient ( -1 < r < +1)
Usual Scale: (slightly different from textbook)
+1.00 perfect correlation
+.75  strong correlation
+.50  moderately strong correlation
+.25  moderate correlation
+.10  weak correlation
.00  no correlation (unrelated)
-.10  weak negative correlation
(and so on for negative correlations)
Correlation (continued)
G. How to interpret correlations (continued)
 NOTE: Zero correlation may indicate that relationShip is nonlinear
(rather than no association between variables)
H. Important to check shape of distribution  linearity;
lopsidedness; weird “outliers”
– Scatterplots = usual method
– Line graphs (if scatter plot is hard to read)
– May need to transform or edit the data:
• Transforms to make variable more “linear”
• Exclusion or recoding of “outliers”
Correlation (continued)
– Scatterplots vs. Line graphs (example)
Correlation (continued)
I. How to report correlational results?
1. Single correlations (r and significance - in text)
2. Multiple correlations (matrix of coefficients in a separate table)
– Note the triangular-mirrored nature of the matrix

crc319 crc383 dth177 pvs500 pfh493


crc319: Violent Crime rate ----- .614 -.048 .268 .034
crc383: Property Crime rate .614 ----- .265 .224 .042
dth177: Suicide rate -.048 .265 ----- .178 .304
pvs500: Poverty rate .268 .224 .178 ----- -.191
pfh493: Alcohol Consumption .034 .042 .304 -.191 -----
Bivariate analysis

The correlation is one of the most common and


most useful statistics.
A correlation is a single number that describes the
degree of relationship between two variables.

Let's assume that we want to look at the


relationship between two variables, height
(in inches) and self esteem.
Person Self
Height Esteem
1 68 4.1
2 71 4.6
3 62 3.8
4 75 4.4

Height is measured in inches. 5 58 3.2

Self esteem is measured based 6


7
60
67
3.1
3.8

on the average of 10 1-to-5 8 68 4.1

rating items (where higher 9


10
71
69
4.3
3.7
scores mean higher self esteem) 11 68 3.5
12 67 3.2
13 63 3.7
14 62 3.3
15 60 3.4
16 63 4
17 65 4.1
18 67 3.8
19 63 3.4
20 61 3.6
Minimum Maximum
Variable Mean StDev Variance Sum Range
Height 65.4 4.4057 19.4105 1308 58 75 17
Self Esteem 3.755 0.4261 0.18155 75.1 3.1 4.6 1.5
Calculating the Correlation

So, the correlation for our twenty cases is .73,


which is a fairly strong positive relationship
Testing the Significance of a Correlation

Once you've computed a correlation, you can


determine the probability that the observed
correlation occurred by chance. That is, you
can conduct a significance test. Most often you
are interested in determining the probability
that the correlation is a real one and not a
chance occurrence. In this case, you are
testing the mutually exclusive hypotheses:

Null Hypothesis: r=0


Alternative Hypothesis: r <> 0
you need to first determine the significance level. Here, use the
common significance level of alpha = .05

The df is simply equal to N-2 or, in this example, is 20-2 = 18.

Finally, decide whether you are doing a one-tailed or two-tailed test. In this
example, since there is no strong prior theory to suggest whether the relationship
between height and self esteem would be positive or negative, we opt for the two
-tailed test

With these three pieces of information


-- the significance level (alpha = .05)), degrees of
freedom (df = 18), and type of test (two-tailed)
the critical value is .4438. This means that
if our correlation is greater than .4438 or less
than -.4438 (remember, this is a two-tailed test),
we can conclude that the odds are less than 5
out of 100 that this is a chance occurrence.
Since the correlation of .73 (higher), we conclude
that it is not a chance finding and that the
correlation is "statistically significant".

The null hypothesis is rejected and the


alternative is accepted
Pearson Product-Moment Correlation Matrix for
Salesperson

45
Other Correlations
The specific type of correlation illustrated here is known as
the Pearson Product Moment Correlation.
It is appropriate when both variables are measured at an in
terval level.
However there are a wide variety of other types of
correlations for other circumstances. for instance,
if you have two ordinal variables, you could use the
Spearman rank Order Correlation (rho) or the Kendall rank
order Correlation (tau).
When one measure is a continuous interval level one and
the other is dichotomous (i.e., two-category) you can use
the Point-Biserial Correlation.
For other situations, consulting the web-based statistics
selection program, Selecting Statistics at http://trochim.h
uman.cornell.edu/selstat/ssstart.htm.
Regression Analysis
• Simple (Bivariate) Linear Regression
– A measure of linear association that investigates straight-
line relationships between a continuous dependent variable
and an independent variable that is usually continuous, but
can be a categorical dummy variable.
• The Regression Equation (Y = α + βX )
– Y = the continuous dependent variable
– X = the independent variable
– α = the Y intercept (regression line intercepts Y axis)
– β = the slope of the coefficient (rise over run)

47
The Regression Equation
• Parameter Estimate Choices
– β is indicative of the strength and direction of the
relationship between the independent and dependent
variable.
– α (Y intercept) is a fixed point that is considered a
constant (how much Y can exist without X)
• Standardized Regression Coefficient (β)
– Estimated coefficient of the strength of relationship
between the independent and dependent variables.
– Expressed on a standardized scale where higher
absolute values indicate stronger relationships
(range is from -1 to 1).

48
Simple Regression Results Example

49

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy