Ppt

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 85

Introduction to Statistics?

Dr. Smitabh Barik


Learning objectives
1. Data and its presentation
2. Scales of measurement
3. Central tendency
4. Normal distribution and measures of
dispersion
5. Sampling and sample size estimation
6. Probability
7. Difference between proportions
8. Difference between means
9. Correlation
10. Regression
Write True or False for the following statements:
1. Waist hip ratio is an example of simple variable.
F

2. Blood cholesterol is an example of continuous data


T

3. Quantitative data blood sugar can be expressed in qualitative data


as non-diabetic and diabetic
T

4. Sector diagram is an example of quantitative data


F

5. Epidemic curve is an example of histogram


T
6. Declining trend of diarrheal diseases can be best represented by
line diagram .
T

7. Relationship between two variables can be represented by stem


and leaf plot diagram
F

8. Hospital data are basically secondary data


F

9. Pictogram can be explained to a layman


T

10. The general term for any unit which is measured in a research
is called variable
T
Statistics
“Statistics is a way to get information from
data”
Statisti
cs
Informatio
Data
Data: Facts, n
Information:
especially Knowledge
numerical facts, communicated
collected concerning some
together for particular fact
information. related to these
data.
Definitions…
A variable is some characteristic of a
population or sample which is different for
different samples. (Quantitative
characteristics)
E.g. student grades, Student height

Attribute (Qualitative characteristics)


E.g. Gender, Religion
Types of variable
Simple and composite variable.

Qualitative and Quantitative variable

Dependent and Independent variable

Dependent----Resultant or Criterion variable

Independent----Explanatory or causal
variable
Latent variable---- can not be measured

directly. (Bright student, efficient worker)


Types of Statistics

I. Descriptive
II. Inferential
Types of data:-

1. Quantitative data
2. Qualitative data
3. Time series data (longitudinal
data)
Qualitative data
When a particular characteristics can not
be measured, but can be expressed in
frequency.
Non-numerical
Ex- gender , religion.
Can not measure characteristics, but it can
be expressed in frequency.
Enumeration data
Nominal and Ordinal
Quantitative data
Both the characteristics and frequency of a
variable can be measured.
Measurement data
Continuous and discrete data
DISCRETE/ categorical

whole number
Example: The no. of family members
The no. of heart beats
The no. of admissions in a day

CONTINOUS/ interval/ dimensional

Example: Height, Weight, Age, BP, Serum


Cholesterol and BMI
Transformation of Qual and
Quant Data

Height (cm/feet/inch)
Weight
Hb
Blood sugar
Blood pressure
Time Series Data…
Observations measured at the same point
in time are called cross-sectional data.

Observations measured at successive


points in time are called time-series data.

Time-series data graphed on a line chart,


which plots the value of the variable on the
vertical axis against the time periods on
the horizontal axis.
Scales of Measurements
To measure variables
4 types
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Nominal Data…
(Categorization)
 The values of nominal data are categories.
E.g. responses to questions about marital status, coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4

 Because the numbers are arbitrary arithmetic operations


don’t make any sense (e.g. does Widowed ÷ 2 =
Married?!)

 Nominal data are also called qualitative or categorical.

Q: Among the measure of central tendency, only ______ can


be applied in nominal data?
Q: _________ test is the most common test of significance
that can be utilized in this scale
Ordinal Data…( Rank
ordering)
Ordinal Data appear to be categorical in nature,
but their values have an order; a ranking to them:

E.g. College course rating system:


 poor = 1, fair = 2, good = 3, very good = 4,
excellent = 5

While its still not meaningful to do arithmetic on


this data (e.g. does 2*fair = very good?!), we can
say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what
numeric values are assigned to each category.
Interval scale
Data re placed in meaningful order.
They have definite interval between them.
Interval can also be measured.
Scores are Not meaningful
Eg. Celsius scale
Difference between 1000c and 900c is same
as the difference between 600c and 500c.
Interval scales do not have absolute zero.
Ratio scale (Numeric
scoring)
Same as interval scale
It has an Absolute Zero.
Most biomedical scales are in ratio scale.
Eg. weight, time, BP, Pulse

Kelvin scale
Eg.
IQ
Credit score
SES
How to represent the data ?

Tables and graphs

Y-Values
4 Y-
Val
0 ues
0 2 4
Data Presentation
Qual
 Bar
 Pie or sector diagram & doughnut chart
 Pictogram
 Map or spot diagram
 Venn diagram

Quant
 Histogram
 Frequency polygon & Frequency curve
 Line diagram
 Cumulative frequency polygon (ogive)
 Scatter diagram
Venn diagram
Shows degree of overlap and exclusivity for
2 or more characteristics or factors within a
sample, or population
Bar diagram
Simple
Multiple or compound
Component or proportional
Frequency polygon
Frequency Curve
Line diagram
Ogive…
Is a graph of a cumulative frequency
distribution.

We create an ogive in three steps…


1) Calculate relative frequencies. 
2) Calculate cumulative relative
frequencies by adding the current
class’ relative frequency to the previous
class’ cumulative relative frequency.
 (For the first class, its cumulative relative
frequency is just its relative frequency)
Cumulative Relative
Frequencies…

first class…
next
class: .355+.185=.540

:
:

last
class: .930+.070=1.00
Ogive…
The ogive can be used
to answer questions
like:

What telephone bill


value is at the 50th
percentile?

“around $35”
Scatter Diagram…
 Dot diagram/ Correlation diagram/ Scatter
plot
 Example :- A real estate agent wanted to
know to what extent the selling price of a
home is related to its size…

1) Collect the data 


2) Determine the independent variable (X –
house size) and the dependent variable (Y
– selling price) 
3) Use Excel to create a “scatter diagram”…
A linear relationship between
two variable is given by eq.

ŷ a  bX
Dependent
Independen
t
Y = Dependent variable
X = Independent variable.
Scatter Diagram…
It appears that in fact there is a
relationship, that is, the greater the house
size the greater the selling price…
Patterns of Scatter
Diagrams…
Linearity and Direction are two concepts we
are interested in

Positive Linear Relationship Negative Linear Relationship

1.
37
Weak or Non-Linear Relationship
Scatter diagram is the only diagram for
quantitative data, where relationship
between two variable is determined.
Box plot / Box and
Whisker plot
Box plot is a representation of the quartiles
(25%, 50% and 75%) and the range of a
continuous and ordered data set.
Stem and Leaf plot
1. The best method to show the association between
height and weight of children in a class is by

(a) Bar diagram

(b) Line diagram

(c) Scatter diagram

(d) Histogram

c
2. Two variables can be plotted together in which of
the following diagram?

(a) Pie chart

(c) Frequency polygon

(b) Histogram

(d) Scatter diagram

d
3. The age and sex structure of a population may be
represented by

(a) Life table

(b) Correlation coefficient

(c) Population pyramid

(d) Bar diagram

c
4. Between height and weight there is an

(a) Association

(b) Correlation

(c) Proportion

(d) Index

a & b
5. An analysis of the religion of populations, who reside in a
rural block, reveals that 45% are Hindu, 30% are Muslims,
15% are Christians, and 10% are Jains. These data would best
be depicted graphically by which of the following diagram?

(a) Normal curve

(b) Cumulative frequency diagram

(c) Venn diagram

(d) Histogram

(e) Pie chart


e
6. Which diagram is said to be a mixture of table and
diagram?

(a) Box and Whisker plot diagram

(b) Ogive

(c) Stem and Leaf plot diagram

(d) Proportional bar diagram


c
Contd..
Two Types of Statistics
Descriptive statistics of a POPULATION
Relevant notation (Greek):
 mean
N population size
 sum

Inferential statistics of SAMPLES from


a population.
Assumptions are made that the sample
reflects the population in an unbiased
form. Roman Notation:
X mean
n sample size
 sum
Measures of Central
Tendency
These measures tap into the average
distribution of a set of scores or values in the
data.
Mean
Median
Mode
What is “Mean”?
The “mean” of some data is the
average score or value, such as
the average age of an MBA
student or average weight of
professors that like to eat donuts.

Inferential mean of a sample:


X=(X)/n
Mean of a population: =(X)/N
The Arithmetic Mean…

…is appropriate for describing


measurement data, e.g. heights of people,
marks of student papers, etc.

…is seriously affected by extreme values


called “outliers”. E.g. as soon as a
billionaire moves into a neighborhood, the
average household income increases
beyond what it was previously!
Measures of Central
Location…
The arithmetic mean, a.k.a. average,
shortened to mean, is the most popular &
useful measure of central location.

It is computed by simply adding up all the


observations and dividing by the total
number Sum
of of the observations
observations:
Mean =
Number of observations
Arithmetic Mean…

Sample Mean
Population Mean
Statistics is a pattern
language…
Population Sample

Size N n

Mean
Problem of being “mean”
The main problem associated with the
mean value of some data is that it is
sensitive to outliers.
The Median
Because the mean average can be
sensitive to extreme values, the median is
sometimes useful and more accurate.

The median is simply the middle value


among some scores of a variable. (no
standard formula for its computation)
The Mode
The most frequent response or
value for a variable.
Or the value which occurs maximum
time in frequency distribution.
Multiple modes are possible:
bimodal or multimodal.
Measures of Dispersion
Measures of dispersion tell us about variability
in the data. Also univariate.

Basic question: how much do values differ for a


variable from the min to max, and distance
among scores in between. We use:
Range
Mean deviation
Standard Deviation
Variance
IQR
Measures of Variability…
Measures of central location fail to tell the
whole story about the distribution; that is,
how much are the observations spread out
around the mean value?
For example, two sets of class grades
are shown. The mean (=50) is the
same in each case…

But, the red class has greater


variability than the blue class.
Range…
The range is the simplest measure of variability,
calculated as:

Range = Largest observation – Smallest


observation

E.g.
 Data: {4, 4, 4, 4, 50} Range = 46
 Data: {4, 8, 15, 24, 39, 50} Range = 46
 The range is the same in both cases,
 but the data sets have very different
distributions…
Statistics is a pattern
language…
Population Sample

Size N n

Mean

Variance
Variance… population mean

The variance of a population is:

population size sample mean

The variance of a sample is:

Note! the denominator is sample size (n) minus one !


The Standard Deviation
A standardized measure of distance from the
mean.

Very useful and something you do read about


when making predictions or other statements
about the data.
Formula for Standard
Deviation

S = ( X  X ) 2

(n - 1)
=square root
=sum (sigma)
X=score for each point in data
_
X=mean of scores for the variable
n=sample size (number of
observations or cases
Variance

( X  X ) 2
2=
S (n - 1)
• Note that this is the same equation except for
no square root taken.

• Its use is not often directly reported in research


but instead is a building block for other statistical
methods
Application…
Example :- The following sample consists of
the number of jobs six randomly selected
students applied for: 17, 15, 23, 7, 9, 13.
Finds its mean and variance.

What are we looking to calculate?

The following sample consists of the number


of jobs six randomly selected students
applied for: 17, 15, 23, 7, 9, 13.
Finds its mean and variance.
…as opposed to  or 2
Sample Mean

Sample Mean & Variance…

Sample Variance

Sample Variance (shortcut method)


Standard Deviation…
The standard deviation is simply the square
root of the variance, thus:

Population standard deviation:

Sample standard deviation:


Thank you
Samplin
g
The Empirical Rule…
 Approximately 68% of all observations fall
 within one standard deviation of the mean.

 Approximately 95% of all observations fall


 within two standard deviations of the mean.

 Approximately 99.7% of all observations fall


 within three standard deviations of the mean.
Chi-Squared Test
It is used to test the significance between 2
proportions. “Test of significance”.
Categorical data
Ex..
A study was conducted for efficacy of 2
vaccines. The results of the study are-

Vacci Diseas Disease Total Disease


ne e not vaccinat occurrenc
occurr occurre ed e rate
ed d populati
on
Vac-A 22 68 90 24.4%
Vac-B 14 72 86 16.2%
Total 36 140 176
Apparently Vac-B is superior to Vac-A.
Now question arises whether Vac-B is really
superior to Vac-A, or difference is merely
due to chance.
To solve this we have to follow the steps.

Step-1 : Null hypothesis- we assume that


there is no difference in the effect of Vac-A
and Vac-B.
Step-2 : calculate expected values in each cell.

(Column total) X (Row total)


Grand total

Attack rate of disease = 36 = 0.204


176

Non attack rate of disease = 140 = 0.795


176
Vac-A expected numbers of attacks = 90 x
0.204 = 18.36
expected numbers of non attacks = 90 x
0.795 = 71.55

Vac-B expected numbers of attacks = 86 x


0.204 = 17.54
Vaccin Attacked Non-
eexpected numbers of non attacks = 86 x
Attacked
0.795 = 68.37
A O= 22 O= 68
E= 18.36 E= 71.55
B O= 14 O= 72
E= 17.54 E= 68.37
O= observed value
E= expected value
Step-3 : Applying Chi-squared test

𝜒2 =
E
= (22-18.36)2 + (68-71.55)2 + (14-17.54)2 +
(72-68.37)2
18.36 71.55 17.54
68.37

= 1.39
Step-4: finding the degree of freedom (df)
df= (C-1) (R-1)
= (2-1) (2-1)
=1

Step-5: Cut off point for type 1 error is 5% or


0.05.

table, with 1 df the value of 𝜒2 for a probability


On referring probability table, we see that in 2x2

of 0.05 is 3.84.
Since our observed value (1.79) is much lower,
we can conclude that Vac-B is not superior to
Vac-A & our null hypothesis is true
Q: A study was conducted to find out
effectiveness of ORS and Homebased
fluid, in children less than 5 years. Out of
100 under-5 taking ORS, 15 developed
dehydration. Out of 150 Under-5 taking
home based fluid, 35 developed
dehydration. Find out association of
dehydration with ORS and home based
fluid.
Dehydratio Dehydratio Tot Atta
n(+) n(-) al ck
rate
ORS 15 85 100 15%
Home 35 115 150 23.3
Based %
Total 50 200 250

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy