Statistics
Statistics
Introduction to Statistics: -
Stats Definition: - Stats is the science of collecting, organizing and
analyzing data.
3. IQ of students in classroom
Type of Statistics: -
1. Descriptive Statistics
2. Inferential Statistics
i. Bernoulli Distribution
ii. Uniform Distribution
iii. Binomial Distribution
iv. Normal or Gaussian Distribution
v. Exponential Distribution
vi. Poisson Distribution
(Sample data)
a. Descriptive Question: -
IV. What is the average height of the entire camps
V. Disturbance of a data
VI. 140cm how many STD it is away from mean
b. Inferential Question: -
• Are the average height of a players of camp1 similar to that of
camp2
Sample
data
E.g.:- Gender, Blood E.g.:- Customer E.g.:- No. of children e.g.:- House price in
feedback {1, 2,3,4,5} in a family Bengaluru
Group, Colors,
No. of bikes Length of river
location, cities, days
No. of people working
e.g.:- No.
of children in a
Family
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale
I. Nominal Scale data: - A nominal scale is the 1st level of measurement scale in
which the numbers serve as “tags” or “labels” to classify or identify the objects. A
nominal scale usually deals with the non-numeric variables or the numbers that do
not have any value
• Qualitative/ Categorical Data
• E.g.: - Gender, color, Labels
• Order or rank does not matter
• Likert Scale
• Net Promoter Score (NPS)
• Bipolar Matrix Table
• IQ
IV. Ratio Scale Data: - The ratio scale is the 4th level of measurement
scale, which is quantitative. It is a type of variable measurement scale.
It allows researchers to compare the differences or intervals. The ratio
scale has a unique feature. It possesses the character of the origin or
zero points.
❖ Descriptive Statistics
➢ Mean: - The mean represents the average value of the dataset. It can be
calculated as the sum of all the values in the dataset divided by the number
of values.
Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19,
and 17
When you look at the given dataset, the two middle values obtained are 27
and 29. Now, find out the mean value for these two numbers.
i.e., (27+29)/2 =28
Therefore, the median for the given data distribution is 28.
➢ Mode: - The mode represents the frequently occurring value in the
dataset. Sometimes the dataset may contain multiple modes and, in
some cases, it does not contain any mode at all.
Since the mode represents the most common value. Hence, the most
frequently repeated value in the given dataset is 5.
➢ Sets: -
A= {1,2,3,4,5,6,7,8}
B= {3,4,5,6,7}
I. Intersection: -
A ∩ B = {3,4,5,6,7}
II. Union: -
A ∪ B = {1,2,3,4,5,6,7,8}
III. Difference: -
A-B= {1,2,8}
IV. Subset: -
A B = False
B A= True
V. Superset: -
A B = True
B A= False
Histogram: -
Ages= {10,12,14,18,24,30,35,36,37,40,41,42,43,50,51}
Bins, Bin size
No. of Bins=50/5=10
Bin size=5
B. Right Skewed: -
❖ sampling Techniques: -
A. Simple random sampling:-
Example: Simple random sampling:- You want to select a simple random sample of
1000 employees of a social media marketing company. You assign a number to every
employee in the company database from 1 to 1000, and use a random number
generator to select 100 numbers.
B. Stratified sampling:-
Stratified sampling involves dividing the population into subpopulations that may
differ in important ways. It allows you draw more precise conclusions by ensuring
that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata)
based on the relevant characteristic (e.g., gender identity, age range, income bracket,
job role).
C. Systematic sampling:-
Systematic sampling is similar to simple random sampling, but it is usually slightly
easier to conduct. Every member of the population is listed with a number, but instead
of randomly generating numbers, individuals are chosen at regular intervals.
This is an easy and inexpensive way to gather initial data, but there is no way to tell if
the sample is representative of the population, so it can’t
produce generalizable results. Convenience samples are at risk for both sampling
bias and selection bias.
It is often used in qualitative research, where the researcher wants to gain detailed
knowledge about a specific phenomenon rather than make statistical inferences, or
where the population is very small and specific. An effective purposive sample must
have clear criteria and rationale for inclusion. Always make sure to describe
your inclusion and exclusion criteria and beware of observer bias affecting your
arguments.
Example: Purposive sampling:- You want to know more about the opinions and
experiences of disabled students at your university, so you purposefully select a
number of students with different support needs in order to gather a varied range of
data on their experiences with student services.
F. Cluster sampling:-
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled
cluster. If the clusters themselves are large, you can also sample individuals from
within each cluster using one of the techniques above. This is called multistage
sampling.
This method is good for dealing with large and dispersed populations, but there is
more risk of error in the sample, as there could be substantial differences between
clusters. It’s difficult to guarantee that the sampled clusters are really representative
of the whole population.
Example: Cluster sampling: - The company has offices in 10 cities across the country
(all with roughly the same number of employees in similar roles). You don’t have the
capacity to travel to every office to collect your data, so you use random sampling to
select 3 offices – these are your clusters.
❖ Covariance and Correlation: -
• Covariance is a statistical term that refers to a systematic relationship
between two random variables in which a change in the other reflects
a change in one variable.
• The covariance value can range from -∞ to +∞, with a negative value
indicating a negative relationship and a positive value indicating a
positive relationship.
• The greater this number, the more reliant the relationship. Positive
covariance denotes a direct relationship and is represented by a
positive number.
• A negative number, on the other hand, denotes negative covariance,
which indicates an inverse relationship between the two variables.
Covariance is great for defining the type of relationship, but it's
terrible for interpreting the magnitude.
Between 0 and 1 Positive correlation When one variable Baby length & weight:
changes, the other
variable changes in
the same direction. The longer the baby, the
heavier their weight.
where
• cov is the covariance
• σx is the standard deviation of X
• σy is the standard deviation of Y
Let us say we are running an experiment of tossing a fair coin. The possible events
are Heads, Tails. And for instance, if we use X to denote the events, the
probability distribution of X would take the value 0.5 for X=heads, and 0.5 for
X=tails
➢ Discrete data is counted and can take only a limited number of values. It
makes no sense when written in decimal format. And the random variable
that holds discrete data is called the Discrete random variable.
1. Discrete Distributions
2. Continuous Distributions
1. Bernoulli Distribution: -
E.g.: -
Pr(T)=0.5 = 1-p=q
2. Binomial Distribution: -
• it’s concerned with discrete random variables {PMF}
• There are two possible outcomes: true or false, success or failure, yes
or no.
• These Experiments is Performs for n trials
• Every trial is an independent trial, which means the outcome of one
trial does not affect the outcome of another trial.
E.g.: -
Tossing a Coin 10 times
=PMF
n
Cx = n!/x!(n-x)!
Where,
n = the number of experiments
x = 0, 1, 2, 3, 4, …
p = Probability of Success in a single experiment
q = Probability of Failure in a single experiment = 1 – p
Mean, μ = np
Variance, σ2 = npq
3. Poisson Distribution: -
• it’s concerned with discrete random variables {PMF}
• Describe the number of events occurring in a fixed time interval
every time
interval
4. Normal or Gaussian Distribution: -
• it’s concerned with Continuous random variables {PDF}
• Normal distributions are symmetrical, but not all symmetrical
distributions are normal
Characteristics of Normal Distribution
• mean = median = mode
• Symmetrical about the center
• Unimodal
• 50% of values less than the mean and 50% greater than the mean
Here, x is value of the
variable; f(x) represents the
probability density function; μ
(mu) is the mean; and σ (sigma) is
the standard deviation.
1. Blood pressure
It is always good to know the standard deviation because we can say that
any value is:
• likely to be within 1 standard deviation (1σ)(68.3 out of 100 should be)
• very likely to be within 2 standard deviations (2σ) (95.5 out of 100
should be)
• almost certainly within 3 standard deviations (3σ) (997 out of 1000
should be)
5. Uniform Distribution: -
I. Continuous Uniform Distribution (PDF)
II. Discrete Uniform Distribution (PMF)
• Symmetrical
• Bell-shaped
• Mean and median are equal; both located at the center of the
distribution
The mean of the normal distribution determines its location and the standard
deviation determines its spread.
A standard normal distribution has the following properties:
• About 68% of data falls within one standard deviation of the mean
• About 95% of data falls within two standard deviations of the mean
• About 99.7% of data falls within three standard deviations of the
mean
• What is a “Z-score”?
The number of standard deviations from the mean is also called the
“Standard Score”, “sigma” or “Z-score”. Simply, a Z-score describes the
position of a raw score in terms of its distance from the mean, when
measured in standard deviation units.
z = (x – μ) / σ
We can take any Normal Distribution and convert it to The Standard Normal
Distribution.
S.NO. Normalization Standardization
3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.
This transformation squishes the n- It translates the data to the mean vector
6. dimensional data into an n- of original data to the origin and
dimensional unit hypercube. squishes or expands.
1. The sample size is sufficiently large. This condition is usually met if the
size of the sample is n ≥ 30.
2. The samples are independent and identically distributed, i.e., random
variables. The sampling should be random.
3. The population’s distribution has a finite variance. The central limit
theorem doesn’t apply to distributions with infinite variance.
1. What is Central Limit Theorem in Statistics?
Central Limit Theorem in statistics states that whenever we take a large
sample size of a population then the distribution of sample mean
approximates to the normal distribution.
- Null Hypothesis (H0):- The Null Hypothesis (H0) aims to nullify the
alternative hypothesis by implying that there exists no relation between
two variables in statistics. It states that the effect of one variable on the
other is solely due to chance and no empirical cause lies behind it.
The margin of error is equal to half the width of the entire confidence
interval.
1. Z-Test:-
• Population standard deviation is known
• Large sample size (n > 30)
• Z-Test = (x̅ – μ) / (σ / √n)
σ/√n----→ Standard Error
σ -----→ Population standard deviation
μ-----→ Population Mean
x̅-----→ Sample Mean
n----→ No. of Sample
• Degrees of Freedom Not applicable
• We Used Z Test when the population standard deviation is known
and the sample size is large
A. A one-tailed z-test allows for the possibility of rejection of the Null Hypothesis
in only one direction, whereas a two-tailed z-test tests the possibility of rejection in
both directions (left and right).
3. Chi Square: -
• Chi Square test clams about Population proportions
• It is a non-parametric test is performed on categorical (nominal or
ordinal) data
4. Anova(F-Test): -
• ANOVA, which stands for Analysis of Variance, is a statistical
test used to analyze the difference between the means of more than
two groups.
• ANOVA compares the variation between group means to the
variation within the groups. If the variation between group means is
significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.
1. One Way Annova:- One factor with at least 2 levels, these levels are
independent
2. Repeated measures annova:- One factor with atleast 2 levels, levels are
dependents
3. Factorial Annova:- Two or More factors (Each of which with at least 2
levels)
Levels can be either independent or dependent