Sem 6 - DSV - Unit 4 - Sampling and Estimation

Unit 4 - Sampling and Estimation
A.Y. 2023-24 (Even) Semester: 6

(Information Technology)
Komal Rohit
Asst. Professor, IT Dept., GCET
Sampling
Sampling is a method that allows us to get information about
the population based on the statistics from a subset of the
population (sample), without having to investigate every
individual.
Why do we need Sampling
Sampling is done to draw conclusions about populations
from samples, and it enables us to determine a population’s
characteristics by directly observing only a portion (or
sample) of the population.
Sample selection is a cost-efficient method.
Analysis of the sample is less cumbersome and more

practical than an analysis of the entire population.
Steps involved in Sampling
The process of identifying a subset from a population of

elements (aka observations or cases) is called sampling
process or simply sampling.
The following steps are used in any sampling process:
• Identification of target population that is important

for a given problem under study.
• Decide the sampling frame.
• Determine the sample size.
• Decide Sampling method.
Step 1 - Identification of target population that is important for
a given problem under study
For example,
Assume that we are interested in studying attrition among

young professionals in India. The definition ‘young
professionals’ in India is vague(not defined); we need a clear
identification of the target population.
A better definition of the population in this case would be to

study the attrition among IT professionals in the age group 25–
35 years in India. It is important to clearly define the target
population for correct inference.
Step 2 - Decide the sampling frame
Sampling frame defines the source (or method/procedure) used for

identifying the elements of the target population. Choice of sampling
frame is important for accuracy of the study.
For Example,
To analyse attrition among IT professionals, sources such as LinkedIn and
job portals Naukri and Monster can be used.
However, these frames may not have important variables (features) that
are required such as information related to salary and other data
captured during exit interview.
So, ideally to understand the attrition behavior one has to use the data
captured by many human resource departments across multiple
companies.
Step 3 - Determine the sample size.
Determining sample size for data collection is important since Collecting

data can be expensive and at the same time insufficient sample results in
lack of precision in estimation of the parameters.
The sample size for analytics projects is determined using factors such as
effect size, standard deviation, desired level of confidence and margin of
error.
Step 4 - Decide Sampling method
Sampling method is the technique used for selecting individual

cases in the sample from the target population using the sampling
frame.
At a higher level, sampling method is classified into two major

categories: probabilistic sampling and non-probabilistic sampling.
Population Parameters and Sample Statistic
The population can be very large making it impossible to

collect every feature of each case in the population.
Measures such as mean and standard deviation calculated

using the entire population are called population
parameters.
Calculating population parameters in most practical

situations is almost impossible, we depend on samples to
estimate the population parameters.
Population parameters estimated from the sample are

called sample statistic.
Statistical Bias
Measurement or sampling errors that are systematic and
produced by the measurement or sampling process.
An important distinction should be made between errors

due to random chance, and errors due to bias.
Consider the physical process of a gun shooting at a target.

It will not hit the absolute center of the target every time,
or even much at all.
Statistical Bias
An unbiased process
will produce error,
but it is random and
does not tend
strongly in any
direction.
Statistical Bias
Biased process -
there is still random
error in both the x and
y direction, but there is
also a bias.
Shots tend to fall in

the upper-right
quadrant.
Statistical Bias
Bias comes in different forms, and may be observable or
invisible.
When a result suggests bias (e.g., by reference to a

benchmark or actual values), it is often an indicator that a
statistical or machine learning model has been misspecified,
or an important variable left out.
Types of Sampling techniques
Probability Sampling:
Every element of the population has an equal chance of
being selected. Probability sampling gives us the best chance
to create a sample that is truly representative of the
population.
Non-Probability Sampling:
All elements do not have an equal chance of being selected.
Consequently, there is a significant risk of ending up with a
non-representative sample which does not produce
generalizable results.
For example, let’s say our population consists of 20
individuals. Each individual is numbered from 1 to 20 and is
represented by a specific color (red, blue, green, or yellow).
Each person would have odds of 1 out of 20 of being chosen

in probability sampling.
With non-probability sampling, these odds are not equal. A

person might have a better chance of being chosen than
others.
Probabilistic Sampling
1 - Simple Random Sampling
2 - Systematic Sampling
3 - Stratified Sampling
4 - Cluster Sampling
Simple Random Sampling
Every individual is chosen entirely by chance and each
member of the population has an equal chance of being
selected.
Simple random sampling reduces selection bias.
Simple Random Sampling
Advantage - It is the most direct method of probability

sampling.
But it comes with a caveat –

it may not select enough individuals with our characteristics
of interest.
Monte Carlo methods use repeated random sampling for the

estimation of unknown parameters.
Systematic Sampling
First individual is selected randomly and others are selected
using a fixed ‘sampling interval’.
Example -
Say our population size is x and we have to select a sample
size of n.
Then, the next individual that we will select would be x/nth

intervals away from the first individual. We can select the
rest in the same way.
Systematic Sampling
Suppose, we began with person number 3, and we want a
sample size of 5.
So, the next individual that we will select would be at an
interval of (20/5) = 4 from the 3rd person, i.e. 7 (3+4), and so
on.
3, 3+4=7, 7+4=11, 11+4=15, 15+4=19 = 3, 7, 11, 15, 19
Systematic Sampling
Systematic sampling is more convenient than simple

random sampling.
However, it might also lead to bias if there is an

underlying pattern in which we are selecting items from
the population (though the chances of that happening are
quite rare).
Stratified Sampling
Population is divided into subgroups (called strata) based on different
traits like gender, category, etc. And then we select the sample(s)
from these subgroups:
Stratified Sampling
Here, we first divided our population into subgroups
based on different colors of red, yellow, green and blue.
Then, from each color, we selected an individual in the
proportion of their numbers in the population.
We use this type of sampling when we want

representation from all the subgroups of the population.
However, stratified sampling requires proper knowledge

of the characteristics of the population.
Cluster Sampling
In a clustered sample, we use the subgroups of the population as the
sampling unit rather than individuals. The population is divided into
subgroups, known as clusters, and a whole cluster is randomly
selected to be included in the study:
Cluster Sampling
In the above example, we have divided our population into 5 clusters.

Each cluster consists of 4 individuals and we have taken the 4th
cluster in our sample. We can include more clusters as per our sample
size.
This type of sampling is used when we focus on a specific region or

area.
Probabilistic Sampling
Non Probabilistic Sampling
1 - Convenience Sampling
2- Quota Sampling
3 - Judgment Sampling
4- Snowball Sampling
Convenience Sampling
Easiest method of sampling because individuals are
selected based on their availability and willingness to take
part.
Convenience Sampling
Convenience sampling is prone to significant bias,
because the sample may not be the representation of the
specific characteristics such as religion or the gender of
the population.
Quota Sampling
In this type of sampling, we choose items based on
predetermined characteristics of the population. Consider
that we have to select individuals having a number in
multiples of four for our sample.
Quota Sampling
In quota sampling, the chosen sample might not be the
best representation of the characteristics of the
population that weren’t considered.
Judgment (Purposive) Sampling
It is also known as selective sampling. It depends on the
judgment of the experts when choosing whom to ask to
participate.
Judgment (Purposive) Sampling
As you can imagine, like quota sampling it is also prone to
bias by the experts and may not necessarily be
representative.
Snowball Sampling
Existing people are asked to nominate further people
known to them so that the sample increases in size like a
rolling snowball. This method of sampling is effective
when a sampling frame is difficult to identify.
Snowball Sampling
Here, we had randomly chosen person 1 for our sample,
and then he/she recommended person 6, and person 6
recommended person 11, and so on.
1->6->11->14->19
There is a significant risk of selection bias in snowball

sampling, as the referenced individuals will share
common traits with the person who recommends them.
Sampling Distribution
A sampling distribution is a probability distribution of a
statistic obtained from a larger number of samples drawn
from a specific population.
The sampling distribution of a given population is the

distribution of frequencies of a range of different
outcomes that could possibly occur for a statistic of a
population
The probability distribution of a statistic for a large
number of samples taken from a population.
Imagining an experiment may help you to understand
sampling distributions:
• Suppose that you draw a random sample from a
population and calculate a statistic for the sample, such
as the mean.
• Now you draw another random sample of the same
size, and again calculate the mean.
• You repeat this process many times, and end up with a
large number of means, one for each sample.
The distribution of the sample means is an example of a
sampling distribution.
A sampling distribution is a probability distribution of a statistic
that is obtained through repeated sampling of a specific
population.
It describes a range of possible outcomes for a statistic, such as

the mean or mode of some variable, of a population.
The majority of data analyzed by researchers are actually

samples, not populations.
Sampling distribution refers to the probability distribution of a

statistic such as sample mean and sample
standard deviation computed from several random samples of
same size. Understanding the sampling distribution is
important for hypothesis testing.
Central Limit Theorem (CLT)
The central limit theorem states that the sampling distribution
of a sample statistic (like the sample mean or proportion) is
nearly normal or bell-shaped and will have on average the true
population parameter that is being estimated.
The sampling distribution is a theoretical distribution, that we

cannot observe, that describes all the possible values of a
sample statistic from random samples of the same size that are
taken from the same population.
Central Limit Theorem (CLT)
The parameters of the sampling distribution of the mean are determined by
the parameters of the population:
The mean of the sampling distribution is the mean of the population.
The standard deviation of the sampling distribution is the standard deviation

of the population divided by the square root of the sample size.
We can describe the sampling distribution of the mean using this notation:
Where,
X̄ is the sampling distribution of the sample means
~ means “follows the distribution”
N is the normal distribution
µ is the mean of the population
σ is the standard deviation of the population
n is the sample size
Estimation of population parameters
Estimation is a process used for making inferences about
population parameters based on samples.
For example, we may like to estimate the population

parameters such as mean and standard deviation and
probability distribution parameters such as scale, shape, and
location parameters.
The following are the two types of estimates:
1. Point Estimate: The point estimate of a population

parameter is the single value (or specific value) calculated
from the sample (thus called statistic). The sample mean
and variance are estimates of the population mean and
variance. Similarly, sample proportion is an estimate of
population proportion.
2. Interval Estimate: Instead of a specific value of the

parameter, in an interval estimate the parameter is said to lie
in an interval (say between points a and b) with a certain
probability (or confidence).
The estimation of parameters is usually carried out using the
following approaches:
1. Method of Moments
2. Maximum Likelihood Estimate (MLE)

1. Method of Moments -
The method of moments is a way to estimate population
parameters, like the population mean or the population
standard deviation.
The basic idea is that you take known facts about the
population, and extend those ideas to a sample.
For example, it’s a fact that within a population:

Expected value E(x) = μ
For a sample, the estimator

is just the sample mean, x̄ .
The formula for the sample mean is:
This is the first moment condition.
1. Method of Moments -
The second moment condition involves the variance.
The population variance is Var(x) = σ2, so we just need to use the

method of moments to estimate the variance in the sample.
Use the fact that the population variance

Var(x) = σ2 is the same as: E(x – μ)2 = σ2.
As in the first moment, replace the population expectation by the

sample equivalent (the sample mean). However, while the first
moment was the sample mean of x, this time it’s x – μ, giving the
formula:
2. Maximum likelihood estimation –
Maximum Likelihood Estimation (MLE) is a statistical method used to

estimate the parameters of a probability distribution that best
describe a given dataset.
The fundamental idea behind MLE is to find the values of the

parameters that maximize the likelihood of the observed data,
assuming that the data are generated by the specified distribution.

Sem 6 - DSV - Unit 4 - Sampling and Estimation

Uploaded by

Copyright:

Available Formats

Sem 6 - DSV - Unit 4 - Sampling and Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sem 6 - DSV - Unit 4 - Sampling and Estimation

Uploaded by

Copyright:

Available Formats

Unit 4 - Sampling and Estimation

A.Y. 2023-24 (Even) Semester: 6

Sample selection is a cost-efficient method.

Analysis of the sample is less cumbersome and more

The process of identifying a subset from a population of

The following steps are used in any sampling process:

• Identification of target population that is important

Assume that we are interested in studying attrition among

A better definition of the population in this case would be to

Sampling frame defines the source (or method/procedure) used for

Step 3 - Determine the sample size.

Determining sample size for data collection is important since Collecting

Step 4 - Decide Sampling method

Sampling method is the technique used for selecting individual

At a higher level, sampling method is classified into two major

The population can be very large making it impossible to

Measures such as mean and standard deviation calculated

Calculating population parameters in most practical

Population parameters estimated from the sample are

An important distinction should be made between errors

Consider the physical process of a gun shooting at a target.

Shots tend to fall in

When a result suggests bias (e.g., by reference to a

Each person would have odds of 1 out of 20 of being chosen

With non-probability sampling, these odds are not equal. A

Advantage - It is the most direct method of probability

But it comes with a caveat –

Monte Carlo methods use repeated random sampling for the

Then, the next individual that we will select would be x/nth

Systematic sampling is more convenient than simple

However, it might also lead to bias if there is an

We use this type of sampling when we want

However, stratified sampling requires proper knowledge

In the above example, we have divided our population into 5 clusters.

This type of sampling is used when we focus on a specific region or

There is a significant risk of selection bias in snowball

The sampling distribution of a given population is the

It describes a range of possible outcomes for a statistic, such as

The majority of data analyzed by researchers are actually

Sampling distribution refers to the probability distribution of a

The sampling distribution is a theoretical distribution, that we

The mean of the sampling distribution is the mean of the population.

The standard deviation of the sampling distribution is the standard deviation

For example, we may like to estimate the population

1. Point Estimate: The point estimate of a population

2. Interval Estimate: Instead of a specific value of the

2. Maximum Likelihood Estimate (MLE)

For example, it’s a fact that within a population:

For a sample, the estimator

The population variance is Var(x) = σ2, so we just need to use the

Use the fact that the population variance

As in the first moment, replace the population expectation by the

Maximum Likelihood Estimation (MLE) is a statistical method used to

The fundamental idea behind MLE is to find the values of the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.