Sem 6 - DSV - Unit 4 - Sampling and Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Unit 4 - Sampling and Estimation

A.Y. 2023-24 (Even) Semester: 6


(Information Technology)

Komal Rohit
Asst. Professor, IT Dept., GCET
Sampling
Sampling is a method that allows us to get information about
the population based on the statistics from a subset of the
population (sample), without having to investigate every
individual.
Why do we need Sampling
Sampling is done to draw conclusions about populations
from samples, and it enables us to determine a population’s
characteristics by directly observing only a portion (or
sample) of the population.

Sample selection is a cost-efficient method.

Analysis of the sample is less cumbersome and more


practical than an analysis of the entire population.
Steps involved in Sampling

The process of identifying a subset from a population of


elements (aka observations or cases) is called sampling
process or simply sampling.

The following steps are used in any sampling process:

• Identification of target population that is important


for a given problem under study.
• Decide the sampling frame.
• Determine the sample size.
• Decide Sampling method.
Steps involved in Sampling
Step 1 - Identification of target population that is important for
a given problem under study

For example,

Assume that we are interested in studying attrition among


young professionals in India. The definition ‘young
professionals’ in India is vague(not defined); we need a clear
identification of the target population.

A better definition of the population in this case would be to


study the attrition among IT professionals in the age group 25–
35 years in India. It is important to clearly define the target
population for correct inference.
Steps involved in Sampling
Step 2 - Decide the sampling frame

Sampling frame defines the source (or method/procedure) used for


identifying the elements of the target population. Choice of sampling
frame is important for accuracy of the study.

For Example,
To analyse attrition among IT professionals, sources such as LinkedIn and
job portals Naukri and Monster can be used.

However, these frames may not have important variables (features) that
are required such as information related to salary and other data
captured during exit interview.

So, ideally to understand the attrition behavior one has to use the data
captured by many human resource departments across multiple
companies.
Steps involved in Sampling

Step 3 - Determine the sample size.

Determining sample size for data collection is important since Collecting


data can be expensive and at the same time insufficient sample results in
lack of precision in estimation of the parameters.

The sample size for analytics projects is determined using factors such as
effect size, standard deviation, desired level of confidence and margin of
error.
Steps involved in Sampling

Step 4 - Decide Sampling method

Sampling method is the technique used for selecting individual


cases in the sample from the target population using the sampling
frame.

At a higher level, sampling method is classified into two major


categories: probabilistic sampling and non-probabilistic sampling.
Population Parameters and Sample Statistic

The population can be very large making it impossible to


collect every feature of each case in the population.

Measures such as mean and standard deviation calculated


using the entire population are called population
parameters.

Calculating population parameters in most practical


situations is almost impossible, we depend on samples to
estimate the population parameters.

Population parameters estimated from the sample are


called sample statistic.
Statistical Bias
Measurement or sampling errors that are systematic and
produced by the measurement or sampling process.

An important distinction should be made between errors


due to random chance, and errors due to bias.

Consider the physical process of a gun shooting at a target.


It will not hit the absolute center of the target every time,
or even much at all.
Statistical Bias

An unbiased process
will produce error,
but it is random and
does not tend
strongly in any
direction.
Statistical Bias

Biased process -
there is still random
error in both the x and
y direction, but there is
also a bias.

Shots tend to fall in


the upper-right
quadrant.
Statistical Bias
Bias comes in different forms, and may be observable or
invisible.

When a result suggests bias (e.g., by reference to a


benchmark or actual values), it is often an indicator that a
statistical or machine learning model has been misspecified,
or an important variable left out.
Types of Sampling techniques
Probability Sampling:
Every element of the population has an equal chance of
being selected. Probability sampling gives us the best chance
to create a sample that is truly representative of the
population.

Non-Probability Sampling:
All elements do not have an equal chance of being selected.
Consequently, there is a significant risk of ending up with a
non-representative sample which does not produce
generalizable results.
Types of Sampling techniques
For example, let’s say our population consists of 20
individuals. Each individual is numbered from 1 to 20 and is
represented by a specific color (red, blue, green, or yellow).

Each person would have odds of 1 out of 20 of being chosen


in probability sampling.

With non-probability sampling, these odds are not equal. A


person might have a better chance of being chosen than
others.
Types of Sampling techniques
Probabilistic Sampling
1 - Simple Random Sampling
2 - Systematic Sampling
3 - Stratified Sampling
4 - Cluster Sampling
Simple Random Sampling
Every individual is chosen entirely by chance and each
member of the population has an equal chance of being
selected.
Simple random sampling reduces selection bias.
Simple Random Sampling

Advantage - It is the most direct method of probability


sampling.

But it comes with a caveat –


it may not select enough individuals with our characteristics
of interest.

Monte Carlo methods use repeated random sampling for the


estimation of unknown parameters.
Systematic Sampling
First individual is selected randomly and others are selected
using a fixed ‘sampling interval’.

Example -
Say our population size is x and we have to select a sample
size of n.

Then, the next individual that we will select would be x/nth


intervals away from the first individual. We can select the
rest in the same way.
Systematic Sampling
Suppose, we began with person number 3, and we want a
sample size of 5.
So, the next individual that we will select would be at an
interval of (20/5) = 4 from the 3rd person, i.e. 7 (3+4), and so
on.
3, 3+4=7, 7+4=11, 11+4=15, 15+4=19 = 3, 7, 11, 15, 19
Systematic Sampling

Systematic sampling is more convenient than simple


random sampling.

However, it might also lead to bias if there is an


underlying pattern in which we are selecting items from
the population (though the chances of that happening are
quite rare).
Stratified Sampling
Population is divided into subgroups (called strata) based on different
traits like gender, category, etc. And then we select the sample(s)
from these subgroups:
Stratified Sampling
Here, we first divided our population into subgroups
based on different colors of red, yellow, green and blue.
Then, from each color, we selected an individual in the
proportion of their numbers in the population.

We use this type of sampling when we want


representation from all the subgroups of the population.

However, stratified sampling requires proper knowledge


of the characteristics of the population.
Cluster Sampling
In a clustered sample, we use the subgroups of the population as the
sampling unit rather than individuals. The population is divided into
subgroups, known as clusters, and a whole cluster is randomly
selected to be included in the study:
Cluster Sampling

In the above example, we have divided our population into 5 clusters.


Each cluster consists of 4 individuals and we have taken the 4th
cluster in our sample. We can include more clusters as per our sample
size.

This type of sampling is used when we focus on a specific region or


area.
Probabilistic Sampling
Non Probabilistic Sampling
1 - Convenience Sampling
2- Quota Sampling
3 - Judgment Sampling
4- Snowball Sampling
Convenience Sampling
Easiest method of sampling because individuals are
selected based on their availability and willingness to take
part.
Convenience Sampling
Convenience sampling is prone to significant bias,
because the sample may not be the representation of the
specific characteristics such as religion or the gender of
the population.
Quota Sampling
In this type of sampling, we choose items based on
predetermined characteristics of the population. Consider
that we have to select individuals having a number in
multiples of four for our sample.
Quota Sampling
In quota sampling, the chosen sample might not be the
best representation of the characteristics of the
population that weren’t considered.
Judgment (Purposive) Sampling
It is also known as selective sampling. It depends on the
judgment of the experts when choosing whom to ask to
participate.
Judgment (Purposive) Sampling
As you can imagine, like quota sampling it is also prone to
bias by the experts and may not necessarily be
representative.
Snowball Sampling
Existing people are asked to nominate further people
known to them so that the sample increases in size like a
rolling snowball. This method of sampling is effective
when a sampling frame is difficult to identify.
Snowball Sampling
Here, we had randomly chosen person 1 for our sample,
and then he/she recommended person 6, and person 6
recommended person 11, and so on.
1->6->11->14->19

There is a significant risk of selection bias in snowball


sampling, as the referenced individuals will share
common traits with the person who recommends them.
Sampling Distribution
A sampling distribution is a probability distribution of a
statistic obtained from a larger number of samples drawn
from a specific population.

The sampling distribution of a given population is the


distribution of frequencies of a range of different
outcomes that could possibly occur for a statistic of a
population
Sampling Distribution
The probability distribution of a statistic for a large
number of samples taken from a population.
Imagining an experiment may help you to understand
sampling distributions:
• Suppose that you draw a random sample from a
population and calculate a statistic for the sample, such
as the mean.
• Now you draw another random sample of the same
size, and again calculate the mean.
• You repeat this process many times, and end up with a
large number of means, one for each sample.
The distribution of the sample means is an example of a
sampling distribution.
Sampling Distribution
A sampling distribution is a probability distribution of a statistic
that is obtained through repeated sampling of a specific
population.

It describes a range of possible outcomes for a statistic, such as


the mean or mode of some variable, of a population.

The majority of data analyzed by researchers are actually


samples, not populations.

Sampling distribution refers to the probability distribution of a


statistic such as sample mean and sample
standard deviation computed from several random samples of
same size. Understanding the sampling distribution is
important for hypothesis testing.
Sampling Distribution
Sampling Distribution
Central Limit Theorem (CLT)
The central limit theorem states that the sampling distribution
of a sample statistic (like the sample mean or proportion) is
nearly normal or bell-shaped and will have on average the true
population parameter that is being estimated.

The sampling distribution is a theoretical distribution, that we


cannot observe, that describes all the possible values of a
sample statistic from random samples of the same size that are
taken from the same population.
Central Limit Theorem (CLT)
The parameters of the sampling distribution of the mean are determined by
the parameters of the population:

The mean of the sampling distribution is the mean of the population.

The standard deviation of the sampling distribution is the standard deviation


of the population divided by the square root of the sample size.

We can describe the sampling distribution of the mean using this notation:

Where,
X̄ is the sampling distribution of the sample means
~ means “follows the distribution”
N is the normal distribution
µ is the mean of the population
σ is the standard deviation of the population
n is the sample size
Estimation of population parameters
Estimation is a process used for making inferences about
population parameters based on samples.

For example, we may like to estimate the population


parameters such as mean and standard deviation and
probability distribution parameters such as scale, shape, and
location parameters.
Estimation of population parameters
The following are the two types of estimates:

1. Point Estimate: The point estimate of a population


parameter is the single value (or specific value) calculated
from the sample (thus called statistic). The sample mean
and variance are estimates of the population mean and
variance. Similarly, sample proportion is an estimate of
population proportion.

2. Interval Estimate: Instead of a specific value of the


parameter, in an interval estimate the parameter is said to lie
in an interval (say between points a and b) with a certain
probability (or confidence).
Estimation of population parameters
The estimation of parameters is usually carried out using the
following approaches:

1. Method of Moments

2. Maximum Likelihood Estimate (MLE)


Estimation of population parameters
1. Method of Moments -
The method of moments is a way to estimate population
parameters, like the population mean or the population
standard deviation.

The basic idea is that you take known facts about the
population, and extend those ideas to a sample.

For example, it’s a fact that within a population:


Expected value E(x) = μ

For a sample, the estimator


is just the sample mean, x̄ .
The formula for the sample mean is:
This is the first moment condition.
Estimation of population parameters
1. Method of Moments -
The second moment condition involves the variance.

The population variance is Var(x) = σ2, so we just need to use the


method of moments to estimate the variance in the sample.

Use the fact that the population variance


Var(x) = σ2 is the same as: E(x – μ)2 = σ2.

As in the first moment, replace the population expectation by the


sample equivalent (the sample mean). However, while the first
moment was the sample mean of x, this time it’s x – μ, giving the
formula:
Estimation of population parameters
2. Maximum likelihood estimation –

Maximum Likelihood Estimation (MLE) is a statistical method used to


estimate the parameters of a probability distribution that best
describe a given dataset.

The fundamental idea behind MLE is to find the values of the


parameters that maximize the likelihood of the observed data,
assuming that the data are generated by the specified distribution.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy