0% found this document useful (0 votes)
5 views43 pages

Classx DS Unit 3

Chapter 3 of the Data Science curriculum focuses on identifying patterns of partiality, preference, and prejudice, which are collectively referred to as bias in data. It explains the importance of understanding bias, the Central Limit Theorem, and various types of biases such as selection bias, linearity bias, confirmation bias, recall bias, and survivor bias. The chapter emphasizes the need to eliminate bias to ensure model accuracy and discusses the Central Limit Theorem's significance in statistical analysis.

Uploaded by

Sushanth Dasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

Classx DS Unit 3

Chapter 3 of the Data Science curriculum focuses on identifying patterns of partiality, preference, and prejudice, which are collectively referred to as bias in data. It explains the importance of understanding bias, the Central Limit Theorem, and various types of biases such as selection bias, linearity bias, confirmation bias, recall bias, and survivor bias. The chapter emphasizes the need to eliminate bias to ensure model accuracy and discusses the Central Limit Theorem's significance in statistical analysis.

Uploaded by

Sushanth Dasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

DATA

SCIENCE
Grade X
Chapter 3: Identifying
Patterns
This chapter aims at teaching
students how to identify
partiality, preference & prejudice.
At the end of this chapter,
students should be able to
understand:

• How to identify the partiality,


preference and prejudice?
• What is Central Limit Theorem?
LEARNING OBJECTIVES
1. What is partiality, preference
and prejudice?
2. How to identify partiality,
preference and prejudice?
3. Probability for Statistics
4. The Central Limit Theorem
5. Why is Central Limit Theorem
important?
What Is Partiality, Preference
And Prejudice?
We often come across situations where if we have a special
fondness towards a particular thing, we tend to be slightly
partial towards it.
This, in majority cases may affect the outcome or you can say it
can deviate the outcome in favor of certain thing.
Naturally, it is not the right way of dealing with the data on
larger scale.
This partiality, preference and prejudice towards a set of data is
called as a Bias.
In Data Science, bias is a deviation from the expected outcome
in the data. Fundamentally, you can also call bias as error in the
data.
Why does the bias occur in first place?
Bias basically occurs because of
sampling and estimation. If we
would know everything about all the
entities in our data and would store
information on all probable entities,
our data would never have any bias.
However, data science is often not
conducted in carefully controlled
conditions. It is mostly done of the
“found data”, i.e. the data that is
collected for a purpose other than
modelling. That is the reason why
this data is very likely to have
biases.
Why does the bias really
matter?
Predictive models often consider only the data that is
used for training.
In fact, they know no other reality other than the
data that is fed in their system.
Naturally, if the data that is fed into the system is
biased, model accuracy and fidelity are compromised.
Biased models can also tend to discriminate against
certain groups of people. Therefore, it is very
important to eliminate the bias to avoid these risks.
How To Identify The Partiality,
Preference And Prejudice?
Bias is the tendency of a statistics to overestimate or underestimate a
parameter.
Bias causes your results to sway from the accurate measure and thus, cause
sampling errors.
Partiality, preference and prejudice in a given data set can be identifying and
categorizing them in the appropriate type.
We can categorize common statistical and cognitive bias in following ways:
1. Selection Bias
2. Linearity Bias
3. Confirmation Bias
4. Recall Bias
5. Survivor Bias
Selection Bias
This type of bias usually occurs when a
model itself influences the creation of
data that is used to train it. Selection
bias is said to occur when the sample
data that is gathered is not
representative of the true future
population of cases that the model will
see. This bias occurs mostly in systems
that rank the content like
recommendation systems, polls or
personalized advertisements. This is
because, user responses for the content
that is displayed is collected and the
user response for the content that is not
displayed is unknown.
Types of Selection Bias
o Sampling bias: occurs when randomization is not properly achieved during data collection.
o Convergence bias: occurs when data is not selected in a representative manner. e.g. when
you collect data by only surveying customers who purchased your product and not another
half, your dataset does not represent the group of people who did not purchase your product.
o Participation bias: occurs when the data is unrepresentative due to participations gaps in
the data collection process.
So let’s say Apple launched a new iPhone and on the same day Samsung launched a new
Galaxy Note. You send out surveys to 1000 people to collect their reviews. Now instead of
randomly selecting the responses for analysis, you decide to choose the first 100 customers
that responded to your survey. This will lead to sampling bias since those first 100 customers
are more likely to be enthusiastic about the product and are likely to provide good reviews.
Next, if you decide to collect data by surveying only Apple customers by opting out of
Samsung customers, you will induce a convergence bias in your dataset.
Lastly, you send the survey to 500 Apple and 500 Samsung customers. 400 Apple customers
respond but only 100 Samsung customers respond. Now, this dataset would be
underrepresenting the Samsung customers and would count towards participation bias.
Linearity bias
Linearity bias assumes that
change in one quantity
produces an equal and
proportional change in
another. Unlike selection bias,
linearity bias is a cognitive
bias. This is produced not
through some statistical
process but rather through
how mistakenly we perceive
the world around us.
Confirmation Bias
Confirmation Bias or Observer Bias is an outcome
of seeing what you want to see in the data. This
can occur when researchers go into a project
with some subjective thoughts about their study,
which is either conscious or unconscious. We can
also encounter this when labelers allow their
subjective thoughts to control their labeling
habits, which results in inaccurate data.
E.g. you trained a model to rank sports cars
according to their speed using some features.
Your model results show that Ferrari was faster
than Ford. However, a few years back you
remember watching a movie where Ford beats
Ferrari and you believe that Ford is faster than
Ferarri so you keep training and running the
model until the model gives you the results you
believe.
Recall Bias
Recall Bias is a type of measurement
bias. It is common at the data labeling
stage of any project. This type of bias
occurs when you label similar type of
data inconsistently. Thus, resulting in
lower accuracy. For example, let us say
we have a team labeling images of
damaged laptops. The damaged laptops
are tagged across labels as damaged,
partially damaged, and undamaged.
Now, if someone in the team labels an
image as damaged and some similar
image as partially damaged, your data
will obviously be inconsistent.
Survivor Bias
The survivorship bias is based on the concept that we usually
tend to twist the data sets by focusing on successful examples
and ignoring the failures. This type of bias also occurs when we
are looking at the competitors. For example, while starting a
business we usually take the examples of businesses in a
similar sector that have performed well and often ignore the
businesses which have incurred heavy losses, gone bankrupt,
merged etc.
While this is arguable point that we don’t want to copy the
failure, we can still learn a lot by understanding a range of
customer experiences. The only way to avoid survivor bias in
our systems is by finding as many inputs as possible and study
the failures as well as average performers.
Probability for Statistics
Probability is all about counting
randomness.
It is the basics of how we make
predictions in statistics.
We can use probability to predict how
likely or unlikely particular events may
be.
We can also, if needed, consider
informal predictions beyond the scope
of the data which we have analyzed.
Probability is a very essential tool in
statistics.
Probability is a very essential tool in
statistics. There are two problems and
nature of their solution that will
illustrate the difference.
Problem 1: Assume a coin is “fair”
Question: If a coin is tossed 10 times, how
many times will we get “tail” on the top
face.
Problem 2: You pick up a coin
Question: Is this a fair coin? That is, does
each face have an equal chance of
appearing?
Problem 1 is a mathematical
probability problem.
Problem 2 is a statistics problem that
can use the mathematical probability
model determined in Problem 1 as a
tool to seek a solution.
The answer to neither question is
deterministic. Tossing coin produces
random outcomes, which suggests
that the answer is probabilistic.
The solution to Problem 1 starts with the assumption that
the coin is fair. It later proceeds to logically deduce the
numerical probabilities for each possible count of “tails”
after a toss resulting from 10 tosses. The possible counts
are 0,1….,10.
The solution to Problem 2 starts with an unfamiliar toss;
we do not know if it is fair or biased. The search for an
answer is experimental: toss the coin, see what happens,
and examine the resulting data to see whether they look
as if they came from a fair toss or a biased toss.
In Problem 1, we form our answer from logical
deductions. In Problem 2, we form our answer by
observing experimental results.
Central Limit Theorem
The Central Limit Theorem states that distribution of sample
approaches a normal distribution as the sample size gets
larger irrespective of what is the shape of the population
distribution.
The Central Limit Theorem is a statistical theory stating that
given a significantly large sample size from a population with
finite variance, the mean of all samples from same set of
population will be roughly equal to the mean of the
population. This holds true regardless of whether the source
population is normal or skewed provided that the sample size
is significantly large.
Few points to note about the
Central Limit Theorem are:
✓ The Central Limit Theorem states that the distribution of
sample means nears a normal distribution as the sample size
gets bigger.
✓ Sample sizes that are equal to or greater than 30 are
considered enough for the Central Limit Theorem to hold.
✓ Key aspect of the Central Limit Theorem is that the average
of sample mean, and the standard deviation will always equal
the population mean and the standard deviation.
✓ A significantly large sample size can predict the
characteristics of a population very accurately.
Understanding the Central Limit
Theorem with the help of an example
Consider that there are 50 houses in your area. And each house has 5
people. Our task is to calculate average weight of people in your area.
The usual approach that majority follow is:
1. Measure the weights of all people in your area
2. Add all the weights
3. Divide the total sum of weights with the total number of students to
calculate the average
However, the question over here is, what if the size of data is
enormous? Does this way of calculating the average make sense? Of
course, the answer is no. Measuring weight of all the people will be a
very tiring and lengthy process.
Understanding the Central Limit
Theorem with the help of an example
As a workaround, we have an alternative approach that we
can take.
1. To start with, draw groups of people at random from your
area. We will call this a sample. We will draw multiple
samples in this case, each consisting of 30 people
2. Calculate the individual mean of each sample set
3. Calculate the mean of these sample means
4. To add up to this, a histogram of sample mean weights of
people will resemble a normal distribution.
This is what the Central Limit Theorem is all about.
Formula for Central Limit Theorem
Now let us move ahead and understand what the formula
for the central limit theorem is.

Where, μ = Population mean


σ = Population standard deviation
μx¯ = Sample mean
σx¯¯¯ = Sample standard deviation
n = Sample size
WHY IS CENTRAL LIMIT THEOREM SO IMPORTANT?

The Central Limit Theorem states that no matter what


the distribution of population is, the shape of the
sampling distribution will always approach normality as
the sample size increases.
This is helpful, as any research never knows which mean
in the sampling distribution is the same as population
mean, however, by selecting many random samples
from population, the sample means will cluster together,
allowing the researcher to make a good estimate of the
population mean.
Having said that, as the sample size increases, the error
will always decrease.
Practical implementations of the
Central Limit Theorem
1. Voting polls estimate the count of people
who support a particular election candidate.
The results of news channels that come with
confidence intervals are all calculated using
the Central Limit Theorem.
2. The Central Limit Theorem can also be used
to calculate the mean family income for a
specific region.
Exercises: Objective Type Questions
Please choose the correct option in the questions below.

1. What is the Data Science term used to describe partiality, preference, and prejudice?

a) Bias

b) Favoritism

c) Influence

d) Unfairness

Answer: a

2. Which of the following is NOT a type of bias?

a) Selection Bias

b) Linearity Bias

c) Recall Bias

d) Trial Bias

Answer: d
Exercises: Objective Type Questions
3. Which of the following is not a correct statement about a probability
a) It must have a value between 0 and 1
b) It can be reported as a decimal or a fraction
c) A value near 0 means that the event is not likely to occur/happen
d) It is the collection of several experiments
Answer: d
4. The central limit theorem states that sampling distribution of the sample mean is
approximately normal if
a) All possible samples are selected
b) The sample size is large
c) The standard error of the sampling distribution is small
Answer: b
Exercises: Objective Type Questions
5. The central limit theorem says that the mean of the sampling distribution of the
sample mean is
a) Equal to the population mean divided by the square root of the sample size
b) Close to the population mean if the sample size is large
c) Exactly equal to the population mean
Answer: c
6. Sample of size 25 are selected from a population with mean 40 and standard
deviation 7.5. The mean of the sampling distribution sample mean is
a) 7.5
b) 8
c) 40
Answer: c
Standard Questions
1. Explain what is Bias and why it occurs in data science?
Ans: We often come across situations where if we have a
special fondness towards a particular thing, we tend to be
slightly partial towards it. This, in majority cases may affect
the outcome or you can say it can deviate the outcome in
favor of certain thing. Naturally, it is not the right way of
dealing with the data on larger scale.
This partiality, preference and prejudice towards a set of data
is called as a Bias.
In Data Science, bias is a deviation from the expected
outcome in the data. Fundamentally, you can also call bias as
error in the data. However, it is observed that this error is
indistinct and goes unnoticed.
Standard Questions
2. Explain Selection Bias with the help of an example
Ans: This type of bias usually occurs when a model itself influences the
creation of data that is used to train it. Selection bias is said to occur
when the sample data that is gathered is not representative of the true
future population of cases that the model will see. This bias occurs
mostly in systems that rank the content like recommendation systems,
polls or personalized advertisements. This is because, user responses
for the content that is displayed is collected and the user response for
the content that is not displayed is unknown.
Eg: So let’s say Apple launched a new iPhone and on the same day
Samsung launched a new Galaxy Note. You send out surveys to 1000
people to collect their reviews. Now instead of randomly selecting the
responses for analysis, you decide to choose the first 100 customers
that responded to your survey. This will lead to sampling bias since
those first 100 customers are more likely to be enthusiastic about the
product and are likely to provide good reviews.
Standard Questions
3. Explain Recall Bias with the help of an example
Ans: Recall Bias is a type of measurement bias. It is common
at the data labeling stage of any project. This type of bias
occurs when you label similar type of data inconsistently.
Thus, resulting in lower accuracy. For example, let us say we
have a team labeling images of damaged laptops. The
damaged laptops are tagged across labels as damaged,
partially damaged, and undamaged. Now, if someone in the
team labels an image as damaged and some similar image as
partially damaged, your data will obviously be inconsistent.
Standard Questions

4. Explain Linearity Bias with the help of an example


Ans: Linearity bias assumes that change in one
quantity produces an equal and proportional change
in another. Unlike selection bias, linearity bias is a
cognitive bias. This is produced not through some
statistical process but rather through how mistakenly
we perceive the world around us.
Standard Questions
5. Explain Confirmation Bias with the help of an
example.
Ans: Confirmation Bias or Observer Bias is an
outcome of seeing what you want to see in the data.
This can occur when researchers go into a project
with some subjective thoughts about their study,
which is either conscious or unconscious. We can also
encounter this when labelers allow their subjective
thoughts to control their labeling habits, which results
in inaccurate data.
Standard Questions
6. What is the central limit theorem?
Ans: The Central Limit Theorem states that distribution of
sample approaches a normal distribution as the sample size
gets larger irrespective of what is the shape of the population
distribution.
The Central Limit Theorem is a statistical theory stating that
given a significantly large sample size from a population with
finite variance, the mean of all samples from same set of
population will be roughly equal to the mean of the
population. This holds true regardless of whether the source
population is normal or skewed provided that the sample size
is significantly large.
Standard Questions
1. Explain what is Bias and why it occurs in data science?
2. Explain Selection Bias with the help of an example
3. Explain Recall Bias with the help of an example
4. Explain Linearity Bias with the help of an example
5. Explain Confirmation Bias with the help of an example
6. What is the central limit theorem?
7. What is the formula for central limit theorem?
8. What is real life application of central limit theorem?
9. Why central limit theorem is important?
10. The coaches of various sports around the world use probability to better their
game and create gaming strategies. Can you explain how probability is applied in this
case and how does it help players?
Standard Questions
7. What is the formula for central limit theorem?
Ans: The formula for for central limit theorem is:

Where,
μ = Population mean
σ = Population standard deviation
μx¯¯¯ = Sample mean
σx¯¯¯ = Sample standard deviation
n = Sample size
Standard Questions
8. What is real life application of central limit theorem?
Ans: Practical implementations of the Central Limit Theorem
include:
1. Voting polls estimate the count of people who support a
particular election candidate. The results of news channels
that come with confidence intervals are all calculated using
the Central Limit Theorem.
2. The Central Limit Theorem can also be used to calculate
the mean family income for a specific region.
Standard Questions
9. Why central limit theorem is important?
Ans: The Central Limit Theorem states that no matter what
the distribution of population is, the shape of the sampling
distribution will always approach normality as the sample size
increases.
This is helpful, as any research never knows which mean in
the sampling distribution is the same as population mean,
however, by selecting many random samples from
population, the sample means will cluster together, allowing
the researcher to make a good estimate of the population
mean.
Having said that, as the sample size increases, the error will
always decrease.
Standard Questions
10. The coaches of various sports around the world use
probability to better their game and create gaming
strategies. Can you explain how probability is applied in this
case and how does it help players?
Ans: Coaches use probability to decide the best possible
strategy to pursue in a game. When a particular batter goes
up to bat in a baseball game, the players and coach can look
up the player’s specific batting average to deduce how that
player will perform. The coach can then plan their approach
accordingly.
Higher Order Thinking Skills
1. As per reports, in October 2019, researchers found that an
algorithm used on more than 200 million people in US hospitals
to predict which patients who would likely need extra medical
care heavily favored white patients over black patients. Can
you reason about what must have caused this bias and
categorize it into the types of bias that you learnt in this
chapter?

2. The recorded percentage of the population who speaks


English in India are following a normal distribution. The mean
and the standard deviations are 62 and 5, respectively. If a
person is eager to find the record of 50 people in the
population, then what would mean and the standard deviation
of the chosen sample?
Applied Project

Consider that your friend is planning to open a


clothing store in your area. With the help of
central limit theorem, determine what should
be the type and collection of clothes that will
sell better in your area.
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy