Data Collection
Data Collection
Agenda
•Clear Definition: Specify who or what constitutes the population of interest. This includes
defining the characteristics that qualify individuals or items for inclusion in the study.
•Scope and Boundaries: Clearly outline the geographical, temporal, and demographic
boundaries of the population.
Determine the Sample Size
•Statistical Power: Larger samples provide more accurate estimates but are costlier and more time-consuming. Use
power analysis to determine the minimum sample size needed to detect an effect or difference.
Choose a Sampling Method
•Probability Sampling: Methods where every member of the population has a known, non-zero chance of being
selected. This includes:
•Simple Random Sampling: Each member has an equal chance of selection.
•stratified Sampling: Population is divided into strata, and random samples are drawn from each stratum.
•Cluster Sampling: Population is divided into clusters, some of which are randomly selected, and all members of
chosen clusters are sampled.
•Systematic Sampling: Every nth member of the population is selected after a random start.
Choose a Sampling Method
•Non-Probability Sampling: Methods where some members of the population may have no chance of
being selected. This includes:
•Convenience Sampling: Samples are chosen based on ease of access.
•Judgmental or Purposive Sampling: Samples are selected based on the researcher’s judgment
about which members are most useful or representative.
•Quota Sampling: Samples are selected to ensure certain characteristics are represented in specific
proportions.
bias
Bias
Selection Bias Observer Bias
Measurement Bias Survivorship Bias
Response Bias Attrition Bias
Confirmation Bias Confounding Bias
Selection Bias
•Selection bias occurs when the sample is not representative of the population due to non-random selection.
This leads to results that cannot be generalized to the entire population.
Measurement Bias
2. Measurement Bias
Measurement bias occurs when the data collection instruments or procedures systematically favor certain
outcomes over others.
•Instrument Bias: Arises from faulty or biased measurement tools. For example, a poorly calibrated scale
consistently giving incorrect weight readings.
•Interviewer Bias: Occurs when the interviewer's behavior or questioning influences the responses. For example,
leading questions can sway respondents' answers.
•Recall Bias: Happens when participants do not accurately remember past events or experiences. This is common
in retrospective studies.
Response Bias
• Response bias occurs when participants do not provide truthful or accurate responses, leading to distorted
data
Confirmation Bias
Confirmation bias occurs when researchers selectively collect or interpret data in a way that confirms their
preexisting beliefs or hypotheses.
•Data Mining Bias: Occurs when researchers look for patterns in the data that support their hypothesis while
ignoring those that contradict it.
•Publication Bias: Tendency for studies with positive or significant results to be published more often than those
with null or negative results.
Observer Bias
Observer bias happens when the researcher's expectations or knowledge influence their observations and
interpretations.
Attrition Bias
Attrition bias occurs when participants drop out of a study over time, and those who remain are systematically
different from those who leave.
Case Study: The Impact of Biased
Data on Decision-Making
Background
A prominent technology firm, TechSolutions,
implemented a machine learning algorithm to
streamline its hiring process. The goal was to
identify and recruit top talent more efficiently by
analyzing resumes and ranking candidates based
on predicted job performance.
The Problem
After six months of using the algorithm, it was
observed that the number of women and
minority candidates being hired had significantly
decreased. This discrepancy raised concerns
about potential biases in the hiring algorithm.
Data Analysis
Historical Bias: The training data consisted primarily of resumes from the past
decade, during which the majority of employees hired were white males. This
historical bias was encoded into the algorithm, which learned to favor similar
profiles.
Feature Selection: Certain features that correlated with higher hiring success,
such as attending specific universities or having particular job titles, were
disproportionately common among white male candidates, leading the
algorithm to favor these candidates.
Lack of Diversity in Training Data: The dataset lacked sufficient representation
from women and minority groups, preventing the algorithm from accurately
assessing the potential of candidates from these backgrounds.
Consequences