0% found this document useful (0 votes)
15 views

Topic3 DataCollection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Topic3 DataCollection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Collection

Topic 03 - Data Collection ST1131 1 / 48


1 Lurking and Confounding variables

2 Observational and Experimental Studies

3 Good Ways to Sample for Observational Studies

4 Experimental Studies

Topic 03 - Data Collection ST1131 2 / 48


1 Lurking and Confounding variables

2 Observational and Experimental Studies

3 Good Ways to Sample for Observational Studies

4 Experimental Studies

Topic 03 - Data Collection ST1131 3 / 48


An Example: Education and Crime

Example 1 (Education and Crime Data)

Data collected on crimes in the dierent counties in Florida. a

Crime: the number of crimes in that county in the past year (Response).
Education rate: percentage of residents age ≥ 25, who completed high
school.
Urbanization level: the percentage of residents living in a metropolitan area
was aslo recorded.
a
Statistics: The Art and Science of Learning from Data, 4th edition, Agresti, Franklin and
Klingenberg, Chapter 3

Topic 03 - Data Collection ST1131 4 / 48


Scatter plot: Crime vs. Education

Cor(Crime, Edu) = 0.47.


Question to explore: From the positive Cor(Crime, Edu), can we conclude
that having a more highly educated populace causes the crime to go up?
Topic 03 - Data Collection ST1131 5 / 48
Scatter plot: Crime vs. Urbanization

Another variable measured in the study was urbanization  the percentage of


residents living in a metropolitan area.

Topic 03 - Data Collection ST1131 6 / 48


Think It Through

Cor(Urban, Crime) = 0.68.

Cor(Urban, Edu) = 0.79.

So, perhaps the reason for positive Cor(Crime, Edu) is that: education rate
tends to be higher in more highly urbanized counties, but crime also tends to
be higher in such counties.

Topic 03 - Data Collection ST1131 7 / 48


Insight: Conditioning by Urbanization

Categorize urbanization into 3 levels: low(0 - 30), medium (30 - 75) and
high (>75).

Make a scatter plot of crime (y -axis) vs. education (x -axis) for each level of
urbanization.

Topic 03 - Data Collection ST1131 8 / 48


Conditional Scatter plots

Within each panel, the correlation between crime and education is 0.137,
0.177 and -0.351 respectively.

Topic 03 - Data Collection ST1131 9 / 48


Conditional Scatter plots (cont)

When the urbanization level is low or medium, Cor(Crime, Edu) is very low
(0.137, 0.177).

When the urbanization is high, Cor(Crime, Edu) is slightly stronger but


negative (-0.351).

If we had not considered urbanization, we would have made a wrong


conclusion that education causes crime!

Correlation does not imply causation.

Topic 03 - Data Collection ST1131 10 / 48


Lurking Variables

Denition 1 (Lurking Variable)

A lurking variable is a variable, usually unobserved, that inuences the


association between the variables of primary interest.

Topic 03 - Data Collection ST1131 11 / 48


Confounding Variables

Confounding occurs when two explanatory variables are associated with a


response variable, but are also associated with each other.

Counfounding variable is a variable that is included in the dataset.

It's dicult to tell which of the two explanatories, if any, is causing a change
in the response.

In Example 1 (Education and Crime Data), Urbanization is a counfouding


variable.

Topic 03 - Data Collection ST1131 12 / 48


Lurking Variable vs Confounding Variables

A lurking variable is one that is typically not measured in the study; it has
a potential for confounding.

In Example 1, Urbanization was a confounding variable (not lurking) since it


was measured.

Topic 03 - Data Collection ST1131 13 / 48


1 Lurking and Confounding variables

2 Observational and Experimental Studies

3 Good Ways to Sample for Observational Studies

4 Experimental Studies

Topic 03 - Data Collection ST1131 14 / 48


Cell Phone Usage

Suppose we wish to investigate if the use of cell phones is a health risk. Three
studies were conducted.

Example 2 (Cell Phone Study 1)

A German study in 2001 (Stang et al., 2001) compared 118 eye cancer
patients to 475 healthy subjects (who did not have eye cancer).
All participants were given a questionnaire, in which they were queried on
their cell phone usage.
On average, the eye cancer patients used cell phones more often.

Topic 03 - Data Collection ST1131 15 / 48


Cell Phone Usage

Example 3 (Cell Phone Study 2)

A British study (Hepworth at al., 2006) compared 966 patients who had
brain cancer to 1716 people without brain cancer.
Again, they were interviewed on their cell phone use patterns via a
questionnaire.
The investigators found little dierence between the groups with regard to
their cell phone usage.

Topic 03 - Data Collection ST1131 16 / 48


Cell Phone Usage

Example 4 (Cell Phone Study 3)

In a study (Volkow et al., 2011), 47 participants had a cell phone attached


to their ears through which a 50-minute muted call was transmitted.
During the call, a specic type of brain activity was monitored by PET scan.
47 participants again had a cell phone attached to their ears but the phone
was turn o. Their brain activity was monitored.
For both times, the participants did not know if the phone was on or o.
The study found a signicant increase in brain activity in the region closest
to the antenna during the call.

Topic 03 - Data Collection ST1131 17 / 48


Observational Studies

In the rst two studies, the subjects were merely observed.

Denition 2 (Observational Study)

In an observational study, the values for the response variable and explanatory
variables are observed for the sampled subjects, without anything being done to
them.
Example: Cell phone Study 1 and Study 2.

Topic 03 - Data Collection ST1131 18 / 48


Experimental Studies

In the third study, cell phone usage was assigned to the participants and
monitored.

Denition 3 (Experimental Study)

An experiment is conducted by assigning subjects to certain experimental


conditions (treatments) and then observing the outcome on the response
variable.

Example: Cell phone Study 3.

Topic 03 - Data Collection ST1131 19 / 48


Advantage of Experimental Studies

In study 1, increased computer usage could cause the eye cancer, and also
be positively associated with increased cell phone usage,

In an experimental study, we can control for lurking variables by


randomly assigning the treatment.

With an experimental study, we are more condent to determine the


causality between the explanatory variable and the response variable.

Topic 03 - Data Collection ST1131 20 / 48


Disadvantages of Experimental Studies

Example 5 (An Ideal Cell Phone Study)

Select 1000 random students from NUS.


Randomly assign 500 of them to use a cell phone for at least 5 hours every
day for the next 30 years.
The other 500 are not allowed to use a cell phone at all for the next 30 years.
At the end of 30 years, measure the proportion of cancer in both groups to
compare them.

Topic 03 - Data Collection ST1131 21 / 48


Disadvantages of Experimental Studies

Example 5 (An Ideal Cell Phone Study)

Select 1000 random students from NUS.


Randomly assign 500 of them to use a cell phone for at least 5 hours every
day for the next 30 years.
The other 500 are not allowed to use a cell phone at all for the next 30 years.
At the end of 30 years, measure the proportion of cancer in both groups to
compare them.

It is not ethical to put those 500 students at any risk of cancer, even though
we do not know if it is true.
Who can refrain from not using a cell phone for 30 years?
30 years is a long time to wait.

Topic 03 - Data Collection ST1131 21 / 48


Observational or Experimental?

We measure the candy consumption of various countries, along with the


population. One of the goals is to estimate the candy consumption of
Singapore.

We wish to investigate the ability of a new drug to reduce hypertension. We


recruit a set of participants, and randomly assign either a placebo or the new
drug to each of them.

We wish to investigate the use of a post-menopausal hormone drug and its


association with breast cancer. We obtain a list of breast cancer patients
currently undergoing treatment at NUH and telephone them to ask if they
had used this drug in the past 10 years. We also obtain a list of patients at
NUH who did not have breast cancer, and telephoned to ask them the same
question.

Topic 03 - Data Collection ST1131 22 / 48


1 Lurking and Confounding variables

2 Observational and Experimental Studies

3 Good Ways to Sample for Observational Studies

4 Experimental Studies

Topic 03 - Data Collection ST1131 23 / 48


Sample Surveys

Sample Survey: A study that asks questions/take measurements of the


subjects in a sample drawn from the population randomly.

Suppose we wish to study the preferences of NUS students on: study hour,
partime job preference, etc.

▶ Select a group of NUS students that is a representative of the whole NUS


students

▶ Record the information from them

▶ The matter is: how could we select a good representative of the population?

Topic 03 - Data Collection ST1131 24 / 48


Steps of a Sample Survey

Step 1: Identify the Population

Step 2: Compile a list of subjects in the population from which the sample
will be taken, called the sampling frame.

Ideally, the sampling frame lists all subjects in the population.

Topic 03 - Data Collection ST1131 25 / 48


Steps of a Sample Survey

Step 3: Specify a method for selecting subjects from the sampling frame,
called the sampling design.

Step 4: Collect data from the chosen sample.

Topic 03 - Data Collection ST1131 26 / 48


Sampling Designs: by Convenience

Population = all NUS students. The sampling frame = all registered


students. From this sampling frame, we want to select a sample of 100
students.

If I select 100 students from this statistics class, do you think this 100
students could represent well the population about: gender, study year, etc.?
It's doubtful!

The sample above was selected simply by convenience, and it might not be
a representative of the population.

Topic 03 - Data Collection ST1131 27 / 48


Sampling Designs: by Chance

If we use chance rather than convenience, we can get a better sample which
could represent the population well.

Good sampling designs employ randomization.

Topic 03 - Data Collection ST1131 28 / 48


Simple Random Sampling

Denition 4 (Simple Random Sample (SRS))

A simple random sample of n subjects from a sampling frame is one in which


each possible sample of size n has the same chance of being selected.

Having a simple random sample ensures that it is representative of the


general population.

A representative sample will allow us to make inferences about the


population.

Topic 03 - Data Collection ST1131 29 / 48


Select a Simple Random Sample

Step 1, the subjects in the sampling frame are numbered.

Step 2, generate a set of n random numbers. Nowadays, this step can be


done easily by computer.

Step 3, subjects that have same number as in the set of n numbers in Step 2
are selected to be a SRS.

Topic 03 - Data Collection ST1131 30 / 48


Other Ways of Getting a Random Sample

Beside Simple Random Sampling, there are some other ways for sampling design
that involves randomization, such as:

Cluster Random Sampling

Stratied Random Sampling

Topic 03 - Data Collection ST1131 31 / 48


Collecting Data from a Chosen Sample

Given a sample of subjects, there are few ways to collect data.

Personal face-to-face interview. An advantage of this is that it is easier to


get people to respond, but a disadvantage is that it is costly.

Telephone interview. This is cheaper to implement, but it is easier for


subjects to refuse to participate.

Self-administered questionnaire. This is cheaper and less labor-intensive


than the previous two, but many more subjects may fail to participate.

Topic 03 - Data Collection ST1131 32 / 48


Possible Sources of Bias in Sample Surveys

Sampling bias, as result of sampling design step, or sampling frame step.

▶ When the sample is not random.

▶ The sampling frame does not represent the full population (under coverage).

E.g. a telephone survey will not include students without a registered phone.

Topic 03 - Data Collection ST1131 33 / 48


Possible Sources of Bias in Sample Surveys

Non-sampling bias. This occurs not due to sampling design. It has


nonresponse bias and response bias.

▶ Nonresponse bias: some sampled subjects cannot be reached or refuse to


participate.

E.g. is it possible that students who do not respond are those who study
more hours? If so, our results will be biased low for the questions on study
hours per week.

▶ Response bias. This occur when the participant is not honest when answering
question(s) or answering wrongly.

Topic 03 - Data Collection ST1131 34 / 48


Possible Sources of Bias in Sample Surveys

Topic 03 - Data Collection ST1131 35 / 48


Poor Alternatives to Simple Random Sampling

A convenience sample is one where a sample is selected based on ease of


access instead. You may have encountered this outside a mall or at an MRT
station.

Volunteer samples, where people are encouraged to participate in the survey


via a yer or email. This can yield incorrect inferences.

Topic 03 - Data Collection ST1131 36 / 48


The US 1936 Presidential Election Poll

Example 6 (The US 1936 Presidential Election Poll)

Literary Digest (LD) magazine conducted a poll to predict the result of the
1936 presidential election, between Franklin Roosevelt (Democrat) and Alf
Landon (Republican).
The sampling frame was constructed from: telephone directories, country
club memberships and automobile registration.
LD mailed questionnaires to 10 million people, asked how they planned to
vote.
2.3 million people returned the questionnaire. LD then predicted that
Republican would win, getting 57% of the vote.
In fact, Republican got only 36% and Democrat won that election. a

a
Bryson, M. C. (1976), American Statistician, vol. 30, pp. 184-185

Topic 03 - Data Collection ST1131 37 / 48


The US 1936 Presidential Election Poll (cont)

Population: all registered voters in the US in 1936.

This survey had 2 severe problems:

1 Sampling bias due to under coverage.

2 Nonresponse bias. Of the 10 million people in the chosen random sample, 7.7
million did not respond.

A large sample size does not guarantee an unbiased sample.

Topic 03 - Data Collection ST1131 38 / 48


Summary: Key Parts of a Sample Survey

Identify the population of all the subjects of interest

Dene a sampling frame. which attempts to list all the subjects in the
population

Use a random sampling design.

Be cautious about bias: nonresponse bias as well as response bias.

Topic 03 - Data Collection ST1131 39 / 48


1 Lurking and Confounding variables

2 Observational and Experimental Studies

3 Good Ways to Sample for Observational Studies

4 Experimental Studies

Topic 03 - Data Collection ST1131 40 / 48


Experimental Studies

In an experiment, subjects are referred to as experimental units: human,


mice, stores, schools, etc.

We assign each subject to an experimental condition, called treatment. The


outcome on the response variable is then recorded.

The goal of the experiment is to investigate the associationhow the


treatment aects the response.

The advantage of an experimental study over non-experimental study is that


it provides stronger evidence for causation.

Topic 03 - Data Collection ST1131 41 / 48


Control Groups And Randomization

Elements for a good experimental study are

1 Having a control comparison group

2 Randomization: randomly assigning treatments to subjects.

3 Blinding the study: the subjects do not know they receive actual treatment
or a placebo.

Topic 03 - Data Collection ST1131 42 / 48


An Example: Antidepressants for Quitting Smoke

Example 7 (Antidepressants for Quitting Smoke)

Purpose: to know whether antidepressants help people quit smoke.


429 men and women (aged ≥ 18) had smoked ≥ 15 cigarettes per day for
the previous year. Subjects were highly motivated to quit.
Subjects were assigned to 1 of 2 groups: 1 group took 300 mg daily of an
antidepressant (bupropion). Other group did not take an antidepressant.
At the end of the year, the study observed whether each subject had
successfully abstained from smoking or had relapsed.

Topic 03 - Data Collection ST1131 43 / 48


An Example: Antidepressants for Quitting Smoke (cont)

1 Identify: the experimental units; the response; treatment (explanatory).

2 How should the researcher assign the subjects to the two treatment groups?

3 Without knowing more about this study, what would you identify as a
potential problem with the study design?

4 What descriptive statistics (numerical) are used for the analysis (to compare
the 2 groups after 1 year)?

Topic 03 - Data Collection ST1131 44 / 48


Antidepressants for Quitting Smoke: Better Version

Example 8 (Antidepressants for Quitting Smoke)

Purpose: to know whether antidepressants help people quit smoke.


429 men and women (aged ≥ 18) had smoked ≥ 15 cigarettes per day for
the previous year. Subjects are highly motivated to quit.
Subjects are randomly assigned to 1 of 2 groups: 1 group take 300 mg daily
of an antidepressant (bupropion). Other group take placebo and all
participants are blinded about what they take .

After one year, the study observes whether each subject have successfully
quitted from smoking or still smokes (the number of cigarettes per day is
recorded).

Topic 03 - Data Collection ST1131 45 / 48


The Role of Randomization in Experimental Studies

We use randomization for assigning subjects to treatments

to eliminate bias that may appear if we assign subjects by hand.

to balance the groups on lurking variables that we know it aects the


response.

to balance the groups on lurking variables that may be unknown to us.

Topic 03 - Data Collection ST1131 46 / 48


Summary: Key Parts of a Good Experiment

Experiments units

The treatments correspond to values of an explanatory variable.

An good experiment should have: control comparison groups, randomization


and blinding.

Topic 03 - Data Collection ST1131 47 / 48


THANK YOU!

Topic 03 - Data Collection ST1131 48 / 48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy