Topic3 DataCollection
Topic3 DataCollection
4 Experimental Studies
4 Experimental Studies
Crime: the number of crimes in that county in the past year (Response).
Education rate: percentage of residents age ≥ 25, who completed high
school.
Urbanization level: the percentage of residents living in a metropolitan area
was aslo recorded.
a
Statistics: The Art and Science of Learning from Data, 4th edition, Agresti, Franklin and
Klingenberg, Chapter 3
So, perhaps the reason for positive Cor(Crime, Edu) is that: education rate
tends to be higher in more highly urbanized counties, but crime also tends to
be higher in such counties.
Categorize urbanization into 3 levels: low(0 - 30), medium (30 - 75) and
high (>75).
Make a scatter plot of crime (y -axis) vs. education (x -axis) for each level of
urbanization.
Within each panel, the correlation between crime and education is 0.137,
0.177 and -0.351 respectively.
When the urbanization level is low or medium, Cor(Crime, Edu) is very low
(0.137, 0.177).
It's dicult to tell which of the two explanatories, if any, is causing a change
in the response.
A lurking variable is one that is typically not measured in the study; it has
a potential for confounding.
4 Experimental Studies
Suppose we wish to investigate if the use of cell phones is a health risk. Three
studies were conducted.
A German study in 2001 (Stang et al., 2001) compared 118 eye cancer
patients to 475 healthy subjects (who did not have eye cancer).
All participants were given a questionnaire, in which they were queried on
their cell phone usage.
On average, the eye cancer patients used cell phones more often.
A British study (Hepworth at al., 2006) compared 966 patients who had
brain cancer to 1716 people without brain cancer.
Again, they were interviewed on their cell phone use patterns via a
questionnaire.
The investigators found little dierence between the groups with regard to
their cell phone usage.
In an observational study, the values for the response variable and explanatory
variables are observed for the sampled subjects, without anything being done to
them.
Example: Cell phone Study 1 and Study 2.
In the third study, cell phone usage was assigned to the participants and
monitored.
In study 1, increased computer usage could cause the eye cancer, and also
be positively associated with increased cell phone usage,
It is not ethical to put those 500 students at any risk of cancer, even though
we do not know if it is true.
Who can refrain from not using a cell phone for 30 years?
30 years is a long time to wait.
4 Experimental Studies
Suppose we wish to study the preferences of NUS students on: study hour,
partime job preference, etc.
▶ The matter is: how could we select a good representative of the population?
Step 2: Compile a list of subjects in the population from which the sample
will be taken, called the sampling frame.
Step 3: Specify a method for selecting subjects from the sampling frame,
called the sampling design.
If I select 100 students from this statistics class, do you think this 100
students could represent well the population about: gender, study year, etc.?
It's doubtful!
The sample above was selected simply by convenience, and it might not be
a representative of the population.
If we use chance rather than convenience, we can get a better sample which
could represent the population well.
Step 3, subjects that have same number as in the set of n numbers in Step 2
are selected to be a SRS.
Beside Simple Random Sampling, there are some other ways for sampling design
that involves randomization, such as:
▶ The sampling frame does not represent the full population (under coverage).
E.g. a telephone survey will not include students without a registered phone.
E.g. is it possible that students who do not respond are those who study
more hours? If so, our results will be biased low for the questions on study
hours per week.
▶ Response bias. This occur when the participant is not honest when answering
question(s) or answering wrongly.
Literary Digest (LD) magazine conducted a poll to predict the result of the
1936 presidential election, between Franklin Roosevelt (Democrat) and Alf
Landon (Republican).
The sampling frame was constructed from: telephone directories, country
club memberships and automobile registration.
LD mailed questionnaires to 10 million people, asked how they planned to
vote.
2.3 million people returned the questionnaire. LD then predicted that
Republican would win, getting 57% of the vote.
In fact, Republican got only 36% and Democrat won that election. a
a
Bryson, M. C. (1976), American Statistician, vol. 30, pp. 184-185
2 Nonresponse bias. Of the 10 million people in the chosen random sample, 7.7
million did not respond.
Dene a sampling frame. which attempts to list all the subjects in the
population
4 Experimental Studies
3 Blinding the study: the subjects do not know they receive actual treatment
or a placebo.
2 How should the researcher assign the subjects to the two treatment groups?
3 Without knowing more about this study, what would you identify as a
potential problem with the study design?
4 What descriptive statistics (numerical) are used for the analysis (to compare
the 2 groups after 1 year)?
After one year, the study observes whether each subject have successfully
quitted from smoking or still smokes (the number of cigarettes per day is
recorded).
Experiments units