MD115 Wk01
MD115 Wk01
• Help tab: library that contains help files about the features of R-based language
commands
• Viewer tab: Displays 3D graphics and textual outputs more elegantly compared to
the Console.
Objects
• Picture it as a box that you can store anything that can be created or loaded in R-
language
• Types of information that can be stored inside an object:
▪ A graph
▪ An algorithm
▪ A list
▪ A dataset and so on
❑ round(x=3.14159)
• What will go inside the parentheses of the function round() will be called an
argument, which will undergo numerical rounding
Objects with more than one Element
• Redeclare your object with concatenate function c() my_object_to_begin <- c(5, 7, 8,
12, 56)
Another Object
• R can import any kind of data (CSV files, other statistical packages,
databases, Excel worksheets, etc)
How would you visualize the values observed in the 5th line and 3rd column?
• use the operator[,],
• Puromycin[5,3]
Vector indexing
Indexing = selecting elements from a vector or other object
3. Logical vector
• Of equal length to main vector (or recycled)
• Positions with TRUE we keep, positions with FALSE we
exclude
4. Character vector
• The names of our data vector (if defined)
Why biostatistics?
Worried
10%
Proud
1%
Calm
43%
Optimistic
10%
Happy
13%
Methods:
▪ Data Harmonization of measures of hobby engagement and aspects of mental wellbeing
across 16 nations represented in five longitudinal studies
▪ Explored the association between changes in hobby engagement and changes in mental
wellbeing outcomes
Methods:
• Text mining framework of news stories from Upworthy.com
• Sentiment analysis on the basis of the Linguistic Inquiry and Word Count (LIWC)
• Counted the number of positive words 𝑛positive and the number of negative
words 𝑛negative in each headline.
• Examine the effects of the proportion of negative words in a headline on the
Click-through Rate (CTR)
𝑛 𝑛
positive negative
Positive 𝑖𝑗 = and Negative 𝑖𝑗 =
𝑛 𝑛
total total
A population is a set of all units (usually people, objects, or events) that we are
interested in studying.
variable
Population
A statistical inference is an estimate, prediction, or some other
generalization about a population based on information
contained in a sample.
Example 1
Mental Health
Problem: According to ourworldindata (https://ourworldindata.org/) (Sep.
10, 2023), on average, people experience the symptoms of depression 5
years before they are diagnosed. Suppose a Psychiatrist-Supervising
physician hypothesizes that in her department the average difference
from the subject exhibiting symptoms to being diagnosed is less than 5
years. To test her hypothesis, she samples 50 patients in her health
department and determines the period of time from the first symptoms
to diagnosis.
a. Describe the population
b. Describe the variable of interest
c. Describe the sample
d. Describe the inference
Example 2
Sports Drinks: Isotonic and electrolyte drinks
Gatorade and Powerade are two different commercial products that help in electrolyte
replenishment and hydration during workouts. Suppose, as part of a Gatorade marketing
campaign, 500 Powerade consumers are given a blind test (i.e., a test where the two brand
names are disguised). Each consumer is asked to state their experience with their overall
preference (workout performance and taste) of brand A or brand B.
The test shows that 59% of the 1,000 Powerade consumers preferred
Gatorade
▪ Does it mean that exactly 59% of all Powerade drinkers in the region
prefer Gatorade?
➢ No
We can use sound statistical reasoning to ensure that the sampling
procedure will generate estimates that are almost certainly within a
specified limit of the true percentage of all Powerade consumers who prefer
Gatorade.
Confidence Interval (59 ± 5)%: 54% to 63%
The estimate of the preference for Gatorade is almost certainly within 5% of the preference
of the population.
Inferential Statistics:
The 5 Elements
• For example: How can we know that the COVID-19 vaccine is effective
against symptomatic COVID-19?
• Do an experiment: randomize some people to receive
either the vaccine OR placebo, follow them up over
some time period, and compare how many of them get
symptomatic COVID-19
• Use a study sample (the people randomized) to draw
inferences about a population (everybody out there)
Sample → population
The larger the study sample, and/or the more extreme the difference,
the more likely this is to be true and not due to chance
• The data we are working with are usually rectangular (tabular), like:
Unit of observation The unit that is described by the data. Usually the patient.
Observation (Record) The rows of the table. A set of values that
refer to a particular unit of observation
Variable (Field) The columns of the table. A set of values of the same
type that reflect a particular characteristic of the units
of observation. Has a name
Primary key The observation ID. A variable that uniquely defines a unit of observation.
Types of Data
4 Patient completely disabled; not capable of any self-care; totally confined to bed or chair
• When we assign ranks we consider only their relative position irrespectively of their magnitudes
Discrete Data
➢ Numbers that represent actual measurable quantities rather than mere labels
➢ Take only specified values-integers/counts
o Number of motor vehicle accidents in Massachusetts in a specified month,
the number of times a woman has given birth, the number of new cases of
tuberculosis reported in the United States during a one-year period, and the
number of beds available in a particular hospital.
It is meaningful to measure the distance between possible data values for discrete
observations → arithmetic rules can be applied.
• Counts of things
• No units of measurement
Continuous Data
➢ Numbers that are not restricted to taking on certain specified values (such as
integers)
o Time, the serum cholesterol level of a patient, the concentration of a
pollutant, and temperature
Categorical Numeric
variables variables
Examples
• Age to age group: <18,18-39,40-64, 65+ years
• Body Mass Index: <20, 20-24.9, 25-30, 30+ kg/m 2
Types of variables
One can convert between variable types
Categorical Numeric
variables variables
Examples
• Age to age group: <18,18-39,40-64, 65+ years
• Body Mass Index: <20, 20-24.9, 25-30, 30+ kg/m 2
Formulating a plan for analysis
What we do when we receive a new dataset for analysis
• Values in a variable that have not been observed for some units - we do
not know what they are!
• You should NOT assume they are equal to zero!
• They require special handling
• Most statistical software have special representations for missing
values
• Statistical methods invariably assume NO missing values (complete
case analysis)
• OK to discard some observations (5-10%) due to missing values -otherwise
special methods required (e.g. multiple imputation)
Formulating a plan for analysis