0% found this document useful (0 votes)
18 views67 pages

MD115 Wk01

Uploaded by

yybhy9gwtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views67 pages

MD115 Wk01

Uploaded by

yybhy9gwtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction – Basic concepts

and types of data; formulating a


plan for analysis
Dr. Ioannis Alatsathianos, PhD
Molecular Biology DUTH, MSc & PhD Medical Faculty of
Athens NKUA, Hygiene, Epidemiology & Medical Statistics
Goals
➢ Understand the mathematical theory behind statistical models
➢ Endorse Statistical and Critical Thinking
➢ Understand assumptions and prerequisites for the statistical methods that you may use
for specific applications.

At the end of this course, you will be able to

1. Create a real-life situation and a statistical problem in mathematical terms


2. Choose the correct statistical methods for this problem after you have considered
possible implications and limitations of various methods
Objectives
1. Importance of Statistics in Medicine
•Learning Outcome: Understand the role of statistics in generating medical knowledge and
recognize why a foundational understanding of statistics is essential for medical students,
particularly for interpreting research and data in clinical settings.
2. Study Sample Size and Validity of Results
•Learning Outcome: Explain how study sample size impacts the validity and reliability of
research results. Identify factors like unbiased sampling and accurate measurements as
critical for a study’s credibility.
3. Principles of Reproducible Research
•Learning Outcome: Describe the importance of reproducible research as a fundamental
principle of high-quality research. Understand how using code to process and analyze data
helps ensure reproducibility for the researcher, collaborators, and the wider research
community.
4. Sample Quantities vs. Population Quantities
•Learning Outcome: Distinguish between sample quantities and population quantities.
Understand that sample quantities (e.g., sample mean) are used to estimate population
values, especially when sample size is large and unbiased.
5. Understanding Variables in a Dataset
•Learning Outcome: Define a variable in a dataset as a characteristic of observation units. Identify the different types of
variables (categorical, numeric, continuous, discrete) and understand how each type reflects characteristics within a dataset.
6. Conversion and Measurement of Variables
•Learning Outcome: Explain that variables can sometimes be converted (e.g., numeric to categorical) and recognize that
certain variable types (e.g., discrete) represent counts, while continuous variables are often accompanied by units of
measurement.
7. Recognizing Ordinal Variables
•Learning Outcome: Identify examples of ordinal variables, such as educational level or military rank, which have a
meaningful order but no consistent difference between levels.
8. Identifying Continuous Numeric Variables
•Learning Outcome: Recognize examples of continuous numeric variables (e.g., temperature, body mass index), which can
take on a full range of values and often have units of measurement associated with them.
9. Identifying Nominal Variables
•Learning Outcome: Identify nominal variables, like blood type or occupation, which categorize observations without an
inherent order, useful for grouping and organizing data.
10. Data Analysis and Handling Outliers
•Learning Outcome: Understand key data analysis practices, including identifying and handling outliers and missing values.
Recognize when it is appropriate to discard data and the importance of logical checks for consistency in datasets.
11. R Statistical Environment Basics
•Learning Outcome: Familiarize with the basics of R, including vectors as the simplest object type, the importance of RStudio,
handling missing values with NA, and extending R’s capabilities with packages.
Logistics
➢ Lectures: Fridays 08:00 -10:30, Sigma
➢ Mandatory Laboratories in Stats Lab: Tuesdays 12:30-14:30 pm Groups A & B
14:30-16:30 pm Groups C & D and 16:30-18:30 pm Groups E & F.
➢ Assignment: to be announced till 6th of December and expected to be
submitted by beginning of January (20%).
➢ Midterms: 11/11 - 15/11, in class. Closed books and notes
➢ Final: TBD
Suggested Book(s)
Let’s go!
This phrase in a conversation between
two friends it does make sense. The
two friends will immediately
understand and head to their
destination…

The machine will not understand.


• It needs to know exactly where, with whom,
when, how…
Common Language with you and the computer will be R:

❑ Working with the R statistical environment


❑ but first install R: https://www.r-project.org
Now your laptop is ready to speak the same language as you…

But before… Install R studio (Integrated Development Environment): https://www.rstudio.com


• for a more friendly interface
• Choose the option RStudio Desktop “free” and the version of RStudio

After the installation is complete, the R studio software can be opened.


• File > New File > R Script
1. The Script Editor

You can write as many codes as you want:


• Activate them when you want and in the order that you want
• Add comments on them with #
• Share them with others Enter key will only change
• Scripts don’t get lost just like that
your line
But Ctrl+Enter or
Command+Enter (mac)
execute the command!
2. The Console

Is the R-based language:


• All R results will be returned in the console-
apart from graphs (and help queries) Enter key will work here!
• You cannot save or edit the commands-
future reader cannot revisit
3. The Environment, History, Connections…

• Environment tab: It contains the objects


• History tab: contains history of the commands
• Connections: possible connections of yours with
other computers
• Tutotial: tutorials about packages
4. Files, Plots, Packages,…

• Files tab: allows us to browse the folders and


files on our own laptop
• Plots tab: visualization of the two-
dimensional graphics in pdf, jpeg..
• Packages tab: contains R packages installed

• Help tab: library that contains help files about the features of R-based language
commands
• Viewer tab: Displays 3D graphics and textual outputs more elegantly compared to
the Console.
Objects
• Picture it as a box that you can store anything that can be created or loaded in R-
language
• Types of information that can be stored inside an object:
▪ A graph
▪ An algorithm
▪ A list
▪ A dataset and so on

• You create an object using: <- “less than” “minus” signs


“my_object_to_begin<- 32”

• What we commanded in R is to store that metric in a “box” with the name


“my_object_to_begin”

• “my_object_to_begin” is an atomic vector


• but what about objects with multiple elements?
Functions and Arguments

• Functions are like verbs that indicate an action-algorithms


• They have an objective

❑ round(x=3.14159)

• You can recognize a function by the parentheses


• Inside the object round there are predefined instructions about rounding numbers..

• What will go inside the parentheses of the function round() will be called an
argument, which will undergo numerical rounding
Objects with more than one Element

• Redeclare your object with concatenate function c() my_object_to_begin <- c(5, 7, 8,
12, 56)

Another Object

• NA is a special value in R that represents a missing value


Packages
• So far, we talked about R-based language installation and base
language
• But data science is multidisciplinary- Examples of data science
applications apart from medicine and biology are in
engineering, physics, education, psychology, pedagogy, law,
politics, public security, economics, sociology, business,
marketing, astronomy, anthropology, human resources,
meteorology, geography, and history.
• Base language can bring together a large-the base number of
functions needed, but when you need to specialize in medicine
you need the relevant packages to extend the vocabulary of the
machine
Installation of Packages

The installation of packages is done using the function


• install.packages(“name of the package to be installed”).
• Necessary to call the package by typing library(name of the package
installed and you wish to call)
A few important object types
Vector A set of values (elements) of the same type (e.g. numeric, character, logical, factor).
A variable of a dataset.
• Though vectors can stand on their own in R

List A set of other R objects (potentially of different types).


• Usually named; if so we can refer to named list elements by the $
operator (e.g. myList$element1)
• Many R functions return a list!
data.frame

data.frame 𝐴 list of vectors of the same length.


Represents rectangular data in R.
• Again use the $ operator to refer to individual vectors (variables),
e.g. dat$age

Factor A vector representing a categorical variable


• "Levels" of the factor = the individual categories
Loading datasets into R

(reading them into data.frames)

• R can import any kind of data (CSV files, other statistical packages,
databases, Excel worksheets, etc)

• Probably the simplest way: from Excel files

• We install the ’readxl’ package from CRAN first


• Then we load it into the R workspace with library(), making the functions it
contains available for use.
R understands the positioning of data in objects that contain datasets, indicating first
the row and then the column

How would you visualize the values observed in the 5th line and 3rd column?
• use the operator[,],
• Puromycin[5,3]
Vector indexing
Indexing = selecting elements from a vector or other object

Use the indexing operator [,], in one of four ways


1. Positive integer vector
• Vector positions of the elements we “keep”

2. Negative integer vector


• Vector positions of the elements that we exclude

3. Logical vector
• Of equal length to main vector (or recycled)
• Positions with TRUE we keep, positions with FALSE we
exclude

4. Character vector
• The names of our data vector (if defined)
Why biostatistics?

— Why should I learn biostatistics?


— You already know the importance of biomedical data in decision
making!
News
What is biostatistics?

Statistics is the science of data. This involves collecting, classifying, summarizing,


organizing, analyzing, presenting, and interpreting numerical and categorical
information

Biostatistics is the science of data in relation to biomedical research. This involves


collecting, classifying, summarizing, organizing, analyzing, presenting, and
interpreting numerical and categorical information.
Social Media Use in 2023 Globally
% Internet users when asked if they ever use some form of % Social Media users when asked if being
social media when looking for more information on brands worried they overuse their chosen platforms
Yes 78 Yes 28
No 22 No 72
% of social media users outside of China % who trust product/brand recommendations made by
who say the following is their favorite platform social media influencers
Do not trust at all 21.75
WhatsApp 21
Trust a little 44.25
Instagram 21
Facebook 20 Trust a lot/completely 25.75
TikTok 9 I don't know/no opinion 8.25
Twitter 5 % Gen Z who trust product/brand recommendations made by social media influencers
Facebook Messenger 3
Do not trust at all 10
Telegram Messenger 3
Pinterest 2 Trust a little 50
Snapchat 2 Trust a lot/completely 33
Line 2 I don't know/no opinion 7
GWI (formerly GlobalWebIndex): https://www.gwi.com/
What can we do with all these numbers?
Sampling: When data are selected from some larger set of data whose characteristics
we want to estimate

Based on the previous example:


1. A Public Health specialist might collect Gen Z’s preferred social media platform
(describing sets of data)
2. Then use this statistic to explain a public health campaign addressed to young
people across a few key platforms, rather than blasting the same creative across
every channel (drawing conclusions)
Statistics involves two different processes
a) Descriptive statistics utilizes numerical and graphical methods to look for patterns
in a data set, summarize the information revealed in a data set, and present that
information in a convenient form.

b) Inferential statistics utilizes sample data to make estimates, decisions,


predictions, or other generalizations about a larger set of data.
Sample
Variable
a) Descriptive Statistics: Which of the following best describes how you're
feeling at the moment?
Other
8%

Worried
10%

Proud
1%

Calm
43%

Optimistic
10%

Happy
13%

Excel graph of Mood (Based on www.


gwi.com,
Disappointed / Upset
Frustrated Excited https://app.globalwebindex.com/.)
5%
8% 2%
Descriptive Statistics:
The 4 Elements

1. The population or sample of interest


2. One or more variables (characteristics of the population or sample units) of
interest
3. Tables, graphs, or numerical summary tools
4. Identification of patterns in the data
b) Inferential Statistics

Methods:
▪ Data Harmonization of measures of hobby engagement and aspects of mental wellbeing
across 16 nations represented in five longitudinal studies
▪ Explored the association between changes in hobby engagement and changes in mental
wellbeing outcomes

Conclusions: Increased hobby engagement predicts subsequent decreases in depressive


symptoms and increased self-reported health, happiness, and life satisfaction.
b) Inferential Statistics

Methods:
• Text mining framework of news stories from Upworthy.com
• Sentiment analysis on the basis of the Linguistic Inquiry and Word Count (LIWC)
• Counted the number of positive words 𝑛positive and the number of negative
words 𝑛negative in each headline.
• Examine the effects of the proportion of negative words in a headline on the
Click-through Rate (CTR)
𝑛 𝑛
positive negative
Positive 𝑖𝑗 = and Negative 𝑖𝑗 =
𝑛 𝑛
total total

Conclusion: “If it bleeds, it leads”


Inferential statistics was applied to arrive at this conclusion.
Fundamentals Elements of Statistics

An experimental (or observational) unit is an object (e.g., person, thing, transaction,


or event) about which we collect data.

A population is a set of all units (usually people, objects, or events) that we are
interested in studying.

A variable is a characteristic or property of an individual experimental (or


observational) unit in the population.

A sample is a subset of the units of a population.


Example
Sample

variable

Population
A statistical inference is an estimate, prediction, or some other
generalization about a population based on information
contained in a sample.
Example 1
Mental Health
Problem: According to ourworldindata (https://ourworldindata.org/) (Sep.
10, 2023), on average, people experience the symptoms of depression 5
years before they are diagnosed. Suppose a Psychiatrist-Supervising
physician hypothesizes that in her department the average difference
from the subject exhibiting symptoms to being diagnosed is less than 5
years. To test her hypothesis, she samples 50 patients in her health
department and determines the period of time from the first symptoms
to diagnosis.
a. Describe the population
b. Describe the variable of interest
c. Describe the sample
d. Describe the inference
Example 2
Sports Drinks: Isotonic and electrolyte drinks

Gatorade and Powerade are two different commercial products that help in electrolyte
replenishment and hydration during workouts. Suppose, as part of a Gatorade marketing
campaign, 500 Powerade consumers are given a blind test (i.e., a test where the two brand
names are disguised). Each consumer is asked to state their experience with their overall
preference (workout performance and taste) of brand A or brand B.

a. Describe the population


b. Describe the variable of interest.
c. Describe the sample.
d. Describe the inference.
About intrusive thoughts:

How good an inference is and why


don’t we just measure all the
population?

➢Resource constraints (not enough money or


time)
➢Usually can’t work with the whole universe
Reliability of Inference:
Isotonic and electrolyte drinks

The test shows that 59% of the 1,000 Powerade consumers preferred
Gatorade
▪ Does it mean that exactly 59% of all Powerade drinkers in the region
prefer Gatorade?
➢ No
We can use sound statistical reasoning to ensure that the sampling
procedure will generate estimates that are almost certainly within a
specified limit of the true percentage of all Powerade consumers who prefer
Gatorade.
Confidence Interval (59 ± 5)%: 54% to 63%
The estimate of the preference for Gatorade is almost certainly within 5% of the preference
of the population.
Inferential Statistics:
The 5 Elements

1. The population of interest


2. One or more variables (characteristics of the population units) of interest
3. Sample of population
4. The inference about the population based on information contained in the
sample
5. Confidence Interval or else: measure of the reliability of the inference
Why biostatistics?

— “Bro I’m a medical doctor! Why should I learn biostatistics?”


— “Because that’s exactly how medical knowledge is generated!”
How medical knowledge is generated:
• Science (and medicine in particular) is empirical:
• natural and experimental observations → inductive
reasoning → generalization

• Basic research ←→ Clinical research


• Any new medical knowledge is validated in actual people

• For example: How can we know that the COVID-19 vaccine is effective
against symptomatic COVID-19?
• Do an experiment: randomize some people to receive
either the vaccine OR placebo, follow them up over
some time period, and compare how many of them get
symptomatic COVID-19
• Use a study sample (the people randomized) to draw
inferences about a population (everybody out there)
Sample → population

• Can we say that the vaccine is effective, if:

• out of 10 people who got the vaccine, 3 got COVID-19, and


out of 10 people who got placebo, 𝟔 got COVID-19 ??
• out of 100 people who got the vaccine, 30 got COVID-19, and
out of 𝟏𝟎𝟎 people who got placebo, 𝟔𝟎 got COVID-19 ??
• out of 𝟏𝟎 people who got the vaccine, 1 got COVID-19, and
out of 𝟏𝟎 people who got placebo, 𝟖 got COVID-19 ??

There are only two possibilities. Either:


1. The vaccine is effective in reducing your risk of getting
COVID-19
2. The vaccine is NOT effective in reducing your risk of getting
COVID-19, and just by chance we got this difference
Sample → population

• Can we say that the vaccine is effective, if:

• out of 10 people who got the vaccine, 3 got COVID-19, and


out of 10 people who got placebo, 𝟔 got COVID-19 ??
• out of 100 people who got the vaccine, 30 got COVID-19, and
out of 𝟏𝟎𝟎 people who got placebo, 𝟔𝟎 got COVID-19 ??
• out of 𝟏𝟎 people who got the vaccine, 1 got COVID-19, and
out of 𝟏𝟎 people who got placebo, 𝟖 got COVID-19 ??

The larger the study sample, and/or the more extreme the difference,
the more likely this is to be true and not due to chance

• This is the kind of questions statistics deals with!


Biostatistics in Medicine - Example
Clinical research
• Remember: we are working with a sample to infer about
a population of interest
• It is the population that we’re really interested in – not the
sample
• We need to clearly distinguish between population and sample
• Sample quantities (e.g. sample mean) are known, i.e.
measured, whereas population quantities (e.g. population
mean) are unknown and are being estimated

• Appropriate sample selection is essential, as is accurate


data measurement – otherwise bias ensues
• We will discuss bias next year, in Epidemiology class
• Assuming unbiased samples and accurate measurements, statistics
allows us to:
• Convert our data into meaningful results (analyze our data)
• Let us know how likely our results are to reflect real differences, or be due to chance (random error)
How is clinical research done?

Steps 3-5 can be implemented using statistical software, such as the R


statistical environment

• Facilitates reproducible research


Reproducible research: a basic principle of good research
Definition
Ability to reproduce the results for:
• the investigator himself/herself
• other collaborating investigators
• the wider research community

• A prerequisite is the use of code throughout:


• Processing raw data
• Analyzing data and generating results
• Presenting results in a report
Direct link between analysis and final result
• Nothing done “by hand”!

• Interactive use vs R scripts


Rectangular data

• The data we are working with are usually rectangular (tabular), like:

Unit of observation The unit that is described by the data. Usually the patient.
Observation (Record) The rows of the table. A set of values that
refer to a particular unit of observation
Variable (Field) The columns of the table. A set of values of the same
type that reflect a particular characteristic of the units
of observation. Has a name
Primary key The observation ID. A variable that uniquely defines a unit of observation.
Types of Data

Categorical (or qualitative or attribute) data consist of


names or labels (not numbers that represent counts (or
measurements).

Quantitative (or numerical) data consist of numbers


representing counts or measurements.
Types of variables
Categorical Numeric
• Nominal • Continuous
e.g. sex, occupation e.g. temperature
(without an inherent ordering) (measurements)
• Ordinal
e.g. educational level • Discrete
(with inherent ordering) e.g. number of children
(counts)
• Dichotomous
e.g. diseased/healthy, yes/no
(with just two levels)

Different statistical methods are


appropriate for different types of data!
Nominal Data

➢ Where values fall into unordered categories or classes


Table: Sample of outcomes indicating whether an individual had Kaposi's sarcoma for the first 2560 AIDS patients reported to the Centers for Disease Control and Prevention in
Atlanta, Georgia

00000000 00010100 00000010 00001000 00000001 00000000 10000000 00000000

00101000 00000000 00000000 00011000 00100001 01001100 00000000 00000010

00000001 00000000 00000010 01100000 00000000 00000100 00000000 00000000

• Numbers used for convenience


• Males might be assigned the value of 1 and females the value of 0

• Categorical variables with NO inherent ordering


Dichotomous/Binary
• Two distinct values exclusively
• Yes/No, Head/Tail, Pass/Fail

The only meaningful


More Categories arithmetic operations are
proportions
• Three or more possible categories
• Blood type: A, B, O, AB (represented as 1, 2, 3,
4)
Ordinal Data

➢ Where order among categories is important


Eastern Cooperative Oncology Group's classification of patient performance status
Status Definition
0 Patient fully active, able to carry on all pre-disease performance without restriction
1 Patient restricted in physically strenuous activity but ambulatory and able to carry out work of a light or sedentary nature
Patient ambulatory and capable of all self-care but unable to carry out any work activities; up and about more than 50% of waking
2
hours
3 Patient capable of only limited self-care; confined to bed or chair more than 50% of waking hours

4 Patient completely disabled; not capable of any self-care; totally confined to bed or chair

• Many arithmetic operations do not make sense because the magnitude of


the difference within the levels is not standard.

• Categorical variables with an inherent ordering


Ranked Data

➢ We put serial Numbers in groups of observations according to their


size.
Ten leading causes of death in the United States, 2021 (FastStats - Leading Causes of Death (cdc.gov ))
Rank Cause of Death Total Deaths
1 Diseases of the heart 695,547
2 Malignant neoplasms 605,213
3 COVID-19 416,893
4 Accidents (unintentional injuries) 224,935
5 Stroke (cerebrovascular diseases) 162,890
6 Chronic lower respiratory diseases 142,342
7 Alzheimer’s disease 119,399
8 Diabetes 103,294
9 Chronic liver disease and cirrhosis 56,585
10 Nephritis, nephrotic syndrome, and nephrosis 54,358

• When we assign ranks we consider only their relative position irrespectively of their magnitudes
Discrete Data

➢ Numbers that represent actual measurable quantities rather than mere labels
➢ Take only specified values-integers/counts
o Number of motor vehicle accidents in Massachusetts in a specified month,
the number of times a woman has given birth, the number of new cases of
tuberculosis reported in the United States during a one-year period, and the
number of beds available in a particular hospital.

It is meaningful to measure the distance between possible data values for discrete
observations → arithmetic rules can be applied.

• Counts of things
• No units of measurement
Continuous Data

➢ Numbers that are not restricted to taking on certain specified values (such as
integers)
o Time, the serum cholesterol level of a patient, the concentration of a
pollutant, and temperature

Common Practice: When a lesser degree of accuracy is needed we can transform


continuous observations into discrete, ordinal, or even dichotomous ones according
to the degree of precision required (based on the question)

• Measurements (non-countable), accompanied always by some


unit of measurement
• Can convert to other unit of measurement
Types of variables One can convert between variable types

Categorical Numeric
variables variables

Examples
• Age to age group: <18,18-39,40-64, 65+ years
• Body Mass Index: <20, 20-24.9, 25-30, 30+ kg/m 2
Types of variables
One can convert between variable types
Categorical Numeric
variables variables

Examples
• Age to age group: <18,18-39,40-64, 65+ years
• Body Mass Index: <20, 20-24.9, 25-30, 30+ kg/m 2
Formulating a plan for analysis
What we do when we receive a new dataset for analysis

0. Refer to the study protocol

1. Understand what our data represent (unit of observation, variables)

2. Process the data


• Verify the types of variables (and convert if necessary, or recode into
derivative variables)
• Check for missing values
• Perform logical / consistency checks in each variable
• Check the distribution of values, including outliers
Missing values

• Values in a variable that have not been observed for some units - we do
not know what they are!
• You should NOT assume they are equal to zero!
• They require special handling
• Most statistical software have special representations for missing
values
• Statistical methods invariably assume NO missing values (complete
case analysis)
• OK to discard some observations (5-10%) due to missing values -otherwise
special methods required (e.g. multiple imputation)
Formulating a plan for analysis

What we do when we receive a new dataset for analysis

3. Analyze the data


• Descriptive analyses
• Simple univariate analyses
• Complex, multivariate analyses
• Highly dependent on the type of variables!

4. Output the results


• Tables, figures, etc
Thank you!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy