Datastyle PDF
Datastyle PDF
Datastyle PDF
Analytic Style
A guide for people who want to
analyze data.
Jeff Leek
This book is for sale at http://leanpub.com/datastyle
1. Introduction . . . . . . . . . . . . . . . . . . . . . 1
5. Exploratory analysis . . . . . . . . . . . . . . . . . 23
8. Causality . . . . . . . . . . . . . . . . . . . . . . . 50
9. Written analyses . . . . . . . . . . . . . . . . . . . 53
10.Creating figures . . . . . . . . . . . . . . . . . . . 58
11.Presenting data . . . . . . . . . . . . . . . . . . . . 70
12.Reproducibility . . . . . . . . . . . . . . . . . . . . 79
15.Additional resources . . . . . . . . . . . . . . . . . 92
1. Introduction
The dramatic change in the price and accessibility of data
demands a new focus on data analytic literacy. This book is
intended for use by people who perform regular data analyses.
It aims to give a brief summary of the key ideas, practices, and
pitfalls of modern data analysis. One goal is to summarize
in a succinct way the most common difficulties encountered
by practicing data analysts. It may serve as a guide for
peer reviewers who may refer to specific section numbers
when evaluating manuscripts. As will become apparent, it is
modeled loosely in format and aim on the Elements of Style
by William Strunk.
The book includes a basic checklist that may be useful as a
guide for beginning data analysts or as a rubric for evaluating
data analyses. It has been used in the author’s data analysis
class to evaluate student projects. Both the checklist and this
book cover a small fraction of the field of data analysis, but the
experience of the author is that once these elements are mas-
tered, data analysts benefit most from hands on experience in
their own discipline of application, and that many principles
may be non-transferable beyond the basics.
If you want a more complete introduction to the analysis
of data one option is the free Johns Hopkins Data Science
Specialization¹.
As with rhetoric, it is true that the best data analysts some-
times disregard the rules in their analyses. Experts usually do
¹https://www.coursera.org/specialization/jhudatascience/1
Introduction 2
2.2 Descriptive
A descriptive data analysis seeks to summarize the measure-
ments in a single data set without further interpretation. An
example is the United States Census. The Census collects data
on the residence type, location, age, sex, and race of all people
in the United States at a fixed time. The Census is descriptive
because the goal is to summarize the measurements in this
fixed data set into population counts and describe how many
The data analytic question 5
2.3 Exploratory
An exploratory data analysis builds on a descriptive analysis
by searching for discoveries, trends, correlations, or rela-
tionships between the measurements of multiple variables to
generate ideas or hypotheses. An example is the discovery of a
four-planet solar system by amateur astronomers using public
astronomical data from the Kepler telescope. The data was
made available through the planethunters.org website, that
asked amateur astronomers to look for a characteristic pattern
of light indicating potential planets. An exploratory analysis
like this one seeks to make discoveries, but rarely can confirm
those discoveries. In the case of the amateur astronomers,
follow-up studies and additional data were needed to confirm
the existence of the four-planet system.
2.4 Inferential
An inferential data analysis goes beyond an exploratory anal-
ysis by quantifying whether an observed pattern will likely
hold beyond the data set in hand. Inferential data analyses are
the most common statistical analysis in the formal scientific
literature. An example is a study of whether air pollution
correlates with life expectancy at the state level in the United
States. The goal is to identify the strength of the relationship
in both the specific data set and to determine whether that
relationship will hold in future data. In non-randomized
The data analytic question 6
2.5 Predictive
While an inferential data analysis quantifies the relationships
among measurements at population-scale, a predictive data
analysis uses a subset of measurements (the features) to pre-
dict another measurement (the outcome) on a single person
or unit. An example is when organizations like FiveThir-
tyEight.com use polling data to predict how people will vote
on election day. In some cases, the set of measurements
used to predict the outcome will be intuitive. There is an
obvious reason why polling data may be useful for predicting
voting behavior. But predictive data analyses only show that
you can predict one measurement from another, they don’t
necessarily explain why that choice of prediction works.
2.6 Causal
A causal data analysis seeks to find out what happens to one
measurement if you make another measurement change. An
example is a randomized clinical trial to identify whether
fecal transplants reduces infections due to Clostridium di-
ficile. In this study, patients were randomized to receive a
fecal transplant plus standard care or simply standard care.
In the resulting data, the researchers identified a relationship
between transplants and infection outcomes. The researchers
The data analytic question 7
2.7 Mechanistic
Causal data analyses seek to identify average effects between
often noisy variables. For example, decades of data show
a clear causal relationship between smoking and cancer. If
you smoke, it is a sure thing that your risk of cancer will
increase. But it is not a sure thing that you will get cancer. The
causal effect is real, but it is an effect on your average risk. A
mechanistic data analysis seeks to demonstrate that changing
one measurement always and exclusively leads to a specific,
deterministic behavior in another. The goal is to not only
understand that there is an effect, but how that effect operates.
An example of a mechanistic analysis is analyzing data on
how wing design changes air flow over a wing, leading
to decreased drag. Outside of engineering, mechanistic data
analysis is extremely challenging and rarely undertaken.
2.8.2 Overfitting
Interpreting an exploratory analysis as predictive
2.8.3 n of 1 analysis
Descriptive versus inferential analysis.
You know the raw data is in the right format if you ran no
software on the data, did not manipulate any of the numbers
in the data, did not remove any data from the data set, and
did not summarize the data in any way.
If you did any manipulation of the data at all it is not the raw
form of the data. Reporting manipulated data as raw data is
a very common way to slow down the analysis process, since
the analyst will often have to do a forensic study of your data
to figure out why the raw data looks weird.
While these are the hard and fast rules, there are a number
of other things that will make your data set much easier to
handle.
want to know any other information about how you did the
data collection/study design. For example, are these the first
20 patients that walked into the clinic? Are they 20 highly
selected patients by some characteristic like age? Are they
randomized to treatments?
A common format for this document is a Word file. There
should be a section called “Study design” that has a thorough
description of how you collected the data. There is a section
called “Code book” that describes each variable and its units.
• Continuous
• Ordinal
• Categorical
• Missing
• Censored
¹http://www.data.gov/
Checking the data 18
but if you overlay the actual data points you can see that they
have very different distributions.
Exploratory analysis 25
Figure 5.3 Data sets with identical correlations and regression lines
²http://en.wikipedia.org/wiki/Anscombe%27s_quartet
Exploratory analysis 27
but if you size the points by the skill of the student you see
that more skilled students don’t study as much. So it is likely
that skill is confounding the relationship
Exploratory analysis 29
Figure 5.6 Studying versus score with point size by skill level
³http://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot
⁴https://en.wikipedia.org/wiki/File:Bland-Altman_Plot.svg
Exploratory analysis 32
Figure 6.2 The first step in inference is making a best estimate of what
is happening in the population
people and also have smaller shoes. Age is related to both lit-
eracy and shoe size and is a confounder for that relationship.
When you observe a correlation or relationship in a data set,
consider the potential confounders - variables associated with
both variables you are trying to relate.
Figure 6.4 If you infer to the wrong population bias will result.
• A title
• An introduction or motivation
• A description of the statistics or machine learning
models you used
• Results including measures of uncertainty
• Conclusions including potential problems
• References
Written analyses 54
Figure 10.5 Without logs 99% of the data are in the lower left hand
corner in this figure from
• Meet people
• Get people excited about your ideas/software/results
• Help people understand your ideas/software/results
• https://speakerdeck.com/²
• http://www.slideshare.net/³.
• Data
– raw data
– processed data
• Figures
– Exploratory figures
– Final figures
• R code
– Raw or unused scripts
– Data processing scripts
– Analysis scripts
• Text
– README files explaining what all the compo-
nents are
– Final data analysis products like presentation-
s/writeups
14.5 Inference
1. Did you identify what large population you are trying
to describe?
The data analysis checklist 89
14.6 Prediction
1. Did you identify in advance your error measure?
2. Did you immediately split your data into training and
validation?
3. Did you use cross validation, resampling, or bootstrap-
ping only on the training data?
4. Did you create features using only the training data?
5. Did you estimate parameters only on the training data?
6. Did you fix all features, parameters, and models before
applying to the validation data?
7. Did you apply only one final model to the validation
data and report the error rate?
14.7 Causality
1. Did you identify whether your study was randomized?
2. Did you identify potential reasons that causality may
not be appropriate such as confounders, missing data,
non-ignorable dropout, or unblinded experiments?
3. If not, did you avoid using language that would imply
cause and effect?
The data analysis checklist 90
14.9 Figures
1. Does each figure communicate an important piece of
information or address a question of interest?
2. Do all your figures include plain language axis labels?
3. Is the font size large enough to read?
4. Does every figure have a detailed caption that explains
all axes, legends, and trends in the figure?
14.10 Presentations
1. Did you lead with a brief, understandable to everyone
statement of your problem?
2. Did you explain the data, measurement technology, and
experimental design before you explained your model?
The data analysis checklist 91
3. Did you explain the features you will use to model data
before you explain the model?
4. Did you make sure all legends and axes were legible
from the back of the room?
14.11 Reproducibility
1. Did you avoid doing calculations manually?
2. Did you create a script that reproduces all your analy-
ses?
3. Did you save the raw and processed versions of your
data?
4. Did you record all versions of the software you used to
process the data?
5. Did you try to have someone else run your analysis
code to confirm they got the same answers?
14.12 R packages
1. Did you make your package name “Googleable”
2. Did you write unit tests for your functions?
3. Did you write help files for all functions?
4. Did you write a vignette?
5. Did you try to reduce dependencies to actively main-
tained packages?
6. Have you eliminated all errors and warnings from R
CMD CHECK?
15. Additional
resources
15.1 Class lecture notes
• Johns Hopkins Data Science Specialization¹ and Addi-
tional resources²
• Data wrangling, exploration, and analysis with R³
• Tools for Reproducible Research⁴
• Data carpentry⁵
15.2 Tutorials
• Git/github tutorial⁶
• Make tutorial⁷
• knitr in a knutshell⁸
• Writing an R package from scratch⁹
¹https://github.com/DataScienceSpecialization/courses
²http://datasciencespecialization.github.io/
³https://stat545-ubc.github.io/
⁴http://kbroman.org/Tools4RR/
⁵https://github.com/datacarpentry/datacarpentry
⁶http://kbroman.org/github_tutorial/
⁷http://kbroman.org/minimal_make/
⁸http://kbroman.org/knitr_knutshell/
⁹http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Additional resources 93
15.4 Books
• An introduction to statistical learning¹³
• Advanced data analysis from an elementary point of
view¹⁴
• Advanced R programming¹⁵
• OpenIntro Statistics¹⁶
• Statistical inference for data science¹⁷
¹⁰https://github.com/jtleek/datasharing
¹¹https://github.com/jtleek/talkguide
¹²https://github.com/jtleek/rpackages
¹³http://www-bcf.usc.edu/~gareth/ISL/
¹⁴http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/
¹⁵http://adv-r.had.co.nz/
¹⁶https://www.openintro.org/stat/textbook.php
¹⁷https://leanpub.com/LittleInferenceBook