Exploratory Data Analysis

CSE 442 - Data Visualization
Exploratory Data Analysis
Jeffrey Heer University of Washington

What was the first 
data visualization?
0 BC
~6200 BC Town Map of Catal Hyük, Konya Plain, Turkey 0 BC
~950 AD Position of Sun, Moon and Planets
0 BC Sunspots over time, Scheiner 1626
Longitudinal distance between Toledo and Rome, van Langren 1644
The Rate of Water Evaporation, Lambert 1765
The Rate of Water Evaporation, Lambert 1765
The Golden Age of
Data Visualization
1786 1900
The Commercial and Political Atlas, William Playfair 1786
Statistical Breviary, William Playfair 1801
1786 1826(?) Illiteracy in France, Pierre Charles Dupin
“to affect thro’ the Eyes
what we fail to convey to
the public through their
word-proof ears”
1786 1856 “Coxcomb” of Crimean War Deaths, Florence Nightingale
1786 1864 British Coal Exports, Charles Minard
1786 1884 Rail Passengers and Freight from Paris
1786 1890 Statistical Atlas of the Eleventh U.S. Census
The Rise of Statistics
1786 1900 1950

Rise of formal methods in statistics and
social science — Fisher, Pearson, …
Little innovation in graphical methods
A period of application and popularization
Graphical methods enter textbooks,

curricula, and mainstream use
1786 1900 1950

1786 Data Analysis & Statistics, Tukey 1962
Four major influences act on data
analysis today:
1. The formal theories of statistics.
2. Accelerating developments in
computers and display devices.
3. The challenge, in many fields, of
more and larger bodies of data.
4. The emphasis on quantification
in a wider variety of disciplines.
The last few decades have seen the
rise of formal theories of statistics,
"legitimizing" variation by confining
it by assumption to random
sampling, often assumed to involve
tightly specified distributions, and
restoring the appearance of
security by emphasizing narrowly
optimized techniques and claiming
to make statements with "known"
probabilities of error.
While some of the influences of
statistical theory on data
analysis have been helpful,
others have not.
Exposure, the effective laying
open of the data to display the
unanticipated, is to us a major
portion of data analysis. Formal
statistics has given almost no
guidance to exposure; indeed, it is
not clear how the informality and
flexibility appropriate to the
exploratory character of exposure
can be fitted into any of the
structures of formal statistics so far
proposed.
Nothing - not the careful logic of
mathematics, not statistical models
and theories, not the awesome
arithmetic power of modern
computers - nothing can substitute
here for the flexibility of the
informed human mind.
Accordingly, both approaches and

techniques need to be structured so
as to facilitate human involvement
and intervention.
Set A Set B Set C Set D
X Y X Y X Y X Y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.11 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Summary Statistics Linear Regression

uX = 9.0 σX = 3.317 Y2 = 3 + 0.5 X
uY = 7.5 σY = 2.03 R2 = 0.67 [Anscombe 1973]
Set A Set B
Set C Set D
X X
Topics
Exploratory Data Analysis

Data Wrangling
Exploratory Analysis Examples
Polaris / Tableau
Data Wrangling
I spend more than half of my
time integrating, cleansing and
transforming data without doing
any actual analysis. Most of the
time I’m lucky if I get to do any
“analysis” at all.
Anonymous Data Scientist
[Kandel et al. ’12]
DataWrangler
Wrangler: Interactive Visual Specification

of Data Transformation Scripts
Sean Kandel et al. CHI’11
Data Wrangling
One often needs to manipulate data prior to

analysis. Tasks include reformatting, cleaning,
quality assessment, and integration.
Approaches include:
Manual manipulation in spreadsheets
Custom code (e.g., dplyr in R, Pandas in Python)
Trifacta Wrangler http://www.trifacta.com/products/wrangler/
Open Refine http://openrefine.org/
Data Quality
“The first sign that a visualization is good is that it

shows you a problem in your data…
…every successful visualization that I've been
involved with has had this stage where you realize,
"Oh my God, this data is not what I thought it
would be!" So already, you've discovered
something.”
Martin Wattenberg
Violent Marauding
Infants! Centenarians!
???
Visualize Friends by School?
Berkeley |||||||||||||||||||||||||||||||
Cornell ||||
Harvard |||||||||
Harvard University |||||||
Stanford ||||||||||||||||||||
Stanford University ||||||||||
UC Berkeley |||||||||||||||||||||
UC Davis ||||||||||
University of California at Berkeley |||||||||||||||
University of California, Berkeley ||||||||||||||||||
University of California, Davis |||
Data Quality Hurdles
Missing Data no measurements, redacted, …?

Erroneous Values misspelling, outliers, …?
Type Conversion e.g., zip code to lat-lon
Entity Resolution diff. values for the same thing?
Data Integration effort/errors when combining data
LESSON: Anticipate problems with your data.

Many research problems around these issues!
Analysis Example:
Motion Pictures Data
Motion Pictures Data
Title String (N)

IMDB Rating Number (Q)
Rotten Tomatoes Rating Number (Q)
MPAA Rating String (O)
Release Date Date (T)
Lesson: Exercise Skepticism
Check data quality and your assumptions.
Start with univariate summaries, then start to
consider relationships among variables.
Avoid premature fixation!
Analysis Example:
Antibiotic Effectiveness
Data Set: Antibiotic Effectiveness
Genus of Bacteria String (N)

Species of Bacteria String (N)
Antibiotic Applied String (N)
Gram-Staining? Pos / Neg (N)
Min. Inhibitory Concent. (g) Number (Q)
Collected prior to 1951.

What questions might we ask?
How do the drugs compare?
Original graphic by Will Burtin, 1951

Radius: 1 / log(MIC)
Bar Color: Antibiotic
Background Color: Gram Staining
Mike Bostock
Stanford CS448B, Winter 2009
X-axis: Antibiotic | log(MIC)

Y-axis: Gram-Staining | Species
Color: Most-Effective?
Bowen Li
Stanford CS448B, Fall 2009
Which antibiotic should one use?
Do the bacteria
group by antibiotic
resistance?
Do the bacteria
group by antibiotic
resistance?
Wainer & Lysen

American Scientist, 2009
Do the bacteria
group by antibiotic
resistance?
Not a streptococcus!
(realized ~30 yrs later)
Wainer & Lysen

Do the bacteria
group by antibiotic
resistance?
Not a streptococcus!
Really a streptococcus!
Wainer & Lysen

Do the bacteria group by resistance?
Do different drugs correlate?
Do the bacteria group by resistance?
Do different drugs correlate? Wainer & Lysen
Lesson: Iterative Exploration
Exploratory Process
1 Construct graphics to address questions
2 Inspect “answer” and assess new questions
3 Repeat…
Transform data appropriately (e.g., invert, log)
Show data variation, not design variation [Tufte]

Administrivia
A2: Exploratory Data Analysis
Use visualization software to form & answer questions
First steps:
Step 1: Pick domain & data
Step 2: Pose questions
Step 3: Profile the data
Iterate as needed
Create visualizations
Interact with data
Refine your questions
Author a report
Screenshots of most insightful views (10+) Due by 5:00pm
Include titles and captions for each view Monday, Oct 16
Tableau / Polaris
Polaris [Stolte et al.]
Tableau
Encodings
Data Display
Data
Model
Tableau Demo
The dataset:
Federal Elections Commission Receipts
Every Congressional Candidate from 1996 to 2002
4 Election Cycles
9216 Candidacies
Dataset Schema
Year (Qi)
Candidate Code (N)
Candidate Name (N)
Incumbent / Challenger / Open-Seat (N)
Party Code (N) [1=Dem,2=Rep,3=Other]
Party Name (N)
Total Receipts (Qr)
State (N)
District (N)
This is a subset of the larger data set available from the FEC.
Hypotheses?
What might we learn from this data?

Hypotheses?
What might we learn from this data?

Correlation between receipts and winners?
Do receipts increase over time?
Which states spend the most?
Which party spends the most?
Margin of victory vs. amount spent?
Amount spent between competitors?
Tableau Demo
Tableau / Polaris Approach
Insight: can simultaneously specify both

database queries and visualization
Choose data, then visualization, not vice versa
Use smart defaults for visual encodings
Can also suggest encodings upon request
Specifying Table Configurations
Operands are the database fields

Each operand interpreted as a set {…}
Quantitative and Ordinal fields treated differently
Three operators:
concatenation (+)
cross product (x)
nest (/)  
x +
\ GROUP BY Category, Region, Segment
Table Algebra: Operands
Ordinal fields: interpret domain as a set that partitions

table into rows and columns.
Quarter = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} ->
Quantitative fields: treat domain as single element set

and encode spatially as axes.
Profit = {(Profit[-410,650])} ->
Concatenation (+) Operator
Ordered union of set interpretations

Quarter + Product Type
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} + {(Coffee), (Espresso)}
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4),(Coffee),(Espresso)}
Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}

Cross (x) Operator
Cross-product of set interpretations

Quarter x Product Type =
{(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3,
Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)}
Product Type x Profit =

Nest (/) Operator
Cross-product filtered by existing records
Quarter x Month ->

creates twelve entries for each quarter. i.e.,
(Qtr1, December)
Quarter / Month ->

creates three entries per quarter based on
tuples in database (not semantics)
Table Algebra
The operators (+, x, /) and operands (O, Q) provide

an algebra for tabular visualization.
Algebraic statements are then mapped to:

Visualizations - trellis plot partitions, visual encodings
Queries - selection, projection, group-by aggregation
In Tableau, users make statements via drag-and-drop

Note that this specifies operands NOT operators!
Operators are inferred by data type (O, Q)
Ordinal-Ordinal
Quantitative-Quantitative
Ordinal-Quantitative
Querying the Database
BONUS TOPIC
Data Fraud
A Detective Story
You have accounting records for two firms

that are in dispute. One is lying. How to tell?
Firm A Firm B LIARS!
283.08 25.23 283.08 75.23
153.86 385.62 353.86 185.25
1448.97 12371.32 5322.79 9971.42
18595.91 1280.76 8795.64 4802.43
21.33 257.64 61.33 57.64
Amt. Paid: $34823.72 Amt. Rec’d: $29908.67
Benford’s Law (Benford 1938, Newcomb 1881)
The logarithms of the values (not the values
themselves) are uniformly randomly distributed.
Hence the leading digit 1 has a ~30% likelihood.

Larger digits are increasingly less likely.
Benford’s Law (Benford 1938, Newcomb 1881)
The logarithms of the values (not the values
themselves) are uniformly randomly distributed.
Holds for many (but certainly not all) real-life data
sets: Addresses, Bank accounts, Building heights, …
Data must span multiple orders of magnitude.
Evidence that records do not follow Benford’s Law
is admissible in a court of law!

Exploratory Data Analysis

Uploaded by

Copyright:

Available Formats

Exploratory Data Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploratory Data Analysis

Uploaded by

Copyright:

Available Formats

What was the first data visualization?

What was the first data visualization?

What are some examples of early data visualizations discussed?

What are some examples of early data visualizations discussed?

CSE 442 - Data Visualization

Exploratory Data Analysis

Jeffrey Heer University of Washington

1786 1900 1950

Little innovation in graphical methods

A period of application and popularization

Graphical methods enter textbooks,

1786 1900 1950

Accordingly, both approaches and

Summary Statistics Linear Regression

Exploratory Data Analysis

Wrangler: Interactive Visual Specification

One often needs to manipulate data prior to

“The first sign that a visualization is good is that it

Missing Data no measurements, redacted, …?

LESSON: Anticipate problems with your data.

Title String (N)

Genus of Bacteria String (N)

Collected prior to 1951.

Original graphic by Will Burtin, 1951

X-axis: Antibiotic | log(MIC)

Wainer & Lysen

Wainer & Lysen

Wainer & Lysen

Transform data appropriately (e.g., invert, log)

Show data variation, not design variation [Tufte]

What might we learn from this data?

What might we learn from this data?

Insight: can simultaneously specify both

Operands are the database fields

Ordinal fields: interpret domain as a set that partitions

Quantitative fields: treat domain as single element set

Ordered union of set interpretations

Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}

Cross-product of set interpretations

Product Type x Profit =

Cross-product filtered by existing records

Quarter x Month ->

Quarter / Month ->

The operators (+, x, /) and operands (O, Q) provide

Algebraic statements are then mapped to:

In Tableau, users make statements via drag-and-drop

You have accounting records for two firms

Hence the leading digit 1 has a ~30% likelihood.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.