Exploratory Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104
At a glance
Powered by AI
Some of the earliest forms of data visualization date back to 6200 BC with maps. Many influential visualizations were created during the 'Golden Age' from 1786-1900. Modern data visualization has been influenced by developments in statistics, computing, and large datasets.

One of the earliest known data visualizations is the Town Map of Catal Hyük from around 6200 BC in Turkey.

Early examples mentioned include a 950 AD diagram showing the position of celestial bodies, sunspot observations from 1626, and charts from 1644-1765 showing longitudinal distances and rates of water evaporation.

CSE 442 - Data Visualization

Exploratory Data Analysis

Jeffrey Heer University of Washington


What was the first

data visualization?

0 BC
~6200 BC Town Map of Catal Hyük, Konya Plain, Turkey 0 BC
~950 AD Position of Sun, Moon and Planets
0 BC Sunspots over time, Scheiner 1626
Longitudinal distance between Toledo and Rome, van Langren 1644
The Rate of Water Evaporation, Lambert 1765
The Rate of Water Evaporation, Lambert 1765
The Golden Age of
Data Visualization

1786 1900
The Commercial and Political Atlas, William Playfair 1786
Statistical Breviary, William Playfair 1801
1786 1826(?) Illiteracy in France, Pierre Charles Dupin
“to affect thro’ the Eyes
what we fail to convey to
the public through their
word-proof ears”
1786 1856 “Coxcomb” of Crimean War Deaths, Florence Nightingale
1786 1864 British Coal Exports, Charles Minard
1786 1884 Rail Passengers and Freight from Paris
1786 1890 Statistical Atlas of the Eleventh U.S. Census
The Rise of Statistics

1786 1900 1950


Rise of formal methods in statistics and
social science — Fisher, Pearson, …

Little innovation in graphical methods

A period of application and popularization

Graphical methods enter textbooks,


curricula, and mainstream use

1786 1900 1950


1786 Data Analysis & Statistics, Tukey 1962
Four major influences act on data
analysis today:
1. The formal theories of statistics.
2. Accelerating developments in
computers and display devices.
3. The challenge, in many fields, of
more and larger bodies of data.
4. The emphasis on quantification
in a wider variety of disciplines.
The last few decades have seen the
rise of formal theories of statistics,
"legitimizing" variation by confining
it by assumption to random
sampling, often assumed to involve
tightly specified distributions, and
restoring the appearance of
security by emphasizing narrowly
optimized techniques and claiming
to make statements with "known"
probabilities of error.
While some of the influences of
statistical theory on data
analysis have been helpful,
others have not.
Exposure, the effective laying
open of the data to display the
unanticipated, is to us a major
portion of data analysis. Formal
statistics has given almost no
guidance to exposure; indeed, it is
not clear how the informality and
flexibility appropriate to the
exploratory character of exposure
can be fitted into any of the
structures of formal statistics so far
proposed.
Nothing - not the careful logic of
mathematics, not statistical models
and theories, not the awesome
arithmetic power of modern
computers - nothing can substitute
here for the flexibility of the
informed human mind.

Accordingly, both approaches and


techniques need to be structured so
as to facilitate human involvement
and intervention.
Set A Set B Set C Set D
X Y X Y X Y X Y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.11 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

Summary Statistics Linear Regression


uX = 9.0 σX = 3.317 Y2 = 3 + 0.5 X
uY = 7.5 σY = 2.03 R2 = 0.67 [Anscombe 1973]
Set A Set B

Set C Set D

X X
Topics

Exploratory Data Analysis


Data Wrangling
Exploratory Analysis Examples
Polaris / Tableau
Data Wrangling
I spend more than half of my
time integrating, cleansing and
transforming data without doing
any actual analysis. Most of the
time I’m lucky if I get to do any
“analysis” at all.
Anonymous Data Scientist
[Kandel et al. ’12]
DataWrangler

Wrangler: Interactive Visual Specification


of Data Transformation Scripts
Sean Kandel et al. CHI’11
Data Wrangling

One often needs to manipulate data prior to


analysis. Tasks include reformatting, cleaning,
quality assessment, and integration.

Approaches include:
Manual manipulation in spreadsheets
Custom code (e.g., dplyr in R, Pandas in Python)
Trifacta Wrangler http://www.trifacta.com/products/wrangler/
Open Refine http://openrefine.org/
Data Quality

“The first sign that a visualization is good is that it


shows you a problem in your data…
…every successful visualization that I've been
involved with has had this stage where you realize,
"Oh my God, this data is not what I thought it
would be!" So already, you've discovered
something.”
Martin Wattenberg
Violent Marauding
Infants! Centenarians!
???
Visualize Friends by School?

Berkeley |||||||||||||||||||||||||||||||
Cornell ||||
Harvard |||||||||
Harvard University |||||||
Stanford ||||||||||||||||||||
Stanford University ||||||||||
UC Berkeley |||||||||||||||||||||
UC Davis ||||||||||
University of California at Berkeley |||||||||||||||
University of California, Berkeley ||||||||||||||||||
University of California, Davis |||
Data Quality Hurdles

Missing Data no measurements, redacted, …?


Erroneous Values misspelling, outliers, …?
Type Conversion e.g., zip code to lat-lon
Entity Resolution diff. values for the same thing?
Data Integration effort/errors when combining data

LESSON: Anticipate problems with your data.


Many research problems around these issues!
Analysis Example:
Motion Pictures Data
Motion Pictures Data

Title String (N)


IMDB Rating Number (Q)
Rotten Tomatoes Rating Number (Q)
MPAA Rating String (O)
Release Date Date (T)
Lesson: Exercise Skepticism
Check data quality and your assumptions.
Start with univariate summaries, then start to
consider relationships among variables.
Avoid premature fixation!
Analysis Example:
Antibiotic Effectiveness
Data Set: Antibiotic Effectiveness

Genus of Bacteria String (N)


Species of Bacteria String (N)
Antibiotic Applied String (N)
Gram-Staining? Pos / Neg (N)
Min. Inhibitory Concent. (g) Number (Q)

Collected prior to 1951.


What questions might we ask?
How do the drugs compare?

Original graphic by Will Burtin, 1951


How do the drugs compare?

Radius: 1 / log(MIC)
Bar Color: Antibiotic
Background Color: Gram Staining
How do the drugs compare?

Mike Bostock
Stanford CS448B, Winter 2009
How do the drugs compare?

X-axis: Antibiotic | log(MIC)


Y-axis: Gram-Staining | Species
Color: Most-Effective?
Bowen Li
Stanford CS448B, Fall 2009
Which antibiotic should one use?
Do the bacteria
group by antibiotic
resistance?
Do the bacteria
group by antibiotic
resistance?

Wainer & Lysen


American Scientist, 2009
Do the bacteria
group by antibiotic
resistance?

Not a streptococcus!
(realized ~30 yrs later)

Wainer & Lysen


American Scientist, 2009
Do the bacteria
group by antibiotic
resistance?

Not a streptococcus!
(realized ~30 yrs later)
Really a streptococcus!
(realized ~20 yrs later)

Wainer & Lysen


American Scientist, 2009
Do the bacteria group by resistance?
Do different drugs correlate?
Do the bacteria group by resistance?
Do different drugs correlate? Wainer & Lysen
American Scientist, 2009
Lesson: Iterative Exploration
Exploratory Process
1 Construct graphics to address questions
2 Inspect “answer” and assess new questions
3 Repeat…

Transform data appropriately (e.g., invert, log)

Show data variation, not design variation [Tufte]


Administrivia
A2: Exploratory Data Analysis
Use visualization software to form & answer questions
First steps:
Step 1: Pick domain & data
Step 2: Pose questions
Step 3: Profile the data
Iterate as needed
Create visualizations
Interact with data
Refine your questions
Author a report
Screenshots of most insightful views (10+) Due by 5:00pm
Include titles and captions for each view Monday, Oct 16
Tableau / Polaris
Polaris [Stolte et al.]
Tableau

Encodings

Data Display

Data
Model
Tableau Demo

The dataset:
Federal Elections Commission Receipts
Every Congressional Candidate from 1996 to 2002
4 Election Cycles
9216 Candidacies
Dataset Schema

Year (Qi)
Candidate Code (N)
Candidate Name (N)
Incumbent / Challenger / Open-Seat (N)
Party Code (N) [1=Dem,2=Rep,3=Other]
Party Name (N)
Total Receipts (Qr)
State (N)
District (N)
This is a subset of the larger data set available from the FEC.
Hypotheses?

What might we learn from this data?


Hypotheses?

What might we learn from this data?


Correlation between receipts and winners?
Do receipts increase over time?
Which states spend the most?
Which party spends the most?
Margin of victory vs. amount spent?
Amount spent between competitors?
Tableau Demo
Tableau / Polaris Approach

Insight: can simultaneously specify both


database queries and visualization
Choose data, then visualization, not vice versa
Use smart defaults for visual encodings
Can also suggest encodings upon request
Specifying Table Configurations

Operands are the database fields


Each operand interpreted as a set {…}
Quantitative and Ordinal fields treated differently

Three operators:
concatenation (+)
cross product (x)
nest (/) 

x +
\ GROUP BY Category, Region, Segment
Table Algebra: Operands

Ordinal fields: interpret domain as a set that partitions


table into rows and columns.
Quarter = {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} ->

Quantitative fields: treat domain as single element set


and encode spatially as axes.
Profit = {(Profit[-410,650])} ->
Concatenation (+) Operator

Ordered union of set interpretations


Quarter + Product Type
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4)} + {(Coffee), (Espresso)}
= {(Qtr1),(Qtr2),(Qtr3),(Qtr4),(Coffee),(Espresso)}

Profit + Sales = {(Profit[-310,620]),(Sales[0,1000])}


Cross (x) Operator

Cross-product of set interpretations


Quarter x Product Type =
{(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3,
Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)}

Product Type x Profit =


Nest (/) Operator

Cross-product filtered by existing records

Quarter x Month ->


creates twelve entries for each quarter. i.e.,
(Qtr1, December)

Quarter / Month ->


creates three entries per quarter based on
tuples in database (not semantics)
Table Algebra

The operators (+, x, /) and operands (O, Q) provide


an algebra for tabular visualization.

Algebraic statements are then mapped to:


Visualizations - trellis plot partitions, visual encodings
Queries - selection, projection, group-by aggregation

In Tableau, users make statements via drag-and-drop


Note that this specifies operands NOT operators!
Operators are inferred by data type (O, Q)
Ordinal-Ordinal
Quantitative-Quantitative
Ordinal-Quantitative
Querying the Database
BONUS TOPIC
Data Fraud
A Detective Story

You have accounting records for two firms


that are in dispute. One is lying. How to tell?
Firm A Firm B LIARS!
283.08 25.23 283.08 75.23
153.86 385.62 353.86 185.25
1448.97 12371.32 5322.79 9971.42
18595.91 1280.76 8795.64 4802.43
21.33 257.64 61.33 57.64
Amt. Paid: $34823.72 Amt. Rec’d: $29908.67
Benford’s Law (Benford 1938, Newcomb 1881)
The logarithms of the values (not the values
themselves) are uniformly randomly distributed.

Hence the leading digit 1 has a ~30% likelihood.


Larger digits are increasingly less likely.
Benford’s Law (Benford 1938, Newcomb 1881)
The logarithms of the values (not the values
themselves) are uniformly randomly distributed.
Holds for many (but certainly not all) real-life data
sets: Addresses, Bank accounts, Building heights, …
Data must span multiple orders of magnitude.
Evidence that records do not follow Benford’s Law
is admissible in a court of law!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy