Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
0 BC
~6200 BC Town Map of Catal Hyük, Konya Plain, Turkey 0 BC
~950 AD Position of Sun, Moon and Planets
0 BC Sunspots over time, Scheiner 1626
Longitudinal distance between Toledo and Rome, van Langren 1644
The Rate of Water Evaporation, Lambert 1765
The Rate of Water Evaporation, Lambert 1765
The Golden Age of
Data Visualization
1786 1900
The Commercial and Political Atlas, William Playfair 1786
Statistical Breviary, William Playfair 1801
1786 1826(?) Illiteracy in France, Pierre Charles Dupin
“to affect thro’ the Eyes
what we fail to convey to
the public through their
word-proof ears”
1786 1856 “Coxcomb” of Crimean War Deaths, Florence Nightingale
1786 1864 British Coal Exports, Charles Minard
1786 1884 Rail Passengers and Freight from Paris
1786 1890 Statistical Atlas of the Eleventh U.S. Census
The Rise of Statistics
Set C Set D
X X
Topics
Approaches include:
Manual manipulation in spreadsheets
Custom code (e.g., dplyr in R, Pandas in Python)
Trifacta Wrangler http://www.trifacta.com/products/wrangler/
Open Refine http://openrefine.org/
Data Quality
Berkeley |||||||||||||||||||||||||||||||
Cornell ||||
Harvard |||||||||
Harvard University |||||||
Stanford ||||||||||||||||||||
Stanford University ||||||||||
UC Berkeley |||||||||||||||||||||
UC Davis ||||||||||
University of California at Berkeley |||||||||||||||
University of California, Berkeley ||||||||||||||||||
University of California, Davis |||
Data Quality Hurdles
Radius: 1 / log(MIC)
Bar Color: Antibiotic
Background Color: Gram Staining
How do the drugs compare?
Mike Bostock
Stanford CS448B, Winter 2009
How do the drugs compare?
Not a streptococcus!
(realized ~30 yrs later)
Not a streptococcus!
(realized ~30 yrs later)
Really a streptococcus!
(realized ~20 yrs later)
Encodings
Data Display
Data
Model
Tableau Demo
The dataset:
Federal Elections Commission Receipts
Every Congressional Candidate from 1996 to 2002
4 Election Cycles
9216 Candidacies
Dataset Schema
Year (Qi)
Candidate Code (N)
Candidate Name (N)
Incumbent / Challenger / Open-Seat (N)
Party Code (N) [1=Dem,2=Rep,3=Other]
Party Name (N)
Total Receipts (Qr)
State (N)
District (N)
This is a subset of the larger data set available from the FEC.
Hypotheses?
Three operators:
concatenation (+)
cross product (x)
nest (/)
x +
\ GROUP BY Category, Region, Segment
Table Algebra: Operands