Data Analysis With Python
Data Analysis With Python
Data Analysis With Python
Analysis Platform
1
About me
Name: Deepankar Sharma
Email: dsharma@enthought.com
Email: deepankar.sharma@gmail.com
2
Question about audience
People who consider themselves
programmers
People who write code on a daily basis
People who consider Python their
primary language
People who write data driven
applications
3
My goals for this talk
Increase development of data driven
applications using Python
Increase the number of Python based
stories on HN front page
Introduce users to Python libraries for
analyzing / visualizing data Python
4
Trajectory of successful project
5
Trajectory of unsuccessful project
6
Lets Pick A Problem
7
8
Analyzing Weather Data
9
Source of data
10
http://noaa.gov
11
Data related fields
Temperature
Dew point
Sea level
pressure
Station
pressure
Visibility
Windspeed
12
Max windspeed
Max wind gust
Max temp
Min temp
Precipitation
Snow depth
Storing Your Data
13
Transient Storage
14
15
Numpy -> N-dimensional homogeneous array implemented in C: fast, &
memory efficient
>>> a = np.random.randn(1000, 100) ; b = a[::2,:]
Numpy arrays are full featured: 60 methods out of the box (max, mean,
conjugate, ...) + SCIPY packages add MANY more + Scikits projects
(Statsmodel, TimeSeries, ...).
Structured arrays offer a labeling of fields
>>> dt = np.dtype([(Station name, S10), (Elevation, np.float), (Lat, np.int)])
>>> arr = genfromtxt(station_db.txt, dtype = dt, ...)
>>> print arr[Station name]
Holding in Numpy arrays
16
>>> from numpy import memmap
>>> image = memmap('some_file.dat',
dtype=uint16,
mode='r+',
shape=(5,5),
offset=header_size)
>>> mean_value = image.mean()
>>> scaled_img = image * .5
>>> np.multiply(image,.5,scaled_img)
Very efficient thanks to 1. OS caching and 2. the
implementation of Numpy arrays (typically 2-3x
slower than in memory).
image:
2D NumPy array
shape: 5, 5
dtype: uint16
some_file.dat
<header> 110111!
<data> 0110000001
0010010111011000
1101001001000100
1111010101000010
0010111000101011
00011110101011!
Memory mapping allows to manipulate arrays of data requiring more than available
RAM:
Big data: memmaped arrays
17
>>> from numpy import memmap
>>> a = memmap('some_file.dat,dtype=uint8,
mode=write, shape=(N,))
Responses (python2.7, MacOS with 8GB RAM, 11GB free HD):
Numpys memmap module relies on pythons mmap which carries OS dependent
limitations:
Limitations of memmap
Mac OS (32bit
python)
Win 7 & MacOS
(64bit, 3Gb RAM)
Linux Ubuntu 11.04
(64bit, 3Gb RAM)
N = 10**9 OK (du = 0.9G) OK (du = 0.9G) OK (du = 4K)
N = 3x10**9 Overflow error OK (du = 3G) OK (du = 4K)
N = 10**13 No space left on
device
No space left on
device
OK (du = 4K)
Holding data in Pandas I
18
Pandas (now version 0.7.1) offers thin wrappers around 1,2,3D Numpy arrays.
Author: Wes McKinney, Lambda Foundry, http://pandas.sourceforge.net/
axis labeling, for example using datetime steps, and nice representation in ipython
data alignment, data merge (incl. priorities for the various datasets),
management of missing data
MANY statistical tools (describe, moving average, covariance, correlation, ...)
Easy visualization (line, bar chart, boxplot, ...) with Matplotlib
>>> from pandas import *
>>> a = [12.3, 15.3, 14.6, np.nan, 17.1, 13.6]
>>> ts = Series(a, index = DateRange(1/1/2000, periods = 6,
offset = datetools.day), name = Temperature) # 1D
>>> df = DataFrame(ts) # 2D
>>> df[var] = ts2 # Add another columns
Access components: df.values (np.ndarray), df.index (pandas.Index)
19
>>> print ts
2000-01-01 12.3
2000-01-02 15.3
2000-01-03 14.6
2000-01-04 NaN
2000-01-05 17.1
Name: Temperature
>>> print df
Temperature var
2000-01-01 12.3 -1.452
2000-01-02 15.3 1.851
2000-01-03 14.6 -0.09037
2000-01-04 NaN -0.3942
2000-01-05 17.1 1.446
Holding data in Pandas II
Pretty representation:
Data alignment, data reduction, missing value management
ts.align(ts2) ; ts.reindex(ts2.index) ; ts.groupby().apply()
ts.fillna(0.0) ; ts.dropna() ; ts.to_sparse()
Loading data from/to files:
>>> read_csv, read_table, ts.tofile, ts.to_csv
>>> HDFStore(), ExcelFile()
Persistent Storage
20
Some Options
Some universal file format (built into the data-structure):
- txt, csv
- binary (watch out!)
Some standard labeled file formats:
- json: json
- HDF: pytables, h5py, pyhdf
- netCDF: netCDF4, (also scipy.io.netcdf, Scientific.io.netcdf)
Some database options
- SQL: sqlalchemy, sqlite3, mysql-python, psycopg!
- No SQL: couchdb, mongodb, cassandra, !
21
22
Storing data to HDF5
HDF5 files is the best way to store large datasets during/after processing.
FEATURES
HDF5 file format is self-describing: good for complex data objects
HDF5 files are portable: cross-platform, cross-language (C, C++, Fortran, Java)
HDF5 is optimized: direct access to parts of the file without parsing the entire
contents.
See http://www.hdfgroup.org/HDF5
PYTHON LIBRARIES
h5py - "thin wrapper" around the C HDF5 library.
PyTables - Provides some higher level abstractions and efficient tools for
retrieval, compression and out-of-core functionalities.
Benchmarking Pytables
23
Source: http://www.pytables.org/moin/PyTables
FAST!
EFFICIENT!
Out of core calcs w/ Pytables
24
Source: http://www.pytables.org/moin/ComputingKernel
FAST!
EFFICIENT
Visualizing Data
25
Wonder if there is a way
to see those stations
on a map.
26
27
28
Compare Weather From
Multiple Cities
29
30
Source code at http://www.github.com/jonathanrocher/climate_model/
Plot weather data
Comparing !
Even More Data
31
Scatter plot matrix
32
Filename: scatter_matrix.py
Can I learn something
from this data?
33
Learning from data
Classify data into categories
Optimize a function wrt input
paremeters
Create predictive model from data
34
Support vector machines
Brief Interlude Into
Classifiers
35
Examples
Predict if a mail is spam or not
Sort incoming mail into folders
Predict if a transaction is fraudulent
Predict if a patient has a disease
36
Feature vectors
37
Mail # Word1 Word2 Spam?
1 0 1 Y
2 0 1 Y
3 1 0 N
4 1 1 Y
5 1 0 N
6 1 1 N
Classifying data
38
Source: Berwick2003
Support vectors
39
Source: Berwick2003
Support Vector Regression
40
Scikits learn
41
http://scikit-learn.org
Slide showing predictor
from sklearn.svm import SVR
clf = SVR(epsilon=0.2)
clf.fit(X, y)
pred = clf.predict(test)
42
Learn from weather data
43
Filename: ml_app.py
Applications for this analysis
Impact of sales campaign
Effect of hiring star athlete
Effect of upgrading computer
infrastructure
Predict stock prices ?
44
Source code repository
https://github.com/jonathanrocher/climate_model/tree/pygotham
45
Credits for talk
Jonathan Rocher This talk builds upon
his talk from PyCon
Naveen Michaud Agrawal Wrote code
for mapping weather stations
Chris Colbert Helped debug several
issues and and gave Enaml advice
Sean Ross Feedback on this talk
46
Network IO
Urllib2
Requests
Paramiko
47
Python data structures
48
Numpy
Pandas
Blist
Bitarray
Python Visualization / Plotting
Chaco
Matplotlib
Networkx
49