12 Pandas
Always start by importing these Python modules
import pandas as pd
from pandas import DataFrame, Series
Series of data
* Typically, the column index (df.columns) is a list of strings (variable names) or (less
commonly) integers
Integers - for case or row numbers; Strings – for case names; or DatetimeIndex or
PeriodIndex – for time series
Series object: an ordered, one-dimensional array of data with an index. All the data in a
Series is of the same data type. Series arithmetic is vectorised after first aligning the Series
index for each of the operands.
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
df = DataFrame()
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://'
df = pd.read_sql_table('table', engine)
col0 col1
0 1.0 100
1 2.0 200
2 3.0 300
3 4.0 400
Saving a DataFrame
df.to_csv('name.csv', encoding='utf-8')
import pymysql
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://' +
df.to_sql('TABLE',e, if_exists='replace')
d = df.to_dict() # to dictionary
str = df.to_string() # to string
m = df.as_matrix() # to numpy matrix
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col0 4 non-null float64
1 col1 4 non-null int64
dtypes: float64(1), int64(1)
memory usage: 192.0 bytes
DataFrame non-indexing attributes
df = pd.DataFrame({'a':[7,9,8,10],'b':[4,3,2,1]})
a b
0 7 4
1 9 3
2 8 2
3 10 1
a b
0 7 4
1 9 3
2 8 2
3 10 1
df = pd.DataFrame({'a':[7,9,8,10],'b':[4,3,2,1]})
a b
0 7 4
1 9 3
2 8 2
3 10 1
df = pd.DataFrame({'ac_mahi':[7,9,8,10],'ab':[4,3,2,1]})
ac_mahi ab
0 7 4
1 9 3
2 8 2
3 10 1
df = df.rename(columns={'old':'new','a':'1'})
df.columns = ['new1', 'new2', 'new3'] # etc.
Selecting columns
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
df1[['b','c']] = df2[['e','f']]
df3 = df1.append(other=df2)
df = df.drop('col1', axis=1)
df.drop('col1', axis=1, inplace=True)
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col') # drops from frame
del df['col'] # even classic python works
df = df.drop(df.columns[0], axis=1)#first
df = df.drop(df.columns[-1:],axis=1)#last
Vectorised arithmetic on columns
df['percent'] = df['proportion'] * 100.0
df['log_data'] = np.log(df['col1'])
label = df['col1'].idxmin()
label = df['col1'].idxmax()
s = df['col'].isnull()
s = df['col'].notnull() # not isnull()
s = df['col'].astype(float)
s = df['col'].abs()
s = df['col'].round(decimals=0)
s = df['col'].diff(periods=1)
s = df['col'].shift(periods=1)
s = df['col'].to_datetime()
s = df['col'].fillna(0) # replace NaN w 0
s = df['col'].cumsum()
s = df['col'].cumprod()
s = df['col'].pct_change(periods=4)
s = df['col'].rolling(window=4,
min_periods=4, center=False).sum()
df['Total'] = df.sum(axis=1)
i = df.columns.get_loc('col_name')
df = df.drop('row_label')
df = df.drop(['row1','row2']) # multi-row
Trap: a single integer without a colon is a column label for integer numbered columns.
Select a slice of rows by label/index
[inclusive-from : inclusive–to [ : step]]
Trap: cannot work for integer labelled rows – see previous code snippet on
integer position slicing.
df = df.sort(df.columns[0],
df.sort(['col1', 'col2'], inplace=True)
import random as r
k = 20 # pick a number
selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :] # get copy
i = df.index.get_loc('row_label')
Note: the indexing attributes (.loc, .iloc, .ix, .at .iat) can be used to get and set values in the
Note: the .loc, iloc and .ix indexing attributes can accept python slice objects. But .at and
.iat do not.
Note: .loc can also accept Boolean Series arguments
Avoid: chaining in the form df[col_indexer][row_indexer]
Trap: label slices are inclusive, integer slices exclusive.