12 Pandas
12 Pandas
12 Pandas
Preliminaries
Always start by importing these Python modules
import pandas as pd
from pandas import DataFrame, Series
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
* Typically, the column index (df.columns) is a list of strings (variable names) or (less
commonly) integers
Integers - for case or row numbers; Strings – for case names; or DatetimeIndex or
PeriodIndex – for time series
Series object: an ordered, one-dimensional array of data with an index. All the data in a
Series is of the same data type. Series arithmetic is vectorised after first aligning the Series
index for each of the operands.
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
df = DataFrame()
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://'
+'USER:PASSWORD@HOST/DATABASE')
df = pd.read_sql_table('table', engine)
col0 col1
0 1.0 100
1 2.0 200
2 3.0 300
3 4.0 400
Saving a DataFrame
df.to_csv('name.csv', encoding='utf-8')
import pymysql
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://' +
'USER:PASSWORD@HOST/DATABASE')
df.to_sql('TABLE',e, if_exists='replace')
d = df.to_dict() # to dictionary
str = df.to_string() # to string
m = df.as_matrix() # to numpy matrix
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col0 4 non-null float64
1 col1 4 non-null int64
dtypes: float64(1), int64(1)
memory usage: 192.0 bytes
DataFrame non-indexing attributes
df = pd.DataFrame({'a':[7,9,8,10],'b':[4,3,2,1]})
df
a b
0 7 4
1 9 3
2 8 2
3 10 1
df
a b
0 7 4
1 9 3
2 8 2
3 10 1
df = pd.DataFrame({'a':[7,9,8,10],'b':[4,3,2,1]})
df
a b
0 7 4
1 9 3
2 8 2
3 10 1
df = pd.DataFrame({'ac_mahi':[7,9,8,10],'ab':[4,3,2,1]})
df
ac_mahi ab
0 7 4
1 9 3
2 8 2
3 10 1
df = df.rename(columns={'old':'new','a':'1'})
df.columns = ['new1', 'new2', 'new3'] # etc.
Selecting columns
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
df1[['b','c']] = df2[['e','f']]
df3 = df1.append(other=df2)
df = df.drop('col1', axis=1)
df.drop('col1', axis=1, inplace=True)
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col') # drops from frame
del df['col'] # even classic python works
df = df.drop(df.columns[0], axis=1)#first
df = df.drop(df.columns[-1:],axis=1)#last
Vectorised arithmetic on columns
df['proportion']=df['count']/df['total']
df['percent'] = df['proportion'] * 100.0
df['log_data'] = np.log(df['col1'])
label = df['col1'].idxmin()
label = df['col1'].idxmax()
s = df['col'].isnull()
s = df['col'].notnull() # not isnull()
s = df['col'].astype(float)
s = df['col'].abs()
s = df['col'].round(decimals=0)
s = df['col'].diff(periods=1)
s = df['col'].shift(periods=1)
s = df['col'].to_datetime()
s = df['col'].fillna(0) # replace NaN w 0
s = df['col'].cumsum()
s = df['col'].cumprod()
s = df['col'].pct_change(periods=4)
s = df['col'].rolling(window=4,
min_periods=4, center=False).sum()
df['Total'] = df.sum(axis=1)
i = df.columns.get_loc('col_name')
df = df.drop('row_label')
df = df.drop(['row1','row2']) # multi-row
Trap: a single integer without a colon is a column label for integer numbered columns.
Select a slice of rows by label/index
[inclusive-from : inclusive–to [ : step]]
Trap: cannot work for integer labelled rows – see previous code snippet on
integer position slicing.
df = df.sort(df.columns[0],
ascending=False)
df.sort(['col1', 'col2'], inplace=True)
import random as r
k = 20 # pick a number
selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :] # get copy
i = df.index.get_loc('row_label')
Note: the indexing attributes (.loc, .iloc, .ix, .at .iat) can be used to get and set values in the
DataFrame.
Note: the .loc, iloc and .ix indexing attributes can accept python slice objects. But .at and
.iat do not.
Note: .loc can also accept Boolean Series arguments
Avoid: chaining in the form df[col_indexer][row_indexer]
Trap: label slices are inclusive, integer slices exclusive.