Chapter 2 - Python Pandas II
Chapter 2 - Python Pandas II
Chapter 2 - Python Pandas II
2.1 Introduction
2.2 Iterating Over a DataFrame
2.3 Binary Operation in a DataFrame
2.4 Descriptive Statistics with Pandas
2.5 Some Other Essential Function and Functionality
2.6 Advanced Operations on DataFrame
2.7 Handling Missing Data
2.8 Combining DataFrames
2.9 Function groupby( )
2.1 Introduction
Where Series contains all column Where Series contains all row
Values for that row-index Values for that column-index
Using pandas.iterrows( ) Function
iterrows( ) brings you the subsets from a dataframe in the form of row index
and a series containing values for all columns in that row.
iteritems( ) brings you the vertical subsets from a dataframe in the form of
column index and a series containing values for all rows in that column.
2)Python integer
types cannot
Store NaN values
To store a NaN
value
In a column
The datatype
Of a column is
Changed to non-
Integer suitable type.
Subtraction binary operation [ using - , sub() and rsub() ]
We can perform subtract binary operation on two daraframe objects either
- operator
sub( ) function
rsub( ) function means revers subtract Note :
1) When + is
<DF1>. sub(<DF2>) which means <DF1> - <DF2> performed
<DF1>.rsub(<DF2>) which means <DF2> - <DF1> On string
Values, it
Concatenates
Them.
2)Python integer
types cannot
Store NaN values
To store a NaN
value
In a column
The datatype
Of a column is
Changed to non-
Integer suitable type.
Multiplication Binary operation [ using *, mul() and rmul() ]
We can perform multiplication binary operation on two daraframe objects
either
* operator
mul( ) function
rmul( ) function means revers multiplication
Division Binary operation [ using /, div() ]
The min( ) and max( ) functions find out the minimum or maximum
value respectively from a given set of data.
Parameters :
axis 0 or 1 By default minimn oe maximum is calculated
along axis 0 (i.e. Index(0),columns(1))
skipna True or False Exclude NA/null/NaN values when
computing the result
numeric_only True or False Include only float,int,boolean columns for
calculating min or max
Index of Maximum and Minimum Values
●
Mode( ) : Function mode( ) returns the mode value(i.e. The value that
appears most often in a given set of values( ) from a set of values.
Count( ) : The function count( ) counts the non-NA entries for each
row or colmns.The values None, NaN etc. are considered as NA in
pandas
<dataframe>.count(axis = 0,numeric_only = False)
The quantile( ) function returns the values at the given quantiles over
requested axis (axis 0 or 1)
>>res6 >>>res6.describe()
2.5 Some Other Essential Function and Functionality
Inspection Function info( ) : We can use info( ) function to get basic
information about dataframe object.
<DF>.info( )
We can use head( ) and tail( ) to retrieve top 5 or bottom 5 rows repectively
of a dataframe object if you pass no argument to these.
<DF>.head( [ n = 5] ) Argument n is
optional and its
<DF>.tail( [n = 5 ] ) default value is 5
>>>
>>> res7
>>>
You can retrieve any number of top rows or bottom rows with head( ) and
tail( ) respectively by specifying value of n as argument to head( ) or tail( )
●
eg. <DF>.head(n=12) or <DF>.tail( n = 24)
Df1.cummax( ) df2.cummax( )
Applying Functions on a Subset of Dataframe
<dataframe>.loc[<row index>,:]
eg. print(prodf.loc['Andhra P.',:].count( ) )
<DF>.pivot(index=<columnname>, columns=<columnname>,values=<columnname>)
Specify here the Specify here the Specofy here the columns,
column which is to column, whose whose values are to be spread
be treated as values will across the dataframe created as
index (i.e. As become columns per specified index and
rows) columns.
eg. dfd.pivot(index = 'country' ,columns = 'Tutor' , values = 'classes')
●
It does not raise errors for multiple entries of a row, column combination.
●
It aggregates the multiple entries present for a row-column combination,
you need to specify what typeof aggregation you want(sum,mean etc.)
The mad( ) Function : This function is used to calculate the mean absolute deviation of the
values for the requested axis.The Mean Absolute Deviation (MAD) of a set data is the
average distance between each data value and the mean.
●
These are the values that cannot participate in computation constructively.These
values are known as missing values.
●
For instance , ranking of colleges is based, and placement data etc. for past 10
year etc. Now for newer colleges, placement data may not be available. Hence in
such a dataset, the placement data will be missing for newer colleges and this will
be signified by some empty value
●
Some other real life examples where you may find missing data in a dataset ma be
situation like : a 2 bedroom house wouldn't include the size of the third
bedroom,it will be a missing value in its dataset, in a servey, some respondent
may choose not to share their income,then this will also be a missing value in that
data set.
●
Why handle missing data? The simple answer is that data science is all abot
inferring accurate result/prediction from the given data and if the missing values
are not handle properly , then it may affect the overall result.
We can handle missing data in many ways, most common ones are:
●
Dropping missing data
●
Filling missing data(Imputation)
Detecting /Filtering Missing Data
We can use isnull( ) to detect missing values in a Pandas object, it reyurns
True or False for each values in a Pandas object if it is a missing value or
not.
<PandaObject>.isnull( )
If you want to filter data which is not a missing value i.e. Non null data, then you can use
following for series object : But this method will not
work for database.
<series>[filter condition] For a dataframe object, you
need to use methods t
handle missing values like
dropna()
Handling Missing Data – Droppping Missing Values
To drop missing values you can use dropna( ) in following three ways :
( a) <PandaObject>.dropna( ) : This will drop all the rows that have NaN
values in them, even row with single NaN value in it, e.g. see below:
>>> >>>
>>> >>>
>>>
<DF>.dropna(axis =1) : With argument axis = 1 , will drop columns that
have any NaN values in them, e.g.
>>>
>>>
Handling Missing Data – Filling Missing Values
Though dropna( ) removes the null values, but you also lose other non-null
data with it too. To avoid this, you may want to fill the missing data with
some appropriate value of your choice.For this purpose you can use
fillna( ) in following ways :
print(ser1) print(pdf1)
(b) Using dictionary with fillna( ) to specify fill values for each column
separately.
You can create a dictionary that defines fill values for each of the columns.
And then you can pass this dictionary that defines fill values for each of
the columns.And then you can pass this dictionary as an argument to
fillna( ), Pandas will leave those columns untouched or unchanged that are
not in the dictionary.
<DF>.fillna(<dictionary having fill values for column >)
>>> >>>
print(pdf5)
Filling missing values from another DataFrame
●
We can also fill the missing values from a similar Dataframe i.e. from a
Dataframe having the same index and columns. If you pass another
DataFrame's name in fillna( ), then it will take corresponding cell's values
for missing values.
Df1.fillna(Df2)
●
Will fill missing values of DataFrame df1 from the correspoding cells of
Df2. But the condition here is that both the dataframe should have similar
structure, only their values differ.
Combining DataFrames
pd.concat( [<DF1>,<DF2>] )
pd.concat( [<DF1>,<DF2>] , ignore_index = true)
pd.concat( [<DF1>,<DF2>] , axis = 1)
Combining DataFrames using join()
The join( ) method used by DataFrames basically creates a new dataframe
from two dataframes by joining their rows.Joining of two dataframes
means creating a third dataframes from the two dataframes.This joining
can be done by using join( ) function provided by Pandas library.
In other words , the duplicate values in the same field are grouped together
to form groups.