Chapter 2 - Python Pandas II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Python Pandas II

2.1 Introduction
2.2 Iterating Over a DataFrame
2.3 Binary Operation in a DataFrame
2.4 Descriptive Statistics with Pandas
2.5 Some Other Essential Function and Functionality
2.6 Advanced Operations on DataFrame
2.7 Handling Missing Data
2.8 Combining DataFrames
2.9 Function groupby( )
2.1 Introduction

In This Chapter we shall talk more about dataframes, basic


operation of dataframe, descriptive statistics , pivoting , handling
missing data, combining/merging etc.So let's get started.
2.2 Iterating Over a DataFrame
To process all the data values of a dataframe we need to iterate over a
dataframe.
There are many ways using which we can iterate over a dataframe.
Most common of whichare <Dfobject>.iterrows( ) and
<Dfobject>.iteritems()

The iterrows() method iterates The iteritems() method iterates


Over dataframe rowwise Over dataframe column-wise

Where each horizontal subset Where each vertical subset


is in the form of (row-index,Series) is in the form of (col-index,Series)

Where Series contains all column Where Series contains all row
Values for that row-index Values for that column-index
Using pandas.iterrows( ) Function
iterrows( ) brings you the subsets from a dataframe in the form of row index
and a series containing values for all columns in that row.

Each row is taken one at a


time in the form of
(row,rowSeries)
Where row would store the
Row -index row-index and rowSeries
All values in a row are will store all the values of
in form of the row in form of a Series
series object object.
Output
Using pandas.iteritems( ) Function

iteritems( ) brings you the vertical subsets from a dataframe in the form of
column index and a series containing values for all rows in that column.

Each row is taken one at a time in the form


All values in a of (col,colSeries) where col would store the
Column are in Row-index and colSeries will store all
Form of a The values of the row in form of a Series
Series object object
Output
2.3 Binary Operation in a DataFrame
We can perform binary operation in a DataFrame.These binary operations
can be addition, subraction,multiplication and division.
In a binary operation the data from the two dataframes are aligned on the
bases of their row and column indexes.
For the matching row,column index, the given operation is performed .
And for the non matching row,column indexes NaN value is stored in the
result.

In order to understand binary operations on DataFrames, consider the


following DataFrames (df1,df2,df3,df4)

>>>df1 >>>df2 >>>df3 >>>df4


Addition binary operation [ using + , add() and radd() ]
We can perform add binary operation on two daraframe objects either
+ operator
add( ) function
radd( ) function radd means reverse add Note :
1) When + is
<DF1>. add(<DF2>) which means <DF1> + <DF2> performed
<DF1>.radd(<DF2>) which means <DF2> + <DF1> On string
Values, it
Concatenates
Them.

2)Python integer
types cannot
Store NaN values
To store a NaN
value
In a column
The datatype
Of a column is
Changed to non-
Integer suitable type.
Subtraction binary operation [ using - , sub() and rsub() ]
We can perform subtract binary operation on two daraframe objects either
- operator
sub( ) function
rsub( ) function means revers subtract Note :
1) When + is
<DF1>. sub(<DF2>) which means <DF1> - <DF2> performed
<DF1>.rsub(<DF2>) which means <DF2> - <DF1> On string
Values, it
Concatenates
Them.

2)Python integer
types cannot
Store NaN values
To store a NaN
value
In a column
The datatype
Of a column is
Changed to non-
Integer suitable type.
Multiplication Binary operation [ using *, mul() and rmul() ]
We can perform multiplication binary operation on two daraframe objects
either
* operator
mul( ) function
rmul( ) function means revers multiplication
Division Binary operation [ using /, div() ]

We can perform division binary operation on two daraframe objects either


/ operator
div( ) function
2.4 Descriptive Statistics with Pandas
Pythonn Pandas is a widely used data science library and it offers many
useful functions.Among many other functions, the functions offered by
Pandas also include many useful statistical functions.

Reference dataframe namely pradt


Functions min( ) and max( )

The min( ) and max( ) functions find out the minimum or maximum
value respectively from a given set of data.

<dataframe>.min(axis = None, skipna = None, numeric_only = None)


<dataframe>.max(axis = None, skipna = None, numeric_only = None)

Parameters :
axis 0 or 1 By default minimn oe maximum is calculated
along axis 0 (i.e. Index(0),columns(1))
skipna True or False Exclude NA/null/NaN values when
computing the result
numeric_only True or False Include only float,int,boolean columns for
calculating min or max
Index of Maximum and Minimum Values

<DF>.idxmax() - This function is used to get the maximum value in


columns

<DF>.idxmin() - This function is used to get the minimum value in


columns
Function mode( ), mean( ) , median( )

Mode( ) : Function mode( ) returns the mode value(i.e. The value that
appears most often in a given set of values( ) from a set of values.

<dataframe>.mode(axis = 0 ,numeric_only = false)


Mean( ) : Function mean( ) returns the computed mean (average) from
a set of values.
<dataframe>.mean(axis = None, skipna = None , numeric_only=None)
Median( ) : Funtion median( ) returns the middle number from a set of
numbers.It returns the median value that seperates the higher half from the
lower half of a set of values.

<dataframe>.median(axis = None, skipna = None , numeric_only = None)


Functions count( ) and sum( )

Count( ) : The function count( ) counts the non-NA entries for each
row or colmns.The values None, NaN etc. are considered as NA in
pandas
<dataframe>.count(axis = 0,numeric_only = False)

axis : {index (0), columns (1)} default 0;


numeric_only : boolean (True or False) default False ; Include only
float, int or boolean data.
Sum( ) : The Function sum( ) returns the sum of the values for the
requested axis.
<dataframe>.sum(axis = None, skipns = None, numeric_only = None,
min_count = 0)

axis : {index (0), columns (1)} default 0;


skipna : boolean, default true; Exclude NA/null values
numeric_only : boolean default None ; Include only float, int or boolean
min_count : int , default 0; the required number of valid values to
perform the operation.
Functions quantile( ), std( ) and var( )
quantile( )

The quantile( ) function returns the values at the given quantiles over
requested axis (axis 0 or 1)

<dataframe>.quantile(q=0.5, axis = 0 , numeric_only = Ture)


Parameters :
q Defalut 0.5(50% quantile). 0<=q <=1
skipna {0,1,'index','columns' } default 0
numeric_only If false , the quantile of datetime and timedelta data will be
computed as well.

The quantile( ) returns quantiles of DataFrame if q is an array , if q is a


float a Series will be returned.
It means , if we plot a distribution graph for Fruits, 25% dividing marker
(Q1) will be at 1872.825 on the graph,
(Q2) at point 7491 on the graph,
(Q3) at point 10920 and (Q4) at 140169.2 point on the graph.
Std( )
The std( ) function computes the standard deviation over requested axis

<dataframe>.std(axis = None, skipna = None, numeric_only = None,)


Parameters :
axis [ index(0) ,columns(1)] default 0
skipna Boolean, default True, Exclude NA/null/NaN values.
numeric_only Boolean , default None, Inculde only float, int, boolean
column
var( )
The var( ) function computes varience and returns the unbiased varience
over the requested axis.

<dataframe>.var(axis = None, skipna = None, numeric_only = None,)


Parameters :
axis [ index(0) ,columns(1)] default 0
skipna Boolean, default True, Exclude NA/null/NaN values.
numeric_only Boolean , default None, Inculde only float, int, boolean
column
The describe( )Function
The describe( ) function provides the descriptive statistics details of the
dataframe.
<DF>.describe( )
The describe( ) gives following information for a dataframe object having
numeric columns
Count Count of non-NA values in a column
Mean Computed mean of values in a column
Std Standard deviation of values in a column
Min : Minimum value in column
25%,50%,75% : Percentiles of values
Max. : Maximum value in a column
For a dataframe with string type column(s), the describe( ) will give
following information

Count -- Count of non-NA values in a column


Unique – number of unique entries in the column
Top – the most coomon entry of the column
Freq – frequency of the most common element displyedas top
above

>>res6 >>>res6.describe()
2.5 Some Other Essential Function and Functionality
Inspection Function info( ) : We can use info( ) function to get basic
information about dataframe object.
<DF>.info( )

type It is an instance of a DataFrame


Index values Its shows the assigned indexes
Number of rows Number of rows in the dataframe object
Data columns and values in It lists number of columns and count of only non -NA values
them in them.
Datatypes of each column Datatype for each column.
Memory_usage Approximate amount of memory used to hold the DataFrame.
DataFrame's Top and Bottom Rows using head( ) and tail( )

We can use head( ) and tail( ) to retrieve top 5 or bottom 5 rows repectively
of a dataframe object if you pass no argument to these.

<DF>.head( [ n = 5] ) Argument n is
optional and its
<DF>.tail( [n = 5 ] ) default value is 5

>>>
>>> res7

>>>
You can retrieve any number of top rows or bottom rows with head( ) and
tail( ) respectively by specifying value of n as argument to head( ) or tail( )

eg. <DF>.head(n=12) or <DF>.tail( n = 24)

you can any value of n as per your need.

you can also write res7.head( n = 3)


res7.head(3)
res7.tail( n = 4)
res7.tail(4)
Cumulative Calculation Functions

Cumulative calculation means after applying a function to one row, its


output is also used in next's calculation, e.g.
cumsum( )Functions
Cumsum( )calculates cumulative sum i.e., in the output of this function,the
value of each row is replaced by sum of all prior rows including ths
row .String value rows uses use concatenation.It is used as :
<DF>.cumsum( [axis = None ] )

By default , it calculates the sum of columns' value and display cumulative


sum in each successive row
Cumprod ( ) , cummax( ), cummin( ) Functions
cumprod( ) - gives cumulative product of values.
cummax( ) - gives cumulative maximum of values.
cummin( ) - gives cumulative minimum of vaules.

Df1.cummax( ) df2.cummax( )
Applying Functions on a Subset of Dataframe

To apply a function on a column , you need to use the following in place of


dataframe name:
<dataframe>[ <column name>]
Applying Functions on Multiple Columns of a DataFrame

<dataframe>[ [<column name>,<column name>,.......] ]


eg. print( prodf[ [ 'Wheat' , 'Rice']].count( )
row
Applying Functions on a Row of a DataFrame

<dataframe>.loc[<row index>,:]
eg. print(prodf.loc['Andhra P.',:].count( ) )

Applying Functions on a Range of Rows of a DataFrame


<dataframe>.loc[<start row label>:<end row label>,:]
or
<dataframe>.iloc[<start row>:<end row>,:]

Applying Functions to a subset of the DataFrame


<dataframe>.loc[<start row >:<end row>,<start column >:<end column>]
or
<dataframe>.iloc[<start row>:<end row>,<start column >:<end column>]
2.6 Advanced Operations on DataFrame
Pivoting : What pivoting is and why this is used ? Lets understand by
following example.
Both these views list the same data in two different perspective.
(I) The second view is a summarised representation of the first view.
(II) The Second view has converted row data to columns .

“ Pivoting is actually a summary technique that works on tablular


data.Pivoting technique rearrange the data from rows and columns,
by possibly rotating rows and columns or by aggregating data from
multiple sources, in a report form.”

It summarises extensive data.


It rotates or pivots data by transforming rows into columns.

Pivot( ) Function : The pivot( ) method creates a dataframe storing the
pivoted table, which is a simplified version of the original data and only
contains information as specified through the parameters to the pivot( )
method.

<DF>.pivot(index=<columnname>, columns=<columnname>,values=<columnname>)

pivot( index = , columns = , values = )

Specify here the Specify here the Specofy here the columns,
column which is to column, whose whose values are to be spread
be treated as values will across the dataframe created as
index (i.e. As become columns per specified index and
rows) columns.
eg. dfd.pivot(index = 'country' ,columns = 'Tutor' , values = 'classes')

The result of pivot() function returns the result in the


Form of a newly created dataframe, i.e. you may store
Result in a dataframe

Home work : Example 33


Pivot-table() Function
For data having multiple values for same row and column combination,
we can use another pivoting function- the pivot-table() function.


It does not raise errors for multiple entries of a row, column combination.

It aggregates the multiple entries present for a row-column combination,
you need to specify what typeof aggregation you want(sum,mean etc.)

pandas.pivot_table(<dataframe>, values = None , index = None, columns


= None, aggfunc='mean')
Or
(<dataframe>.pivot_table(values = None, index = None, columns = None,
aggfunc = 'mean')
Consider the following example that now creates the pivoted table for the
same, full-year's data of online tutoring company.

PivT = df1.pivot_table(index = 'Tutor', columns='Country', values = 'Classes')

Since no aggfunc is specified, it will


by defaut compute the mean or the
average of the multiple values
Sorting DataFrame Values
Sorting refers to arranging values in a particular order. The values can be
stored on the basis of a specified column or columns and can be in
ascending or descending order.Pandas makes available sort_values()
function for this purpose.

<DF>.sort_values( by , axis=0, ascending=ture,


inplace=false,kind='quicksort', na_position='last')

by Name or list of names to sort by


axis [Index=0 or Column = 1], default 0
ascending Bool or list of bool, default true
inplace Bool, default is false, if true, perform operation in-place
na_position ['first,'last'], default 'last'; first puts NaNs at the beginning, last puts NaNs
at the end.
Aggregation
Data Aggregation is a process of producing a summary statistics from a
dataset using statistical aggregation function.

The mad( ) Function : This function is used to calculate the mean absolute deviation of the
values for the requested axis.The Mean Absolute Deviation (MAD) of a set data is the
average distance between each data value and the mean.

<dataframe>.mad(axis =None, skipna = None)

Parameter : axis - { ndex (0), column(1) } default 0


Skipna – boolean, default ture; Exclude NA/null values. If an entire row/column is NA, the
result will be NA
2.7 Handling Missing Data
Pandas library is designed to deal with huge amount of data or big data.In such
volume of data, there may be some values which have NA values such as NULL
or NaN or None values.


These are the values that cannot participate in computation constructively.These
values are known as missing values.


For instance , ranking of colleges is based, and placement data etc. for past 10
year etc. Now for newer colleges, placement data may not be available. Hence in
such a dataset, the placement data will be missing for newer colleges and this will
be signified by some empty value


Some other real life examples where you may find missing data in a dataset ma be
situation like : a 2 bedroom house wouldn't include the size of the third
bedroom,it will be a missing value in its dataset, in a servey, some respondent
may choose not to share their income,then this will also be a missing value in that
data set.

Why handle missing data? The simple answer is that data science is all abot
inferring accurate result/prediction from the given data and if the missing values
are not handle properly , then it may affect the overall result.
We can handle missing data in many ways, most common ones are:

Dropping missing data

Filling missing data(Imputation)
Detecting /Filtering Missing Data
We can use isnull( ) to detect missing values in a Pandas object, it reyurns
True or False for each values in a Pandas object if it is a missing value or
not.
<PandaObject>.isnull( )

If you want to filter data which is not a missing value i.e. Non null data, then you can use
following for series object : But this method will not
work for database.
<series>[filter condition] For a dataframe object, you
need to use methods t
handle missing values like
dropna()
Handling Missing Data – Droppping Missing Values
To drop missing values you can use dropna( ) in following three ways :
( a) <PandaObject>.dropna( ) : This will drop all the rows that have NaN
values in them, even row with single NaN value in it, e.g. see below:
>>> >>>
>>> >>>

(b) <DF>.dropna(how='all') : With argument how = 'all', it will drop only


those rows that have all NaN values, i.e. no values is non-null in those
rows, e.g.
>>>

>>>
<DF>.dropna(axis =1) : With argument axis = 1 , will drop columns that
have any NaN values in them, e.g.

>>>

>>>
Handling Missing Data – Filling Missing Values
Though dropna( ) removes the null values, but you also lose other non-null
data with it too. To avoid this, you may want to fill the missing data with
some appropriate value of your choice.For this purpose you can use
fillna( ) in following ways :

(a) <PandaObject>.fillna(<n>) : This will fill all NaN values in a Pandas


object with the given <n> value,e.g. See below :

print(ser1) print(pdf1)
(b) Using dictionary with fillna( ) to specify fill values for each column
separately.
You can create a dictionary that defines fill values for each of the columns.
And then you can pass this dictionary that defines fill values for each of
the columns.And then you can pass this dictionary as an argument to
fillna( ), Pandas will leave those columns untouched or unchanged that are
not in the dictionary.
<DF>.fillna(<dictionary having fill values for column >)

>>> >>>

print(pdf5)
Filling missing values from another DataFrame

We can also fill the missing values from a similar Dataframe i.e. from a
Dataframe having the same index and columns. If you pass another
DataFrame's name in fillna( ), then it will take corresponding cell's values
for missing values.

Df1.fillna(Df2)

Will fill missing values of DataFrame df1 from the correspoding cells of
Df2. But the condition here is that both the dataframe should have similar
structure, only their values differ.
Combining DataFrames

When we have to access two or more similar dataframes contaning more or


less similar data, we would want a way to combine those dataframe.In this
sistutation Pandas provide the following way to combine dataframes.

Concat( )

join( )

merge( )
Combining DataFrames using concat()
The concat can concatenate two dataframes along the rows or along the
columns.This method is useful if the two dataframes have similar structure.

pd.concat( [<DF1>,<DF2>] )
pd.concat( [<DF1>,<DF2>] , ignore_index = true)
pd.concat( [<DF1>,<DF2>] , axis = 1)
Combining DataFrames using join()
The join( ) method used by DataFrames basically creates a new dataframe
from two dataframes by joining their rows.Joining of two dataframes
means creating a third dataframes from the two dataframes.This joining
can be done by using join( ) function provided by Pandas library.

<DataFrame1>.join( <DataFrame2, [how =' left' ] ) Values of how


can be
Left,Right,outer,
Inner

Let us now use it practically by considering the following dataframe


Inner Join
( I ) Take the rows having common indexes from both the dataframes and
create a new dataframe from it.This type of join is called Inner Join.e.g.
(II) Take all the rows from the left(first) dataframe and join with it only
those rows from the second dataframe that have common indexes as
dataframe1 and create a new dataframe from it.This type of join is called
Left Join.
By default join( ) will create the LEFT JOIN
(III) Take all the rows from the right(second) dataframe and join with it
only those rows from the first dataframe that have common indexes as
dataframe2 and create a new dataframe from it.This type of join is called
Right Join.
(IV) Take all the rows from both the dataframes and join them.There will
be values for all the columns only in the matching indexes' rows and non
matching index rows will have some missing values.This type of join is
called Outer Join.
Joining on a Column
With on argument in join( ), the specified column's values from the left
dataframe are joined with the indexes of the right dataframe.

Df1.join( Df2, on = <column name of Df1>)


Combining DataFrames using merge()
Pandas provides merge( ) function in which you can specify the field on
the basis of which you want to combine the two dataframes.

pd.merge( <DF1>, <DF2> )


pd.merge( <DF1>, <DF2> , on = <field_name>)
pd.merge( <DF1>, <DF2> , on = <field_name>, <how = 'left' | 'right' | 'inner' | 'outer')
Function groupby( )
The groupby( ) function allows to create field wise group of values in a
dataframe as per a specific aggregate function.

In other words , the duplicate values in the same field are grouped together
to form groups.

<dataframe>.groupby( by = None, axis = 0 )

by - label, or list of labels


axis - index = '0' , columns = '1'

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy