Python Pandas

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 230

Chapter 1

Data Handling using


Pandas -1
Data Handling using
Pandas -1
Python Library – Matplotlib
Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python.It is used to create
1. Develop publication quality plots with just a few lines of code
2. Use interactive figures that can zoom, pan, update...
We can customize and Take full control of line styles, font
properties, axes properties... as well as export and embed to a
number of file formats and interactive environments
Data Handling using
Pandas -1
Python Library – Pandas
It is a most famous Python package for data science, which offers
powerful and flexible data structures that make data analysis and
manipulation easy.Pandas makes data importing and data analyzing
much easier. Pandas builds on packages like NumPy , Pandasand
matplotlib to give us a single & convenient place for data analysis
and visualization work.
Data Handling using
Pandas -1 Pandas – Installation/Environment Setup
Pandas module doesn't come bundled with Standard Python.
If we install Anaconda Python
package Pandas will be installed by default.
Steps for Anaconda installation
& Use
1. visit the site https://www.
anaconda.com/download/
2. Download appropriate anaconda installer
3. After download install it.
4. During installation check for set path and all user
5. After installation start spyder utility of anaconda from start
menu
6. Type import pandas as pd in left pane(temp.py)
7. Then run it.
8. If no error is show then it shows pandas is installed.
Data Handling using
Pandas -1 Pandas – Installation/Environment Setup

Pandas installation can be done in Standard Python


distribution,using following steps.
1. There must be service pack installed on our computer if we
are using windows.If it is not installed then we will not be
able to install pandas in existing Standard Python(which is
already installed).So install it first(google it).
2. We can check it through properties option of my computer
icon.

3. Now install latest version(any one above 3.4) of python.


Data Handling using
Pandas -1 Pandas – Installation/Environment Setup

4. Now move to script folder of python distribution in command


prompt (through cmd command of windows).
5. Execute following commands in command prompt serially.
>pip install numpy
>pip install six
>pip install pandas
Wait after each command for installation
Now we will be able to use pandas in standard python
distribution.
6. Type import pandas as pd in python (IDLE) shell.
7. If it executed without error(it means pandas is installed on
your system)
Introduction
Pandas or Python Pandas is Python’s library for
data analysis. Pandas has derived its name from
“panel data system”, which is another term for
multidimensional, structured data sets.

Data analysis refers to process of evaluating big


data sets using analytical and statistical tools so as
to discover useful information and conclusions to
support business decision making.
Why Pandas?
Pandas is capable of many tasks including:

1. It can read or write in many different formats


(integer, float, double, etc).

2. It can calculate in all ways data is organized ie.,


across rows and columns.

3. It can easily select subsets of data from bulky data


sets and even combine multiple data sets together.
4. It has functionality to find and fill missing data.

5. It allows you to apply operations to independent groups


within the data.

6. It supports reshaping of data into different forms.

7. It supports advanced time-series functionality (Time


series forecasting is the use of a model to predict future
values based on previously observed values).
8. It supports visualization by integrating matplotlib and
seaborn etc. libraries.

9. It is best at handling huge tabular data sets


comprising different data formats.

10. It also supports the most simple of tasks needed


with data such as loading data or doing feature
engineering on time series data etc.
Using Pandas

Pandas offers high-performance, easy-to-use


data structures and data analysis tools.

>>> import pandas as pd


Pandas Data Structure
A data structure is a particular way of storing and
organizing data in a computer to suit a specific
purpose so that it can be accessed and worked
within appropriate ways.
Data Handling using
Pandas -1 Data Structures in Pandas
Two important data structures of pandas are–Series, DataFrame
1. Series
Series is like a one-dimensional array like structure with
homogeneous data. For example, the following series is a
collection of integers.

Basic feature of series are


 Homogeneous data
 Size Immutable
 Values of Data Mutable
Series Data Structure

Series is an important data structure of pandas. It


represents a one-dimensional array of indexed
data. A Series type object has two main
components:

• An array of actual data.

• An associated array of indexes or data


blocks.
Data Handling using
Pandas -1
2. DataFrame
DataFrame is a two-dimensional array with
like heterogeneous
data. SR. Admn
No. No
Student Name Class Section Gender Date Of
Birth
1 001284 NIDHI MANDAL I A Girl 07/08/2010
2 001285 SOUMYADIP I A Boy 24/02/2011
BHATTACHARYA
3 001286 SHREYAANG I A Boy 29/12/2010
SHANDILYA
Basic feature of DataFrame are
 Heterogeneous data
 Size Mutable
 Data Mutable
Data Handling using
Pandas -1
Pandas Series
It is like one-dimensional array capable of holding data
of any type (integer, string, float, python objects, etc.).
Series can be created using constructor.
Syntax :- pandas.Series( data, index, dtype)
Creation of Series is also possible from – ndarray,
dictionary, scalar value.
Series can be created using
1. A Python sequence
2. An ndarray
3. Dict
4. Scalar value or constant
Data Handling using Pandas -1

Pandas Series

I.Create an Empty Series

e.g.

import pandas as pd
s = pd.Series()
print(s)

Output
Series([], dtype: float64)
Data Handling using
Pandas -1 Pandas Series
Create a Series from a Python sequence

Without index With index position


e.g. e.g.
import pandas as pd
import pandas as pd import numpy as np
import numpy as np
pd.Series([4,6,8,10],
s=pd.Series([4,6,8,10])
print(s) index=[100,101,102,103])
print(s)
Output Output
0 4 100 4
1 6 101 6
2 8 102 8
3 10 103 10
dtype: int 64 dtype: int 64
Note : default index is
starting from 0 Note : index is starting
from 100
Data Handling using
Pandas -1 Pandas Series
Create a Series from ndarray
Without index With index position
e.g. e.g.
import pandas as pd import pandas as pd
import numpy as np import numpy as np
data = data =
np.array(['a','b','c','d']) np.array(['a','b','c','d'])
s=
s = pd.Series(data) pd.Series(data,index=[100,1
print(s) 01,102,103])
Output print(s)
0 a
1 b Output
2 c 100 a
3 d 101 b
dtype: object 102 c
Note : default index is starting 103 d dtype:
from 0 object
Data Handling using
Pandas -1 Pandas Series
Create a Series from dict
Eg.1(without index) Eg.2 (with index)
import pandas as pd import pandas as pd
import numpy as np import numpy as np
data = {'a' : 0., 'b' : 1., data = {'a' : 0., 'b' :
'c' : 2.} 1., 'c' : 2.}
s = pd.Series(data) s=
print(s) pd.Series(data,index
=['b','c','d','a'])
Output print(s)
a 0.0
b 1.0 Output
c 2.0 b 1.0
dtype: c 2.0
float64 d
NaN
a 0.0
dtype:
Data Handling using
Pandas -1

Create a Series from Scalar Value


e.g
import pandas as pd import
numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)
Output 0 5
1 5
2 5
3 5
dtype: int64
Note :- here 5
is repeated
for 4 times
3. Vector Operations on Series Object

Vector operations mean that if you apply a function or


expression then it is individually applied on each item of the
object.

Examples:

ob2+2, ob2*3, ob8**2, ob2>5

4. Arithmetic on Series Objects

We can perform arithmetic like addition, subtraction, division etc.


with two Series objects and it will calculate result on two
corresponding items of the two objects given in expression. But
it has a condition – the operation is performed only on the
matching indexes.
Data Handling using
Pandas -1 Pandas Series
Arithmetic operations with
e.g. Series Objects with
import pandas as pd matching indexes
s = pd.Series([1,2,3])
t = pd.Series([1,2,4])
u=s+t #addition operation print 0 2
(u) u=s*t # multiplication 1 4
2 7
operation dtype: int64
print (u) output
0 1
1 4
2 12
dtype: int64
Data Handling using
Pandas -1 Pandas Series
Arithmetic operations with
e.g. Series Objects without
import pandas as pd matching indexes
s = pd.Series([1,2,3])
t = pd.Series([1,2,4],index=[1,2,3])
u=s+t #addition operation print
(u)
0
NaN
print (u) output 1 3.0
2 5.0
3 NaN
dtype: float64

When you perform arithmetic operations on two Series type objects, the
data is aligned on the basis of matching indexes. This is called Data
Alignment in Pandas objects.
Using a mathematical function/expression to create
data array in Series().
Eg:-

import pandas as pd
Import numpy as np
a= np.arange(9,13)
print(a)
obj=pd.Series(index=a,data=a*2)

OUTPUT

9 18
10 20
11 22
12 24
Using a mathematical function/expression to create
data array in Series().
Eg:-

import pandas as pd
Import numpy as np
Lst= [9,10,11,12]
obj=pd.Series(data=2*Lst)

OUTPUT
0 9
1 10
2 11
3 12
4 9
5 10
6 11
7 12
Series Object Attributes
Syntax: < Series object >.<attribute name>
Accessing a Series Object and its
Elements
1. Accessing Individual Elements.
Syntax: <Series Object name>[<valid index>]
2. Extracting Slices from Series Object.
Syntax: <Object>[start : end : step]
Data Handling using
Pandas -1
Accessing Data from Series with indexing and slicing
e.g.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s[‘a’]) # Syntax:<Series Object name>[valid index]
print (s[:3]) # slicing takes place position wise and not the index
wise in a series object.
Output
1
a 1
b 2
c 3
dtype: int64
Operations on Series Object
1. Modifying Elements of Series Object:

(a)Syntax: <SeriesObject>[<index>]=<new_data_value>

Above assignment will change the data value of the given


index in Series object.

(b)Syntax: <SeriesObject>[start:stop]=<new_data_value>

Above assignment will replace all the values falling in given


slice.
Data Handling using
Pandas -1
Modifying Elements of Series Object:
e.g.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s[‘a’]=0
print(s)

Output
a 0
b 2
c 3
d 4
e 5
dtype: int64
Data Handling using
Pandas -1
Modifying Elements of Series Object:
e.g.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s[:3]=6
Output
a 6
b 6
c 6
d 4
e 5
dtype: int64
2. The head() and tail() functions

The head() function is used to fetch first n rows from a pandas


object and tail() function returns last n rows from a pandas
object.

Syntax:<pandas object>.head([n])
or
<pandas object>.tail([n])

Default value is 5.
Data Handling using
Pandas -1
Pandas Series
Head
function e.g

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.head(3))

Output
a 1
b 2
c 3
dtype: int64
Return first 3 elements
Data Handling using
Pandas -1
Pandas Series
tail function
e.g

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.tail(3))

Output
c 3
d 4
e 5
dtype: int64
Return last 3 elements
You can store the result of object arithmetic in another
object, which will also be a Series object.

ie., if you give:

ob6=ob1+ob3

Then ob6 will also be a series object(if ob1 and ob3 are
panda Series objects).

5. Filtering Entries

Syntax:

<Series Object>[[Boolean expression on Series Object]]


In[20]:s1 In[22]:s1[s1>3]
Out[20]: Out[22]:
a 0 d 4
b 2 e 5
c 3 dtype : int64
d 4
e 5
dtype : int64

In[21]:s1>3
Out[21]:
a False
b False
c False
d True
e True
dtype : bool
Sorting Series Values

Syntax:

<Series Object>.sort_values([ascending=True|False])
>>>s1
A 6700
B 5600
C 5000
D 5200
dtype : int64

>>>s1.sort_values()
C 5000
D 5200
B 5600
A 6700
dtype : int64
>>>s1.sort_values(ascending=False)
A 6700
B 5600
D 5200
C 5000
dtype : int64
Sorting Series Values Based on
Indexes
Syntax:

<Series Object>.sort_index([ascending=True|False])

>>>s1
A 6700
B 5600
C 5000
D 5200
dtype : int64

>>>s1.sort_index()
A 6700
B 5600
C 5000
D 5200
dtype : int64
>>>s1.sort_index(ascending=False)
D 5200
C 5000
B 5600
A 6700
dtype : int64
Some Additional operations on Series
Objects
1. Reindexing:

Syntax:

<Series Object>=<Object>.reindex(<sequence with neworder of


indexex>)

2. Dropping Entries from an Axis:

Syntax:

<Series Object>.drop(<index to be removed>)


In[20]:s1
Out[20]:
a 0
b 2
c 3
d 4
e 5
dtype : int64

In[21]: s1=s1.reindex([‘e’,’d’,’c’,’b’,’a’])

In[22]:s1
Out[22]:
e 5
d 4
Here order of indexes is now changed.
c 3
b 2
a 0
dtype : int64
Pandas
Altering series label –

OUTPUT
e.g. program
import pandas as pd a. 54
import numpy as np b. 76
data = np.array([54,76,88,99,34]) c. 88
s1 = pd.Series(data,index=['a','b','c','d','e']) d. 99
print (s1) e. 34
s2=s1.rename(index={'a':0,'b':1})
print(s2)
dtype: int32
0 54
1 76
c 88
d 99
e 34
dtype: int32
In[20]:s1
Out[20]:
a 0
b 2
c 3
d 4
e 5
dtype : int64

In[21]: s1=s1.drop(’c’)

In[22]:s1
Out[22]:
a 0
b 2 Here index ‘c’ is dropped.
d 4
e 5
dtype : int64
Data Handling using Pandas -1

Pandas DataFrame
It is a two-dimensional data structure, just like any table
(with rows & columns).
Basic Features of DataFrame
 Columns may be of different types
 Size can be changed(Mutable) and value-mutable
 Labeled axes (rows / columns)
 Arithmetic operations on rows and columns
Structure

Rows

It can be created using constructor


pandas.DataFrame( data, index, columns, dtype)
Data Handling using
Pandas -1 Pandas
DataFrame Create
DataFrame
It can be created
with followings
 2D dictionaries
 Series type
object
 2D Numpy
ndarrays
 Another
DataFrame
import pandas as pd objectoutput Empty
df = pd.DataFrame() DataFrame
print(df) Columns: [ ]
Create an Empty Index: [ ]
DataFrame
Data Handling using
Pandas -1
Pandas DataFrame
1.Create a DataFrame from Dict of ndarrays / Lists
e.g.1
import pandas as pd
data1 = {'Name':['Freya', 'Mohak'],'Age':[9,10]}
df1 = pd.DataFrame(data1)
print (df1)
Output
Name Age
0 Freya 9
1 Mohak 10
Data Handling using
Pandas -1
Pandas DataFrame
Create a DataFrame from 2D Dictionary with value as dictionary
e.g.1
import pandas as pd
d1 = {‘Sales' : {‘name’:’Rohit’,’age’: 24,’gender’:’Male’},
‘Marketing' : {‘name’:’Neha’,’age’:25,’gender’:’Female’}}
df1 = pd.DataFrame(d1)
print (df1)
Output
Marketing Sales
age 25 24
name Neha Rohit
gender Female Male
Data Handling using
Pandas -1
PandasDataFrame
Create a DataFrame from List of Dicts
e.g.1
import pandas as pd
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z': 5}]
df1 = pd.DataFrame(data1) print (df1)
Output
x y z
0 1 2 NaN
1 5 4 5.0

Write below as 3rd stmnt in above program for indexing


df = pd.DataFrame(data, index=['first', 'second'])
Data Handling using
Pandas -1
Pandas DataFrame
Create a DataFrame from List of lists
e.g.1
import pandas as pd
data1 = [[25,45,60],[34,67,89],[88,90,56]]
df1 = pd.DataFrame(data1) print (df1)
Output
0 1 2
0 25 45 60
1 34 67 89
2 88 90 56
Data Handling using
Pandas -1
Pandas DataFrame
Create a DataFrame from Dict of Series
e.g.1
import pandas as pd
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d1)
print (df1)
Output
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Data Handling using
Pandas -1
Pandas DataFrame
Create a DataFrame from a 2D ndarray
e.g.1
import pandas as pd
Import numpy as np
a = np.array([[1,2,3],[4,5,6]],np.int32)
df1 = pd.DataFrame(a)
print (df1)
Output
0 1 2
0 1 2 3
1 4 5 6
Data Handling using
Pandas -1
Pandas DataFrame
Create a DataFrame from another Dataframe Object
e.g.1
import pandas as pd
import pandas as np
a = np.array([[1,2,3],[4,5,6]],np.int32)
df1 = pd.DataFrame(a)
>>>df1
0 1 2
0 1 2 3
1 4 5 6
df2=pd.DataFrame(df1)
>>>df1
0 1 2
0 1 2 3
2 4 5 6
DataFrame Attributes
Syntax: <DataFrame object>.<attribute name>
Data Handling using
PandasDataFrame
Pandas -1
Column addition
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
c = [7,8,9]
df[‘C'] = c
df.C=c

Column Deletion
del df1['one'] # Deleting the first column using DEL
function
df.pop('two') #Deleting another column using POP
function
Rename columns
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
Selecting / Accessing Multiple Columns

Syntax: <DataFrame object>[[<column name>,


<column name>, <column name>]]

Selecting / Accessing a Subset from a


Dataframe using Row/Column Names

Syntax: <DataFrame
object>.loc[[<startrow>:<endrow>,
<startcolumn>: <endcolumn>]]
Data Handling using
Pandas -1
e.g.1
import pandas as pd
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d1)
print (df1)
Output
onetwo
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
• To access a row:

Syntax: <DF object>.loc[<rowlabel>,:]

• To access a multiple rows:

Syntax: <DF object>.loc[<startrow>:<endrow>,:]

• To access a selective columns:

Syntax: <DF object>.loc[: ,


<startcolumn>:<endcolumn>]

• To access range of columns from range of rows:

<DF object>.loc[<start row>:<endrow> ,


<startcolumn>:<endcolumn>]
Data Handling using
Pandas -1
Pandas DataFrame
Row Selection, Addition, and Deletion
e.g.1
import pandas as pd
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d1)
print (df1)
Output
onetwo
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Data Handling using
Pandas -1

print (df1.loc[‘b’,:])
Output
one two
b 2.0 2
Obtaining a Subset/Slice from a Dataframe
using Row/Column Numeric Index/Position

Syntax: <DataFrame object>.iloc[[<start row


index>:<end row index>, <start col index>:
<end col index>]]
Data Handling using
Pandas -1
e.g.1
import pandas as pd
a = np.array([[1,2,3],[4,5,6]],np.int32)
df1 = pd.DataFrame(a)
>>>df1
0 1 2
0 1 2 3
1 4 5 6

print(df1.iloc[0:1,0:2])
OUTPUT
0 1
0 1 2
Selecting/Accessing Individual value
(i) Either give name of row or numeric index
in square brackets with, ie., as this:

<DF object>.<column>[<row name or row


numeric index>]

(ii) You can use at or iat attribute with DF


object as shown below:

<DF object>.at[<row name> , <Column


name>]

Or

<DF object>.iat[<numeric row index> ,


<numeric column index >]
Deleting Columns:

del <DF object>[<column name>]

<DF>.drop([“Total Cost” , “Order ID”],axis=1)

Deleting Rows:

<DF>.drop(index or sequence of indexes)


Eg:
df1.drop(range(4))
df1.drop([0,1,2,3])
Iterating over a DataFrame

2 functions for iterating over a


dataframe are:

• iterrows()

• Iteritems()
<DF>.iterrows() views a dataframe in the
form of horizontal subsets i.e.,row-wise.

<DF>.iteritems() views a dataframe in the


form of vertical subsets i.e.,column-wise.
Data Handling using
Pandas -1 Pandas DataFrame
Iterate over rows in a dataframe
e.g.
import pandas as pd
import numpy as np
raw_data1 = {'name': ['freya',
'mohak'],
'age': [9, 10]}
df1 = pd1.DataFrame(raw_data1, columns = ['name', 'age’])
for (index, rowSeries) in df1.iterrows():
print(index)
print (rowSeries["name"], rowSeries["age"])
Output
0
freya 9
1
mohak 10
Data Handling using
Pandas -1
Pandas DataFrame
1.Create a DataFrame from Dict of ndarrays / Lists
e.g.1
import pandas as pd
data1 = {'name':['freya', 'mohak'],'age':[9,10]}
df1 = pd.DataFrame(data1)
print (df1)
Output
name age
0 freya 9
1 mohak 10
Data Handling using
Pandas -1 Pandas DataFrame
Iterate over columns in a dataframe
e.g.
import pandas as pd
import numpy as np
raw_data1 = {'name': ['freya',
'mohak'],
'age': [10, 1] }
df1 = pd1.DataFrame(raw_data1, columns = ['name', 'age’])
for (col, colSeries) in df1.iteritems():
print(col)
print (colSeries[0], colSeries[1])
Output
name
freya
mohak
age
9
Data Handling using
Pandas -1 Pandas DataFrame
Head & Tail
head() returns the first n rows (observe the index values). The default
number of elements to display is five, but you may pass a custom number.
tail() returns the last n rows .e.g.
e.g.1
import pandas as pd
data1 = {'name':[‘freya', ‘mohak'],'age':[9,10]}
df1 = pd.DataFrame(data1)
print (df1.head(2))
print(df1.tail(2))
Output
name age
0 freya 9
1 mohak 10
Data Handling using
Pandas -1 Python Pandas
Pandas DataFrame
Accessing a DataFrame with a boolean index :
In order to access a dataframe with a boolean index, we have to create a
dataframe in which index of dataframe contains a boolean value that is
“True” or “False”.
import pandas as pd
dict = {'name':[“Mohak",
“Freya", “Roshni"],
'degree': ["MBA", "BCA",
"M.Tech"],
'score':[90, 40, 80]}
df = pd.DataFrame(dict, index = [True, False, True])
print(df)
print(df.loc[True]) #it will return rows of Mohak
and Roshni only
Output
name degree score
True Mohak MBA 90
False Freya BCA 40
True Roshni M-Tech 80

name degree score


True Mohak MBA 90
True Roshni M-Tech 80
Binary Operations in a DataFrame

ADDITION:

<DF1>.add(<DF2>) which means <DF1> + <DF2>

<DF1>.radd(<DF2>) which means <DF2> + <DF1>

SUBTRACTION:

<DF1>.sub(<DF2>) which means <DF1> - <DF2>

<DF1>.rsub(<DF2>) which means <DF2> - <DF1>


MULTIPLICATION:

<DF1>.mul(<DF2>) which means <DF1> * <DF2>

DIVISION:

<DF1>.div(<DF2>) which means <DF1> / <DF2>


Data Handling using
Pandas -1 Pandas

DataFrame
Binary operation over
dataframe with
dataframe
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
y = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
new_x = x.add(y, axis=0)
print(ne
w_x)
Output
0 1 2
0 2 8 14
0 1 2 0 1 2
0 1 4 7 0 1 4 7
1 2 5 8 + 1 2 5 8
2 3 6 9 2 3 6 9
Inspection functions info() and
describe()
<min value>+(<max
value>-<min
value)*<percentile
to be calculated>
Advance operations
on dataframes
(pivoting, sorting &
aggregation/Descriptive
statistics)
Aggregation/Descriptive statistics - dataframe
Data aggregation –
Aggregation is the process of turning the values of a dataset (or a
subset of it) into one single value or data aggregation is a
multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible
like count,sum,min,max,median,quartile etc.

Name Age Score


0 Sachin 26 87
1 Dhoni 25 67
2 Virat 25 89
3 Rohit 24 55
4 Shikhar 31 47

…then a simple aggregation method is to calculate the sum of


the Score, which is 87+67+89+55+47= 345. Or a different
aggregation method would be to count the number of Name,
which is 5.
Aggregation/Descriptive statistics - dataframe
#e.g. program for data aggregation/descriptive statistics
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Dataframe contents")
print
(df)
print(df.count())
print("count age:",df['Age'].count())
print("sum of score:",df['Score'].sum())
print("minimum age:",df['Age'].min())
print("maximum score:",df['Score'].max())
print("mean age:",df['Age'].mean())
print("mode of age:",df['Age'].mode())
print("median of score:",df['Score'].median())
OUTPUT
Name 5
Dataframe contents
Age 5
Name Age Score Score 5
0 Sachin 26 87 dtype: int64
1 Dhoni 25 67 count age: 5
2 Virat 25 89 dtype: int64
3 Rohit 24 55 sum of score: 345
dtype: int64
4 Shikhar 31 47
minimum age : 24
47 55 67 87 89 dtype: int64
maximum score: 89
dtype: int64
If count is odd-(5+1)/2, mean age: 26.2
here 3rd element is the dtype: float64
median. mode of age: 25
median of score: 67
If count is even then dtype: int64
take the average of
middle numbers.
Eg:- 47 55 67 87
Median=(55+67)/2.
The mad() function

It is used to calculate the mean absolute deviation of


the values for the requested axis.

Σi=1 to n (Xi –X)


n

Syntax: <dataframe object>.mad()


Aggregation/Descriptive statistics - dataframe

Quantile -
Quantile statistics is a part of a data set. It is used to describe
data in a clear and understandable way.

Common Quantiles
Certain types of quantiles are used commonly enough to have
specific names. Below is a list of these:
• The 2 quantile is called the median
• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles
Quantiles
How to Find Quantiles?
Sample question: Find the number in the following set of data where 30 percent
of values fall below it, and 70 percent fall above:
2 4 5 7 9 11 12 17 19 21 22 31 35 36 45 44 55 68 79 80 81 88 90 91 92 100 112
113 114 120 121 132 145 148 149 152 157 170 180 190

Step 1: Order the data from smallest to largest. The data in the question is
already in ascending order.
Step 2: Count how many observations you have in your data set. this particular
data set has 40 items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the
number where 30 percent of the values fall below it, so convert that to .3.
Step 4: Insert your values into the formula:
ith observation = q (n + 1)
ith observation = .3 (40 + 1) = 12.3
Answer: The ith observation is at 12.3, so we round down to 12 (remembering
that this formula is an estimate). The 12th number in the set is 31, which is the
number where 30 percent of the values fall below it.
Aggregation/Descriptive statistics - dataframe
#e.g. program on Quantile –
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100],
[4, 1000]]),columns=['a', 'b’])
print(df)
print(df.quantile(0.5))
OUTPUT
a b
0 1 1
1 2 10
2 3 100
3 4 1000

a 2.5
b 55.0
dtype: float64
Quantiles
How to Find Quantiles in python
In pandas series object->

import pandas as pd
import numpy as np
s = pd.Series([1, 2, 4, 5,6,8,10,12,16,20])
r=s.quantile(.3)
print(r)

OUTPUT
4.699999999999999
Note – It returns 30% quantile
Quantiles
How to Find Quantiles in python
In pandas dataframe object->

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[11, 1], [12, 10], [13, 100], [14, 100],
[15, 1000]]),columns=['a', 'b'])
r=df.quantile(.2)
print(r)

OUTPUT
a 11.8
b 8.2
Name: 0.2,
dtype: float64
Note – It returns 20% quantile
Aggregation/Descriptive statistics - dataframe
var() – Variance Function in python pandas is used to calculate
variance of a given set of numbers, Variance of a data frame, Variance
of column and Variance of rows, let’s see an example of each.
#e.g.program
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Dataframe contents")
print
(df)
print(df.var())
print(df.std())
#df.loc[:,“Age"].var() for variance of specific column
#df.var(axis=0) column variance
#df.var(axis=1) row variance

std()- Standard Deviation is the square root of variance.


Cumulative Calculations Functions

cumsum() – gives cumulative sum of values.

cumprod() – gives cumulative product of values.

cummax() – gives cumulative maximum of values.

cummin() – gives cumulative minimum of values.


>>>df1 >>>df1.cumsum(axis=‘columns’)
0 1 2 0 1 2
0 1 2 3 0 1 3 6
1 4 5 6 1 4 9 15

>>>df1.cumsum()
0 1 2
0 1 2 3
1 5 7 9
>>>df
0 1
0 700 490
1 975 460
2 970 570
3 900 590

>>>df.cummax()
0 1
0 700 490
1 975 490
2 975 570
3 975 590
Pivoting - dataframe

DataFrame -It is a 2-dimensional data structure with


columns of different types. It is just similar to a
spreadsheet or SQL table, or a dict of Series objects. It
is generally the most commonly used pandas object.

Pivot –Pivot reshapes data and uses unique values from


index/ columns to form axes of the resulting dataframe.
Index is column name to use to make new frame’s
index.Columns is column name to use to make new
frame’s columns.Values is column name to use for
populating new frame’s values.

Pivot table - Pivot table is used to summarize and


aggregate data inside dataframe.
Pivoting - dataframe
Example of pivot:

ITEM COMPANY RUPEES USD


TV LG 12000 700
TV VIDEOCON 10000 650
DATAFRAME
AC LG 15000 800
AC SONY 14000 750

COMPANY LG SONY VIDEOCON


ITEM PIVOT
AC 15000 14000 NaN
TV 12000 NaN 10000
Pivoting - dataframe

There are two functions available in python for pivoting dataframe.


1.pivot()
2.pivot_table()

1. pivot() - This function is used to create a new derived


table(pivot) from existing dataframe. It takes 3 arguments : index,
columns, and values. As a value for each of these parameters we
need to specify a column name in the original table(dataframe).
Then the pivot function will create a new table(pivot), whose row
and column indices are the unique values of the respective
parameters. The cell values of the new table are taken from column
given as the values parameter.
Pivoting - dataframe
#pivot() e.g. program
import pandas as pd
import numpy as np
table = {
ITEM:['TV', 'TV', 'AC', 'AC'],
'COMPANY’:['LG', 'VIDEOCON', 'LG', 'SONY'],
'RUPEES’:['12000', '10000', '15000', '14000'],
'USD’:['700', '650', '800', '750’]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
p = d.pivot(index='ITEM', columns='COMPANY',
values='RUPEES')
print("\n\nDATA OF PIVOT")
print(p)
print (p[p.index=='TV'].LG.values)
#pivot() creates a new table/DataFrame whose columns are the unique values
n COMPANY and whose rows are indexed with the unique values of
ITEM.Last statement of above program retrun value of TV item LG company i.e.
12000.
Pivoting - dataframe
#Pivot Table
The pivot_table() method comes to solve this problem. It works
like pivot, but it aggregates the values from rows with
duplicate entries for the specified columns.
ITEM COMPANY RUPEES USD
TV LG 12000 700
TV VIDEOCON 10000 650
TV LG 15000 800
AC SONY 14000 750

COMPANY LG SONY VIDEOCON


ITEM
AC NaN 14000 NaN

TV 13500 = mean(12000,15000) NaN 10000


d.pivot_table(index='ITEM', columns='COMPANY', values=
'RUPEES‘,aggfunc= mean) In essence pivot_table is a generalisation of
pivot, which allows you to aggregate multiple values with the same
destination in the pivoted table.
Sorting - dataframe
Sorting means arranging the contents in ascending or
descending order.There are two kinds of sorting
available in pandas(Dataframe).
1. By value(column)
2. By index

1. By value - Sorting over dataframe column/s


elements is supported by sort_values() method. We
will cover here three aspects of sorting values of
dataframe.
• Sort a pandas dataframe in python by Ascending and
Descending
• Sort a python pandas dataframe by single column
• Sort a pandas dataframe by multiple columns.
Sorting - dataframe
Sort the python pandas Dataframe by single column –
Ascending order
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,27,25,24,31]), OUTPUT
'Score':pd.Series([87,89,67,55,47])} Dataframe contents without sorting
#Create a DataFrame Name Age Score
df = pd.DataFrame(d) 0 Sachin 26 87
1 Dhoni 27 89
print("Dataframe contents without sorting") 2 Virat 25 67
print (df) 3 Rohit 24 55
df=df.sort_values(by='Score') 4 Shikhar 31 47
print("Dataframe contents after sorting")
print (df) Dataframe contents after sorting
Name Age Score
4 Shikhar 31 47
3 Rohit 24 55
2 Virat 25 67
1 Dhoni 27 87
0 Sachin 26 89
Sorting - dataframe
Sort the python pandas Dataframe by single column – Descending order
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,27,25,24,31]), OUTPUT
'Score':pd.Series([87,89,67,55,47])} Dataframe contents without sorting
#Create a DataFrame Name Age Score
df = pd.DataFrame(d) 0 Sachin 26 87
1 Dhoni 27 89
print("Dataframe contents without 2 Virat 25 67
sorting")
print (df) 3 Rohit 24 55
df=df.sort_values(by='Score',ascending=0) 4 Shikhar 31 47
print("Dataframe contents after sorting")
print (df) Dataframe contents after sorting
Name Age Score
1 Dhoni 27 89
0 Sachin 26 87
2 Virat 25 67
3 Rohit 24 55
4 Shikhar 31 47
Sorting - dataframe
Sort the pandas Dataframe by Multiple Columns
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])} OUTPUT
Dataframe contents without sorting
#Create a DataFrame Name Age Score
df = pd.DataFrame(d) 0 Sachin 26 87
1 Dhoni 25 67
print("Dataframe contents without sorting") 2 Virat 25 89
print (df) 3 Rohit 24 55
df=df.sort_values(by=['Age', 4 Shikhar 31 47
'Score'],ascending=[True,False]) Dataframe contents after sorting
print("Dataframe contents after
sorting") Name Age Score
print (df) 3 Rohit 24 55
2 Virat 25 89
1 Dhoni 25 67
0 Sachin 26 87
4 Shikhar 31 47
Sorting - dataframe

2. By index - Sorting over dataframe index sort_index() is


supported by sort_values() method. We will cover here three
aspects of sorting values of dataframe. We will cover here
two aspects of sorting index of dataframe.

• how to sort a pandas dataframe in python by index in


Ascending order
• how to sort a pandas dataframe in python by index in
Descending order

Visit : python.mykvs.in for regular updates


Sorting - dataframe
sort the dataframe in python pandas by index in ascending
order:
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])} OUTPUT
Dataframe contents without sorting
#Create a DataFrame Name Age Score
df = pd.DataFrame(d) 1 Dhoni 25 67
4 Shikhar 31 47
df=df.reindex([1,4,3,2,0]) 3 Rohit 24 55
print("Dataframe contents without sorting") 2 Virat 25 89
print (df) 0 Sachin 26 87
df1=df.sort_index()
print("Dataframe contents after Dataframe contents after sorting
sorting") Name Age Score
print (df1) 0 Sachin 26 87
1 Dhoni 25 67
2 Virat 25 89
3 Rohit 24 55
4 Shikhar 31 47
index
Sorting - dataframe
Sorting pandas dataframe by index in descending order:
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]), OUTPUT
'Score':pd.Series([87,67,89,55,47])}
Dataframe contents without sorting
#Create a DataFrame Name Age Score
df = pd.DataFrame(d) 1 Dhoni 25 67
4 Shikhar 31 47
df=df.reindex([1,4,3,2,0]) 3 Rohit 24 55
print("Dataframe contents without sorting") 2 Virat 25 89
print (df) 0 Sachin 26 87
df1=df.sort_index(ascending=0)
print("Dataframe contents after Dataframe contents after sorting
sorting") Name Age Score
print (df1) 4 Shikhar 31 47
3 Rohit 24 55
2 Virat 25 89
1 Dhoni 25 67
0 Sachin 26 87

index
Reindexing – Python Pandas
It is a fundamental operation over pandas series or
dataframe.It is a process that makes the data in a
Series/data frame conforms to a set of labels. It is used
by pandas to perform much of the alignment process.
Reindex in python pandas or change the order of the
rows and column in python pandas dataframe or
change the order of data of series object is possible
with the help of reindex() function.
E.g. Given below for series- first column is label(as index) and second
column for value

a. 54 >--- After Reindex ---> e


b. 76 34
c. 88 d
d. 99 99
e. 34 c
88
b
Reindexing – Python Pandas
Reindexing pandas series
The program given below creates a pandas series with some
numeric values then index it with a,b,c,d,e labels,then after index
is changed to e,d,c,b,a with the help of reindex() function.
e.g. program OUTPUT
import pandas as pd
a. 54
import numpy as np
b. 76
data = np.array([54,76,88,99,34])
c. 88
s1 = pd.Series(data,index=['a','b','c','d','e'])
d. 99
print (s1)
e. 34
s2=s1.reindex(['e','d','c','b','a'])
print(s2) dtype: int32
e 34
d 99
c 88
b 76
a 54
dtype: int32
Reindexing – Python Pandas
Reindexing pandas series without label -
Reindex insert NaN markers where no data exists for a label.In
below program f,g are not available as label.
e.g. program OUTPUT
import pandas as pd a. 54
import numpy as np b. 76
data = np.array([54,76,88,99,34]) c. 88
s1 = d. 99
pd.Series(data,index=['a','b','c','d','e']) e. 34
print (s1) dtype: int32
s2=s1.reindex(['f','g','c','b','a']) f NaN
print(s2) g NaN
c 88.0
b 76.0
a 54.0
dtype: float64
Renaming – Python Pandas
Altering series label –

e.g. program
import pandas as pd
import numpy as np OUTPUT
data = np.array([54,76,88,99,34]) a. 54
s1 = pd.Series(data,index=['a','b','c','d','e']) b. 76
print (s1) s2=s1.rename(index={'a':0,'b':1}) c. 88
print(s2) d. 99
e. 34
dtype: int32
0 54
1 76
c 88
d 99
e 34
dtype: int32
Reindexing – Python Pandas
Reindexing Rows in pandas Dataframe
e.g.program
import pandas as pd
import numpy as np

table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER REINDEX")
df=d.reindex([1,2, 3,4,0])
print(df)
Reindexing – Python Pandas

OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

DATA OF DATAFRAME AFTER REINDEX


name age weight height runsscored
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
0 vishal 15 51 5.1 55

Visit : python.mykvs.in for regular updates


Reindexing – Python Pandas
Reindexing Columns in pandas Dataframe
e.g.program
import pandas as pd
import numpy as np

table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER REINDEX")
df=d.reindex(columns=['name’,
'runsscored','age'])
print(df)
Reindexing – Python Pandas

OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

DATA OF DATAFRAME AFTER REINDEX


name runsscored age
0 vishal 55 15
1 anil 25 16
2 mayur 71 15
3 viraj 53 17
4 mahesh 51 16

Visit : python.mykvs.in for regular updates


The syntax for using reindex_like() is,

<dataframe>.reindex_like(<other dataframe>)
Renaming – Python Pandas
Altering dataframe labels
e.g.program
import pandas as pd
import numpy as np

table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER RENAME")
df=d.rename(index={0:'a',1:'b'})
print(df)
Renaming – Python Pandas

OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

DATA OF DATAFRAME AFTER RENAMING


name age weight height runsscored
a vishal 15 51 5.1 55
b anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

Visit : python.mykvs.in for regular updates


Data Handling using
Pandas -1
PandasDataFrame
Rename columns
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})

A B
0 1 4
1 2 5
2 3 6

a c
0 1 4
1 2 5
2 3 6
Function application
Pandas provide three important functions namely pipe(),
apply() and applymap() ,to apply our own function or some
other library’s function.Use of these functions depend on
entire dataframe,row-coloumn elements or element wise.

 Dataframe wise Function Application: pipe()

 Row or Column Wise Function Application: apply()

 Element wise Function Application: applymap()


Sandwiching

np.multiply(sal_df.add(30),3)
Function application
Dataframe wise Function Application: pipe()
Pipe() function performs the operation for the entire dataframe with
the help of user defined or library function. In below example we
are using pipe() Function to add value 5 to the entire dataframe.
e.g.program. OUTPUT
import pandas as pd science_marks english_marks
import numpy as np 0 81 282
import math 1 180 276
# own function 2 204 216
3 270 180
#Create a Dictionary of series 4 156 156
d = {'science_marks':pd.Series([22,55,63,85,47]),
'english_marks':pd.Series([89,87,67,55,47])}
df = pd.DataFrame(d)
df1=df.pipe(np.add,5).pipe(np.multiply,3)
print (df1)
Function application
Row or Column Wise Function Application: apply()
apply() function performs the operation over either row wise or column wise
data
e.g. of Row wise Function in python pandas : apply()

import pandas as pd
import numpy as np
import math
d = {'science_marks':pd.Series([22,55]),
'english_marks':pd.Series([89,87])}
df = pd.DataFrame(d) OUTPUT
print(df) science_marks english_marks
r=df.apply(np.mean,axis=1) 0 22 89
print (r) 1 55 87

0 55.5
1 71.0

dtype: float64
Function application
e.g. of Column wise Function in python pandas : Apply()

import pandas as pd
import numpy as np
import math
d = {'science_marks':pd.Series([22,55]),
'english_marks':pd.Series([89,87])}
df = pd.DataFrame(d) print(df)
r=df.apply(np.mean,axis=0)
print (r)

OUTPUT
science_marks english_marks
0 22 89
1 55 87

science_marks 38.5
english_marks 88.0
dtype: float64
Function application
Element wise Function Application in python pandas: applymap()
applymap() Function performs the specified operation for all the elements the
dataframe.

e.g.program
import pandas as pd
import numpy as np
import math
d=
{'science_marks':pd.Series([22,55]), OUTPUT
'english_marks':pd.Series([89,87])} science_marks english_marks
df = pd.DataFrame(d)
0 25 9
print(df)
1 16 4
r=df.applymap(np.sqrt)
print (r) science_marks english_marks

0 5 3
1 4 2
Function application
aggregation (group by) in pandas –
Data aggregation- Aggregation is the process of finding the
values of a dataset (or a subset of it) into one single value. Let
us go through the DataFrame like…

DATA OF DATAFRAME
Name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

…then a simple aggregation method is to calculate the sum of the


runsscored, which is 55+25+71+53+51=255. Or a different aggregation
method would be to count the number of the name, which is 5. So the
aggregation is not too complicated. Let’s see the rest in practice…
Function application
Grouping in pandas
For Data Analysis we will probably do segmentations many times. For
instance, it’s nice to know the max for all age groups, then grouping is
to be done for each age value(group).
e.g. given below.

DATA OF DATAFRAME
name ageweight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
name weight height runsscored
age
Maximum weight ,height and
15 vishal 51 5.1 71
runsscored in each age
16 mahesh 48 5.2 51
group.
17 viraj 51 5.3 53
Function application
Grouping in pandas e.g.program.
import pandas as pd
import numpy as np

table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh'],

'age':[15, 16, 15, 17,16],


'weight':[51, 48, 49, 51,48],
'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print(d.groupby('age').max())

# for individual column value we can


use stmt like print(d.groupby('age').max().name)
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

name weight height runsscored


age
15 vishal 51 5.1 71
16 mahesh 48 5.2 51
17 viraj 51 5.3 53
Function application
Transform –
Transform is an operation used in conjunction with groupby.It is
used in given pattern.

Dataframe -> grouping -> aggregate function on


group
each value -> then transform that value in each group
value. DATA OF DATAFRAME
e.g. name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
0 126
1 76
2 126
3 53
4 76 In above example sum of score of each age group is
applied over in order of age.
Function application
Transform –
e.g. program –
import pandas as pd
import numpy as np

table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh'], 'age':[15, 16,
15, 17,16], 'weight':[51, 48, 49, 51,48], 'height':[5.1, 5.2, 5.1,
5.3,5.1], 'runsscored':[55,25, 71, 53,51]}

d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print(d.groupby('age')["runsscored"].transform(np.sum))
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51

0 126
1 76
2 126
3 53
4 76
Handling Missing Data
You can handle missing data in many ways, most common ones are:

(i) Dropping missing data (ii) Filling missing data(Imputation)


>>>df
0 1
0 700.0 490.0
1 NaN NaN
2 NaN 570.0
3 900.0 590.0

>>>s
a 1.0
b 2.0
c NaN
d 4.0
e 5.0
Detecting or Filtering Missing Data:

<PandaObject>.isnull()

>>>s.isnull() >>>df.isnull()

a False 0 1
b False 0 False False
c True 1 True True
d False 2 True False
e False 3 False False
If you want to filter data which is not a missing value i.e.,
non-null data, then you can use following for Series
object:

<Series>[filter condition]

>>>s[s.notnull()]
a 1.0
b 2.0
d 4.0
e 5.0
dtype : float64

The above method will not work for dataframes as for


dataframes it is not simple.
Handling Missing Data – Dropping
Missing Values
(a)<PandaObject>.dropna()

(b)<DF>.dropna(how=‘all’)

(c)<DF>.dropna(axis=1)

(d)<DF>.dropna(axis=1, how=‘all’)
Handling Missing Data – Dropping
Missing Values
(a)<PandaObject>.dropna()

In[66] : s=s.dropna()
In[67] : s
Out[67]:
a 1.0
b 2.0
d 4.0
e 5.0
dtype : float64
Handling Missing Data – Dropping
Missing Values
(a)<PandaObject>.dropna()

In[66] : df1=df.dropna()
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
3 900.0 590.0
Handling Missing Data – Dropping
Missing Values
(b) <DF>.dropna(how=‘all’)

In[66] : df1=df.dropna(how=‘all’)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
2 NaN 570.0
3 900.0 590.0
Handling Missing Data – Dropping
Missing Values
(c) <DF>.dropna(axis=1)

In[66] : df1=df.dropna(axis=1)
In[67] : df1
Out[67]:
Empty Dataframe
Columns : []
Index : [0,1,2,3]
Handling Missing Data – Dropping
Missing Values
(d) <DF>.dropna(axis=1,how=‘all’)

In[66] : df1=df.dropna(axis=1,how=‘all’)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
1 NaN NaN
2 NaN 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values

(a)<PandaObject>.fillna(<n>)

(b)<DF>.fillna(<dictionary having fill


values for columns>)
Handling Missing Data – Filling Missing Values

(a)<PandaObject>.fillna(<n>)

In[66] : df1=df.fillna(0)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
1 0.0 0.0
2 0.0 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values

(a)<PandaObject>.fillna(<n>)

In[66] : s1=s.fillna(0)
In[67] : s1
Out[67]:
a 1.0
b 2.0
c 0.0
d 4.0
e 5.0
Handling Missing Data – Filling Missing Values

(b)<DF>.fillna(<dictionary having fill values


for columns>)
In[66] : fillValues={0:’a’,1:’b’}
In[67]df1=df.fillna(fillValues)
In[68] : df1
Out[68]:
0 1
0 700.0 490.0
1 a b
2 a 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values

(b)<DF>.fillna(<dictionary having fill values


for columns>)
In[66] : fillValues={0:’a’}
In[67]df1=df.fillna(fillValues)
In[68] : df1
Out[68]:
0 1
0 700.0 490.0
1 a NaN
2 a 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values

(b)<DF>.fillna(<dictionary having fill values


for columns>)
In[66]df1=df.fillna({0:’a’,1:’b’})
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
1 a b
2 a 570.0
3 900.0 590.0
Handling Missing Data – Filling
Missing Values from another
DataFrame
(c) <DF1>.fillna(DF2)

It will fill the missing values of DataFrame DF1 from the


corresponding cells of DF2. But the condition here is that both
the dataframes should have similar structure, only their values
differ.
Combining Dataframes using concat()

pd.concat([<df1>,<df2>])
pd.concat([<df1>,<df2>],ignore_index=True)
pd.concat([<df1>,<df2>],axis=1)
Combining Datafram
Concatenate two DataFrame objects with identical
columns.
>>>df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4

The concat() method is useful if the two dataframes have similar


structures.
>>>p= pd.concat([df1, df2])
>>>p
letter number
0 a 1
1 b 2
0 c 3
2 d 4

>>> p=pd.concat([df1, df2],ignore_index=True)


>>>p
letter number
0 a 1
3 b 2
2 c 3
3 d 4
>>>p= pd.concat([df1, df2],axis=1)
>>>p
letter number letter number
0 a 1 c 3
1 b 2 d 4
Data Handling using
Pandas -1

Joining dataframe
e.g.
import pandas as pd
df1 = pd.DataFrame({
‘Name’:[Harini, Dave,
Simrat, Saqib]
},index=[1,2,3,4])
df2 = pd.DataFrame(
{‘Competition’:
[5,3,3],
},index=[1,3,7])
print(df1)
print(df2)
>>>df1 >>>df2
Name Competition
1 Harini 1 5
2 Dave 3 3
3 Simrat 7 3
4 Saqib
Syntax:
<DataFrame1>.join(<DataFrame2>,[how=‘left’])

>>>df1.join(df2)

Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN
>>>df1.join(df2,how=‘left’)

Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN

>>>df1.join(df2,how=‘inner’)

Name Competition
1 Harini 5
3 Simrat 3
>>>df1.join(df2,how=‘right’)

Name Competition
1 Harini 5
3 Simrat 3
7 NaN 3
>>>df1.join(df2,how=‘outer’)

Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN
7 NaN 3.0
Joining on a Column

Syntax:
<DF1>.join(<DF2>,on=<column name of DF1>)

>>>df2 >>>df1
C_id Competition P_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib
>>>df2.join(df1,on=“C_id”)

C_id Competition P_id Name


1 1 5 1 Harini
3 3 3 7 Simrat
7 4 3 6 Saqib

The join() function can join only the left dataframe’s column values
with the indexes of the right dataframe.
Joining on a Column
Syntax:
<DF1>.join(<DF2>,on=<column name of DF1 which is
identical to DF2>)

>>>df2 >>>df1
C_id Competition C_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib

>>>df2.join(df1,on=“C_id”) will give error.


>>>df2.join(df1,on=“C_id”,lsuffix=‘_CDF’,rsuffix=‘_SDF’)

C_id_CDF Competition C_id_SDF Name


1 1 5 1 Harini
3 3 3 7 Simrat
7 4 3 6 Saqib
Combining Dataframes using merge()

pd.merge(<DF1>,<DF2>)
pd.merge(<DF1>,<DF2>, on=<field_name>)

pd.merge(<DF1>,<DF2>,[on=<field_name>)],
<how=‘left’|’right’|’inner’|’outer’>)
>>>df2 >>>df1
C_id Competition C_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib

>>>pd.merge(df2,df1)

C_id Competition Name


1 1 5 Harini
3 3 3 Dave
>>>pd.merge(df2,df1,on=‘C_id’)

C_id Competition Name


1 1 5 Harini
3 3 3 Dave
Data Handling using
Pandas -1

Merging dataframe(different styles)

pd.merge(df1, df2, on=C_id', how='left') #left join


pd.merge(df1, df2, on=‘C_id’how='right’) #right join
pd.merge(df1, df2, on='C_id'how='outer',) #outer join
pd.merge(df1, df2, on=‘C_id',how='inner’) # inner join
>>>pd.merge(df1,df2,on=‘C_id’,how=‘left’)

C_id Name Competition


1 1 Harini 5.0
2 3 Dave 3.0
3 7 Simrat NaN
4 6 Saqib NaN
Modern technology has made things easier but
at the same time data has increased multifold.
In fact, data has grown so big that a
specific term has been coined,
'Big Data'
Data Visualization involves various disciplines
Visual representation of information and data
using charts, graphs, maps etc.
How to achieve
data visualization?

Mapping from
Data Space to
Graphic Space
Convert data to charts, graphs, maps etc.
 Schools
 Hospitals
 Banks
 Sports
Purpose of
Data visualization

• Better analysis
• Quick action
• Identifying patterns
• Finding errors
• Understanding the story
• Exploring business insights
• Grasping the Latest Trends
Plotting library
Matplotlib is the whole python package/ library used to create 2D graphs and plots
by using python scripts. pyplot is a module in matplotlib, which supports a very
wide variety of graphs and plots namely - histogram, bar charts, power spectra,
error charts etc. It is used along with NumPy to provide an environment for MatLab.

Pyplot provides the state-machine interface to the plotting library in matplotlib.It


means that figures and axes are implicitly and automatically created to achieve the
desired plot.For example, calling plot from pyplot will automatically create the
necessary figure and axes to achieve the desired plot. Setting a title will then
automatically set that title to the current axes object.The pyplot interface is
generally preferred for non-interactive plotting (i.e., scripting).
Matplotlib –
pyplot features
Following features are provided in matplotlib library for data
visualization.
• Drawing – plots can be drawn based on passed data
through specific functions.
• Customization – plots can be customized as per
requirement after specifying it in the arguments of the
functions.Like color, style (dashed, dotted), width; adding
label, title, and legend in plots can be customized.
• Saving – After drawing and customization plots can be
saved for future use.
Types of charts
using matplotlib

• LINE CHART
• BAR GRAPH
• SCATTER CHART
Matplotlib –line plot
Line Plot
A line plot/chart is a graph that shows the frequency of
data occurring along a number line.
The line plot is represented by a series of datapoints
connected with a straight line. Generally line plots are
used to display trends over time. A line plot or line graph
can be created using the plot() function available in pyplot
library. We can, not only just plot a line but we can
explicitly define the grid, the x and y axis scale and labels,
title and display options etc.
Matplotlib –line plot

E.G.PROGRAM
import numpy as np
import matplotlib.pyplot as plt
year = [2014,2015,2016,2017,2018]
jnvpasspercentage = [90,92,94,95,97]
kvpasspercentage = [89,91,93,95,98]
plt.plot(year, jnvpasspercentage, color='g')
plt.plot(year, kvpasspercentage, color='orange')
plt.xlabel(‘Year')
plt.ylabel('Pass percentage')
plt.title('JNV KV PASS % till 2018')
plt.show()
Matplotlib –line plot

Line Plot customization


• Custom line color
plt.plot(year, kvpasspercentage, color='orange')
Change the value in color argument.like ‘b’ for blue,’r’,’c’,….., (‘#008000’)
etc.
• Custom line style
plt.plot(year, kvpasspercentage, linestyle='-' , linewidth=4).
set linestyle to any of '-‘ for solid line style, '--‘ for dashed, '-.‘ for dashdot ,
'.‘ for dotted line.
• Custom line width
plt.plot(year, kvpasspercentage, linestyle='-' , linewidth=4),
set linewidth as required.

plt.figure(figsize=(15,7)
plt.grid(True)
• Title
plt.title('JNV KV PASS % till 2018') – Change it as per requirement.
• Label - plt.xlabel(‘Year') - change x or y label as per requirement
• Legend - plt.legend(loc='upper right‘)
The loc argument can either take values 1,2,3,4 signifying
the position strings ‘upper right’, ’upper left’, ’lower left’, ‘lower right’
respectively. Default position is ‘upper right’ or 1.
• Marker Type, Size and Color – The data points being plotted are called
markers. The marker types can be dots or crosses or diamonds etc.
plt.plot(year, kvpasspercentage,’k’,marker=‘d’,
markersize=5,markeredgecolor=‘red’)
plt.plot(year, kvpasspercentage,’kd’,
Line color and marker
linestyle=‘solid’)
style combined so
marker takes same
color as line.
plt.plot(year, kvpasspercentage,’kd’, Here marker color is
linestyle=‘solid’,markeredgecolor=‘red’) separately specified.

Note – If you do not specify marker type, then data points


will not be marked specifically on the line chart and its
default type will be the same as that of the line type.

Note:- As many lines required call plot() function multiple


times with suitable arguments. If you skip the colour
information in plot(), Python will plot multiple lines in the
same plot with different colors but these colors are decided
internally by Python.
Scatter Chart

The scatter charts can be created through two functions


of pyplot library:

(i) plot() function


(ii) scatter() function
(i) Scatter Charts using plot() Function

In plot() function, whenever you specify marker type /style,


whether with color or without color, and do not give
linestyle argument, plot will create a scatter chart.
plt.plot(a, b,’r+’,markersize=8)
plt.plot(a, b,’+’)
(ii) Scatter Charts using scatter() Function
matplotlib.pyplot.scatter(x, y,s=None,
c=None,marker=None)

Parameters:
x,y – The data positions.
s – The marker size(optional argument).
c – marker color or sequence of color(optional
argument).
marker – marker style(optional argument).
plt.scatter(a, b,s=12,c=‘m’,marker=‘D’)

colarr=[‘r’,’b’,’m’,’g’,’k’]
sarr=[20,60,100,45,25]
plt.scatter(a, b,s=sarr,c=colarr,)
Matplotlib –Bar Graph
Bar Graph
A graph drawn using rectangular bars to show how large
each value is. The bars can be horizontal or vertical.
A bar graph makes it easy to compare data between
different groups at a glance. Bar graph represents
categories on one axis and a discrete value in the other.
The goal of bar graph is to show the relationship between
the two axes. Bar graph can also show big changes in data
over time.
Plotting with Pyplot
Plot bar graphs
e.g program
import matplotlib.pyplot as plt
import numpy as np
label = ['Anil', 'Vikas', 'Dharma', 'Mahen',
'Manish', 'Rajesh']
per = [94,85,45,25,50,54]
index = np.arange(len(label))
plt.bar(index, per)
plt.xlabel('Student Name', fontsize=5)
plt.ylabel('Percentage', fontsize=5)
plt.xticks(index, label, fontsize=5, rotation=30)
plt.title('Percentage of Marks achieve by student
Class XII')
plt.show()
#Note – use barh () for horizontal bars
Matplotlib –Bar graph

Bar graph customization

• Custom bar color


plt.bar(index, per,color="green“)
Change the value in color argument like ‘b’ for blue,’r’,’c’,…..
plt.bar(index, per,color=[‘r’,’g’,’black’,’c’,’m’,’orange’])
• Title
plt.title('Percentage of Marks achieve by student Class XII')
Change it as per requirement
• Label - plt.xlabel('Student Name', fontsize=5)- change x or y label
as per requirement
• Legend - plt.legend(loc='upper right‘)
• Custom bar width
plt.bar(index,per,width=1/2)
plt.bar(index,per,width=[0.5,.6,.7,.8])
Creating Multiple Bars Chart

import numpy as np
import matplotlib.pyplot as plt

Val=[[5.,25.,45.,20.],[4.,23.,49.,17.],[6.,22.,47.,19.]]

X=np.arrange(4)

plt.bar(X+0.00,Val[0],color=‘b’,width=0.25)
plt.bar(X+0.25,Val[1],color=‘g’,width=0.25)
plt.bar(X+0.50,Val[2],color=‘r’,width=0.25)

plt.show()
Creating Horizontal Bar Chart
Note – To create horizontal bar chart, you need to use
barh() function(bar horizontal), in place of bar(). Also you
need to give x and y axis labels carefully – the label that you
gave to x axis in bar(), will become y axis label in barh() and
vice-versa.
Creating Pie Charts
The PyPlot interface offers pie() function for creating a pie
chart.

Two important things to know about the pie() function is,

(i) The pie() function, plots a single data range only. It will
calculate the share of individual elements of the data range
being plotted vs. the whole of the data range.
(ii) The default shape of a pie chart is oval but you can always
change to circle by using axis() of pyplot, sending “equal” as
argument to it.
matplotlib.pyplot.axis(“equal”)
Labels of Slices of Pie
Adding Formatted Slice
Percentages to Pie

To view percentage of share in a pie chart, you need to add an


argument autopct with a format string such as “%1.1f%%”. It will show
the percent share of each slice to the whole.
Changing colors of the
Slices
Exploding a Slice
Customizing the Plot
1.Title
plt.title('Percentage of Marks achieve by student Class XII')

2.Label
plt.xlabel('Student Name', fontsize=5)- change x or y label as per
requirement.

3.Legend
plt.legend(('jnv','kv'),loc='upper right‘)
The loc argument can either take values 1,2,3,4
signifying the position strings ‘upper right’, ’upper left’,
’lower left’, ‘lower right’ respectively. Default position is
‘upper right’ or 1.
4. Ticks
plt.xticks([0,1,2,3])

5. Setting Xlimits and Ylimits


plt.xlim(<xmin>,<xmax>)
plt.ylim(<ymin>,<ymax>)
plt.xlim(-2.0,4.0)

6. Saving a Figure
plt.savefig(“multibar.pdf”) (saves in current directory)
plt.savefig(“c:\\data\\multibar.pdf”) (saves at the given path)
plt.savefig(‘’c:\\data\\multibar.png”) (saves at the given path)
Histogram
A histogram is a powerful technique in data visualization. It
is an accurate graphical representation of the distribution of
numerical data.It was first introduced by Karl Pearson. It is
an estimate of the distribution of a continuous variable
(quantitative variable). It is similar to bar graph. To construct
a histogram, the first step is to “bin” the range of values —
means divide the entire range of values into a series of
intervals — and then count how many values fall into each
interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must
be adjacent, and are often (but are not required to be) of
equal size.
Matplotlib –Histogram

A histogram is a graphical representation which


organizes a group of data points into user-specified
ranges.
Histogram provides a visual interpretation of
numerical data by showing the number of data points
that fall within a specified range of values (“bins”). It
is similar to a vertical bar graph but without gaps
between the bars.
Histogram
Difference between a histogram and a bar chart / graph –

A bar chart majorly represents categorical data (data that has


some labels associated with it), they are usually represented
using rectangular bars with lengths proportional to the
values that they represent.

While histograms on the other hand, is used to describe


distributions. Given a set of data, what are their distributions
Syntax:
matplotlib.pyplot.hist(x,bins=None,cumulative=False,histtype
=‘bar’,align=‘mid’,orientation=‘vertical’)

Parameters:

1. x : array of sequence.
2. bins : intervals(optional)
3. cumulative : bool, optional. Default is false.
4. histtype : bar, barstacked, step, stepfilled.(optional)
5. orientation : horizontal, vertical(optional)
(i) Using PyPlot’s graph functions.
(ii) Using Dataframe’s plot() function.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy