Python Pandas
Python Pandas
Python Pandas
Pandas Series
e.g.
import pandas as pd
s = pd.Series()
print(s)
Output
Series([], dtype: float64)
Data Handling using
Pandas -1 Pandas Series
Create a Series from a Python sequence
Examples:
When you perform arithmetic operations on two Series type objects, the
data is aligned on the basis of matching indexes. This is called Data
Alignment in Pandas objects.
Using a mathematical function/expression to create
data array in Series().
Eg:-
import pandas as pd
Import numpy as np
a= np.arange(9,13)
print(a)
obj=pd.Series(index=a,data=a*2)
OUTPUT
9 18
10 20
11 22
12 24
Using a mathematical function/expression to create
data array in Series().
Eg:-
import pandas as pd
Import numpy as np
Lst= [9,10,11,12]
obj=pd.Series(data=2*Lst)
OUTPUT
0 9
1 10
2 11
3 12
4 9
5 10
6 11
7 12
Series Object Attributes
Syntax: < Series object >.<attribute name>
Accessing a Series Object and its
Elements
1. Accessing Individual Elements.
Syntax: <Series Object name>[<valid index>]
2. Extracting Slices from Series Object.
Syntax: <Object>[start : end : step]
Data Handling using
Pandas -1
Accessing Data from Series with indexing and slicing
e.g.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s[‘a’]) # Syntax:<Series Object name>[valid index]
print (s[:3]) # slicing takes place position wise and not the index
wise in a series object.
Output
1
a 1
b 2
c 3
dtype: int64
Operations on Series Object
1. Modifying Elements of Series Object:
(a)Syntax: <SeriesObject>[<index>]=<new_data_value>
(b)Syntax: <SeriesObject>[start:stop]=<new_data_value>
Output
a 0
b 2
c 3
d 4
e 5
dtype: int64
Data Handling using
Pandas -1
Modifying Elements of Series Object:
e.g.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s[:3]=6
Output
a 6
b 6
c 6
d 4
e 5
dtype: int64
2. The head() and tail() functions
Syntax:<pandas object>.head([n])
or
<pandas object>.tail([n])
Default value is 5.
Data Handling using
Pandas -1
Pandas Series
Head
function e.g
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.head(3))
Output
a 1
b 2
c 3
dtype: int64
Return first 3 elements
Data Handling using
Pandas -1
Pandas Series
tail function
e.g
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.tail(3))
Output
c 3
d 4
e 5
dtype: int64
Return last 3 elements
You can store the result of object arithmetic in another
object, which will also be a Series object.
ob6=ob1+ob3
Then ob6 will also be a series object(if ob1 and ob3 are
panda Series objects).
5. Filtering Entries
Syntax:
In[21]:s1>3
Out[21]:
a False
b False
c False
d True
e True
dtype : bool
Sorting Series Values
Syntax:
<Series Object>.sort_values([ascending=True|False])
>>>s1
A 6700
B 5600
C 5000
D 5200
dtype : int64
>>>s1.sort_values()
C 5000
D 5200
B 5600
A 6700
dtype : int64
>>>s1.sort_values(ascending=False)
A 6700
B 5600
D 5200
C 5000
dtype : int64
Sorting Series Values Based on
Indexes
Syntax:
<Series Object>.sort_index([ascending=True|False])
>>>s1
A 6700
B 5600
C 5000
D 5200
dtype : int64
>>>s1.sort_index()
A 6700
B 5600
C 5000
D 5200
dtype : int64
>>>s1.sort_index(ascending=False)
D 5200
C 5000
B 5600
A 6700
dtype : int64
Some Additional operations on Series
Objects
1. Reindexing:
Syntax:
Syntax:
In[21]: s1=s1.reindex([‘e’,’d’,’c’,’b’,’a’])
In[22]:s1
Out[22]:
e 5
d 4
Here order of indexes is now changed.
c 3
b 2
a 0
dtype : int64
Pandas
Altering series label –
OUTPUT
e.g. program
import pandas as pd a. 54
import numpy as np b. 76
data = np.array([54,76,88,99,34]) c. 88
s1 = pd.Series(data,index=['a','b','c','d','e']) d. 99
print (s1) e. 34
s2=s1.rename(index={'a':0,'b':1})
print(s2)
dtype: int32
0 54
1 76
c 88
d 99
e 34
dtype: int32
In[20]:s1
Out[20]:
a 0
b 2
c 3
d 4
e 5
dtype : int64
In[21]: s1=s1.drop(’c’)
In[22]:s1
Out[22]:
a 0
b 2 Here index ‘c’ is dropped.
d 4
e 5
dtype : int64
Data Handling using Pandas -1
Pandas DataFrame
It is a two-dimensional data structure, just like any table
(with rows & columns).
Basic Features of DataFrame
Columns may be of different types
Size can be changed(Mutable) and value-mutable
Labeled axes (rows / columns)
Arithmetic operations on rows and columns
Structure
Rows
Column Deletion
del df1['one'] # Deleting the first column using DEL
function
df.pop('two') #Deleting another column using POP
function
Rename columns
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
Selecting / Accessing Multiple Columns
Syntax: <DataFrame
object>.loc[[<startrow>:<endrow>,
<startcolumn>: <endcolumn>]]
Data Handling using
Pandas -1
e.g.1
import pandas as pd
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d1)
print (df1)
Output
onetwo
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
• To access a row:
print (df1.loc[‘b’,:])
Output
one two
b 2.0 2
Obtaining a Subset/Slice from a Dataframe
using Row/Column Numeric Index/Position
print(df1.iloc[0:1,0:2])
OUTPUT
0 1
0 1 2
Selecting/Accessing Individual value
(i) Either give name of row or numeric index
in square brackets with, ie., as this:
Or
Deleting Rows:
• iterrows()
• Iteritems()
<DF>.iterrows() views a dataframe in the
form of horizontal subsets i.e.,row-wise.
ADDITION:
SUBTRACTION:
DIVISION:
DataFrame
Binary operation over
dataframe with
dataframe
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
y = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
new_x = x.add(y, axis=0)
print(ne
w_x)
Output
0 1 2
0 2 8 14
0 1 2 0 1 2
0 1 4 7 0 1 4 7
1 2 5 8 + 1 2 5 8
2 3 6 9 2 3 6 9
Inspection functions info() and
describe()
<min value>+(<max
value>-<min
value)*<percentile
to be calculated>
Advance operations
on dataframes
(pivoting, sorting &
aggregation/Descriptive
statistics)
Aggregation/Descriptive statistics - dataframe
Data aggregation –
Aggregation is the process of turning the values of a dataset (or a
subset of it) into one single value or data aggregation is a
multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible
like count,sum,min,max,median,quartile etc.
Quantile -
Quantile statistics is a part of a data set. It is used to describe
data in a clear and understandable way.
Common Quantiles
Certain types of quantiles are used commonly enough to have
specific names. Below is a list of these:
• The 2 quantile is called the median
• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles
Quantiles
How to Find Quantiles?
Sample question: Find the number in the following set of data where 30 percent
of values fall below it, and 70 percent fall above:
2 4 5 7 9 11 12 17 19 21 22 31 35 36 45 44 55 68 79 80 81 88 90 91 92 100 112
113 114 120 121 132 145 148 149 152 157 170 180 190
Step 1: Order the data from smallest to largest. The data in the question is
already in ascending order.
Step 2: Count how many observations you have in your data set. this particular
data set has 40 items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the
number where 30 percent of the values fall below it, so convert that to .3.
Step 4: Insert your values into the formula:
ith observation = q (n + 1)
ith observation = .3 (40 + 1) = 12.3
Answer: The ith observation is at 12.3, so we round down to 12 (remembering
that this formula is an estimate). The 12th number in the set is 31, which is the
number where 30 percent of the values fall below it.
Aggregation/Descriptive statistics - dataframe
#e.g. program on Quantile –
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100],
[4, 1000]]),columns=['a', 'b’])
print(df)
print(df.quantile(0.5))
OUTPUT
a b
0 1 1
1 2 10
2 3 100
3 4 1000
a 2.5
b 55.0
dtype: float64
Quantiles
How to Find Quantiles in python
In pandas series object->
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 4, 5,6,8,10,12,16,20])
r=s.quantile(.3)
print(r)
OUTPUT
4.699999999999999
Note – It returns 30% quantile
Quantiles
How to Find Quantiles in python
In pandas dataframe object->
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[11, 1], [12, 10], [13, 100], [14, 100],
[15, 1000]]),columns=['a', 'b'])
r=df.quantile(.2)
print(r)
OUTPUT
a 11.8
b 8.2
Name: 0.2,
dtype: float64
Note – It returns 20% quantile
Aggregation/Descriptive statistics - dataframe
var() – Variance Function in python pandas is used to calculate
variance of a given set of numbers, Variance of a data frame, Variance
of column and Variance of rows, let’s see an example of each.
#e.g.program
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Dataframe contents")
print
(df)
print(df.var())
print(df.std())
#df.loc[:,“Age"].var() for variance of specific column
#df.var(axis=0) column variance
#df.var(axis=1) row variance
>>>df1.cumsum()
0 1 2
0 1 2 3
1 5 7 9
>>>df
0 1
0 700 490
1 975 460
2 970 570
3 900 590
>>>df.cummax()
0 1
0 700 490
1 975 490
2 975 570
3 975 590
Pivoting - dataframe
index
Reindexing – Python Pandas
It is a fundamental operation over pandas series or
dataframe.It is a process that makes the data in a
Series/data frame conforms to a set of labels. It is used
by pandas to perform much of the alignment process.
Reindex in python pandas or change the order of the
rows and column in python pandas dataframe or
change the order of data of series object is possible
with the help of reindex() function.
E.g. Given below for series- first column is label(as index) and second
column for value
e.g. program
import pandas as pd
import numpy as np OUTPUT
data = np.array([54,76,88,99,34]) a. 54
s1 = pd.Series(data,index=['a','b','c','d','e']) b. 76
print (s1) s2=s1.rename(index={'a':0,'b':1}) c. 88
print(s2) d. 99
e. 34
dtype: int32
0 54
1 76
c 88
d 99
e 34
dtype: int32
Reindexing – Python Pandas
Reindexing Rows in pandas Dataframe
e.g.program
import pandas as pd
import numpy as np
table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER REINDEX")
df=d.reindex([1,2, 3,4,0])
print(df)
Reindexing – Python Pandas
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER REINDEX")
df=d.reindex(columns=['name’,
'runsscored','age'])
print(df)
Reindexing – Python Pandas
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
<dataframe>.reindex_like(<other dataframe>)
Renaming – Python Pandas
Altering dataframe labels
e.g.program
import pandas as pd
import numpy as np
table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh']), 'age':[15, 16, 15,
17,16], 'weight':[51, 48, 49, 51,48],'height':[5.1, 5.2, 5.1, 5.3,5.1],
'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER RENAME")
df=d.rename(index={0:'a',1:'b'})
print(df)
Renaming – Python Pandas
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
A B
0 1 4
1 2 5
2 3 6
a c
0 1 4
1 2 5
2 3 6
Function application
Pandas provide three important functions namely pipe(),
apply() and applymap() ,to apply our own function or some
other library’s function.Use of these functions depend on
entire dataframe,row-coloumn elements or element wise.
np.multiply(sal_df.add(30),3)
Function application
Dataframe wise Function Application: pipe()
Pipe() function performs the operation for the entire dataframe with
the help of user defined or library function. In below example we
are using pipe() Function to add value 5 to the entire dataframe.
e.g.program. OUTPUT
import pandas as pd science_marks english_marks
import numpy as np 0 81 282
import math 1 180 276
# own function 2 204 216
3 270 180
#Create a Dictionary of series 4 156 156
d = {'science_marks':pd.Series([22,55,63,85,47]),
'english_marks':pd.Series([89,87,67,55,47])}
df = pd.DataFrame(d)
df1=df.pipe(np.add,5).pipe(np.multiply,3)
print (df1)
Function application
Row or Column Wise Function Application: apply()
apply() function performs the operation over either row wise or column wise
data
e.g. of Row wise Function in python pandas : apply()
import pandas as pd
import numpy as np
import math
d = {'science_marks':pd.Series([22,55]),
'english_marks':pd.Series([89,87])}
df = pd.DataFrame(d) OUTPUT
print(df) science_marks english_marks
r=df.apply(np.mean,axis=1) 0 22 89
print (r) 1 55 87
0 55.5
1 71.0
dtype: float64
Function application
e.g. of Column wise Function in python pandas : Apply()
import pandas as pd
import numpy as np
import math
d = {'science_marks':pd.Series([22,55]),
'english_marks':pd.Series([89,87])}
df = pd.DataFrame(d) print(df)
r=df.apply(np.mean,axis=0)
print (r)
OUTPUT
science_marks english_marks
0 22 89
1 55 87
science_marks 38.5
english_marks 88.0
dtype: float64
Function application
Element wise Function Application in python pandas: applymap()
applymap() Function performs the specified operation for all the elements the
dataframe.
e.g.program
import pandas as pd
import numpy as np
import math
d=
{'science_marks':pd.Series([22,55]), OUTPUT
'english_marks':pd.Series([89,87])} science_marks english_marks
df = pd.DataFrame(d)
0 25 9
print(df)
1 16 4
r=df.applymap(np.sqrt)
print (r) science_marks english_marks
0 5 3
1 4 2
Function application
aggregation (group by) in pandas –
Data aggregation- Aggregation is the process of finding the
values of a dataset (or a subset of it) into one single value. Let
us go through the DataFrame like…
DATA OF DATAFRAME
Name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
DATA OF DATAFRAME
name ageweight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
name weight height runsscored
age
Maximum weight ,height and
15 vishal 51 5.1 71
runsscored in each age
16 mahesh 48 5.2 51
group.
17 viraj 51 5.3 53
Function application
Grouping in pandas e.g.program.
import pandas as pd
import numpy as np
table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh'],
table = {
"name":['vishal', 'anil', 'mayur', 'viraj','mahesh'], 'age':[15, 16,
15, 17,16], 'weight':[51, 48, 49, 51,48], 'height':[5.1, 5.2, 5.1,
5.3,5.1], 'runsscored':[55,25, 71, 53,51]}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print(d.groupby('age')["runsscored"].transform(np.sum))
OUTPUT
DATA OF DATAFRAME
name age weight height runsscored
0 vishal 15 51 5.1 55
1 anil 16 48 5.2 25
2 mayur 15 49 5.1 71
3 viraj 17 51 5.3 53
4 mahesh 16 48 5.1 51
0 126
1 76
2 126
3 53
4 76
Handling Missing Data
You can handle missing data in many ways, most common ones are:
>>>s
a 1.0
b 2.0
c NaN
d 4.0
e 5.0
Detecting or Filtering Missing Data:
<PandaObject>.isnull()
>>>s.isnull() >>>df.isnull()
a False 0 1
b False 0 False False
c True 1 True True
d False 2 True False
e False 3 False False
If you want to filter data which is not a missing value i.e.,
non-null data, then you can use following for Series
object:
<Series>[filter condition]
>>>s[s.notnull()]
a 1.0
b 2.0
d 4.0
e 5.0
dtype : float64
(b)<DF>.dropna(how=‘all’)
(c)<DF>.dropna(axis=1)
(d)<DF>.dropna(axis=1, how=‘all’)
Handling Missing Data – Dropping
Missing Values
(a)<PandaObject>.dropna()
In[66] : s=s.dropna()
In[67] : s
Out[67]:
a 1.0
b 2.0
d 4.0
e 5.0
dtype : float64
Handling Missing Data – Dropping
Missing Values
(a)<PandaObject>.dropna()
In[66] : df1=df.dropna()
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
3 900.0 590.0
Handling Missing Data – Dropping
Missing Values
(b) <DF>.dropna(how=‘all’)
In[66] : df1=df.dropna(how=‘all’)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
2 NaN 570.0
3 900.0 590.0
Handling Missing Data – Dropping
Missing Values
(c) <DF>.dropna(axis=1)
In[66] : df1=df.dropna(axis=1)
In[67] : df1
Out[67]:
Empty Dataframe
Columns : []
Index : [0,1,2,3]
Handling Missing Data – Dropping
Missing Values
(d) <DF>.dropna(axis=1,how=‘all’)
In[66] : df1=df.dropna(axis=1,how=‘all’)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
1 NaN NaN
2 NaN 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values
(a)<PandaObject>.fillna(<n>)
(a)<PandaObject>.fillna(<n>)
In[66] : df1=df.fillna(0)
In[67] : df1
Out[67]:
0 1
0 700.0 490.0
1 0.0 0.0
2 0.0 570.0
3 900.0 590.0
Handling Missing Data – Filling Missing Values
(a)<PandaObject>.fillna(<n>)
In[66] : s1=s.fillna(0)
In[67] : s1
Out[67]:
a 1.0
b 2.0
c 0.0
d 4.0
e 5.0
Handling Missing Data – Filling Missing Values
pd.concat([<df1>,<df2>])
pd.concat([<df1>,<df2>],ignore_index=True)
pd.concat([<df1>,<df2>],axis=1)
Combining Datafram
Concatenate two DataFrame objects with identical
columns.
>>>df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
Joining dataframe
e.g.
import pandas as pd
df1 = pd.DataFrame({
‘Name’:[Harini, Dave,
Simrat, Saqib]
},index=[1,2,3,4])
df2 = pd.DataFrame(
{‘Competition’:
[5,3,3],
},index=[1,3,7])
print(df1)
print(df2)
>>>df1 >>>df2
Name Competition
1 Harini 1 5
2 Dave 3 3
3 Simrat 7 3
4 Saqib
Syntax:
<DataFrame1>.join(<DataFrame2>,[how=‘left’])
>>>df1.join(df2)
Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN
>>>df1.join(df2,how=‘left’)
Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN
>>>df1.join(df2,how=‘inner’)
Name Competition
1 Harini 5
3 Simrat 3
>>>df1.join(df2,how=‘right’)
Name Competition
1 Harini 5
3 Simrat 3
7 NaN 3
>>>df1.join(df2,how=‘outer’)
Name Competition
1 Harini 5.0
2 Dave NaN
3 Simrat 3.0
4 Saqib NaN
7 NaN 3.0
Joining on a Column
Syntax:
<DF1>.join(<DF2>,on=<column name of DF1>)
>>>df2 >>>df1
C_id Competition P_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib
>>>df2.join(df1,on=“C_id”)
The join() function can join only the left dataframe’s column values
with the indexes of the right dataframe.
Joining on a Column
Syntax:
<DF1>.join(<DF2>,on=<column name of DF1 which is
identical to DF2>)
>>>df2 >>>df1
C_id Competition C_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib
pd.merge(<DF1>,<DF2>)
pd.merge(<DF1>,<DF2>, on=<field_name>)
pd.merge(<DF1>,<DF2>,[on=<field_name>)],
<how=‘left’|’right’|’inner’|’outer’>)
>>>df2 >>>df1
C_id Competition C_id Name
1 1 5 1 1 Harini
3 3 3 2 3 Dave
7 4 3 3 7 Simrat
4 6 Saqib
>>>pd.merge(df2,df1)
Mapping from
Data Space to
Graphic Space
Convert data to charts, graphs, maps etc.
Schools
Hospitals
Banks
Sports
Purpose of
Data visualization
• Better analysis
• Quick action
• Identifying patterns
• Finding errors
• Understanding the story
• Exploring business insights
• Grasping the Latest Trends
Plotting library
Matplotlib is the whole python package/ library used to create 2D graphs and plots
by using python scripts. pyplot is a module in matplotlib, which supports a very
wide variety of graphs and plots namely - histogram, bar charts, power spectra,
error charts etc. It is used along with NumPy to provide an environment for MatLab.
• LINE CHART
• BAR GRAPH
• SCATTER CHART
Matplotlib –line plot
Line Plot
A line plot/chart is a graph that shows the frequency of
data occurring along a number line.
The line plot is represented by a series of datapoints
connected with a straight line. Generally line plots are
used to display trends over time. A line plot or line graph
can be created using the plot() function available in pyplot
library. We can, not only just plot a line but we can
explicitly define the grid, the x and y axis scale and labels,
title and display options etc.
Matplotlib –line plot
E.G.PROGRAM
import numpy as np
import matplotlib.pyplot as plt
year = [2014,2015,2016,2017,2018]
jnvpasspercentage = [90,92,94,95,97]
kvpasspercentage = [89,91,93,95,98]
plt.plot(year, jnvpasspercentage, color='g')
plt.plot(year, kvpasspercentage, color='orange')
plt.xlabel(‘Year')
plt.ylabel('Pass percentage')
plt.title('JNV KV PASS % till 2018')
plt.show()
Matplotlib –line plot
plt.figure(figsize=(15,7)
plt.grid(True)
• Title
plt.title('JNV KV PASS % till 2018') – Change it as per requirement.
• Label - plt.xlabel(‘Year') - change x or y label as per requirement
• Legend - plt.legend(loc='upper right‘)
The loc argument can either take values 1,2,3,4 signifying
the position strings ‘upper right’, ’upper left’, ’lower left’, ‘lower right’
respectively. Default position is ‘upper right’ or 1.
• Marker Type, Size and Color – The data points being plotted are called
markers. The marker types can be dots or crosses or diamonds etc.
plt.plot(year, kvpasspercentage,’k’,marker=‘d’,
markersize=5,markeredgecolor=‘red’)
plt.plot(year, kvpasspercentage,’kd’,
Line color and marker
linestyle=‘solid’)
style combined so
marker takes same
color as line.
plt.plot(year, kvpasspercentage,’kd’, Here marker color is
linestyle=‘solid’,markeredgecolor=‘red’) separately specified.
Parameters:
x,y – The data positions.
s – The marker size(optional argument).
c – marker color or sequence of color(optional
argument).
marker – marker style(optional argument).
plt.scatter(a, b,s=12,c=‘m’,marker=‘D’)
colarr=[‘r’,’b’,’m’,’g’,’k’]
sarr=[20,60,100,45,25]
plt.scatter(a, b,s=sarr,c=colarr,)
Matplotlib –Bar Graph
Bar Graph
A graph drawn using rectangular bars to show how large
each value is. The bars can be horizontal or vertical.
A bar graph makes it easy to compare data between
different groups at a glance. Bar graph represents
categories on one axis and a discrete value in the other.
The goal of bar graph is to show the relationship between
the two axes. Bar graph can also show big changes in data
over time.
Plotting with Pyplot
Plot bar graphs
e.g program
import matplotlib.pyplot as plt
import numpy as np
label = ['Anil', 'Vikas', 'Dharma', 'Mahen',
'Manish', 'Rajesh']
per = [94,85,45,25,50,54]
index = np.arange(len(label))
plt.bar(index, per)
plt.xlabel('Student Name', fontsize=5)
plt.ylabel('Percentage', fontsize=5)
plt.xticks(index, label, fontsize=5, rotation=30)
plt.title('Percentage of Marks achieve by student
Class XII')
plt.show()
#Note – use barh () for horizontal bars
Matplotlib –Bar graph
import numpy as np
import matplotlib.pyplot as plt
Val=[[5.,25.,45.,20.],[4.,23.,49.,17.],[6.,22.,47.,19.]]
X=np.arrange(4)
plt.bar(X+0.00,Val[0],color=‘b’,width=0.25)
plt.bar(X+0.25,Val[1],color=‘g’,width=0.25)
plt.bar(X+0.50,Val[2],color=‘r’,width=0.25)
plt.show()
Creating Horizontal Bar Chart
Note – To create horizontal bar chart, you need to use
barh() function(bar horizontal), in place of bar(). Also you
need to give x and y axis labels carefully – the label that you
gave to x axis in bar(), will become y axis label in barh() and
vice-versa.
Creating Pie Charts
The PyPlot interface offers pie() function for creating a pie
chart.
(i) The pie() function, plots a single data range only. It will
calculate the share of individual elements of the data range
being plotted vs. the whole of the data range.
(ii) The default shape of a pie chart is oval but you can always
change to circle by using axis() of pyplot, sending “equal” as
argument to it.
matplotlib.pyplot.axis(“equal”)
Labels of Slices of Pie
Adding Formatted Slice
Percentages to Pie
2.Label
plt.xlabel('Student Name', fontsize=5)- change x or y label as per
requirement.
3.Legend
plt.legend(('jnv','kv'),loc='upper right‘)
The loc argument can either take values 1,2,3,4
signifying the position strings ‘upper right’, ’upper left’,
’lower left’, ‘lower right’ respectively. Default position is
‘upper right’ or 1.
4. Ticks
plt.xticks([0,1,2,3])
6. Saving a Figure
plt.savefig(“multibar.pdf”) (saves in current directory)
plt.savefig(“c:\\data\\multibar.pdf”) (saves at the given path)
plt.savefig(‘’c:\\data\\multibar.png”) (saves at the given path)
Histogram
A histogram is a powerful technique in data visualization. It
is an accurate graphical representation of the distribution of
numerical data.It was first introduced by Karl Pearson. It is
an estimate of the distribution of a continuous variable
(quantitative variable). It is similar to bar graph. To construct
a histogram, the first step is to “bin” the range of values —
means divide the entire range of values into a series of
intervals — and then count how many values fall into each
interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must
be adjacent, and are often (but are not required to be) of
equal size.
Matplotlib –Histogram
Parameters:
1. x : array of sequence.
2. bins : intervals(optional)
3. cumulative : bool, optional. Default is false.
4. histtype : bar, barstacked, step, stepfilled.(optional)
5. orientation : horizontal, vertical(optional)
(i) Using PyPlot’s graph functions.
(ii) Using Dataframe’s plot() function.