0% found this document useful (0 votes)
26 views135 pages

Unit I: Data Handling Using Pandas and Data Visualization: Marks:25

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views135 pages

Unit I: Data Handling Using Pandas and Data Visualization: Marks:25

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

Unit I:

Data Handling using Pandas and


Data
Visualization
Marks :25
Data Handling using Pandas -I
Introduction to Python libraries- Pandas, Matplotlib. Data structures in Pandas -
Series and Data Frames. Series: Creation of Series from – ndarray, dictionary,
scalar value; mathematical operations; Head and Tail functions; Selection,
Indexing and Slicing.
Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV
files; display; iteration; Operations on rows and columns: add, select, delete,
rename; Head and Tail functions; Indexing using Labels, Boolean Indexing; Joining,
Merging and Concatenation.
Importing/Exporting Data between CSV files and Data Frames.
Data handling using Pandas – II
Descriptive Statistics: max, min, count, sum, mean, median, mode,
quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by, Sorting, Deleting and
Renaming Index, Pivoting. Handling missing values – dropping and filling.
Importing/Exporting Data between MySQL database and Pandas.
Introduction to Python Library : Pandas ( Python for Data Analysis)
Introduction to Python Libraries
• Python libraries contain a collection of built-in modules that allow us to perform
many actions without writing detailed programs for it.

• We have to import these libraries for calling its functions

• NumPy, Pandas and Matplotlib are three well-established Python libraries. These
libraries allows us to manipulate, transform and visualize data easily and efficiently.

• NumPy  Numerical Python, uses a multidimensional array object and has


functions for working with these arrays. It is used for numerical analysis and
scientific computing.
• PANDAS(PANelDAta) is a high-level data manipulation tool used for
analyzing data. It is very easy to import and export data using Panda's
library. It is built on packages like NumPy and Matplotlib to do data analysis
and visualization work. Series, DataFrame and Panel to make the process
of analyzing data organized, effective and efficient.

• Matplotlib It is used for 2D graph plotting graphs and visualization. It is


built on NumPy and its designed to work well with NumPy and Pandas.
Installing Pandas:
• Open command prompt
• Type cd\ to move to the root directory
• Type pip install Pandas
• pip Python package Installer
Note: With the installation of Pandas, NumPy(Numeric Python) will also be installed
automatically. Pandas cannot handle arrays on its own. NumPy is the library which can
handle arrays.
Testing Pandas
* Type import pandas as pd in the IDLE shell
>>> import pandas as pd
DATA STRUCTURES IN PANDAS
• It is a way of storing and organizing data in a computer.
• Three types of Data Structures namely
1) Series It is a one-dimensional Structure storing homogeneous (same data type)
mutable(which can be modified) data such as integer, string.
2) DataFrames It is a two-dimensional structure storing heterogeneous (multiple data
type) mutable data.
3) Panel  It is a three-dimensional method of storing data( Not in Syllabus)
1. Series
• Series is like a one-dimensional array like structure with homogeneous
(same type of) data.
• Data label associated with particular value is called its index.

For example, the following series is a collection of integers.


49 55 10 79 67

Basic feature of series are


Homogeneous data
 Size of series data is Immutable ( we cannot change the size of series data)
Series Data is Mutable
Series is a one-dimensional labelled structure capable of holding any data
type(integers, strings ,floating point numbers, python objects, etc.…)

A series can also be described as an ordered dictionary with mapping of


index values to data values.

Example of series–type objects

Index Data Index Data Index Data


0 22 ‘Jan’ 31 ‘Sunday’ 1
1 -14 ‘Feb’ 28 ‘Monday’ 2
2 52 ‘Mar’ 31 ‘Tuesday’ 3
3 100 ‘April’ 30 ‘Wednesday’ 4
How to create series in pandas:
• Using Series() method
• List or dictionary data can be converted into series using
this method.
2. DataFrame
DataFrame is like a two-dimensional array with heterogeneous data.

Basic feature of DataFrame are


 Heterogeneous data
 Size Mutable
 Data Mutable
Create a series with your 3 friends name.
>>> import pandas as pd
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data)
>>> print(s1)
Output:
0 Abey
1 Bhasu
2 Charlie
dtype: object
Create a series with your 3 friends name with
index values.
>>> import pandas as pd
>>> data=['Abey','Bhasu','Charlie']
>>> s1=pd.Series(data, index=[3,5,1])
>>> print(s1)
Output:
3 Abey
5 Bhasu
1 Charlie
dtype: object
Home work
• Create a series with first four months as index and no of days in it as
data.
• Create a series having names of any five famous monuments of India
and assign their states as index values.
Creating an empty series using Series() Method:
• It is created by Series() method with no arguments in it.

# Example 1: Empty Series using Series() Method


>>> import pandas as pd
>>> s1 = pd.Series() (or) s1=pd.Series(None)
>>> print(s1)

Output:
Series([], dtype: float64)
Creating a series using Series() method with Arguments
A series is created using Series() method by passing index and data elements as
the arguments to it.
Syntax:
<Series object> = pandas. Series(data, index =idx)
* series output has 2 columns index on left and data value is on right. If we don’t
specify index, default index will be taken from 0 to N-1.
Create a Series using List:
# Example 2: creating a series using Series() with List as an argument
>>> import pandas as pd
>>> s1 = pd. Series([10,20,30,40])
>>> print(s1)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Creating a series using range method
>>>import pandas as pd
>>> s1 = pd.Series(range(5))
>>> print(s1)
0 0
1 1
2 2
3 3
4 4
dtype: int64
Creating a series with explicit index values:
>>> import pandas as pd
>>> s1 = pd. Series( [10, 20, 30, 40, 50], index = ['a’, 'b',’ c',’ d',’ e’] )
>>> print(s1)
a 10
b 20
c 30
d 40
e 50
dtype: int64
Creating a Series from ndarray
Without index Argument
>>> import pandas as pd
>>> import numpy as np
>>> data = np. array (['a’, 'b’, 'c’, 'd'])
>>> s1 = pd.Series(data)
>>> print(s1)
Output:
0 a
1 b
2 c
3 d
dtype: object
Creating a Series from ndarray
With index Argument
>>> import pandas as pd
>>> import numpy as np
>>> data = np. array (['a’, 'b’, 'c’, 'd’])
>>> s1 = pd.Series( data, index=[100,101,102,103] )
>>> print(s1)
Ouput:
100 a
101 b
102 c
103 d
dtype: object
Create a Series from dict
Eg.1(without index)
>>> import pandas as pd
>>> data = {'a':0,'b':1,'c':2}
>>> s1 = pd.Series ( data)
>>> print(s1)
Output:
a 0
b 1
c 2
dtype: int64
Eg.2 (with index)
>>> import pandas as pd
>>> data = {'a':0,'b':1,'c':2}
>>> s1 =pd.Series( data, index= ['b' ,'c', 'd' ,'a'])
>>> print(s1)
Output:
b 1.0
c 2.0
d NaN  Not a Number
a 0.0
dtype: float64
Create a Series from Scalar
>>> import pandas as pd
>>> s1 =pd.Series(5, index=[1,2,3,4])
>>> print(s1)
Output:
1 5
2 5
3 5
4 5
dtype: int64
Note :- here 5 is repeated for 4 times (as per no of index)
Creating a series using arange method of numpy
>>> import pandas as pd

>>> import numpy as np

>>> s1=pd.Series(np.arange(10,16,1),index=['a','b','c','d','e','f'])

>>>print(s1)
a 10
b 11
c 12
d 13
e 14
f 15
dtype: int32
Accessing elements of a series
* There are 2 methods indexing and slicing
A) Indexing
Two types of indexes are: positional index and labelled index. Positional indexing
is default index starting from 0, whereas labelled index is user defined index.
Example 1:
>>> import pandas as pd
>>>s1 = pd.Series([ 10, 20,30, 40,50])
>>>print(s1[2] )
30
Example 2:
>>> import pandas as pd
>>>s1 = pd.Series([ 10, 20,30, 40,50],index = ['a','b','c','d','e'])
>>> print(s1['d'] )
40
>>> print(s1[['a','c','e']])

Output:

a 10

c 30

e 50

dtype: int64
Example 3:
>>>import pandas as pd

>>>sercap=pd.Series([‘NewDelhi’,’London’,’Paris’],
index=[‘India’,’UK’,’France’])

>>>print(sercap[‘India’]) >>>print(sercap[[‘UK’,’France’]])
NewDelhi UK London
France Paris
dtype: object
How to assign new index values to series
>>>sercap.index=[10,20,30]
>>>print(sercap)

10 NewDelhi
20 London
30 Paris
dtype: Object
B) Slicing
• Similar to slicing with NumPy arrays
• Slicing can be done by specifying the starting and ending parameters.
• In positional index the value at the end index position is excluded.
Example:
>>>import pandas as pd
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’], index=[‘India’,
’USA’, ’UK’, ’France’])
>>>print(sercap[1:3])
output
USA WashingtonDC
UK London
dtype: object
Example using labelled index

>>>import pandas as pd
>>>sercap=pd.Series([‘NewDelhi’, ’WashingtonDC’, ’London’, ’Paris’],
index=[‘India’, ’USA’, ’UK’, ’France’])
>>>print(sercap[‘USA’: ‘France’])

USA WashingtonDC
UK London
France Paris
dtype: object
Series in reverse order slicing
>>> import pandas as pd

>>> sercap=pd.Series(['NewDelhi','WashingtonDC','London','Paris'],
index=['India','USA','UK','France'])

>>>print(sercap[: : -1])

France Paris

UK London

USA WashingtonDC

India NewDelhi

dtype: object
How to modify the values of series using slicing
>>> import pandas as pd
>>> s1=pd.Series(range(10,16,1),index=['a','b','c','d','e','f'])
>>> s1[1:3]=50
>>> print(s1)
a 10
b 50
c 50
d 13
e 14
f 15
dtype: int64
Example 2: using index label
>>> import pandas as pd
>>> s1=pd.Series(range (10,16,1),index=['a', 'b', 'c', 'd', 'e‘ ,'f'])
>>> s1['c' :'e']=500
>>> print(s1)
a 10
b 11
c 500
d 500
e 500
f 15
dtype: int64
Accessing Data from Series with indexing and slicing( using position)
e.g. import pandas as pd
>>> s1 = pd.Series([11, 12 ,13 ,14,15],index=[ 'a',’ b’, 'c’, 'd’, 'e'])
>>> print(s1[0]) >>>print(s1[‘a’])
11
>>> print(s1[:3])
a 11
b 12
c 13
dtype: int64
>>> print(s1[-3:])
c 13
d 14
e 15
dtype: int64
In the first statement the element at ‘0’ position is displayed.

In the second statement the first 3 elements from the list are displayed.

In the third statement last 3 index values are displayed because of negative indexing.
Retrieve Data from selection :

There are three methods for data selection:

• loc is used for indexing or selecting based on name, i.e., by row name and

column name. It refers to name-based indexing .

loc = [< list of row names>, <list of column names>]

• iloc is used for indexing or selecting based on position , i.e., by row number

and column number. It refers to position-based indexing.

iloc =[<row number range>,<column number range>]


• ix usually tries to behave like loc but falls back to behaving like iloc if a label
is not present in the index. ix is deprecated and the use of loc and iloc is
encouraged instead
>>> # usage of loc and iloc for accessing elements of a series
>>> import pandas as pd
>>> s = pd.Series([11,12,13,14,15],index= ['a’, 'b’, 'c’, 'd’, 'e'])
>>> print( s.loc [ 'b’ : 'e’]) >>>print(s .iloc [1:4])
b 12 b 12
c 13 c 13
d 14 d 14
e 15 dtype: int64
dtype: int64
Pandas Series Retrieve Data from selection
e.g.1 >>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series( np.NaN, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. iloc[:3]) # slice the first three rows

Output: >>>print(s1.loc[49:47])
49 NaN
48 NaN
47 NaN
dtype: float 64
e.g.2 >>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series( np. nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>>print(s1. loc[ 49 : 1] ) # selects the data according to the index name
Output:
49 NaN >>>print(s1.iloc[ :6])
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
dtype: float 64
Conditional Filtering Entries:
>>> import pandas as pd
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000])
>>> print(s1) >>> print(s1 < 2)
Output: Output :
0 1.000000 0 True
1 1.414214 1 True
2 1.730751 2 True
3 2.000000
3 False
dtype: float64
dtype: bool
Note :
>>>print(s1 [s1>=2]) • In the statement s <2 , it performs a vectorized operation
Output: which checks every element in the series.
3 2.0 • In the statement s1[s1>=2] it performs filtering operation
dtype: float64 and returns filter result whose values return True for the
>>> print(s1 [s1 < 2]) expression.
Output:
0 1.000000
1 1.414214
2 1.730751
dtype: float64
Conditional Filtering Entries
Filtering entries from a series object can be done using expressions that are of
Boolean type.
<Series object> [ <Boolean expression on series object>]

Example:
Series object s11 stores the charity contribution made by each section

A 6700
B 5600
C 5000
D 5200
Write a program to display which section contributed more than Rs. 5500
Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Program:
>>> import pandas as pd
>>> s11= pd.Series([6700,5600,5000,5200],index=['A','B','C','D'])
>>> print("Contribution >5500 are:")
>>> print(s11[s11>5500])

Output:
Contribution >5500 are:
A 6700
B 5600
dtype: int64
Sorting Series values:
Series object can be sorted based on values and indexes.

• Sorting on basis of values:


<Series object >.sort _values ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64

>>>print(s1.sort_values()) >>> print(s1.sort_values(ascending=False))


Output: Output:
C 5000 A 6700
D 5200 B 5600
B 5600 D 5200
A 6700 C 5000
dtype: int64 dtype: int64
• Sorting on basis of indexes
<Series object >.sort _index ([ascending =True | False])
If S1 is
A 6700
B 5600
C 5000
D 5200
dtype: int64

>>> s1.sort_index() >>> s1.sort_index(ascending=False)


Output: Output:
A 6700 D 5200
B 5600 C 5000
C 5000 B 5600
D 5200 A 6700
dtype: int64 dtype: int64
Deleting elements from a Series
• Element of a series can be deleted by using drop () method by passing the index as
argument.
Example
>>> import pandas as pd
>>> s1 = pd. Series([1.00000,1.414214,1.730751,2.000000], index= range(1,5))
>>> print(s1)
Output
1 1.00000
2 1.414214
3 1.730751
4 2.000000
dtype: float64
To remove one element from the series
>>> print(s1.drop (3)) # to drop the element temporarily
>>>s1.drop(3, inplace=True) # if the element has to be dropped
permanently
>>>print(s1)
Output:
1 1.000000
2 1.414214
4 2.000000
dtype: float64
To remove more than one element from the
series
>>>print(s1.drop([1,3]))
Output:
2 1.414214
4 2.000000
dtype: float64
Methods of Series

• head()

• tail()

• count()
• Series .head () is a series function that fetches first ‘n’ from a Pandas object.
• By default it gives the top 5 rows of the series.
• Series. tail () is a series function displays the last five elements by default.

Example 1: Example 2:
>>>import pandas as pd >>>import pandas as pd
>>> s1=pd.Series([1,2,3,4,5],index=['a','b','c','d','e']) >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> print(s1.head(3)) >>> print(s1.head())
output output
a 1 a 1
b 2
b 2 c 3
c 3 d 4
e 5
dtype: int64 dtype: int64
Pandas tail () function:
>>>import pandas as pd
>>>import pandas as pd >>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e])
>>> s1= pd.Series([1,2,3,4,5],index=['a','b','c','d’,’e]) >>> print(s1.tail())
>>> print(s1.tail(2)) Output:
Output: a 1
d 4 b 2
e 5 c 3
dtype: int64 d 4
e 5
dtype: int64
pandas count() function:

• Returns the number of non-NaN values in the


>>>import pandas as pd
series.
>>> import numpy as np

>>> import pandas as pd


>>> s1=pd.Series([1,2,np.nan,4,5],

>>> s1=pd.Series([1,2,3,4,5], index= ['a','b','c','d','e'])


index=['a','b','c','d','e'])

>>> print(s1.count())
>>> print(s1.count())

output
output
5 4
Homework
Consider the following code:
>>> import pandas as pd
>>> import numpy as np
>>> s1=pd.Series([12,np.nan,10])
>>> print(s1)
Find the output and write a python statement to count and display only non null
values in the above series.
Output
ii) >>> s1.count()
i)
2
0 12.0
1 NaN
2 10.0
dtype: float64
Series Object Attributes:
Properties of a series through its associated attributes.
1) Series. index  returns index of the series
2) Series. values  returns ndarray
3) Series. dtype  returns dtype object of the underlying data.
4) Series. shape  returns tuple of the shape of the underlying data.
5) Series. nbytes  returns number of bytes of underlying data.
6) Series. ndim  returns the number of dimension
7) Series. size  returns number of elements.
8) Series. hasnans  returns true if there is any NaN
9) Series. empty  returns true if series object is empty.
Naming the Series and the index column
>>> import pandas as pd
>>> >>> s1 = pd.Series({'Jan':31,"Feb":28,"Mar":31,"Apr":30})
>>> s1.name="Days"
>>> s1.index.name="Months"
>>> print(s1)
Output:
Months
Jan 31
Feb 28
Mar 31
Apr 30
Name: Days, dtype: int64
>>> import pandas as pd
>>> s1 = pd.Series( range(1, 15, 3), index= [x for x in 'abcde'])
>>> s1.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> s1.values
array([ 1, 4, 7, 10, 13], dtype=int64)
>>> s1.dtype
dtype('int64')
>>> s1.shape
(5,)
>>> s1.nbytes
40
>>> s1.ndim
1
>>> s1.size
5
>>> s1.hasnans
False
>>> s1.empty
False
Sumitha Arora pg no 297 class 11
• Int 8  1 byte
• Int 16  2 bytes
• Int 32  4 bytes
• Int 64  8 bytes
Mathematical operations with Series
e.g.1: e.g.2:
import pandas as pd import pandas as pd
>>> s1 = pd.Series([1,2,3]) >>> s1 = pd.Series([1,2,3])
>>> s2 = pd.Series([1,2,4]) >>> s2 = pd.Series([1,2,4])
>>> s3 = s1+s2 >>> s3 = s1 * s2
>>> print(s3) >>> print(s3)
Output: Output:
0 2 0 1
1 4 1 4
2 7 2 12
dtype: int64 dtype: int64
Mathematical operations with Series
e.g. 4
e.g. 3
>>>import pandas as pd
>>>import pandas as pd
>>> import numpy as np
>>> import numpy as np
>>> s1 = np. arange(10,15)
>>> s1 = np. arange(10,15)
>>> s2 = pd.Series(index= s1, data= s1**4)
>>> s2 = pd.Series(index= s1, data= s1 *4)
>>> print(s2)
>>> print(s2)
Output:
Output:
10 10000
10 40
11 14641
11 44
12 20736
12 48
13 28561
13 52
14 38416
14 56
dtype: int32
dtype: int32
Mathematical operations with Series
e.g. 6
e.g. 5 concat your firstname with your lastname
>>> import pandas as pd >>>import pandas as pd
>>> data =['I','n','f','o','r’] >>> s1 = [ 'a',’ b’, 'c’]
>>> s1 = pd.Series(data+['m','a','t','i','c','s’])
>>> s1 >>> s2 = pd.Series(data= s1 *2)
Output: >>> print(s2)
0 I
1 n Output:
2 f 0 a
3 o
4 r 1 b
5 m 2 c
6 a
7 t 3 a
8 i 4 b
9 c
10 s 5 c
dtype: object dtype: object
Note :
• Arithmetic operations is possible on objects of same index;
otherwise will result as NaN
Homework:

Differentiate between Numpy Arrays and Series


objects

Draw tables of subtraction showing the changes in


the series elements and corresponding output
without replacing missing values and after replacing
the missing values with 1000.
Homework
>>> seriesA=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
>>> seriesB=pd.Series([10,20,-10,-50,100],index=['z','x','a','c','e'])
>>> print(seriesA-seriesB)
Output
a 11.0
b NaN
c 53.0
d NaN
e -95.0
x NaN
z NaN
dtype: float64
Subtraction after replacing with NaN values
>>> print(seriesA.sub(seriesB,fill_value=1000))
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
x 980.0
z 990.0
dtype: float64
DataFrames:
• It can store 2 D heterogeneous data.
• It is a two-dimensional data structure , just like any table (with rows and columns).
• They are like Spreadsheets or SQL tables.
• It is the most used data structure in Pandas

Features of DataFrames:
i) Columns can be of different types.
ii) Size of DataFrame is mutable . i.e. no of rows and columns can be changed.
iii) Its data/ values are also mutable.
iv) Labelled axes( rows/ columns)
v) Arithmetic operations on rows and columns
vi) Indexes may constitute numbers, strings or characters.
Create DataFrame :
It can be created with followings
• Lists
• dictionary
• Series
• Numpy ndarray
• Another DataFrame

Name Marks Index


Vijaya 80 B1
Rahul 92 A2
Meghna 67 C
Radhika 95 A1
i)Creation of DataFrame and display:
Creating a DataFrame begins with an empty dataframe.
Example:
>>>import pandas as pd
>>> df1 = pd.DataFrame()
>>> print (df1)
Output:
Empty DataFrame
Columns: [ ]
Index: [ ]
Creation of DataFrame options:
Syntax:

pandas. DataFrame ( data, index, columns, dtype, copy)

data  can be series, list, dict, constants or other DataFrames.


index for row labels , index is optional. By default index is from 0 to n-1 if no index
is passed.
columns  for column labels , default is np. arange (n) if no index is passed.
dtype  dtype for columns. If no data type None is applied.
copy  this command is for copying data by default is False.
iii)Creating DataFrame from Lists:
List is passed as argument to DataFrame method and gets converted into DataFrame with
elements as columns and index automatically created by Python
Example 1:
>>> import pandas as pd
>>> data= [10,20,30,40]
>>> df1 = pd. DataFrame(data)
>>> print(df1)
Output:
0
0 10
1 20
2 30
3 40
Example 2:
>>> import pandas as pd
>>> data=[["Shreya",20],["Rakshit", 22],[" Ajay", 18]]
>>> df1= pd. DataFrame(data, columns=['Name', 'Age'])
>>> print(df1)
Output:
Name Age
0 Shreya 20
1 Rakshit 22
2 Ajay 18
iv) Creation of DataFrame from list of Dictionaries

>>>import pandas as pd
>>>data=[{'a':10,'b':20},{'a':5,'b':10,'c':20}]
>>>df= pd.DataFrame(data)
>>>print(df)

Output:
a b c
0 10 20 NaN
1 5 10 20.0
v)Creating a DataFrame from dictionary of lists
Example : WAP to store 5 students name, marks , sport in a DataFrame using values as lists.
>>>dict1={'students':['Raj’, 'Neha',’ Sunil’, 'Jamaal’, 'Ruchika'],'marks':[98,65.7,45,78,79],'sport':['Tennis’, 'Badminton’,
'Football’, 'Squash’, 'Kabaddi']}
>>> import pandas as pd
>>> df1= pd. DataFrame(dict1)
>>> print(df1)
Output:
students marks sport
0 Raj 98.0 Tennis
1 Neha 65.7 Badminton
2 Sunil 45.0 Football
3 Jamaal 78.0 Squash
4 Ruchika 79.0 Kabaddi
>>>dict1={'students':['Raj’, 'Neha',’ Sunil’, 'Jamaal’,
'Ruchika'],'marks':[98,65.7,45,78,79],'sport':['Tennis’, 'Badminton’, 'Football’, 'Squash’,
'Kabaddi']}
>>> import pandas as pd
>>> df1= pd. DataFrame (dict1,index =['I','II','III','IV','V'])
>>>print(df1)

Output:
students marks sport
I Raj 98.0 Tennis
II Neha 65.7 Badminton
III Sunil 45.0 Football
IV Jamaal 78.0 Squash
V Ruchika 79.0 Kabaddi
VI)Creating a dataframe from a 2d dictionary having values as dictionary objects:
Example:
Create and display a dataframe from a 2D dictionary Sales, which stores the quarter-wise sales as
inner dictionary for two years .
>>>sales={'yr1':{'Qtr1':34500,'Qtr2':50000,'Qtr3':23000,'Qtr4':45000},'yr2':{'Qtr1':44500,'Qtr2':5500
0,'Qtr3':25000,'Qtr4':55000}}
>>> dfsales=pd.DataFrame(sales)
>>>print(dfsales)
Output:
yr1 yr2
Qtr1 34500 44500
Qtr2 50000 55000
Qtr3 23000 25000
Qtr4 45000 55000
Homework:

WAP to create a DataFrame from a 2D dictionary as shown below: row labels


r1,r2,r3,column labels c1,c2,c3

c1 c2 c3
r1 101 113 124
r2 130 140 200
r3 115 216 217
4) Create a DataFrame from Series:
Example 1:
>>> import pandas as pd
>>> student_marks=pd.Series({'Anu':75,'Aarish':98,'Banu':78,'Arpit':89,'Shaurya':97})
>>> student_age=pd.Series({'Anu':15,'Aarish':14,'Banu':18,'Arpit':19,'Shaurya':17})
>>> student_df=pd.DataFrame({'marks':student_marks,'Age':student_age})
>>> print(student_df)
Output:
marks Age
Anu 75 15
Aarish 98 14
Banu 78 18
Arpit 89 19
Shaurya 97 17
Example 2:

>>> import pandas as pd

>>> s1=pd.Series({'IP':'Sumita Arora','CS':"Preeti Arrora"})

>>> s2=pd.Series({'IP':200,'CS':500})

>>> df1=pd.DataFrame({'Author':s1,'No of Copies':s2})

>>> print(df1)
5) Creating a DataFrame object from another DataFrame Object:

>>> import pandas as pd


>>>df1=pd. DataFrame([[1,2,3],[4,5,6]])
>>> dfnew = pd. DataFrame(df1)
>>> print(dfnew)

Output:
0 1 2
0 1 2 3
1 4 5 6
Sorting Data in a dataframe:

>>> print( student_df. sort_values(by=['marks']))


marks Age
Anu 75 15
Banu 78 18
Arpit 89 19
Shaurya 97 17
Aarish 98 14
>>> print( student_df. sort_values(by=['marks'],ascending=False))
marks Age
Aarish 98 14
Shaurya 97 17
Arpit 89 19
Banu 78 18
Anu 75 15
6) Retrieving various properties of a DataFrame:
Let us consider the dataframe dfn

Marketing Sales
age 25 24
Name Neha Rohit
Gender Female Male

1)>>>print(dfn. index)
Index(['age', 'Name', 'Gender'], dtype='object’)

2) >>> print(dfn. columns)


Index(['Marketing', 'Sales'], dtype='object’)

3) >>> print(dfn .axes)


[Index(['age', 'Name', 'Gender'], dtype='object'), Index(['Marketing', 'Sales‘],
dtype='object’)]
4) >>> print(dfn. dtypes)
Marketing object
Sales object
dtype: object

5) >>>print(dfn.T)  transpose
age Name Gender
Marketing 25 Neha Female
Sales 24 Rohit Male
6) >>>print(dfn. shape)
(3, 2)
7) >>> print(dfn. shape[0]) # used to see the no of rows
3
8) >>> print(dfn. shape[1]) # used to see the no of columns
2
9) >>> print(dfn. size)
6
10) >>> print(dfn. ndim)
2
11) >>> print(dfn. empty)
False
12) >>> print(dfn. values)
array([[25, 24],
['Neha', 'Rohit'],
['Female', 'Male']], dtype=object)
Methods in Dataframe:

1)>>>print(len(dfn))
3

2) >>>print(dfn. count()) # print(dfn. count( axis= ‘index’) )


 represents rows
Marketing 3
Sales 3
dtype: int64

3) >>> print(dfn. count(1)) # print(dfn. count( axis= ‘columns’))


 represents columns
age 2
Name 2
Gender 2
dtype: int64
4) head()
print(dfn.head(2))

Output:
Marketing Sales
age 25 24
Name Neha Rohit

5) tail()
print(dfn.tail(2))

Output:
Marketing Sales
Name Neha Rohit
Gender Female Male
7) Selecting / Accessing Data through indexing
Consider a DataFrame Df1

Population Hospitals Schools


Delhi 10927986 189 7916
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617
7) 1) Selecting/ Accessing a column through labelled indexing

>>> print(Df1[“Population”]) >>>print(Df1. Population)


Output:
Delhi 10927986 Output:
Mumbai 12691836 Delhi 10927986
Kolkata 4631392 Mumbai 12691836
Chennai 4328063 Kolkata 4631392
Name: Population, dtype: int64 Chennai 4328063
Name: Population, dtype: int64
7) 2) Selecting/ Accessing multiple columns:

>>>print(Df1 [[‘Schools’ , ‘Hospitals’]])

Output:
Schools Hospitals
Delhi 7916 189
Mumbai 8508 208
Kolkata 7226 149
Chennai 7617 157

>>>print(Df1 [[‘Hospitals’ , ‘Schools’]])

Output:
Hospitals Schools
Delhi 189 7916
Mumbai 208 8508
Kolkata 149 7226
Chennai 157 7617
Note: Columns appear in the order given in the list in the square brackets
• To access selective columns using slicing:
>>>print(Df1.loc[ : , "Population“ : "Schools“ ])

Output:
Population Hospitals Schools
Delhi 10927986 189 7916
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617

>>>print(Df1.loc[ : ,"Population" : "Hospitals"])


Output:
Population Hospitals
Delhi 10927986 189
Mumbai 12691836 208
Kolkata 4631392 149
Chennai 4328063 157
• To access range of columns from a range of rows:

>>>print(Df1.loc[ "Delhi" : "Mumbai" , "Population" : "Hospitals"])

Output:

Population Hospitals
Delhi 10927986 189
Mumbai 12691836 208
7) 3) Selecting/ Accessing a subset from a dataframe using row/column names:

• To access a row Give the row label / name. Don’t forget to give colon after
comma

>>>print(Df1. loc [ “Delhi” ,:])


Output:
Population 10927986
Hospitals 189
Schools 7916
Name: Delhi, dtype: int64
• To access multiple rows
>>>print(Df1.loc[“Mumbai” : “Kolkata” , :])

Output:
Population Hospitals Schools
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226

>>>print(Df1.loc[ "Mumbai“ : "Chennai“ , :])

Output:
Population Hospitals Schools
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617
7) 4) Selecting Rows / Columns of a DataFrame:

• iloc is used instead of loc


• iloc means integer location
• <start index> : <end index> given for rows and columns work like slices. End
index is excluded.

Example 1: Example 2:

>>> print(Df1.iloc [0 : 2, 1 : 3]) >>>print(Df1. iloc [ 0 : 2, 1 : 2])

Output: Output:
Hospitals Schools Hospitals
Delhi 189 7916 Delhi 189
Mumbai 208 8508 Mumbai 208
7) 5) Selecting/ Accessing Individual Value:
• Give name of row or numeric index in square brackets

Example:
>>> print(Df1. Population [ 'Delhi ‘])

Output:

10927986

• We can use at ( row label or column label) or iat (row index no or column index no)
attribute with DF object.
Example:
>>> print(Df1. at[ 'Chennai’, 'Schools’ ])

Output:
7617

>>>print(Df1. iat [ 3 ,2 ])

Output:
7617
8) Adding/ Modifying Row’s / Column’s Values in DataFrames:
1) Adding / Modifying a Column:
* will modify it, if the column already exists.
* will add a new column, if it does not exist already.
• To change or add a column
>>>Df1[ "Density“ ] = 1219
>>> print( Df1 )
Output:
Population Hospitals Schools Density
Delhi 10927986 189 7916 1219
Mumbai 12691836 208 8508 1219
Kolkata 4631392 149 7226 1219
Chennai 4328063 157 7617 1219
Here all the rows in the new column gets the same value.
We can assign the data values for each row of the columns in the form a list.
>>> Df1["Density"] = [1500,1219,1630,1050]
>>> print ( Df1 )
Output:
Population Hospitals Schools Density
Delhi 10927986 189 7916 1500
Mumbai 12691836 208 8508 1219
Kolkata 4631392 149 7226 1630
Chennai 4328063 157 7617 1050
8) 2) Adding / Modifying a row:
* will modify it, if the row already exists.
* will add a new row, if it does not exist already.
• To change or add a row Bangalore to the dataframe
>>> df1.loc['Bangalore']=[135614,267,6889,1500]
>>> print(df1)
Output:
Population Hospitals Schools Density
Delhi 10927986 189 7916 1500
Mumbai 12691836 208 8508 1219
Kolkata 4631392 149 7226 1630
Chennai 4328063 157 7617 1050
Bangalore 135614 267 6889 1500
To change or add a row Bangalore to the dataframe using at method
>>> df1.at['Bangalore']=[135614,267,6889,1500]
>>> print(df1)
Output:
Population Hospitals Schools Density
Delhi 10927986 189 7916 1500
Mumbai 12691836 208 8508 1219
Kolkata 4631392 149 7226 1630
Chennai 4328063 157 7617 1050
Bangalore 135614 267 6889 1500
8) 3) Modifying a single cell

To change or modify a single cell using ‘at’ method

Example:
>>> df1.at['Bangalore','Schools']=5678
>>> print(df1)

Output:
Population Hospitals Schools
Delhi 10927986 189 7916
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617
Bangalore 135614 267 5678
8) 4) Modifying a single cell

To change or modify a single cell using ‘loc’ method

>>> df1.loc['Bangalore','Schools']=5679
>>> print(df1)
Output:
Population Hospitals Schools
Delhi 10927986 189 7916
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617
Bangalore 135614 267 5679
Deleting Rows / Columns in a DataFrame:

del <DF object>[Column name]

Example:

>>> del df1['Schools’]


>>> print(df1)

Output:
Population Hospitals
Delhi 10927986.0 189.0
Mumbai 12691836.0 208.0
Kolkata 4631392.0 149.0
Chennai 4328063.0 157.0
Bangalore 5678097.0 171.0
9) Deleting Rows/Columns

1) Deleting Rows / Columns in a DataFrame:

>>> df1.drop('Delhi', axis=0,inplace=True)


>>> print(df1)
Output:
Population Hospitals Schools
Mumbai 12691836 208 8508
Kolkata 4631392 149 7226
Chennai 4328063 157 7617
Bangalore 135614 267 5679
To delete the column:
>>> df1.drop('Schools',axis=1)
Output:
Population Hospitals
Delhi 10927986 189
Mumbai 12691836 208
Kolkata 4631392 149
Chennai 4328063 157
Bangalore 135614 267
9) 2) Renaming Rows/ Columns:
To change the name of a row or column individually rename function can be
used.
<DF object>.rename( index = [< names dictionary>] , columns = [< names
dictionary>] , inplace = False)
Where
Index argument  rename the rows
Columns argument  rename the columns
For both index and column arguments specify the name-change directory as
[ old name : new name]
Inplace True if you want to rename the same DataFrame, if we skip this
argument a new DataFrame will be created and the original remains the same.
Example consider a DataFrame topdf:
Roll No Name Marks
Sec A 115 Pavni 97.5
Sec B 236 Rishi 98.0
Sec C 307 Preet 98.5
Sec D 423 Paula 98.0
To change row labels as ‘A’, ‘B’, ‘C’, ‘D’
topdf. rename(index={'Sec A':'A' , 'Sec B' : 'B', 'Sec C' : 'C', 'Sec D' : 'D’})
Roll No Name Marks
A 115 Pavni 97.5
B 236 Rishi 98.0
C 307 Preet 98.5
D 423 Paula 98.0
This statement will show the changed indexes but when display the

dataframe topdf, it will show the original dataframe only.

Roll No Name Marks

Sec A 115 Pavni 97.5

Sec B 236 Rishi 98.0

Sec C 307 Preet 98.5

Sec D 423 Paula 98.0


To make change in the indexes of the original dataframe include inplace= True

>>> topdf. rename(index={'Sec A':'A' , 'Sec B' : 'B', 'Sec C' : 'C', 'Sec D' : ‘D’} , inplace=True)
>>>print(topdf)
Output:
Roll No Name Marks
A 115 Pavni 97.5
B 236 Rishi 98.0
C 307 Preet 98.5
D 423 Paula 98.0
To change columns labels rollNo as r.no

>>>topdf. rename(columns= {'Roll No’ : ’Rno'}, inplace= True)


>>>print(topdf)

Output:
Rno Name Marks
A 115 Pavni 97.5
B 236 Rishi 98.0
C 307 Preet 98.5
D 423 Paula 98.0
10) More on DataFrame Indexing – Boolean Indexing
Boolean indexing means having Boolean values [ (True or False) or (1 or 0)] as
indexes in a DataFrame.
Days No of Classes
True Monday 6
False Tuesday 0
True Wednesday 3
False Thursday 0
True Friday 8
Using Boolean indexing we can divide the DataFrame into two groups True rows and
False rows
10) 1) Creating a DataFrame with Boolean Indexes:

Example 1:

import pandas as pd
days= ['Monday', 'Tuesday’, 'Wednesday’, 'Thursday’, 'Friday']
classes =[6, 0, 3, 0, 8]
dc={'Days’ : days, "No of Classes": classes}
classdf= pd .DataFrame (dc, index= [True, False, True, False, True])
print( classdf)

Output:

Days No of Classes
True Monday 6
False Tuesday 0
True Wednesday 3
False Thursday 0
True Friday 8
Example2:

import pandas as pd
days= ['Monday', 'Tuesday’, 'Wednesday’, 'Thursday’, 'Friday']
classes =[6, 0, 3, 0, 8]
dc={'Days’ : days, "No of Classes": classes}
classdf= pd .DataFrame (dc, index= [1, 0, 1, 0, 1])
print( classdf)

Output:

Days No of Classes
1 Monday 6
0 Tuesday 0
1 Wednesday 3
0 Thursday 0
1 Friday 8
10) 2) Accessing rows from DataFrames using Boolean Indexes:

Boolean indexing is useful for filtering out the True or False indexed rows using loc
attribute.

<Df Object>.loc[True]  displays all records with True Index.


<Df Object>. loc[False]  displays all records with False Index.
<Df Object>. loc[1]  displays all records with Index 1.
<Df Object>. loc[0]  displays all records with Index 0.
For example:
>>> print(classdf. loc[True])
Output:
Days No of Classes
True Monday 6
True Wednesday 3
True Friday 8
>>>print(classdf. loc[False])
Output:
Days No of Classes
False Tuesday 0
False Thursday 0
>>> classdf. loc [0 ]

Output:

Days No of Classes

0 Tuesday 0

0 Thursday 0
To Set and Reset index:

WAP to create a dataframe using Names of 5 students and their marks in 5

subjects.

>>> import pandas as pd

>>> student={'Stud_Name':['Ajay’, 'Sanjay',’ Sunil’, 'Amrita’, 'Tom’],

'English':[56,78,89,90,100],'IP':[78,89,90,67,90],'Maths':[89,90,87,86,90],

'Accounts':[78,89,95,78,89],'Phy':[78,89,90,87,89]}

>>> df1=pd. DataFrame(student)

>>>print(df1)
Output:
Stud_Name English IP Maths Accounts Phy
0 Ajay 56 78 89 78 78
1 Sanjay 78 89 90 89 89
2 Sunil 89 90 87 95 90
3 Amrita 90 67 86 78 87
4 Tom 100 90 90 89 89
>>> df1.set_index('Stud_Name', inplace=True)
>>>print(df1)
>>>df1.set_index(‘Accounts’,inplace=True)
>>>print(df1)

Output:
Stud_Name English IP Maths Accounts Phy
Ajay 56 78 89 78 78
Sanjay 78 89 90 89 89
Sunil 89 90 87 95 90
Amrita 90 67 86 78 87
>>> df1.reset_index(inplace=True)
>>>print(df1)

Output:
Stud_Name English IP Maths Accounts Phy
0 Ajay 56 78 89 78 78
1 Sanjay 78 89 90 89 89
2 Sunil 89 90 87 95 90
3 Amrita 90 67 86 78 87
4 Tom 100 90 90 89 89
Iterating over the Dataframe
• 2 methods <dfobject>.iterrows() and <dfobject>.iteritems()
• iterrows() views the dataframe in the form of horizontal subsets (rows).
• iteritems()  views the dataframe in the form of vertical subsets(columns).
• The iterrows() method iterates over the dataframe row wise where each
horizontal subset is in the form of (row-index, series) where series contains the
column values for that row-index.
• The iteritems() method iterates over the dataframe column wise where each
vertical subset is in the form of(column-index, series) where series contains all
row values for that column index.
WAP to create a dataframe and iterate them over rows.
>>>import pandas as pd
>>>data=[["Virat",55,66,31],["Rohit",88,66,43],["Hardik",99,101,68]]
>>>players = pd.DataFrame(data, columns = ["Name","Match-1","Match-2","Match-3"])
>>>print(players)
>>>print("Iterating by rows:")
>>>for (index, row) in players.iterrows():
>>> print(index, row.values)
>>>print("Iterating by columns:")
>>>for (index, row) in players.iterrows():
print(index, row["Name"],row["Match-1"], row["Match-2"],row["Match-3"])
Output:
Name Match-1 Match-2 Match-3
0 Virat 55 66 31
1 Rohit 88 66 43
2 Hardik 99 101 68
Iterating by rows:
0 ['Virat' 55 66 31]
1 ['Rohit' 88 66 43]
2 ['Hardik' 99 101 68]
Iterating by columns:
0 Virat 55 66 31
1 Rohit 88 66 43
2 Hardik 99 101 68
WAP to create a dataframe and print it along with their index using
iteritems().
>>>import pandas as pd
>>>sc_2yrs={2016:{'ViratKohli':2595,'RohitSharma':2406,'ShikharDhawan':2378},
2017:{'Virat Kohli':2818,'Rohit Sharma':2613,'Shikhar Dhawan':2295}}
>>>df=pd.DataFrame(sc_2yrs)
>>>print(df)
>>>print("-------------------------------------------")
>>>for (year,runs) in df.iteritems():
>>> print("Year:",year)
>>> print(runs)
Output:
2016 2017
Virat Kohli 2595 2818
Rohit Sharma 2406 2613
Shikhar Dhawan 2378 2295
-------------------------------------------
Year: 2016
Virat Kohli 2595
Rohit Sharma 2406
Shikhar Dhawan 2378
Name: 2016, dtype: int64

Year: 2017
Virat Kohli 2818
Rohit Sharma 2613
Shikhar Dhawan 2295
Name: 2017, dtype: int64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy