pandas notes
pandas notes
Library:
a library is a collection of modules (files containing Python code) that provide pre-written functions
and classes to help you perform common tasks without having to write the code from scratch.
Python libraries contain a collection of built in modules that allow us to perform many actions
without writing detailed programs for it
NumPy, Pandas and Matplotlib are three well-established Python libraries for scientific and analytical
use.
Numpy:
NumPy, which stands for ‘Numerical Python’, it is a package that can be used for numerical data
analysis and scientific computing. NumPy uses a multidimensional array object and has functions and
tools for working with these arrays.
Pandas:
PANDAS (PANel Data System) is a high-level data manipulation tool used for analysing data.It gives us
a single, convenient place to do most of our data analysis and visualisation work. Pandas has three
important data structures, namely – Series, DataFrame and Panel to make the process of analysing
data organised, effective and efficient.
Matplotlib:
The Matplotlib library in Python is used for plotting graphs and visualisation. Using Matplotlib, with
just a few lines of code we can generate publication quality plots, histograms, bar charts,
scatterplots, etc.
1. A Numpy array requires homogeneous data, while a Pandas DataFrame can have different data
types (float, int, string, datetime, etc.).
2. Pandas have a simpler interface for operations like f ile loading, plotting, selection, joining, GROUP
BY, which come very handy in data-processing applications.
3. Pandas DataFrames (with column names) make it very easy to keep track of data.
4. Pandas is used when data is in Tabular Format, whereas Numpy is used for numeric array based
data manipulation
5. It can easily select subsets of data from bulky data sets and even combine multiple datasets
together.
6.It has functionality to find and fill missing values.
A data structure is a collection of data values and operations that can be applied to that data. It
enables efficient storage, retrieval and modification to the data.
• Series • DataFrame
series :
A Series is a one-dimensional array containing a sequence of values of any data type (int, float, list,
string, etc) which by default have numeric data labels starting from zero.
Creation of Series:
Output:
France Paris
UK London
dtype: object
(4) >>> seriesCapCntry**'UK','USA'++
Output:
UK London
USA WashingtonDC
dtype: object
(B) Slicing:
we may need to extract a part of a series. This can be done through slicing.
We can define which part of the series is to be sliced by specifying the start and end
parameters *start :end+ with the series name. When we use positional indices for slicing,
the value at the endindex position is excluded.
(1) >>> import pandas as pd
>>> seriesCapCntry = pd.Series(*'NewDelhi', 'WashingtonDC', 'London', 'Paris'+,
index=*'India', 'USA', 'UK', 'France'+)
>>> seriesCapCntry*1:3+
output:
USA WashingtonDC
UK London
dtype: object
(2) If labelled indexes are used for slicing, then value at the end index label is also
included in the output, for example:
>>> import pandas as pd
>>> seriesCapCntry*'USA' : 'France'+
Output:
USA WashingtonDC
UK London
France Paris
dtype: object
(3) We can also get the series in reverse order, for example:
>>> import pandas as pd
>>> seriesCapCntry* : : -1+
Output:
France Paris
UK London
USA WashingtonDC
India NewDelhi
dtype: object
Attributes of Series:
Attribute Name Purpose example
name assigns a name to the >>> seriesCapCntry.name =
Series ‘Capitals’ >>>
print(seriesCapCntry)
India NewDelhi
USA WashingtonDC
UK London
France Paris
Name: Capitals, dtype: object
index.name assigns a name to the >>>seriesCapCntry.index.name
index of the series = ‘Countries’ >>>
print(seriesCapCntry) Countries
India NewDelhi
USA WashingtonDC
UK London
France Paris
Name: Capitals, dtype: object
values prints a list of the values >>>
in the series print(seriesCapCntry.values)
*‘NewDelhi’ ‘WashingtonDC’
‘London’ ‘Paris’+
size prints the number of >>> print(seriesCapCntry.size)
values in the Series 4
object
empty prints True if the series is >>> seriesCapCntry.empty
empty, and False False
otherwise # Create an empty series
seriesEmpt=pd.Series()
>>> seriesEmpt.empty
True
Methods of Series:
>>> print(seriesTenTwenty)
Output:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int32
head(n) operation:
Returns the first n members of the series. If the value for n is not passed, then by default n takes
5 and the first five members are displayed.
>>> seriesTenTwenty.head(2)
Output:
0 10
1 11
dtype: int32
>>> seriesTenTwenty.head()
Output:
0 10
1 11
2 12
3 13
4 14
dtype: int32
tail(n) operation:
Returns the last n members of the series. If the value for n is not passed, then by default n takes
5 and the last five members are displayed.
b NaN
c -47.0
d NaN
e 105.0
y NaN
z NaN
dtype: float64
The second method is applied when we do not want to have NaN values in the output. We can
use the series method add() and a parameter fill_value to replace missing value with a
specified value.
a -9.0
b 2.0
c -47.0
d 4.0
e 105.0
y 20.0
z 10.0
dtype: float64
a 11.0
b NaN
c 53.0
d NaN
e -95.0
y NaN
z NaN
dtype: float64
a 11.0
b -998.0
c 53.0
d -996.0
e -95.0
y 980.0
z 990.0
dtype: float64
b NaN
c -150.0
d NaN
e 500.0
y NaN
z NaN
dtype: float64
b 0.0
c -150.0
d 0.0
e 500.0
y 0.0
z 0.0
dtype: float64
b NaN
c -0.06
d NaN
e 0.05
y NaN
z NaN
dtype: float64
a -0.10
b inf
c -0.06
d inf
e 0.05
y 0.00
z 0.00
dtype: float64
dataFrame:
A DataFrame is another pandas structure ,which stores data in two dimensional way.It is
actually a two dimensional (tabular and spreedsheet like) labelled array,which is acutally an
ordered collection of columns where columns may store different types of data e.g numeric
or string or floating point etc.
Creation of DataFrame:
(A) Creation of an empty DataFrame
>>> import pandas as pd
>>> dFrameEmt = pd.DataFrame()
>>> dFrameEmt
Output:
Empty DataFrame
Columns: *+
Index: *+
(B) Creation of DataFrame from NumPy ndarrays
Consider the following three NumPy ndarrays. Let us create a simple DataFrame without
any column labels, using a single ndarray:
>>> import numpy as np
>>> import pandas as pd
>>> array1 = np.array(*10,20,30+)
>>> array2 = np.array(*100,200,300+)
>>> array3 = np.array(*-10,-20,-30, -40+)
>>> dFrame5 = pd.DataFrame(*array1, array3, array2+, columns=* 'A', 'B', 'C', 'D'+)
>>> dFrame5
Output:
A B C D
0 10 20 30 NaN
1 -10 -20 -30 -40.0
2 100 200 300 NaN
(C) Creation of DataFrame from List of Dictionaries
>>> import pandas as pd
>>> listDict = *,'a':10, 'b':20-, ,'a':5, 'b':10, 'c':20-+
>>> dFrameListDict = pd.DataFrame(listDict)
>>> dFrameListDict
Output:
a b c
0 10 20 NaN
1 5 10 20.0
Here, the dictionary keys are taken as column labels, and the values corresponding to
each key are taken as rows.
(D) Creation of DataFrame from Dictionary of Lists:
>>> import pandas as pd
>>> dictForest = ,'State': *'Assam', 'Delhi', 'Kerala'+, 'GArea': *78438, 1483, 38852+ , 'VDF' :
*2797, 6.72,1663+-
>>> dFrameForest= pd.DataFrame(dictForest)
>>> dFrameForest
Output:
State GArea VDF
0 Assam 78438 2797.00
1 Delhi 1483 6.72
2 Kerala 38852 1663.00
(E) Creation of DataFrame from Series :
>>> import pandas as pd
>>> seriesA = pd.Series(*1,2,3,4,5+, index = *'a', 'b', 'c', 'd', 'e'+)
>>>seriesB = pd.Series (*1000,2000,-1000,-5000,1000+, index = *'a', 'b', 'c', 'd', 'e'+)
>>>seriesC = pd.Series(*10,20,-10,-50,100+, index = *'z', 'y', 'a', 'c', 'e'+)
>>> dFrame7 = pd.DataFrame(*seriesA, seriesB+)
>>> dFrame7
Output:
a b c d e
0 1 2 3 4 5
1 1000 2000 -1000 -5000 1000
(F) Creation of DataFrame from Dictionary of Series:
>>> import pandas as pd
>>> ResultSheet=, 'Arnab': pd.Series(*90, 91, 97+, index=*'Maths','Science','Hindi'+),
'Ramit': pd.Series(*92, 81, 96+, index=*'Maths','Science','Hindi'+), 'Samridhi': pd.Series(*89,
91, 88+, index=*'Maths','Science','Hindi'+), 'Riya': pd.Series(*81, 71, 67+,
index=*'Maths','Science','Hindi'+), 'Mallika': pd.Series(*94, 95, 99+,
index=*'Maths','Science','Hindi'+)-
>>> ResultDF = pd.DataFrame(ResultSheet)
>>> ResultDF
Output:
Arnab Ramit Samridhi Riya Mallika
Maths 90 92 89 81 94
Science 91 81 91 71 95
Hindi 97 96 88 67 99
Operations on rows and columns in DataFrames:
(A) Adding a New Column to a DataFrame:
>>> import pandas as pd
>>> ResultSheet=, 'Arnab': pd.Series(*90, 91, 97+, index=*'Maths','Science','Hindi'+),
'Ramit': pd.Series(*92, 81, 96+, index=*'Maths','Science','Hindi'+), 'Samridhi':
pd.Series(*89, 91, 88+, index=*'Maths','Science','Hindi'+), 'Riya': pd.Series(*81, 71, 67+,
index=*'Maths','Science','Hindi'+), 'Mallika': pd.Series(*94, 95, 99+,
index=*'Maths','Science','Hindi'+)-
>>> ResultDF*'Preeti'+=*89,78,76+
>>> ResultDF = pd.DataFrame(ResultSheet)
>>> ResultDF
Output:
Arnab Ramit Samridhi Riya Mallika preeti
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 86
Note: Assigning values to a new column label that does not exist will create a new column
at the end. If the column already exists in the DataFrame then the assignment statement
will update the values of the already existing column
>>> ResultDF*'Ramit'+=*99, 98, 78+
>>> ResultDF
Output:
Arnab Ramit Samridhi Riya Mallika preeti
Maths 90 99 89 81 94 89
Science 91 98 91 71 95 78
Hindi 97 78 88 67 99 86
Note: We can also change data of an entire column to a particular value in a DataFrame.
>>> ResultDF*'Arnab'+=90
>>> ResultDF
Output:
Arnab Ramit Samridhi Riya Mallika preeti
Maths 90 99 89 81 94 89
Science 90 98 91 71 95 78
Hindi 90 78 88 67 99 86
(B) Adding a New Row to a DataFrame :
(1)>>> ResultDF.loc*'Science'+
Output:
Arnab 91
Ramit 81
Samridhi 91
Riya 71
Mallika 95
Name: Science, dtype: int64
(3) When a single column label is passed, it returns the column as a Series.
>>> ResultDF.loc*:,'Arnab'+
Output:
maths 90
Science 91
Hindi 97
Name: Arnab, dtype: int64
Also, we can obtain the same result that is the marks of ‘Arnab’ in all the subjects
by using the command:
>>> print(df*'Arnab'+)
(4) To read more than one row from a DataFrame, a list of row labels is used as
shown below. Note that using **++ returns a DataFrame.
>>> ResultDF.loc**'Science', 'Hindi'++
Output:
Arnab Ramit Samridhi Riya Mallika
Science 91 81 91 71 95
Hindi 97 96 88 67 99
B) Boolean Indexing:
Boolean means a binary variable that can represent either of the two states - True
(indicated by 1) or False (indicated by 0).
>>> ResultDF
Output:
Arnab Ramit Samridhi Riya Mallika
Maths 90 92 89 81 94
Science 91 81 91 71 95
Hindi 97 96 88 67 99
>>> ResultDF.loc*'Maths'+ > 90
Output:
Arnab False
Ramit True
Samridhi False
Riya False
Mallika True
Name: Maths, dtype: bool
To check in which subjects ‘Arnab’ has scored more than 90, we can write:
>>> ResultDF.loc*:,‘Arnab’+>90
Output:
Maths False
Science True
Hindi True
Name: Arnab, dtype: bool
Accessing DataFrames Element through Slicing:
(1)>>> ResultDF.loc*'Maths': 'Science'+
Output:
Arnab Ramit Samridhi Riya Mallika
Maths 90 92 89 81 94
Science 91 81 91 71 95
Note that in DataFrames slicing is inclusive of the end values.
(2)>>> ResultDF.loc*'Maths': 'Science', ‘Arnab’+
Maths 90
Science 91
Name: Arnab, dtype: int64
(3)>>> ResultDF.loc*'Maths': 'Science', ‘Arnab’:’Samridhi’+
Output:
Arnab Ramit Samridhi
Maths 90 92 89
Science 91 81 91
we may use a slice of labels with a list of column names to access values of those rows and
columns:
(4)>>> ResultDF.loc*'Maths': 'Science',*‘Arnab’,’Samridhi’++
Output:
Arnab Samridhi
Maths 90 89
Science 91 91
Filtering Rows in DataFrames:
In order to select or omit particular row(s), we can use a Boolean list specifying ‘True’ for the
rows to be shown and ‘False’ for the ones to be omitted in the output.
>>> ResultDF.loc**True, False, True++
Output:
Arnab Ramit Samridhi Riya Mallika
Maths 90 92 89 81 94
Hindi 97 96 88 67 99
Joining, Merging and Concatenation of DataFrames:
(A) Joining :
We can use the pandas.DataFrame.append() method to merge two DataFrames. It
appends rowsof the second DataFrame at the end of the first DataFrame. Columns
not present in the first DataFrame are added as new columns.
>>> dFrame1=pd.DataFrame(**1, 2, 3+, *4, 5+, *6++, columns=*'C1', 'C2', 'C3'+,
index=*'R1', 'R2', 'R3'+)
>>> dFrame1
C1 C2 C3
R1 1 2.0 3.0
R2 4 5.0 NaN
R3 6 NaN NaN
>>> dFrame2=pd.DataFrame(**10, 20+, *30+, *40, 50++, columns=*'C2', 'C5'+,
index=*'R4', 'R2', 'R5'+)
>>> dFrame2
C2 C5
R4 10 20.0
R2 30 NaN
R5 40 50.0
>>> dFrame1=dFrame1.append(dFrame2)
>>> dFrame1
C1 C2 C3 C5
R1 1.0 2.0 3.0 NaN
R2 4.0 5.0 NaN NaN
R3 6.0 NaN NaN NaN
R4 NaN 10.0 NaN 20.0
R2 NaN 30.0 NaN NaN
R5 NaN 40.0 NaN 50.0
if we append dFrame1 to dFrame2, the rows of dFrame2 precede the rows of
dFrame1. To get the column labels appear in sorted order we can set the parameter
sort=True. The column labels shall appear in unsorted order when the parameter sort
= False.
verify_integrity:
The parameter verify_integrity of append()method may be set to True when we want
to raise an error if the row labels are duplicate. By default, verify_integrity = False.
That is why we could append the duplicate row with label R2 when appending the
two DataFrames, as shown above.
ignore_index:
The parameter ignore_index of append()method may be set to True, when we do not
want to use row index labels. By default, ignore_index = False.
ImPortIng and exPortIng data between csV FILes and dataFrames:
Csv(comma separated value):
A CSV file (Comma-Separated Values file) is a plain text file that stores tabular data—
like a spreadsheet—in a simple format. Each line in the file represents a row of data,
and the values in that row are separated by commas (or sometimes other delimiters
like semicolons or tabs).
A Comma Separated Value (CSV) file is a text f ile where values are separated by
comma. Each line represents a record (row). Each row consists of one or more f ields
(columns). They can be easily handled through a spreadsheet application.
In case we do not want the column names to be saved to the file we may use the
parameter header=False. Another parameter index=False is used when we do not
want the row labels to be written to the file on disk.
>>> ResultDF.to_csv( 'C:/NCERT/resultonly.txt', sep = '@', header = False, index=
False)
If we open the file resultonly.txt, we will find the following contents:
90@92@89@81@94
91@81@91@71@95
97@96@88@67@99
Difference between Pandas Series and NumPy Arrays:
pandas Numpy
In series we can define our own labeled NumPy arrays are accessed by their
index to access elements of an array. integer position using numbers only.
These can be numbers or letters.
The elements can be indexed in The indexing starts with zero for the first
descending order also. element and the index is fixed.
If two series are not aligned, NaN or There is no concept of NaN values and if
missing values are generated. there are no matching values in arrays,
alignment fails
Series require more memory. NumPy occupies lesser memory