Unit 3
Unit 3
NumP Librari
y es
NumPy stands for Numerical Python. It is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, Fourier transform, and
matrices. In Python we have lists that serve the purpose of arrays, but they are slow to
process. NumPy aims to provide an array object that is much faster than traditional
Python lists. The array object in NumPy is called ndarray, it provides a lot of supporting
functions that make working with ndarray very easy. Arrays are very frequently used in
data science, where speed and resources are very important. NumPy arrays are stored at
one continuous place in memory unlike lists, so processes can access and manipulate
them very efficiently. This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with
latest CPU architectures using the concept of vectorized processing.
import numpy as np
#creating and displaying array
data=[[1,2,6],[3,5,9]]
data=np.array(data)
print("Array Data")
print(data)
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an
object describing the data type of the array.
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Creating ndarrays
The easiest way to create an array is to use the array function. This accepts any
sequence like object (including other arrays) and produces a new NumPy array
containing the passed data.
data=(2,6,9)
data=np.array(data)
print("Array Data")
print(data)
import numpy as np
#creating and displaying array
data=[[1,2,6],[3,5,9]]
data=np.array(data)
print("Array Data")
print(data)
NumPy arrays has many attributes. ndim is the attribute that represents the number of
dimensions (axes) of the ndarray.
#displaying dimension
print(data.ndim)
Unless explicitly specified, np.array tries to infer a good data type for the array that it
creates. The data type is stored in a special dtype object.
data=[2.4,3.9,-1.2]
data=np.array(data)
#print data type of array elemnts
print(data.dtype)
We can also specify data type of array elements explicitly while creating ndarrays.
data=np.array([1,3,5,8],dtype='int64')
print(data)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
In addition to np.array, there are a number of other functions for creating new arrays.
As examples, zeros and ones create arrays of 0’s or 1’s, respectively, with a given
length or shape. empty creates an array without initializing its values to any particular
value. To create a higher dimensional array with these methods, pass a tuple for the
shape.
The numpy.arange() function is used to generate an array with evenly spaced values
within a specified interval. The function returns a one-dimensional array of type
numpy.ndarray.
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
data=np.arange(1,10,2)
print(data)
array: Convert input data (list, tuple, array, or other sequence type) to an
ndarray either by inferring a dtype or explicitly specifying a dtype. Copies the
input data by default.
asarray: Convert input to ndarray, but do not copy if the input is already an
ndarray.
arrange: used to generate an array with evenly spaced values within a
specified interval.
ones, ones_like: Produce an array of all 1’s with the given shape and dtype.
ones_like takes another array and produces a ones array of the same shape and
dtype.
d=np.ones_like(data)
print(d)
zeros, zeros_like: Like ones and ones_like but producing arrays of 0’s instead.
d=np.zeros_like(data)
print(d)
empty, empty_like: Create new arrays by allocating new memory, but do
not populate with any values like ones and zeros.
eye, identity: Create a square N x N identity matrix (1’s on the diagonal and
0’s elsewhere)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Data Types for ndarrays
The data type or dtype is a special object containing the information the ndarray needs
to interpret a chunk of memory as a particular type of data.
data=np.array([1,3,5,8],dtype='int64')
print(data)
The numerical dtypes are named in the format: a type name, like float or int, followed
by a number indicating the number of bits per element. We can explicitly convert or
cast an array from one dtype to another using ndarray’s astype method.
data=np.array([1,3,5,8],dtype='int64')
print(data)
data=data.astype('float64')
print(data)
data=data.astype(np.int32)
print(data)
If we have an array of strings representing numbers, we can use astype to convert them
to numeric form.
data=np.array(['2.5','3.7','9.1'],dtype=np.string_)
print(data)
data=data.astype('float64')
print(data)
print(data.dtype)
If casting was failed for some reason (like a string that cannot be converted to float64), a
TypeError will be raised.
data=np.array(['2.5','3.7','9.1f'],dtype=np.string_)
print(data)
data=data.astype('float64') #Error
print(data)
print(data.dtype)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Operations between Arrays and Scalars
Arrays are important because they enable you to express batch operations on data
without writing any for loops. This is usually called Vectorization. Any arithmetic
operations between equal-size arrays applies the operation element-wise.
import numpy as np
a = np.array([[1., 2., 3.], [4., 5., 6.]])
print(a)
r=a*a
print("Element-wise multiplication of arrays:")
print(r)
r=a+a
print("Sum of arrays:")
print(r)
Arithmetic operations with scalars is propagated to the value to each element in the
NumPy array.
import numpy as np
a = np.array([[1., 2., 3.], [4., 5., 6.]])
print(a)
r=a/2
print("Half of array elements:")
print(r)
r=a**0.5
print("Square root of array elements:")
print(r)
import numpy as np
a = np.arange(10)
print("Array Elements:")
print(a)
print("Element at index 3")
print(a[3])
print("Element from index 3-6")
print(a[3:7])
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
a[2]=10 #modifying element at index 2
a[4:6]=11 #modifying element from index 4 to 5
print("Array Elements:")
print(a)
If we assign a scalar value to a slice, as in a[4:6] = 11, the value is propagated (or
broadcasted henceforth) to the entire selection. An important first distinction from lists is
that array slices are views on the original array. This means that the data is not copied,
and any modifications to the view will be reflected in the source array as demonstrated
below.
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9])
aslice=a[3:7]
aslice[1]=15 #modification will be reflected in original array
print("Array Elements:")
print(a)
a = [1,2,3,4,5,6,7,8,9]
aslice=a[3:7]
aslice[1]=15 #modification will not be reflected in original list
print("List Elements:")
print(a)
As NumPy has been designed with large data use cases in mind, we could imagine
performance and memory problems if NumPy copies data instead of creating views. We
want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the
array; for example arr[5:8].copy().
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array Element at index 2")
print(a[2])
print("Array Element at index 1,2")
print(a[1][2])
print(a[1,2])#Equivalent to a[1][2]
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
In multidimensional arrays, if you omit later indices, the returned object will be a
lower dimensional ndarray consisting of all the data along the higher dimensions. As
demonstrated in 2 × 2 × 3 array.
import numpy as np
a = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print("Array Elements")
print(a)
print("Array Element at index 1")
print(a[1])
print("Array Element at index 1,1")
print(a[1,1])
print("Array Element at index 1,1,2")
print(a[1,1,2])
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array Elements:")
print(a)
print("First Two Rows")
print(a[0:2])# Or print(a[:2])
print("First Two Columns of array")
print(a[:,0:2])
print("2x2 slice in top-left corner")
print(a[0:2,0:2])
Like 1D array we can take array slices and update it which is reflected in the original
array.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
aslice=a[0:2,0:2]
aslice[:,:]=0
print("Array Elements")
print(a)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Boolean Indexing
In NumPy, Boolean indexing allows us to filter elements from an array based on a specific
condition. We use Boolean masks to specify the condition. Boolean mask is a NumPy
array containing truth values (True/False) that correspond to each element in the array.
Suppose we have an array named ‘a’.
We can create a mask that selects all elements of a that are greater than
Above statement creates a Boolean mask that evaluates to True for elements that are
greater than 20, and False for elements that are less than or equal to 20. The resulting
mask is an array stored in the boolean_mask variable as below.
import numpy as np
a = np.array([12, 24, 16, 21, 32, 29, 7, 15])
boolean_mask = a > 20
print(boolean_mask)
print(a[boolean_mask])
a[boolean_mask]=0#sets all elements greater than 20 to zero
print(a)
Fancy Indexing
In NumPy, fancy indexing allows us to use an array of indices to access multiple array
elements at once. Fancy indexing can perform more advanced and efficient array
operations, including conditional filtering, sorting, and so on.
Example
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8])
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
print("Simple Indexing:",simple_indexing) # 4
We can also use fancy indexing on multi-dimensional arrays. Concept of fancy indexing
is also same in multi-dimensional arrays.
Example
import numpy as np
a = np.arange(10)
print("Dataset:",a)
s=np.sqrt(a)#unary universal function
print("Square Roots:",s)
e=np.exp(a)
print("Exp(a):",e)
x=np.random.randn(10)
y=np.random.randn(10)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
z=np.maximum(x,y)#bunary universal function
print("x=",x)
print("y=",y)
print("z=",z)
m=np.max(x)
print("Maximum=",m)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Sort 2d array using custom datatype
import numpy as np
# Define a custom data type with fields 'name', 'age', and 'height'
user_dtype = np.dtype([
('name', 'U10'), # Unicode string of maximum length 10
('age', 'i4'), # 4-byte (32-bit) integer
('height', 'f4') # 4-byte (32-bit) float
])
Array Functions
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Note: Write down programs to demonstrate each of the above
methods
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
functions for efficiently saving and loading array data on disk. Arrays are saved by
default in an uncompressed raw binary format with file extension .npy. If the file path
does not already end in .npy, the extension will be appended. The array on disk can
then be loaded using np.load. We can save multiple arrays in a zip archive using
np.savez and passing the arrays as keyword arguments. When loading an .npz file, we
get back a dictionary-like object which loads the individual arrays.
import numpy as np
a = np.arange(10)
print("a=",a)
np.save('some_array', a)
b=np.load('some_array.npy')
print("b=",b)
c = np.arange(20)
print("c=",c)
np.savez('array_archive.npz', x=a, y=c)
arch = np.load('array_archive.npz')
print("Arrays in Archive:")
for k in arch:
print(arch[k])
Example
import numpy as np
a = np.loadtxt('/content/drive/My Drive/test.txt', delimiter=',')
print(a)
np.savetxt('/content/drive/My Drive/test1.txt', a)
print("File is saved")
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
The genfromtxt() function is used to load data in a program from a text file. It takes
multiple argument values to clean the data of the text file. It also has the ability to deal
with missing or null values through the processes of filtering, removing, and replacing.
import numpy as np
# invoking genfromtxt method to read employee.txt file
content = np.genfromtxt("/content/drive/My Drive/test.txt", dtype=str,
encoding = None, delimiter=",")
# print file data on console
print("File data:", content)
Linear Algebra
Linear algebra, like matrix multiplication, decompositions, determinants, and other
square matrix math, is an important part of any array library. Unlike some languages
like MATLAB, multiplying two two-dimensional arrays with * is an element-wise
product instead of a matrix dot product. Numpy.linalg has a standard set of matrix
decompositions and things like inverse and determinant. Commonly-used
numpy.linalg functions are listed below.
#Matrix multiplication
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])
z=x.dot(y)
print(z)
r=np.dot(x, y)#equivalent to x.dot(y)
print(r)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
#solving system of linear equations, finding determinant and inverse
#2x+3y-z=5
#x+3y-z=4
#3x-y+2z=7
import numpy as np
from numpy.linalg import inv, solve,det
a = np.array([[2,3,-1],[1,3,-1],[3,-1,2]])
b=np.array([5,4,7])
s=solve(a,b)
print(s)
d=det(a)
print("determinant of a=",d)
b=inv(a)
print("Inverse of a=",b)
We pass a sequence of arrays that we want to join to the stack() method along with the
axis. If axis is not explicitly passed it is taken as 0
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2))
print(arr)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import numpy as np
np.random.seed(100)
d=np.random.randint(0,10)
print("d=",d)
samples = np.random.normal(size=(4, 4))
print(samples)
d=np.random.permutation([1,2,3])
print("d=",d)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
l=[1,2,3,4,5]
d=np.random.shuffle(l)
print("Shuffled List=",l)
Array Broadcasting
In NumPy, we can perform mathematical operations on arrays of different shapes. An
array with a smaller shape is expanded to match the shape of a larger one. This is
called broadcasting.
import numpy as np
a=np.array([1,10,3]) #size=3
b=np.array([[5],[10],[15]]) #size=3x1 so compatible for broadcasting
c=a+b
print(c) #output is 3x3 array
output:
[[ 6 15 8]
[11 20 13]
[16 25 18]]
Functions Descriptions
add() concatenates two strings
multiply() repeats a string for a specified number of times
capitalize() capitalizes the first letter of a string
lower() converts all uppercase characters in a string to lowercase
upper() converts all lowercase characters in a string to uppercase
join() joins a sequence of strings
equal() checks if two strings are equal or not
Series
Example
import pandas as pd
Edited By:import
Dipak Dahal
numpy as np Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
obj = pd.Series([4, 7, -5, 3]) #series data structure
print(obj.values) #displaying values in the data structure
DataFrame
import pandas as pd
data = {'State': ['Bagmati', 'Koshi', 'Karnali', 'Lumbini', 'Gandaki'],
'Year': [2000, 2001, 2002, 2001, 2002]}
frame1 = pd.DataFrame(data)#creating dataframe
print(frame1)
frame2 = pd.DataFrame(data,columns=["State","Year","Debt"])
print(frame2)#creating data frame
print(frame2["State"])#displaying column State
obj=pd.Series([2,5,3,3,4])
frame2["Debt"]=obj
print(frame2)#displaying data frame
print(frame2.values)#displaying in 2D array format
Index Objects
Pandas’s Index objects are responsible for holding the axis labels and other metadata (like
the axis names). Any array or other sequence of labels used when constructing a Series
or DataFrame is internally converted to an Index. Index objects are immutable and thus
can’t be modified by the user.
import pandas as pd
s= pd.Series(range(3), index=[1, 2, 3])
print(s)
print(s.index)
print(pd.Int64Index(s))
#s.index[1]='d'# index is immutable
Essential Functionalities
This section discusses fundamental mechanics of interacting with the data contained in a
Series or DataFrame.
Reindexing
A critical method on panda’s objects is reindex, which means to create a new object
with the data conformed to a new index. Calling reindex on this Series rearranges the
data according to the new index, introducing missing values if any index values were
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
not already present.
import pandas as pd
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
obj1 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj1)
obj2=obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
print(obj2)
For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method
such as ffill which forward fills the values.
import pandas as pd
obj = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj)
obj1=obj.reindex(range(6), method='ffill')
print(obj1)
obj2=obj.reindex(range(6), method='bfill')
print(obj2)
Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an
axis.
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
obj1 = obj.drop('c')
print(obj1)
obj2 = obj.drop(['b','d'])
print(obj2)
import pandas as pd
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['c1', 'c2', 'c3',
'c4'],
columns=['r1', 'r2', 'r3', 'r4'])
print(data)
d=data.drop('c2')
print(d)
d=data.drop('r2',axis=1)
print(d)
Series indexing works analogously to NumPy array indexing, except we can use the
Steris’s index values instead of only integers.
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj[2]) #same as obj(['c'])
print(obj['c'])
print(obj[1:3])
print(obj[['b','c','d']])
Slicing with labels behaves differently than normal Python slicing in that the endpoint is
inclusive and setting using these methods works just as we would expect.
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj[2]) #same as obj(['c'])
print(obj['c'])
print(obj[1:3])
print(obj['b':'d'])
obj['b':'c'] = 5
print(obj)
Indexing into a DataFrame is for retrieving one or more columns either with a single
value or sequence:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['r1', 'r2', 'r3',
'r4'],
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
columns=['c1', 'c2', 'c3', 'c4'])
print(data['c1'])
print(data[['c1','c3']])
print(data[:2])
print(data[data['c3'] > 5])
Another use case is in indexing with a Boolean DataFrame, such as one produced by a
scalar comparison.
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['r1', 'r2', 'r3',
'r4'],
columns=['c1', 'c2', 'c3', 'c4'])
print(data < 5)
data[data < 5] = 0
print(data)
One of the most important pandas features is the behavior of arithmetic between objects
with different indexes. When adding together objects, if any index pairs are not the same,
the respective index in the result will be the union of the index pairs.
import pandas as pd
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s3=s1+s1
print(s3)
s3=s1+s2
print(s3)
In the case of DataFrame, alignment is performed on both the rows and the columns:
import pandas as pd
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
index=['1', '2', '3'])
df2 = pd.DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'),
index=['1', '2', '3', '4'])
print(df1)
print(df2)
df=df1+df1
print(df)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
df=df1+df2
print(df)
import pandas as pd
df1 = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=list('abcd'),
index=['1', '2', '3'])
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)), columns=list('abcde'),
index=['1', '2', '3', '4'])
df=df1.add(df2, fill_value=0)
print(df)
df=df1.sub(df2, fill_value=0)
print(df)
df=df1.mul(df2, fill_value=0)
print(df)
df=df1.div(df2, fill_value=0)
print(df)
As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. In such
case, operation is performed by using the concept of broadcasting.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((3, 4)))
s=pd.Series([2,4,5,7])
df1=df+s
print(df)
print(s)
print(df1)
By default, arithmetic between DataFrame and Series matches the index of the Series on
the Data Frame’s columns, broadcasting down the rows. If an index value is not found in
either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form
the union.
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
NumPy universal functions work fine with pandas objects.
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),index=['r1', 'r2', 'r3', 'r4'])
print(frame)
frame=np.abs(frame)
print(frame)
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),index=['r1', 'r2', 'r3', 'r4'])
print(frame)
frame=np.abs(frame)
print(frame)
f= lambda x: x.max() - x.min()
fr=frame.apply(f,axis=0)
print(fr)
The function passed to apply need not return a scalar value, it can also return a Series
with multiple values.
import numpy as np
import pandas as pd
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns a
new, sorted object.
import numpy as np
import pandas as pd
With a DataFrame, we can sort by index on either axis. The data is sorted in ascending
order by default, but can be sorted in descending order too.
import numpy as np
import pandas as pd
To sort a Series by its values, use its sort_values method. Any missing values are sorted
to the end of the Series by default.
import numpy as np
import pandas as pd
On DataFrame, We may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:
import numpy as np
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import pandas as pd
Ranking is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. Ties are broken according to a rule. By default rank
breaks ties by assigning. We can also rank in descending order, too.
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2,
5, 8, -2.5]})
print(frame)
fr=frame.rank(axis=1)
print(fr)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Axis indexes with Duplicate Values
Series may have duplicate indices. The index’s is_unique property can tell you whether
its values are unique or not. Data selection is one of the main things that behaves
differently with duplicates. Indexing a value with multiple entries returns a Series
while single entries return a scalar value.
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained. Another method is describe that
produce multiple summary statistics in one shot. Summary descriptive methods of
dataframe is listed the table given below.
Example
import numpy as np
import pandas as pd
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -
1.3]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
print(df)
print(df.sum())
print(df.sum(axis=1))
print(df.mean())
print(df.describe())
Covariance
Covariance is a measure of the relationship between two random variables. It measures
the direction of the relationship between two variables. If the covariance for any two
variables is positive, that means, both the variables move in the same direction. If the
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
covariance for any two variables is negative, that means, both the variables move in the
∑𝑛 (𝑥𝑖 − 𝑥̅)(𝑦𝑖 − 𝑦̅ )
opposite direction. It can be calculated as below:
𝑐𝑜𝑣(𝑥, 𝑦) = 𝑖=1
𝑛
A square matrix provides the covariance between each pair of components (or elements)
of a given random vector is called a covariance matrix.
# importing pandas as pd
import pandas as pd
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect. The
sample correlation coefficient, r, quantifies the strength of the relationship. Correlation
coefficient quite close to 0, but either positive or negative, implies little or no relationship
between the two variables. A correlation coefficient close to plus 1 means a positive
relationship between the two variables, with increases in one of the variables being
associated with increment in the other variable. A correlation coefficient close to -1
indicates a negative relationship between two variables, with an increase in one of the
variables being associated with a decrease in the other variable. The most common
formula is the Pearson Correlation coefficient used for linear dependency between the
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
data sets and is given as below.
𝑟=
√(𝑛 ∑ 𝑥2 − (∑ 𝑥)2)(𝑛 ∑ 𝑦2 − (∑ 𝑦)2)
Example
# importing pandas as pd
import pandas as pd
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
"C":[4, 3, 8, 5],
"D":[5, 4, 2, 8]})
print(df)
print(df.corr())
# importing pandas as pd
import pandas as pd
s=pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = s.unique()
print(uniques)
l=s.value_counts()
print(l)
m = s.isin(['b', 'c'])
print(m)
print(s[m])
Example
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import pandas as pd
import numpy as np
s = pd.Series(['Orange', 'Mango', np.nan, 'Avocado'])
print(s)
print(s.isnull())
print(s.notnull())
s1=s.dropna()
print(s1)
s2=s.fillna(0)
print(s2)
Example 2
import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
print(data)
d1=data.dropna()
print(d1)
d2=data.fillna(0)
print(d2)
d3=data.dropna(how='all')
print(d3)
d4=data.dropna(how='all',axis=1)
print(d4)
Calling fillna with a dict you can use a different fill value for each column. fillna returns
a new object, but you can modify the existing object in place. The same interpolation
methods available for reindexing can be used with fillna. With fillna you can do lots of
other things with a little creativity. For example, we might pass the mean or median
value of a Series.
import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
d=data
d.fillna(0)
print(d)
d.fillna(0,inplace=True)
print(d)
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
print(data)
d=data.ffill()
print(d)
d=data.ffill(limit=1)
print(d)
d=data.fillna(data.mean())
print(d)
Hierarchical Indexing?
Hierarchical indexing, also known as multi-level indexing, is a way of organizing data
in Pandas with multiple levels of row or column labels. This allows you to work with
more complex data structures than a simple table with one row and one column of
labels. For example, imagine we have a dataset with sales data for a company, broken
down by region and by quarter. You could organize this data with a hierarchical index
that has two levels: one for the region and one for the quarter.
Example
import pandas as pd
index = [('Kathmandu', 'Q1'), ('Kathmandu', 'Q2'),('Kathmandu', 'Q3'),
('Kathmandu', 'Q4'),
('Pokhara', 'Q1'), ('Pokhara', 'Q2'), ('Pokhara', 'Q3'),
('Pokhara', 'Q4')]
sales = [350, 500,325, 475,200, 300,350,250]
sales_data = pd.Series(sales, index=index)
print(sales_data)
print(sales_data.index)
for x in index:
if(x[1]=='Q2'):
print(x,sales_data[x])
Another Example
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
df
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Panel Data
The Panel in Pandas is used for working with three-dimensional data. It has three main
axes these are items is the 0 axis which corresponds to the data, major-axis is the axis 1
for rows, and minor-axis is the axis 2 for columns. A panel can be created by using the
pandas panel () function. The panel in pandas is a three-dimensional container of data.
To create a panel, we can use ndarrays and a dictionary of DataFrames. We can also
extract data from panels using different methods. (Deprecated)
Group By
Pandas groupby is used for grouping the data according to the categories and
applying a function to the categories. It also helps to aggregate data efficiently. The
Pandas groupby() is a very powerful function with a lot of variations. It makes the task
of splitting the Dataframe over some criteria really easy and efficient.
Example:
import pandas as pd
df=pd.read_csv("/content/drive/MyDrive/Python
Data/employees.csv")
df=df.dropna()
data=df.head(100)
data=data[['Team','First Name','Gender','Salary']]
gb=data.groupby(["Team","Gender"])
gb['Salary'].max()
set1=['a','c','d','e','y']
address=['ktm','htd','pkh','bkh','ktm']
set2=['g','a','b','d','h']
d1={"set":set1,"address":address}
d2={"set":set2,"address":address}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
result=pd.merge(df1,df2,on="set",how="outer")
print(result)
Matplotlib Library
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. Matplotlib makes difficult things possible and simple things easy. matplotlib.pyplot is
a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines
in a plotting area, decorates the plot with labels, etc. Once we are done, we can save it with
savefig() or display it with show().
Example
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the x and y-axis
plt.ylabel("Billions")
plt.xlabel("Years")
plt.show()
Bar Charts
A bar chart, often known as a bar graph, is a diagram that displays categorical data as rectangular
bars with heights or lengths proportional to the values they stand for. You can plot the bars either
vertically or horizontally. A vertical bar chart may also be referred to as a column chart.
Comparisons among distinct categories are displayed in a bar graph. The comparison categories
are shown on one axis of the chart, and a measured value is shown on the other axis.
Example
from matplotlib import pyplot as plt
Country = ["Nepal", "Srilanka", "Bangladesh", "India",
"Bhutan","Madhives","Pakistan","Afganistan"]
GDP_growth_rate = [6.4, 4.5, 8.3, 7.4, 5.8,8.7,3.2,2.1]
# plot bars with Country as x-coordinate and GDP_growth_rate as height
plt.figure(figsize=(8,4))
plt.bar(Country, GDP_growth_rate)
plt.title("GDP Growth Rates of SAARC Countries") # add a title
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
plt.ylabel("GDP Growth Rate") # label the y-axis
plt.xlabel("Country")#label the x-axis
# label x-axis with movie names at bar centers
plt.show()
Calling plt.barh() function with parameters y,x as plt.barh(y,x) plots horizontal bar chart.
Example
from matplotlib import pyplot as plt
Country = ["Nepal", "Srilanka", "Bangladesh", "India",
"Bhutan","Madhives","Pakistan","Afganistan"]
GDP_growth_rate = [6.4, 4.5, 8.3, 7.4, 5.8,8.7,3.2,2.1]
# plot bars with Country as x-coordinate and GDP_growth_rate as height
plt.figure(figsize=(8,4))
plt.barh(Country, GDP_growth_rate)
plt.title("GDP Growth Rates of SAARC Countries") # add a title
plt.ylabel("GDP Growth Rate") # label the y-axis
plt.xlabel("Country")#label the x-axis
# label x-axis with movie names at bar centers
plt.show()
Stacked bar charts have each plot stacked one over another. We used an unstacked bar chart to
compare each group; we can use a stacked plot to compare each individual. A stacked bar plot is
used to represent the grouping variable. Where group counts or relative proportions are being
plotted in a stacked manner. Occasionally, it is used to display the relative proportion summed to
100%.
Example
# importing package
import matplotlib.pyplot as plt
import numpy as np
# create data
x = ['A', 'B', 'C', 'D']
y1 = np.array([10, 20, 10, 30])
y2 = np.array([20, 25, 15, 25])
y3 = np.array([12, 15, 19, 6])
y4 = np.array([10, 29, 13, 19])
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
plt.legend(["Round 1", "Round 2", "Round 3", "Round 4"])
plt.title("Scores by Teams in 4 Rounds")
plt.show()
Line Charts
A line chart is a type of chart that provides a visual representation of data in the form of points
that are connected in a straight line. Line Charts are a good choice for showing trends. These
charts are used to represent the relation between two data X and Y on a different axis.
Example
import matplotlib.pyplot as plt
quantity=[1123,1256,1289,1378,1456,1367,1256]
amount=[2246,2512,2588,2702,2912,3214,3250]
Month=["Jan","Feb","Mar","Apr","May","June","July"]
plt.figure(figsize=(8,4))
plt.plot(Month,quantity,marker='x')
plt.plot(Month,amount,marker='o')
plt.title('Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend(["Sales Quntity","Sales Amount"],loc="upper left")
Example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Load the dataset into a Pandas DataFrame
df = pd.read_csv("/content/drive/My Drive/HistoricalPrices.csv")
plt.figure(figsize=(8,4))
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
plt.title('DJIA Open and Close Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(["Open Price","Close Price"],loc="upper left")
Scatterplots
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
Example
Example
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(170, 10, 250)
num_bins = 7
plt.figure(figsize=(4,3))
plt.hist(x, num_bins, color='Blue', alpha=0.5)
plt.show()
Example 2
Example
Plotting Maps
Maps have been used for centuries to help people navigate and understand their
surroundings. In the age of big data, maps have become an essential tool for data
visualization. They allow us to visualize data in a way that is intuitive, interactive, and
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
easy to understand. Maps can help us identify patterns and relationships that might be
difficult to see in other types of visualizations.
Plotly is a powerful data visualization library for Python that allows you to create a
wide range of interactive visualizations, including maps. One of the advantages of
Plotly is that it is designed to work seamlessly with other Python libraries, such as
Pandas and NumPy. This makes it easy to import and manipulate data and to create
visualizations that are customized to your specific needs.
The Scattergeo() function is used to create a scatter plot on a geographic map. This means
that it can help you plot points on a map where each point represents a specific
geographic location, like a city or a landmark. For example, if you have a dataset that
contains the latitude and longitude coordinates of different cities around the world, we
can use Scattergeo() to plot each city on a world map.
Example
import plotly.express as px
import pandas as pd
Line Plot
plt.plot(x,y)
color property for linecolor
marker
markersize or ms
markeredgecolor or mec
markerfacecolor or mfc
linestyle '-', '--', '-.', ':', 'None', ' ', '', 'solid', 'dashed',
'dashdot', 'dotted'
multiple lines
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
plt.grid(True) for grid
xlabel, ylabel
title
legend
subplot plt.subplot(row,col,figure)
plt.plot(x,y1)
plt.legend(["sinx","cosx"])
plt.show()
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT
Scatter Plot
plt.sactter(x,y)
Color
s for size of points
from ctypes import sizeof
import numpy as np
import matplotlib.pyplot as plt
points=[(10,5),(5,6),(7,5)]
x,y=list(zip(*points))
plt.scatter(x,y,color="red")
plt.show()
Bar Graph
plt.bar(x,y) and plt.barh(x,y)
bim=np.random.randint(10,60,3)
bba=np.random.randint(20,60,3)
bca=np.random.randint(10,60,3)
d={"bim":bim,"bba":bba,"bca":bca}
data=pd.DataFrame(d,index=["2011","2012","2013"])
data.plot(kind='bar')
pie chart
plt.pie(values)
labels=[]
startangel=number
explode=[same size as data]
autopct for labels
colors[]
Edited By: Dipak Dahal Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT