0% found this document useful (0 votes)
9 views96 pages

Numpy Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views96 pages

Numpy Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 96

NumPy Arrays and Vectorized Computation

0.0.1 NUMPY MODULE:


NumPy, short for Numerical Python, is a fundamental library for numerical computing in
Python. It provides powerful data structures, primarily the ndarray (n-dimensional array), which
enables efficient storage and manipulation of large datasets. With its support for multi-
dimensional arrays, NumPy allows users to perform complex mathematical operations with ease.
One of the key features of NumPy is its ability to perform element-wise operations on arrays,
which is significantly faster than using traditional Python lists. This efficiency stems from its
implementation in C, allowing for lower-level optimizations. NumPy also includes a comprehensive
set of mathematical functions that can operate on arrays, including linear algebra, Fourier
transforms, and random number generation. In addition to its array capabilities, NumPy provides
tools for integrating with other languages, such as C and Fortran, making it a versatile choice for
performance-critical applications. It serves as the backbone for many other scientific computing
libraries, including SciPy, pandas, and Matplotlib, establishing itself as an essential component
of the scientific Python ecosystem. NumPy’s array operations are broadcastable, meaning that
arrays of different shapes can still be used together in calculations, making it easier to handle
data of varying dimensions. This flexibility is particularly useful in data analysis and machine
learning tasks.

1 1. Numpy Arrays from Python DataStructures,Intrinsic


[1] Numpy
# To define a Objects
new ndarrayand
usingRandom Functions
array() method
: import numpy as np
2a =1.np.array([1,
3]) a
1 Arrays2,from python data structures

[1] : array([1, 2, 3])

[2] : #checking the type


type(a)
[2] : numpy.ndarray

[3] : #checking dtype


a.dtype

1
[3] : dtype('int32')

[4] : #ndim
a.ndim
[4]: 1

[5] : #size
a.size
[5]: 3

[6] : #shape
a.shape
[6]: (3,)

[7] : #numpy 1-D array


import numpy as np
a=[1,2,3,4,5]
b=np.array(a)
print(a)
[1, 2, 3, 4, 5]
[8] : #Creation of ndarrays using array() method
#2-D array
import numpy as np
x=[1,2,3]
y=[3,4,5]
z=np.array((x,y))
print(z)
[[1 2 3]
[3 4 5]]
[9] :
#Tuple 1-D
m=(1,2,3)
c=np.array(m)
print(c)
[1 2 3]
[10] : #Tuple 2-D array
import numpy as np
a=(1,2,3,4,5)
b=(6,7,8,9,1)
c=np.array((a,b))

2
print(c)

[[1 2 3 4 5]
[6 7 8 9 1]]
[11] : #set
a=[1,2,3,4,5]
c=set(a)
np.array(c)
[11] : array({1, 2, 3, 4, 5}, dtype=object)

[12] : #Set
l={1,2,3,4}
print(np.array(l))
{1, 2, 3, 4}
[13] : #dictionary
import numpy as np
dict={'a':1,'b':2,'c':3}
z=np.array(list(dict.items()
)) print(z)
a=np.array(list(dict.keys())
) print(a)
[['a' '1']
['b' '2']
['c' '3']]
['a' 'b' 'c']

3 1.2 Intirinsic Numpy Objects


[14] : Intrinsic NumPy objects are fundamental data structures provided by the NumPy library, which
are optimized for numerical computations and provide efficient operations on large datasets.
#using arange() method
a=np.array(np.arange(9))
print(a)
[0 1 2 3 4 5 6 7 8]
[15] : #Zeros() method
a=np.zeros(3)
print(a)
[0. 0. 0.]

3
[16] : b=np.zeros([3,3])
print(b)

[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[17] : #zeros_like()
x=np.array([[1,3,7],[2,5,9]])
x
[17] : array([[1, 3, 7],
[2, 5, 9]])

[18] : d=np.zeros_like(x)
d

[18] : array([[0, 0, 0],


[0, 0, 0]])

[19] : #using ones() method


a=np.ones(4)
print(a)
[1. 1. 1. 1.]
[20] : b=np.ones([3,3])
print(b)

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[21] :
#using eye() method
a=np.eye(3)
print(a)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
[22] :
c=np.eye(3,k=1)
print(c)
[[0. 1. 0.]
[0. 0. 1.]
[0. 0. 0.]]

4
[23] : #using identity() method
a=np.identity(3)
print(a)

[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
[24] :
#using full() method
d=np.full((2,2),7)
print(d)
[[7 7]
[7 7]]
[25] : np.arange(15)

[25] : array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

[26] : x=np.arange(6,dtype=in
t) np.full_like(x,1)

[26] : array([1, 1, 1, 1, 1, 1])

[27] : c=np.full_like(x,0.1)
c

[27] : array([0, 0, 0, 0, 0, 0])

[28] : d=np.full_like(x,0.1,dtype=np.double)
d

[28] : array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1])

[29] : #using empty() method


a=np.empty((2,3))
print(a)

[[0.1 0.1 0.1]


[0.1 0.1 0.1]]
[30] : np.empty((2, 3, 2))

[30]: array([[[1.05337787e-311, 2.86558075e-


322], [0.00000000e+000,
0.00000000e+000], [1.10343781e-312,
1.31370903e-076]],

[[5.20093491e-090, 5.69847262e-066],

5
[5.51292779e+169, 4.85649086e-033],
[6.48224659e+170, 5.82471487e+257]]])

[31]: #empty_like()
a=([1,2,3],[4,5,6])
np.empty_like(a)

[31]: array([[1730487296, 496, 0],


[ 0, 131074, 168442489]])

[32]: #using diag() mrthod


np.diag([1,2,3,4])

[32]: array([[1, 0, 0,
0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])

[33]: #creting meshgrid()


x=np.array([1,2,3])
y=np.array([4,5,6])
x,y=np.meshgrid(x,y)
print(x)
print(y)
[[1 2 3]
[1 2 3]
[1 2 3]]
[[4 4 4]
[5 5 5]
[6 6 6]]

3.1 1.3 Random Functions


The random functions in NumPy are essential for simulations, statistical sampling, and generating
synthetic data. They help facilitate various operations in scientific computing, machine learning,
[34] and data analysis.
: #randint()method
from numpy import
random x =
random.randint(100)
print(x)
15

6
[35]: #choice() method
y=np.random.bytes(7)
print(y)
a=np.random.choice(['true','false'],size=(2,3))
print(a)
b"t'\n\x16\x14QB"
[['true' 'false'
'false']
[36] ['true' 'false' 'true']]
: #complex number
x = random.rand(1) +
random.rand(1)*1j print (x)
print(x.real)
print(x.imag)
[0.08421058+0.69654499j]
[0.08421058]
[0.69654499]
[37]
#complex number using the rand()method
:
x = random.rand(1,5) +
random.rand(1,5)*1j print (x)

[[0.29653563+0.94629414j 0.56539718+0.58965768j 0.83340819+0.82456817j


0.16209606+0.15309722j 0.92519953+0.01018444j]]
[38]
#random() method
:
np.random.random(size=(2,2))+1j*np.random.random(size=(2,2))

[38] : array([[0.90898124+0.87349692j, 0.64895681+0.87327894j],


[0.7544518 +0.122983j, 0.4716534 +0.77610277j]])

[39] : #permutation()
np.random.permutation(5)
[39] : array([0, 3, 4, 1, 2])

[40] : a=np.array(5)
b=np.random.choice(a,size=5,p=[0.1,0.2,0.3,0.2,0.2])
print(b)

[4 3 0 3 4]
[41] : #randint()
np.random.randint(1,5)
[41]: 3

7
[42] : #randn()
a=np.random.randn(1,10)
print(a)

[[ 0.08009351 1.04758386 -0.15977457 0.60779634 0.12686552 -2.29032851


-0.53667358 -0.69266066 1.42867051 -0.34056088]]
[43] :
#choice()
a=np.array(['apple','bananaa','cherry'])
b=np.random.choice(a)
print(b)
bananaa
[44] : #shuffle()
np.random.shuffle(a)
print(a)
['cherry' 'apple' 'bananaa']

4 2.Manipulation Of Numpy Arrays

5 2.1 Indexing
[45] : Indexing in NumPy refers to accessing individual elements or groups of elements within an array

#integer indexing
import numpy as np
x = np.array([[1, 2], [3, 4], [5,
6]]) y = x[[0,1,2], [0,1,0]]
print(x[0,1])
2
[46] : a=[3,4,5,6,7]
print(a[0])
3
[47] : arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d[0]
[47] : array([[1, 2, 3],
[4, 5, 6]])

[48] : #copy()
old_values = arr3d[0].copy()
arr3d[0] = 42

8
print(arr3d)

[[[42 42 42]
[42 42 42]]

[[ 7 8 9]
[49] : [10 11 12]]]
import numpy as np

arr = np.array([1, 2, 3, 4])


7
print(arr[2] + arr[3])
[50] : import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print( arr[0, 1])


2
[51] : import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print( arr[1, 4])


10
[52] : import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr[0, 1, 2])
6
[53] : import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print( arr[1, -1])


10

9
6 2.2 Slicing
Slicing in NumPy refers to the process of selecting a specific subset of elements from an array. It
allows you to create a new view of the original data without copying it, which can be very efficient
[54] : in terms of memory usage.
#slicing
import numpy as np
arr=np.array([5,6,7,8,9
]) print(arr[1:3])
[6 7]
[55] : import numpy as np
arr=np.array([5,6,7,3,6,8,9])
print(arr[1:])
[6 7 3 6 8 9]
[56] : import numpy as np
arr=np.array([1,2,3,4,5,8,9])
print(arr[1:])
[2 3 4 5 8 9]
[57] : arr=np.array([5,6,7,8,9
]) print(arr[:3])

[5 6 7]
[58] : arr=np.array([5,6,7,8,9])
print(arr[-3:-1])
[7 8]
[59] : arr=np.array([5,6,7,8,9
]) print(arr[:3])

[5 6 7]
[60] : arr=np.array([5,6,7,8,9
]) print(arr[:3])

[5 6 7]
[61] : arr=np.array([5,6,7,8,9])
print(arr[-3:-1])
[7 8]

10
[62] : #slicing parameters separated by a
#colon : (start:stop:step) directly to the ndarray object
arr=np.array([5,6,7,8,4,5,6,7,9])
print(arr[1:5:2])
[6 8]
[63] : arr=np.array([5,6,7,8,4,5,6,7,9])
print(arr[-1:-5:-1])

[9 7 6 5]

[64]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[1, 1:4])

[7 8 9]

[65]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 2])

[3 8]

[66]: import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 1:4])

[[2 3 4]
[7 8 9]]
[67]
: #string indexing
b = "Hello, World!"
print(b[2:5])

llo
[68]: b = "Hello, World!"
print(b[:5])

Hello

11
[69]: b = "Hello, World!"
print(b[2:])

llo, World!

7 2.3 Re-Shaping
Reshaping in NumPy is the process of changing the shape (i.e., dimensions) of an existing array
without altering the data. This is particularly useful when you need to transform an array to fit a
[70] certain shape for further operations, such as machine learning or data processing task
:
import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)
(2, 4)
[71]: import numpy as np

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('shape of array :', arr.shape)
[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
[72]
: import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

arr1= arr.reshape(4,

3) print(arr1)
[[ 2 3]
1
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[73]: import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

arr1 = arr.reshape(2, 2, 3)

12
print(arr1)

[[ 1 2 3]
[
[ 4 5 6]]

[[ 7 8 9]
[10 11 12]]]
[74]: import numpy as np
a=np.arange(8)
print(a.reshape(4,2))

[[0 1]
[2 3]
[4 5]
[75] [6 7]]
: a=np.arange(12).reshape(4,3)
print(a)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]

8 2.4 Joining Arrays


Joining arrays in NumPy is a way of combining two or more arrays into a single array. There are
[76] several ways to join arrays, depending on the desired result and the shape of the input arrays.
:
#concatenation
a1=np.arange(6).reshape(3,2)
a2=np.arange(6).reshape(3,2)
print(np.concatenate((a1,a2),axis=1))

[[0 1 0 1]
[2 3 2 3]
[4 5 4 5]]
[77] #numpy.hstack and vstack
: a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print(np.stack((a,b)))

[[[1 2]
[3 4]]

13
[[5 6]
[7 8]]]
[78]
: #stack()
print(np.stack((a,b),axis=0))

[[[1 2]
[3 4]]

[[5 6]
[7 8]]
]
[79]: print(np.stack((a,b),axis=1))

[[[1 2]
[5 6]]

[[3 4]
[7 8]]
]
[80]: #hstack()
ch = np.hstack((a,b))
print(ch)

[[1 2 5 6]
[3 4 7 8]]
[81]
: #vstack()
ch = np.vstack((a,b))
print(ch)

[[1 2]
[3 4]
[5 6]
[7 8]]

9 2.5 Splitting
Splitting in NumPy involves dividing an array into multiple sub-arrays. This can be useful when
you need to partition data for different processing purposes or when dealing with chunks of data
[82] in a structured way.
:
import numpy as np
a = np.arange(9)
print(a)

[0 1 2 3 4 5 6 7 8]

14
[83]: #split()
b = np.split(a,3)
print(b)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]
[84]: #horizontal split-hsplit()
a = np.arange(12).reshape(4,3)
b=np.hsplit(a,3)
print(b)
[array([[0],
[3],
[6],
[9] ]), array([[ 1],
[ 4],
[ 7],
[10] ]), array([[ 2],
[ 5],
[85] [ 8],
: [11] ])]

#verticsl split-vsplit()
b=np.vsplit(a,2)
print(b) 1, 2],
[array([[0,
[3, 4, 5]]), array([[ 6, 7, 8],
[ 9, 10, 11]])]

10 3.Computation On Numpy Arrays Using Universal Functions

11 3.1 Unary Universal Functions


Unary Universal Functions (also known as unary ufuncs) in NumPy are mathematical functions
[86] that operate on a single input array element-wise. These functions apply a specific mathematical
: operation to each element of an array independently, resulting in an output array of the same shape.
arr = np.arange(10)
print(arr)
[0 1 2 3 4 5 6 7 8 9]
[87]: #sqrt() function
np.sqrt(arr)

15
[87] : array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

[88] : #exp()
np.exp(arr)
[88] : array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00,
2.00855369e+01, 5.45981500e+01, 1.48413159e+02, 4.03428793e+02,
1.09663316e+03,
2.98095799e+03, 8.10308393e+03])

[89] : #min()
np.min(arr)
[89]: 0

[90] : #max()
np.max(arr)
[90]: 9

[91] : #average()
np.average(arr)
[91]: 4.5

[92] : #abs()
print(np.abs(arr))
[0 1 2 3 4 5 6 7 8 9]
[93] : #fabs()
arr=np.arange(0,-5,-0.5)
print(np.fabs(arr))
[0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]

12 3.2 Binary Universal Functions


Binary Universal Functions (also known as binary ufuncs) operate on two input arrays elementwise.
These functions require two arrays (or one array and one scalar) and perform a mathematical
[94] : operation between corresponding elements

x =
np.random.randn(8) y
= np.random.randn(8)
[print(x)
1.11097262 -0.26995231 0.0060993 1.04398907 -1.82141342 0.00998652
0.08274781 0.82046885]

16
[95] :
print(y)

[-0.05342373 0.10817525 -0.4610533 0.5755554 -0.66695438 0.25344274


1.40395846 -0.87447163]
[96] :
np.maximum(x, y)
[96] : array([ 1.11097262, 0.10817525, 0.0060993 , 1.04398907, -0.66695438,
0.25344274, 1.40395846, 0.82046885])

[97] : arr = np.random.randn(7) * 5


remainder, whole_part =
np.modf(arr) print(remainder)

[-0.98958028 0.75318997 0.47148313 0.96309562 -0.84443205 0.60019609


0.41412946]
[98] :
print(whole_part)
[-6. 7. 2. 1. -2. 5. 3.]

[99] : import numpy as np


a = np.arange(9).reshape(3,3)
b = np.array([[10,10,10],[10,10,10],[10,10,10]])
print(np.add(a,b))
[[10 11 12]
[13 14 15]
[16 17 18]]
[100] :
np.subtract(a,b)
[100] : array([[-10, -9, -8],
[ -7, -6, -5],
[ -4, -3, -2]])

[101] : np.multiply(a,b)

[101] : array([[ 0, 10,


20],
[30, 40, 50],
[60, 70, 80]])

[102] : np.divide(a,b)

[102] : array([[0. , 0.1,


0.2],
[0.3, 0.4, 0.5],
[0.6, 0.7, 0.8]])

17
[103] : import numpy as np
a = np.array([10,100,1000])
np.power(a,2)
[103] : array([ 100, 10000, 1000000], dtype=int32)

13 4.Compute Statistical and Mathematical Methods and


Com- parison Operations on rows/columns
13.1 4.1 Mathematical and Statistical methods on Numpy Arrays
[104] : NumPy provides a variety of mathematical and statistical methods to perform operations on arrays.

a = np.array([[3,7,5],[8,4,3],[2,4,9]])
[104]a: array([[3, 7,
5],
[8, 4, 3],
[2, 4, 9]])

[105] : #sum()
a.sum()
[105]: 45

[106] : #percentile()
import numpy as np
a = np.array([[30,40,70],[80,20,10],[50,90,60]])
np.percentile(a,90)
[106]: 82.0

[107] : arr = np.random.randn(5, 4)

[108] : #mean()
arr.mean()
[108]: -0.14756616582071838

[109] : arr.mean(axis=1)

[109] : array([-0.93641711, 0.12758996, -0.44993246, 0.13099294, 0.38993583])

[110] : #median()
np.median(arr)
[110]: -0.28413298907449897

18
[111] : #standard deviation
np.std(arr)
[111]: 0.9329450218698545

[112] : #variance
np.var(arr)
[112]: 0.8703864138317433

[113] : #sum()
arr.sum(axis=0)
[113] : array([ 0.68253865, -2.88096912, -2.108008 , 1.35511515])

[114] : arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])


print(arr.cumsum())

[ 0 1 3 6 10 15 21 28]
[115] : arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
print(arr.cumsum(axis=0))
[[ 1 2]
0
[ 3 5 7]
[ 9 12 15]]
[116] : print(arr.cumprod(axis=1))

[[ 0 0 0]
[ 3 12 60]
[ 6 42 336]]

13.2 4.2 Comparison Operations


Comparison operations in NumPy allow element-wise comparison between arrays or with scalars.
[117] :
#array_equal()
a=np.array([[1,2],[3,4]])
b=np.array([[1,2],[3,4]])
print(np.array_equal(a,b))
True
[118] : a=np.array([1,15,6,8])
b=np.array([11,12,6,4])

19
[119] : #greater()
print(np.greater(a,b))

[False True False True]


[120] :
print(np.greater(a[0],b[2]))
#greater_equal
[121] : False
print(np.greater_equal(a,b))

[False True True True]


[122] : #less()
print(np.less(a[0],b[2]))

True
[123] :
[print(np.less(a,b))
True False False False]

[124] : #less_equal()
print(np.less_equal(a,b))
[ True False True False]

14 5.Computation on Numpy Arrays using Sorting,unique


and Set Operations
14.1 5.1 Sorting
[125] :
Sorting helps to arrange elements of an array in a particular order.
import numpy as np
a = np.array([[3,7],[9,1]])
print(a)
[[3 7]
[9 1]]
[126] : #sort()
np.sort(a)
[126] : array([[3, 7],
[1, 9]])

20
[127] : np.sort(a,axis=0)

[127] : array([[3, 1],


[9, 7]])

[128] : np.sort(a,axis=1)

[128] : array([[3, 7],


[1, 9]])

[129] : arr = np.random.randn(5, 3)


print(arr)

[[-0.92147727 -0.67857177 -0.04478315]


[-0.30378745 -0.95433394 -1.83418572]
[-0.48103436 -0.55413111 1.28233061]
[ 0.76260305 1.30994277 0.32818117]
[130] : [ 1.87598839 -0.35057108 0.47603584]]
arr.sort(1)
print(arr)
[[-0.92147727 -0.67857177 -0.04478315]
[-1.83418572 -0.95433394 -0.30378745]
[-0.55413111 -0.48103436 1.28233061]
[ 0.32818117 0.76260305 1.30994277]
[-0.35057108 0.47603584 1.87598839]]

14.2 5.2 Unique Operation


[131] :
NumPy provides functions that perform set operations on arrays
#unique()
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
print(np.unique(names))
['Bob' 'Joe' 'Will']
[132] : # Contrast np.unique with the pure Python alternative:
sorted(set(names))
[132]: ['Bob', 'Joe', 'Will']
[133]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
print(np.unique(ints))
[1 2 3 4]

21
[134]: 14.3 5.3 Set Operations
#in1d()method
import numpy as np
values = np.array([6, 0, 0, 3, 2, 5, 6])
print(np.in1d(values, [2, 3, 6]))

[ True False False True True False True]


[135]: arr1=np.array([1,2,3,4])
arr2=np.array([3,4,5,6])
[136]: #union1d()
print(np.union1d(arr1,arr2))
[1 2 3 4 5 6]
[137]: #intersect1d()
print(np.intersect1d(arr1,arr2))
[3 4]
[138]: #setdiff1d()
print(np.setdiff1d(arr1,arr2))
[1 2]
[139]: #setxor1d()
print(np.setxor1d(arr1,arr2))
[1 2 5 6]

15 6.Load an image file and do crop and flip operation


using Numpy indexing
To load and manipulate images with NumPy, you can use the Pillow (PIL) library to load an
[8] image and convert it into a NumPy array.
:
from PIL import Image
img=Image.open("img.jpg")
img.format
[8] : 'JPEG'

[9] : import numpy as np


a=np.array(img)
print(a)

22
[[[242 242 242]
[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[242 242 242]


[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[242 242 242]


[242 242 242]
[242 242 242]

[195 195 195]
[195 195 195]
[195 195 195]]

[[208 208 208]


[208 208 208]
[206 206 206]

[163 163 163]
[163 163 163]
[163 163 163]]

[[208 208 208]


[207 207 207]
[206 206 206]

[164 164 164]
[164 164 164]
[164 164 164]]

[[207 207 207]


[207 207 207]
[205 205 205]

[165 165 165]
[165 165 165]
23
24
[165 165 165]]]
[10] : from IPython.display import display
# Display the original, cropped and flipped images
display(Image.fromarray(a))

25
[11] : crop_img=a[100:900,100:900,:]
img_out=Image.fromarray(crop_img)
img_out
[11]:

[12] : flipped_img=np.flipud(a)
display(Image.fromarray(flipped_img))

26
[ ]:

27
Data Manipulation with Pandas

1 1.create pandas series from python List ,Numpy


Arrays and Dictionary

2 1.1Pandas Series From Python List


[1] import pandas as pd
: import numpy as np
data=[4,7,-5,3]
a=pd.Series(data)
print(a)

0 4
1 7
2 -5
3 3
dtype: int64

[2] # import pandas lib. as pd


: import pandas as pd

# create Pandas Series with define indexes


x = pd.Series([10, 20, 30, 40, 50], index =['a', 'b', 'c', 'd', 'e'])

# print the Series


print(x)
a 10
b 20
c 30
d 40
e 50
[3] dtype: int64
:
import pandas as pd

ind = [10, 20, 30, 40, 50, 60, 70]

27
lst = ['G', 'h', 'i',
'j',
'k', 'l', 'm']

# create Pandas Series with define indexes


x = pd.Series(lst, index = ind)

# print the Series


print(x)
10 G
20 h
30 i
40 j
50 k
60 l
70 m
dtype: object
[4] : import pandas as pd
3import
1.2numpy
Pandas
as np Series From Numpy arrays

# numpy array
data = np.array(['a', 'b', 'c', 'd', 'e'])

# creating series
s = pd.Series(data)
print(s)

0 a
1 b
2 c
3 d
4 e
dtype: object

[5] : # importing Pandas & numpy


import pandas as pd
import numpy as np

# numpy array
data = np.array(['a', 'b', 'c', 'd', 'e'])

# creating series
s = pd.Series(data, index =[1000, 1001, 1002, 1003, 1004])
print(s)

28
1000 a
1001 b
1002 c
1003 d
1004 e
[6] : dtype: object

numpy_array = np.array([1, 2.8, 3.0, 2, 9, 4.2])


# Convert NumPy array to Series
s = pd.Series(numpy_array, index=list('abcdef'))
print("Output Series:")
Output Series:
print(s)
a 1.0
b 2.8
c 3.0
d 2.0
e 9.0
f 4.2
dtype: float64
[7] : import pandas as pd
4 1.3 Pandas Series From Dictionary
# create a dictionary
dictionary = {'D': 10, 'B': 20, 'C': 30}

# create a series
series = pd.Series(dictionary)

print(series)
D 10
B 20
C 30
dtype: int64
# import the pandas lib as pd
import pandas as pd
[8] :
# create a dictionary
dictionary = {'A': 50, 'B': 10, 'C': 80}

# create a series
series = pd.Series(dictionary, index=['B','C','A'])

29
print(series)

B 10
C 80
A 50
dtype: int64
[9] :
import pandas as pd

# create a dictionary
dictionary = {'A': 50, 'B': 10, 'C': 80}

# create a series
series = pd.Series(dictionary, index=['B', 'C', 'D', 'A'])
B 10.0
Cprint(series)
80.0
D NaN
A 50.0
dtype: float64

4.1 2. Data Manipulation with Pandas Series


[10] :
4.2 2.1 Indexing
import pandas as pd
import numpy as np

# creating simple array


data =
np.array(['s','p','a','n','d','a','n','a']) ser
= pd.Series(data,index=[10,11,12,13,14,15,16,17])
nprint(ser[16])
[11] : import pandas as pd

Date = ['1/1/2018', '2/1/2018', '3/1/2018', '4/1/2018']


Index_name = ['Day 1', 'Day 2', 'Day 3', 'Day
4'] sr = pd.Series(data = Date,
index = Index_name )
print(sr)
Day 1 1/1/2018
Day 2 2/1/2018

30
Day 3 3/1/2018
Day 4 4/1/2018
dtype: object
[12] :
print(sr['Day 1'])
1/1/2018

[13] : import numpy as np


import pandas as pd
s=pd.Series(np.arange(5.),index=['a','b','c','d','
e']) print(s)
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
[24]
: 4.3 2.2 Selecting
import numpy as np
import pandas as pd
s=pd.Series(np.arange(5.),index=['a','b','c','d','
e']) print(s)
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
[19] : s['b']

[19]: 1.0

[26]: s[['b','a','d']]

[26] : b 1.0
a 0.0
d 3.0
dtype: float64

[27] : s['b':'e']

31
[27]: 1.0
b
c 2.0
d 3.0
e 4.0
dtype: float64
[20] : s[1]

[20]: 1.0

[21] : s[2:4]

[21]: c 2.0
d 3.0
dtype: float64

[23]: s[[1,3]]

[23]: b 1.0
d 3.0
dtype: float64

[28] :
print(s[[0, 2, 4]])

a 0.0
c 2.0
e 4.0
dtype: float64
[4]
: 4.4 2.3 Filtering
import numpy as np
import pandas as pd
s=pd.Series(np.arange(5.),index=['a','b','c','d','
e']) print(s)
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
[32]: s[s<2]

[32]: a 0.0
dtype: float64

32
[36]: s[s>2]

[36]: b 5.0
d 3.0
e 4.0
dtype: float64

[35]: s[s!=2]
[35]: 0.0
a
b 5.0
d 3.0
e 4.0
dtype: float64
[38]: s[(s>2)&(s<5) ]

[38]: d 3.0
e 4.0
dtype: float64

[33]: s['b':'c']

[33]: b 5.0
c 2.0
dtype: float64
[7]
print(s[1:2]==5)
:
b True
dtype: bool

[42] s[s.isin([2,4])]
:

[42]: c 2.0
e 4.0
dtype: float64

[8] 4.5 2.4 Arithmetic Operations


: import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])

[3]: series3 = series1 + series2


print(series3)

33
0 7
1 9
2 11
3 13
4 15
[4] dtype: int64
:
series3 = series1 - series2
print(series3)
0 -5
1 -5
2 -5
3 -5
4 -5
[5] dtype: int64
:
series3 = series1 *series2
print(series3)
0 6
1 14
2 24
3 36
4 50
[6] dtype: int64
:
series3 = series1 /series2
print(series3)
0 0.166667
1 0.285714
2 0.375000
3 0.444444
4 0.500000
[9] dtype: float64
:
series3 = series1 %series2
print(series3)
0 1
1 2
2 3
3 4
4 5
dtype: int64

34
[10] 4.6 2.5 Ranking
: import pandas as pd
s=pd.Series([121,211,153,214,115,116,237,118,219,120])
s.rank(ascending=True)

[10]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[49]: s.rank(ascending=False)

[49]: 0 6.0
1 4.0
2 5.0
3 3.0
4 10.0
5 9.0
6 1.0
7 8.0
8 2.0
9 7.0
dtype: float64

[11]: s.rank(method='min')

[11]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[12]: s.rank(method='max')

35
[12]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[50]: s.rank(method='first')

[50]: 0 5.0
1 7.0
2 6.0
3 8.0
4 1.0
5 2.0
6 10.0
7 3.0
8 9.0
9 4.0
dtype: float64

[52] 4.7 2.6pandas


import Sorting
as pd
: sr = pd.Series([19.5, 16.8, 22.78, 20.124, 18.1002])
print(sr)

0 19.5000
1 16.8000
2 22.7800
3 20.1240
4 18.1002
dtype: float64

[8] sr.sort_values(ascending = False)


:
[8]: 22.7800
2
3 20.1240
0 19.5000
4 18.1002
1 16.8000
36
dtype: float64

37
[53]: sr.sort_values(ascending = True)
[53]: 16.8000
1
4 18.1002
0 19.5000
3 20.1240
2 22.7800
dtype: float64

[55]: sr.sort_index()
[55]: 19.5000
0
1 16.8000
2 22.7800
3 20.1240
4 18.1002

dtype: float64
[58]
: print(sr.sort_values(kind))

1 16.8000
4 18.1002
0 19.5000
3 20.1240
2 22.7800
dtype: float64
[40]
4.8 2.7 checking null values
:
s=pd.Series({'ohio':35000,'teyas':71000,'oregon':16000,'utah':500
0}) print(s)
states=['california','ohio','Texas','oregon']
x=pd.Series(s,index=states)
print(x)

ohio 35000
teyas 71000
oregon 16000
utah 5000
dtype: int64
california
NaN ohio
35000.0
Texas NaN
oregon 16000.0
dtype: float64

[42] x.isnull()
:
38
[42]: california
True
ohio
False
Texas True
oregon False
dtype: bool

[44]: x.notnull()

[44]: california
False ohio
True
Texas False
oregon
True dtype: bool

[19] 4.9 2.8 Concatenation


: # creating the Series
series1 = pd.Series([1, 2, 3])
series2 = pd.Series(['A', 'B',
'C'])
[65]: # concatenating
display(pd.concat([series1, series2]))
0 1
1 2
2 3
0 A
1 B
2 C
[66] dtype: object
:
display(pd.concat([series1, series2],
axis = 1))
0 1
0 1 A
1 2 B
[67] 2 3 C
:
display(pd.concat([series1, series2],
axis = 0))
0 1
1 2
2 3
0 A
1 B

39
2 C
dtype: object
[21]
: print(pd.concat([series1, series2], ignore_index=True))

0 1
1 2
2 3
3 A
4 B
5 C
[22] dtype: object
:
print(pd.concat([series1, series2], ignore_index=False))

0 1
1 2
2 3
0 A
1 B
[69]
2 C
:
dtype: object

print(pd.concat([series1, series2], keys=['series1', 'series2']))

series1 0 1
1 2
2 3
series2 0 A
1 B
2 C
[16] dtype: object
:
4.10 3 .Creating DataFrames from List and Dictionary
4.11 3.1 From List
data = [1, 2, 3, 4, 5]

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Numbers'])
print(df)
Numbers
0 1
1 2
2 3
3 4
4 5

40
[70]: import pandas as pd
nme = ["aparna", "pankaj", "sudhir", "Geeku"]
deg = ["MBA", "BCA", "M.Tech",
"MBA"] scr = [90, 40, 80, 98]
dict = {'name': nme, 'degree': deg, 'score':
scr} df = pd.DataFrame(dict)
print(df)
name degree score
0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
[38]: import pandas as pd 98
3 Geeku MBA
data = [['G', 10], ['h', 15], ['i', 20]]
# Create the pandas Dataframe
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
print(df)
Name Age
0 G 10
1 h 15
2 i 20

[39] 4.12 3.2 From Dictionary


:
df=pd.DataFrame({'a':[4,5,6],'b':[7,8,9],'c':[10,11,12]},index=[1,2,3])
print(df)
a b c
1 4 7 10
2 5 8 11
3 6 9 12
[13] df=pd.DataFrame({'state':['AP','AP','AP','TS','TS','TS'],'year':
: 𝗌[2000,2001,2002,2000,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9,3.2]})

print(df)

state year pop


0 AP 2000 1.5
1 AP 2001 1.7
2 AP 2002 3.6
3 TS 2000 2.4
4 TS 2001 2.9
5 TS 2002 3.2

41
[14] : df=pd.DataFrame({'a':[4,5,6],'b':[7,8,9]},index=pd.MultiIndex.
𝗌from_tuples([('d',1),('d',2),('e',2)] ,names=['n','v']))
print(df)

a b
n v
d 1 4 7
2 5 8
e 2 6 9
[71]: df=pd.DataFrame({'ap':{'a':0.0,'c':3.0,'d':6.0},'ts':
{'a':1.0,'c':4.0,'d':7.
𝗌0},'tn':{'a':2.0,'c':5.0,'d':8.0}})

[71]: df.reindex(['a','b','c','d'])
ap ts tn
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

4.13 4.Import various file formats to pandas DataFrames and preform


the fol- lowing
4.14 4.1 Importing file
[10]
: import pandas as pd
data=pd.read_csv('bird.csv')
data

[10]: id huml humw ulnal ulnaw feml fem tibl tibw tarl tarw \
w
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW
3 SW

42
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[420 rows x 12 columns]

[15] : 4.15 4.2 display top and bottom five rows


data.head(5)
[15] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw type
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84 SW
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01 SW
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34 SW
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41 SW
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13 SW

[16] : data.tail(5)

[16]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
415 SO
416 SO
417 SO
418 SO
419 SO

[17] : 4.16 4.3 Get shape,data type,null values,index and column details
data.shape
[17]: (420, 12)

[18] : data.dtypes

43
[18] : id int64
huml float64
humw float64
ulnal float64
ulnaw float64
feml float64
femw float64
tibl float64
tibw float64
tarl float64
tarw float64
type
object dtype:
object

[19] : data.isnull().sum()

[19]: id 0
huml 1
humw 1
ulnal 3
ulnaw 2
feml 2
femw 1
tibl 2
tibw 1
tarl 1
tarw 1
type 0
dtype: int64

[20] : data.columns

[20]: Index(['id', 'huml', 'humw', 'ulnal', 'ulnaw', 'feml', 'femw', 'tibl',


'tibw', 'tarl', 'tarw', 'type'],
dtype='object')

[21]: data.index

[21]: RangeIndex(start=0, stop=420, step=1)

[24] 4.17 4.4 Select/Delete the records rows/columns based on conditions


data.loc[data['huml']>4]
:

[24]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34

44
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW
3 SW
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[419 rows x 12 columns]

[25] data.drop([0,3])
:
[25]: id huml humw ulnal ulnaw feml fem tibl tibw tarl tarw \
w
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
5 5 61.92 4.78 50.46 3.47 49.52 4.41 56.95 2.73 29.07 2.83
6 6 79.73 5.94 67.39 4.50 42.07 3.41 71.26 3.56 37.22 3.64
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
1 SW
2 SW
4 SW
5 SW
6 SW
.. …

45
415 SO
416 SO
417 SO
418 SO
419 SO

[418 rows x 12 columns]

[27]: data.drop(data[data['huml']>4.3].index)

[27] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw type
342 342 NaN NaN NaN NaN 32.54 2.65 55.06 2.81 38.94 2.25 SO

[28] : data.loc[6,'ulnal']

[28]: 67.39

[29] : data.loc[11:15][['huml','humw']]

[29] : huml humw


11 186.00 9.83
12 172.00 8.44
13 148.91 6.78
14 149.19 6.98
15 140.59 6.59

[30] : 4.18
data
4.5 Sorting and Ranking operations in DataFrame

[30]: id huml humw ulnal ulnaw feml fem tibl tibw tarl tarw \
w
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
.. … … … … … … … … … … …
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05

type
0 SW
1 SW
2 SW

46
3 SW
4 SW
.. …
415 SO
416 SO
417 SO
418 SO
419 SO

[420 rows x 12 columns]

[31] : data.sort_index(ascending=False)

[31]: id huml humw ulnal ulnaw feml fem tibl tibw tarl tarw \
w
419 419 17.89 1.44 19.26 1.10 17.62 1.34 29.81 1.24 21.69 1.05
418 418 20.38 1.78 22.53 1.50 21.35 1.48 36.09 1.53 25.98 1.24
417 417 18.79 1.63 19.83 1.53 20.96 1.43 34.45 1.41 22.86 1.21
416 416 19.21 1.64 20.76 1.49 19.24 1.45 33.21 1.28 23.60 1.15
415 415 17.96 1.63 19.25 1.33 18.36 1.54 31.25 1.33 21.99 1.15
.. … … … … … … … … … … …
4 4 62.80 4.84 52.09 3.73 33.95 2.72 56.27 2.96 31.88 3.13
3 3 77.65 5.70 65.76 4.77 40.04 3.52 69.17 3.40 35.78 3.41
2 2 79.97 6.37 69.26 5.28 43.07 3.90 75.35 4.04 38.31 3.34
1 1 88.91 6.63 80.53 5.59 47.04 4.30 80.22 4.51 41.50 4.01
0 0 80.78 6.68 72.01 4.88 41.81 3.70 5.50 4.03 38.70 3.84

type
419 SO
418 SO
417 SO
416 SO
415 SO
.. …
4 SW
3 SW
2 SW
1 SW
0 SW

[420 rows x 12 columns]

[32] : data.sort_values(['ulnaw']).head(6)

[32] : id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
369 369 13.48 1.27 16.00 1.00 12.67 1.10 23.12 0.88 16.34 0.89
413 413 12.95 1.16 14.09 1.03 13.03 1.03 22.13 0.96 15.19 1.02
395 395 15.62 1.28 18.52 1.06 15.75 1.17 28.63 1.03 21.39 0.88

47
367 367 13.31 1.17 16.47 1.06 12.32 0.93 22.47 0.95 15.97 0.75
414 414 13.63 1.16 15.22 1.06 13.75 0.99 23.13 0.96 15.62 1.01
376 376 13.52 1.28 17.88 1.07 15.10 1.05 25.14 1.23 17.81 0.69

type
369 SO
413 SO
395 SO
367 SO
414 SO
376 SO

[33] : data.sort_values(by=['ulnaw','ulnal']).head(6)

[33]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
369 369 13.48 1.27 16.00 1.00 12.67 1.10 23.12 0.88 16.34 0.89
413 413 12.95 1.16 14.09 1.03 13.03 1.03 22.13 0.96 15.19 1.02
414 414 13.63 1.16 15.22 1.06 13.75 0.99 23.13 0.96 15.62 1.01
367 367 13.31 1.17 16.47 1.06 12.32 0.93 22.47 0.95 15.97 0.75
395 395 15.62 1.28 18.52 1.06 15.75 1.17 28.63 1.03 21.39 0.88
376 376 13.52 1.28 17.88 1.07 15.10 1.05 25.14 1.23 17.81 0.69

type
369 SO
413 SO
414 SO
367 SO
395 SO
376 SO

[34] : data.rank().head(10)

[34]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 1.0 289.0 344.0 275.0 325.5 289.0 295.0 1.0 302.0 272.0 328.0
1 2.0 308.0 343.0 284.0 343.0 312.0 320.0 308.0 327.5 285.0 333.0
2 3.0 286.0 336.0 268.0 334.0 295.0 303.5 292.0 303.5 271.0 305.5
3 4.0 284.0 308.0 255.0 313.5 279.0 288.0 270.0 272.5 247.0 310.5
4 5.0 248.0 281.0 227.5 258.0 224.0 225.5 231.0 250.0 211.0 294.0
5 6.0 246.0 275.0 223.0 242.0 326.0 322.0 234.0 234.0 181.0 268.5
6 7.0 285.0 321.0 262.0 304.0 292.0 282.5 279.0 280.5 259.0 320.0
7 8.0 304.0 306.0 278.0 306.0 300.0 299.0 296.0 295.5 266.0 324.0
8 9.0 362.0 370.0 354.0 362.0 365.0 356.5 363.5 359.0 352.0 346.0
9 10.0 387.0 399.0 381.5 383.0 382.0 398.0 382.0 397.0 392.0 377.0

typ
0 e
274.5
1 274.5

48
2 274.5
3 274.5
4 274.5
5 274.5
6 274.5
7 274.5
8 274.5
9 274.5

[35] : data.rank().head(2)

[35]: id huml humw ulnal ulnaw feml femw tibl tibw tarl tarw \
0 1.0 289.0 344.0 275.0 325.5 289.0 295.0 1.0 302.0 272.0 328.0
1 2.0 308.0 343.0 284.0 343.0 312.0 320.0 308.0 327.5 285.0 333.0

type
0 274.5
1 274.5

[15]: data.rank(ascending=False).head(5)
[15]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 891.0 617.0 246.0 783.0 289.0 497.0 179.0 552.5 220.0
1 890.0 171.5 783.5 701.0 734.5 183.0 179.0 552.5 112.0
2 889.0 171.5 246.0 538.0 734.5 404.5 587.5 552.5 17.0
3 888.0 171.5 783.5 619.0 734.5 226.5 179.0 552.5 824.5
4 887.0 617.0 246.0 876.0 289.0 226.5 587.5 552.5 283.0

Fare Cabin Embarke


d
0 815.0 NaN 322.5
1 103.0 94.0 805.5
2 659.5 NaN 322.5
3 144.0 134.5 322.5
4 628.0 NaN 322.5

4.18.1 4.6 Statistical operations


[23]: import pandas as pd
data=pd.read_csv('gym-
track.csv') data
[23]: Age Gender Weight (kg) Height (m) Max_BPMAvg_BPM Resting_BPM \
0 56 Male 88.3 1.71 180 157 60
1 46 Female 74.9 1.53 179 151 66
2 32 Female 68.1 1.66 167 122 54
3 25 Male 53.2 1.70 190 164 56
4 38 Male 46.1 1.79 188 158 68

49
.. … … … … … … …
968 24 Male 87.1 1.74 187 158 67
969 25 Male 66.6 1.61 184 166 56
970 59 Female 60.4 1.76 194 120 53
971 32 Male 126.4 1.83 198 146 62
972 46 Male 88.7 1.63 166 146 66

Session_Duration (hours) Calories_Burne Workout_Typ Fat_Percentage \


d e
0 1.69 1313.0 Yoga 12.6
1 1.30 883.0 HIIT 33.9
2 1.11 677.0 Cardi 33.4
o
3 0.59 532.0 Strengt 28.8
h
4 0.64 556.0 Strengt 29.2
.. … … h …
968 1.57 1364.0 … 10.0
Strengt
h
969 1.38 1260.0 Strengt 25.0
h
970 1.72 929.0 Cardi 18.8
o
971 1.10 883.0 HIIT 28.2
972 0.75 542.0 Strengt 28.8
h

Water_Intake (liters) Workout_Frequenc (days/week) Experience_Leve \


y l
0 3.5 4 3
1 2.1 4 2
2 2.3 4 2
3 2.1 3 1
4 2.8 3 1
.. … … …
968 3.5 4 3
969 3.0 2 1
970 2.7 5 3
971 2.1 3 2
972 3.5 2 1

BM
0 I
30.20
1 32.00
2 24.71
3 18.41
4 14.39
50
.. …
968 28.77
969 25.69
970 19.50
971 37.74
972 33.38

[973 rows x 15 columns]

51
[25]: data['Age'].mean()

[25]: 38.68345323741007

[28]: data['Age'].median()

[28]: 40.0

[29]: data['Age'].std()

[29]: 12.180927866987108

[30] : data['Age'].sum()

[30]: 37639

[31] : data['Age'].var()

[31]: 148.37500370074312

4.18.2 4.7 count and Uniqueness of given Categorical values


[35]: data.count()

[35]: Age 973


Gender 973
Weight (kg) 973
Height (m) 973
Max_BPM 973
Avg_BPM 973
Resting_BPM 973
Session_Duration (hours) 973
Calories_Burned 973
Workout_Type 973
Fat_Percentage 973
Water_Intake (liters) 973
Workout_Frequency (days/week) 973
Experience_Level 973
BMI 973
dtype: int64

52
Data cleaning and preparation

a) handling missing data by detecting dropping and


replacing/filling mising valus
import pandas as pd
import numpy as np

Student performance
Import any csv file to pandas data frame and perform the following
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('student.csv')
df

Hours_Studied Attendance Parental_Involvement


Access_to_Resources \
0 23 84 Low
High
1 19 64 Low
Medium
2 24 98 Medium
Medium
3 29 89 Low
Medium
4 19 92 Medium
Medium
... ... ... ... .
..
6602 25 69 High
Medium
6603 23 76 High
Medium
6604 20 90 Medium
Low
6605 10 86 High
High
6606 15 67 Medium
Low
Extracurricular_Activities Sleep_Hours Previous_Scores \
0 No 7 73
1 No 8 59
2 Yes 7 91
3 Yes 8 98
4 Yes 6 65
... ... ... ...
6602 No 7 76

51
6603 No 8 81
6604 Yes 6 65
6605 Yes 6 91
6606 Yes 9 94

Motivation_Level Internet_Access Tutoring_Sessions Family_Income


\
0 Low Yes 0 Low

1 Low Yes 2 Medium

2 Medium Yes 2 Medium

3 Medium Yes 1 Medium

4 Medium Yes 3 Medium

... ... ... ... ...

6602 Medium Yes 1 High

6603 Medium Yes 3 Low

6604 Low Yes 3 Low

6605 High Yes 2 Low

6606 Medium Yes 0 Medium

Teacher_Quality School_Type Peer_Influence Physical_Activity \


0 Medium Public Positive 3
1 Medium Public Negative 4
2 Medium Public Neutral 4
3 Medium Public Negative 4
4 High Public Neutral 4
... ... ... ... ...
6602 Medium Public Positive 2
6603 High Public Positive 2
6604 Medium Public Negative 2
6605 Medium Private Positive 3
6606 Medium Public Positive 4

Learning_Disabilities Parental_Education_Level Distance_from_Home


\
0 No High School Near

1 No College Moderate

2 No Postgraduate Near

52
3 No High School Moderate

4 No College Near

... ... ... ...

6602 No High School Near

6603 No High School Near

6604 No Postgraduate Near

6605 No High School Far

6606 No Postgraduate Near

Gender Exam_Score
0 Male 67
1 Female 61
2 Male 74
3 Male 71
4 Female 70
... ... ...
6602 Female 68
6603 Female 69
6604 Female 68
6605 Female 68
6606 Male 64

[6607 rows x 20 columns]

# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores

53
Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

# 1. Detect missing data


missing_data = df.isnull()
print("\nMissing Data:")
print(missing_data.head(10))

Missing Data:
Hours_Studied Attendance Parental_Involvement
Access_to_Resources \
0 False False False
False
1 False False False

54
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 False False False
False
1 False False False
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality


\
0 False False False False

1 False False False False

2 False False False False

3 False False False False

55
4 False False False False

5 False False False False

6 False False False False

7 False False False False

8 False False False False

9 False False False False

School_Type Peer_Influence Physical_Activity


Learning_Disabilities \
0 False False False
False
1 False False False
False
2 False False False
False
3 False False False
False
4 False False False
False
5 False False False
False
6 False False False
False
7 False False False
False
8 False False False
False
9 False False False
False

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False

56
# No of null values
n=df.isnull().sum()
n

Hours_Studied 0
Attendance 0
Parental_Involvement 0
Access_to_Resources 0
Extracurricular_Activities 0
Sleep_Hours 0
Previous_Scores 0
Motivation_Level 0
Internet_Access 0
Tutoring_Sessions 0
Family_Income 0
Teacher_Quality 78
School_Type 0
Peer_Influence 0
Physical_Activity 0
Learning_Disabilities 0
Parental_Education_Level 90
Distance_from_Home 67
Gender 0
Exam_Score 0
dtype: int64

# 2. Drop rows with missing values


df_dropna = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropna.head(10))

DataFrame after dropping rows with missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

5 19 88 Medium Medium

6 29 84 Medium Low

7 25 78 Low High

57
8 17 94 Medium High

9 23 98 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium
5 Yes 8 89
Medium
6 Yes 7 68
Low
7 Yes 6 50
Medium
8 No 6 80
High
9 Yes 8 71
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High
5 Yes 3 Medium Medium
6 Yes 1 Low Medium
7 Yes 1 High High
8 Yes 0 Medium Low
9 Yes 0 High High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

58
4 Public Neutral 4 No

5 Public Positive 3 No

6 Private Neutral 2 No

7 Public Negative 2 No

8 Private Neutral 1 No

9 Public Positive 5 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70
5 Postgraduate Near Male 71
6 High School Moderate Male 67
7 High School Far Male 66
8 College Near Male 69
9 High School Moderate Male 72

# 3. Fill missing values with a specific value (e.g., mean, median, or


custom value)
# Let's fill missing values in the 'Age' column with the mean value of
that column
mean_chas = df['Attendance'].mean()
df_fillna = df.fillna({'Attendance': mean_chas}) print("\
nDataFrame after filling missing values:")
print(df_fillna.head(10))

DataFrame after filling missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

5 19 88 Medium Medium

6 29 84 Medium Low

59
7 25 78 Low High

8 17 94 Medium High

9 23 98 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium
5 Yes 8 89
Medium
6 Yes 7 68
Low
7 Yes 6 50
Medium
8 No 6 80
High
9 Yes 8 71
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High
5 Yes 3 Medium Medium
6 Yes 1 Low Medium
7 Yes 1 High High
8 Yes 0 Medium Low
9 Yes 0 High High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

60
3 Public Negative 4 No

4 Public Neutral 4 No

5 Public Positive 3 No

6 Private Neutral 2 No

7 Public Negative 2 No

8 Private Neutral 1 No

9 Public Positive 5 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70
5 Postgraduate Near Male 71
6 High School Moderate Male 67
7 High School Far Male 66
8 College Near Male 69
9 High School Moderate Male 72

# 4. Replace missing values conditionally


# For example, replace missing values in 'City' with 'Unknown'
df_replace = df.fillna({'Hours_Studied': 'Unknown'}) print("\
nDataFrame after replacing missing values:")
print(df_replace.head())

DataFrame after replacing missing values:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \

61
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

b) transform data using apply() and map() method


# Load the CSV file into a Pandas DataFrame
# Replace 'data.csv' with the actual file path if needed
df=pd.read_csv('student.csv')

# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
Hours_Studied Attendance Parental_Involvement Access_to_Resources

62
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No 7 73
Low
1 No 8 59
Low
2 Yes 7 91
Medium
3 Yes 8 98
Medium
4 Yes 6 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

63
# Assume 'Price' is a column that we want to transform

# 1. Transform using apply() method


# Let's square the values in the 'Price' column
df['Sleep_Hours'] = df['Previous_Scores'].apply(lambda x: x ** 2)
df

Hours_Studied Attendance Parental_Involvement


Access_to_Resources \
0 23 84 Low
High
1 19 64 Low
Medium
2 24 98 Medium
Medium
3 29 89 Low
Medium
4 19 92 Medium
Medium
... ... ... ... .
..
6602 25 69 High
Medium
6603 23 76 High
Medium
6604 20 90 Medium
Low
6605 10 86 High
High
6606 15 67 Medium
Low

Extracurricular_Activities Sleep_Hours Previous_Scores \


0 No 5329 73
1 No 3481 59
2 Yes 8281 91
3 Yes 9604 98
4 Yes 4225 65
... ... ... ...
6602 No 5776 76
6603 No 6561 81
6604 Yes 4225 65
6605 Yes 8281 91
6606 Yes 8836 94

Motivation_Level Internet_Access Tutoring_Sessions Family_Income


\
0 Low Yes 0 Low

1 Low Yes 2 Medium

64
2 Medium Yes 2 Medium

3 Medium Yes 1 Medium

4 Medium Yes 3 Medium

... ... ... ... ...

6602 Medium Yes 1 High

6603 Medium Yes 3 Low

6604 Low Yes 3 Low

6605 High Yes 2 Low

6606 Medium Yes 0 Medium

Teacher_Quality School_Type Peer_Influence Physical_Activity \


0 Medium Public Positive 3
1 Medium Public Negative 4
2 Medium Public Neutral 4
3 Medium Public Negative 4
4 High Public Neutral 4
... ... ... ... ...
6602 Medium Public Positive 2
6603 High Public Positive 2
6604 Medium Public Negative 2
6605 Medium Private Positive 3
6606 Medium Public Positive 4

Learning_Disabilities Parental_Education_Level Distance_from_Home


\
0 No High School Near

1 No College Moderate

2 No Postgraduate Near

3 No High School Moderate

4 No College Near

... ... ... ...

6602 No High School Near

6603 No High School Near

65
6604 No Postgraduate Near

6605 No High School Far

6606 No Postgraduate Near

Gender Exam_Score
0 Male 67
1 Female 61
2 Male 74
3 Male 71
4 Female 70
... ... ...
6602 Female 68
6603 Female 69
6604 Female 68
6605 Female 68
6606 Male 64

[6607 rows x 20 columns]

# 2. Transform using map() method


# Let's map a new column 'Price_category' based on the 'Price' values
Age_category_map = {0: 'Low', 1: 'Medium', 2: 'High'}
df['Sleep_Hours'] = df['Previous_Scores'].map(Age_category_map)

# Display the transformed DataFrame


print("\nDataFrame after transformation:")
print(df.head())

DataFrame after transformation:


Hours_Studied Attendance Parental_Involvement Access_to_Resources
\
0 23 84 Low High

1 19 64 Low Medium

2 24 98 Medium Medium

3 29 89 Low Medium

4 19 92 Medium Medium

Extracurricular_Activities Sleep_Hours Previous_Scores


Motivation_Level \
0 No NaN 73
Low
1 No NaN 59

66
Low
2 Yes NaN 91
Medium
3 Yes NaN 98
Medium
4 Yes NaN 65
Medium

Internet_Access Tutoring_Sessions Family_Income Teacher_Quality \


0 Yes 0 Low Medium
1 Yes 2 Medium Medium
2 Yes 2 Medium Medium
3 Yes 1 Medium Medium
4 Yes 3 Medium High

School_Type Peer_Influence Physical_Activity Learning_Disabilities


\
0 Public Positive 3 No

1 Public Negative 4 No

2 Public Neutral 4 No

3 Public Negative 4 No

4 Public Neutral 4 No

Parental_Education_Level Distance_from_Home Gender Exam_Score


0 High School Near Male 67
1 College Moderate Female 61
2 Postgraduate Near Male 74
3 High School Moderate Male 71
4 College Near Female 70

c) Detect and filter outliers


# Load the CSV file into a Pandas DataFrame
# Replace 'data.csv' with the actual file path if needed
df = pd.read_csv('titanic.csv')
df

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...

67
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0
.. ... ... ...
...
886 Montvila, Rev. Juozas male 27.0
0
887 Graham, Miss. Margaret Edith female 19.0
0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN
1
889 Behr, Mr. Karl Howell male 26.0
0
890 Dooley, Mr. Patrick male 32.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

68
Titanic
# Display the first few rows of the DataFrame to understand the data
print("Original DataFrame:")
print(df.head())

Original DataFrame:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


S ibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

# Select the column to analyze for outliers (replace 'Value' with the
actual column name)
column_name = 'Fare'
# Calculate the z-scores for the selected column
z_scores = np.abs((df[column_name] - df[column_name].mean()) /
df[column_name].std())
z_scores.head(10)

0 0.502163
1 0.786404
2 0.488580
3 0.420494
4 0.486064
5 0.477848
6 0.395591
7 0.223957

69
8 0.424018
9 0.042931
Name: Fare, dtype: float64

# Define a threshold for outliers (e.g., z-score greater than 3)


z_score_threshold = 3

# Filter the DataFrame to keep rows without outliers


filtered_df = df[z_scores <= z_score_threshold]

# Display the DataFrame after filtering outliers


print("\nDataFrame after filtering outliers:")
print(filtered_df.head())

DataFrame after filtering outliers:


PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

70
d) perform vectorized string operations on pandas series
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('titanic.csv')
df

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0
.. ... ... ...
...
886 Montvila, Rev. Juozas male 27.0
0
887 Graham, Miss. Margaret Edith female 19.0
0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN
1
889 Behr, Mr. Karl Howell male 26.0
0
890 Dooley, Mr. Patrick male 32.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

71
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Assuming 'Name' is the column containing strings

# Convert all names to uppercase


df['Name']= df['Sex'].str.upper()
df

PassengerId Survived Pclass Name Sex Age SibSp Parch


\
0 1 0 3 MALE male 22.0 1 0

1 2 1 1 FEMALE female 38.0 1 0

2 3 1 3 FEMALE female 26.0 0 0

3 4 1 1 FEMALE female 35.0 1 0

4 5 0 3 MALE male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 MALE male 27.0 0 0

887 888 1 1 FEMALE female 19.0 0 0

888 889 0 3 FEMALE female NaN 1 2

889 890 1 1 MALE male 26.0 0 0

890 891 0 3 MALE male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C

72
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Calculate the length of each


name df['Name'] =
df['Sex'].str.len() df

PassengerId Survived Pclass Name Sex Age SibSp


Parch \
0 1 0 3 4 male 22.0 1 0

1 2 1 1 6 female 38.0 1 0

2 3 1 3 6 female 26.0 0 0

3 4 1 1 6 female 35.0 1 0

4 5 0 3 4 male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 4 male 27.0 0 0

887 888 1 1 6 female 19.0 0 0

888 889 0 3 6 female NaN 1 2

889 890 1 1 4 male 26.0 0 0

890 891 0 3 4 male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Split the names based on a delimiter (e.g., space) and create a new
column for the first part of the name

73
df['Name'] = df['Sex'].str.split(' ').str[0]
df

PassengerId Survived Pclass Name Sex Age SibSp Parch


\
0 1 0 3 male male 22.0 1 0

1 2 1 1 female female 38.0 1 0

2 3 1 3 female female 26.0 0 0

3 4 1 1 female female 35.0 1 0

4 5 0 3 male male 35.0 0 0

.. ... ... ... ... ... ... ... ...

886 887 0 2 male male 27.0 0 0

887 888 1 1 female female 19.0 0 0

888 889 0 3 female female NaN 1 2

889 890 1 1 male male 26.0 0 0

890 891 0 3 male male 32.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C
890 370376 7.7500 NaN Q

[891 rows x 12 columns]

# Display the transformed DataFrame


print("DataFrame after performing vectorized string operations:")
print(df.head())

DataFrame after performing vectorized string operations:


PassengerId Survived Pclass Name Sex Age SibSp
Parch \
0 1 0 3 male male 22.0 1 0

74
1 2 1 1 female female 38.0 1 0

2 3 1 3 female female 26.0 0 0

3 4 1 1 female female 35.0 1 0

4 5 0 3 male male 35.0 0 0

Ticket Fare Cabin Embarked


0 A/5 21171 7.2500 NaN S
1 PC 17599 71.2833 C85 C
2 STON/O2. 3101282 7.9250 NaN S
3 113803 53.1000 C123 S
4 373450 8.0500 NaN S

75
Data Wrangling

0.0.1 1. Concatenate / Join / Merge/ Reshape DataFrames.


Used to concatenate two or more DataFrame objects. By setting axis=0 it concatenates vertically
(rows), and by setting axis=1 it concatenates horizontally (columns).
[3]
: import pandas as pd
df1 = pd.DataFrame({'X': ['X0', 'X1'],# Column 'A' with values 'A0',
'A1'
'Y': ['Y0', 'Y1']})# Column 'B' with values 'B0', 'B1'
# Create the second DataFrame (df2) with columns 'A' and 'B' and two rows
df2 = pd.DataFrame({'X': ['X2', 'X3'],
'Y': ['Y2', 'Y3']})
# Concatenate df1 and df2 vertically (axis=0) to stack rows
# This combines the two DataFrames by adding the rows of df2 below the rows of␣
𝗌df1

result = pd.concat([df1, df2],


axis=0) df1

[3] : X A1Y
0 X0 Y0
1 X1 Y1

[4] : df2

[4] : X Y
0 X2 Y2
1 X3 Y3

[5] : result

[5]: X A1Y Y
0 X0 Y0 NaN
1 X1 Y1 NaN
0 X2 NaN Y2
1 X3 NaN Y3

76
0.0.2 MERGE
Used to merge two data frames based on a key column, similar to SQL joins. Options include
how=’inner’, how=’outer’, how=’left’, and how=’right’ for different types of joins.
[8]
: import pandas as pd
# Create DataFrame 1
df1 = pd.DataFrame({'key': ['x', 'y', 'z'], 'value1': [1, 2, 3]})
# Create DataFrame 2
df2 = pd.DataFrame({'key': ['y', 'z', 'a'], 'value2': [4, 5, 6]})
# Merge DataFrames on 'key' column using inner join
result = pd.merge(df1, df2, on='key',
how='inner') df1

[8] : key value1


0 x 1
1 y 2
2 z 3

[9] : df2

[9] : key value2


0 y 4
1 z 5
2 a 6

[10] : result

[10] : key value1 value2


0 y 2 4
1 z 3 5
[11] : import pandas as pd
# Create DataFrame 1
# Create DataFrame 1
df1 = pd.DataFrame({'key': ['x', 'y', 'z'], 'value1': [1, 2, 3]})
# Create DataFrame 2
df2 = pd.DataFrame({'key': ['y', 'z', 'a'], 'value2': [4, 5, 6]})
# Merge DataFrames on 'key' column using outer join
result = pd.merge(df1, df2, on="key", how='outer')
[11] : df1
key value1
0 x 1
1 y 2
2 z 3

[12] : df2

77
[12] : key value2
0 y 4
1 z 5
2 a 6

[13] : result

[13]: key value1 value2


0 x 1.0 NaN
1 y 2.0 4.0
2 z 3.0 5.0
3 a NaN 6.0
0.0.3 JOIN
A join is a way to combine data from two or more tables (or DataFrames) based on a common
column, known as the join key.
[18]
: df1 = pd.DataFrame({"x": ["x0", "x1", "x2"], "y": ["y0", "y1", "y2"]},
index=["j0", "j1", "j2"]) # Create DataFrame 2
df2 = pd.DataFrame({"z": ["z0", "z2", "z3"], "a": ["a0", "a2", "a3"]},
index=["K0", "K2", "K3"])
# Print DataFrame 1
print(df1)
# Print DataFrame 2
print(df2)
# Join DataFrames 1 and 2 on index (default)
df3 = df1.join(df2)
print(df3)

x y
j0 x0 y0
j1 x1 y1
j2 x2 y2
z a
K0 z0 a0
K2 z2 a2
K3 z3 a3
x y z a
j0 x0 y0 NaN NaN
j1 x1 y1 NaN NaN
j2 x2 y2 NaN NaN

0.0.4 INNER JOIN


Returns rows with matching keys in both DataFrames.

78
[21]: #inner join
# Create DataFrame 1
df1 = pd.DataFrame({"x": ["x0", "x1", "x2"], "y": ["y0", "y1", "y2"]},
index=["j0", "j1", "j2"]) # Create DataFrame 2
df2 = pd.DataFrame({"x": ["x0", "x1", "x3"],"z": ["z0", "z2",
"z3"], "a": ["a0", "a2", "a3"]},
index=["K0", "K2", "K3"])
df4 = df1.merge(df2,on="x", how='inner')
print(df4)
x y z a
0 x0 y0 z0
a0
1 x1 y1 z2
a2
[22]
: 0.0.5 FULL OUTER JOIN
Returns all rows from both DataFrames.
# full outer join
df5 = df1.merge(df2,on="x", how='outer')
print(df5)
x y z a
0 x0 y0 z0 a0
1 x1 y1 z2 a2
2 x2 y2 NaN NaN
3 x3 NaN z3 a3

0.0.6 LEFT OUTER JOIN


Returns all rows from the left DataFrame and matching rows from the right DataFrame.

0.0.7 RIGHT OUTER JOIN


Returns all rows from the right DataFrame and matching rows from the left DataFrame.
[25] #right outer join
: df7 = df1.merge(df2,on="x",how='right')
print(df7)

x y z a
0 x0 y0 z0 a0
1 x1 y1 z2 a2
2 x3 NaN z3 a3

0.0.8 RESHAPE
Reshaping functions like pivot and melt are used to transform the layout of data frames.

79
[30]: import pandas as pd
# Create Series 1
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
# Create Series 2
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
# Concatenate Series into DataFrame
df = pd.concat([s1, s2], keys=['one', 'two'])
print(df)
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6

dtype: int64
[31]
: print(df.unstack())

a b c d
e one 0.01.0 2.0 3.0
NaN
[ ] two NaN NaN 4.0 5.0 6.0
:

80
81
84
85
86
87
88
89
90
91
92
93

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy