0% found this document useful (0 votes)
7 views

UNIT IV FDS

Uploaded by

Ezhilvenji S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT IV FDS

Uploaded by

Ezhilvenji S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 142

[Type here] [Type here] [Type here]

B.E/CSE/IT -FOUNDATION OF DATASCIENCE R-2021/II- III SEM

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


&
DEPARTMENT OF INFORMATION TECHNOLOGY

Prepared By,
Mrs.S.EZHILVANJI
AP/CSE
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks,
boolean logic – fancy indexing – structured arrays – Data manipulation with Pandas – data
indexing and selection – operating on data – missing data – Hierarchical indexing–
combining datasets – aggregation and grouping – pivot tables
Introduction:
NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data
buffers. NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more
efficient storage and data operations as the arrays grow larger in size.
Basics of NumPy Arrays.

Few categories of basic array manipulations

1. Attributes of arrays
 Determining the size, shape, memory consumption, and data types of arrays
2. Indexing of arrays
 Getting and setting the value of individual array elements
3. Slicing of arrays
 Getting and setting smaller subarrays within a larger array
4. Reshaping of arrays
 Changing the shape of a given array
5. Joining and splitting of arrays
 Combining multiple arrays into one, and splitting one array into many
[Type here] [Type here] [Type here]

1.NumPy Array Attributes

 For a one-dimensional, two-dimensional, and three-dimensional array

Syntax

import numpy as np

np.random.seed(0) # seed for reproducibility


x1 = np.random.randint(10, size=6) # One-dimensional array

x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array

x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

Each array has attributes

 ndim (the number of dimensions)


 shape (the size of each dimension)
 size (the total size of the array)
 dtype (the data type of the array)
 itemsize (which lists the size (in bytes) of each array element)
 nbytes (which lists the total size (in bytes) of the array)

Syntax
print("x3 ndim: ", x3.ndim)

print("x3 shape:",x3.shape)

print("x3 size: ", x3.size)

print("dtype:", x3.dtype)

print("itemsize:", x3.itemsize, "bytes")

print("nbytes:", x3.nbytes, "bytes")

Output:
x3 ndim: 3

x3 shape: (3, 4, 5)

x3 size: 60 dtype: int64

itemsize: 8 bytes

nbytes: 480 bytes


[Type here] [Type here] [Type here]

2. Array Indexing: Accessing Single Elements


 Indexing in NumPy will feel quite familiar like list indexing,
 In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired
index in square brackets, just as with Python lists
 To index from the end of the array, you can use negative indices
 In a multidimensional array, you access items using a comma-separated tuple of indices
 Unlike Python lists, NumPy arrays have a fixed type. This means, for example, that if you attempt to
insert a floating-point value to an integer array, the value will be silently truncated.

In a one-dimensional array, access the ith value (counting from zero) by
specifying the desired index in square brackets.

Syntax

x1

Output: array([5, 0, 3, 3, 7, 9])

x1[0]

Output: 5

x1[4]

Output: 7

 index from the end of the array x1[-1]

Output: 9

x1[-2]

Output: 7

 access items

Syntax

x2
Output: array([[3, 5, 2, 4],

[7, 6, 8, 8],

[1, 6, 7, 7]])
[Type here] [Type here] [Type here]

x2[0, 0]

Output: 3

x2[2, 0]

Output: 1
x2[2, -1]

Output: 7

 Modifying values

Syntax

x2[0, 0] = 12

x2

Output: array([[12, 5, 2, 4],

[7, 6, 8, 8],

[1, 6, 7, 7]])

 insert a floating-point value to an integer array


Syntax

x1[0] = 3.14159

x1

Output: array([3, 0, 3, 3, 7, 9])


3. Array Slicing: Accessing Subarrays

 To access subarrays with the slice notation, marked by the colon (:) character.

Syntax

x[start:stop:step]

 By default values start=0, stop=size of dimension, step=1.


[Type here] [Type here] [Type here]

One-dimensional subarrays

x = np.arange(10) x

Output: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x[:5] # first five elements

Output: array([0, 1, 2, 3, 4])

x[5:] # elements after index 5

Output: array([5, 6, 7, 8, 9])

x[4:7] # middle subarray

Output: array([4, 5, 6])

x[::2] # every other element

Output: array([0, 2, 4, 6, 8])

x[1::2] # every other element, starting at index 1

Output: array([1, 3, 5, 7, 9])


 When the step value is negative, the defaults for start and stop are swapped.
Syntax

x[::-1] # all elements, reversed

Output: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

x[5::-2] # reversed every other from index 5

Output: array([5, 3, 1])


[Type here] [Type here] [Type here]

Multidimensional subarrays

 Multidimensional slices work in the same way, with multiple slices separated by commas.

Example:

Syntax

x2

Output: array([[12, 5, 2, 4],

[ 7, 6, 8, 8],

[ 1, 6, 7, 7]])

x2[:2, :3] # two rows, three columns

Output: array([[12, 5, 2],

[ 7, 6, 8]])

x2[:3, ::2] # all rows, every other column

Output: array([[12, 2],

[ 7, 8],

[ 1, 7]])

x2[::-1, ::-1]

Output: array([[ 7, 7, 6, 1],

[ 8, 8, 6, 7],

[ 4, 2, 5, 12]])

Accessing array rows and columns Syntax

print(x2[:, 0]) # first column of x2

Output: [12 7 1]
[Type here] [Type here] [Type here]
print(x2[0, :]) # first row of x2

Output: [12 5 2 4]

 In case of row access, the empty slice can be omitted for a more compact syntax:

print(x2[0]) # equivalent to x2[0, :]

Output: [12 5 2 4]

Subarrays as no-copy views

 array slices is that they return views rather than copies of the array data

Syntax

print(x2)
Output: [[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

Creating copies of arrays

 Explicitly copy the data within an array or a subarray.

Syntax

x2_sub_copy = x2[:2, :2].copy()

print(x2_sub_copy)

Output: [[99 5]

[ 7 6]]

x2_sub_copy[0, 0] = 42

print(x2_sub_copy) Output: [[42 5]

[ 7 6]]
print(x2)

Output: [[99 5 2 4]

[ 7 6 8 8]

[ 1 6 7 7]]
[Type here] [Type here] [Type here]
4. Reshaping of Arrays Syntax

grid = np.arange(1, 10).reshape((3, 3)) print(grid)

Output: [[1 2 3]

[4 5 6]

[7 8 9]]

 Conversion of a one-dimensional array into a two-dimensional row or column matrix


Syntax
x = np.array([1, 2, 3])
# row vector via reshape
x.reshape((1, 3))

Output: array([[1, 2, 3]])

# row vector via newaxis


x[np.newaxis, :]

Output: array([[1, 2, 3]])

# column vector via reshape


x.reshape((3, 1))

Output: array([[1],

[2],

[3]])

5. Array Concatenation and Splitting

 To combine multiple arrays into one, and to conversely split a single array into multiple
arrays

Concatenation of arrays

 Concatenation, or joining of two arrays in NumPy use the routines


np.concatenate, np.vstack, and np.hstack.

Syntax

x = np.array([1, 2, 3])
[Type here] [Type here] [Type here]

y = np.array([3, 2, 1]) np.concatenate([x,

y])

Output: array([1, 2, 3, 3, 2, 1])

z = [99, 99, 99]

print(np.concatenate([x, y, z]))

Output: [ 1 2 3 3 2 1 99 99 99]

 np.concatenate

grid = np.array([[1, 2, 3],

[4, 5, 6]])

np.concatenate([grid, grid]) # concatenate along the first axis

Output: array([[1, 2, 3],

[4, 5, 6],

[1, 2, 3],

[4, 5, 6]])

np.concatenate([grid, grid], axis=1) # concatenate along the second axis

Output: array([[1, 2, 3, 1, 2, 3],

[4, 5, 6, 4, 5, 6]])

 np.vstack (vertical stack) and np.hstack (horizontal stack) functions


Syntax
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]]) # vertically stack the arrays
np.vstack([x, grid])

Output: array([[1, 2, 3],

[9, 8, 7],
[6, 5, 4]])
y = np.array([[99], # horizontally stack the arrays
[Type here] [Type here] [Type here]
[99]])
np.hstack([grid, y])

Output: array([[ 9, 8, 7, 99],

[ 6, 5, 4, 99]])

Splitting of arrays

 Opposite of concatenation is splitting, is implemented by the functions np.split, np.hsplit,


and np.vsplit.

Syntax

x = [1, 2, 3, 99, 99, 3, 2, 1]

x1, x2, x3 = np.split(x, [3, 5]) print(x1, x2,

x3)

Output: [1 2 3] [99 99] [3 2 1]

 hsplit and np.vsplit are similar grid =


np.arange(16).reshape((4, 4)) grid
Output: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

upper, lower = np.vsplit(grid, [2]) print(upper)

print(lower)

Output: [[0 1 2 3]

[4 5 6 7]]

[[ 8 9 10 11]

[12 13 14 15]]

left, right = np.hsplit(grid, [2]) print(left)

print(right)

Output: [[ 0 1]
[Type here] [Type here] [Type here]

[ 4 5]

[ 8 9]

[12 13]]

[[ 2 3]

[ 6 7]

[10 11]

[14 15]]
Computation on NumPy Arrays: Universal Functions

Introducing UFuncs
NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is
known as a vectorized operation.

Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute
repeated operations on values in NumPy arrays. Ufuncs are extremely flexible—before we saw an
operation between a scalar and an array, but we can also operate between two arrays

Exploring NumPy’s UFuncs

Ufuncs exist in two flavors: unary ufuncs, which operate on a single input, and binary ufuncs, which
operate on two inputs. We’ll see examples of both these types of functions here.

Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The standard addition, subtraction,
multiplication, and division can all be used.

x=
np.arange(4)
print("x =", x)
print("x + 5
=", x + 5)
print("x - 5
=", x - 5)
print("x * 2
=", x * 2)

O Equivalen Description
p t ufunc
e
r
a
t
o
r
+ np.add Addition (e.g., 1 + 1
= 2)
- np.subtract Subtraction (e.g., 3 -
2 = 1)
[Type here] [Type here] [Type here]
- np.negativ Unary negation
e (e.g., -2)
* np.multipl Multiplication (e.g.,
y 2 * 3 = 6)
/ np.divide Division (e.g., 3 / 2
= 1.5)

// np.floor_ Floor division (e.g., 3 // 2 =


divide 1)
** np.power Exponentiation (e.g., 2 ** 3
= 8)
% np.mod Modulus/remainder (e.g., 9
% 4 = 1)

Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also understands Python’s built-in
absolute value function.
 np.abs()
 np.absolute()
x = np.array([-2, -1,
0, 1, 2]) abs(x)

array([2, 1, 0, 1, 2])

The corresponding NumPy ufunc is np.absolute, which is also available under the alias np.abs
np.absolute(
x) array([2,
1, 0, 1, 2])

np.abs(x)
array([2, 1, 0, 1, 2])

Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist
are the trigonometric functions.
 np.sin()
 np.cos()
 np.tan()
inverse trigonometric functions
 np.arcsin()
 np.arccos()
 np.arctan()

Defining an array of angles: theta = np.linspace(0, np.pi, 3)


Compute some trigonometric functions like
print("theta = ", theta)
print("sin(theta) = ",
np.sin(theta))
print("cos(theta) = ",
[Type here] [Type here] [Type here]
np.cos(theta))
print("tan(theta) = ",
np.tan(theta))

Exponents and logarithms


Another common type of operation available in a NumPy ufunc are the exponentials.
 np.exp(x) – calculate exponent of all elements in the input array ie ex ( e=2.7182)
 np.exp2(x) – calculate 2**x for all x being the array elements
 np.power(x,y) – calculates the power as xy

x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))

The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural
logarithm; if you prefer to compute the base-2 logarithm or the base-10 logarithm as .
 np.log(x) - is a mathematical function that helps user to calculate Natural logarithm of x where x belongs
to all the input array elements
 np.log2(x) - to calculate Base-2 logarithm of x
 np.log10(x) - to calculate Base-10 logarithm of x

x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))

Specialized ufuncs
NumPy has many more ufuncs available like
 Hyperbolic trig functions,
 Bitwise arithmetic,
 Comparison operators,
 Conversions from radians to degrees,
 Rounding and remainders, and much more

More specialized and obscure ufuncs is the submodule scipy.special. If you want to compute some
obscure mathematical function on your data, chances are it is implemented in scipy.special.
 Gamma function

Advanced Ufunc
Features Specifying
output
Rather than creating a temporary array, you can use this to write computation results directly to the
memory location where you’d like them to be. For all ufuncs, you can do this using the out argument of the
function.
x=
np.arange(
5) y =
np.empty(
5)
[Type here] [Type here] [Type here]
np.multiply(x, 10, out=y)
print(y)

[ 0. 10. 20. 30. 40.]

Aggregates functions
a. Summing the Values in an Array

 Consider computing the sum of all values in an array, Python can do this using the built-
in sum function:

Syntax

import numpy as np

L = np.random.random(100)

sum(L) Output: 55.61209116604941

np.sum(L) Output:55.612091166049424

big_array = np.random.rand(1000000)

%timeit sum(big_array)
%timeit np.sum(big_array)

 np.sum is multiple array dimensions

b. Minimum and Maximum Syntax

min(big_array), max(big_array)

Output: (1.1717128136634614e-06, 0.9999976784968716)

np.min(big_array), np.max(big_array)

Output: (1.1717128136634614e-06, 0.9999976784968716)

print(big_array.min(), big_array.max(), big_array.sum())

Output: 1.17171281366e-06 0.999997678497 499911.628197

c. Multidimensional aggregates Two-

dimensional array

Syntax

M = np.random.random((3, 4)) print(M)

Output: [[ 0.8967576 0.03783739 0.75952519 0.06682827]


[Type here] [Type here] [Type here]

[ 0.8354065 0.99196818 0.19544769 0.43447084]

[ 0.66859307 0.15038721 0.37911423 0.6687194 ]]

M.sum()

Output: 6.0850555667307118

 Aggregation functions take an additional argument specifying the axis along which the
aggregate is computed.
 find the minimum value within each column by specifying axis=0

M.min(axis=0)

Output: array([ 0.66859307, 0.03783739, 0.19544769, 0.06682827])


 find the maximum value within each row
M.max(axis=1)

Output: array([ 0.8967576 , 0.99196818, 0.6687194 ])

 The axis keyword specifies the dimension of the array that will be collapsed, rather
than the dimension that will be returned.
 So specifying axis=0 means that the first axis will be collapsed:
 For two-dimensional arrays, this means that values within each column will be
aggregated.

d. Aggregation functions available in NumPy


[Type here] [Type here] [Type here]
Python coding using Numpy . using Array function implement the Average
Height of US Presidents.

Example: What Is the Average Height of US Presidents?

 Aggregates available in NumPy can be extremely useful for summarizing a set of values.
 Data is available in the file president_heights.csv

ord name h
er e
i
g
h
t
(
c
m
)
1 George 1
Washing 8
ton 9
2 John 1
Adams 7
0
3 Thomas 1
Jefferson 8
9
4 James 1
Madison 6
3
5 James 1
Monroe 8
3
[Type here] [Type here] [Type here]

6 John 1
Quincy 7
Adams 1
7 Andrew 1
Jackson 8
5
8 Martin 1
Van 6
Buren 8
9 William 1
Henry 7
Harrison 3
10 John 1
Tyler 8
3
11 James K. 1
Polk 7
3
12 Zachary 1
Taylor 7
3
13 Millard 1
Fillmore 7
5
14 Franklin 1
Pierce 7
8
15 James 1
Buchana 8
n 3
16 Abraham 1
Lincoln 9
3
17 Andrew 1
Johnson 7
8
18 Ulysses 1
S. Grant 7
3
19 Rutherfo 1
rd B. 7
Hayes 4
20 James A. 1
Garfield 8
3
21 Chester 1
[Type here] [Type here] [Type here]

A. 8
Arthur 3
23 Benjami 1
n 6
Harrison 8
25 William 1
McKinle 7
y 0
ord name h
er e
i
g
h
t
(
c
m
)
26 Theodor 1
e 7
Roosevel 8
t
27 William 1
Howard 8
Taft 2
28 Woodro 1
w 8
Wilson 0
29 Warren 1
G. 8
Harding 3
30 Calvin 1
Coolidge 7
8
31 Herbert 1
Hoover 8
2
32 Franklin 1
D. 8
Roosevel 8
t
33 Harry S. 1
Truman 7
5
34 Dwight 1
D. 7
Eisenho 9
[Type here] [Type here] [Type here]

wer
35 John F. 1
Kennedy 8
3
36 Lyndon 1
B. 9
Johnson 3
37 Richard 1
Nixon 8
2
38 Gerald 1
Ford 8
3
39 Jimmy 1
Carter 7
7
40 Ronald 1
Reagan 8
5
41 George 1
H. W. 8
Bush 8
42 Bill 1
Clinton 8
8
43 George 1
W. Bush 8
2
44 Barack 1
Obama 8
5

!head -4 data/president_heights.csv

order,name,height(cm)

1,George Washington,189

2,John Adams,170 3,Thomas

Jefferson,189

Program
[Type here] [Type here] [Type here]
import pandas as pd

data = pd.read_csv('data/president_heights.csv') heights =

np.array(data['height(cm)']) print(heights)

Output

[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173

174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183

177 185 188 188 182 185]

A variety of summary statistics in figure 4.1 print("Mean

height: ", heights.mean()) print("Standard deviation:",

heights.std()) print("Minimum height: ", heights.min())

print("Maximum height: ", heights.max())

Output:

Mean height: 179.738095238

Standard deviation: 6.93184344275

Minimum height: 163

Maximum height: 193

To compute quantiles

print("25th percentile: ", np.percentile(heights, 25)) print("Median: ",

np.median(heights))

print("75th percentile: ", np.percentile(heights, 75))

Output:

25th percentile: 174.25


Median: 182.0

75th percentile: 183.0

Visual representation of the data

%matplotlib inline

import matplotlib.pyplot as plt


[Type here] [Type here] [Type here]

import seaborn; seaborn.set() # set plot style

plt.hist(heights)

plt.title('Height Distribution of US Presidents') plt.xlabel('height

(cm)') plt.ylabel('number');

Output

Fig.4.1 Histogram of presidential heights

Computation on Arrays: Broadcasting


 Broadcasting is simply a set of rules for applying binary ufuncs (addition,
subtraction, multiplication, etc.) on arrays of different sizes.
Vectorizing operations is to use NumPys broadcasting functionality

a. Introducing Broadcasting

 For arrays of the same size, binary operations are performed on an element-by- element
basis:
Program
import numpy as np a = np.array([0, 1,
2])
b = np.array([5, 5, 5]) a + b
Output: array([5, 6, 7])

a+5
Output: array([5, 6, 7])
o this operation stretches or duplicates the value 5 into the array [5, 5, 5], and adds the results
[Type here] [Type here] [Type here]
 The advantage of NumPys broadcasting is that this duplication of values does not
actually take place
 When we add a one-dimensional array to a two-dimensional array

M = np.ones((3, 3)) M
Output: array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])

M+ a

Output: array([[ 1., 2., 3.],

[ 1., 2., 3.],

[ 1., 2., 3.]])

 Here the one-dimensional array a is stretched, or broadcast, across the second dimension
in order to match the shape of M.

a = np.arange(3)

b = np.arange(3)[:, np.newaxis] print(a)

print(b)

Output:

[0 1 2]

[[0]

[1]

[2]]

a+b

Output: array([[0, 1, 2],

[1, 2, 3],

[2, 3, 4]])

 Before we stretched or broadcasted one value to match the shape of the other
 stretched both a and b to match a common shape, and the result is a two
[Type here] [Type here] [Type here]
dimensional array

b. Visualization of NumPy broadcasting

Fig.Visualization of NumPy broadcasting

The light boxes represent the broadcasted values in fig

c. Rules of Broadcasting
 Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with
fewer dimensions is padded with ones on its leading (left) side.
 Rule 2: If the shape of the two arrays does not match in any dimension, the array with
shape equal to 1 in that dimension is stretched to match the other shape.
 Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Broadcasting example 1
Let’s look at adding a two-dimensional array to a one-dimensional array:

M = np.ones((2, 3))
a = np.arange(3)

Let’s consider an operation on these two arrays. The shapes of the arrays are:
M.shape = (2, 3)
a.shape = (3,)
We see by rule 1 that the array a has fewer dimensions, so we pad it on the left with ones:
M.shape -> (2, 3)
a.shape -> (1, 3)
By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:
M.shape -> (2, 3)
a.shape -> (2, 3)
The shapes match, and we see that the final shape will be (2, 3):
M+a
[Type here] [Type here] [Type here]
array([[ 1., 2., 3.],
[ 1., 2., 3.]])

Broadcasting example 2
Let’s take a look at an example where both arrays need to be broadcast:
a = np.arange(3).reshape((3, 1))
b = np.arange(3)

Again, we’ll start by writing out the shape of the arrays:


a.shape = (3, 1)
b.shape = (3,)

Rule 1 says we must pad the shape of b with ones:


a.shape -> (3, 1)
b.shape -> (1, 3)

And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:
a.shape -> (3, 3)
b.shape -> (3, 3)

11
[Type here] [Type here] [Type here]

Because the result matches, these shapes are compatible. We can see
this here: a + b
array([[0
, 1, 2],
[1, 2, 3],
[2, 3, 4]])
Comparisons, Masks, and Boolean Logic
 Masking comes up when you want to extract, modify, count, or otherwise
manipulate values in an array based on some criterion.

Comparison Operators as

ufuncs

 NumPy also implements comparison operators such as < (less than) and >
(greater than) as element-wise ufuncs.
 All six of the standard comparison operations are available:
Syntax
x = np.array([1, 2, 3, 4, 5])
x<3 # less than
Output: array([ True, True, False, False, False], dtype=bool) x > 3
# greater than

Output: array([False, False, False, True, True], dtype=bool) x <= 3


# less than or equal

Output: array([ True, True, True, False, False], dtype=bool) x >= 3 #


greater than or equal

Output: array([False, False, True, True, True], dtype=bool) x != 3


# not equal

Output: array([ True, True, False, True, True], dtype=bool) x == 3


# equal

Output: array([False, False, True, False, False], dtype=bool)

 It is also possible to do an element-by-element comparison of two arrays, and to


include compound expressions:
[Type here] [Type here] [Type here]

(2 * x) == (x ** 2)

Output: array([False, True, False, False, False], dtype=bool)

 the comparison operators are implemented as ufuncs in NumPy

For example,

x < 3,

NumPy uses np.less(x, 3)

summary of the comparison operators

Two-dimensional example

rng = np.random.RandomState(0) x =

rng.randint(10, size=(3, 4))

Output: array([[5, 0, 3, 3],

[7, 9, 3, 5],

[2, 4, 7, 6]])

x<6

Output: array([[ True, True, True, True],

[False, False, True, True],

[ True, True, False, False]], dtype=bool)


[Type here] [Type here] [Type here]
Working with Boolean Arrays

print(x)

Output: [[5 0 3 3]
[Type here] [Type here] [Type here]

[7 9 3 5]

[2 4 7 6]]

Counting entries

 To count the number of True entries in a Boolean array, np.count_nonzero is


useful:

# how many values less than 6?

np.count_nonzero(x < 6)

Output: 8

Or

np.sum(x < 6)

Output: 8

# how many values less than 6 in each row?

np.sum(x < 6, axis=1)

Output: array([4, 2, 2])

 This counts the number of values less than 6 in each row of the matrix.

Checking whether any or all the values are true # are

there any values greater than 8? np.any(x > 8)

Output: True

# are there any values less than zero?

np.any(x < 0)

Output: False

# are all values less than 10?


[Type here] [Type here] [Type here]
np.all(x < 10)

Output: True

# are all values equal to 6?

np.all(x == 6)
[Type here] [Type here] [Type here]

Output: False

# are all values in each row less than 8?

np.all(x < 8, axis=1)

Output: array([ True, False, True], dtype=bool)

Boolean operators

 Python’s bitwise logic operators, &, |, ^, and ~


np.sum((inches > 0.5) & (inches < 1))

Output: 29

 The parentheses are important—because of operator precedence rules, with


parentheses removed this expression would be evaluated as follows, which results
in an error:
inches > (0.5 & inches) < 1
 Using the equivalence of A AND B and NOT (A OR B)

np.sum(~( (inches <= 0.5) | (inches >= 1) )) Output: 29

 Combining comparison operators and Boolean operators on arrays can lead to a


wide range of efficient logical operations.

Bitwise Boolean operators and their equivalent ufuncs

Program

print("Number days without rain: ", np.sum(inches == 0))

print("Number days with rain: ", np.sum(inches != 0))


[Type here] [Type here] [Type here]

print("Days with more than 0.5 inches:", np.sum(inches > 0.5))

print("Rainy days with < 0.1 inches :", np.sum((inches > 0) & (inches < 0.2)))

Output
:

Number days without rain: 215

Number days with rain: 150

Days with more than 0.5 inches: 37

Rainy days with < 0.1 inches : 75

Boolean Arrays as Masks

Output: array([[5, 0, 3, 3],

[7, 9, 3, 5],

[2, 4, 7, 6]])

x<5

Output: array([[False, True, True, True],

[False, False, True, False],

[ True, True, False, False]], dtype=bool)

 To select the values from the array, we can simply index on this Boolean array;
this is known as a masking operation

x[x < 5]

Output: array([0, 3, 3, 3, 2, 4])

Example for compute statistics on our Seattle rain data:

# construct a mask of all rainy days

rainy = (inches > 0)


[Type here] [Type here] [Type here]
# construct a mask of all summer days (June 21st is the 172nd day)

summer = (np.arange(365) - 172 < 90) & (np.arange(365) - 172 > 0)

print("Median precip on rainy days in 2014 (inches): ",

np.median(inches[rainy]))
[Type here] [Type here] [Type here]

print("Median precip on summer days in 2014 (inches): ",

np.median(inches[summer]))

print("Maximum precip on summer days in 2014 (inches): ",

np.max(inches[summer]))

print("Median precip on non-summer rainy days (inches):",

np.median(inches[rainy & ~summer]))

Output:

Median precip on rainy days in 2014 (inches): 0.194881889764

Median precip on summer days in 2014 (inches): 0.0

Maximum precip on summer days in 2014 (inches): 0.850393700787 Median

precip on non-summer rainy days (inches): 0.200787401575

Example: Counting Rainy Days

 A series of data that represents the amount of precipitation each day for a year in a
given city in fig 4.3.

Program

import numpy as np import

pandas as pd

# use Pandas to extract rainfall inches as a NumPy array rainfall =

pd.read_csv('data/Seattle2014.csv')['PRCP'].values inches = rainfall /

254 # 1/10mm -> inches inches.shape

Output: (365,)

Histogram of rainy days

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn; seaborn.set() # set plot styles


[Type here] [Type here] [Type here]

plt.hist(inches, 40);
[Type here] [Type here] [Type here]

Fig.4.3 Histogram of rainy days

FANCY INDEXING
Fancy Indexing
 Fancy indexing is the simple indexing, but we pass arrays of indices in place of single
scalars. This allows us to very quickly access and modify complicated subsets of an
array’s values.

 To access and modify portions of arrays using simple indices (e.g., arr[0]), slices (e.g.,
arr[:5]), and Boolean masks (e.g., arr[arr > 0])

Exploring Fancy Indexing

 It means passing an array of indices to access multiple array elements at once.

Types of fancy indexing.


 Indexing / accessing more values
 Array of indices
 In multi dimensional
 Standard indexing
Program

import numpy as np

rand = np.random.RandomState(42)

x = rand.randint(100, size=10) print(x)

Output: [51 92 14 71 60 20 82 86 74 74]


[Type here] [Type here] [Type here]

Indexing / accessing more values


Suppose we want to access three different elements. We could do it like this:
[x[3], x[7], x[2]]

[71, 86, 14]

Array of indices
We can pass a single list or array of indices to obtain the same result.
ind = [3, 7, 4]
x[ind]

array([71, 86, 60])

In multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array.
X = np.arange(12).reshape((3, 4))
X

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Standard indexing
Like with standard indexing, the first index refers to the row, and the second to the column.
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])

14
[Type here] [Type here] [Type here]

X[row, col] array

([ 2, 5, 11])

Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes
we’ve seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
 Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])

 Combine fancy indexing with slicing


X[1:, [2, 0, 1]]

array([[ 6, 4, 5],
[10, 8, 9]])
 Combine fancy indexing with masking
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]

array([[ 0, 2],
[ 4, 6],
[ 8, 10]])

Modifying Values with Fancy Indexing


Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an
array. Change some value in an array

Modify particular element by index


For example, imagine we have an array of indices and we’d like to set the corresponding items in an
array to some value.
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)

[ 0 99 99 3 99 5 6 7 99 9]

Using assignment operator


We can use any assignment-type operator for this. For example
x[i] -= 10
print(x)
15
[Type here] [Type here] [Type here]

[ 0 89 89 3 89 5 6 7 89 9]

Using at()
Use the at() method of ufuncs for other behavior of modifications.
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)

[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
To access three different

elements [x[3], x[7], x[2]]

Output: [71, 86, 14]

To pass a single list or array of

indices ind = [3, 7, 4]

x[ind]

Output: array([71, 86, 60])

 The shape of the result reflects the shape of the index arrays rather than the
shape of the array being indexed

ind = np.array([[3, 7],


[4, 5]])
x[ind]
Output: array([[71, 86],
[60, 20]])

Multiple dimensions

X = np.arange(12).reshape((3,

4)) X

Output: array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11]])

 With Standard indexing, the first index refers to the row, and the second to
[Type here] [Type here] [Type here]
the column

row = np.array([0, 1, 2])

col = np.array([2, 1,

3]) X[row, col]

Output: array([ 2, 5, 11])

 The first value in the result is X[0, 2], the second is X[1, 1], and the
third is X[2, 3]
[Type here] [Type here] [Type here]

 Combine a column vector and a row vector within the

indices X[row[:, np.newaxis], col]

Output: array([[ 2, 1, 3],

[ 6, 5, 7],

[10, 9, 11]])

 Each row value is matched with each column vector, exactly as we


saw in broadcasting of arithmetic operations.

Program

row[:, np.newaxis] * col

Output: array([[0, 0, 0],

[2, 1, 3],

[4, 2, 6]])

 With fancy indexing the return value reflects the broadcasted shape of
the indices, rather than the shape of the array being indexed.

Combined Indexing

print(X)

Output: [[ 0 1 2 3]

[ 4 5 6 7]

[ 8 9 10 11]]

Combine fancy and simple

indices X[2, [2, 0, 1]]

Output: array([10, 8, 9])

Combine fancy indexing with

slicing X[1:, [2, 0, 1]]

Output: array([[ 6, 4, 5],


[Type here] [Type here] [Type here]

[10, 8, 9]])
[Type here] [Type here] [Type here]

Combine fancy indexing with masking

mask = np.array([1, 0, 1, 0],

dtype=bool) X[row[:, np.newaxis],

mask]

Output: array([[ 0, 2],

[ 4, 6],

[ 8, 10]])

 All of these indexing options combined lead to a very flexible set of


operations for accessing and modifying array values.

Example: Selecting Random Points

 One common use of fancy indexing is the selection of subsets of rows from a
matrix.
 For example, we might have an N by D matrix representing N points in D
dimensions, such as the following points drawn from a two-dimensional
normal distribution.
Program
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
Output: (100, 2)

Visualize these points as a scatter plot Program

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn; seaborn.set() # for plot styling

plt.scatter(X[:, 0], X[:, 1]);


[Type here] [Type here] [Type here]

Fig.4.4 Normally distributed points

 Fancy indexing to select 20 random points, first choosing 20 random


indices with no repeats in Fig.4.4, and use these indices to select a portion of
the original array:

Syntax

indices = np.random.choice(X.shape[0], 20, replace=False)

indices

Output: array([93, 45, 73, 81, 50, 10, 98, 94, 4, 64, 65, 89, 47, 84, 82,

80, 25, 90, 63, 20])

Syntax

selection = X[indices] # fancy indexing here

selection.shape

Output: (20, 2)

Plot large circles at the locations of the selected points

plt.scatter(X[:, 0], X[:, 1], alpha=0.3)

plt.scatter(selection[:, 0], selection[:, 1],


[Type here] [Type here] [Type here]
facecolor='none', s=200);

Random selection among points


[Type here] [Type here] [Type here]

Figure 4.5. Random selection among points

 This sort of strategy is used to quickly partition datasets in Figure 4.5, as is


needed in train/test splitting for validation of statistical models and in
sampling approaches to answering statistical questions.
Modifying Values with Fancy Indexing

 It is used to modify parts of an array

Example

 Imagine we have an array of indices and like to set the corresponding items
in an array to some value:

Program

x = np.arange(10)

i = np.array([2, 1, 8, 4])

x[i] = 99

print(x)

Output: [ 0 99 99 3 99 5 6 7 99 9]

We can use any assignment-type operator for

this. x[i] -= 10
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

print(x)

Output: [ 0 89 89 3 89 5 6 7 89 9]

Repeated indices with these operations can cause some potentially unexpected results

Program

x = np.zeros(10)

x[[0, 0]] = [4,

6]

print(x)

Output: [ 6. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

 The result of this operation is to first assign x[0] = 4, followed by x[0] = 6.


The result, of course, is that x[0] contains the value 6.

Program

i = [2, 3, 3, 4, 4, 4]

x[i] += 1

Output: array([ 6., 0., 1., 1., 1., 0., 0., 0., 0., 0.])

 x[3] would contain the value 2, and x[4] would contain the value 3, as this
is how many times each index is repeated

Use the at() method of ufuncs

Program

x = np.zeros(10)

np.add.at(x, i, 1)

print(x)

Output: [ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]

 The at() method does an in-place application of the given operator at


the specified indices (here, i) with the specified value (here, 1)
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]
by hand
Example: Binning Data
Figure 4.6. A histogram computed by hand
 Imagine we have 1,000 values and would
like to quickly find where they fall within
an array of bins.
Program

np.random.seed(42)

x = np.random.randn(100)
SORTING ARRAYS
# compute a histogram by hand bins = Sorting in NumPy: np.sort and np.argsort
np.linspace(-5, 5, 20) counts =

np.zeros_like(bins)

# find the appropriate bin for each x

i = np.searchsorted(bins, x) # add 1 to

each of these bins np.add.at(counts, i,

1)

 The counts now reflect the number of


points within each bin—in other words,
a Histogram in Figure 4.6
# plot the results

plt.plot(bins, counts,

linestyle='steps'); A histogram computed


[Type here] [Type here] [Type here]
Sorting Arrays
Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s
np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O[ N
log N], quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default
quicksort is more than sufficient.

Sorting without modifying the input.


To return a sorted version of the array without modifying the input, you can use np.sort
x = np.array([2, 1, 4, 3,
5]) np.sort(x)

array([1, 2, 3, 4, 5])

Returns sorted indices


A related function is argsort, which instead returns the indices of the sorted elements
x = np.array([2, 1, 4, 3,
5]) i = np.argsort(x)
print(i)

[1 0 3 2 4]
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a
multidimensional array using the axis argument. For example

rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

[[6 3 7 4 6 9]
[2 6 7 4 3 7]
[7 2 5 4 1 7]
[5 1 4 0 9 5]]

np.sort(X, axis=0)

array([[2, 1, 4, 0, 1,

5],

16
[Type here] [Type here] [Type here]

[5, 2, 5, 4, 3, 7],
[6, 3, 7, 4, 6, 7],
[7, 6, 7, 4, 9, 9]])

np.sort(X,

axis=1)

array([[3, 4, 6,

6, 7, 9],
[2, 3, 4, 6, 7, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 5, 9]])

Partial Sorts: Partitioning


Sometimes we’re not interested in sorting the entire array, but simply want to find the K smallest values
in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number
K; the result is a new array with the smallest K values to the left of the partition, and the remaining
values to the right, in arbitrary order
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)

array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in the array, and the remaining
array positions contain the remaining values. Within the two partitions, the elements have arbitrary
order.

Partitioning in multidimensional array


Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array.
np.partition(X, 2, axis=1)

array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])

STRUCTURED ARRAYS
Structured Data: NumPy’s Structured Arrays
[Type here] [Type here] [Type here]
 Several categories of data on a number of people (name, age, and weight),
 To store these values for use in a Python program.
 It would be possible to store these in three separate

arrays: name = ['Alice', 'Bob', 'Cathy', 'Doug']

age = [25, 45, 37, 19]

weight = [55.0, 85.5, 68.0, 61.5]

Create a simple array using an

expression x = np.zeros(4,

dtype=int)

Create a structured array using a compound data type specification

# Use a compound data type for structured arrays

data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),

'formats':('U10', 'i4',

'f8')}) print(data.dtype)

Output: [('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

 'U10' translates to “Unicode string of maximum length 10,” 'i4'


translates to 4-byte (i.e., 32 bit) integer, and 'f8' translates to “8-byte
(i.e., 64 bit) float.”

Create an empty container array, we can fill the array with our lists of values:

Syntax

data['name'] = name

data['age'] = age

data['weight'] =

weight print(data)

Output: [('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)

('Doug', 19, 61.5)]


[Type here] [Type here] [Type here]

Structured arrays are refers to values either by index or by name:

# Get all names

data['name']

Output: array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

# Get first row of data

data[0]

Output: ('Alice', 25, 55.0)

# Get the name from the last row

data[-1]['name']

Output: 'Doug'

Using Boolean masking, the operation filtering on

age: # Get names where age is under

30 data[data['age'] < 30]['name']

Output: array(['Alice', 'Doug'],

dtype='<U10')

 Pandas provides a DataFrame object, which is a structure built on NumPy


arrays that offers a variety of useful data manipulation functionality

eating Structured Arrays Program

np.dtype({'names':('name', 'age', 'weight'),

'formats':('U10', 'i4', 'f8')})

Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

Numerical types

Program
[Type here] [Type here] [Type here]
np.dtype({'names':('name', 'age', 'weight'),

'formats':((np.str_, 10), int, np.float32)})


[Type here] [Type here] [Type here]

Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])

Compound type

np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight',

'<f8')]) comma-separated string

np.dtype('S10,i4,f8')

Output: dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

 The first (optional) character is < or >, which means “little endian” or “big
endian,” respectively, and specifies the ordering convention for significant
bits.
 The next character specifies the type of data: characters, bytes, ints,
floating points, and so on.
 The last character or characters represents the size of the object in bytes.

NumPy data types

Record Arrays: Structured Arrays with a Twist

 NumPy also provides the np.recarray class, which is almost identical to the
structured arrays, but with one additional feature: fields can be accessed as
attributes rather than as dictionary keys.

data['age']
[Type here] [Type here] [Type here]
Output: array([25, 45, 37, 19], dtype=int32)
[Type here] [Type here] [Type here]

Record array

data_rec = data.view(np.recarray) data_rec.age

Output: array([25, 45, 37, 19], dtype=int32)


Refer values through index or name
The handy thing with structured arrays is that you can now refer to values either by index or by name.
i. data['name']# by name

array(['Alice', 'Bob', 'Cathy', 'Doug'],dtype='<U10')


ii.data[0]# by index

('Alice', 25, 55.0)

Using Boolean masking


This allows to do some more sophisticated operations such as filtering on any fields.
data[data['age'] < 30]['name']

array(['Alice', 'Doug'],dtype='<U10')

Creating Structured Arrays

Dictionary method
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})

dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

18
[Type here] [Type here] [Type here]

Numerical types can be specified with Python types


np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int,

np.float32)}) dtype([('name', '<U10'), ('age',

'<i8'), ('weight', '<f4')])

List of tuples
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])


Specify the types alone
np.dtype('S10,i4,f8')

dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

DATA MANIPULATION WITH PANDAS

 A panda is newer packages built on top of NumPy, and provide an


efficient implementation of a DataFrame.
 DataFrames are essentially multidimensional arrays with attached row
and column labels, and often with heterogeneous types and/or missing
data.
 Pandas implements a number of powerful data operations familiar to users
of both database frameworks and spreadsheet programs.
 NumPy ndarray data structure provides essential features for the type of
clean, well-organized data seen in numerical computing tasks.

Installing and Using Pandas

 Installing Pandas on your system requires NumPy to be installed


 Once Pandas is installed, you can import it and check the
version: import pandas
pandas. version
Output: '0.18.1'

 import NumPy under the alias


np import pandas as pd

 To display all the contents of the pandas


[Type here] [Type here] [Type here]
namespace pd.<TAB>

 To display the built-in Pandas


documentation pd?
[Type here] [Type here] [Type here]

Introducing Pandas Objects

 Pandas objects as enhanced versions of NumPy structured arrays in which


the rows and columns are identified with labels rather than simple integer
indices.
 Pandas provide a host of useful tools, methods, and functionality on top of
the basic data structures.
 Three fundamental Pandas data structures:
1. Series
2. DataFrame
3. Index
 Standard NumPy and Pandas imports

import numpy as

np import pandas

as pd

1. The Pandas Series Object

 A Pandas Series is a one-dimensional array of indexed data. It can be


created from a list or array

Syntax

data = pd.Series([0.25, 0.5, 0.75,


1.0]) data
Output: 0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
 The Series wraps both a sequence of values and a sequence of indices,
which we can access with the values and index attributes.
Syntax
data.values
Output: array([ 0.25, 0.5 , 0.75, 1. ])
 The index is an array-like object of type pd.Index
[Type here] [Type here] [Type here]
Syntax
data.index
Output: RangeIndex(start=0, stop=4, step=1)
[Type here] [Type here] [Type here]

 data can be accessed by the associated index


Syntax
data[1]
Output: 0.5

Syntax

data[1:3]
Output: 1 0.50
2 0.75
dtype: float64

a. Series as generalized NumPy array

 NumPy array has an implicitly defined integer index used to access the values
 Pandas Series has an explicitly defined index associated with the values
 This explicit index definition gives the Series object additional capabilities
Program
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c',
'd']) data
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

data['b']

Output: 0.5

 noncontiguous or nonsequential indices


Program
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
[Type here] [Type here] [Type here]
Output: 2 0.25
5 0.50
[Type here] [Type here] [Type here]

3 0.75
7 1.00
dtype: float64

data[5]

Output: 0.5

b. Series as specialized dictionary

 A dictionary is a structure that maps arbitrary keys to a set of arbitrary


values, and a Series is a structure that maps typed keys to a set of typed
values.
 Constructing a Series object directly from a Python dictionary

Program

population_dict = {'California': 38332521,

'Texas': 26448193,

'New York': 19651127,

'Florida': 19552860,

'Illinois': 12882135}

population =

pd.Series(population_dict) population

Output: California 38332521

Florida 19552860

Illinois 12882135

New York 19651127

Texas

26448193

dtype: int64

 By default, a Series will be created where the index is drawn from the
sorted keys. dictionary-style item access can be performed
[Type here] [Type here] [Type here]
population['California']

Output: 38332521

 the Series also supports array-style operations such as


slicing population['California':'Illinois']
[Type here] [Type here] [Type here]

Output: California 38332521


Florida 19552860
Illinois
12882135 dtype:
int64

c. Constructing Series objects

 a few ways of constructing a Pandas Series from


scratch; pd.Series(data, index=index)
 where index is an optional argument, and data can be one of many entities.

Example

 data can be a list or NumPy array, in which case index defaults to an


integer sequence
pd.Series([2, 4, 6])
Output: 0 2
14
26
dtype: int64
 data can be a scalar, which is repeated to fill the specified
index pd.Series(5, index=[100, 200, 300])
Output: 100 5
200 5
300 5
dtype: int64
 data can be a dictionary, in which index defaults to the sorted dictionary
keys pd.Series({2:'a', 1:'b', 3:'c'})
Output: 1b
2 a
3 c
dtype: object
 the index can be explicitly set if a different result is
preferred pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
Output: 3 c
2a
[Type here] [Type here] [Type here]
dtype: object

CONSTRUCTING DATAFRAME OBJECTS


a. DataFrame as a generalized NumPy array

 A Series is an analog of a one-dimensional array with flexible indices


 A DataFrame is an analog of a two-dimensional array with both flexible
row indices and flexible column names.
 From a single Series object.
 From a list of dicts.
 From a dictionary of Series objects.
 From a two-dimensional NumPy array.
 From a NumPy structured array.

 Construct a new Series listing the area of each of the five states
Program
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois':
149995} area =
pd.Series(area_dict)
area
Output: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas
695662
dtype: int64
 Use a dictionary to construct a single two-dimensional object containing this
information
states = pd.DataFrame({'population': population,'area': area})
states
Output: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
[Type here] [Type here] [Type here]
Texas 695662 26448193
 DataFrame has an index attribute that gives access to the index
labels states.index
Output:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
 the DataFrame has a columns attribute, which is an Index object holding
the column labels
states.columns
[Type here] [Type here] [Type here]

Output: Index(['area', 'population'], dtype='object')


 the DataFrame can be thought of as a generalization of a two-dimensional
NumPy array, where both the rows and columns have a generalized index for
accessing the data

b. DataFrame as specialized dictionary

 A dictionary maps a key to a value, a DataFrame maps a column name


to a Series of column data.
states['area']
Output: California 423967
Florida 170312
Introducing Pandas Objects |
103 Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
 In a two-dimensional NumPy array, data[0] will return the first row.
For a DataFrame, data['col0'] will return the first column.

c. Constructing DataFrame objects

 A Pandas DataFrame can be constructed in a variety of ways


 From a single Series object. A Data Frame is a collection of Series objects,
and a single column Data Frame can be constructed from a single Series:
pd.DataFrame(population, columns=['population'])
Output: population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

d. From a list of dicts

 Any list of dictionaries can be made into a

DataFrame. data = [{'a': i, 'b': 2 * i}

for i in range(3)]
[Type here] [Type here] [Type here]
pd.DataFrame(data)
[Type here] [Type here] [Type here]

Output: a b

000

112

224

 Even if some keys in the dictionary are missing, Pandas will fill them in with
NaN (i.e.,“not a number”) values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Output: a b c
0 1.0 2 NaN
1 NaN 3 4.0

e. From a dictionary of Series objects

 A DataFrame can be constructed from a dictionary of Series

objects pd.DataFrame({'population': population,'area':

area}) Output: area population

California 423967 38332521

Florida 170312 19552860

Illinois 149995 12882135

New York 141297 19651127

Texas 695662 26448193

f. From a two-dimensional NumPy array

 Given a two-dimensional array of data, we can create a DataFrame with any


specified column and index names. If omitted, an integer index will be used
for each:
pd.DataFrame(np.random.rand(3,
2), columns=['foo', 'bar'],
index=['a', 'b', 'c'])
Output: foo bar
a 0.865257 0.213169
[Type here] [Type here] [Type here]
b 0.442759 0.108267
c 0.047110 0.905718
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

Data Indexing and Selection

A Series object acts in many ways like a one dimensional NumPy array, and in many ways like a standard
Python dictionary. It will help us to understand the patterns of data indexing and selection in these arrays.
 Series as dictionary
 Series as one-dimensional array
 Indexers: loc, iloc, and ix

 Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of
values.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data

a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

data['b']
25
[Type here] [Type here] [Type here]

0.5
Examine the keys/indices and values
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values
i. 'a' in data
True
ii. data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')

iii.list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Modifying series object


Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by
assigning to a new key, you can extend a Series by assigning to a new index value.

data['e'] = 1.25
data

a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64

 Series as one-dimensional array


A Series builds on this dictionary-like interface and provides array-style item selection via the same basic
mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing.

Slicing by explicit index


data['a':'c']

a 0.25
b 0.50
c 0.75
dtype: float64

Slicing by implicit integer index


data[0:2]

a 0.25
b 0.50
dtype: float64

Masking
data[(data > 0.3) & (data < 0.8)]

b 0.50
c 0.75
[Type here] [Type here] [Type here]

dtype: float64

Fancy indexing
data[['a', 'e']]

a 0.25
e 1.25
dtype: float64

 Indexers: loc, iloc, and ix


Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are
not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1a
3b
5c
dtype: object

loc - the loc attribute allows indexing and slicing that always references the explicit index.
data.loc[1]
'a'

data.loc[1:3]
1a
3b
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the implicit Python-style index.
data.iloc[1]
'b'

data.iloc[1:3]
3b
5c
dtype: object

ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-based indexing.

Data Selection in DataFrame


 DataFrame as a dictionary
 DataFrame as two-dimensional array
 Additional indexing conventions

DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects.

27
[Type here] [Type here] [Type here]

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing
of the column name.

Dictionary-style indexing of the column name.


result=pd.DataFrame({'DS':sub1,'FDS':sub2})
result[‘DS’]

DS
sai 90
ram 85
kasim 92
tamil 89

Attribute-style access with column names that are strings


result.DS

DS
sai 90
ram 85
kasim 92
tamil 89

Comparing attribute style and dictionary style accesses


result.DS is result[‘DS’]

True
Modify the object
Like with the Series objects this dictionary-style syntax can also be used to modify the object, in this case to add
a new column:

result[‘TOTAL’]=result[‘DS’]+result[‘FDS’]
result

D FDS TOTA
S L
sai 90 91 181
ram 85 95 180
kasim 92 89 181
tamil 89 90 179

DataFrame as two-dimensional array


 Transpose
We can transpose the full DataFrame to swap rows and columns.
result.T

sai ram kasim tamil


DS 90 85 92 89
FDS 91 95 89 90
TOTAL 181 180 181 179
[Type here] [Type here] [Type here]

28
[Type here] [Type here] [Type here]

Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the
underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result
 loc
result.loc[: ‘ram’, : ‘FDS’ ]

DS FD
S
sai 90 91
ram 85 95
 iloc
result.iloc[:2, :2 ]

DS FD
S
sai 90 91
ram 85 95

 ix
result.ix[:2, :’FDS’ ]

DS FD
S
sai 90 91
ram 85 95

 Masking and Fancy indexing


In the loc indexer we can combine masking and fancy indexing as in the following:
result.loc[result.total>180,[ ‘DS’, ‘FDS’ ]]

D FD
S S
sai 90 91
kasim 92 89

 Modifying values
Indexing conventions may also be used to set or modify values; this is done in the standard way that
you might be accustomed to from working with NumPy.

result.iloc[1,1] =70
D FDS TOTA
S L
sai 90 91 181
ram 85 70 180
kasim 92 89 181
tamil 89 90 179
[Type here] [Type here] [Type here]
Additional indexing conventions
Slicing row wise

result['sai':'kasim']

29
[Type here] [Type here] [Type here]

D FDS TOTA
S L
sai 90 91 181
ram 85 70 180
kasim 92 89 181

Such slices can also refer to rows by number rather than by index:
result[1:3]

D FDS TOTA
S L
ram 85 70 180
kasim 92 89 181

Masking row wise


result[result.total>180]

D FDS TOTA
S L
sai 90 91 181
kasim 92 89 181

Operating on Data in Pandas


Pandas inherits much of this functionality from NumPy, and the ufuncs. So Pandas having the ability to
perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.)
and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
For unary operations like negation and trigonometric functions, these ufuncs will preserve index and column
labels in the output.
For binary operations such as addition and multiplication, Pandas will automatically align indices when passing
the objects to the ufunc.
Here we are going to see how the universal functions are working in series and DataFrames by
 Index preservation
 Index alignment

Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on pandas. In outputs the index will
preserved (maintained) as shown below.
For series
x=pd.Series([1,2,3,4])
x

01
12
23
34
dtype: int64
[Type here] [Type here] [Type here]
For DataFrame
df=pd.DataFrame(np.random.randint(0,10,(3,4)),
columns=['a','b','c','d'])
30
[Type here] [Type here] [Type here]

df

a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2

For universal function. (here we use exponent as example)


Ufuncs for series
np.exp(ser)

0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64

Ufuncs for Data Frame


np.exp(df)

a b c d

0 2.718282 54.598150 2.718282 54.598150

1 2980.957987 54.598150 1.000000 54.598150

2 1096.633158 1096.633158 1096.633158 7.389056

Index Alignment
Pandas will align indices in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll.

Index alignment in Series


suppose we are combining two different data sources, then the index will aligned accordingly.
x=pd.Series([2,4,6],index=[1,3,5])
y=pd.Series([1,3,5,7],index=[1,2,3,4])
x+y

1 3.0
2 NaN
3 9.0
4 NaN
5 NaN
dtype: float64
[Type here] [Type here] [Type here]
31
[Type here] [Type here] [Type here]

The resulting array contains the union of indices of the two input arrays, which we could determine using
standard Python set arithmetic on these indices.
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is
how Pandas marks as missing data.

Fill value in missing data (fill_value)


If using NaN values is not the desired behavior, we can modify the fill value using appropriate object methods
in place of the operators.

x.add(y,fill_value=0)

1 3.0
2 3.0
3 9.0
4 7.0
5 6.0
dtype: float64

Index alignment in DataFrame


A similar type of alignment takes place for both columns and indices when you are performing operations on
DataFrames.

A = pd.DataFrame(rng.randint(0, 20, (2,


2)),columns=list('AB')) A

A B
01
11
151

B = pd.DataFrame(rng.randint(0, 10, (3,


3)), columns=list('BAC'))
B

BA
C040
9
1580
2926

A+B

A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN
NaN

Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result
are sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any
desired fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A.

32
[Type here] [Type here] [Type here]

fill = A.stack().mean()
A.add(B,
fill_value=fill)

A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5

Mapping between Python operators and Pandas methods.


Python operator Pandas method(s)
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()

Operations between Data Frame and Series


When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-
dimensional and one-dimensional NumPy array.
A = rng.randint(10, size=(3,
4)) A
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])

A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])

Handling Missing Data


A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or
choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.

In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing
integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.

Missing Data in Pandas

33
[Type here] [Type here] [Type here]

The way in which Pandas handles missing values is constrained by its NumPy package, which does not have a
built-in notion of NA values for non floating- point data types.

NumPy supports fourteen basic integer types once you account for available precisions, signedness, and
endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an
unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new
fork of the NumPy package.

Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values:
the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will
see, but in practice ends up being a good compromise in most cases of interest.

None: Pythonic missing data


The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in
Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only
in arrays with data type 'object' (i.e., arrays of Python objects)

This dtype=object means that the best common type representation NumPy could infer for the contents of the
array is that they are Python objects.

NaN: Missing numerical data


NaN is a special floating-point value recognized by all systems that use the standard IEEE floating-point
representation.
vals2 = np.array([1, np.nan, 3,
4]) vals2.dtype

dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the
operation, the result of arithmetic with NaN will be another NaN

1 + np.nan

nan
0 * np.nan

Nan

NaN and None in Pandas


NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably.
pd.Series([1, np.nan, 2, None])
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are
present. For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-
point type to accommodate the NA

34
[Type here] [Type here] [Type here]

x = pd.Series(range(2),
dtype=int) x
00
11
dtype: int64
x[0] = None
x
0 NaN
1 1.0
dtype: float64
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a
NaN value.

Pandas handling of NAs by type


Typeclass Conversion when storing NAs NA sentinel value
floating No change np.nan
object No change None or np.nan
integer Cast to float64 np.nan
boolean Cast to object None or np.nan

Note : In Pandas, string data is always stored with an object dtype.

Operating on Null Values


there are several useful methods for detecting, removing, and replacing null values in Pandas data
structures. They are:
 isnull() - Generate a Boolean mask indicating missing values
 notnull() - Opposite of isnull()
 dropna() - Return a filtered version of the data
 fillna() - Return a copy of the data with missing values filled or imputed

Detecting null values


Pandas data structures have two useful methods for detecting null data: isnull() and notnull().
isnull()
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0 False
1 True
2 False
3 True
dtype: bool

notnull()
data.notnull()

0 True
1 False
2 True

35
[Type here] [Type here] [Type here]

3 False
dtype: bool

Dropping null values

dropna()
data.dropna()

01
2 hello
dtype: object

Dropping null values in dataframe

df = pd.DataFrame([[1, np.nan, 2],


[2, 3, 5],
[np.nan, 4, 6]])
Df

0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

df.dropna()

012
1 2.0 3.0 5

Drop values in column or row


We can drop NA values along a different axis; axis=1 drops all columns containing a null value.

df.dropna(axis='columns')

02
15
26
Rows or columns having all null values
You can also specify how='all', which will only drop rows/columns that are all null values.

df[3] = np.nan
df

0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN

36
[Type here] [Type here] [Type here]

df.dropna(axis='columns', how='all')

012
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

Specific no of null values (thresh)


the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept
df.dropna(axis='rows', thresh=3)

0123
1 2.0 3.0 5 NaN

Filling null values


Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a
single number like zero, or it might be some sort of imputation or interpolation from the good values. You could
do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides
the fillna() method, which returns a copy of the array with the null values replaced.
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a 1.0
b
NaN
c 2.0
d NaN

Fill with single value


We can fill NA entries with a single value, such as zero
data.fillna(0)

a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64

Fill with previous value


We can specify a forward-fill to propagate the previous value forward
data.fillna(method='ffill')

a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64

37
[Type here] [Type here] [Type here]

Fill with next value


We can specify a back-fill to propagate the next values backward.

data.fillna(method='bfill')

a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64

Hierarchical Indexing
 Hierarchical indexing (also known as multi-indexing) to incorporate multiple
index levels within a single index.
 In this way, higher dimensional data can be compactly represented within
the familiar one-dimensional Series and two-dimensional DataFrame
objects.

1. Multiply Indexed Series

 Represent two-dimensional data within a one-dimensional Series.

The better way: Pandas MultiIndex

 Our tuple-based indexing is essentially a rudimentary multi-index, and


the Pandas MultiIndex type gives us the type of operations we wish to
have.
 We can create a multi-index from the tuples as follows:

Syntax

index =

pd.MultiIndex.from_tuples(index) index

Output: MultiIndex(levels=[['California', 'New York', 'Texas'], [2000,

2010]], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

 The MultiIndex contains multiple levels of indexing - in this case, the state
names and the years, as well as multiple labels for each data point which
encode these levels.
[Type here] [Type here] [Type here]
Syntax

pop = pop.reindex(index)

pop

Output: California 2000 33871648

2010 37253956

New York 2000 18976457


[Type here] [Type here] [Type here]

2010 19378102

Texas 2000 20851820

2010 25145561

dtype: int64

 First two columns of the Series representation show the multiple index
values, while the third column shows the data.
 To access all data for which the second index is 2010, we can simply use
the Pandas slicing notation:

pop[:, 2010]

Output: California 37253956

New York 19378102

Texas

25145561

dtype: int64

2. Methods of MultiIndex Creation

 The most straight forward way to construct a multiply indexed Series or


Data Frame is to simply pass a list of two or more index arrays to the
constructor.

Example:

df = pd.DataFrame(np.random.rand(4,

2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],

columns=['data1', 'data2'])

df

Output: data1 data2

a 1 0.554233 0.356072

2 0.925244 0.219474

b 1 0.441759 0.610054
[Type here] [Type here] [Type here]

2 0.171495 0.886688

 The work of creating the MultiIndex is done in the background.


[Type here] [Type here] [Type here]

 Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas


will automatically recognize this and use a MultiIndex by default

3. Explicit MultiIndex constructors

 For more flexibility in how the index is constructed, you can instead use
the class method constructors available in the pd.MultiIndex.
 For example, as we did before, you can construct the MultiIndex
from a simple list of arrays, giving the index values within each
level:

Construct multiindex from a list of tuples

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Construct multiindex from a Cartesian product of single indices

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

4. MultiIndex level names

 by passing the names argument to any of the above MultiIndex constructors,


or by setting the names attribute of the index

Syntax

pop.index.names = ['state',

'year'] pop

Output: state year

California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102
[Type here] [Type here] [Type here]
Texas 2000 20851820

2010 25145561

dtype: int64
[Type here] [Type here] [Type here]

5. MultiIndex for columns

 In a DataFrame, the rows and columns are completely symmetric, and just as
the rows can have multiple levels of indices, the columns can have multiple
levels as well.

Program

# hierarchical indices and columns

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],

names=['year', 'visit'])

columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR',

'Temp']], names=['subject', 'type'])

# mock some data

data = np.round(np.random.randn(4, 6), 1)

data[:, ::2] *= 10

data += 37

# create the DataFrame

health_data = pd.DataFrame(data, index=index,

columns=columns) health_data

Output: subject Bob Guido Sue

type HR Temp HR Temp HR

Temp year visit

2013 1 31.0 38.7 32.0 36.7 35.0 37.2

2 44.0 37.7 50.0 35.0 29.0 36.7

2014 1 30.0 37.4 39.0 37.8 61.0 36.9

2 47.0 37.8 48.0 37.3 51.0 36.5

6. Methods of MultiIndex Creation


[Type here] [Type here] [Type here]
 To construct a multiply indexed Series or DataFrame is to simply pass a
list of two or more index arrays to the constructor.

Example:
[Type here] [Type here] [Type here]

df = pd.DataFrame(np.random.rand(4,

2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],

columns=['data1', 'data2'])

df

Output:

data1 data2

a 1 0.554233 0.356072

2 0.925244 0.219474

b 1 0.441759 0.610054

2 0.171495 0.886688

 If you pass a dictionary with appropriate tuples as keys, Pandas


will automatically recognize this and use a MultiIndex by default:
Syntax
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)

Output: California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102

Texas 2000 20851820

2010 25145561

dtype: int64
[Type here] [Type here] [Type here]

7. Explicit MultiIndex constructors

 Construct the MultiIndex from a simple list of arrays


[Type here] [Type here] [Type here]

Syntax

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

 Construct the Multiindex from a list of tuples

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b',

2)])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

 Construct the Multiindex from a Cartesian product of single

indices pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

 Construct the MultiIndex directly using its internal encoding by passing levels

pd.MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Output: MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

8. Indexing and Slicing a MultiIndex

Multiply indexed Series

Consider the multiply indexed Series of state

populations pop

Output: state year

California 2000 33871648

2010 37253956
[Type here] [Type here] [Type here]
New York 2000 18976457

2010 19378102
[Type here] [Type here] [Type here]

Texas 2000 20851820

2010 25145561

dtype: int64

We can access single elements by indexing with multiple terms

pop['California', 2000]

Output: 33871648

9. MultiIndex partial indexing

pop['California']

Output: year

2000 33871648

2010 37253956

dtype: int64

10. Partial slicing

pop.loc['California':'New York']

Output: state year

California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102

dtype: int64

 With sorted indices, we can perform partial indexing on lower levels by


passing an empty slice in the first index

pop[:, 2000]
Output: state
California 33871648
[Type here] [Type here] [Type here]
New York 18976457
Texas 20851820
dtype: int64

11. Data Aggregations on Multi-Indices

 data aggregation methods, such as mean(), sum(), and max()


 For hierarchically indexed data, these can be passed a level parameter
that controls which subset of the data the aggregate is computed on.

Example

health_data

Output: subject Bob Guido Sue

type HR Temp HR Temp HR

Temp year visit

2013 1 31.0 38.7 32.0 36.7 35.0 37.2

2 44.0 37.7 50.0 35.0 29.0 36.7

2014 1 30.0 37.4 39.0 37.8 61.0 36.9

2 47.0 37.8 48.0 37.3 51.0 36.5

data_mean =

health_data.mean(level='year') data_mean

Output: subject Bob Guido Sue

type HR Temp HR Temp HR

Temp year

2013 37.5 38.2 41.0 35.85 32.0 36.95

2014 38.5 37.6 43.5 37.55 56.0 36.70

data_mean.mean(axis=1, level='type')

Output: type HR

Temp year
[Type here] [Type here] [Type here]
2013 36.833333 37.000000

2014 46.000000 37.283333

COMBINING DATASETS
CONCAT AND APPEND

 concatenation of Series and DataFrames with the pd.concat function

Program

import pandas as pd
import numpy as np
def make_df(cols,
ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in
ind] for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC',
range(3)) Output: A B
C

0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

1. Concatenation of NumPy Arrays

Combine the contents of two or more arrays into a single

array x = [1, 2, 3]

y = [4, 5, 6]

z = [7, 8, 9]

np.concatenate([x, y, z])

Output: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

 The first argument is a list or tuple of arrays to concatenate


 axis keyword that allows you to specify the axis along which the result will
[Type here] [Type here] [Type here]
be concatenated

x = [[1, 2],

[3, 4]]

np.concatenate([x, x], axis=1)

Output: array([[1, 2, 1, 2],

[3, 4, 3, 4]])
[Type here] [Type here] [Type here]

2. Simple Concatenation with pd.concat

 Pandas has a function, pd.concat(), which has a similar syntax to


np.concatenate but contains a number of options

pd.concat(objs, axis=0, join='outer', join_axes=None,

ignore_index=False, keys=None, levels=None,

names=None, verify_integrity=False, copy=True)

 pd.concat() can be used for a simple concatenation of Series or DataFrame objects,


 np.concatenate() can be used for simple concatenations of arrays

Program

ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])

ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5,

6]) pd.concat([ser1, ser2])

Output: 1 A

2 B

3 C

4 D

5 E

6 F

dtype: object

 To concatenate higher-dimensional objects, such as DataFrames


Program
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3,
4]) print(df1);
print(df2);
print(pd.concat([df1,
df2]))
[Type here] [Type here] [Type here]
Output

df1 df2 pd.concat([df1, df2])


AB AB AB

1 A1 B1 3 A3 B3 1 A1 B1
[Type here] [Type here] [Type here]

2 A2 B2 4 A4 B4 2 A2 B2

3 A3 B3

4 A4 B4

 pd.concat allows specification of an axis along which concatenation will


take place.

a. Duplicate indices

 difference between np.concatenate and pd.concat is that Pandas


concatenation preserves indices, even if the result will have duplicate indices
Example
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
print(x);
print(y);
print(pd.concat([x,
y]))

Output

y pd.concat([x, y])
X
AB AB A
B
0 A0 0 A2 B2 0 A0 B0
B0
1 A1 1 A3 B3 1 A1 B1
B1
0 A2 B2
1 A3 B3

b. Adding MultiIndex keys

 the keys option to specify a label for the data sources; the result will
be a hierarchically indexed series containing the data
Program
print(x);
print(y);
[Type here] [Type here] [Type here]
print(pd.concat([x, y], keys=['x', 'y']))

Output

x y pd.concat([x, y], keys=['x',


'y']) A B AB AB
[Type here] [Type here] [Type here]

0 A0 B0 0 A2 B2 x 0 A0 B0

1 A1 B1 1 A3 B3 1 A1 B1
y 0 A2 B2
1 A3 B3

c. Concatenation with joins

 data from different sources might have different sets of column names
 Consider the concatenation of the following two DataFrames
Program
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3,
4]) print(df5);
print(df6);
print(pd.concat([df5,
df6])

Output

df5 df6 pd.concat([df5, df6])


AB C B C D A B C D
1 A1 B1 C1 3 B3 C3 1 A1 B1 C1 NaN
D3
2 A2 B2 C2 4 B4 C4 2 A2 B2 C2 NaN
D4
3 NaN B3 C3 D3
4 NaN B4 C4 D4

 By default, the entries for which no data is available are filled with NA values.
 To change this, we can specify one of several options for the join and
join_axes parameters of the concatenate function.
 By default, the join is a union of the input columns (join='outer'), but we
can change this to an intersection of the columns using join='inner':
Program
print(df5);
print(df6);
print(pd.concat([df5, df6], join='inner'))
[Type here] [Type here] [Type here]
Output

df5 df6 pd.concat([df5, df6], join='inner')


AB C BCD BC
1 A1 B1 3 B3 C3 1 B1 C1
C1 D3
[Type here] [Type here] [Type here]

2 A2 B2 C2 4 B4 C4 D4 2 B2 C2

3 B3 C3

4 B4 C4

d. The append() method

 Series and DataFrame objects have an append method that can accomplish
the same thing in fewer keystrokes.
 pd.concat([df1, df2])

print(df1); print(df2); print(df1.append(df2))

df1 df2 df1.append(df2)


A A AB
B B
1 A1 B1 3 A3 B3 1 A1 B1

2 A2 B2 4 A4 B4 2 A2 B2

3 A3 B3
4 A4 B4

 append() method in Pandas does not modify the original object, it creates a new
object with the combined data.

e. Combining Datasets: Merge and Join

 feature offered by Pandas is its high-performance, in-memory join and merge


operations

Relational Algebra

 pd.merge() is a subset of what is known as relational algebra, which is a formal


set of rules for manipulating relational data, and forms the conceptual foundation
of operations available in most databases.

Categories of Joins

 The pd.merge() function implements a number of types of joins:


o one-to-one
o many-to-one
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

o many-to-many joins
 All three types of joins are accessed via an identical call to the pd.merge() interface;
 The type of join performed depends on the form of the input data.

One-to-one joins

Consider the following two DataFrames, which contain information on several


employees in a company.

Program

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],

'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],

'hire_date': [2004, 2008, 2012, 2014]})

print(df1);

print(df2)

Output

df1 df2
employee group employee hire_date

0 Bob Accounting 0 Lisa 2004

1 Jake Engineering 1 Bob 2008

2 Lisa Engineering 2 Jake 2012


3 Sue HR 3 Sue 2014

 To combine this information into a single DataFrame, we can use the


pd.merge() function:
Program
df3 = pd.merge(df1,
df2) df3
Output:

employee group hire_date


0 Bob Accounting 2008
1 Jake Engineering 2012
[Type here] [Type here] [Type here]
2 Lisa Engineering 2004
[Type here] [Type here] [Type here]

3 Sue HR 2014
 The pd.merge() function recognizes that each DataFrame has an “employee”
column, and automatically joins using this column as a key.

Many-to-one joins

 Many-to-one joins are joins in which one of the two key columns contains
duplicate entries.
Example
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering',
'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))

df3 df4
employee group hire_date group supervisor
0 Bob Accounting 2008 0 Accounting Carly
1 Jake Engineering 2012 1 Engineering Guido
2 Lisa Engineering 2004 2 HR Steve
3 Sue HR 2014

pd.merge(df3, df4)

employee group hire_date supervisor


0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve

Many-to-many joins

 If the key column in both the left and right array contains duplicates, then
the result is a many-to-many merge.
 Consider the following, where we have a DataFrame showing one or more
skills associated with a particular group.
Program
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
[Type here] [Type here] [Type here]
'skills': ['math', 'spreadsheets', 'coding', 'linux',
[Type here] [Type here] [Type here]

'spreadsheets', 'organization']})
print(df1);
print(df5);
print(pd.merge(df1, df5))
Output

df1 df5
employee group group s kills
0 Bob Accounting 0 Accounting math
1 Jake Engineering 1 Accounting spreadsheets
2 Lisa Engineering 2 Engineering coding
3 Sue HR 3 Engineering linux

4 HR spreadsheets
5 HR organization

Syntax

pd.merge(df1, df5)

Output

employee group Skills


0 Bob Accounting Math

1 Bob Accounting Spreadsheets

2 Jake Engineering Coding

3 Jake Engineering Linux

4 Lisa Engineering Coding

5 Lisa Engineering Linux

6 Sue HR Spreadsheets
7 Sue HR Organization

AGGREGATION AND GROUPING


Computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives insight
into the nature of a potentially large dataset.

Simple Aggregation in Pandas


As with a one dimensional NumPy array, for a Pandas Series the aggregates return a single value.
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
[Type here] [Type here] [Type here]

0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
Sum
ser.sum()
2.8119254917081569

Mean
ser.mean()
0.56238509834163142
[Type here] [Type here] [Type here]
The same operations also performed in DataFrame

Listing of Pandas aggregation methods


Aggregation

Description
count() Total number of items
first(), last() First and
last item mean(), median()
Mean and
median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

Planets Data
[Type here] [Type here] [Type here]

 It gives information on planets that astronomers have discovered around


other stars (known as extrasolar planets or exoplanets for short).
 It can be downloaded with a simple Seaborn command:
Program
import seaborn as sns
planets =
sns.load_dataset('planets')
planets.shape
Output: (1035, 6)

planets.head()
Output: method number orbital_period mass distance year

0 Radial Velocity 1 269.300 7.10 77.40 2006


1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009

Simple Aggregation in Pandas

rng =

np.random.RandomState(42) ser

= pd.Series(rng.rand(5))

ser

0.374540
Output: 0
1 0.950714

2 0.731994

3 0.598658
4 0.156019

dtype: float64

ser.sum()
[Type here] [Type here] [Type here]

Output: 2.8119254917081569
[Type here] [Type here] [Type here]

ser.mean()

Output: 0.56238509834163142

df = pd.DataFrame({'A': rng.rand(5),

'B':

rng.rand(5)}) df

Output: A B
0 0.155995 0.020584

1 0.058084 0.969910

2 0.866176 0.832443

3 0.601115 0.212339

4 0.708073 0.181825

df.mean()

Output: A 0.477888

B 0.443420

dtype: float64

df.mean(axis='columns')

Output: 0 0.088290
1 0.513997

2 0.849309

3 0.406727
4 0.444949

dtype: float64

 describe() that computes several common aggregates for each column and
returns the result.
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

GroupBy: Split, Apply, Combine

Fig.4.7 A visual representation of a groupby operation

Figure 4.7 makes clear what the GroupBy accomplishes:

 The split step involves breaking up and grouping a DataFrame depending


on the value of the specified key.
 The apply step involves computing some function, usually an
aggregate, transformation, or filtering, within the individual groups.
 The combine step merges the results of these operations into an output array.

Program

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],

'data': range(6)}, columns=['key', 'data'])

df

Output key data


:
0 A 0

1 B 1

2 C 2
[Type here] [Type here] [Type here]

3 A 3

4 B 4
[Type here] [Type here] [Type here]

5 C 5

Syntax

df.groupby('key')

Output: <pandas.core.groupby.DataFrameGroupBy object at 0x117272160>

 we can apply an aggregate to this DataFrameGroupBy object, which will


perform the appropriate apply/combine steps to produce the desired result:

Syntax

df.groupby('key').sum()

Output: data

key

A 3

B 5

C 7

The GroupBy object

 GroupBy are aggregate, filter, transform, and apply

Aggregate, filter, transform, apply

 GroupBy objects have aggregate(), filter(), transform(), and apply() methods that
efficiently implement a variety of useful operations before combining the grouped
data.
Program
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1',
'data2']) df

Output: key data1 data2


[Type here] [Type here] [Type here]

0 A 0 5
1 B 1 0
2 C 2 3
[Type here] [Type here] [Type here]

3 A 3 3

4 B 4 7
5 C 5 9

Aggregation

df.groupby('key').aggregate(['min', np.median, max])

Output: data1 data2

min median max min median max

key

A 0 1.5 3 3 4.0 5

B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9

df.groupby('key').aggregate({'data1': 'min',

'data2': 'max'})
Output: data1 data2

key
A 0 5

B 1 7
C 2 9

Filtering

 A filtering operation allows you to drop data based on the group properties.

Program

def filter_func(x):

return x['data2'].std() > 4

print(df); print(df.groupby('key').std());

print(df.groupby('key').filter(filter_func))
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

Output

df df.groupby('key').std()

key data1 data2 key data1 data2


0 A 0 5 A 2.12132 1.414214

1 B 1 0 B 2.12132 4.949747

2 C 2 3 C 2.12132 4.242641

3 A 3 3

4 B 4 7
5 C 5 9

df.groupby('key').filter(filter_func)

key data1 data2


1
B 1 0
2 C 2 3

4 B 4 7
5 C 5 9

 The filter() function should return a Boolean value specifying whether the
group passes the filtering.

Transformation

 Transformation can return some transformed version of the full data to recombine.

df.groupby('key').transform(lambda x: x - x.mean())

Output: data1 data2


0 -1.5 1.0

1 -1.5 -3.5

2 -1.5 -3.0

3 1.5 -1.0

4 1.5 3.5
[Type here] [Type here] [Type here]

5 1.5 3.0
[Type here] [Type here] [Type here]

The apply() method

 apply() method apply an arbitrary function to the group results.


 The function should take a DataFrame, and return either a Pandas object (e.g.,
DataFrame, Series) or a scalar; the combine operation will be tailored to the type
of output returned.
 apply() that normalizes the first column by the sum of the second:

Program

def norm_by_data2(x):

# x is a DataFrame of group

values x['data1'] /=

x['data2'].sum() return x

print(df);

print(df.groupby('key').apply(norm_by_data2))

Output

df df.groupby('key').apply(norm_by_data2)

key data1 data2 key data1 data2


0 A 0 5 0 A 0.000000 5

1 B 1 0 1 B 0.142857 0

2 C 2 3 2 C 0.166667 3

3 A 3 3 3 A 0.375000 3

4 B 4 7 4 B 0.571429 7
5 C 5 9 5 C 0.416667 9

Specifying the split key

 we split the DataFrame on a single column name.

A list, array, series, or index providing the grouping keys

 The key can be any series or list with a length matching that of the DataFrame.
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]

L = [0, 1, 0, 1, 2, 0]

print(df);

print(df.groupby(L).sum())

df df.groupby(L).sum()

key data1 data2 data1 data2


0 0 5 0 7 17
A
1B 1 0 1 4 3

2 2 3 2 4 7
C
3 3 3
A
4B 4 7
5 5 9
C

print(df);

print(df.groupby(df['key']).sum())

df df.groupby(df['key']).sum()
[Type here] [Type here] [Type here]

key data1 data2 data1 data2


0 A 0 5 A 3 8

1 B 1 0 B 5 7

2C 2 3 C 7 12

3 A 3 3

4 B 4 7
5C 5 9

PIVOT TABLES
 A Pivot table is a similar operation that is commonly seen in spreadsheets and
other programs that operate on tabular data.
[Type here] [Type here] [Type here]

 The pivot table takes simple columnwise data as input, and groups the entries
into a two-dimensional table that provides a multidimensional summarization of
the data.

Motivating Pivot Tables

 Database of passengers on the Titanic, available through the Seaborn library


Program
import numpy as np
import pandas as pd
import seaborn as
sns
titanic =
sns.load_dataset('titanic')
titanic.head()

Output:
survived pclass sex age sibsp parch fare embarked class \\

0 0 3 male 22.0 1 0 7.2500 S Third


1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third

who adult_male deck embark_town alive alone


0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True

Pivot Tables by Hand

titanic.groupby('sex')[['survived']].mean()
[Type here] [Type here] [Type here]
Output:

survived sex

female 0.742038

male 0.188908
[Type here] [Type here] [Type here]

 We group by class and gender, select survival, apply a mean aggregate,


combine the resulting groups, and then unstack the hierarchical index to reveal
the hidden multidimensionality.

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

Output: class First Second Third


sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

Pivot Table Syntax


titanic.pivot_table('survived', index='sex', columns='class')

Output: class First Second Third


sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

Multilevel pivot tables

 grouping in pivot tables can be specified with multiple levels, and via a number of
options

age = pd.cut(titanic['age'], [0, 18, 80])


titanic.pivot_table('survived', ['sex', age], 'class')
Output: class First Second
Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663

 add info on the fare paid using pd.qcut to automatically compute quantiles:

fare = pd.qcut(titanic['fare'], 2)

titanic.pivot_table('survived', ['sex', age], [fare, 'class'])


[Type here] [Type here] [Type here]

Output:
B.E /CSE/ Foundation of Data Science CS3352- II/III-R-
2021

fare [0, 14.454]

class First Third \\


Second
sex age

female (0, 18] NaN 1.000000 0.714286

(18, 80] NaN 0.880000 0.444444


male (0, 18] NaN 0.000000 0.260870

(18, 80] 0.0 0.098039 0.125000

fare (14.454, 512.329]

class First Second

Third sex age

female (0, 18] 0.909091 1.000000 0.318182

(18, 80] 0.972973 0.914286 0.391304

male (0, 18] 0.800000 0.818182 0.178571

(18, 80] 0.391304 0.030303 0.192308

Prepared by Mrs.S.Ezhilvanji , AP/CSE ESCET

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy