UNIT IV FDS
UNIT IV FDS
Prepared By,
Mrs.S.EZHILVANJI
AP/CSE
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks,
boolean logic – fancy indexing – structured arrays – Data manipulation with Pandas – data
indexing and selection – operating on data – missing data – Hierarchical indexing–
combining datasets – aggregation and grouping – pivot tables
Introduction:
NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data
buffers. NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more
efficient storage and data operations as the arrays grow larger in size.
Basics of NumPy Arrays.
1. Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays
2. Indexing of arrays
Getting and setting the value of individual array elements
3. Slicing of arrays
Getting and setting smaller subarrays within a larger array
4. Reshaping of arrays
Changing the shape of a given array
5. Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many
[Type here] [Type here] [Type here]
Syntax
import numpy as np
Syntax
print("x3 ndim: ", x3.ndim)
print("x3 shape:",x3.shape)
print("dtype:", x3.dtype)
Output:
x3 ndim: 3
x3 shape: (3, 4, 5)
itemsize: 8 bytes
Syntax
x1
x1[0]
Output: 5
x1[4]
Output: 7
Output: 9
x1[-2]
Output: 7
access items
Syntax
x2
Output: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
[Type here] [Type here] [Type here]
x2[0, 0]
Output: 3
x2[2, 0]
Output: 1
x2[2, -1]
Output: 7
Modifying values
Syntax
x2[0, 0] = 12
x2
[7, 6, 8, 8],
[1, 6, 7, 7]])
x1[0] = 3.14159
x1
To access subarrays with the slice notation, marked by the colon (:) character.
Syntax
x[start:stop:step]
One-dimensional subarrays
x = np.arange(10) x
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas.
Example:
Syntax
x2
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
[ 7, 6, 8]])
[ 7, 8],
[ 1, 7]])
x2[::-1, ::-1]
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
Output: [12 7 1]
[Type here] [Type here] [Type here]
print(x2[0, :]) # first row of x2
Output: [12 5 2 4]
In case of row access, the empty slice can be omitted for a more compact syntax:
Output: [12 5 2 4]
array slices is that they return views rather than copies of the array data
Syntax
print(x2)
Output: [[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Syntax
print(x2_sub_copy)
Output: [[99 5]
[ 7 6]]
x2_sub_copy[0, 0] = 42
[ 7 6]]
print(x2)
Output: [[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
[Type here] [Type here] [Type here]
4. Reshaping of Arrays Syntax
Output: [[1 2 3]
[4 5 6]
[7 8 9]]
Output: array([[1],
[2],
[3]])
To combine multiple arrays into one, and to conversely split a single array into multiple
arrays
Concatenation of arrays
Syntax
x = np.array([1, 2, 3])
[Type here] [Type here] [Type here]
y])
print(np.concatenate([x, y, z]))
Output: [ 1 2 3 3 2 1 99 99 99]
np.concatenate
[4, 5, 6]])
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
[4, 5, 6, 4, 5, 6]])
[9, 8, 7],
[6, 5, 4]])
y = np.array([[99], # horizontally stack the arrays
[Type here] [Type here] [Type here]
[99]])
np.hstack([grid, y])
[ 6, 5, 4, 99]])
Splitting of arrays
Syntax
x3)
print(lower)
Output: [[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
print(right)
Output: [[ 0 1]
[Type here] [Type here] [Type here]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
Computation on NumPy Arrays: Universal Functions
Introducing UFuncs
NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is
known as a vectorized operation.
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute
repeated operations on values in NumPy arrays. Ufuncs are extremely flexible—before we saw an
operation between a scalar and an array, but we can also operate between two arrays
Ufuncs exist in two flavors: unary ufuncs, which operate on a single input, and binary ufuncs, which
operate on two inputs. We’ll see examples of both these types of functions here.
Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The standard addition, subtraction,
multiplication, and division can all be used.
x=
np.arange(4)
print("x =", x)
print("x + 5
=", x + 5)
print("x - 5
=", x - 5)
print("x * 2
=", x * 2)
O Equivalen Description
p t ufunc
e
r
a
t
o
r
+ np.add Addition (e.g., 1 + 1
= 2)
- np.subtract Subtraction (e.g., 3 -
2 = 1)
[Type here] [Type here] [Type here]
- np.negativ Unary negation
e (e.g., -2)
* np.multipl Multiplication (e.g.,
y 2 * 3 = 6)
/ np.divide Division (e.g., 3 / 2
= 1.5)
Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also understands Python’s built-in
absolute value function.
np.abs()
np.absolute()
x = np.array([-2, -1,
0, 1, 2]) abs(x)
array([2, 1, 0, 1, 2])
The corresponding NumPy ufunc is np.absolute, which is also available under the alias np.abs
np.absolute(
x) array([2,
1, 0, 1, 2])
np.abs(x)
array([2, 1, 0, 1, 2])
Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist
are the trigonometric functions.
np.sin()
np.cos()
np.tan()
inverse trigonometric functions
np.arcsin()
np.arccos()
np.arctan()
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural
logarithm; if you prefer to compute the base-2 logarithm or the base-10 logarithm as .
np.log(x) - is a mathematical function that helps user to calculate Natural logarithm of x where x belongs
to all the input array elements
np.log2(x) - to calculate Base-2 logarithm of x
np.log10(x) - to calculate Base-10 logarithm of x
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
Specialized ufuncs
NumPy has many more ufuncs available like
Hyperbolic trig functions,
Bitwise arithmetic,
Comparison operators,
Conversions from radians to degrees,
Rounding and remainders, and much more
More specialized and obscure ufuncs is the submodule scipy.special. If you want to compute some
obscure mathematical function on your data, chances are it is implemented in scipy.special.
Gamma function
Advanced Ufunc
Features Specifying
output
Rather than creating a temporary array, you can use this to write computation results directly to the
memory location where you’d like them to be. For all ufuncs, you can do this using the out argument of the
function.
x=
np.arange(
5) y =
np.empty(
5)
[Type here] [Type here] [Type here]
np.multiply(x, 10, out=y)
print(y)
Aggregates functions
a. Summing the Values in an Array
Consider computing the sum of all values in an array, Python can do this using the built-
in sum function:
Syntax
import numpy as np
L = np.random.random(100)
np.sum(L) Output:55.612091166049424
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
min(big_array), max(big_array)
np.min(big_array), np.max(big_array)
dimensional array
Syntax
M.sum()
Output: 6.0850555667307118
Aggregation functions take an additional argument specifying the axis along which the
aggregate is computed.
find the minimum value within each column by specifying axis=0
M.min(axis=0)
The axis keyword specifies the dimension of the array that will be collapsed, rather
than the dimension that will be returned.
So specifying axis=0 means that the first axis will be collapsed:
For two-dimensional arrays, this means that values within each column will be
aggregated.
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
Data is available in the file president_heights.csv
ord name h
er e
i
g
h
t
(
c
m
)
1 George 1
Washing 8
ton 9
2 John 1
Adams 7
0
3 Thomas 1
Jefferson 8
9
4 James 1
Madison 6
3
5 James 1
Monroe 8
3
[Type here] [Type here] [Type here]
6 John 1
Quincy 7
Adams 1
7 Andrew 1
Jackson 8
5
8 Martin 1
Van 6
Buren 8
9 William 1
Henry 7
Harrison 3
10 John 1
Tyler 8
3
11 James K. 1
Polk 7
3
12 Zachary 1
Taylor 7
3
13 Millard 1
Fillmore 7
5
14 Franklin 1
Pierce 7
8
15 James 1
Buchana 8
n 3
16 Abraham 1
Lincoln 9
3
17 Andrew 1
Johnson 7
8
18 Ulysses 1
S. Grant 7
3
19 Rutherfo 1
rd B. 7
Hayes 4
20 James A. 1
Garfield 8
3
21 Chester 1
[Type here] [Type here] [Type here]
A. 8
Arthur 3
23 Benjami 1
n 6
Harrison 8
25 William 1
McKinle 7
y 0
ord name h
er e
i
g
h
t
(
c
m
)
26 Theodor 1
e 7
Roosevel 8
t
27 William 1
Howard 8
Taft 2
28 Woodro 1
w 8
Wilson 0
29 Warren 1
G. 8
Harding 3
30 Calvin 1
Coolidge 7
8
31 Herbert 1
Hoover 8
2
32 Franklin 1
D. 8
Roosevel 8
t
33 Harry S. 1
Truman 7
5
34 Dwight 1
D. 7
Eisenho 9
[Type here] [Type here] [Type here]
wer
35 John F. 1
Kennedy 8
3
36 Lyndon 1
B. 9
Johnson 3
37 Richard 1
Nixon 8
2
38 Gerald 1
Ford 8
3
39 Jimmy 1
Carter 7
7
40 Ronald 1
Reagan 8
5
41 George 1
H. W. 8
Bush 8
42 Bill 1
Clinton 8
8
43 George 1
W. Bush 8
2
44 Barack 1
Obama 8
5
!head -4 data/president_heights.csv
order,name,height(cm)
1,George Washington,189
Jefferson,189
Program
[Type here] [Type here] [Type here]
import pandas as pd
np.array(data['height(cm)']) print(heights)
Output
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
Output:
To compute quantiles
np.median(heights))
Output:
%matplotlib inline
plt.hist(heights)
(cm)') plt.ylabel('number');
Output
a. Introducing Broadcasting
For arrays of the same size, binary operations are performed on an element-by- element
basis:
Program
import numpy as np a = np.array([0, 1,
2])
b = np.array([5, 5, 5]) a + b
Output: array([5, 6, 7])
a+5
Output: array([5, 6, 7])
o this operation stretches or duplicates the value 5 into the array [5, 5, 5], and adds the results
[Type here] [Type here] [Type here]
The advantage of NumPys broadcasting is that this duplication of values does not
actually take place
When we add a one-dimensional array to a two-dimensional array
M = np.ones((3, 3)) M
Output: array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
M+ a
Here the one-dimensional array a is stretched, or broadcast, across the second dimension
in order to match the shape of M.
a = np.arange(3)
print(b)
Output:
[0 1 2]
[[0]
[1]
[2]]
a+b
[1, 2, 3],
[2, 3, 4]])
Before we stretched or broadcasted one value to match the shape of the other
stretched both a and b to match a common shape, and the result is a two
[Type here] [Type here] [Type here]
dimensional array
c. Rules of Broadcasting
Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with
fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with
shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
Broadcasting example 1
Let’s look at adding a two-dimensional array to a one-dimensional array:
M = np.ones((2, 3))
a = np.arange(3)
Let’s consider an operation on these two arrays. The shapes of the arrays are:
M.shape = (2, 3)
a.shape = (3,)
We see by rule 1 that the array a has fewer dimensions, so we pad it on the left with ones:
M.shape -> (2, 3)
a.shape -> (1, 3)
By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:
M.shape -> (2, 3)
a.shape -> (2, 3)
The shapes match, and we see that the final shape will be (2, 3):
M+a
[Type here] [Type here] [Type here]
array([[ 1., 2., 3.],
[ 1., 2., 3.]])
Broadcasting example 2
Let’s take a look at an example where both arrays need to be broadcast:
a = np.arange(3).reshape((3, 1))
b = np.arange(3)
And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:
a.shape -> (3, 3)
b.shape -> (3, 3)
11
[Type here] [Type here] [Type here]
Because the result matches, these shapes are compatible. We can see
this here: a + b
array([[0
, 1, 2],
[1, 2, 3],
[2, 3, 4]])
Comparisons, Masks, and Boolean Logic
Masking comes up when you want to extract, modify, count, or otherwise
manipulate values in an array based on some criterion.
Comparison Operators as
ufuncs
NumPy also implements comparison operators such as < (less than) and >
(greater than) as element-wise ufuncs.
All six of the standard comparison operations are available:
Syntax
x = np.array([1, 2, 3, 4, 5])
x<3 # less than
Output: array([ True, True, False, False, False], dtype=bool) x > 3
# greater than
(2 * x) == (x ** 2)
For example,
x < 3,
Two-dimensional example
rng = np.random.RandomState(0) x =
[7, 9, 3, 5],
[2, 4, 7, 6]])
x<6
print(x)
Output: [[5 0 3 3]
[Type here] [Type here] [Type here]
[7 9 3 5]
[2 4 7 6]]
Counting entries
np.count_nonzero(x < 6)
Output: 8
Or
np.sum(x < 6)
Output: 8
This counts the number of values less than 6 in each row of the matrix.
Output: True
np.any(x < 0)
Output: False
Output: True
np.all(x == 6)
[Type here] [Type here] [Type here]
Output: False
Boolean operators
Output: 29
Program
print("Rainy days with < 0.1 inches :", np.sum((inches > 0) & (inches < 0.2)))
Output
:
[7, 9, 3, 5],
[2, 4, 7, 6]])
x<5
To select the values from the array, we can simply index on this Boolean array;
this is known as a masking operation
x[x < 5]
np.median(inches[rainy]))
[Type here] [Type here] [Type here]
np.median(inches[summer]))
np.max(inches[summer]))
Output:
A series of data that represents the amount of precipitation each day for a year in a
given city in fig 4.3.
Program
pandas as pd
Output: (365,)
%matplotlib inline
plt.hist(inches, 40);
[Type here] [Type here] [Type here]
FANCY INDEXING
Fancy Indexing
Fancy indexing is the simple indexing, but we pass arrays of indices in place of single
scalars. This allows us to very quickly access and modify complicated subsets of an
array’s values.
To access and modify portions of arrays using simple indices (e.g., arr[0]), slices (e.g.,
arr[:5]), and Boolean masks (e.g., arr[arr > 0])
import numpy as np
rand = np.random.RandomState(42)
Array of indices
We can pass a single list or array of indices to obtain the same result.
ind = [3, 7, 4]
x[ind]
In multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array.
X = np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Standard indexing
Like with standard indexing, the first index refers to the row, and the second to the column.
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
14
[Type here] [Type here] [Type here]
([ 2, 5, 11])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes
we’ve seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])
array([[ 6, 4, 5],
[10, 8, 9]])
Combine fancy indexing with masking
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
[ 0 99 99 3 99 5 6 7 99 9]
[ 0 89 89 3 89 5 6 7 89 9]
Using at()
Use the at() method of ufuncs for other behavior of modifications.
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
To access three different
x[ind]
The shape of the result reflects the shape of the index arrays rather than the
shape of the array being indexed
Multiple dimensions
X = np.arange(12).reshape((3,
4)) X
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With Standard indexing, the first index refers to the row, and the second to
[Type here] [Type here] [Type here]
the column
col = np.array([2, 1,
The first value in the result is X[0, 2], the second is X[1, 1], and the
third is X[2, 3]
[Type here] [Type here] [Type here]
[ 6, 5, 7],
[10, 9, 11]])
Program
[2, 1, 3],
[4, 2, 6]])
With fancy indexing the return value reflects the broadcasted shape of
the indices, rather than the shape of the array being indexed.
Combined Indexing
print(X)
Output: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[10, 8, 9]])
[Type here] [Type here] [Type here]
mask]
[ 4, 6],
[ 8, 10]])
One common use of fancy indexing is the selection of subsets of rows from a
matrix.
For example, we might have an N by D matrix representing N points in D
dimensions, such as the following points drawn from a two-dimensional
normal distribution.
Program
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
Output: (100, 2)
%matplotlib inline
Syntax
indices
Output: array([93, 45, 73, 81, 50, 10, 98, 94, 4, 64, 65, 89, 47, 84, 82,
Syntax
selection.shape
Output: (20, 2)
Example
Imagine we have an array of indices and like to set the corresponding items
in an array to some value:
Program
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
Output: [ 0 99 99 3 99 5 6 7 99 9]
this. x[i] -= 10
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]
print(x)
Output: [ 0 89 89 3 89 5 6 7 89 9]
Repeated indices with these operations can cause some potentially unexpected results
Program
x = np.zeros(10)
6]
print(x)
Output: [ 6. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Program
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
Output: array([ 6., 0., 1., 1., 1., 0., 0., 0., 0., 0.])
x[3] would contain the value 2, and x[4] would contain the value 3, as this
is how many times each index is repeated
Program
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
Output: [ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
np.random.seed(42)
x = np.random.randn(100)
SORTING ARRAYS
# compute a histogram by hand bins = Sorting in NumPy: np.sort and np.argsort
np.linspace(-5, 5, 20) counts =
np.zeros_like(bins)
i = np.searchsorted(bins, x) # add 1 to
1)
plt.plot(bins, counts,
array([1, 2, 3, 4, 5])
[1 0 3 2 4]
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a
multidimensional array using the axis argument. For example
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
[[6 3 7 4 6 9]
[2 6 7 4 3 7]
[7 2 5 4 1 7]
[5 1 4 0 9 5]]
np.sort(X, axis=0)
array([[2, 1, 4, 0, 1,
5],
16
[Type here] [Type here] [Type here]
[5, 2, 5, 4, 3, 7],
[6, 3, 7, 4, 6, 7],
[7, 6, 7, 4, 9, 9]])
np.sort(X,
axis=1)
array([[3, 4, 6,
6, 7, 9],
[2, 3, 4, 6, 7, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 5, 9]])
array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in the array, and the remaining
array positions contain the remaining values. Within the two partitions, the elements have arbitrary
order.
array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])
STRUCTURED ARRAYS
Structured Data: NumPy’s Structured Arrays
[Type here] [Type here] [Type here]
Several categories of data on a number of people (name, age, and weight),
To store these values for use in a Python program.
It would be possible to store these in three separate
expression x = np.zeros(4,
dtype=int)
'formats':('U10', 'i4',
'f8')}) print(data.dtype)
Create an empty container array, we can fill the array with our lists of values:
Syntax
data['name'] = name
data['age'] = age
data['weight'] =
weight print(data)
Output: [('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)
data['name']
data[0]
data[-1]['name']
Output: 'Doug'
dtype='<U10')
Numerical types
Program
[Type here] [Type here] [Type here]
np.dtype({'names':('name', 'age', 'weight'),
Compound type
np.dtype('S10,i4,f8')
The first (optional) character is < or >, which means “little endian” or “big
endian,” respectively, and specifies the ordering convention for significant
bits.
The next character specifies the type of data: characters, bytes, ints,
floating points, and so on.
The last character or characters represents the size of the object in bytes.
NumPy also provides the np.recarray class, which is almost identical to the
structured arrays, but with one additional feature: fields can be accessed as
attributes rather than as dictionary keys.
data['age']
[Type here] [Type here] [Type here]
Output: array([25, 45, 37, 19], dtype=int32)
[Type here] [Type here] [Type here]
Record array
array(['Alice', 'Doug'],dtype='<U10')
Dictionary method
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
18
[Type here] [Type here] [Type here]
List of tuples
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
import numpy as
np import pandas
as pd
Syntax
Syntax
data[1:3]
Output: 1 0.50
2 0.75
dtype: float64
NumPy array has an implicitly defined integer index used to access the values
Pandas Series has an explicitly defined index associated with the values
This explicit index definition gives the Series object additional capabilities
Program
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c',
'd']) data
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
Output: 0.5
3 0.75
7 1.00
dtype: float64
data[5]
Output: 0.5
Program
'Texas': 26448193,
'Florida': 19552860,
'Illinois': 12882135}
population =
pd.Series(population_dict) population
Florida 19552860
Illinois 12882135
Texas
26448193
dtype: int64
By default, a Series will be created where the index is drawn from the
sorted keys. dictionary-style item access can be performed
[Type here] [Type here] [Type here]
population['California']
Output: 38332521
Example
Construct a new Series listing the area of each of the five states
Program
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois':
149995} area =
pd.Series(area_dict)
area
Output: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas
695662
dtype: int64
Use a dictionary to construct a single two-dimensional object containing this
information
states = pd.DataFrame({'population': population,'area': area})
states
Output: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
[Type here] [Type here] [Type here]
Texas 695662 26448193
DataFrame has an index attribute that gives access to the index
labels states.index
Output:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
the DataFrame has a columns attribute, which is an Index object holding
the column labels
states.columns
[Type here] [Type here] [Type here]
for i in range(3)]
[Type here] [Type here] [Type here]
pd.DataFrame(data)
[Type here] [Type here] [Type here]
Output: a b
000
112
224
Even if some keys in the dictionary are missing, Pandas will fill them in with
NaN (i.e.,“not a number”) values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Output: a b c
0 1.0 2 NaN
1 NaN 3 4.0
A Series object acts in many ways like a one dimensional NumPy array, and in many ways like a standard
Python dictionary. It will help us to understand the patterns of data indexing and selection in these arrays.
Series as dictionary
Series as one-dimensional array
Indexers: loc, iloc, and ix
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of
values.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
25
[Type here] [Type here] [Type here]
0.5
Examine the keys/indices and values
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values
i. 'a' in data
True
ii. data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
iii.list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
data['e'] = 1.25
data
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
a 0.25
b 0.50
c 0.75
dtype: float64
a 0.25
b 0.50
dtype: float64
Masking
data[(data > 0.3) & (data < 0.8)]
b 0.50
c 0.75
[Type here] [Type here] [Type here]
dtype: float64
Fancy indexing
data[['a', 'e']]
a 0.25
e 1.25
dtype: float64
1a
3b
5c
dtype: object
loc - the loc attribute allows indexing and slicing that always references the explicit index.
data.loc[1]
'a'
data.loc[1:3]
1a
3b
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the implicit Python-style index.
data.iloc[1]
'b'
data.iloc[1:3]
3b
5c
dtype: object
ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-based indexing.
DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects.
27
[Type here] [Type here] [Type here]
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing
of the column name.
DS
sai 90
ram 85
kasim 92
tamil 89
DS
sai 90
ram 85
kasim 92
tamil 89
True
Modify the object
Like with the Series objects this dictionary-style syntax can also be used to modify the object, in this case to add
a new column:
result[‘TOTAL’]=result[‘DS’]+result[‘FDS’]
result
D FDS TOTA
S L
sai 90 91 181
ram 85 95 180
kasim 92 89 181
tamil 89 90 179
28
[Type here] [Type here] [Type here]
Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the
underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result
loc
result.loc[: ‘ram’, : ‘FDS’ ]
DS FD
S
sai 90 91
ram 85 95
iloc
result.iloc[:2, :2 ]
DS FD
S
sai 90 91
ram 85 95
ix
result.ix[:2, :’FDS’ ]
DS FD
S
sai 90 91
ram 85 95
D FD
S S
sai 90 91
kasim 92 89
Modifying values
Indexing conventions may also be used to set or modify values; this is done in the standard way that
you might be accustomed to from working with NumPy.
result.iloc[1,1] =70
D FDS TOTA
S L
sai 90 91 181
ram 85 70 180
kasim 92 89 181
tamil 89 90 179
[Type here] [Type here] [Type here]
Additional indexing conventions
Slicing row wise
result['sai':'kasim']
29
[Type here] [Type here] [Type here]
D FDS TOTA
S L
sai 90 91 181
ram 85 70 180
kasim 92 89 181
Such slices can also refer to rows by number rather than by index:
result[1:3]
D FDS TOTA
S L
ram 85 70 180
kasim 92 89 181
D FDS TOTA
S L
sai 90 91 181
kasim 92 89 181
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on pandas. In outputs the index will
preserved (maintained) as shown below.
For series
x=pd.Series([1,2,3,4])
x
01
12
23
34
dtype: int64
[Type here] [Type here] [Type here]
For DataFrame
df=pd.DataFrame(np.random.randint(0,10,(3,4)),
columns=['a','b','c','d'])
30
[Type here] [Type here] [Type here]
df
a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2
0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
a b c d
Index Alignment
Pandas will align indices in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll.
1 3.0
2 NaN
3 9.0
4 NaN
5 NaN
dtype: float64
[Type here] [Type here] [Type here]
31
[Type here] [Type here] [Type here]
The resulting array contains the union of indices of the two input arrays, which we could determine using
standard Python set arithmetic on these indices.
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is
how Pandas marks as missing data.
x.add(y,fill_value=0)
1 3.0
2 3.0
3 9.0
4 7.0
5 6.0
dtype: float64
A B
01
11
151
BA
C040
9
1580
2926
A+B
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN
NaN
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result
are sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any
desired fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A.
32
[Type here] [Type here] [Type here]
fill = A.stack().mean()
A.add(B,
fill_value=fill)
A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing
integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.
33
[Type here] [Type here] [Type here]
The way in which Pandas handles missing values is constrained by its NumPy package, which does not have a
built-in notion of NA values for non floating- point data types.
NumPy supports fourteen basic integer types once you account for available precisions, signedness, and
endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an
unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new
fork of the NumPy package.
Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values:
the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will
see, but in practice ends up being a good compromise in most cases of interest.
This dtype=object means that the best common type representation NumPy could infer for the contents of the
array is that they are Python objects.
dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the
operation, the result of arithmetic with NaN will be another NaN
1 + np.nan
nan
0 * np.nan
Nan
34
[Type here] [Type here] [Type here]
x = pd.Series(range(2),
dtype=int) x
00
11
dtype: int64
x[0] = None
x
0 NaN
1 1.0
dtype: float64
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a
NaN value.
0 False
1 True
2 False
3 True
dtype: bool
notnull()
data.notnull()
0 True
1 False
2 True
35
[Type here] [Type here] [Type here]
3 False
dtype: bool
dropna()
data.dropna()
01
2 hello
dtype: object
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
df.dropna()
012
1 2.0 3.0 5
df.dropna(axis='columns')
02
15
26
Rows or columns having all null values
You can also specify how='all', which will only drop rows/columns that are all null values.
df[3] = np.nan
df
0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
36
[Type here] [Type here] [Type here]
df.dropna(axis='columns', how='all')
012
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
0123
1 2.0 3.0 5 NaN
a 1.0
b
NaN
c 2.0
d NaN
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
37
[Type here] [Type here] [Type here]
data.fillna(method='bfill')
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64
Hierarchical Indexing
Hierarchical indexing (also known as multi-indexing) to incorporate multiple
index levels within a single index.
In this way, higher dimensional data can be compactly represented within
the familiar one-dimensional Series and two-dimensional DataFrame
objects.
Syntax
index =
pd.MultiIndex.from_tuples(index) index
The MultiIndex contains multiple levels of indexing - in this case, the state
names and the years, as well as multiple labels for each data point which
encode these levels.
[Type here] [Type here] [Type here]
Syntax
pop = pop.reindex(index)
pop
2010 37253956
2010 19378102
2010 25145561
dtype: int64
First two columns of the Series representation show the multiple index
values, while the third column shows the data.
To access all data for which the second index is 2010, we can simply use
the Pandas slicing notation:
pop[:, 2010]
Texas
25145561
dtype: int64
Example:
df = pd.DataFrame(np.random.rand(4,
columns=['data1', 'data2'])
df
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
[Type here] [Type here] [Type here]
2 0.171495 0.886688
For more flexibility in how the index is constructed, you can instead use
the class method constructors available in the pd.MultiIndex.
For example, as we did before, you can construct the MultiIndex
from a simple list of arrays, giving the index values within each
level:
Syntax
pop.index.names = ['state',
'year'] pop
2010 37253956
2010 19378102
[Type here] [Type here] [Type here]
Texas 2000 20851820
2010 25145561
dtype: int64
[Type here] [Type here] [Type here]
In a DataFrame, the rows and columns are completely symmetric, and just as
the rows can have multiple levels of indices, the columns can have multiple
levels as well.
Program
names=['year', 'visit'])
data[:, ::2] *= 10
data += 37
columns=columns) health_data
Example:
[Type here] [Type here] [Type here]
df = pd.DataFrame(np.random.rand(4,
columns=['data1', 'data2'])
df
Output:
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
2010 37253956
2010 19378102
2010 25145561
dtype: int64
[Type here] [Type here] [Type here]
Syntax
2)])
Construct the MultiIndex directly using its internal encoding by passing levels
populations pop
2010 37253956
[Type here] [Type here] [Type here]
New York 2000 18976457
2010 19378102
[Type here] [Type here] [Type here]
2010 25145561
dtype: int64
pop['California', 2000]
Output: 33871648
pop['California']
Output: year
2000 33871648
2010 37253956
dtype: int64
pop.loc['California':'New York']
2010 37253956
2010 19378102
dtype: int64
pop[:, 2000]
Output: state
California 33871648
[Type here] [Type here] [Type here]
New York 18976457
Texas 20851820
dtype: int64
Example
health_data
data_mean =
health_data.mean(level='year') data_mean
Temp year
data_mean.mean(axis=1, level='type')
Output: type HR
Temp year
[Type here] [Type here] [Type here]
2013 36.833333 37.000000
COMBINING DATASETS
CONCAT AND APPEND
Program
import pandas as pd
import numpy as np
def make_df(cols,
ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in
ind] for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC',
range(3)) Output: A B
C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
array x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
x = [[1, 2],
[3, 4]]
[3, 4, 3, 4]])
[Type here] [Type here] [Type here]
Program
Output: 1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
1 A1 B1 3 A3 B3 1 A1 B1
[Type here] [Type here] [Type here]
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4
a. Duplicate indices
Output
y pd.concat([x, y])
X
AB AB A
B
0 A0 0 A2 B2 0 A0 B0
B0
1 A1 1 A3 B3 1 A1 B1
B1
0 A2 B2
1 A3 B3
the keys option to specify a label for the data sources; the result will
be a hierarchically indexed series containing the data
Program
print(x);
print(y);
[Type here] [Type here] [Type here]
print(pd.concat([x, y], keys=['x', 'y']))
Output
0 A0 B0 0 A2 B2 x 0 A0 B0
1 A1 B1 1 A3 B3 1 A1 B1
y 0 A2 B2
1 A3 B3
data from different sources might have different sets of column names
Consider the concatenation of the following two DataFrames
Program
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3,
4]) print(df5);
print(df6);
print(pd.concat([df5,
df6])
Output
By default, the entries for which no data is available are filled with NA values.
To change this, we can specify one of several options for the join and
join_axes parameters of the concatenate function.
By default, the join is a union of the input columns (join='outer'), but we
can change this to an intersection of the columns using join='inner':
Program
print(df5);
print(df6);
print(pd.concat([df5, df6], join='inner'))
[Type here] [Type here] [Type here]
Output
2 A2 B2 C2 4 B4 C4 D4 2 B2 C2
3 B3 C3
4 B4 C4
Series and DataFrame objects have an append method that can accomplish
the same thing in fewer keystrokes.
pd.concat([df1, df2])
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4
append() method in Pandas does not modify the original object, it creates a new
object with the combined data.
Relational Algebra
Categories of Joins
o many-to-many joins
All three types of joins are accessed via an identical call to the pd.merge() interface;
The type of join performed depends on the form of the input data.
One-to-one joins
Program
print(df1);
print(df2)
Output
df1 df2
employee group employee hire_date
3 Sue HR 2014
The pd.merge() function recognizes that each DataFrame has an “employee”
column, and automatically joins using this column as a key.
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries.
Example
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering',
'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))
df3 df4
employee group hire_date group supervisor
0 Bob Accounting 2008 0 Accounting Carly
1 Jake Engineering 2012 1 Engineering Guido
2 Lisa Engineering 2004 2 HR Steve
3 Sue HR 2014
pd.merge(df3, df4)
Many-to-many joins
If the key column in both the left and right array contains duplicates, then
the result is a many-to-many merge.
Consider the following, where we have a DataFrame showing one or more
skills associated with a particular group.
Program
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
[Type here] [Type here] [Type here]
'skills': ['math', 'spreadsheets', 'coding', 'linux',
[Type here] [Type here] [Type here]
'spreadsheets', 'organization']})
print(df1);
print(df5);
print(pd.merge(df1, df5))
Output
df1 df5
employee group group s kills
0 Bob Accounting 0 Accounting math
1 Jake Engineering 1 Accounting spreadsheets
2 Lisa Engineering 2 Engineering coding
3 Sue HR 3 Engineering linux
4 HR spreadsheets
5 HR organization
Syntax
pd.merge(df1, df5)
Output
6 Sue HR Spreadsheets
7 Sue HR Organization
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
Sum
ser.sum()
2.8119254917081569
Mean
ser.mean()
0.56238509834163142
[Type here] [Type here] [Type here]
The same operations also performed in DataFrame
Description
count() Total number of items
first(), last() First and
last item mean(), median()
Mean and
median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items
Planets Data
[Type here] [Type here] [Type here]
planets.head()
Output: method number orbital_period mass distance year
rng =
np.random.RandomState(42) ser
= pd.Series(rng.rand(5))
ser
0.374540
Output: 0
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
ser.sum()
[Type here] [Type here] [Type here]
Output: 2.8119254917081569
[Type here] [Type here] [Type here]
ser.mean()
Output: 0.56238509834163142
df = pd.DataFrame({'A': rng.rand(5),
'B':
rng.rand(5)}) df
Output: A B
0 0.155995 0.020584
1 0.058084 0.969910
2 0.866176 0.832443
3 0.601115 0.212339
4 0.708073 0.181825
df.mean()
Output: A 0.477888
B 0.443420
dtype: float64
df.mean(axis='columns')
Output: 0 0.088290
1 0.513997
2 0.849309
3 0.406727
4 0.444949
dtype: float64
describe() that computes several common aggregates for each column and
returns the result.
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]
Program
df
1 B 1
2 C 2
[Type here] [Type here] [Type here]
3 A 3
4 B 4
[Type here] [Type here] [Type here]
5 C 5
Syntax
df.groupby('key')
Syntax
df.groupby('key').sum()
Output: data
key
A 3
B 5
C 7
GroupBy objects have aggregate(), filter(), transform(), and apply() methods that
efficiently implement a variety of useful operations before combining the grouped
data.
Program
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1',
'data2']) df
0 A 0 5
1 B 1 0
2 C 2 3
[Type here] [Type here] [Type here]
3 A 3 3
4 B 4 7
5 C 5 9
Aggregation
key
A 0 1.5 3 3 4.0 5
B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
Output: data1 data2
key
A 0 5
B 1 7
C 2 9
Filtering
A filtering operation allows you to drop data based on the group properties.
Program
def filter_func(x):
print(df); print(df.groupby('key').std());
print(df.groupby('key').filter(filter_func))
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]
Output
df df.groupby('key').std()
1 B 1 0 B 2.12132 4.949747
2 C 2 3 C 2.12132 4.242641
3 A 3 3
4 B 4 7
5 C 5 9
df.groupby('key').filter(filter_func)
4 B 4 7
5 C 5 9
The filter() function should return a Boolean value specifying whether the
group passes the filtering.
Transformation
Transformation can return some transformed version of the full data to recombine.
df.groupby('key').transform(lambda x: x - x.mean())
1 -1.5 -3.5
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
[Type here] [Type here] [Type here]
5 1.5 3.0
[Type here] [Type here] [Type here]
Program
def norm_by_data2(x):
# x is a DataFrame of group
values x['data1'] /=
x['data2'].sum() return x
print(df);
print(df.groupby('key').apply(norm_by_data2))
Output
df df.groupby('key').apply(norm_by_data2)
1 B 1 0 1 B 0.142857 0
2 C 2 3 2 C 0.166667 3
3 A 3 3 3 A 0.375000 3
4 B 4 7 4 B 0.571429 7
5 C 5 9 5 C 0.416667 9
The key can be any series or list with a length matching that of the DataFrame.
[Type here] [Type here] [Type here]
[Type here] [Type here] [Type here]
L = [0, 1, 0, 1, 2, 0]
print(df);
print(df.groupby(L).sum())
df df.groupby(L).sum()
2 2 3 2 4 7
C
3 3 3
A
4B 4 7
5 5 9
C
print(df);
print(df.groupby(df['key']).sum())
df df.groupby(df['key']).sum()
[Type here] [Type here] [Type here]
1 B 1 0 B 5 7
2C 2 3 C 7 12
3 A 3 3
4 B 4 7
5C 5 9
PIVOT TABLES
A Pivot table is a similar operation that is commonly seen in spreadsheets and
other programs that operate on tabular data.
[Type here] [Type here] [Type here]
The pivot table takes simple columnwise data as input, and groups the entries
into a two-dimensional table that provides a multidimensional summarization of
the data.
Output:
survived pclass sex age sibsp parch fare embarked class \\
titanic.groupby('sex')[['survived']].mean()
[Type here] [Type here] [Type here]
Output:
survived sex
female 0.742038
male 0.188908
[Type here] [Type here] [Type here]
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
grouping in pivot tables can be specified with multiple levels, and via a number of
options
add info on the fare paid using pd.qcut to automatically compute quantiles:
fare = pd.qcut(titanic['fare'], 2)
Output:
B.E /CSE/ Foundation of Data Science CS3352- II/III-R-
2021