0% found this document useful (0 votes)
20 views40 pages

Unit 4 PPT Part2 - Pandas

Unit IV of the document covers Python libraries for data wrangling, focusing on NumPy and Pandas. It discusses data manipulation techniques using Pandas, including data indexing, selection, handling missing data, and operations on DataFrames. The document also highlights the advantages of Pandas over NumPy, particularly in managing structured data and performing complex data operations.

Uploaded by

Hari Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views40 pages

Unit 4 PPT Part2 - Pandas

Unit IV of the document covers Python libraries for data wrangling, focusing on NumPy and Pandas. It discusses data manipulation techniques using Pandas, including data indexing, selection, handling missing data, and operations on DataFrames. The document also highlights the advantages of Pandas over NumPy, particularly in managing structured data and performing complex data operations.

Uploaded by

Hari Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Unit-4 PANDAS

JAL1301 DATA
SCIENCE USING
PYTHON

UNIT IV
PYTHON
LIBRARIES FOR
DATA WRANGLING

UNIT IV PYTHON LIBRARIES


FOR DATA WRANGLING
SYLLABUS
Basics of Numpy arrays – Aggregations
– Computations on arrays –
Comparisons masks boolean logic –
Fancy indexing – Sorting arrays -
Structured data
Data manipulation with Pandas – Data
indexing and selection – Operating on
data – Handling missing data –
Hierarchical indexing – Combining
datasets – Aggregation and grouping –
Pivot tables.
2

1
Unit-4 PANDAS

DATA
MANIPULATION
WITH PANDAS

2
Unit-4 PANDAS

PANDAS
 NumPy and its ndarray object provides efficient storage and manipulation
of dense typed arrays in Python.
 NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks.
 Limitations:
  less flexibility (attaching labels to data, working with missing data, etc.)
and
  when attempting operations that do not map well to element-wise
broadcasting (groupings, pivots, etc.),
 Above mentioned are important piece of analyzing the less structured data
available in many forms in the world around us.
5

PANDAS
 Pandas library provides efficient data structures.
 Pandas  a newer package built on top of NumPy  provides an
efficient implementation of a DataFrame.
 DataFrames  essentially multidimensional arrays with attached
row and column labels, and often with heterogeneous types
and/or missing data.
 Pandas and its Series and DataFrame objects, builds on the
NumPy array structure
 Provides efficient access to these sorts of data munging (data
wrangling) tasks that occupy much of a data scientist’s time.
6

3
Unit-4 PANDAS

DATA
INDEXING
AND
SELECTION

DATA • Data Selection in Series


• Series as dictionary

INDEXING •• Series as 1-d array


Indexers: loc, iloc, and ix

AND • Data Selection in


DataFrame
SELECTION •• DataFrame as a dictionary
DataFrame as 2 – d array
• Additional indexing
conventions

4
Unit-4 PANDAS

DATA INDEXING AND SELECTION


• Methods and tools to access, set, and modify values in NumPy arrays
 indexing, slicing, masking, fancy indexing and combinations.
• Similar means of accessing and modifying values in Pandas Series and
DataFrame objects are available.
• Simpler case  Data selection in 1-d series object
• Complicated case  Data selection in 2-d data frame object

DATA SELECTION IN SERIES - SERIES AS


DICTIONARY
Series object
• Provides a mapping from a collection of keys to a collection of values
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data) Output:
a 0.25
print(data[‘b’]) b 0.50
c 0.75
d 1.00
dtype: float64
10
0.5

5
Unit-4 PANDAS

DATA SELECTION IN SERIES - SERIES AS


DICTIONARY
Series object
• Dictionary-like Python expressions and methods are available to examine
the keys/indices and values
• A Series can be extended by assigning to a new index value:

print(‘b’ in data) Output:


print(data.keys()) True
print(list(data.items())) Index(['a', 'b', 'c', 'd'], dtype='object')
data[‘e’] = 1.62 [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

• Pandas is making decisions about memory layout and data copying


11

DATA SELECTION IN SERIES - SERIES AS 1-


DIMENSIONAL ARRAY
Series object
• Can also be used for slicing, masking and fancy indexing
print(data['a‘ : 'c']) # slicing by explicit integer index
print(data[0 : 2]) # slicing by implicit integer index
print(data[(data > 0.3) & (data < 0.8)]) # masking
print(data[['a', 'e']]) # fancy indexing
Output 1: Output 2: Output 3: Output 4:
a 0.25 a 0.25 b 0.50 a 0.25
b 0.50 b 0.50 c 0.75 e 1.62
c 0.75 dtype: float64 dtype: float64 dtype: float64 12
dtype: float64

6
Unit-4 PANDAS

DATA SELECTION IN SERIES - INDEXERS:


LOC, ILOC, AND IX
Series object
• If Series has integer index, it creates confusion in indexing and slicing.

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])


print(data[1]) # explicit index when indexing
print(data[0 : 2]) # implicit index when slicing

Output 2:
Output 1: 1 0.25
0.25 2 0.50
dtype: float64 13

DATA SELECTION IN SERIES - INDEXERS: LOC,


ILOC, AND IX
Series object
• To avoid confusion, loc (for explicit indexing) and iloc (for implicit indexing) can be
used. (ix indexer – used in the context of DataFrame objects)
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])
print(data.loc[1]) # explicit index when indexing
print(data.loc[1 : 3]) # explicit index when slicing
print(data.iloc[1]) # implicit index when indexing
print(data.iloc[1 : 3]) # implicit index when slicing
Output 2: Output 4:
1 0.25
Output 1: Output 3: 2 0.50
2 0.50
0.25 3 0.75 0.50 3 0.75
dtype: float64 dtype: float64 14

7
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A DICTIONARY
DataFrame object
• DataFrame acts like a two-dimensional or structured array  like a
dictionary of Series structures sharing the same index.
import pandas as pd
area = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
pop = pd.Series([100, 150, 125, 85], index=['a', 'b', 'c', 'd'])
print(data) = pd.DataFrame({‘AREA’ : area, ‘POPULATION’ : pop})
Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
d 1.00 85 15

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A DICTIONARY
DataFrame object
• DataFrame acts like a two-dimensional or structured array  like a
dictionary of Series structures sharing the same index.
import pandas as pd
area = pd.Series({‘a’ : 0.25, ‘b’ : 0.5, ‘c’ : 0.75, ‘d’ : 1.0})
pop = pd.Series({‘a’ : 100, ‘b’ : 150, ‘c’ : 125, ‘d’ : 85})
data = pd.DataFrame({‘AREA’ : area, ‘POPULATION’ : pop})
Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
d 1.00 85 16

8
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A DICTIONARY
DataFrame object
• Accessing the individual Series

print(data[‘AREA’]) (or) print(data.AREA)

Output for both cases:


a 0.25
b 0.50
c 0.75
d 1.00
Name: AREA, dtype: float64
17

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A DICTIONARY
DataFrame object
• Modifying the object  eg. adding a new column
data['DENSITY'] = data['POPULATION'] / data['AREA']
print(data)

Output for both cases:


AREA POPULATION DENSITY
a 0.25 100 400.000000
b 0.50 150 300.000000
c 0.75 125 166.666667
d 1.00 85 85.000000 18

9
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A 2-D ARRAY
DataFrame object
• Full DataFrame transposed to swap rows and columns:
print(data)
print(data.T)
Output 2:
Output 1:
AREA POPULATION DENSITY
a 0.25 100 400.000000
b 0.50 150 300.000000
c 0.75 125 166.666667
d 1.00 85 85.000000 19

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A 2-D ARRAY
DataFrame object
• Raw underlying data array can be accessed using the values attribute
print(data.values)
print(data.values[0]) # similar to array indexing

20

10
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A 2-D ARRAY
DataFrame object
print(data[‘AREA’]) #a single “index” passed to a DataFrame to
access a column
print(data.iloc[:3, :2]) # array slicing – using implicit indexing
print(data.loc[‘a’ : ‘c’]) # array slicing – using explicit indexing
Output 1:
Output 2: Output 3:
a 0.25
AREA POPULATION AREA POPULATION DENSITY
b 0.50
a 0.25 100 a 0.25 100 400.000000
c 0.75
b 0.50 150 b 0.50 150 300.000000
d 1.00
c 0.75 125 c 0.75 125 166.666667
Name: AREA, dtype: float64 21

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A 2-D ARRAY
DataFrame object

print(data.ix[:3, :’POPULATION’]) # array slicing – using HYBRID indexing

Rows 0 to 2, columns ‘AREA’ TO ‘POPULATION’


Output:
AREA POPULATION
a 0.25 100
b 0.50 150
c 0.75 125
22

11
Unit-4 PANDAS

DATA SELECTION IN DATAFRAME -


DATAFRAME AS A 2-D ARRAY
DataFrame object
• If you need the indices with population > 100..
print(data.loc[data.POPULATION >= 100]) # array masking
print(data.loc[data.POPULATION >= 100, ['AREA', 'POPULATION']])

Output 1: Output 2:
AREA POPULATION DENSITY AREA POPULATION
a 0.25 100 400.000000 a 0.25 100
b 0.50 150 300.000000 b 0.50 150
c 0.75 125 166.666667 c 0.75 125
23

OPERATING
ON DATA

12
Unit-4 PANDAS

OPERATING
ON DATA
• Ufuncs: Index Preservation
• Ufuncs: Index alignment
• Index alignment in Series
• Index alignment in dataframe

25

OPERATING ON DATA
• Pandas inherits much of ufuncs from NumPy perform quick element-wise to
perform operations, both with basic arithmetic (addition, subtraction,
multiplication, etc.) and with more sophisticated operations (trigonometric
functions, exponential and logarithmic functions, etc.).
• Pandas will preserve index and column labels in the output.
• Pandas will automatically align indices when passing the objects to the
ufunc.

26

13
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX


PRESERVATION Output 1:
• Consider an example Pandas Series and DataFrame objects. 0 6
import pandas as pd 1 3
import numpy as np 2 7
range1 = np.random.RandomState(42) 3 4
series1 = pd.Series(range1.randint(0, 10, 4)) dtype: int32
series2 = pd.Series(np.random.randint(0, 10, 4)) Output 2:
print(series1) # same set of values for every execution 0 3
print(series2) # different set of values for each execution 1 2
2 9
3 3
dtype: int32
27

OPERATING ON DATA - UFUNCS: INDEX


PRESERVATION
• Consider an example Pandas Series and DataFrame objects.
Output:
A B C D
0 6 9 2 6
import pandas as pd 1 7 4 3 7
import numpy as np 2 7 2 5 4
range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 10, (3, 4)), columns=['A', 'B',
'C', 'D'])

28

14
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX


PRESERVATION
• Ufunc applied on those objects - the result will be another Pandas object
with the indices preserved – indices will appear in the o/p.
print(np.exp(series1) # np.log, np.log, np.log10 can also be used
print(np.sin(df1))
Output 1:
Output 2:
0 403.428793
A B C D
1 20.085537
0 -0.279415 0.412118 0.909297 -0.279415
2 1096.633158
1 0.656987 -0.756802 0.141120 0.656987
3 54.598150
2 0.656987 0.909297 -0.958924 -0.756802
dtype: float64
29

OPERATING ON DATA - UFUNCS: INDEX


ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
area = pd.Series({‘a’ : 0.25, ‘b’ : 0.5, ‘c’ : 0.75, ‘d’ : 1.0}, name = ‘AREA’)
pop = pd.Series({‘a’ : 100, ‘b’ : 150, ‘c’ : 125, ‘d’ : 85}, name = ‘POPULATION’)
print(pop / area)
area1 = pd.Series({'a' : 0.25, 'b' : 0.5, 'd' : 1.0}, name = 'AREA')
pop1 = pd.Series({'b' : 150, 'c' : 125, 'a' : 100}, name = 'POP')
print(pop1 / area1)
Output 1: Output 2:
a 400.000000 a 400.0 • Result has the union
b 300.000000 b 300.0 of all indices
c 166.666667 c NaN
• NaN  Not a Number
d 85.000000 d NaN
dtype: float64 dtype: float64  missing data 30

15
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX


ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
A = pd.Series([2, 4, 6, 5], index=[1, 2, 3, 4])
B = pd.Series([1, 3, 5, 7], index=[1, 2, 3, 4])
print(A + B)
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B)
Output 1: Output 2:
1 3 0 NaN • Result has the union
2 7 1 5.0 of all indices
3 11 2 9.0
• NaN  Not a Number
4 12 3 NaN
dtype: int64 dtype: float64  missing data 31

OPERATING ON DATA - UFUNCS: INDEX


ALIGNMENT – IN SERIES
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A.add(B, fill_value=0)) # explicit specification of the fill value for any
elements in A or B that might be missing
Output 1:
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64 32

16
Unit-4 PANDAS

OPERATING ON DATA - UFUNCS: INDEX


ALIGNMENT – IN DATAFRAME
• For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation
range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 20, (2, 2)), columns=['A', 'B'])
df2 = pd.DataFrame(range1.randint(0, 10, (3, 3)), columns=list('BAC'))
print(df1, df2)
print(df1 + df2)
Output 1: Output 2:
Output 1: B A C A B C
A B 0 7 4 6 0 10.0 26.0 NaN
0 6 19 1 9 2 6 1 16.0 19.0 NaN
1 14 10 2 7 4 3 2 NaN NaN NaN 33

OPERATING ON DATA - UFUNCS: INDEX


ALIGNMENT – IN DATAFRAME
• Fill the missing values with the mean of all values in A

range1 = np.random.RandomState(42)
df1 = pd.DataFrame(range1.randint(0, 20, (2, 2)), columns=['A', 'B'])
df2 = pd.DataFrame(range1.randint(0, 10, (3, 3)), columns=list('BAC'))
fill = df1.stack().mean() # fill = 12.25 for the output
print(df1.add(df2, fill_value=fill))
Output : Output 2:
Output 1: B A C A B C
A B 0 7 4 6 0 10.00 26.00 18.25
0 6 19 1 9 2 6 1 16.00 19.00 18.25
1 14 10 2 7 4 3 2 16.25 19.25 15.25 34

17
Unit-4 PANDAS

HANDLING
MISSING
DATA

HANDLING
MISSING DATA
• Trade-Offs in Missing Data
Conventions
• Missing Data in Pandas
• None: Pythonic missing data
• NaN: Missing numerical data
• NaN and None in Pandas
• Operating on Null Values
• Detecting null values
• Dropping null values
• filling null values
36

18
Unit-4 PANDAS

HANDLING MISSING DATA


• Real-world data is rarely clean and homogeneous.
• Will have some amount of data missing
• Different data sources may indicate missing data in different ways
• Pandas chooses to represent it as null, NaN, or NA values
• built-in Pandas tools available for handling missing data in Python

37

TRADE-OFFS IN MISSING DATA


CONVENTIONS
• Two strategies to indicate the presence of missing data in a table or
DataFrame.:
1. Masking approach - Using a mask that globally indicates
missing values – a Boolean array or one bit in the data
representation to locally indicate the null status of a value
2. Sentinel approach - Choosing a sentinel value that indicates a
missing entry - some data-specific convention, such as
indicating a missing integer value with –9999 or some rare bit
pattern, or it could be a more global convention, such as
indicating a missing floating-point value with NaN (Not a
38

Number), a special value.

19
Unit-4 PANDAS

TRADE-OFFS IN MISSING DATA


CONVENTIONS
• Trade-offs in the two strategies:
• Masking approach - use of a separate mask array requires
allocation of an additional Boolean array  adds overhead in
both storage and computation.
• Sentinel approach - A sentinel value reduces the range of valid
values and special values like NaN are not available for all data
types.

39

MISSING DATA IN PANDAS - NONE: PYTHONIC


MISSING DATA
• None  a Python singleton object often used for missing data in
Python code  can be used only in arrays with data type 'object'
(i.e., arrays of Python objects)
import numpy as np
import pandas as pd
vals1 = np.array([1, None, 3, 4])
print(vals1)
print(sum(vals1)

Output:
array([1, None, 3, 4], dtype=object)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' 40

20
Unit-4 PANDAS

MISSING DATA IN PANDAS - NAN: MISSING


NUMERICAL DATA
• NAN  acronym for Not a Number  a special floating-point value
recognized by all systems that use the standard floating point representation.
• NaN is a bit like a data virus  it infects any other object it touches  this
means that aggregates over the values are well defined (i.e., they don’t
result in an error) but result in nan  some special aggregations that will
ignore these missing values
import numpy as np
import pandas as pd Output:
vals2 = np.array([1, np.nan, 3, 4]) [ 1. nan 3. 4.]
print(vals2) float64
print(vals2.dtype) nan
print(sum(vals2)) Nan
print(1 + np.nan)
8.0 41
print(np.nansum(vals2))

MISSING DATA IN PANDAS - NAN AND NONE IN


PANDAS
• Pandas is built to handle NAN and NONE nearly interchangeably, converting
where appropriate
• For example, if we set a value in an integer array to np.nan, it will
automatically be upcast to a floating-point type to accommodate the NA
• Sometimes, Pandas automatically converts the None to a NaN value.

print(pd.Series([1, np.nan, 2, None])) Output1:


0 1.0 Output2: Output3:
x = pd.Series(range(2), dtype=int) 1 NaN 00 0 NaN
print(x) 2 2.0 11 1 1.0
3 NaN dtype: dtype:
x[0] = None dtype: int64 float64
print(x) float64 42

21
Unit-4 PANDAS

OPERATING ON NULL VALUES


• Pandas methods for detecting, removing, and replacing null values in Pandas
data structures.
• isnull()  Generate a Boolean mask indicating missing values
• notnull()  Opposite of isnull()
• dropna()  Return a filtered version of the data

# Detecting null values


Output1:
data =pd.Series([10, np.nan, 20, None])) Output2: Output3:
0 False
print(data.isnull()) 0 10 0 10
1 True
print(data.notnull()) 2 20 2 20
2 False
dtype: dtype:
3 True
# Dropping null values in a series object object
dtype: bool
print(data.dropna()) 43

OPERATING ON NULL VALUES


• Pandas methods for detecting, removing, and replacing null values in Pandas
data structures.
• isnull()  Generate a Boolean mask indicating missing values
• notnull()  Opposite of isnull()
• dropna()  Return a filtered version of the data

Output:
# Dropping null values in a dataframe
0 1 2
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
1 2.0 3.0 5
print(df.dropna()) # entire row with nan dropped
2
print(df.dropna(axis = 1)) # entire column with nan dropped
0 2
df.dropna(axis='columns', how='all') #drops column if all are NAN
1 5
df.dropna(axis='columns', how='any') #drops column if any is NAN
2 6 44

22
Unit-4 PANDAS

OPERATING ON NULL VALUES


• fillna()  Return a copy of the data with missing values filled
# Filling null values in a series
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(data.fillna(0))
print(data.fillna(method='ffill')) # forward-fill
print(data.fillna(method='bfill')) # backward-fill
Output 1: Output 2: Output 1:
a 1.0 a 1.0 a 1.0
b 0.0 b 1.0 b 2.0
c 2.0 c 2.0 c 2.0
d 0.0 d 2.0 d 3.0
e 3.0 e 3.0 e 3.0
dtype: float64 dtype: float64 dtype: float64 45

COMBINING
DATASETS

23
Unit-4 PANDAS

COMBINING
DATASETS
• Concat and append
• Np.concatenate function
• Pd.concat method
• Pd.concat with inner join
• Append method
• Merge and join
• One-to-one join
• Many-to-one join

49

COMBINING DATASETS
• Combining different data sources
• Simpler concatenation of two different datasets
• Complicated database-style joins and merges - handling any
overlaps between the datasets
• Pandas includes functions and methods that make this sort of
data wrangling fast and straightforward – on Series and
DataFrames
• Function used - pd.concat

50

24
Unit-4 PANDAS

COMBINING DATASETS
• A DataFrame created

import numpy as np
import pandas as pd Output:
cols = pd.Series(['A', 'B', 'C']) A B C
ind = np.arange(3) 0 A0 B0 C0
vals = {c: [str(c) + str(i) for i in ind] for c in cols} 1 A1 B1 C1
data = pd.DataFrame(vals, ind) 2 A2 B2 C2
print(data)

51

COMBINING DATASETS
• Concatenation of NumPy arrays  done via the np.concatenate
function
Output:
[1, 2, 3, 4, 5, 6]
x = [1, 2, 3]
y = [4, 5, 6] [[1 2 1 2]
data = np.concatenate([x, y]) [3 4 3 4]]
print(data)
x = [[1, 2],[3, 4]] [[1 2]
data1 = np.concatenate([x, x], axis=0) [3 4]
data2 = np.concatenate([x, x], axis=1) [1 2]
print(data1, data2) [3 4]] 52

25
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND


• Concatenation of Series and DataFrame objects is very similar to
concatenation of NumPy arrays 
• Done via the pd.concat function
• Done via append method

pd.concat(objs, axis=0, join='outer', join_axes=None,


ignore_index=False, keys=None, levels=None, names=None,
verify_integrity=False, copy=True)

53

COMBINING DATASETS – CONCATENATE & APPEND


• To use pd.concat function, for row-wise concatenation, column
names must be the same in both dataframes

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])


df2 = pd.DataFrame({'COL1':['A3', 'A4'],'COL2':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2)
Output 3:
print(pd.concat([df1, df2])) # row-wise concatenation
COL1 COL2
Output 1: Output 2: 1 A1 B1
COL1 COL2 COL1 COL2 2 A2 B2
1 A1 B1 3 A3 B3 3 A3 B3
2 A2 B2 4 A4 B4 4 A4 B454

26
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND


• To use pd.concat function, for column-wise concatenation (with
axis=1), row indices names must be the same in both dataframes

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']}, index=[1,2])


df2 = pd.DataFrame({'COL3':['A3', 'A4'],'COL4':['B3', 'B4']}, index=[1,2])
print(df1)
print(df2)
print(pd.concat([df1, df2], axis=1)) # column-wise concatenation
Output 1: Output 2: Output 3:
COL1 COL2 COL1 COL2 COL1 COL2 COL3 COL4
1 A1 B1 3 A3 B3 1 A1 B1 A3 B3
2 A2 B2 4 A4 B4 2 A2 B2 A4 B4 55

COMBINING DATASETS – CONCATENATE & APPEND


• To use pd.concat function, for row-wise concatenation, if column
names are not the same in both dataframes, union of col names
done, and NaN used for missing values
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2) Output 3:
print(pd.concat([df1, df2])) # row-wise concatenationCOL1 COL2 COL3
Output 1: Output 2: 1 A1 B1 NaN
COL1 COL2 COL2 COL3 2 A2 B2 NaN
1 A1 B1 3 A3 B3 3 NaN A3 B3
2 A2 B2 4 A4 B4 4 NaN A4 B4 56

27
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND


• To avoid Nan, an intersection of the columns using join='inner‘ in
pd.concat can be used
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2)
print(pd.concat([df1, df2], join='inner')) # row-wise inner Output
join 3:
COL2
Output 1: Output 2: 1 B1
COL1 COL2 COL2 COL3 2 B2
1 A1 B1 3 A3 B3 3 A3
2 A2 B2 4 A4 B4 4 A4 57

COMBINING DATASETS – CONCATENATE & APPEND


• To avoid Nan, an intersection of the columns using join='inner‘ in
pd.concat can be used
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2)
print(pd.concat([df1, df2], join='inner')) # row-wise inner Output
join 3:
COL2
Output 1: Output 2: 1 B1
COL1 COL2 COL2 COL3 2 B2
1 A1 B1 3 A3 B3 3 A3
2 A2 B2 4 A4 B4 4 A4 58

28
Unit-4 PANDAS

COMBINING DATASETS – CONCATENATE & APPEND


• Concatenation using append method –returns a dataframe object – not so
efficient as it creates a new index and data buffer
• Multiple append() = = pd concat()  to concatenate more dataframes
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2) Output 3:
print(df1.pd.append(df2)) COL1 COL2
Output 1: Output 2: 1 A1 B1
COL1 COL2 COL1 COL2 2 A2 B2
1 A1 B1 3 A3 B3 3 A3 B3
2 A2 B2 4 A4 B4 4 A4 B4 59

COMBINING DATASETS – CONCATENATE & APPEND


• Concatenation using append method –returns a dataframe object – not so
efficient as it creates a new index and data buffer
• Multiple append() = = pd concat()  to concatenate more dataframes
df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']},index=[1,2])
df2 = pd.DataFrame({'COL2':['A3', 'A4'],'COL3':['B3', 'B4']}, index=[3,4])
print(df1)
print(df2) Output 3:
print(df1.pd.append(df2)) COL1 COL2
Output 1: Output 2: 1 A1 B1
COL1 COL2 COL1 COL2 2 A2 B2
1 A1 B1 3 A3 B3 3 A3 B3
2 A2 B2 4 A4 B4 4 A4 B4 60

29
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN


• pd.merge()  a subset of what is known as relational algebra,
• A formal set of rules for manipulating relational data, and forms the
conceptual foundation of operations available in most databases
• Several primitive operations, whichbecome the building blocks of more
complicated operations on any dataset
• pd.merge() function and the related join() method of Series and DataFrames.
• The pd.merge() function implements a number of types of joins: the one-to-
one, many-to-one, and many-to-many joins

61

COMBINING DATASETS – MERGE & JOIN


• One – to – one join

df1 = pd.DataFrame({'COL1':['A1', 'A2'],'COL2':['B1', 'B2']})


df2 = pd.DataFrame({'COL1':['A2', 'A1'],'COL3':['B3', 'B4']})
print(df1)
print(df2)
print(pd.merge(df1, df2)) Output 3:
Output 1: Output 2: COL1 COL2 COL3
COL1 COL2 COL1 COL3 0 A1 B1 B4
1 A1 B1 0 A2 B3 1 A2 B2 B3
2 A2 B2 1 A1 B4 62

30
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN


• One – to – one join  similar to the column-wise concatenation seen
• No. of values in both dataframes to be merged are the same
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df1)
print(df2) Output 3:
df3 = pd.merge(df1, df2) COL1 COL2 COL3
print(df3) 0 A1 B1 B4
1 A2 B2 B3
63

COMBINING DATASETS – MERGE & JOIN


• One – to – one join

64

31
Unit-4 PANDAS

COMBINING DATASETS – MERGE & JOIN


• Many – to – one join  one of the two key columns contains duplicate
entries.
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],


'hire_date': [2004, 2008, 2012, 2014]})

df3 = pd.merge(df1, df2)


print(df3)
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(pd.merge(df3, df4)) 65

COMBINING DATASETS – MERGE & JOIN


• Many – to – one join  one of the two key columns contains duplicate
entries.

Df3, df4 merged

66

32
Unit-4 PANDAS

AGGREGATION
AND
GROUPING

AGGREGATION
AND
GROUPING
• Simple Aggregation in Pandas
• GroupBy: Split, Apply, Combine

68

33
Unit-4 PANDAS

SIMPLE AGGREGATION IN PANDAS


• Efficient summarization -
computing aggregations like
sum(), mean(), median(),
min(), and max() a single
number gives insight into
the nature of a potentially
large dataset.

• For a Series object 

69

GROUPBY: SPLIT, APPLY, COMBINE


• The GroupBy accomplishes:
 The split step involves breaking up
and grouping a DataFrame
depending on the value of the
specified key.
 The apply step involves computing
some function, usually an aggregate,
transformation, or filtering, within
the individual groups.
 The combine step merges the
results of these operations into an
output array. 70

34
Unit-4 PANDAS

PIVOT TABLES

PIVOT TABLES

• pivot tables
• Multilevel pivot tables
• Additional pivot table options

72

35
Unit-4 PANDAS

PIVOT TABLES
• The pivot table  takes simple columnwise data as input, and
groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.
• Pivot tables  multidimensional version of GroupBy aggregation
 split and the combine happen across not a one-dimensional
index, but across a two-dimensional grid.
• Method used  pivot_table()

73

PIVOT TABLES
• Sample database used  database of passengers on the Titanic
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

74

36
Unit-4 PANDAS

PIVOT TABLES
• Example Groupby operation  to find the survival rate by gender:
titanic.groupby('sex')[['survived']].mean()
• Finds different groups based on ‘sex’ – then finds the mean of
‘survived’ for different values in ‘sex’ (female, male) in the
database
Output: Three of every four (3/4
survived = 0.75) females on board
sex survived, while only one
female 0.742038 in five males (1/5 = 0.2)
male 0.188908 survived 75

PIVOT TABLES
• Example Groupby operation  to find the survival rate by gender
and class
 group by class and gender, select survival, apply a mean
aggregate, combine the resulting groups, and then unstack to reveal
the hidden multidimensionality.
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

Output: Class-wise gender-


class First Second Third wise survivals 
sex multidimensional
female 0.968085 0.921053 0.500000 aggregation
male 0.368852 0.157407 0.135447 operation 76

37
Unit-4 PANDAS

PIVOT TABLES
• Multidimensional aggregation can be handled simply using ‘pivot table’
 method used is pivot_table()
DataFrame.pivot_table(data, values=None, index=None,
columns=None, aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All‘, observed=False, sort=True)
• data  the data (column) on which pivot table is to be generated
• index  row titles
• columns  column titles
• aggfunc  type of aggregation applied  ‘mean’ by default
• fill_value & dropna  to handle missing values
• margins  if True  totals computed along each grouping (row-wise and column-
wise)
• observed  if True, shows observed values for categorical groupers
• sort  if True, results are sorted
77

PIVOT TABLES
• The multidimensional aggregation operation using groupby
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
• The equivalent operation using pivot_table()
titanic.pivot_table('survived', index='sex', columns='class')

Output:
class First Second Third Class-wise gender-
sex wise survivals 
female 0.968085 0.921053 0.500000 2-dimensional
operation
male 0.368852 0.157407 0.135447 78

38
Unit-4 PANDAS

MULTILEVEL PIVOT TABLES


• The grouping in pivot tables can be specified with multiple
levels, and via a number of options.
• Example – let age be the 3rd dimension (apart from class & sex)
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
• pd.cut() will split the age data into 2 sets / bins  from 0 to 18
(inclusive) and 19 to 80
Output:
class First Second Third Class-wise gender-
sex age wise age-wise
female (0, 18] 0.909091 1.000000 0.511628 survivals  3-
(18, 80] 0.972973 0.900000 0.423729 dimensional
male (0, 18] 0.800000 0.600000 0.215686 operation 79
(18, 80] 0.375000 0.071429 0.133663

MULTILEVEL PIVOT TABLES


• Example – let fare be the 4th dimension (apart from sex, age,
class) fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
• pd.qcut() will split the ‘fare’ data into 2 sets / bins  based on
quantile.
• Output: 4-dimensional aggregation

80

39
Unit-4 PANDAS

ADDITIONAL PIVOT TABLE OPTIONS


• Example: aggregation functions applied

titanic.pivot_table (index='sex', columns='class', aggfunc =


{'survived': sum, 'fare': ‘mean'})

81

ADDITIONAL PIVOT TABLE OPTIONS


• Example: computation of totals along each grouping – using
margins=True  margin labels (row-name and column-name
for totals) are ‘All’ by default.
• Instead of ‘All’, margin label can be specified with the
margins_name keyword.
titanic.pivot_table('survived', index='sex', columns='class', margins
= True)

82

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy