0% found this document useful (0 votes)
91 views85 pages

Unit I

The document discusses exploratory data analysis including its fundamentals, significance, and steps. It covers differentiating between numerical and categorical data, data transformation techniques, and visualization aids like line charts, bar charts, and histograms.

Uploaded by

palaniappan.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views85 pages

Unit I

The document discusses exploratory data analysis including its fundamentals, significance, and steps. It covers differentiating between numerical and categorical data, data transformation techniques, and visualization aids like line charts, bar charts, and histograms.

Uploaded by

palaniappan.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

AD3301

DATA EXPLORATION AND


VISUALIZATION

1
Dr S Palaniappan , Associate Professor , Department of ADS
UNIT I

EXPLORATORY DATA ANALYSIS

2
Dr S Palaniappan , Associate Professor , Department of ADS
Lesson Plan
Highest Cognitive
Planned Hour Description of Portion to be Covered Relevant CO Nos
level**
EDA fundamentals - Understanding data science –
1 CO1 K1
Significance of EDA
1 Making sense of data CO1 K1
Comparing EDA with classical and Bayesian analysis -
1 CO1 K1
Software tools for EDA
1 Visual Aids for EDA CO2 K1
1 Visual Aids for EDA CO2 K1
Data transformation techniques-merging database, reshaping
1 CO2 K1
and pivoting
1 Transformation techniques CO2 K1
1 Grouping Datasets - data aggregation CO2 K1
1 Pivot tables and cross-tabulations CO2 K1
CO 1 Understand the fundamentals of exploratory data analysis.
Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with Python”, Packt
T1
Publishing,2020.

Dr S Palaniappan , Associate Professor , Department of ADS 3


The Fundamentals of
EDA
• Data encompasses a collection of discrete objects, numbers, words, events, facts,
measurements, observations, or even descriptions of things.

• Processing such data elicits useful information and processing such information
generates useful knowledge

• How can we generate meaningful and useful information from such data?

• EDA is a process of examining the available dataset to discover patterns, spot


anomalies, test hypotheses, and check assumptions using statistical measures

Dr S Palaniappan , Associate Professor , Department of ADS 4


Understanding data science
• CRoss-Industry Standard Process for data mining (CRISP -DM)

Dr S Palaniappan , Associate Professor , Department of ADS 5


Stages Of Data Analysis And Data Mining
• Data requirements: What type of data is required for the organization to be
collected, curated, and stored

• Data collection : What are the ways we can collect the data ?Data collected from
several sources should be Stored in a correct form

• Data processing: Preprocessing involves the process of pre-curating the dataset


before actual analysis

• Data cleaning: Incompleteness check, duplicates check, error check, and missing
value check

Dr S Palaniappan , Associate Professor , Department of ADS 6


• EDA : We actually start to understand the message contained in the data

• Modelling and algorithm : The cause and effect – Dependent and Independent
variable

• Data Product : Mode developed during data analysis

• Communication: Disseminating the results to end stakeholders to use the result for
business intelligence

Dr S Palaniappan , Associate Professor , Department of ADS 7


The significance of EDA
• Different fields - accumulate and store data primarily in electronic databases

• EDA allows us to visualize data to understand it as well as to create hypotheses for


further analysis

• Key components - summarizing data, statistical analysis, and visualization of data

• Python provides exploratory analysis, with pandas for summarizing;


• scipy, along with others, for statistical analysis;
• and matplotlib and plotly for visualizations.

Dr S Palaniappan , Associate Professor , Department of ADS 8


Steps in EDA
• Problem definition
• Defining the main objective of the analysis,
• defining the main deliverables,
• outlining the main roles and responsibilities,
• obtaining the current status of the data,
• defining the timetable, and
• performing cost/benefit analysis

• Data preparation
• understand the main characteristics of the data,
• clean the dataset,
• delete non-relevant datasets,
• transform the data, and
• divide the data into required chunks for analysis.
Dr S Palaniappan , Associate Professor , Department of ADS 9
• Data analysis:
• summarizing the data,
• finding the hidden correlation and relationships among the data,
• Developing predictive models,
• evaluating the models, and
• calculating the accuracies.

• Development and representation of the results: Presenting the dataset to the target
audience in the form of
• graphs,
• summary tables,
• Maps and diagrams
Dr S Palaniappan , Associate Professor , Department of ADS 10
Making sense of data
• It is crucial to identify the type of data under analysis.
• Different disciplines store different kinds of data for different purposes.

Dr S Palaniappan , Associate Professor , Department of ADS 11


Two Groups of Data
• Numerical data and Categorical data.

• Numerical data - This data has a sense of measurement involved in it


• This data is often referred to as quantitative data in statistics

• Discrete data
• This is data that is countable and its values can be listed out

• Continuous data
• A variable that can have an infinite number of numerical values within a specific
range is classified as continuous data. Whose value is obtained by measuring.

Dr S Palaniappan , Associate Professor , Department of ADS 12


Guess the type of data

Dr S Palaniappan , Associate Professor , Department of ADS 13


• Categorical data
• This type of data represents the characteristics of an object
• Gender, Marital Status, Type Of Address, Or Categories Of The Movies,
Blood Type

• Qualitative datasets as per statistics

• Dichotomous variable – binary data - categorical variable


• Polytomous variables - more than two possible values.

Dr S Palaniappan , Associate Professor , Department of ADS 14


Measurement scales
• Four different types of measurement scales
• Nominal, Ordinal, Interval, And Ratio.

• Nominal
• These are practiced for labeling variables without any quantitative value. The
scales are generally referred to as labels

• The languages that are spoken in a particular country, Hair color, Gender,
nationalities, Profession

• Frequency, Proportion, Percentage and Vizualize

Dr S Palaniappan , Associate Professor , Department of ADS 15


• Ordinal

• In ordinal scales, the order of the values is a significant factor

Dr S Palaniappan , Associate Professor , Department of ADS 16


• Interval
• In interval scales, both the order and exact differences between the values are
significant
• Interval scales are widely used in statistics, for example, in the measure
of central tendencies—mean, median, mode, and standard deviations.

• Ratio data
• It is just like interval data in that it can be categorized and ranked, and there are
equal intervals between the data points
• Income, height, weight, annual sales, market share
• zero means none and it is not possible to have negative values

Dr S Palaniappan , Associate Professor , Department of ADS 17


Dr S Palaniappan , Associate Professor , Department of ADS 18
Comparing EDA with classical and Bayesian
analysis

Dr S Palaniappan , Associate Professor , Department of ADS 19


Software tools available for EDA
• Python:
• R programming language
• Weka
• KNIME

Dr S Palaniappan , Associate Professor , Department of ADS 20


Visual Aids for EDA
• Line chart
• Bar chart
• Scatter plot
• Area plot and stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
• Lollipop chart
• Choosing the best chart
• Other libraries to explore

Dr S Palaniappan , Associate Professor , Department of ADS 21


Line chart
• A line chart is used to illustrate the relationship between two or more continuous
variables.
• Line charts are the simplest form of representing quantitative data between two
variables that are shown with the help of a line that can either be straight or curved

Dr S Palaniappan , Associate Professor , Department of ADS 22


Bar Chart
• Bars can be drawn horizontally or vertically to represent categorical variables.

• A bar chart is a statistical approach to represent given data using vertical and
horizontal rectangular bars.

• It has a uniform width and varying heights.

• The length of each bar is proportional to the value they represent.

• It is basically a graphical representation of data with the help of horizontal or vertical


bars with different heights.

Dr S Palaniappan , Associate Professor , Department of ADS 23


Scatter Plots
• Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them

Dr S Palaniappan , Associate Professor , Department of ADS 24


Area plot and stacked plot
• The stacked plot owes its name to the fact that it represents the area under a line plot
and that several such plots can be stacked on top of one another

Dr S Palaniappan , Associate Professor , Department of ADS 25


Pie chart
• A pie chart (or a circle chart) is a circular statistical graphic, which is divided into
slices to illustrate numerical proportion

Dr S Palaniappan , Associate Professor , Department of ADS 26


Table chart
• A table chart combines a bar chart and a table.

Dr S Palaniappan , Associate Professor , Department of ADS 27


Polar chart
• Polar area charts are similar to pie charts, but each segment has the same angle -
the radius of the segment differs depending on the value.

• This type of chart is often useful when we want to show a comparison data similar to
a pie chart, but also show a scale of values for context.

Dr S Palaniappan , Associate Professor , Department of ADS 28


Histogram
• Histograms are a type of bar plot for numeric data that group the data into bins.
• Histogram plots are used to depict the distribution of any continuous variable
• It is used to represent grouped frequency distribution with continuous classes.

Dr S Palaniappan , Associate Professor , Department of ADS 29


• The lollipop chart is a composite chart with bars and circles. It is a variant of the
bar chart with a circle at the end, to highlight the data value.

Dr S Palaniappan , Associate Professor , Department of ADS 30


Data Transformation
• Data wrangling - data cleaning, data remediation, or data munging
• Transform raw data into more readily used formats

• Data deduplication • Data derivation

• Key restructuring • Data aggregation


• Data integration
• Data cleansing
• Data filtering
• Data validation
• Data joining
• Format revisioning

Dr S Palaniappan , Associate Professor , Department of ADS 31


Merging database-style dataframes

• Append / Concat

• Merge or Join

Dr S Palaniappan , Associate Professor , Department of ADS 32


Append
• Pandas dataframe.append() function is used to append rows of other dataframe to
the end of the given dataframe, returning a new dataframe object.
Student ID ScoreSE
1 89
Student ID ScoreSE Student ID ScoreSE 3 39
1 89 17 71 5 50
3 39 19 91 7 97
5 50 9 22
21 56 11 66
7 97 23 32 13 31
9 22 25 52 15 51
11 66 27 73 17 71
13 31 29 92 19 91
15 51 21 56
23 32
25 52
27 73
92 33
Dr S Palaniappan , Associate Professor , Department of ADS
29
Concat
Student ID ScoreSE Student ID ScoreSE
1 89 2 98
3 39 4 93
5 50 6 44
7 97 8 77
9 22 10 69
11 66 12 56
13 31 14 31
15 51 16 53
17 71 18 78
19 91 20 93
21 56 22 56
23 32 24 77
25 52 26 33
27 73 28 56
Dr S Palaniappan , Associate Professor , Department of ADS
29 92 34
• dataframe = pd.concat([dataFrame1, dataFrame2],
ignore_index=True)

• print(dataframe)

• The argument ignore_index creates new index and its absense


keeps the original indices.

Dr S Palaniappan , Associate Professor , Department of ADS 35


• pd.concat([dataFrame1, dataFrame2], axis=1)

• A DataFrame object has two axes: “axis 0” and “axis


1”. “axis 0” represents rows and “axis 1” represents
columns

Dr S Palaniappan , Associate Professor , Department of ADS 36


First Option in merging Multiple dataset
• Concatenating along with an axis

# Option 1
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)

dfML = pd.concat([df1ML, df2ML],


ignore_index=True)

df = pd.concat([dfML, dfSE], axis=1)


print(df)

Dr S Palaniappan , Associate Professor , Department of ADS 37


Using df.merge with an inner join

• dataframe.merge(right, how, on, left_on, right_on, left_index, right_index,


sort, suffixes, copy, indicator, validate)

Dr S Palaniappan , Associate Professor , Department of ADS 38


# Option 2

dfSE = pd.concat([df1SE, df2SE], ignore_index=True)


dfML = pd.concat([df1ML, df2ML], ignore_index=True)
df = dfSE.merge(dfML, how='inner')
Df

Here, you will perform inner join with each dataframe.

That is to say, if an item exists on the both dataframe, will be included in the new dataframe.

This means, we will get the list of students who are appearing in both the courses. (Student ID
30 will not be there in the list)

Dr S Palaniappan , Associate Professor , Department of ADS 39


StudentID ScoreSE ScoreML
9 22 52
11 66 86
13 31 41
15 51 77
17 71 73
19 91 51
21 56 86
23 32 82
25 52 92
27 73 23
29 92 49
2 98 93
4 93 44
6 44 78
8 77 97
10 69 87
12 56 89
14 31 39
16 53 43
18 78 88
20
Dr S Palaniappan , Associate Professor , Department of ADS 93 78 40
• The inner join takes the intersection from two or more dataframes.
• It corresponds to the INNER JOIN in Structured Query Language (SQL).

• The outer join takes the union from two or more dataframes.
• It corresponds to the FULL OUTER JOIN in SQL.

• The left join uses the keys from the left-hand dataframe only.
• It corresponds to the LEFT OUTER JOIN in SQL.

• The right join uses the keys from the right-hand dataframe only.
• It corresponds to the RIGHT OUTER JOIN in SQL.

Dr S Palaniappan , Associate Professor , Department of ADS 41


• # Option 3
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='left')
df

Dr S Palaniappan , Associate Professor , Department of ADS 42


StudentID ScoreSE ScoreML
2 98 93.0
4 93 44.0
6 44 78.0
8 77 97.0
10 69 87.0
12 56 89.0
14 31 39.0
16 53 43.0
18 78 88.0
20 93 78.0
22 56 NaN
24 77 NaN
26 33 NaN
28 56 NaN
30 27 NaN
Dr S Palaniappan , Associate Professor , Department of ADS 43
• # Option 4
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='right')
df

Dr S Palaniappan , Associate Professor , Department of ADS 44


StudentID ScoreSE ScoreML
1 NaN 39
3 NaN 49
5 NaN 55
7 NaN 77
9 22.0 52
11 66.0 86
13 31.0 41
15 51.0 77
17 71.0 73
19 91.0 51
21 56.0 86

Dr S Palaniappan , Associate Professor , Department of ADS 45


• # Option 5
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='outer')
df.tail(10)

Dr S Palaniappan , Associate Professor , Department of ADS 46


StudentID ScoreSE ScoreML
20 93 78
22 56 NaN
24 77 NaN
26 33 NaN
28 56 NaN
30 27 NaN
1 NaN 39
3 NaN 49
5 NaN 55
7 NaN 77

Dr S Palaniappan , Associate Professor , Department of ADS 47


Reshaping and pivoting
• Stacking: Stack rotates from any particular column in the data to the rows.
• Unstacking: Unstack rotates from the rows into the column.

Dr S Palaniappan , Associate Professor , Department of ADS 48


data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo',
'Trondheim', 'Stavanger', 'Kristiansand'])

print(dframe1)

Index Bergen Oslo Trondheim Stavanger Kristiansand


Rainfall 0 1 2 3 4
Humidity 5 6 7 8 9
Wind 10 11 12 13 14

Dr S Palaniappan , Associate Professor , Department of ADS 49


stacked = dframe1.stack()
Stacked

stacked.unstack()

Dr S Palaniappan , Associate Professor , Department of ADS 50


Transformation techniques
• Data deduplication
• Replacing values
• Handling missing data
• NaN values in pandas objects
• Dropping missing values
• Mathematical operations with NaN
• Filling missing values
• Backward and forward filling
• Interpolating missing values
• Renaming axis indexes
• Discretization and binning

Dr S Palaniappan , Associate Professor , Department of ADS 51


Data deduplication
• Remove the duplicate rows from the DataFrame.

frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [1


0, 10, 22, 23, 23, 24, 24]})

frame3.duplicated()

Dr S Palaniappan , Associate Professor , Department of ADS 52


Remove the duplicates
frame4 = frame3.drop_duplicates()
frame4

frame3['column 3'] = range(7)


frame5 = frame3.drop_duplicates(['column 2'])
frame5

Dr S Palaniappan , Associate Professor , Department of ADS 53


Replacing values
• Find and Replace some values inside a dataframe

import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200, 3000, -786,3000, 234, 444, -786,
332, 3332 ], 'column 2': range(9)})

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None,


regex=False, method=’pad’, axis=None)

Dr S Palaniappan , Associate Professor , Department of ADS 54


replaceFrame.replace(to_replace =-786, value= np.nan)

to_replace =[-786, 200], value= [np.nan, 2]

to_replace : [str, list, dict, Series, numeric, or None]

Dr S Palaniappan , Associate Professor , Department of ADS 55


Handling missing data
• dataframe.isnull()

replaceFrame.isnull()
replaceFrame.notnull()

Dr S Palaniappan , Associate Professor , Department of ADS 56


Counting the null
• replaceFrame.isnull().sum()

• replaceFrame.isnull().sum().sum()
•5

• replaceFrame.count() --→ Count the number of reported values .


• What is alternate way to count the reported values ???

Dr S Palaniappan , Associate Professor , Department of ADS 57


Dropping missing values
• If we want to display column_1 without null values

replaceFrame.column_1[replaceFrame.column_1.notnull()]
replaceFrame.column_1.dropna()
If we want to drop all null values in all the columns
replaceFrame.dropna()

replaceFrame .dropna()

Dr S Palaniappan , Associate Professor , Department of ADS 58


Dropping by rows and column
• replaceFrame.loc[replaceFrame.column_1 < 0, 'column_1'] = np.nan
• replaceFrame.loc[replaceFrame.column_2 >= 0 , 'column_2'] = np.nan

• Suppose we want drop those rows which are NaN

• replaceFrame.dropna(how = 'all‘,axis = 0)
• # row 2 and 6 dropped

Dr S Palaniappan , Associate Professor , Department of ADS 59


• Suppose we want drop those columns which are NaN

• replaceFrame.dropna(how = 'all',axis = 1)
• # column_2 dropped

Dr S Palaniappan , Associate Professor , Department of ADS 60


Mathematical operations with NaN
• replaceFrame.mean()

• replaceFrame.sum()

Dr S Palaniappan , Associate Professor , Department of ADS 61


Filling missing values
fillnaDF = replaceFrame.fillna(0)
fillnaDF

fillnaDF.mean()

Dr S Palaniappan , Associate Professor , Department of ADS 62


x = np.random.randint(100, size=(7))
ind = ['apple', 'banana', 'kiwi', 'grapes', 'mango','pineapple','gauva']
dfx = pd.DataFrame(x,index = ind ,columns=['store1'])

dfx.loc[dfx.store1 < 50, 'store1'] = np.nan


print(dfx)

Dr S Palaniappan , Associate Professor , Department of ADS 63


Backward and forward filling
• dfx["store1"].fillna(method='ffill',inplace = True)
• print(dfx)

• fillna(method=‘bfill)

Dr S Palaniappan , Associate Professor , Department of ADS 64


Linear Interpolation
• ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])
• ser3.interpolate()

Dr S Palaniappan , Associate Professor , Department of ADS 65


Renaming axis indexes

dframe1.index = ['Rain','Moisture','Breeze']

Dr S Palaniappan , Associate Professor , Department of ADS 66


Outlier detection
• Outliers are data points that are far from other data points

Dr S Palaniappan , Associate Professor , Department of ADS 67


Grouping Datasets
• It is often essential to cluster or group data together based on certain criteria
• e-commerce store might want to group all the sales at various stores

Understanding groupby()
Groupby mechanics
Data aggregation
Pivot tables and cross-tabulations

Dr S Palaniappan , Associate Professor , Department of ADS 68


Groupby()
• We can group the city dwellers into different gender groups and calculate their mean
weight

Dr S Palaniappan , Associate Professor , Department of ADS 69


groupby()
• Categorizing a dataset into multiple categories or groups is often essential.

• Pandas groupby is used for grouping the data according to the categories and
apply a function to the categories

• Any groupby operation involves one of the following operations on the original
object. They are −
• Splitting the Object

• Applying a function

• Combining the results


Dr S Palaniappan , Associate Professor , Department of ADS 70
Let’s Create a Data set Index Gender Height
0 M 160
mylist = ['M','F'] 1 F 181
myheight = random.sample(range(150, 185),20) 2 M 151
3 F 156
mygender = random.choices(mylist, k = 20)
4 M 157
5 M 155
data = {'Gender':mygender,'Height':myheight} 6 F 163
df = pd.DataFrame(data) 7 F 172
df 8 M 173
9 F 179
10 F 162
Dr S Palaniappan , Associate Professor , Department of ADS 71
• How many Unique value does the Gender column has?

print(df.groupby('Gender').groups.keys())

• To fine the count of each category

print(df.groupby('Gender').count())

Dr S Palaniappan , Associate Professor , Department of ADS 72


Splitting the data
style = df.groupby('Gender')

male = style.get_group("M")
print(male)

female = style.get_group(“F")
print(female)

Dr S Palaniappan , Associate Professor , Department of ADS 73


Applying the operation
• Average height in each category

f_avg = male['Height'].mean()

m_avg = female['Height'].mean()

print(f_avg,m_avg)

Dr S Palaniappan , Associate Professor , Department of ADS 74


Combine the results of Average
df_output = pd.DataFrame({'Gender':['F','M'],'Height':[f_avg,m_avg]})
df_output

Dr S Palaniappan , Associate Professor , Department of ADS 75


Selecting a subset of columns
Let’s Create a Data set

mylist = ['M','F']
sub = ['EDA',"ML",'DBMS']
myscore = random.sample(range(0, 100),20)
mygender = random.choices(mylist, k = 20)
mysub = random.choices(sub, k = 20)
data = {'Gender':mygender,'Score':myscore,'Subject':mysub}
df = pd.DataFrame(data)
df

Dr S Palaniappan , Associate Professor , Department of ADS 76


double_grouping = df.groupby(["Gender","Subject"])
double_grouping.mean()

double_grouping.max()

double_grouping.min()

Dr S Palaniappan , Associate Professor , Department of ADS 77


• How many people attended the courses in gender wise

double_grouping['Subject'].count()

Dr S Palaniappan , Associate Professor , Department of ADS 78


Data aggregation
• Aggregation is the process of implementing any mathematical operation on a dataset
or a subset of it.

• The Dataframe.aggregate() function is used to apply aggregation across one or more


columns.

• Some of the most frequently used aggregations are as follows:


• sum: Returns the sum of the values for the requested axis
• min: Returns the minimum of the values for the requested axis
• max: Returns the maximum of the values for the requested axis

Dr S Palaniappan , Associate Professor , Department of ADS 79


Let’s Create a Data set

myscore1 = random.sample(range(0, 100),20)


myscore2 = random.sample(range(0, 100),20)
myscore3 = random.sample(range(0, 100),20)

data = {'DBMS':myscore1,'AI':myscore2,'EDA':myscore3}
df = pd.DataFrame(data)
df

Dr S Palaniappan , Associate Professor , Department of ADS 80


• We can apply aggregation in a DataFrame, df, as df.aggregate() or df.agg()

df.agg('count')

df.agg(‘mean')

df.agg(["min",'max','mean'])

Dr S Palaniappan , Associate Professor , Department of ADS 81


Pivot tables and cross-tabulations
• Pandas offers several options for grouping and summarizing data

• pivot_table and crosstab also provides a method of groupby’s

• The pivot table takes simple column-wise data as input, and groups the entries into a
two-dimensional table that provides a multidimensional summarization of the data.

Dr S Palaniappan , Associate Professor , Department of ADS 82


Selecting a subset of columns
Let’s Create a Data set

mylist = ['M','F']
sub = ['EDA',"ML",'DBMS']
myscore = random.sample(range(0, 100),20)
mygender = random.choices(mylist, k = 20)
mysub = random.choices(sub, k = 20)
data = {'Gender':mygender,'Score':myscore,'Subject':mysub}
df = pd.DataFrame(data)
df

Dr S Palaniappan , Associate Professor , Department of ADS 83


table = pd.pivot_table(data=df,index=['Subject'])
table

table = pd.pivot_table(data=df,index=['Gender','Subject'])
table

By default it displays the average

Dr S Palaniappan , Associate Professor , Department of ADS 84


Cross-tabulations
• Also known as contingency tables or cross tabs, cross tabulation groups variables to
understand the correlation between different variables.

• pd.crosstab(df["Gender"], df["Subject"])

• pd.crosstab(df["Gender"], df["Subject"],values=df["Score"],aggfunc='mean')

Dr S Palaniappan , Associate Professor , Department of ADS 85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy