Unit I
Unit I
1
Dr S Palaniappan , Associate Professor , Department of ADS
UNIT I
2
Dr S Palaniappan , Associate Professor , Department of ADS
Lesson Plan
Highest Cognitive
Planned Hour Description of Portion to be Covered Relevant CO Nos
level**
EDA fundamentals - Understanding data science –
1 CO1 K1
Significance of EDA
1 Making sense of data CO1 K1
Comparing EDA with classical and Bayesian analysis -
1 CO1 K1
Software tools for EDA
1 Visual Aids for EDA CO2 K1
1 Visual Aids for EDA CO2 K1
Data transformation techniques-merging database, reshaping
1 CO2 K1
and pivoting
1 Transformation techniques CO2 K1
1 Grouping Datasets - data aggregation CO2 K1
1 Pivot tables and cross-tabulations CO2 K1
CO 1 Understand the fundamentals of exploratory data analysis.
Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with Python”, Packt
T1
Publishing,2020.
• Processing such data elicits useful information and processing such information
generates useful knowledge
• How can we generate meaningful and useful information from such data?
• Data collection : What are the ways we can collect the data ?Data collected from
several sources should be Stored in a correct form
• Data cleaning: Incompleteness check, duplicates check, error check, and missing
value check
• Modelling and algorithm : The cause and effect – Dependent and Independent
variable
• Communication: Disseminating the results to end stakeholders to use the result for
business intelligence
• Data preparation
• understand the main characteristics of the data,
• clean the dataset,
• delete non-relevant datasets,
• transform the data, and
• divide the data into required chunks for analysis.
Dr S Palaniappan , Associate Professor , Department of ADS 9
• Data analysis:
• summarizing the data,
• finding the hidden correlation and relationships among the data,
• Developing predictive models,
• evaluating the models, and
• calculating the accuracies.
• Development and representation of the results: Presenting the dataset to the target
audience in the form of
• graphs,
• summary tables,
• Maps and diagrams
Dr S Palaniappan , Associate Professor , Department of ADS 10
Making sense of data
• It is crucial to identify the type of data under analysis.
• Different disciplines store different kinds of data for different purposes.
• Discrete data
• This is data that is countable and its values can be listed out
• Continuous data
• A variable that can have an infinite number of numerical values within a specific
range is classified as continuous data. Whose value is obtained by measuring.
• Nominal
• These are practiced for labeling variables without any quantitative value. The
scales are generally referred to as labels
• The languages that are spoken in a particular country, Hair color, Gender,
nationalities, Profession
• Ratio data
• It is just like interval data in that it can be categorized and ranked, and there are
equal intervals between the data points
• Income, height, weight, annual sales, market share
• zero means none and it is not possible to have negative values
• A bar chart is a statistical approach to represent given data using vertical and
horizontal rectangular bars.
• This type of chart is often useful when we want to show a comparison data similar to
a pie chart, but also show a scale of values for context.
• Append / Concat
• Merge or Join
• print(dataframe)
# Option 1
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
That is to say, if an item exists on the both dataframe, will be included in the new dataframe.
This means, we will get the list of students who are appearing in both the courses. (Student ID
30 will not be there in the list)
• The outer join takes the union from two or more dataframes.
• It corresponds to the FULL OUTER JOIN in SQL.
• The left join uses the keys from the left-hand dataframe only.
• It corresponds to the LEFT OUTER JOIN in SQL.
• The right join uses the keys from the right-hand dataframe only.
• It corresponds to the RIGHT OUTER JOIN in SQL.
df = dfSE.merge(dfML, how='left')
df
df = dfSE.merge(dfML, how='right')
df
df = dfSE.merge(dfML, how='outer')
df.tail(10)
print(dframe1)
stacked.unstack()
frame3.duplicated()
import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200, 3000, -786,3000, 234, 444, -786,
332, 3332 ], 'column 2': range(9)})
replaceFrame.isnull()
replaceFrame.notnull()
• replaceFrame.isnull().sum().sum()
•5
replaceFrame.column_1[replaceFrame.column_1.notnull()]
replaceFrame.column_1.dropna()
If we want to drop all null values in all the columns
replaceFrame.dropna()
replaceFrame .dropna()
• replaceFrame.dropna(how = 'all‘,axis = 0)
• # row 2 and 6 dropped
• replaceFrame.dropna(how = 'all',axis = 1)
• # column_2 dropped
• replaceFrame.sum()
fillnaDF.mean()
• fillna(method=‘bfill)
dframe1.index = ['Rain','Moisture','Breeze']
Understanding groupby()
Groupby mechanics
Data aggregation
Pivot tables and cross-tabulations
• Pandas groupby is used for grouping the data according to the categories and
apply a function to the categories
• Any groupby operation involves one of the following operations on the original
object. They are −
• Splitting the Object
• Applying a function
print(df.groupby('Gender').groups.keys())
print(df.groupby('Gender').count())
male = style.get_group("M")
print(male)
female = style.get_group(“F")
print(female)
f_avg = male['Height'].mean()
m_avg = female['Height'].mean()
print(f_avg,m_avg)
mylist = ['M','F']
sub = ['EDA',"ML",'DBMS']
myscore = random.sample(range(0, 100),20)
mygender = random.choices(mylist, k = 20)
mysub = random.choices(sub, k = 20)
data = {'Gender':mygender,'Score':myscore,'Subject':mysub}
df = pd.DataFrame(data)
df
double_grouping.max()
double_grouping.min()
double_grouping['Subject'].count()
data = {'DBMS':myscore1,'AI':myscore2,'EDA':myscore3}
df = pd.DataFrame(data)
df
df.agg('count')
df.agg(‘mean')
df.agg(["min",'max','mean'])
• The pivot table takes simple column-wise data as input, and groups the entries into a
two-dimensional table that provides a multidimensional summarization of the data.
mylist = ['M','F']
sub = ['EDA',"ML",'DBMS']
myscore = random.sample(range(0, 100),20)
mygender = random.choices(mylist, k = 20)
mysub = random.choices(sub, k = 20)
data = {'Gender':mygender,'Score':myscore,'Subject':mysub}
df = pd.DataFrame(data)
df
table = pd.pivot_table(data=df,index=['Gender','Subject'])
table
• pd.crosstab(df["Gender"], df["Subject"])
• pd.crosstab(df["Gender"], df["Subject"],values=df["Score"],aggfunc='mean')