Balaji Capstone Project 1
Balaji Capstone Project 1
A PROJECT REPORT
Submitted by
Outlier treatment: -
This dataset is the mix of continuous as well as categorical variables.
It doesn’t make nay sense if we perform outlier treatment on
categorical variable as each category denotes a type of customer. So,
we are performing outlier treatment only for variables continuous in
nature.
• Used box plot to determine the presence if outlier in a variable.
• The dots outside the upper limit of a quantile represents the
outlier in the variable.
• We have 8 continuous variables in the dataset namely,
“Tenure”,
• “CC_Contacted_LY”, “Account_user_count”, “cashback”,
“rev_per_month”,
• “Day_Since_CC_connect”, “coupon_used_for_payment” and
“rev_growth_yoy”.
• We have used upper limit and lower limit to remove outliers.
Below is the pictorial representation of variables before and
after outlier treatment.
Before After
Fig 7: - Before and after outlier treatement
Missing Value treatment and variable transformation: -
• Out of 19 variables we have data anomalies present in 17
variables and null values in 15 variables.
• Using “Median” to impute null values where the variable is
continuous because the Median is less prone to outliers when
compared with the mean.
• Using “Mode: to impute null values where variables are
categorical.
• We have treated null values variable by variable as each variable
is unique.
Treating Variable “Tenure”
• We look at the unique observations in the variable and see that
we have “#” and “nan” present in the data.
• Where “#” is a anomaly and “nan” represents null value.
Fig 8: - before treatment
• Replacing “#” with “nan” and further we replace “nan” with the
calculated median of the
variable and now we don’t see any presence of bad data and null
values.
• Converted data type to integer, because IDE has recognized it
as object data type
due presence of bad data.
Treating Variable “City_Tier”
• We look at the unique observations in the variable and presence
of null value as shown below.
• Replacing “$”, “*” and “#” with “nan” and further we replace
“nan” with calculated median of the variable and now we don’t
see any presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Day_Since_CC_connect”
• We look at the unique observations in the variable and see
presence of “$” which denoted bad data and also the presence of
null values.
• Replacing “$” with “nan” and further we replace “nan” with
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Before After