SMDM Predictive Modeling Business Report 05.02.2022 PDF
SMDM Predictive Modeling Business Report 05.02.2022 PDF
PROJECT
ANALYSIS
PREDICTIVE
MODELING
Praveen Kumar Tiwari
PGP-DSBA Online
February’ 22
Date: 05/02/2022
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Contents
Problem 1 – Linear Regression
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data types,
shape, EDA). Perform Univariate and Bivariate Analysis………………………………………………………………………3-12
1.2) Impute null values if present? Do you think scaling is necessary in this case?. ................................ 12-13
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (70:30). Apply
Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets using R-
square, RMSE………………………………………………………………………………………………………………………………………13-15
1.4) Inference: Based on these predictions, what are the business insights and recommendations. …….16
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.…………………….17-27
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis). ………………………………………………………………….28-31
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models and write inferences,
which model is best/optimized…………………………………………………………………………………………………………….32-37
2.4) Inference: Based on these predictions, what are the insights and recommendations…………………….38
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 1: Linear Regression
Problem Statement:
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the sales
of these firms on the bases of the details given in the dataset so as to help your company in investing
consciously. Also, provide them with 5 attributes that are most important.
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem
Data Description:
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Descriptive Statistics of the Dataset:
The above table gives information such as unique values, mean, median, standard deviation, five point
summary, min-max, count, etc. for all the variables present in the dataset.
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:
We can see now that column "sp500" has been converted into dummy variable from object data type. As
it was mandatory to do this step to go ahead with modelling process.
Observation:
We can see that feature "tobinq" has 21 null values and we will impute these null values with Mean
method in further steps
Observation:
Now as we can see that after imputation, we have no null values present in the dataset.
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Correlation Plot:
Observation:
From this plot we can see that feature like “capital”, “randd”, “employment”,
“value” has relatively high correlation with Target variable “Sales”. These may
be the features that will have higher importance than others while predicting the “Sales” of firms.
We’ll see later on after doing modelling whether it holds right or not.
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Outlier Checks:
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Univariate Analysis:
Histogram Plot
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:
From above histogram plots we can clearly see that none of them possess normal distribution curve.
Most of the features are highly skewed.
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean.
All variables are positively skewed.
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Bivariate Analysis :
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:
1.2) Impute null values if present? Do you think scaling is necessary in this case?
Scaling is done so that the data which belongs to wide variety of ranges can be brought together in
similar relative range and thus bringing out the best performance of the model. Generally, we perform
Feature Scaling while dealing with the Gradient Descent Based Algorithms such as Linear and Logistic
Regression as these are very sensitive to the range of data points. In addition, it is very useful in checking
and reducing multi-collinearity in the data. VIF (Variance Inflation Factor) is a value, which indicates the
presence of multicollinearity.
We have imputed the null values already in above steps for "tobinq" feature.
Scaling - Purpose: In this method, we convert variables with different scales of measurements into a
single scale.
Based on the given data set, as we have attributes that do not have well-defined meanings so therefore
we should scale our data in this case. Accordingly, we have scaled the dataset after treating the outliers
and converting the categorical data into continuous in the dataset. StandardScaler normalizes the data
using the formula (x -mean)/standard deviation
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Applying Z-score method:
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train
and Test sets using R-square, RMSE.
1. Train-Test Split:
2. Copy all the predictor variables into X data frame
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Let us check the intercept for the model:
The intercept for our model is 0.0031473199509150646
R square on training data:
0.9358806629736066
Intercept 0.003147
capital 0.254899
patents -0.030257
randd 0.053242
employment 0.420876
tobinq -0.044965
value 0.280758
institutions 0.003085
sp500_yes 0.049137
dtype: float64
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Summary:
Scatter Plot:
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
1.4) Inference: Based on these predictions, what are the business insights and recommendations.
Conclusion:
sales = (0.0) * Intercept + (0.25) * capital + (-0.03) * patents + (0.05) * randd + (0.42) * employment + (-
0.04) * tobinq + (0.28) * value + (0.0) * institutions + (0.05) * sp500_yes
When capital increases by 1 unit, sales increases by 0.25 units, keeping all other predictors constant.
When employment increases by 1 unit, sales price increases by 0.42 units, keeping all other predictors
constant.
When value increases by 1 unit, diamond sales increases by 0.28 units, keeping all other predictors
constant.
As per model these five attributes that are most important attributes 'Capital', 'employment',
'value','randd' for predicting the sales. So with this our coorelation heatmap assupmtion becomes true.
There are also some negative co-efficient values, for instance, corresponding co-efficient (-0.04) for
'tobinq',(-0.03) for 'patents' for table. This implies, these are inversely proportional with sales.
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 2: Logistic Regression and Linear Discriminant Analysis
Problem Statement:
You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set so as
to provide insights that will help the government to make stronger laws for car manufacturers to ensure
safety measures. Also, find out the important factors on the basis of which you made your predictions.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5:
unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data Description:
“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Info of the Dataset:
The above table gives information such as unique values, mean, median, standard deviation, five points
summary, min-max, count, etc. for all the variables present in the dataset.
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Null Values:
Observation:
We can see that feature "injSeverity" has 77 null values and we will impute these null values with Median
or Mode method in further steps.
Observation:
Now after applying median for missing values, we can see that we have no null values present in the
dataset.
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Outliers:
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Treating the outliers:
Lower & Upper Range
weight:-
Lower Range : -415.353999997256
Upper Range : 767.7019999951584
ageOFocc:-
Lower Range : -17.0
Upper Range : 87.0
yearacc:-
Lower Range : 1999.5
Upper Range : 2003.5
yearVeh:-
Lower Range : 1979.0
Upper Range : 2011.0
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Checking for Correlations:
Heatmap:
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Pairplot:
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
1. Dropping the "caseid" feature as it's not as important for modelling purpose in this case
ytest_predict = model.predict(X_test)
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Getting the Predicted Classes and Probs:
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix on the test data:
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Linear Discriminant Analysis:
Observation:
Data Split: We have split ted the data into train and test (70:30). We have applied Logistic Regression and
LDA (linear discriminant analysis) on the given dataset taking survived as the target variable.
31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models
and write inferences, which model is best/optimized.
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
AUC and ROC for the testing data
33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Classification report:
Classification report:
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Linear Discriminant Analysis:-
35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
AUC and ROC for the testing data
36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Classification report:
Classification report:
37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.4) Inference: Based on these predictions, what are the insights and recommendations.
1. The model accuracy of logistic regression on both training data as well as testing data is almost
the same i.e 98%.
2. Similarly, AUC in logistic regression for training data and testing data is also similar.
3. The other parameters of the confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is overfitted.
4. We have therefore applied Grid Search CV to hyper-tune our model and as per which F1 score in
both training and test data was 98%.
5. In the case of LDA, the AUC for testing and training data is also the same and it was 97%.
6. Besides this the other parameters of the confusion matrix of the LDA model were also similar and
it clearly shows that model is overfitted here too.
7. Overall, we can conclude that the logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted
…………………………………………………………..THE END……………………………………………………………………….
38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited