0% found this document useful (0 votes)
2 views38 pages

SMDM Predictive Modeling Business Report 05.02.2022 PDF

This document outlines a project analysis involving predictive modeling using Linear Regression and Logistic Regression. It details the data exploration, preprocessing, model application, and performance evaluation for predicting sales of firms and survival in car crashes. Key insights and recommendations are provided based on the analysis results, emphasizing important attributes affecting predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views38 pages

SMDM Predictive Modeling Business Report 05.02.2022 PDF

This document outlines a project analysis involving predictive modeling using Linear Regression and Logistic Regression. It details the data exploration, preprocessing, model application, and performance evaluation for predicting sales of firms and survival in car crashes. Key insights and recommendations are provided based on the analysis results, emphasizing important attributes affecting predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

SMDM

PROJECT
ANALYSIS
PREDICTIVE
MODELING
Praveen Kumar Tiwari
PGP-DSBA Online
February’ 22
Date: 05/02/2022

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents

Contents
Problem 1 – Linear Regression

1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data types,
shape, EDA). Perform Univariate and Bivariate Analysis………………………………………………………………………3-12

1.2) Impute null values if present? Do you think scaling is necessary in this case?. ................................ 12-13

1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (70:30). Apply
Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets using R-
square, RMSE………………………………………………………………………………………………………………………………………13-15

1.4) Inference: Based on these predictions, what are the business insights and recommendations. …….16

Problem 2 – Logistic Regression and Linear Discriminant Analysis

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.…………………….17-27

2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis). ………………………………………………………………….28-31

2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models and write inferences,
which model is best/optimized…………………………………………………………………………………………………………….32-37

2.4) Inference: Based on these predictions, what are the insights and recommendations…………………….38

THE END! ................................................................................................................................................. ….38

2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 1: Linear Regression

Problem Statement:
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the sales
of these firms on the bases of the details given in the dataset so as to help your company in investing
consciously. Also, provide them with 5 attributes that are most important.

1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.

Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem

Data Dictionary for Firm level data:


1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the stock
performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's market
value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.

Data Description:

Exploratory Data Analysis:


Top 5 entries in the dataset.

3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.

Shape of the Dataset:


Number of rows: 759
Number of columns: 9

Info of the Dataset :

There are total of 9 variables present in the dataset.


1 Categorical Variables- sp500.
8 Numeric type variables- sales, capital Blair, patents, randd, employment, tobinq, value, institutions

4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Descriptive Statistics of the Dataset:

The above table gives information such as unique values, mean, median, standard deviation, five point
summary, min-max, count, etc. for all the variables present in the dataset.

Converting categorical to dummy variables :

5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:

We can see now that column "sp500" has been converted into dummy variable from object data type. As
it was mandatory to do this step to go ahead with modelling process.

Check for Null Values:

Observation:

We can see that feature "tobinq" has 21 null values and we will impute these null values with Mean
method in further steps

Imputing missing values:


Imputing missing values with “mean” of that particular feature comes into action

Observation:

Now as we can see that after imputation, we have no null values present in the dataset.

Checking for duplicates:

Number of duplicate rows = 0

6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Correlation Plot:

Observation:

From this plot we can see that feature like “capital”, “randd”, “employment”,
“value” has relatively high correlation with Target variable “Sales”. These may
be the features that will have higher importance than others while predicting the “Sales” of firms.

We’ll see later on after doing modelling whether it holds right or not.

7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Outlier Checks:

After Removing all the outliers:

8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Univariate Analysis:
Histogram Plot

9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:

From above histogram plots we can clearly see that none of them possess normal distribution curve.
Most of the features are highly skewed.

Skewness of the Dataset:

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean.
All variables are positively skewed.

10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Bivariate Analysis :

11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:

1.2) Impute null values if present? Do you think scaling is necessary in this case?

Scaling is done so that the data which belongs to wide variety of ranges can be brought together in
similar relative range and thus bringing out the best performance of the model. Generally, we perform
Feature Scaling while dealing with the Gradient Descent Based Algorithms such as Linear and Logistic
Regression as these are very sensitive to the range of data points. In addition, it is very useful in checking
and reducing multi-collinearity in the data. VIF (Variance Inflation Factor) is a value, which indicates the
presence of multicollinearity.
We have imputed the null values already in above steps for "tobinq" feature.

Scaling - Purpose: In this method, we convert variables with different scales of measurements into a
single scale.

Based on the given data set, as we have attributes that do not have well-defined meanings so therefore
we should scale our data in this case. Accordingly, we have scaled the dataset after treating the outliers
and converting the categorical data into continuous in the dataset. StandardScaler normalizes the data
using the formula (x -mean)/standard deviation

12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Applying Z-score method:

1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train
and Test sets using R-square, RMSE.

1. Train-Test Split:
2. Copy all the predictor variables into X data frame

3. Copy target into the y data frame .


4. Split X and y into training and test set in 70:30 ratio

Linear Regression Model:


Invoke the Linear Regression function and find the best fit model on training data
LinearRegression()

Let us explore the coefficients for each of the independent attributes:

The coefficient for capital is 0.25489940880326073


The coefficient for patents is -0.030256859387979826
The coefficient for randd is 0.05324174844070624
The coefficient for employment is 0.4208762388661018
The coefficient for tobinq is -0.04496498746335731
The coefficient for value is 0.28075824105632624
The coefficient for institutions is 0.003084523927916951
The coefficient for sp500_yes is 0.04913692311412488

13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Let us check the intercept for the model:
The intercept for our model is 0.0031473199509150646
R square on training data:
0.9358806629736066

R square on testing data:


0.9241294393352394

RMSE on Training data:


0.25830810681499167
RMSE on Testing data:
0.26166630564747606

Linear Regression using statsmodels:


concatenate X and y into a single dataframe:

Import statsmodel formula and applying it for fitting the model:

Intercept 0.003147
capital 0.254899
patents -0.030257
randd 0.053242
employment 0.420876
tobinq -0.044965
value 0.280758
institutions 0.003085
sp500_yes 0.049137
dtype: float64

14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Summary:

Scatter Plot:

15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
1.4) Inference: Based on these predictions, what are the business insights and recommendations.

Conclusion:

The final Linear Regression equation is:

sales = (0.0) * Intercept + (0.25) * capital + (-0.03) * patents + (0.05) * randd + (0.42) * employment + (-
0.04) * tobinq + (0.28) * value + (0.0) * institutions + (0.05) * sp500_yes

When capital increases by 1 unit, sales increases by 0.25 units, keeping all other predictors constant.

When employment increases by 1 unit, sales price increases by 0.42 units, keeping all other predictors
constant.

When value increases by 1 unit, diamond sales increases by 0.28 units, keeping all other predictors
constant.

As per model these five attributes that are most important attributes 'Capital', 'employment',
'value','randd' for predicting the sales. So with this our coorelation heatmap assupmtion becomes true.

There are also some negative co-efficient values, for instance, corresponding co-efficient (-0.04) for
'tobinq',(-0.03) for 'patents' for table. This implies, these are inversely proportional with sales.

16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 2: Logistic Regression and Linear Discriminant Analysis

Problem Statement:
You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set so as
to provide insights that will help the government to make stronger laws for car manufacturers to ensure
safety measures. Also, find out the important factors on the basis of which you made your predictions.

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem

Data Dictionary for Car_Crash

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5:
unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.

17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data Description:

Exploratory Data Analysis:


Top 5 entries in the dataset.

“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.

Shape of the Dataset:


Number of rows: 11217
Number of columns: 15

18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Info of the Dataset:

There are total of 15 variables present in the dataset.


There are mix of variables consists of object datatypes, int and float datatypes. Most of them are
categorical variables and we will convert them in further steps.

Descriptive Statistics of the Dataset:

The above table gives information such as unique values, mean, median, standard deviation, five points
summary, min-max, count, etc. for all the variables present in the dataset.

19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Null Values:

Observation:

We can see that feature "injSeverity" has 77 null values and we will impute these null values with Median
or Mode method in further steps.

Imputing Missing Values:

Observation:

Now after applying median for missing values, we can see that we have no null values present in the
dataset.

Check for duplicate data:


Number of duplicate rows = 0
(11217, 15)

20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Outliers:

21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Treating the outliers:
Lower & Upper Range
weight:-
Lower Range : -415.353999997256
Upper Range : 767.7019999951584

ageOFocc:-
Lower Range : -17.0
Upper Range : 87.0

yearacc:-
Lower Range : 1999.5
Upper Range : 2003.5

yearVeh:-
Lower Range : 1979.0
Upper Range : 2011.0

23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Checking for Correlations:

Heatmap:

25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Pairplot:

26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:

27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

1. Dropping the "caseid" feature as it's not as important for modelling purpose in this case

2. Converting the other 'object' type variables as dummy variables:

3. Train Test Split:


4. Copy all the predictor variables into X dataframe

5. Copy target into the y dataframe.


6. Split X and y into training and test set in 70:30 ratio

Logistic Regression Model:


LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-cg',
verbose=True)

Predicting on Training and Test dataset:


ytrain_predict = model.predict(X_train)

ytest_predict = model.predict(X_test)

28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Getting the Predicted Classes and Probs:

Applying GridSearchCV for Logistic Regression:

GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),


n_jobs=-1,
param_grid={'penalty': ['l2', 'none'], 'solver': ['sag', 'lbfgs'],
'tol': [0.0001, 1e-05]},
scoring='f1')

{'penalty': 'none', 'solver': 'lbfgs', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none')

Getting the probabilities on the test set:

Confusion matrix on the training data:

29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Confusion matrix on the test data:

30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Linear Discriminant Analysis:

1. Build LDA Model:


2. Predicting on Training and Test dataset

3. Getting the Predicted Classes and Probs

Observation:

We have encoded the data (having string values) for Modelling.

Data Split: We have split ted the data into train and test (70:30). We have applied Logistic Regression and
LDA (linear discriminant analysis) on the given dataset taking survived as the target variable.

31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models
and write inferences, which model is best/optimized.

Logistic Regression Model:-

Model Evaluation on Training Data:


0.9540185963571519

Model Evaluation on Testing Data


0.9539512774806892

AUC and ROC for the training data

32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
AUC and ROC for the testing data

Confusion Matrix for the training data:


array([[ 733, 93],
[ 56, 6969]], dtype=int64)

33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Classification report:

Confusion Matrix for test data:


array([[ 315, 39],
[ 20, 2992]], dtype=int64)

Classification report:

34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Linear Discriminant Analysis:-

Model Evaluation on Training Data


0.9540185963571519

Model Evaluation on Testing Data


0.9539512774806892

AUC and ROC for the training data

35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
AUC and ROC for the testing data

Confusion Matrix for the training data:


array([[ 733, 93],
[ 56, 6969]], dtype=int64)

36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Classification report:

Confusion Matrix for test data:


array([[ 315, 39],
[ 20, 2992]], dtype=int64)

Classification report:

37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.4) Inference: Based on these predictions, what are the insights and recommendations.

1. The model accuracy of logistic regression on both training data as well as testing data is almost
the same i.e 98%.
2. Similarly, AUC in logistic regression for training data and testing data is also similar.
3. The other parameters of the confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is overfitted.
4. We have therefore applied Grid Search CV to hyper-tune our model and as per which F1 score in
both training and test data was 98%.
5. In the case of LDA, the AUC for testing and training data is also the same and it was 97%.
6. Besides this the other parameters of the confusion matrix of the LDA model were also similar and
it clearly shows that model is overfitted here too.
7. Overall, we can conclude that the logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted

…………………………………………………………..THE END……………………………………………………………………….

38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy