0% found this document useful (0 votes)

2 views38 pages

SMDM Predictive Modeling Business Report 05.02.2022 PDF

This document outlines a project analysis involving predictive modeling using Linear Regression and Logistic Regression. It details the data exploration, preprocessing, model application, and performance evaluation for predicting sales of firms and survival in car crashes. Key insights and recommendations are provided based on the analysis results, emphasizing important attributes affecting predictions.

Uploaded by

Syed Basit Ali Qadri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views38 pages

SMDM Predictive Modeling Business Report 05.02.2022 PDF

Uploaded by

Syed Basit Ali Qadri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

SMDM

PROJECT
ANALYSIS
PREDICTIVE
MODELING
Praveen Kumar Tiwari
PGP-DSBA Online
February’ 22
Date: 05/02/2022

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents

Contents
Problem 1 – Linear Regression

1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, data types,
shape, EDA). Perform Univariate and Bivariate Analysis………………………………………………………………………3-12

1.2) Impute null values if present? Do you think scaling is necessary in this case?. ................................ 12-13

1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (70:30). Apply
Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets using R-
square, RMSE………………………………………………………………………………………………………………………………………13-15

1.4) Inference: Based on these predictions, what are the business insights and recommendations. …….16

Problem 2 – Logistic Regression and Linear Discriminant Analysis

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.…………………….17-27

2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis). ………………………………………………………………….28-31

2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models and write inferences,
which model is best/optimized…………………………………………………………………………………………………………….32-37

2.4) Inference: Based on these predictions, what are the insights and recommendations…………………….38

THE END! ................................................................................................................................................. ….38

2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 1: Linear Regression

Problem Statement:
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the sales
of these firms on the bases of the details given in the dataset so as to help your company in investing
consciously. Also, provide them with 5 attributes that are most important.

1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.

Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem

Data Dictionary for Firm level data:

1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the stock
performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's market
value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.

Data Description:

Exploratory Data Analysis:

Top 5 entries in the dataset.

3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.

Shape of the Dataset:

Number of rows: 759
Number of columns: 9

Info of the Dataset :

There are total of 9 variables present in the dataset.

1 Categorical Variables- sp500.
8 Numeric type variables- sales, capital Blair, patents, randd, employment, tobinq, value, institutions

4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Descriptive Statistics of the Dataset:

The above table gives information such as unique values, mean, median, standard deviation, five point
summary, min-max, count, etc. for all the variables present in the dataset.

Converting categorical to dummy variables :

5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:

We can see now that column "sp500" has been converted into dummy variable from object data type. As
it was mandatory to do this step to go ahead with modelling process.

Check for Null Values:

Observation:

We can see that feature "tobinq" has 21 null values and we will impute these null values with Mean
method in further steps

Imputing missing values:

Imputing missing values with “mean” of that particular feature comes into action

Observation:

Now as we can see that after imputation, we have no null values present in the dataset.

Checking for duplicates:

Number of duplicate rows = 0

6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Correlation Plot:

Observation:

From this plot we can see that feature like “capital”, “randd”, “employment”,
“value” has relatively high correlation with Target variable “Sales”. These may
be the features that will have higher importance than others while predicting the “Sales” of firms.

We’ll see later on after doing modelling whether it holds right or not.

7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Outlier Checks:

After Removing all the outliers:

8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Univariate Analysis:
Histogram Plot

9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observation:

From above histogram plots we can clearly see that none of them possess normal distribution curve.
Most of the features are highly skewed.

Skewness of the Dataset:

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable
about its mean.
All variables are positively skewed.

10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Bivariate Analysis :

11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:

1.2) Impute null values if present? Do you think scaling is necessary in this case?

Scaling is done so that the data which belongs to wide variety of ranges can be brought together in
similar relative range and thus bringing out the best performance of the model. Generally, we perform
Feature Scaling while dealing with the Gradient Descent Based Algorithms such as Linear and Logistic
Regression as these are very sensitive to the range of data points. In addition, it is very useful in checking
and reducing multi-collinearity in the data. VIF (Variance Inflation Factor) is a value, which indicates the
presence of multicollinearity.
We have imputed the null values already in above steps for "tobinq" feature.

Scaling - Purpose: In this method, we convert variables with different scales of measurements into a
single scale.

Based on the given data set, as we have attributes that do not have well-defined meanings so therefore
we should scale our data in this case. Accordingly, we have scaled the dataset after treating the outliers
and converting the categorical data into continuous in the dataset. StandardScaler normalizes the data
using the formula (x -mean)/standard deviation

12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Applying Z-score method:

1. Train-Test Split:
2. Copy all the predictor variables into X data frame

3. Copy target into the y data frame .

4. Split X and y into training and test set in 70:30 ratio

Linear Regression Model:

Invoke the Linear Regression function and find the best fit model on training data
LinearRegression()

Let us explore the coefficients for each of the independent attributes:

The coefficient for capital is 0.25489940880326073

The coefficient for patents is -0.030256859387979826
The coefficient for randd is 0.05324174844070624
The coefficient for employment is 0.4208762388661018
The coefficient for tobinq is -0.04496498746335731
The coefficient for value is 0.28075824105632624
The coefficient for institutions is 0.003084523927916951
The coefficient for sp500_yes is 0.04913692311412488

13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Let us check the intercept for the model:
The intercept for our model is 0.0031473199509150646
R square on training data:
0.9358806629736066

R square on testing data:

0.9241294393352394

RMSE on Training data:

0.25830810681499167
RMSE on Testing data:
0.26166630564747606

Linear Regression using statsmodels:

concatenate X and y into a single dataframe:

Import statsmodel formula and applying it for fitting the model:

Intercept 0.003147
capital 0.254899
patents -0.030257
randd 0.053242
employment 0.420876
tobinq -0.044965
value 0.280758
institutions 0.003085
sp500_yes 0.049137
dtype: float64

14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Summary:

Scatter Plot:

15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
1.4) Inference: Based on these predictions, what are the business insights and recommendations.

Conclusion:

The final Linear Regression equation is:

sales = (0.0) * Intercept + (0.25) * capital + (-0.03) * patents + (0.05) * randd + (0.42) * employment + (-
0.04) * tobinq + (0.28) * value + (0.0) * institutions + (0.05) * sp500_yes

When capital increases by 1 unit, sales increases by 0.25 units, keeping all other predictors constant.

When employment increases by 1 unit, sales price increases by 0.42 units, keeping all other predictors
constant.

When value increases by 1 unit, diamond sales increases by 0.28 units, keeping all other predictors
constant.

As per model these five attributes that are most important attributes 'Capital', 'employment',
'value','randd' for predicting the sales. So with this our coorelation heatmap assupmtion becomes true.

There are also some negative co-efficient values, for instance, corresponding co-efficient (-0.04) for
'tobinq',(-0.03) for 'patents' for table. This implies, these are inversely proportional with sales.

16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Problem 2: Logistic Regression and Linear Discriminant Analysis

Problem Statement:
You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set so as
to provide insights that will help the government to make stronger laws for car manufacturers to ensure
safety measures. Also, find out the important factors on the basis of which you made your predictions.

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Summary–This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem

Data Dictionary for Car_Crash

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed; 5:
unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.

17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data Description:

Exploratory Data Analysis:

Top 5 entries in the dataset.

“Unnamed: 0” is a variable that simply represents the index in the data. Hence, it should be dropped as it
is of no use in the model.

Shape of the Dataset:

Number of rows: 11217
Number of columns: 15

18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Info of the Dataset:

There are total of 15 variables present in the dataset.

There are mix of variables consists of object datatypes, int and float datatypes. Most of them are
categorical variables and we will convert them in further steps.

Descriptive Statistics of the Dataset:

The above table gives information such as unique values, mean, median, standard deviation, five points
summary, min-max, count, etc. for all the variables present in the dataset.

19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Null Values:

Observation:

We can see that feature "injSeverity" has 77 null values and we will impute these null values with Median
or Mode method in further steps.

Imputing Missing Values:

Observation:

Now after applying median for missing values, we can see that we have no null values present in the
dataset.

Check for duplicate data:

Number of duplicate rows = 0
(11217, 15)

20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Check for Outliers:

21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Treating the outliers:
Lower & Upper Range
weight:-
Lower Range : -415.353999997256
Upper Range : 767.7019999951584

ageOFocc:-
Lower Range : -17.0
Upper Range : 87.0

yearacc:-
Lower Range : 1999.5
Upper Range : 2003.5

yearVeh:-
Lower Range : 1979.0
Upper Range : 2011.0

23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Checking for Correlations:

Heatmap:

25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Pairplot:

26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Get the Correlation Heatmap:

27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

1. Dropping the "caseid" feature as it's not as important for modelling purpose in this case

2. Converting the other 'object' type variables as dummy variables:

3. Train Test Split:

4. Copy all the predictor variables into X dataframe

5. Copy target into the y dataframe.

6. Split X and y into training and test set in 70:30 ratio

Logistic Regression Model:

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-cg',
verbose=True)

Predicting on Training and Test dataset:

ytrain_predict = model.predict(X_train)

ytest_predict = model.predict(X_test)

Applying GridSearchCV for Logistic Regression:

GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),

n_jobs=-1,
param_grid={'penalty': ['l2', 'none'], 'solver': ['sag', 'lbfgs'],
'tol': [0.0001, 1e-05]},
scoring='f1')

{'penalty': 'none', 'solver': 'lbfgs', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none')

Getting the probabilities on the test set:

Confusion matrix on the training data:

1. Build LDA Model:

2. Predicting on Training and Test dataset

3. Getting the Predicted Classes and Probs

Observation:

We have encoded the data (having string values) for Modelling.

Data Split: We have split ted the data into train and test (70:30). We have applied Logistic Regression and
LDA (linear discriminant analysis) on the given dataset taking survived as the target variable.

31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both the models
and write inferences, which model is best/optimized.

Logistic Regression Model:-

Model Evaluation on Training Data:

0.9540185963571519

Model Evaluation on Testing Data

0.9539512774806892

AUC and ROC for the training data

Confusion Matrix for the training data:

array([[ 733, 93],
[ 56, 6969]], dtype=int64)

Confusion Matrix for test data:

array([[ 315, 39],
[ 20, 2992]], dtype=int64)

Classification report:

Model Evaluation on Training Data

0.9540185963571519

Model Evaluation on Testing Data

0.9539512774806892

AUC and ROC for the training data

Confusion Matrix for the training data:

array([[ 733, 93],
[ 56, 6969]], dtype=int64)

Confusion Matrix for test data:

array([[ 315, 39],
[ 20, 2992]], dtype=int64)

Classification report:

1. The model accuracy of logistic regression on both training data as well as testing data is almost
the same i.e 98%.
2. Similarly, AUC in logistic regression for training data and testing data is also similar.
3. The other parameters of the confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is overfitted.
4. We have therefore applied Grid Search CV to hyper-tune our model and as per which F1 score in
both training and test data was 98%.
5. In the case of LDA, the AUC for testing and training data is also the same and it was 97%.
6. Besides this the other parameters of the confusion matrix of the LDA model were also similar and
it clearly shows that model is overfitted here too.
7. Overall, we can conclude that the logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted

…………………………………………………………..THE END……………………………………………………………………….

Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Chapter 3 Regression Analysis Section 3.1 Simple Regression Example 3.1 Honda Civic (I)
100% (1)
Chapter 3 Regression Analysis Section 3.1 Simple Regression Example 3.1 Honda Civic (I)
13 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
Amazon's Net Income/loss and Sales Figures For The Period 1995-2015
0% (1)
Amazon's Net Income/loss and Sales Figures For The Period 1995-2015
2 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Answer Key Problem Set 2
No ratings yet
Answer Key Problem Set 2
8 pages
Institute and Faculty of Actuaries: Subject CS1A - Actuarial Statistics Core Principles
No ratings yet
Institute and Faculty of Actuaries: Subject CS1A - Actuarial Statistics Core Principles
11 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Project Predictive Modeling
No ratings yet
Project Predictive Modeling
43 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Devidutta Predictive Modeling PDF
No ratings yet
Devidutta Predictive Modeling PDF
25 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Business Report - Predictive Modelling
No ratings yet
Business Report - Predictive Modelling
19 pages
FRA Report
100% (1)
FRA Report
30 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Excel Project - Investment Firm
No ratings yet
Excel Project - Investment Firm
3 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Pred Mold Buiness Report PDF
No ratings yet
Pred Mold Buiness Report PDF
49 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Finance & Risk Analytics QSTN 1 - Credit Risk
No ratings yet
Finance & Risk Analytics QSTN 1 - Credit Risk
24 pages
India Credit Risk Model - Varalkshmi
100% (1)
India Credit Risk Model - Varalkshmi
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
Linear Regression Makes Several Key Assumptions
No ratings yet
Linear Regression Makes Several Key Assumptions
5 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Assignment Answers
No ratings yet
Assignment Answers
5 pages
5.multiple Regression
No ratings yet
5.multiple Regression
17 pages
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
No ratings yet
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
17 pages
Practical - Regression
No ratings yet
Practical - Regression
114 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Linear Regression
67% (3)
Linear Regression
15 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
ML Asssignment Subjective Questions Answers
No ratings yet
ML Asssignment Subjective Questions Answers
7 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Business Report
No ratings yet
Business Report
20 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
Unit V - Update
No ratings yet
Unit V - Update
53 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
Project Report - Predictive Modeling
No ratings yet
Project Report - Predictive Modeling
49 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Topic 8 Time Series and Forecasting
No ratings yet
Topic 8 Time Series and Forecasting
33 pages
Fourth Quarter Exam STAT
No ratings yet
Fourth Quarter Exam STAT
38 pages
Package Abc': R Topics Documented
No ratings yet
Package Abc': R Topics Documented
30 pages
In Class Exercise Linear Regression in R
No ratings yet
In Class Exercise Linear Regression in R
6 pages
FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
No ratings yet
FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
77 pages
Edexcel S2 Cheat Sheet
No ratings yet
Edexcel S2 Cheat Sheet
12 pages
Explicit Anomalous Cognition
No ratings yet
Explicit Anomalous Cognition
43 pages
Forecasting Models
No ratings yet
Forecasting Models
8 pages
8 Normalization Methods
No ratings yet
8 Normalization Methods
10 pages
(ISOM2500) (2020) (S) Final Lgkemeh - 68321 2
No ratings yet
(ISOM2500) (2020) (S) Final Lgkemeh - 68321 2
7 pages
Midterm Exam Formula PDF
No ratings yet
Midterm Exam Formula PDF
6 pages
(Ebook PDF) Statistics For Research: With A Guide To SPSS 3rd Edition Instant Download
100% (2)
(Ebook PDF) Statistics For Research: With A Guide To SPSS 3rd Edition Instant Download
43 pages
Economics 717 Fall 2019 Lecture - IV PDF
No ratings yet
Economics 717 Fall 2019 Lecture - IV PDF
30 pages
Week 1 Lecture 1 New
No ratings yet
Week 1 Lecture 1 New
27 pages
Uji Linieritas
No ratings yet
Uji Linieritas
3 pages
9709 s06 QP 7 PDF
No ratings yet
9709 s06 QP 7 PDF
4 pages
The Reliability of Relationship Satisfaction
No ratings yet
The Reliability of Relationship Satisfaction
10 pages
Z Test
No ratings yet
Z Test
21 pages
21AI63AI
No ratings yet
21AI63AI
2 pages
Department of Agricultural Statistics: JNR MSC Agri, Students, Uasd
No ratings yet
Department of Agricultural Statistics: JNR MSC Agri, Students, Uasd
2 pages
Central Tendency
No ratings yet
Central Tendency
17 pages
L18 Hypothesis Testing1
No ratings yet
L18 Hypothesis Testing1
62 pages
The NLIN Procedure
No ratings yet
The NLIN Procedure
49 pages
Data Mining - Lecture 4
No ratings yet
Data Mining - Lecture 4
40 pages
Pyzdek Six Sigma Handbook 4th
No ratings yet
Pyzdek Six Sigma Handbook 4th
9 pages
4th Quarter
No ratings yet
4th Quarter
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SMDM Predictive Modeling Business Report 05.02.2022 PDF

Uploaded by

SMDM Predictive Modeling Business Report 05.02.2022 PDF

Uploaded by

SMDM

Problem 2 – Logistic Regression and Linear Discriminant Analysis

THE END! ................................................................................................................................................. ….38

Data Dictionary for Firm level data:

Exploratory Data Analysis:

Shape of the Dataset:

Info of the Dataset :

There are total of 9 variables present in the dataset.

Converting categorical to dummy variables :

Check for Null Values:

Imputing missing values:

Checking for duplicates:

Number of duplicate rows = 0

After Removing all the outliers:

Skewness of the Dataset:

3. Copy target into the y data frame .

Linear Regression Model:

Let us explore the coefficients for each of the independent attributes:

The coefficient for capital is 0.25489940880326073

R square on testing data:

RMSE on Training data:

Linear Regression using statsmodels:

Import statsmodel formula and applying it for fitting the model:

The final Linear Regression equation is:

Data Dictionary for Car_Crash

Exploratory Data Analysis:

Shape of the Dataset:

There are total of 15 variables present in the dataset.

Descriptive Statistics of the Dataset:

Imputing Missing Values:

Check for duplicate data:

2. Converting the other 'object' type variables as dummy variables:

3. Train Test Split:

5. Copy target into the y dataframe.

Logistic Regression Model:

Predicting on Training and Test dataset:

Applying GridSearchCV for Logistic Regression:

GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),

{'penalty': 'none', 'solver': 'lbfgs', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2, penalty='none')

Getting the probabilities on the test set:

Confusion matrix on the training data:

1. Build LDA Model:

3. Getting the Predicted Classes and Probs

We have encoded the data (having string values) for Modelling.

Logistic Regression Model:-

Model Evaluation on Training Data:

Model Evaluation on Testing Data

AUC and ROC for the training data

Confusion Matrix for the training data:

Confusion Matrix for test data:

Model Evaluation on Training Data

Model Evaluation on Testing Data

AUC and ROC for the training data

Confusion Matrix for the training data:

Confusion Matrix for test data:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.