0% found this document useful (0 votes)
5 views13 pages

Predictive Modelling Report

Uploaded by

akshaypankar907
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Predictive Modelling Report

Uploaded by

akshaypankar907
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

AKSHAY PANKAR PREDICTIVE MODELLING REPORT

PREDICTIVE MODELLING REPORT

REGARDS,
AKSHAY PANKAR

GREAT LEARNING 1
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Problem 1: Linear Regression

The comp-active databases is a collection of a computer systems activity measures.


The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running
in a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs.
As you are a budding data scientist you thought to find out a linear equation to build a model
to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each
attribute affects the system to be in 'usr' mode using a list of system attributes.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.

A. Most of the columns in the data are numeric in nature ('int64' or 'float64' type).The
runqsz and user name columns are string columns ('object' type).We will be dropping
the 'runqsz' column for prediction purposes.
B. Replace the missing values with median values of the columns. Note that we do not
need to specify the column names below. Every column's missing value is replaced
with that column's median respectively.
C. Univariate Analysis

FIG 1 - Histplot of lread for univariate analysis

GREAT LEARNING 2
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

FIG 2 - Boxplot of lwrite with outliers for univariate analysis

FIG 3 - Plot of Scall with outliers for univariate analysis

D. Bivariate analysis among the different variables can be done using scatter matrix plot.
Seaborn libs create a dashboard reflecting useful information about the dimensions.

Output

<seaborn.axisgrid.PairGrid at 0x1aab1315e50>

Image in python file cannot be cropped.

GREAT LEARNING 3
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

E. Multivariate Analysis

FIG 4 – Heat Map for Multivariate analysis

1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Check for
the possibility of creating new features if required. Also check for outliers and
duplicates if there.

FIG 6 – Treating Null values.

GREAT LEARNING 4
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Yes possibility of creating new feature by Imputing of null values are done here and
checked whether is zero or not by usinge medians that is filling missing values with
medians.

FIG 7 – After Treating Null values.

From the image 2 Box plot of lwrite Outliers are present in data set.

1.3 Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and check
the performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.

 By , Comparing the two regression results provided, we can see that both models have
relatively similar R-squared values, with the first model having an R-squared of 0.598
and the second model having an R-squared of 0.602. This indicates that both models
explain approximately the same amount of variability in the dependent variable (usr)
using the independent variables included.

 However, the second model has more independent variables (20) compared to the
first model, which has an unspecified number of independent variables. The second

GREAT LEARNING 5
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

model also has a lower F-statistic value (431.3) compared to the first model (608.4),
indicating that the first model may be a better fit for the data.

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Overall, the business insights and recommendations from these regression analyses
would likely depend on the specific independent variables included in the models and
the specific goals of the analysis. Further analysis and interpretation of the results
would be necessary to provide more specific insights and recommendations.

Step 1 :- Loading all Libraries and followed by upload in Excel file


Step 2 :- EDA And Variate Analysis (Univariate , Bivariate and Multivariate)
Step 3 :- Scaling of Data .
Step 4 :- Followed by regression.
Step 5 :- Performing Train – Test procedure and comparing of model 1 and 2 to draw
conclusions.

GREAT LEARNING 6
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Problem 2: Logistic Regression, LDA and CART


You are a statistician at the Republic of Indonesia Ministry of Health and you are provided
with a data of 1473 females collected from a Contraceptive Prevalence Survey. The samples
are married women who were either not pregnant or do not know if they were at the time
of the survey.

The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condi tion
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.

A. Most of the columns in the data are numeric in nature ('Object' , 'float64' and 1- int64
type).
B. After checking and treating the null values.

FIG 7 – After Treating Null values

C. Outliers checking – using boxplot command (we have to treat outliers)

GREAT LEARNING 7
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Fig 8 – Boxplot of Children born Fig 9 – Boxplot of Husband Education


D. By univariate analysis we can conclude that the house wife with minimum education
is 150 , with 25 percent education is 330 , 50 percent education is 398 and for 75 and
above education percentage women are 515.
E. By univariate analysis we can conclude that men with minimum education is 44 , with
25 percent education is 175 , 50 percent education is 347 and for 75 and above
education percentage women are 827.
F. Bivariate Analysis :

Fig 10 – count plot

From the above graph we can conclude that the with increase in use of contraceptive the
birth rate has been decreased.

GREAT LEARNING 8
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Fig 11 – count plot

From the above graph we can conclude that the with increase in wife education rate the use
of contraceptive have been increased.

G. Multivariate Analysis

Fig 12 – Heat Map for multivariate analysis.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis) and CART.

GREAT LEARNING 9
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Logistic Regression Output :-

Fig 13 - Logistics Regression.

LDA Output

Fig 14 - Linear Discrimant Analysis.


CARTING OUTPUT

GREAT LEARNING 10
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Fig 15 - CART.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized .

Logistic Regression: Accuracy = 0.6507177033492823, Precision = 0.64895


10489510489, Recall = 0.6355996944232238, F1 score = 0.6345191040843214
LDA: Accuracy = 0.6483253588516746, Precision = 0.6469681397738951, Rec
all = 0.6324166030048383, F1 score = 0.6308285719435482
CART: Accuracy = 0.65311004784689, Precision = 0.6491240266963292, Reca
ll = 0.6489686783804431, F1 score = 0.6490425538074917

Fig 16 - ROC for LR.

GREAT LEARNING 11
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

Fig 17 - ROC for LDA .

Fig 17 - ROC for CART .


Logistic Regression:
Accuracy (train): 0.6687179487179488
Accuracy (test): 0.65311004784689
Confusion Matrix:
[[ 92 95]
[ 50 181]]
ROC AUC Score: 0.6841678820288445

LDA:
Accuracy (train): 0.6707692307692308
Accuracy (test): 0.6483253588516746
Confusion Matrix:
[[ 90 97]
[ 50 181]]
ROC AUC Score: 0.6850707225038777

CART:
Accuracy (train): 0.9794871794871794
Accuracy (test): 0.65311004784689
Confusion Matrix:

GREAT LEARNING 12
AKSHAY PANKAR PREDICTIVE MODELLING REPORT

[[114 73]
[ 72 159]]
ROC AUC Score: 0.6487487557006274

From all we can conclude that accuracy of train is more.

2.4 Inference: Basis on these predictions, what are the insights and recommendations .

Overall, the business insights and recommendations from such analyses would likely
depend on the specific independent variables included in the models and the specific
goals of the analysis. Further analysis and interpretation of the results would be
necessary to provide more specific insights and recommendations.

Step 1 :- Loading all Libraries and followed by upload in Excel file


Step 2 :- EDA And Variate Analysis (Univariate , Bivariate and Multivariate)
Step 3 :- Scaling of Data .
Step 4 :- Followed by , data encoding split ,logistic regression , LDA and Cart .
Step 5 :- Performing Train – Test procedure and comparing of model 1 and 2 to draw
conclusions by performance matrix (Train –test split , confusion matrix , AUC-ROC
Curve)

GREAT LEARNING 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy