L3 Demo - Building A Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Building a Linear Regression

This demonstration illustrates building, exploring, and


refining a linear regression model in SAS Visual Statistics
using the VS_Bank data set.

Copyright © SAS Institute Inc. All rights reserved.


Target Variables

2
Categorical variables

3
Interval-valued variables

4
Demographic variables

5
Demonstration
• Create a new report
• Start with DATA
• Select VS_BANK

6
Choose data: VS_BANK
• Select the Objects pane
• Drag and drop Linear Regression onto the canvas

8
1

3 2
Assign data
1
2
Response
2

3
• Select Roles panes
• Add tgt interval New Sales 1

as the response variable


(you can also drag the
variable onto the Linear
Regression canvas)
Continuous effect
2

• Add the 12 variables


starting with logi_rfm
as continuous effects 1

3
Classification effect

• Add category 1 Account


Activity Level and
category 2 Customer 1

Value Level as
2
classification effects
3
Target and input variable roles:
1

Demonstration
• On the menu bar,
click Menu and
select Enable auto-
refresh

3 2
1

• R2 = 0.09333. It appears the model explains less than 10% of the variability
of the data.
• The R2 seems low, but further model refinement can improve it
• The model uses 211,509 observations
Demonstration

Note:
• The number of observations used seems low
• The data sets contains more than a million account
• But the data has a 20% response rate
• For the response variable that is used in this regression, non
responder are coded with a missing value
• The linear regression is fitted using only responder in the data
Demonstration
• We can investigate other fit measure:
• Click R2 and select Root MSE (Root Mean Square Error)
• RMSE = 8086.21
• This tells you that, on average the difference between prediction and actual
value is approximately $8,086.
• This seems imprecise and might be an artifact of how the data were collected
1

Demonstration
• In the Options pane,
• under Model Display,
• select General and
• change the plot layout to Stack to
2
expand the Fit Summary window on
the canvas.
Fit Summary
Fit Summary
• Default threshold of p-value < 0.05 is used for
variable importance.

• Effects (or variables) with p-value < 0.05 have


darker bars and are important (significant)

• Effects with p-values > 0.05 have blue bars and are
not significant

• A bar chart below shows the distribution of effects


in various ranges of p-values on the negative log10
scale
• Rfm variables (1,2,3,4,5,8,9 and 12) and both categorical inputs
are significant according to the p-value criterion
Residual Plot
• Click the Residual Plot tab. Right-click in the residual plot and change
the residual measure to Studentized Residual.
• A studentized residual is a raw residual
that is divided by its estimated standard
deviation

• The variance of the residuals seems to


increase with the predicted value
(heteroscedasticity )

• Some large outliers (outside the black


lines)
ÞInfluence plot can be used to explore the
effect of the outliers
Residual Plot
• Use the mouse to drag and select the large, positive residuals
Residual Plot
• Right-click in the Residual plot and select Show selected to open a
details table
Residual Plot
• Values of the variables associated with these observations are listed for examination.

• Close the Show selected window when you are done.


Assessment plot
• Click the Assessment tab
• The line plots show model predictions vs the actual response in the data. To
create the lines, both outcomes and predictions are binned into percentiles
• Although the model’s predictions seem consistent with responder outcomes over
the middle range of the plot, the model under-predicts at the high and low end
of the response range
1

Influence plot
• In Options pane, select Influence
Plot/Variable Selection Plot.

• Select Influence Plot from the Plot to


show pull-down

2
Influence plot

• Click on the Influence tab


• The bars corresponds to observations
(accounts) in the data
Influential observations
• Select the top 5 bars in the influence plot (Ctrl + left-click or ⌘ + click)

• Those are the observations that have a lot of influence on the value of
the parameters estimates, i.e. they have a high leverage
• When removed from the data, the parameter estimates will change
significantly
Influential observations
• To remove the influential observation, right-click on the influence plot
and select New filter from selection => Exclude selection
Influential observations
• Both RMSE and R2 improved slightly

• However, residuals still not ok (more residuals > 0 and variability of


the residual not homogeneous)
Influential observations
• Warning!

No valid reason for removing these observations from the model was
identified. Removing observations solely because they violate
statistical norms can yield unreliable models.

• To remove the filter: right-click


on the influence plot and select
Remove selection filters
Refining a Linear Regression

This demonstration illustrates how to refine the linear


regression model in SAS Visual Statistics using the
VS_Bank data set.

Copyright © SAS Institute Inc. All rights reserved.


Demonstration – Model Selection
• Click on Maximise to see the details table
Demonstration – Model Selection

• Click on Type III test


• The p-values (Pr > F column) measures the significance of each variable
• You can sort the table by clicking on the Pr > F column
Demonstration – Model Selection
Notes

• P-value threshold = 0.05 is widely used.


• Above: variable is not significant
• Below: variable is significant

• We will use a stricter threshold (adapted to big data): 0.01


Demonstration – Model Selection
1

• Dealing with missing values


• In Options, select Informative missingness

• Model selection to select the most significant


variables automatically
• Select Backward in Variable selection method
• Change Selection criterion to Significance level
2
• Enter 0.01 as the significance level
3
Notes: Variables selections methods
Forward selection:
1. Starts with no variables in the model
2. Test the addition of each variable using a chosen model fit criterion
3. Add the variable (if any) whose inclusion gives the most statistically
significant improvement of the fit
4. Go to 2 until no more variable improves the model to a statistically
significant extent
Notes: Variables selections methods
Backward selection
1. Start with all candidate variables
2. Test the deletion of each variable using a chosen model fit criterion
3. Delete the variable (if any) whose loss gives the most statistically
insignificant deterioration of the model fit
4. Go to 2 until no further variables can be deleted without a statistically
significant loss of fit

Stepwise
a combination of the forward and backward, testing at each step for
variables to be included or excluded
Model Selection
New model with 4 less variables:
Model Selection
Note

The variable selection did not improve much the main issues (residuals
and low R2 and Root MSE), it did result in a more defensible and more
useful and simple specification
Export data
• To export data, click More and select Export data
• Validate Formatted data and click OK to start download an Excel file
Export data
Interactions
• Domain experts suggest that there is a negative correlation between purchase amount and
purchase recency. That is, customers who purchased recently tend to buy smaller amounts
(obtain smaller loans) than customer who purchased less recently.

• Interaction functionality in SAS Visual Statistics can be used.

• Start by restoring the report:


Interactions
• In the Data pane, select New data item => Interaction Effect

3
Interactions
1
• Give the new interaction the
name RFMInteraction 2

• Add logi_rfm4 Last Product 3


Purchase Amount

• Add logi_rfm9 Months


Since Last Purchase

• Click OK
Interactions
• In the Roles pane, add
RFMInteraction as an
interaction effect
Interactions
• Fit Summary plot provides evidence in favour of the hypothesis. The
interaction term is included in the model.
Variable Selection Plot
1

2
right-click

3
Variable Selection Plot
Filtering
• Most of the accounts in the data are consumers of short to medium-
terms loans, i.e., in the $2,000 to $50,000 range.
• However, some are outside this range. What happens if we remove
from the regression analysis the loans outside the range?
• The model could improve!
Filtering
• Select the Filter
pane
• Select New filter 2
=> tgt Interval
New Sales

3
Filtering
1

• The existing range for new sales amounts is $0 to $500,000. To be


consistent with the loan profile for this portfolio, modify the filter range.
• Default filter does not provide enough granularity
• Click on Options and select Advanced edit
Filtering
Enter 2000 and 50000 for the filter values and select OK
Filtering
• Filter greatly reduces the range of the target response variable
• But only 5% of the observations were excluded by the filter
• Root MSE has been reduced to 6,031.4372
• Problems with residuals partially mitigated.
Filtering
• Further exploration using filtering can help analysts understand whether homogeneous groups of
responder accounts exist within the distribution of the target variable.

• Subsequent prediction quality can be improved if separate modeling exercises are performed for
different groups of responders.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy