L3 Demo - Building A Linear Regression
L3 Demo - Building A Linear Regression
L3 Demo - Building A Linear Regression
2
Categorical variables
3
Interval-valued variables
4
Demographic variables
5
Demonstration
• Create a new report
• Start with DATA
• Select VS_BANK
6
Choose data: VS_BANK
• Select the Objects pane
• Drag and drop Linear Regression onto the canvas
8
1
3 2
Assign data
1
2
Response
2
3
• Select Roles panes
• Add tgt interval New Sales 1
3
Classification effect
Value Level as
2
classification effects
3
Target and input variable roles:
1
Demonstration
• On the menu bar,
click Menu and
select Enable auto-
refresh
3 2
1
• R2 = 0.09333. It appears the model explains less than 10% of the variability
of the data.
• The R2 seems low, but further model refinement can improve it
• The model uses 211,509 observations
Demonstration
Note:
• The number of observations used seems low
• The data sets contains more than a million account
• But the data has a 20% response rate
• For the response variable that is used in this regression, non
responder are coded with a missing value
• The linear regression is fitted using only responder in the data
Demonstration
• We can investigate other fit measure:
• Click R2 and select Root MSE (Root Mean Square Error)
• RMSE = 8086.21
• This tells you that, on average the difference between prediction and actual
value is approximately $8,086.
• This seems imprecise and might be an artifact of how the data were collected
1
Demonstration
• In the Options pane,
• under Model Display,
• select General and
• change the plot layout to Stack to
2
expand the Fit Summary window on
the canvas.
Fit Summary
Fit Summary
• Default threshold of p-value < 0.05 is used for
variable importance.
• Effects with p-values > 0.05 have blue bars and are
not significant
Influence plot
• In Options pane, select Influence
Plot/Variable Selection Plot.
2
Influence plot
• Those are the observations that have a lot of influence on the value of
the parameters estimates, i.e. they have a high leverage
• When removed from the data, the parameter estimates will change
significantly
Influential observations
• To remove the influential observation, right-click on the influence plot
and select New filter from selection => Exclude selection
Influential observations
• Both RMSE and R2 improved slightly
No valid reason for removing these observations from the model was
identified. Removing observations solely because they violate
statistical norms can yield unreliable models.
Stepwise
a combination of the forward and backward, testing at each step for
variables to be included or excluded
Model Selection
New model with 4 less variables:
Model Selection
Note
The variable selection did not improve much the main issues (residuals
and low R2 and Root MSE), it did result in a more defensible and more
useful and simple specification
Export data
• To export data, click More and select Export data
• Validate Formatted data and click OK to start download an Excel file
Export data
Interactions
• Domain experts suggest that there is a negative correlation between purchase amount and
purchase recency. That is, customers who purchased recently tend to buy smaller amounts
(obtain smaller loans) than customer who purchased less recently.
3
Interactions
1
• Give the new interaction the
name RFMInteraction 2
• Click OK
Interactions
• In the Roles pane, add
RFMInteraction as an
interaction effect
Interactions
• Fit Summary plot provides evidence in favour of the hypothesis. The
interaction term is included in the model.
Variable Selection Plot
1
2
right-click
3
Variable Selection Plot
Filtering
• Most of the accounts in the data are consumers of short to medium-
terms loans, i.e., in the $2,000 to $50,000 range.
• However, some are outside this range. What happens if we remove
from the regression analysis the loans outside the range?
• The model could improve!
Filtering
• Select the Filter
pane
• Select New filter 2
=> tgt Interval
New Sales
3
Filtering
1
• Subsequent prediction quality can be improved if separate modeling exercises are performed for
different groups of responders.