LR
LR
LR
1 / 40
Logistic regression
Logistic regression is used for binary outcome data, where Y = 0 (fail) or Y = 1 (success).
2 / 40
Now consider n independent experiments where each experiment gives only two outcomes
1 or 0, this is a binomial experiment. Let X be the number of outcome 1, this is called the
binomial random variable with parameter n and p. It will be denoted by X Binomial(n, p).
The probability mass function is given by
P(X = x ) =
n x
p (1 p)nx
x
Here x = 0, 1, . . . , n.
3 / 40
Logit function
The logit function maps the unit interval onto the real line.
logit(x ) = log
x
1x
The inverse logit function maps the real line onto the unit interval.
logit1 (x ) =
exp(x )
1
=
1 + exp(x )
1 + exp(x )
In logistic regression, the logit function is used to map the linear predictor 0 X to a
probability.
4 / 40
Logit function
1.0
4
0.8
2
logit.inv(x)
logit(x)
0.6
0
0.4
2
0.2
4
0.0
0.0
0.2
0.4
0.6
x
0.8
1.0
5 / 40
Logistic regression
exp( 0 x )
1
=
,
1 + exp( 0 x )
1 + exp( 0 x )
1
1 + exp( 0 x )
6 / 40
Ejemplo apartamentos
7 / 40
Ejemplo apartamentos
url <- 'https://raw.githubusercontent.com/fhernanb/datos/master/aptos2015'
datos <- read.table(file=url, header=T)
datos$balcon <- datos$balcon == 'si'
datos$estrato <- as.factor(datos$estrato)
head(datos)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
1
2
3
4
5
6
precio
mt2 ubicacion estrato alcobas banos balcon parqueadero
79 43.16
norte
3
3
1
TRUE
si
93 56.92
norte
2
2
1
TRUE
si
100 66.40
norte
3
2
2 FALSE
no
123 61.85
norte
2
3
2
TRUE
si
135 89.80
norte
4
3
2
TRUE
no
140 71.00
norte
3
3
2 FALSE
si
administracion
avaluo terminado
0.050 14.92300
no
0.069 27.00000
si
0.000 15.73843
no
0.130 27.00000
no
0.000 39.56700
si
0.120 31.14551
si
dim(datos)
## [1] 694
11
8 / 40
9 / 40
1000
500
0
Precio (millones)
1500
FALSE
TRUE
Presencia de balcn
11 / 40
300
200
100
rea (mt2)
400
500
FALSE
TRUE
Presencia de balcn
12 / 40
Ejemplo apartamentos
un conjunto de entrenamiento
y un conjunto de validacin.
13 / 40
require(gamlss)
mod1 <- gamlss(balcon ~ precio + mt2 + alcobas + banos + administracion +
avaluo + parqueadero + estrato + ubicacion + terminado,
family=BI, data=training)
## GAMLSS-RS iteration 1: Global Deviance = 533.7589
## GAMLSS-RS iteration 2: Global Deviance = 533.7589
14 / 40
*******************************************************************
Family: c("BI", "Binomial")
Call:
gamlss(formula = balcon ~ precio + mt2 + alcobas + banos + administracion +
avaluo + parqueadero + estrato + ubicacion + terminado, family = BI,
data = training)
Fitting method: RS()
------------------------------------------------------------------Mu link function: logit
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.356141
1.003835
0.355
0.7229
precio
0.003447
0.001426
2.417
0.0160 *
mt2
-0.003376
0.004125 -0.818
0.4136
alcobas
-0.124762
0.145227 -0.859
0.3907
banos
0.040984
0.208165
0.197
0.8440
administracion
-0.545263
1.084062 -0.503
0.6152
avaluo
0.001807
0.001332
1.356
0.1756
parqueaderosi
0.402516
0.350807
1.147
0.2518
estrato3
0.029470
0.934596
0.032
0.9749
estrato4
0.780036
0.973970
0.801
0.4236
estrato5
0.603801
1.001407
0.603
0.5468
estrato6
0.684924
1.100046
0.623
0.5338
ubicacionbelen guayabal 0.461554
0.432466
1.067
0.2864
ubicacioncentro
-1.144698
0.519925 -2.202
0.0282 *
ubicacionlaureles
-1.096556
0.458645 -2.391
0.0172 *
ubicacionnorte
0.474508
0.864193
0.549
0.5832
ubicacionoccidente
-0.711255
0.391512 -1.817
0.0699 .
ubicacionpoblado
-1.211393
0.540687 -2.240
0.0255 *
terminadosi
-0.026951
0.293539 -0.092
0.9269
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------No. of observations in the fit: 500
15 / 40
16 / 40
Call:
glm(formula = balcon ~ precio + mt2 + alcobas + banos + administracion +
parqueadero + estrato + ubicacion + terminado, family = binomial,
data = training)
Deviance Residuals:
Min
1Q
Median
-2.4490 -1.0871
0.5816
Coefficients:
(Intercept)
precio
mt2
alcobas
banos
administracion
parqueaderosi
estrato3
estrato4
estrato5
estrato6
ubicacionbelen guayabal
ubicacioncentro
ubicacionlaureles
ubicacionnorte
ubicacionoccidente
ubicacionpoblado
terminadosi
--Signif. codes: 0 '***'
3Q
0.8303
Max
1.4960
17 / 40
18 / 40
summary(mod1.final)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
*******************************************************************
Family: c("BI", "Binomial")
Call: gamlss(formula = balcon ~ precio, family = BI, data = training,
trace = FALSE)
Fitting method: RS()
------------------------------------------------------------------Mu link function: logit
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3111237 0.1839521
1.691
0.0914 .
precio
0.0024384 0.0006045
4.034 6.35e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------No. of observations in the fit: 500
Degrees of Freedom for the fit: 2
Residual Deg. of Freedom: 498
at cycle: 2
Global Deviance:
565.8916
AIC:
569.8916
SBC:
578.3208
*******************************************************************
19 / 40
Start: AIC=572.01
balcon ~ precio + mt2 + alcobas + banos + administracion + parqueadero +
estrato + ubicacion + terminado
- estrato
- terminado
- administracion
- banos
- mt2
- alcobas
- parqueadero
<none>
- ubicacion
- precio
Df Deviance
AIC
4
540.59 568.59
1
536.02 570.02
1
536.03 570.03
1
536.04 570.04
1
536.31 570.31
1
537.01 571.01
1
537.29 571.29
536.01 572.01
6
553.33 577.33
1
544.27 578.27
Step: AIC=568.59
balcon ~ precio + mt2 + alcobas + banos + administracion + parqueadero +
20 / 40
ubicacion + terminado
Call:
glm(formula = balcon ~ precio + parqueadero + ubicacion, family = binomial,
data = training)
Deviance Residuals:
Min
1Q
Median
-2.4130 -1.0516
0.6447
Coefficients:
(Intercept)
precio
parqueaderosi
ubicacionbelen guayabal
ubicacioncentro
ubicacionlaureles
ubicacionnorte
ubicacionoccidente
ubicacionpoblado
--Signif. codes: 0 '***'
3Q
0.8211
Max
1.5281
on 499
on 491
degrees of freedom
degrees of freedom
21 / 40
AIC
587.2380
569.8916
563.9745
560.4312
22 / 40
Call:
glm(formula = balcon ~ precio + ubicacion + parqueadero, family = binomial,
data = training)
Deviance Residuals:
Min
1Q
Median
-2.4130 -1.0516
0.6447
Coefficients:
(Intercept)
precio
ubicacionbelen guayabal
ubicacioncentro
ubicacionlaureles
ubicacionnorte
ubicacionoccidente
ubicacionpoblado
parqueaderosi
--Signif. codes: 0 '***'
3Q
0.8211
Max
1.5281
on 499
on 491
degrees of freedom
degrees of freedom
23 / 40
Residuales mod1.final
par(mfrow=c(2, 2))
plot(mod1.final)
0.8
1
0
Quantile Residuals
0.7
0.9
100
200
300
400
Fitted Values
index
Density Estimate
Normal QQ Plot
500
1
0
Sample Quantiles
4
3 2 1
0.2
0.0
0.1
Density
0.3
0.6
3 2 1
1
0
3 2 1
Quantile Residuals
Against index
Quantile. Residuals
## *******************************************************************
Theoretical Quantiles
24 / 40
0.0
0.2
0.4
Deviation
0.2
0.4
wp(mod1.final)
25 / 40
Residuales mod2back
par(mfrow=c(2, 2))
plot(mod2back)
Normal QQ
323
324
301
323
324
Predicted values
Theoretical Quantiles
ScaleLocation
Residuals vs Leverage
1.5
324
323
0
4 3 2 1
0.5
1.0
301
3
6
324
Cook's
distance
0.0
0
2
301
0
1
2
Residuals
Residuals vs Fitted
0.5
2
Predicted values
0.00
0.05
0.10
Leverage
0.15
26 / 40
Residuales mod2forw
par(mfrow=c(2, 2))
plot(mod2forw)
Normal QQ
323
324
301
323
324
Predicted values
Theoretical Quantiles
ScaleLocation
Residuals vs Leverage
1.5
324
323
0
4 3 2 1
0.5
1.0
301
3
6
324
Cook's
distance
0.0
0
2
301
0
1
2
Residuals
Residuals vs Fitted
0.5
2
Predicted values
0.00
0.05
0.10
Leverage
0.15
27 / 40
Modelo a usar
El modelo estimado es este:
P(Y = balcn|X = x ) =
1
1 + exp(0.3111 0.0024 Precio)
log
P(Y = balcn|X = x )
P(Y = sin balcn|X = x )
= 0.3111 + 0.0024 Precio
28 / 40
501
607
527
613
315
115
224
350
500
678
24
105
502
1
267
precio
285
130
115
165
865
260
435
82
285
370
105
235
287
79
520
9
11
14
17
18
20
27
31
32
35
42
43
45
47
49
precio
mt2 balcon prob.est
160 93.00
TRUE 0.7012728
47 43.00 FALSE 0.4257300
70 42.00
TRUE 0.4818139
80 45.35
TRUE 0.5061983
90 100.00
TRUE 0.5305826
90 40.00 FALSE 0.5305826
115 40.00 FALSE 0.5915434
130 54.00
TRUE 0.6281199
130 78.00 FALSE 0.6281199
143 56.00
TRUE 0.6598195
155 94.10
TRUE 0.6890807
155 74.00 FALSE 0.6890807
160 58.00
TRUE 0.7012728
180 66.00
TRUE 0.7500415
185 64.90
TRUE 0.7622336
30 / 40
balcon.est
FALSE TRUE Sum
FALSE
3
46 49
TRUE
1 144 145
Sum
4 190 194
31 / 40
balcon.est
FALSE TRUE Sum
FALSE
19
30 49
TRUE
39 106 145
Sum
58 136 194
32 / 40
33 / 40
70
60
50
40
30
0.0
0.2
0.4
0.6
Punto de corte
0.8
34 / 40
$maximum
[1] 0.4893854
$objective
[1] 75.7732
35 / 40
36 / 40
Deviance
Ilustracin grfica
38 / 40
40 / 40