0% found this document useful (0 votes)

2 views9 pages

Tobit Models - R Data Analysis Examples

The tobit model is a censored regression model used to estimate linear relationships when the dependent variable is censored from above or below. The document provides examples of right and left censoring, describes a dataset for analysis, and demonstrates how to run a tobit regression in R, including interpreting coefficients and assessing model fit. It also discusses the limitations of using other regression methods with censored data.

Uploaded by

sudhanshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views9 pages

Tobit Models - R Data Analysis Examples

Uploaded by

sudhanshu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

The tobit model, also called a censored regression model, is designed to estimate linear

relationships between variables when there is either left- or right-censoring in the dependent
variable (also known as censoring from below and above, respectively). Censoring from above
takes place when cases with a value at or above some threshold, all take on the value of that
threshold, so that the true value might be equal to the threshold, but it might also be higher. In
the case of censoring from below, values those that fall at or below some threshold are
censored.
This page uses the following packages. Make sure that you can load them before trying to run
the examples on this page. If you do not have a package installed, run:
install.packages("packagename"), or if you see the version is out of date, run:
update.packages().

require(ggplot2)
require(GGally)
require(VGAM)

Version info: Code for this page was tested in R Under development (unstable)
(2012-11-16 r61126)
On: 2012-12-15
With: VGAM 0.9-0; GGally 0.4.2; reshape 0.8.4; plyr 1.8; ggplot2 0.9.3; knitr
0.9
Please Note: The purpose of this page is to show how to use various data analysis commands. It
does not cover all aspects of the research process which researchers are expected to do. In
particular, it does not cover data cleaning and checking, verification of assumptions, model
diagnostics and potential follow-up analyses.
Examples of Tobit Analysis
Example 1. In the 1980s there was a federal law restricting speedometer readings to no more
than 85 mph. So if you wanted to try and predict a vehicle’s top-speed from a combination of
horse-power and engine size, you would get a reading no higher than 85, regardless of how fast
the vehicle was really traveling. This is a classic case of right-censoring (censoring from above) of
the data. The only thing we are certain of is that those vehicles were traveling at least 85 mph.
Example 2. A research project is studying the level of lead in home drinking water as a function
of the age of a house and family income. The water testing kit cannot detect lead concentrations
below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous. These
data are an example of left-censoring (censoring from below).
Example 3. Consider the situation in which we have a measure of academic aptitude (scaled
200-800) which we want to model using reading and math test scores, as well as, the type of
program the student is enrolled in (academic, general, or vocational). The problem here is that
students who answer all questions on the academic aptitude test correctly receive a score of

800, even though it is likely that these students are not “truly” equal in aptitude. The same is true
of students who answer all of the questions incorrectly. All such students would have a score of
200, although they may not all be of equal aptitude.
Description of the Data
Description of the Data
For our data analysis below, we are going to expand on Example 3 from above. We have
generated hypothetical data, which can be obtained from our website from within R. Note that R
requires forward slashes, not back slashes when specifying a file location even if the file is on
your hard drive.
dat <- read.csv("https://stats.idre.ucla.edu/stat/data/tobit.csv")

The dataset contains 200 observations. The academic aptitude variable is apt, the reading and
math test scores are read and math respectively. The variable prog is the type of program the
student is in, it is a categorical (nominal) variable that takes on three values, academic (prog = 1),
general (prog = 2), and vocational (prog = 3). The variable id is an identification variable.
Now let’s look at the data descriptively. Note that in this dataset, the lowest value of apt is 352.
That is, no students received a score of 200 (the lowest score possible), meaning that even
though censoring from below was possible, it does not occur in the dataset.
summary(dat)

## id read math prog

## Min. : 1.0 Min. :28.0 Min. :33.0 academic : 45
## 1st Qu.: 50.8 1st Qu.:44.0 1st Qu.:45.0 general :105
## Median :100.5 Median :50.0 Median :52.0 vocational: 50
## Mean :100.5 Mean :52.2 Mean :52.6
## 3rd Qu.:150.2 3rd Qu.:60.0 3rd Qu.:59.0
## Max. :200.0 Max. :76.0 Max. :75.0
## apt
## Min. :352
## 1st Qu.:576
## Median :633
## Mean :640
## 3rd Qu.:705
## Max. :800

# function that gives the density of normal distribution

# for given mean and sd, scaled to be on a count metric
# for the histogram: count = density * sample size * bin width
f <- function(x, var, bw = 15) {
dnorm(x, mean = mean(var), sd(var)) * length(var) * bw
}

# setup base plot

p <- ggplot(dat, aes(x = apt, fill=prog))

# histogram, coloured by proportion in different programs

# with a normal distribution overlayed
p + stat_bin(binwidth=15) +
stat_function(fun = f, size = 1,
args = list(var = dat$apt))

Looking at the above histogram, we can see the censoring in the values of apt, that is, there are
far more cases with scores of 750 to 800 than one would expect looking at the rest of the
distribution. Below is an alternative histogram that further highlights the excess of cases where
apt=800. In the histogram below, the breaks option produces a histogram where each unique
value of apt has its own bar (by setting breaks equal to a vector containing values from the
minimum of apt to the maximum of apt). Because apt is continuous, most values of apt are
unique in the dataset, although close to the center of the distribution there are a few values of
apt that have two or three cases. The spike on the far right of the histogram is the bar for cases
where apt=800, the height of this bar relative to all the others clearly shows the excess number
of cases with this value.

p + stat_bin(binwidth = 1) + stat_function(fun = f, size = 1, args = list(var = dat$apt,

bw = 1))
Next we’ll explore the bivariate relationships in our dataset.
cor(dat[, c("read", "math", "apt")])

## read math apt

## read 1.0000 0.6623 0.6451
## math 0.6623 1.0000 0.7333
## apt 0.6451 0.7333 1.0000

# plot matrix
ggpairs(dat[, c("read", "math", "apt")])
In the first row of the scatterplot matrix shown above, we see the scatterplots showing the
relationship between read and apt, as well as math and apt. Note the collection of cases at the
top these two scatterplots, this is due to the censoring in the distribution of apt.
Analysis methods you might consider
Below is a list of some analysis methods you may have encountered. Some of the methods listed
are quite reasonable while others have either fallen out of favor or have limitations.

Tobit regression, the focus of this page.

OLS Regression – You could analyze these data using OLS regression. OLS regression
will treat the 800 as actual values and not as the lower limit of the top academic
aptitude. A limitation of this approach is that when the variable is censored, OLS
provides inconsistent estimates of the parameters, meaning that the coefficients from
the analysis will not necessarily approach the “true” population parameters as the
sample size increases. See Long (1997, chapter 7) for a more detailed discussion of
problems of using OLS regression with censored data.
Truncated Regression – There is sometimes confusion about the difference between
truncated data and censored data. With censored variables, all of the observations are
in the dataset, but we don’t know the “true” values of some of them. With truncation
some of the observations are not included in the analysis because of the value of the

variable. When a variable is censored, regression models for truncated data provide
inconsistent estimates of the parameters. See Long (1997, chapter 7) for a more detailed
discussion of problems of using regression models for truncated data to analyze
discussion of problems of using regression models for truncated data to analyze
censored data.

Tobit regression
Below we run the tobit model, using the vglm function of the VGAM package.
summary(m <- vglm(apt ~ read + math + prog, tobit(Upper = 800), data = dat))

##
## Call:
## vglm(formula = apt ~ read + math + prog, family = tobit(Upper = 800),
## data = dat)
##
## Pearson Residuals:
## Min 1Q Median 3Q Max
## mu -2.6 -0.76 -0.051 0.79 4.1
## log(sd) -1.1 -0.62 -0.369 0.25 5.4
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept):1 209.6 32.457 6.5
## (Intercept):2 4.2 0.053 79.4
## read 2.7 0.618 4.4
## math 5.9 0.705 8.4
## proggeneral -12.7 12.355 -1.0
## progvocational -46.1 13.770 -3.4
##
## Number of linear predictors: 2
##
## Names of linear predictors: mu, log(sd)
##
## Dispersion Parameter for tobit family: 1
##
## Log-likelihood: -1041 on 394 degrees of freedom
##
## Number of iterations: 4

In the output above, the first thing we see is the call, this is R reminding us what the
model we ran was, what options we specified, etc.
The table labeled coefficients gives the coefficients, their standard errors, and the z-
statistic. No p-values are included in the summary table, but we show how to calculate

them below. Tobit regression coefficients are interpreted in the similar manner to OLS
regression coefficients; however, the linear effect is on the uncensored latent variable,
not the observed outcome See McDonald and Moffitt (1980) for more details
not the observed outcome. See McDonald and Moffitt (1980) for more details.

For a one unit increase in read, there is a 2.6981 point increase in the predicted
value of apt.
A one unit increase in math is associated with a 5.9146 unit increase in the
predicted value of apt.
The terms for prog have a slightly different interpretation. The predicted value of
apt is -46.1419 points lower for students in a vocational program than for students
in an academic program.
The coefficient labeled “(Intercept):1” is the intercept or constant for the model.
The coefficient labeled “(Intercept):2” is an ancillary statistic. If we exponentiate this
value, we get a statistic that is analogous to the square root of the residual variance
in OLS regression. The value of 65.6773 can compared to the standard deviation of
academic aptitude which was 99.21, a substantial reduction.
The final log likelihood, -1041.0629 , is shown toward the bottom of the output, it can be
used in comparisons of nested models.

Below we calculate the p-values for each of the coefficients in the model. We calculate the p-
value for each coefficient using the z values and then display in a table with the coefficients. The
coefficients for read, math, and prog = 3 (vocational) are statistically significant.
ctable <- coef(summary(m))
pvals <- 2 * pt(abs(ctable[, "z value"]), df.residual(m), lower.tail = FALSE)
cbind(ctable, pvals)

## Estimate Std. Error z value pvals

## (Intercept):1 209.555 32.45655 6.456 3.157e-10
## (Intercept):2 4.185 0.05268 79.432 1.408e-244
## read 2.698 0.61808 4.365 1.625e-05
## math 5.915 0.70480 8.392 8.673e-16
## proggeneral -12.716 12.35467 -1.029 3.040e-01
## progvocational -46.142 13.76971 -3.351 8.831e-04

We can test the significant of program type overall by fitting a model without program in it and
using a likelihood ratio test.

m2 <- vglm(apt ~ read + math, tobit(Upper = 800), data = dat)

(p <- pchisq(2 * (logLik(m) - logLik(m2)), df = 2, lower.tail = FALSE))

## [1] 0 003155
## [1] 0.003155

The LRT with two degrees of freedom is associated with a p-value of 0.0032 , indicating that the
overall effect of prog is statistically significant.
Below we calculate the upper and lower 95% confidence intervals for the coefficients.
b <- coef(m)
se <- sqrt(diag(vcov(m)))

cbind(LL = b - qnorm(0.975) * se, UL = b + qnorm(0.975) * se)

## LL UL
## (Intercept):1 145.941 273.169
## (Intercept):2 4.081 4.288
## read 1.487 3.909
## math 4.533 7.296
## proggeneral -36.930 11.499
## progvocational -73.130 -19.154

We may also wish to examine how well our model fits the data. One way to start is with plots of
the residuals to assess their absolute as well as relative (pearson) values and assumptions such
as normality and homogeneity of variance.
dat$yhat <- fitted(m)[,1]
dat$rr <- resid(m, type = "response")
dat$rp <- resid(m, type = "pearson")[,1]

par(mfcol = c(2, 3))

with(dat, {
plot(yhat, rr, main = "Fitted vs Residuals")
qqnorm(rr)
plot(yhat, rp, main = "Fitted vs Pearson Residuals")
qqnorm(rp)
plot(apt, rp, main = "Actual vs Pearson Residuals")
plot(apt, yhat, main = "Actual vs Fitted")
})
The graph in the bottom right was the predicted, or fitted, values plotted against the actual. This
can be particularly useful when comparing competing models. We can calculate the correlation
between these two as well as the squared correlation, to get a sense of how accurate our model
predicts the data and how much of the variance in the outcome is accounted for by the model.
# correlation
(r <- with(dat, cor(yhat, apt)))

## [1] 0.7825

# variance accounted for

r^2

## [1] 0.6123

The correlation between the predicted and observed values of apt is 0.7825 . If we square this
value, we get the multiple squared correlation, this indicates predicted values share 61.23% of
their variance with apt.
References
Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand
Oaks, CA: Sage Publications.
McDonald, J. F. and Moffitt, R. A. 1980. The Uses of Tobit Analysis. The Review of Economics and
Statistics Vol 62(2): 318-321.
Tobin, J. 1958. Estimation of relationships for limited dependent variables. Econometrica 26: 24-
36.

Click here to report an error on this page or leave a comment

How to cite this page (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-cite-

web-pages-and-programs-from-the-ucla-statistical-consulting-group/)

Introduction To Statistics and Data Analysis
No ratings yet
Introduction To Statistics and Data Analysis
567 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
2009 - Introductory Time Series With R - Select Solutions - Aug 05
33% (3)
2009 - Introductory Time Series With R - Select Solutions - Aug 05
16 pages
Genetica Cuantitativa
No ratings yet
Genetica Cuantitativa
120 pages
Air compressorSS15HN
50% (2)
Air compressorSS15HN
50 pages
The Alchemist's Handbook A Practical Manual Instant EPUB Download
100% (10)
The Alchemist's Handbook A Practical Manual Instant EPUB Download
15 pages
2011 AMC 8 Problems: Problem 1
No ratings yet
2011 AMC 8 Problems: Problem 1
9 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
Lab Exercise 1
No ratings yet
Lab Exercise 1
16 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Im in Love With The Villainess Shes So Cheeky For A Commoner Light Novel Vol 1 Inori Instant Download
No ratings yet
Im in Love With The Villainess Shes So Cheeky For A Commoner Light Novel Vol 1 Inori Instant Download
82 pages
Meta Lab Manual
100% (1)
Meta Lab Manual
66 pages
HIV Management 2019
100% (1)
HIV Management 2019
303 pages
Oily - Water - Separator FINAL
No ratings yet
Oily - Water - Separator FINAL
13 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
CB161 (R Lab Manual)
No ratings yet
CB161 (R Lab Manual)
32 pages
BOSH - Lecture 9 - Personal Protective Equipment-Merged
No ratings yet
BOSH - Lecture 9 - Personal Protective Equipment-Merged
59 pages
Gardner Denver Air Compressor Parts Catalog
No ratings yet
Gardner Denver Air Compressor Parts Catalog
113 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Micom P130C: Feeder Management and Frequency Protection P130C/En M/B53
No ratings yet
Micom P130C: Feeder Management and Frequency Protection P130C/En M/B53
580 pages
Statistical Models in S
No ratings yet
Statistical Models in S
115 pages
Fyybsc - CS Sem 1 FMS Journal
No ratings yet
Fyybsc - CS Sem 1 FMS Journal
43 pages
08 Numerical Summary Measures in R
No ratings yet
08 Numerical Summary Measures in R
34 pages
JMC Format For Transmission Line Strining
0% (1)
JMC Format For Transmission Line Strining
7 pages
Lab 2
No ratings yet
Lab 2
22 pages
Statistics 101 Study Notes
No ratings yet
Statistics 101 Study Notes
33 pages
Appendix E
No ratings yet
Appendix E
9 pages
R Examples
No ratings yet
R Examples
56 pages
Fusilera Toyota Hilux 2011-13
100% (1)
Fusilera Toyota Hilux 2011-13
4 pages
ProbList2 24 SLN
No ratings yet
ProbList2 24 SLN
20 pages
Tobit Analysis - Stata Data Analysis Examples
No ratings yet
Tobit Analysis - Stata Data Analysis Examples
10 pages
Exercises For R
No ratings yet
Exercises For R
40 pages
Econometrics 2019 PDF
No ratings yet
Econometrics 2019 PDF
143 pages
1 Tobit Analysis
No ratings yet
1 Tobit Analysis
19 pages
Timber Lesson 1 PDF
No ratings yet
Timber Lesson 1 PDF
59 pages
Lab 5
0% (1)
Lab 5
5 pages
Ship Lifecycle
100% (1)
Ship Lifecycle
156 pages
Efa Medstat
No ratings yet
Efa Medstat
20 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Mock Exam - Appendix
No ratings yet
Mock Exam - Appendix
15 pages
Module - 4 (R Training) - Basic Stats & Modeling
No ratings yet
Module - 4 (R Training) - Basic Stats & Modeling
15 pages
Tk09 Report Assignment
No ratings yet
Tk09 Report Assignment
97 pages
Intro Stat
No ratings yet
Intro Stat
324 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Introduction To Rstudio: Creating Vectors
No ratings yet
Introduction To Rstudio: Creating Vectors
11 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
10 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
R Console
No ratings yet
R Console
6 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Lec7 8
No ratings yet
Lec7 8
28 pages
Practical2 3
No ratings yet
Practical2 3
6 pages
R Assignment
No ratings yet
R Assignment
9 pages
Tentamen #1 - Data Analytics and Visualization - 2020-2021
No ratings yet
Tentamen #1 - Data Analytics and Visualization - 2020-2021
6 pages
M SC Botany Syllabus&Scheme
No ratings yet
M SC Botany Syllabus&Scheme
34 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
BE184
No ratings yet
BE184
47 pages
R Handbook - Regression For Count Data
No ratings yet
R Handbook - Regression For Count Data
13 pages
A1
No ratings yet
A1
8 pages
Litelature P2H
No ratings yet
Litelature P2H
41 pages
BES - R Lab
No ratings yet
BES - R Lab
5 pages
FOB Qindao Price2013.4.8 PDF
No ratings yet
FOB Qindao Price2013.4.8 PDF
35 pages
Questions With No Solutions
No ratings yet
Questions With No Solutions
20 pages
IntroStat Oct2010
No ratings yet
IntroStat Oct2010
324 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
Implementing Rules and Regulations (IRR) of R.A 1378 - 0
No ratings yet
Implementing Rules and Regulations (IRR) of R.A 1378 - 0
12 pages
Conversation Intermediate 2 Final Exam Listening Listen and Choose The Correct Answer For Each Question
No ratings yet
Conversation Intermediate 2 Final Exam Listening Listen and Choose The Correct Answer For Each Question
7 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
F24 Lab-01
No ratings yet
F24 Lab-01
4 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Hair Care
No ratings yet
Hair Care
11 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
MCQ Statistics
No ratings yet
MCQ Statistics
8 pages
annotated-Blackrock-Group 9
No ratings yet
annotated-Blackrock-Group 9
7 pages
All 03
No ratings yet
All 03
12 pages
CIEM5790 2019 20 Exam Paper
No ratings yet
CIEM5790 2019 20 Exam Paper
7 pages
Refining PDF
No ratings yet
Refining PDF
9 pages
Assignment
No ratings yet
Assignment
2 pages
MATH1208AnnotatedBook Imp
No ratings yet
MATH1208AnnotatedBook Imp
145 pages
Working With Micropipets: Practice These Conversions
No ratings yet
Working With Micropipets: Practice These Conversions
5 pages
IGNITOR
No ratings yet
IGNITOR
1 page
Placenta Previa Maternal and Foetal Outcome
No ratings yet
Placenta Previa Maternal and Foetal Outcome
4 pages
R Commands
No ratings yet
R Commands
5 pages
Neometaliks - Corporate
No ratings yet
Neometaliks - Corporate
2 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Tobit Models - R Data Analysis Examples

Uploaded by

Tobit Models - R Data Analysis Examples

Uploaded by

The tobit model, also called a censored regression model, is designed to estimate linear

## id read math prog

# function that gives the density of normal distribution

# setup base plot

# histogram, coloured by proportion in different programs

p + stat_bin(binwidth = 1) + stat_function(fun = f, size = 1, args = list(var = dat$apt,

## read math apt

Tobit regression, the focus of this page.

## Estimate Std. Error z value pvals

m2 <- vglm(apt ~ read + math, tobit(Upper = 800), data = dat)

(p <- pchisq(2 * (logLik(m) - logLik(m2)), df = 2, lower.tail = FALSE))

cbind(LL = b - qnorm(0.975) * se, UL = b + qnorm(0.975) * se)

par(mfcol = c(2, 3))

# variance accounted for

Click here to report an error on this page or leave a comment

How to cite this page (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-cite-

© 2021 UC REGENTS (http://www.ucla.edu/terms-of-use/) HOME (/) CONTACT (/contact)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.