This document provides an introduction to working with data in R, including reading various data formats into R objects. It discusses reading in fixed-width, tab-delimited, and comma-separated-value text files. It also covers reading in data from other statistical programs like Stata and SPSS using the foreign package. The document is intended as a reference for a course on using R and focuses on getting data into R and basic manipulation and examination of data objects.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
122 views35 pages
Introduction To R II
This document provides an introduction to working with data in R, including reading various data formats into R objects. It discusses reading in fixed-width, tab-delimited, and comma-separated-value text files. It also covers reading in data from other statistical programs like Stata and SPSS using the foreign package. The document is intended as a reference for a course on using R and focuses on getting data into R and basic manipulation and examination of data objects.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35
MI CHAEL CL ARK
CENT ER F OR SOCI AL RESEARCH
UNI VERSI T Y OF NOT RE DAME I NT RODUCT I ON TO R A SECOND COURSE Introduction to R 2 Contents Preface 3 What Youre Getting with R 4 Dealing with Data 5 Introduction to Dealing With Data 5 Reading in Text Data 5 Fixed Format 5 Tab-Delimited Format 6 Comma-separated Values 7 Reading in Data from Other Programs 7 The R-commander Package 8 Verication 8 Checking random observations 8 Descriptive Statistics 8 Summary 9 Data Manipulation & Management 9 Data Formats 9 Vectors 9 Lists 10 Matrices 11 Arrays 12 Data Frames 12 Initial Examination of Data 12 Graphical Techniques 15 Summary 17 3 A Second Course Analysis 17 A Simple Group Comparison 18 Linear Regression Example 19 Creating the Model Object 19 Summary Results 19 Plotting the Model Object 21 Statistical Examination of the Model 21 Prediction 23 Beyond the Ordinary 24 Using Other Functions & Packages 26 Other Topics 27 Using the Parallel Package 27 Using Other Programs 28 Conclusion 29 Appendix 30 Scenario 30 Question 30 Goal 30 Model 30 Introduction to R 4 Preface Current draft November 28, 2012 This is the second session of a short course providing an introduction to using the R statistical environment with an eye toward applied social science research. Nothing but the previous course knowledge is assumed, however the basic notion of creation and manipulation of objects, getting and using various R packages and similar introductory details from that is required. As a preliminary step, you should go through the rst course handout, install R, install a couple packages, and play around with the code provided. This handout was created in Rstudio using R 2.15.2 . 5 A Second Course What Youre Getting with R Lets start as we did in the rst course by just chatting a bit about R and what it is capable of. The following is a list chosen just to give a sense of the breadth of Rs capabilities. Additions and advancements to all standard analyses including many cutting edge, just published techniques Structural Equation Modeling Data mining/Predictive analytics/Machine learning and related Natural language processing and text mining Complex survey analysis Network analysis and graphs Image analysis Statistical genetics/bioinformatics/high-throughput screening Bayesian inference for a wide variety of analyses Reproducible research/Live document updating Ability to easily use other statistical packages within the R environ- ment Ability to take advantage of other programming languages (e.g. C, Python) Mapping and geospatial analysis Web capabilities and interactivity Great graphing capabilities Examples of how specic some of the packages can get: A package oriented toward vegetation ecologists (vegan) 1 . 1 This actually has some functionality that might be useful to anyone and I have played with it in the past. A package that will allow one to create interactive Google Visualiza- tion charts (googleVis). Another that will automate the creation of Mplus syntax les, run them, extract the information from the model results, and display them, all without leaving R (MplusAutomation). One that will produce Latex code that will produce highlighted R code, used throughout this document (highlight)... Introduction to R 6 and much, much more, and most of the list above has been available for quite some time. Most R packages will be very well developed and many even have their own sizable community devoted to further devel- opment. And perhaps best of all, it doesnt cost any more to add func- tionality. While some of these may seem like something you wouldnt use, it is often the case that packages from other disciplines or utilized for other purposes have functionality one would desire for their own project. Learning the function will then possibly expose one to new ways of thinking about and exploring their own data. In short, one can learn a lot just by using R. Dealing with Data Introduction to Dealing With Data Its pretty well impossible to expect anyone to use a package they cannot get their data into easily, and the novice to R will nd starting it up will provide no details on how to go about this. The closest one gets to in the File menu is to load an R workspace, which, while the associated les are *.Rdata les, they are not the data sets people in the social sciences are used to. R can handle typical data formats with ease as other stats packages do, and the following provides detail. Reading in Text Data Much of the data one comes across will primarily start as a text le and R manual then be imported into a statistical package. While this is convenient, since any computer will have some program that can read the data, it can tend to be problematic importing it into a statistical package because issues can arise from various sources such as the way the data was recorded, the information relating how the data is to be read 2 , the 2 Hopefully becoming less of an issue as standards are employed and require- ments are put in place for data storage and access. package being used to read in the data, and the various choices and knowledge the user is responsible for. Two common formats are xed and delimited text les. Fixed Format Fixed format tends to throw some students for a loop as they download the data, open it up and are confronted with a wall-o-text, i.e. just one big block of typically numeric information. Typical view of xed format data. 01234567890123456 789 12345678901234567 890 On top of this, a single observation may span more than one row within the le. In general these les are not decipherable just by look- ing at them, and sadly, codebook quality is highly variable (regardless 7 A Second Course of source), where a poor codebook can make xed format unreadable. On the plus side, with the right information, any statistical package can read it in and R is no exception. Furthermore, it can cut down on the tedium necessarily associated with these les. The key information one needs regards how many columns pertain to each variable. In R, the function to read it is read.fwf, and an example is given below. The code on the left will rst create and then read the data on the right. When trying this yourself you may nd it easier to rst create the le as a separate enterprise and then just use the last line replacing the ffdat in last line with something like C:/wheredaleat/ffdat.txt, as you will normally be dealing with a separate le. 1234.56 987.654 12345.6 ffdat <- tempfile() cat(file = ffdat, "1234.56", "987.654", "12345.6", sep = "\n") read.fwf(ffdat, width = c(1, 2, 4)) ## V1 V2 V3 ## 1 1 23 4.560 ## 2 9 87 0.654 ## 3 1 23 45.600 A brief summary before going too much further. The rst line of the code creates an empty temporary le. The second line updates the le with the numeric information listed with an appropriate line separator. The nal line reads this data in and displays it, but while this is ne for demonstration purposes, youd actually be assigning the data to an object so that it can be used further. Tab-Delimited Format Delimited les are the sort of text le one will more likely deal with as often the data that is made publicly available may actually start out in spreadsheet form or is assumed to end there. If you open up a le like tab-delimited, its more obvious whats going on, unlike what you see in xed format les. It makes some initial eyeball inspection easier between the post-import data and the original le, and in general things are easier in importing these les to statistical packages. To continue with the above, the tab delimited form of the data might look like the following 3 : 3 Often the le extension is .dat V1 V2 V3 1 23 4.560 9 87 0.654 1 23 45.600 Now we can see whats going on already, although often you will not get a header row of variable names and it will be important for any statistical package you are using to note whether the rst line is a Introduction to R 8 header or the data wont be read correctly 4 . In R, the default function 4 You might also see some alignment difference between the header and sub- sequent rows, but this is inconsequential for the purposes of importing the data. is read.table. So continuing with the previous example, assuming you have the data somewhere on your machine already, the code is simply something like the following: mydat <- read.table("C:/myfile.dat", header = TRUE) Comma-separated Values Yet another format one will be likely to use is one with comma-separated (comma delimited) values 5 . One can think of it as just like the tab de- 5 As an aside, if you have MS Excel installed .csv les are typically associated with it. limited type, but with commas separating the values instead of spaces or tabs. I often will use it when transferring data between programs as some programs will have issues readings spaces correctly, whereas the comma is hard to misinterpret. I wont spend much time with this as importing is essentially the same as with read.table, however the appropriate function will be read.csv. V1,V2,V3 1,23,4.560 9,87,0.654 1,23,45.600 Reading in Data from Other Programs Dealing with data from other packages can be a fairly straightforward procedure compared to text les though typically you may have to specify an option or two to get things the way you want. The foreign package is required here, and it might be a good idea to have this load when you start up R as you will use it often 6 . You will use the func- 6 In your R/etc folder you will nd the Rprole.site le, put the line li- brary(foreign) somewhere in there. tions from the package in the same way as read.table above, but for example, to read a Stata le youd use the function read.dta and to read an SPSS le youd use the read.spss function. Going from one sta- tistical package to another will almost certainly result in some changes in the data or loss of information. For example, when reading in an SPSS le you may have to choose to import factors as numeric, Stata data les can hold meta information that other programs dont read etc. This is not to say you will lose vital data, just that some choices will have to be made in the process as different programs deal with data les in different ways and there is no 1-1 matching of le struc- ture. 9 A Second Course The R-commander Package The Rcmdr package can be a great help in getting used to the pro- cess of importing data for those new to R. Rcmdr is a package that allows one to use a menu system to do analysis and deal with data, and its layout will denitely feel familiar to those coming from some of the other standard stat packages. In the Data menu there are options for importing text, Stata, SPSS, Excel and other les and it is usually straightforward. Verication It is always important to verify that the data you now have is what it should be. R isnt going to mistake a two for a ve or anything like that, as with any statistical program but there may be issues with the original format that might result in rounding problems, missing values etc. As an example, if 99 represents missing on some variable, its important that the variable is imported as such and not treated as a numeric value, or the change should be made immediately after import. Programs should be able to read them in appropriately most of the time, but it is a good idea double check. Checking random observations A rst step to verifying the data can be comparing random observa- A simple check for whether two objects are the same can be accomplished via the all.equal function. tions in the new data with those in the old. Given a data set with 200 observations and 100 variables, the following would pull out ve cases and their values on a random selection of ve variables. mydata[sample(1:200, 5), sample(1:100, 5)] If there are specic cases you want to look at you would note the exact and column row numbers, but we will be talking about that and related indexing topics in a bit. Descriptive Statistics When importing data from another statistical package, another way to check initial the imported data is to run descriptive statistics on some of the data. The results should be identical to those of the source data. One can use the summary function as a rst step for specic variables or the whole data set once it is imported. Introduction to R 10 Summary Importing data into any statistical package can actually be very dif- cult at times so dont be surprised if you hit a snag. While there are even programs dealing with this specically, you shouldnt feel the need to purchase them these days as most statistical programs gener- ally behave well with each others data. Sometimes the easiest path will involve using multiple packages. I often will save data in a par- ticular statistical package as a text le, which is then easily imported into another package. For example I might go from Excel to text to R. Whatever the format and whatever route you use, make sure to verify the data in some fashion after import. Data Manipulation & Management I wont spend a whole lot of time here as one can review the notes for the rst course for more detail, but there are a couple things worth mentioning. Data Formats There are a variety of formats for dealing with data each of which has their use in analysis. Brief descriptions follow. Vectors Single items of information such as a number are scalars, but usually we have many that are concatenated into a vector 7 that represents 7 In math and stats texts these would just be numeric, but for our purposes Im just distinguishing between one item or several. some variable of interest. Vectors in R can be of several types, though the elements of vector must all be of the same kind. Unlike the way one typically interacts with a single data set, one can have interactions with multiple vectors from any number of sources. In this sense, one could for example have a data set as one object and use variables from it to predict a vector of values that is a completely separate object. Types or modes of vectors include: NUMERI C/I NTEGER Numeric data is typically stored as a numeric or integer mode object, and we can coerce vectors to be of these types using e.g. using the function as.numeric. It usually will not matter to the analysis how a truly numeric variable is stored. 11 A Second Course x = rnorm(10) LOGI CAL vector taking on only TRUE or FALSE values but which is not the same as a factor or character. One will often use these in their own functions, conditional, statements etc. Note that these can also be treated as though they were numeric taking on values of 0 and 1 for False and True respectively. x2 = x > 1 x2 ## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE sum(x2) ## [1] 3 x2 * 2 ## [1] 0 0 0 0 0 2 2 0 2 0 CHARACTER A vector of character strings. Note that these must be converted to a factor if they are to be used as a variable in an analysis. as.character(x2) FACTOR In other packages we typically enter such information directly or numerically with labels, which is no different here. as.factor(x2) # Example from scratch: x <- rep(c(1:3, 5)) x2 <- factor(x, levels = 1:3, labels = c("a", "b", "c"), ordered = T) On the other hand, one may nd that Rs default dealing with or- dered categorical information that has been imported from other pack- ages can be problematic, and there is often something that comes up in dealing with them. So you may end up seeing things like blanks as fac- tor levels, need to do something with the level ordering for graphical display, convert it to a numeric variable for some purposes etc. These The default ordering in plots is by alphabetical ordering of labels. issues dont seem like that big of deal, and they arent normally, but they will come up (and for those that work with Social Science data everyday they come up all the time) and require additional steps to get things just the way you want. Dealing with factor variables in R may take some getting used to, but generally one can go about it like they do elsewhere. Lists Lists are simply collections of objects be they single items, several vectors, multiple datasets or even combinations of those. While one Introduction to R 12 probably doesnt deal with such a thing typically in other stats pack- ages, lists can make iterative operations very easy and are denitely something to utilize in analysis within R. Furthermore most output from analysis will result in a list object, so it is something youll want to know how to access the elements of. The following shows the cre- ation of a list, ways to extract components of the list and elements of the list component, and using the lapply function and its counterpart sapply 8 on the list components. 8 Sapply converts the result to a vector or matrix if possible. xlist = list(1:3, 4:6) xlist ## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] 4 5 6 names(xlist) = c("A", "B") xlist[["A"]] ## [1] 1 2 3 xlist$A ## [1] 1 2 3 xlist$A[2] ## [1] 2 # lapply(x,sum) Not shown sapply(x, sum) ## [1] 0.24293 -0.04679 -0.50959 0.95315 -0.20230 1.96167 1.10532 ## [8] 0.03913 1.74018 -1.36667 Matrices Matrices in R are like data sets we usually deal with but are full of vectors only of one mode 9 , and you may nd that various functions 9 For simple vectors, mode is the same as class. may only work with matrices 10 . When you look at the x object after 10 The function data.matrix will convert a data frame to a matrix running the code, note the row and column designations. Youll see for example [ ,1] for the rst column and [5, ] for the fth row. This is how you extract row and columns from a matrix, e.g. mymatrix [ ,2] or mymatrix[1:5, ]. Use a number before and after the comma to extract specic elements of the matrix. In the following we look at the rst and fth through 8 rows, and all but the rst, third and fth columns. 13 A Second Course x = matrix(rnorm(100), ncol = 10, byrow = T) x[c(1, 5:8), -c(1, 3, 5)] Arrays Researchers sometimes forget or are simply unaware that the there is no restriction of data to one or two dimensions. Arrays can be of any dimension 11 , and often the results of some function operation will 11 A two dimensional array is the same thing as a matrix, which could be con- verted to a dataframe as well. result in an array object. The rst example creates an array of two 2x25 matrices, the second uses the apply function on the iris3 array 12 12 Available in any installation of R. to get column means (with 20% trim) for each type of ower. x = array(rnorm(100), dim = c(2, 25, 2)) apply(iris3, c(2, 3), mean, trim = 0.2) ## Setosa Versicolor Virginica ## Sepal L. 5.00 5.910 6.547 ## Sepal W. 3.41 2.797 2.963 ## Petal L. 1.46 4.307 5.493 ## Petal W. 0.22 1.340 2.023 Data Frames At last we come to the object type perhaps of most interest in applied It may not be obvious to those rst introduced to R, but simply doing something like data.frame(x) just spits out the result of making x a dataframe, it does not automatically change the x object permanently unless a new object is created or the old one overwritten: x = data.frame(x) #overwrites it y = data.frame(x) #create a new object social science research. Data frames are R objects that we refer to as data sets in other packages, and while they look like matrices whose columns may be vectors of any mode, though in R they are treated more like a list of variables of the same length 13 . As an exercise, rerun 13 For example, we can extract columns with the dollar sign operator- my- data$var1. any of the previous examples that created a vector, list, matrix or array, and convert them to a data.frame by typing data.frame(x). Initial Examination of Data Lets get to messing around with some data shall we? I will not spend a lot of time with IED as data indexing, subsetting, and summary statis- tics were covered in the rst course and previous section, but this will serve as a brief review. We will read in data regarding high school student test scores on different subjects. The following will pull it in from the UCLA ATS website (variable labels), examine the rst few rows, the general structure and obtain some descriptive statistics on the variables. hs1 <- read.table("http://www.ats.ucla.edu/stat/data/hs1.csv", header = T, sep = ",") # head(hs1) str(hs1) summary(hs1) Introduction to R 14 # the describe function is much better however library(psych) describe(hs1) ## var n mean sd median trimmed mad min max range skew ## female 1 200 0.55 0.50 1.0 0.56 0.00 0 1 1 -0.18 ## id 2 200 100.50 57.88 100.5 100.50 74.13 1 200 199 0.00 ## race 3 198 3.42 1.04 4.0 3.64 0.00 1 4 3 -1.55 ## ses 4 200 2.06 0.72 2.0 2.07 1.48 1 3 2 -0.08 ## schtyp 5 200 1.16 0.37 1.0 1.07 0.00 1 2 1 1.84 ## prgtype * 6 200 1.73 0.84 1.0 1.66 0.00 1 3 2 0.55 ## read 7 200 52.23 10.25 50.0 52.03 10.38 28 76 48 0.19 ## write 8 200 52.77 9.48 54.0 53.36 11.86 31 67 36 -0.47 ## math 9 200 52.65 9.37 52.0 52.23 10.38 33 75 42 0.28 ## science 10 195 51.66 9.87 53.0 51.83 11.86 26 74 48 -0.18 ## socst 11 200 52.41 10.74 52.0 52.99 13.34 26 71 45 -0.38 ## prog 12 200 1.73 0.84 1.0 1.66 0.00 1 3 2 0.55 ## kurtosis se ## female -1.98 0.04 ## id -1.22 4.09 ## race 0.80 0.07 ## ses -1.10 0.05 ## schtyp 1.40 0.03 ## prgtype * -1.37 0.06 ## read -0.66 0.72 ## write -0.78 0.67 ## math -0.69 0.66 ## science -0.59 0.71 ## socst -0.57 0.76 ## prog -1.37 0.06 Often we want to get separate information for different groups. Here well look at several scores broken down by gender, and a single score broken down by two grouping variables gender and program type. # describe.by(hs1[,7:11], group=hs1$gender) describe.by(hs1[, 7:11], group = hs1$gender, mat = T) #different view ## item group1 var n mean sd median trimmed mad min max ## read1 1 0 1 91 52.82 10.507 52 52.82 11.861 31 76 ## read2 2 1 1 109 51.73 10.058 50 51.42 10.378 28 76 ## write1 3 0 2 91 50.12 10.305 52 50.44 11.861 31 67 ## write2 4 1 2 109 54.99 8.134 57 55.57 7.413 35 67 ## math1 5 0 3 91 52.95 9.665 52 52.45 10.378 35 75 ## math2 6 1 3 109 52.39 9.151 53 52.07 10.378 33 72 ## science1 7 0 4 86 52.88 10.754 55 53.26 11.861 26 74 ## science2 8 1 4 109 50.70 9.039 50 50.80 8.896 29 69 ## socst1 9 0 5 91 51.79 11.334 51 52.45 14.826 26 71 ## socst2 10 1 5 109 52.92 10.234 56 53.34 7.413 26 71 ## range skew kurtosis se ## read1 45 0.04598 -0.7867 1.1014 ## read2 48 0.31898 -0.5456 0.9634 ## write1 36 -0.17694 -1.1681 1.0803 ## write2 32 -0.58190 -0.5024 0.7791 ## math1 40 0.32034 -0.6947 1.0131 ## math2 39 0.23145 -0.7569 0.8765 ## science1 48 -0.31428 -0.6938 1.1597 ## science2 40 -0.12892 -0.5350 0.8657 15 A Second Course ## socst1 45 -0.36525 -0.7158 1.1881 ## socst2 45 -0.34843 -0.5268 0.9803 # describe.by(hs1$science, group=list(hs1$gender,hs1$prgtype)) At this point Ill remind you to check out the rst course notes re- garding the various means to subset the data and examine different parts of it. Moving on, notice that some of the variables were imported as integers, but would be better off considered as factors with associ- ated labels 14 . Lets go ahead and change them: 14 This will have other benets in analy- sis, for example auto-dummy coding. hs1[,c(1,3:6)] <- lapply(hs1[,c(1,3:6)], as.factor) #change the labels levels(hs1$gender) = c("male","female") levels(hs1$race) = c("Hispanic","Asian","Black","White","Other") levels(hs1$ses) = c("Low","Med","High") levels(hs1$schtyp) = c("Public","Private") levels(hs1$prgtype) = c("Academic","General","Vocational") #Alternate demo to give a sense of how one can use R vars <- c("gender","race","ses","schtyp","prgtype") labs <- list(gender=c("male","female"), race=c("Hispanic","Asian","Black","White","Other"), ses=c("Low","Med","High"), schtyp=c("Public","Private"), prgtype=c("Academic","General","Vocational")) for (i in 1:5){ hs1[,vars[i]] = factor(hs1[,vars[i]],labels=labs[[i]]) } str(hs1) ## 'data.frame': 200 obs. of 11 variables: ## $ gender : Factor w/ 2 levels "male","female": 1 2 1 1 1 1 1 1 1 1 ... ## $ id : int 70 121 86 141 172 113 50 11 84 48 ... ## $ race : Factor w/ 5 levels "Hispanic","Asian",..: 4 4 4 4 4 4 3 1 4 3 ... ## $ ses : Factor w/ 3 levels "Low","Med","High": 1 2 3 3 2 2 2 2 2 2 ... ## $ schtyp : Factor w/ 2 levels "Public","Private": 1 1 1 1 1 1 1 1 1 1 ... ## $ prgtype: Factor w/ 3 levels "Academic","General",..: 2 3 2 3 1 1 2 1 2 1 ... ## $ read : int 57 68 44 63 47 44 50 34 63 57 ... ## $ write : int 52 59 33 44 52 52 59 46 57 55 ... ## $ math : int 41 53 54 47 57 51 42 45 54 52 ... ## $ science: int 47 63 58 53 53 63 53 39 NA 50 ... ## $ socst : int 57 61 31 56 61 61 61 36 51 51 ... # summary(hs1) Lets look at the correlation of the test scores. I use the second ar- gument so that it will only use observations with complete data on all four (case-wise deletion). While not the case here, even moderately sized correlation matrices can get a bit unwieldy without having seven decimal places further cluttering up the view, so I provide an example of rounding also 15 . 15 What rounding you will see in output is package dependent. Introduction to R 16 cormat <- cor(hs1[, 7:11], use = "complete") round(cormat, 2) ## read write math science socst ## read 1.00 0.60 0.65 0.62 0.62 ## write 0.60 1.00 0.62 0.57 0.60 ## math 0.65 0.62 1.00 0.62 0.53 ## science 0.62 0.57 0.62 1.00 0.45 ## socst 0.62 0.60 0.53 0.45 1.00 Now we are ready to start examining the data graphically. As pre- view, examine the correlation matrix just created with the following code. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 r e a d w r it e m a t h s c ie n c e write math science socst library(corrplot) corrplot(cormat, "ellipse", type = "lower", diag = F) Graphical Techniques It is an odd time at present in which we have all manner of technolo- gies available to produce any number of visualizations, yet people are beholden to journals still pretending to save ink like its 1950. This results in researchers still producing graphs that enlighten little more than what could be gained from a single sentence 16 . Good graphs are 16 In perhaps one of the more extreme cases, I just recently saw a journal article present a graphic of words from the classic Stroop task in different shades of black and white. sufciently, but not overly, complex and yet done in a way that plays on the readers intuition and claries the material presented. This is not an easy task, but in my opinion, R can make the task more easily at- tainable than other statistics packages, while allowing for possibilities that most would not think of or be able to do in other packages. While you have a great deal of control with basic R plotting func- tions, by default they may not look like much. For example, to examine the test scores distributions we might do something like the following, but rather than write out a line code for every histogram we might want, well set the dimensions of the graphics device to be able to hold all ve, then populate it with the histogram of each variable. We could have used apply or lapply here to accomplish the same thing in a single line of code, but by default the graphs would not retain the variable read F re q u e n c y 30 40 50 60 70 80 0 1 0 2 0 3 0 4 0 write F re q u e n c y 30 40 50 60 70 0 1 0 2 0 3 0 4 0 math F re q u e n c y 30 40 50 60 70 0 1 0 2 0 3 0 science F re q u e n c y 30 40 50 60 70 0 1 0 2 0 3 0 socst F re q u e n c y 30 40 50 60 70 0 5 1 5 2 5 3 5 names. Note also that there are many, many options we would have control of that arent tweaked here. In the following I use the sapply function in a way to deal with the columns of interest and within it I create a function on the y to plot histograms of those variables. The console output left behind is meaningless for our purposes, but refers to contents of objects created by the hist function. 17 A Second Course par(mfrow = c(3, 2)) #set panels for the graph sapply(7:11, function(x) hist(hs1[, x], main = colnames(hs1)[x], xlab = "", col = "lemonchiffon")) par(mfrow = c(1, 1)) #reset read write math science socst 0 10 20 0 5 10 15 20 25 0 5 10 15 20 0 5 10 15 20 0 10 20 30 40 60 80 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 value c o u n t For a comparison, well give a preview to ggplot2 (and reshape2) to accomplish the same thing. Note how the data is transformed to long format rst. Again, many options available are not noted, this just shows the basics. 40 50 60 70 3 0 4 0 5 0 6 0 7 0 hs1$math h s 1 $ s c ie n c e library(ggplot2) library(reshape2) graphdat = melt(hs1[, 7:11]) head(graphdat) ggplot(aes(x = value), data = graphdat) + geom _ histogram() + facet _ wrap(~variable, scales = "free") read 30 40 50 60 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 30 40 50 60 70 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 3 0 5 0 7 0 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG GG G G G G G GG G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 3 0 4 0 5 0 6 0 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G write G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G math G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 4 0 5 0 6 0 7 0 G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G GG GG G G G G G G GG G G G G G G G G 3 0 5 0 7 0 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G science G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 30 40 50 60 70 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG GG G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 40 50 60 70 G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 30 40 50 60 70 3 0 5 0 7 0 socst 4 0 5 0 6 0 7 0 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 4 0 5 0 6 0 7 0 read m a th 1 2 3 4 Given : as.factor(race) 0 1 G iv e n : a s .fa c to r ( fe m a le ) To examine both univariate and bivariate relationships simultane- ously, one might instead use the scatterplot matrix capability of the car package. First is an example of a single scatterplot, scatterplot followed by the matrix. In the rst, the univariate information is pro- vided in the margins. With the matrix scatterplotMatrix the diagonals give us the univariate distribution, while the off-diagonal scatterplots are supplemented by loess curves to help spot potential curvilinear relationships. The coplot function that follows provides another ex- ample that one can use to examine multivariate information in a fairly straightforward manner. library(car) scatterplot(hs1$math, hs1$science) scatterplotMatrix(hs1[, 7:11], pch = 19, cex = 0.5, col = c("green", "red", "grey80")) coplot(math ~ read | as.factor(race) * as.factor(gender), data = hs1) A powerful set of graphing tools is provided by the ggplot2 pack- age 17 . Along with associated packages, one can fairly easily reshape 17 GGPlot2 website. Note also that the lattice package, of which is also a popular alternative. the data into a manageable format and produce graphs of good quality quickly. While we wont go into too much detail here, the things that will probably take some getting used to initially are thinking about the cre- ation of your visualization via adding layers that each apply different a different graphical nuance, and also perhaps dealing with the data in melted form (though the latter is not required for plotting). In the fol- lowing the data is rst inspected with several plots to get a sense of its structure to see if there are any notable problems. Afterwords the data is melted which can make plotting easier. To begin, we rst create the base plot g, which is actually empty. To it we add a plot of the overall means for each test and racial category. Just as a demonstration, we do Introduction to R 18 a further breakdown by gender. Finally the results are further condi- tioned on socio-economic status (note that most of the non-white cells are very low N). 0 50 100 150 200 read write math science socst variable row number 0.00 0.25 0.50 0.75 1.00 value 1 2 3 4 0 1 x y 0 50 100 150 200 science race variable Freq missing yes no library(ggplot2) library(reshape2) # visualize the data itself ggstructure(hs1[, 7:11]) ggfluctuation(xtabs(~race + gender, hs1)) ggmissing(hs1) # melt the data hs2 <- melt(hs1, id.vars = c("id", "gender", "race", "ses", "schtyp", "prgtype"), measure.vars = c("read", "write", "science", "math", "socst"), variable.name = "Test") head(hs2) # create base plot g <- ggplot(data = hs2, aes(x = Test, y = value)) # gray dots are missing on race g + stat _ summary(fun.y = "mean", geom = "point") # facet by gender g + stat _ summary(fun.y = "mean", geom = "point", size = 4) + facet _ grid(. ~ gender) # add another facet g + stat _ summary(fun.y = "mean", geom = "point", data = hs2, size = 4) + facet _ grid(gender ~ ses) Summary Once you get the hang of it, great visualizations can be relatively easily produced within the R environment, and there are many packages specic to certain types of graphs, e.g. networks, maps, etc., as well as interactive, dynamic, graphics, web-related functionality etc. In short, one doesnt need a separate program to produce high quality graphics and one has great control over every aspect of the production process. It is denitely worth getting used to Rs capabilities and watch as your standard plots soon fall by the wayside. Analysis In this section well provide a few examples of how to go about getting some statistical results from your data. The basic approach illustrated should get you pretty far in that many of the analyses in R packages work in a similar manner. For regression type analyses you will get the same type of output though it may look a little different. 19 A Second Course A Simple Group Comparison To begin well just do a simple group comparison on one of the tests from the previous data. For example, lets see if there is a statistical difference among male and female observations with regard to the reading test from the previous data. Well use the t.test function 18 . 18 By default a Welchs correction for unequal variances is applied. t.test(read ~ female, hs1) ## ## Welch Two Sample t-test ## ## data: read by female ## t = 0.7451, df = 188.5, p-value = 0.4572 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.796 3.977 ## sample estimates: ## mean in group male mean in group female ## 52.82 51.73 As a further exercise in programming, well do this for each of the Other approaches would be more statistically appropriate. tests using the powerful apply function. Conceptually its simply doing ve t-tests. The code is also simple in the sense it only requires one line and the apply function to pull off. But as we are new to R, Ill break it down a little further by rst just calculating a simple mean for each column to show how apply works. So in the following, we apply, to the columns (2), the mean function (mean), with the additional argument to remove missing values (na.rm=T). I verify this with the built in colMeans function after. apply(hs1[, 7:11], 2, mean, na.rm = T) ## read write math science socst ## 52.23 52.77 52.65 51.66 52.41 colMeans(hs1[, 7:11], na.rm = T) ## read write math science socst ## 52.23 52.77 52.65 51.66 52.41 Back to the t-test. I do the exact thing we just did but in this case I create a function on the y. This constructed function only requires one input (y), which here will be a single vector pertaining to the columns Reading, Writing etc. test scores. The y itself is fed to the t.test function as the outcome of interest. Once you start getting com- fortable with the apply functions 19 , you can try even more succinct 19 Its not certain this is possible. Just as soon as you feel like youve got the hang of it youll nd quirks. Check out the plyr package as an alternative. approaches as in the second example. myt <- function(y) { t.test(y ~ gender, data = hs1) Introduction to R 20 } apply(hs1[, 7:11], 2, myt) # alternate one-liner # sapply(7:11, function(y) t.test(hs1[,y] ~ hs1$gender), simplify=F ) Linear Regression Example Next we come to the familiar linear regression model, and as we will see, the way the function and its arguments work is very similar to the t.test function before. Here we will use the lm function and the associated object that is created by it. Creating the Model Object First we will create the model object, and well use a new data set 20 . 20 I again use data from UCLAs ATS website, in this case an example they use for Stata which can allow you to compare and contrast. Also its a reminder to check their website when learning any program as there is a lot of useful examples to get you started. Well read this Stata le in directly using the foreign library 21 . It is 21 You can disregard the warning. The district number variable just has an empty label. random sample of 400 elementary schools from the California De- partment of Educations API 2000 data set. This data le contains a measure of school academic performance as well as other attributes of the elementary schools, such as, class size, enrollment, poverty, etc. After importing it, I suggest you use the str and summary functions we used before to start to get a feel for the data. Perhaps produce a simple graph. library(foreign) regdata <- read.dta("http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2.dta") Here the goal is to predict academic performance (api00) from per- centage receiving free meals (meals), percentage of English language learners (ell), and percentage of teachers with emergency credentials (emer). Running the regression is fairly straightforward as follows: Create a model object with the lm function. The minimal require- ment here is the formula specication as was done for the t-test along with the data object name, but if interested you might type ?lm to see what else is possible. Examine the model object. This will only produce the coefcients. Summarize it. This produces the output one is used to seeing in articles and other packages. Summary Results Using the xtable function (from the package of the same name) on the summary object will produce latex code for easy import of the output into a document in professional looking form. Lets take a look at the summarized model object... 21 A Second Course mod1 <- lm(api00 ~ meals + ell + emer, data = regdata) mod1 ## ## Call: ## lm(formula = api00 ~ meals + ell + emer, data = regdata) ## ## Coefficients: ## (Intercept) meals ell emer ## 886.70 -3.16 -0.91 -1.57 summary(mod1) ## ## Call: ## lm(formula = api00 ~ meals + ell + emer, data = regdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -185.47 -39.95 -3.66 36.45 178.48 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 886.703 6.260 141.65 < 2e-16 *** ## meals -3.159 0.150 -21.10 < 2e-16 *** ## ell -0.910 0.185 -4.93 1.2e-06 *** ## emer -1.573 0.293 -5.37 1.4e-07 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 57.8 on 396 degrees of freedom ## Multiple R-squared: 0.836,Adjusted R-squared: 0.835 ## F-statistic: 673 on 3 and 396 DF, p-value: <2e-16 To begin, you get the actual code that produced the output. This is followed by quantiles of the residuals for initial inspection. Then come the coefcients, and this part is typical of output from other stat pack- ages 22 . You might go ahead and get used to seeing < 2.2e-16; other 22 the asterisks signify those places where magic was used in the analysis. statistical packages will put zero, but R reminds you that the p-value is just really small. The coefcients and related output is what I some- times refer to as the micro level of the analysis, as we get details about the individual variables. At the macro level we see how the model does as a whole, and as you would see in other packages for standard re- gression output, you get the residual standard error 23 , F-statistic, and 23 In Stata this is labeled Root MSE, in SPSS, the Standard Error of the Estimate. the R-squared value and its adjusted value. Not surprisingly, all three variables are statistically notable predictors of academic achievement and the model as a whole does well. If you use the structure function (str) on the model object, youll nd that there are a lot of things already stored and ready to use for further processing. As an example we can extract the coefcients and tted values for the model, and at least for some cases, one has access to them in one of two ways: Introduction to R 22 mod1$coef coef(mod1) mod1$fitted.values fitted(mod1) Plotting the Model Object Having such easy access to the output components is quite handy. Con- sider the following which allows one to visualize change in coefcients across groups. It uses lmList from the nlme package to run regressions on separate groups, at which point we can extract and plot them to see how they might change across groups. This is not really the best way to go about looking for such things, but does illustrate the sorts of things one can do fairly easily with the model object, and R makes deep ex- ploration of data and models notably easier than other packages. Here we see how a model changes over the ordered categories representing percentage of the school district receiving free meals. library(nlme) modlist2 <- lmList(api00 ~ growth + yr _ rnd + avg _ ed + full + ell + emer | mealcat, data = regdata, na.action = na.omit) plot(coef(modlist2), nrow = 2) It should also be noted that the lm object comes with built-in plots to help one start to visually inspect the residuals for problems. Just us- ing the basic plot function on an lm object produces four plots to allow one to inspect how the residuals behave (technically one is making use of the plot.lm function). For inspection of outliers, I generally suggest inuencePlot from the car package which produces a single plot that allows one to detect outliers via three different measures (also shown). In general, the lm plots have enhanced versions in the car package along with additional visual diagnostics that are available. Check out crPlots and avPlots functions to obtain component plus residual and added-variable plots. plot(mod1) influencePlot(mod1) Statistical Examination of the Model Testing Assumptions. Along with visual inspection of the results, tradi- tional tests regarding model behavior are also available. These include such tests as tests for normality (e.g. shapiro.test) the Breusch-Pagan test for heteroscedasticity (bptest), RESET test (resettest) and produc- tion of variance ination factors (vifvif) to name a few. A good package to get you started is lmtest and car packages. 23 A Second Course Model Comparison. Often one is also interested in investigating a comparison of one model to a baseline or subset of it in order to draw attention to the unique characteristics of certain predictors. This too is fairly straightforward in R via the add1, drop1, and anova functions. As an example, well rerun the mod1 with only two predictors and then add the third. mod1a <- lm(api00 ~ meals + ell, data = regdata) # (.) is interpreted as 'what is already there' add1(mod1a, ~. + emer, , test = "Chisq") ## Single term additions ## ## Model: ## api00 ~ meals + ell ## Df Sum of Sq RSS AIC Pr(>Chi) ## <none> 1420232 3276 ## emer 1 96343 1323889 3250 1.2e-07 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1 As an alternative, we could could have created two model objects, one with and without emer and compared them via anova. Variable Importance/Comparison. Variable importance is a tricky subject that unfortunately is typically carried out poorly, and in some sense is perhaps a misguided endeavor in many settings 24 . Substantive 24 More important would be overall patterns of effects and theory alignment. considerations should come rst to determine whether something is important at all given the result 25 . Some like to use a standardization 25 As a hint, statistical signicance is a fairly poor indicator of such a thing. of the coefcient but this doesnt come free. If youre going to use sta- tistical signicance as an initial indicator of importance, it might seem a bit odd to abandon it when comparing relative importance among a set of predictors. Just because one standardized coefcient is larger than another doesnt mean it is statistically speaking, no more than just looking at two means would give you that information. For stan- dard regression the package relaimpo provides nice functionality for predictor comparison and a metric, the average semi-partial squared, that nicely decomposes the model R-squared into the individual contri- butions by the predictors. However, better approaches in determining importance would include implementation of things like shrinkage and cross-validation. Model Exploration. Whatever theoretical model one has come up with, no matter how sound the thinking that went into it, the simple fact is that nature is a lot more clever than we are. Whenever possi- ble, which is almost always, one should incorporate a straightforward model search into their analytical approach. Using theory as a basis for providing viable models and the principle of parsimony to guide the search for the set of best models 26 , one can examine their rela- 26 One model based on one set of data should never be considered denitive, especially in the realm of social science. tive performance given the data at hand. A straightforward approach Introduction to R 24 for regression is provided by, for example, the stepAIC function in the MASS package, and depending on what package youre using, there may be something built in for such things. Others that come to mind are the BMA (Bayesian Model Averaging) package, and one might ex- plore the CRAN Task Views for Machine Learning for more exploratory approaches that go well beyond the OLS arena. Prediction Prediction allows for validation of a model under new circumstances, as well as more exploration in cases where we do not have new data and so will simulate responses at predictor values of interest. One can readily do so with the basic predict function in R. If one does nothing but apply it to the same data with no options, it is akin to producing the tted values from the model. But the data we feed it can come from anywhere and so it is far more useful in that sense. The follow- ing produces a predicted value for a school from the low end of the spectrum of values on the predictors and subsequently from one on the upper end. Given that the predictors were all negatively associated with achievement, what prediction would you make about what to expect? # x and y = T save the model and response data frames respectively mod1 <- lm(api00 ~ meals + ell + emer, data = regdata, x = T, y = T) basepred <- predict(mod1) head(cbind(basepred, fitted(mod1))) #same ## basepred ## 1 629.1 629.1 ## 2 547.1 547.1 ## 3 508.2 508.2 ## 4 560.5 560.5 ## 5 557.8 557.8 ## 6 852.4 852.4 # obtain 25th and 75th percentile values of the predictor variables lowervals <- apply(mod1$x, 2, quantile, 0.25) lowervals2 <- data.frame(t(lowervals)) #t transposes uppervals <- apply(mod1$x, 2, quantile, 0.75) uppervals2 <- data.frame(t(uppervals)) # -1 removes the intercept column predlow <- predict(mod1, newdata = lowervals2[-1]) predhigh <- predict(mod1, newdata = uppervals2[-1]) c(predlow, predhigh) ## 1 1 ## 775.2 526.8 Other models will have alternative possibilities for what exactly is predicted (e.g. probabilities), and other packages will often have their 25 A Second Course own built in predict functions. As an example from packages about to be mentioned in the next section, a binomial logit model run with the glm function in MASS could provide predictions of probabilities, and the sim function in Zelig will have a variety of possibilities depending the model one chooses, and generally makes this process pretty easy. Beyond the Ordinary Most of the time one will likely need more than the standard linear model and lm function. In general the question with R is not Can R do x analysis? but Which R package(s) does x analysis? or simply How would I do this in R?. Recall that R has over 4000 packages, the bulk of which are specic to particular types of analyses. However even the base R installation comes with quite a bit of functionality, akin to what one would expect from typical stat packages, and we will start with an example there. GLM example The glm function in R provides functionality to examine generalized linear models. In this example, 27 we will predict the number of sh 27 This is taken from the UCLA ATS website, so feel free to see it performed in other packages there for comparison. a team of sherfolk catch with how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). In the following, the glm function is used to t a poisson regression model to the data. As one can see, the basic format is the same for lm, complete with summary functions. glmdata <- read.dta("http://www.stata-press.com/data/r12/fish.dta") pois _ out <- glm(count ~ child + camper + persons, data = glmdata, family = poisson) summary(pois _ out) ## ## Call: ## glm(formula = count ~ child + camper + persons, family = poisson, ## data = glmdata) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -6.810 -1.443 -0.906 -0.041 16.142 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.9818 0.1523 -13.0 <2e-16 *** ## child -1.6900 0.0810 -20.9 <2e-16 *** ## camper 0.9309 0.0891 10.4 <2e-16 *** ## persons 1.0913 0.0393 27.8 <2e-16 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 2958.4 on 249 degrees of freedom Introduction to R 26 ## Residual deviance: 1337.1 on 246 degrees of freedom ## AIC: 1682 ## ## Number of Fisher Scoring iterations: 6 exp(pois _ out$coef) #exponentiated coefficients ## (Intercept) child camper persons ## 0.1378 0.1845 2.5369 2.9780 This particular example actually exhibits zero ination, which can be assessed using a variety of packages. For example, one can use a specic function (zeroin) from the pscl package), or keep essentially the same set up as glm using vglm (with zipoisson family) from the VGAM package. LME example The following shows an example of a mixed model using the lme4 package and associated lmer function. In this case, it is not just the addition of an argument that is required 28 , but a different formula 28 Using glmer, one would specify a family just like with the glm function. specication to note the random effects. Here is a model looking at reaction times over a period of 10 days in a sleep deprivation study. library(lme4) data(sleepstudy) # lmList(Reaction ~ 1|Subject, sleepstudy) #subject means xyplot(Reaction ~ Days | Subject, sleepstudy, lty = 1) #Reaction time over days for each subject Days R e a c t i o n 200 300 400 0 2 4 6 8 308 309 0 2 4 6 8 310 330 0 2 4 6 8 331 332 333 334 335 200 300 400 337 200 300 400 349 350 351 352 369 370 0 2 4 6 8 371 200 300 400 372 27 A Second Course lme _ mod _ 1 <- lmer(Reaction ~ 1 + (1 | Subject), sleepstudy) # random effect for subject (random intercepts) lme _ mod _ 1 #note how just printing the lmer model object provides the 'summary' functionality ## Linear mixed model fit by REML ## Formula: Reaction ~ 1 + (1 | Subject) ## Data: sleepstudy ## AIC BIC logLik deviance REMLdev ## 1910 1920 -952 1911 1904 ## Random effects: ## Groups Name Variance Std.Dev. ## Subject (Intercept) 1278 35.8 ## Residual 1959 44.3 ## Number of obs: 180, groups: Subject, 18 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 298.51 9.05 33 # summary(lme _ mod _ 1) lme _ mod _ 2 <- lmer(Reaction ~ Days + (1 + Days | Subject), sleepstudy) # 'growth model' lme _ mod _ 2 ## Linear mixed model fit by REML ## Formula: Reaction ~ Days + (1 + Days | Subject) ## Data: sleepstudy ## AIC BIC logLik deviance REMLdev ## 1756 1775 -872 1752 1744 ## Random effects: ## Groups Name Variance Std.Dev. Corr ## Subject (Intercept) 612.1 24.74 ## Days 35.1 5.92 0.066 ## Residual 654.9 25.59 ## Number of obs: 180, groups: Subject, 18 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 251.41 6.82 36.8 ## Days 10.47 1.55 6.8 ## ## Correlation of Fixed Effects: ## (Intr) ## Days -0.138 Using Other Functions & Packages With R there are often many ways via all the different packages and programming capabilities, each one doing something uniquely to the point where you might use one for one piece of output, another for some graphic etc. The trick though is to just get used to how things are done generally then move to those packages that fulll specic needs, especially since most of them are are using some of the more basic functions internally. There are some general packages and associated functions one can use to help ease the social scientists transition from other commonly Introduction to R 28 used statistical packages. Examples include Zelig from political scien- tist Gary King and others. (along with the plm package for panel data), William Revelles psych package, car and rms and MASS for general routines that are utilized in many social sciences. Those alone will get you through the analyses typically covered in graduate social science courses, but see the Task Views for Social Sciences to nd more. However, one of the fun things about R is gaining experience with techniques more widely used in a variety of other disciplines, and us- ing R regularly will expose one to quite a lot more than they probably would be in packages that have a certain customer in mind (something impossible to do from Rs approach). There are great offerings of pack- ages from people from ecology, biology, computer science and so on, so as soon as you get comfortable feel free to explore as they will typ- ically operate in the same way but with different function names and options. Other Topics The following section will have certain parts covered in the course time permitting. As the handout will always be available, this section may be added to indenitely regardless of whether the topics are covered in the course. Using the Parallel Package Typical computers made in the past few years have a muti-processor/core setup, but for which much software is not making use of by default, and even mid-range computers today might have up to 8 processors available. This may not mean much to many applied researchers, but will when some analyses or operations end up taking several minutes or possibly hours or even longer with simulation studies. Even so, no sense waiting any longer than you have to in order to get your work done. With version 2.14 R now has built-in support for parallel processing. This was previously accomplished via other packages such as snow or multicore, and those packages are still useful. However now one can use the parallel package that comes with base R to explicitly take control of computer resource utilization, though there will likely be many packages in the future which will do so implicitly and/or via an argument for a particular function 29 . 29 There are a functions in the plyr package with a parallel argument for example. The key is using this functionality with the apply type of vectorized operations for which one normally does with loops in other statistical 29 A Second Course languages. Once you specify the number of processors to use, it is then merely a function of using new functions like parApply, parLappply etc. in the same way as you have used apply, lapply etc. in basic R programming to get an even faster turnaround on those iterative types of operations. This will make use of all processors on your machine that you specify. For example, a modern desktop pc with Windows 30 30 All of this is available on other operat- ing systems, but one will have to change or add some arguments. might have eight total processors available, so if we wanted to use seven of them for some analysis: library(parallel) cl <- makeCluster(7) ... stopCluster(cl) Note the code of interest will go between the creation of the cluster and stopping it, and while there are more arguments one could specify in creating the cluster, the above would be ne in Windows. In the rst course we learned some about loops and iterative op- erations, and it is with these scenarios where making use of more of your computing power can become noticeable 31 . As an overly simple 31 Even on a standard desktop Ive seen some operations that take several minutes whittled down to seconds. example, the following just adds 3 to each element of the numbers 1 through 20. add3 = function(x) { x + 3 } parSapply(cl, 1:20, add3) Note that more processors to work with doesnt automatically mean a speed gain. Some R functions simply arent built that way yet (e.g. plots) and in a general sense if you cant feed data to the processors quickly enough you could even see a slowdown. However, if one uses the high performance computing resources typically found on the campuses of research universities, familiarity with the parallel package will be required to take full advantage of those services. Furthermore, for just a single machine the setup is relatively straightforward and takes little more than your normal usage of R. Using Other Programs In some cases one may want to do the bulk of their work within R but also use other programs for additional functionality. Typically other stats programs are not going to be as exible as R, and you might be able to be more efcient by simply calling on other programs from within the R environment. This is especially the case with simulation Introduction to R 30 studies, where one may want to see how results change over a variety of parameter settings, or similarly, in exploratory endeavors where there may be many models to investigate. One of the issues to recognize is that all programs have their unique quirks to overcome, so there isnt necessarily a one-size-ts-all ap- proach you could use almost every time for every purpose. However, the concept is consistent, so understanding the basics will allow you to get on your way in such situations. It also helps to know what some R packages are doing when they are utilizing other programs, partic- ularly if you may want to tweak some aspect or customize it for your own. To that end the example at the end of this document comes from Andres Martinez, our data management consultant, that shows how to call on Mplus from within R. Note also that at least in this case, there is a package specically designed for this called MplusAutomation. Conclusion R has several strengths relative to other statistical packages. It makes data cleaning and exploration easier and more efcient. The number of analyses available via contributed packages is unmatched. Fur- thermore, because of its object-oriented nature, further extensions of analysis and functionality are made far easier. It is hoped that after this and the rst set of notes it is fairly clear both what R has to offer and the sorts of things you can expect from it. That said, learning it will take time, perhaps a lot of it, and youll never get to a point where it wont throw a curve ball at you here and there. However, the more you use it, the more tools you will rapidly add to your analytical toolbox, the more concepts that were fuzzy in the past will become clear, and the more condence you will gain in your approach. Best of luck in your endeavors! 31 A Second Course Appendix Demonstration using another program from within R. Scenario There are two type of people in a population. People in the rst type have a relationship dened by Y=0.5+0.2x. People in the second type you have a relationship dened by Y=0.7x. By using a regression mixture model one assumes that the joint distribution of y and x is multivariate normal. Question How does the violation of the normality assumption affects parameter estimation? Goal Determine the effect nonnormality has on the parameter estimation/ class enumeration. Does non-normality occur in both classes? Model Type 1: Y1 = 05. + 0.2x + nonnormal error The rst group has a regression weight of 0.2 and an intercept of 0.5 plus an error that is non-normally distributed. Type 2: Y2 = 0.7x + normal error The second group has a regression weight of 0.7 plus an error that is normally distributed. Introduction to R 32 # This is an example of how to run Mplus from within R Source functions: # datagen (to generate the data and save it in a file) # mplusin (to build the Mplus input file and run analysis) # collect (to collect results) source("C:/csr/examples/r-mplus/ex01/r-mplus-functions.txt") # Generate data and save it in a file using datagen function: for (i in 1:3) { datagen(ng1 = 3000, ng2 = 3000, rep = i, flnm = "C:/csr/examples/r-mplus/ex01/data") } setwd("C:/Program Files/Mplus") #the working directory usually is the location of mplus.exe # Create Mplus input files and run analysis: for (i in 1:3) { mplusin(infile = "C:/csr/examples/r-mplus/ex01/data", rep = i, saveloc = "C:/csr/examples/r-mplus/ex01/", mpinput = "input.inp") # call is a text file that will be piped into mplus.exe: call <- "input.inp" # out.fln is the ouptut file name' out.fln <- paste("C:/csr/examples/r-mplus/ex01/out", i, ".txt", sep = "") call <- rbind(call, out.fln) # source.fln is the file sourced by mplus.exe. It contains the information # in call source.fln <- paste("infile.txt") write(call, source.fln) # Call DOS from R: shell("Mplus < infile.txt") # In DOS, Mplus calls Mplus.exe The less than (<) in the DOS window pipes # what the Mplus.exe is asking for Mplus asks questions infile.txt # contains the answer to those questions Some executable files only # require a file in a special format, so the less than sign is not needed # The structure of infile.txt naturally depends on the executable being # used Watch out if multiple simultaneous simulations are being run on the # same executable. It is better to copy the executable to the folder and # use one exectuable for each set of simulations. shell or system may # work, depending on how the executableis set } # Compile estimates into a csv file: collect(est.loc = "C:/csr/examples/r-mplus/ex01/est", nruns = 3, sum.fln = "C:/csr/examples/r-mplus/ex01/results.csv") collect(est.loc = "C:/csr/examples/r-mplus/ex01/est", nruns = 3, sum.fln = "C:/csr/examples/r-mplus/ex01/results.txt", sum.type = ".txt") # est.loc is the location of the file names without the replication number # nruns is the number of replications sum.fln is the file name to which # the estimates will be stored sum.type is the file type that sum.fln is # (here '.csv' and '.txt') The le above sources "C:/csr/examples/r-mplus/ex01/r-mplusfunctions. txt" in the rst line, which starts on the next page. Youd want all the following functions in that r-mplus-functions.txt le. 33 A Second Course #This is an example of how to run Mplus from within R #Some notes on Mplus: #Requires an input file (.inp) #May require a data file (.dat) #Example 01 #Generate data for 2 groups with an effect size of 0.2/0.7 and a main effect of 0.5 #The combined data are fit by a simple linear regression ########## #Generate data #Function to generate the data and save it in a file: datagen <- function(ng1, ng2, rep, flnm) { #ng1 is the number of observations in group 1 #ng2 is the number of observations in group 2 #rep is the replication number #flnm is the file name of the dataset #initialize the data matrix for each group: dat1 <- matrix(NA, ncol=2, nrow=ng1) dat2 <- matrix(NA, ncol=2, nrow=ng2) #generate x for each group: dat1[,1] <- rnorm(ng1) dat2[,1] <- rnorm(ng2) #generate y for each group: dat1[,2] <- dat1[,1] * 0.2 + rnorm(ng1, sd=0.98) dat2[,2] <- 0.5 + dat2[,1] * 0.7 + rnorm(ng2, sd=0.714) #combine group1 and group2 data: dat <- rbind(dat1, dat2) #determine file name: file.str <- paste(flnm, rep, ".txt", sep="") #flnm determines the file name, including location #(e.g. flnm="C:/csr/examples/r-mplus/ex01") #Note R uses forward slash instead of backward slash (which Windows uses) #rep determines the repetition (e.g. rep=1) #".txt" determines the file extension #sep="" determine how the different objects are separated write.table(dat, file.str, row.names=F, col.names=F) } ########## #Build Mplus input file #Function to build the Mplus input file (also called the control file) to run a regression mixture for two classes: #If building this function from scratch it is usually best to: #Start from a working control file #Paste into R and modify as needed (in R) mplusin <- function(infile, rep, saveloc, mpinput) { #infile is the data file to be analzyed #rep is the replication number #saveloc is the file location to which the estimates will be written #mpinput is the file location to which the Mplus input file will be written #define the name of the input file: mpmat <- 'title: latent class model assuming cross-sectional data;' file <- paste(infile,rep,'.txt',sep='') mpmat <- rbind(mpmat, paste('data: file is ', file, ';',sep='')) mpmat <- rbind(mpmat, 'variable:') mpmat <- rbind(mpmat, '') mpmat <- rbind(mpmat, 'names are x y; ') mpmat <- rbind(mpmat, 'classes=c(2);') mpmat <- rbind(mpmat, '') mpmat <- rbind(mpmat, 'usevariables are ') mpmat <- rbind(mpmat, 'x y;') mpmat <- rbind(mpmat, '') mpmat <- rbind(mpmat, '') mpmat <- rbind(mpmat, 'analysis: type=mixture;') Introduction to R 34 mpmat <- rbind(mpmat, ' estimator=mlr;') mpmat <- rbind(mpmat, '') mpmat <- rbind(mpmat, 'model: ') mpmat <- rbind(mpmat, '%overall%') mpmat <- rbind(mpmat, 'y on x;') mpmat <- rbind(mpmat, '%c#1%') mpmat <- rbind(mpmat, 'y on x;') mpmat <- rbind(mpmat, 'y;') mpmat <- rbind(mpmat, '%c#2%') mpmat <- rbind(mpmat, 'y on x;') mpmat <- rbind(mpmat, 'y;') mpmat <- rbind(mpmat, paste('Savedata: results are',saveloc,'est', rep, '.txt;',sep='')) write(mpmat,mpinput) } ########## #Collecting the results #Function to collect the results: collect <- function(est.loc, nruns, sum.fln, sum.type='.csv') { #est.loc is the location of the file names without the replication number #(e.g., "C:/csr/examples/r-mplus/ex01") #nruns is the number of replications #sum.fln is the file name (.txt or .csv) to which the estimates will be stored #sum.type is the file type of sum.fln (default is .csv) results <- NULL #initialize the vector for (i in 1:nruns) { est.fln <- paste(est.loc, i, ".txt", sep="") #not all rows are not of the same length so it is better to read in one line at a time #parameter estimates are in line 1, the standard errors are in line 2, model fit is in line 3 pars <- read.table(est.fln, nrow=1) #the first row contains the parameter estimates, the number of columns may change #read.table does not work well when the number of columns is not constant ses <- read.table(est.fln, nrow=1, skip=1) #skip the first row, read the next row, which are the standard errors model.fit <- read.table(est.fln, nrow=1, skip=2) #skip the first 2 rows, read the next row, which are the model fit statistics #Collect the class 1 and class 2 parameters: class1 <- c(pars[[1]], pars[[2]], pars[[3]]) class2 <- c(pars[[4]], pars[[5]], pars[[6]]) se.class1 <- c(ses[[1]], ses[[2]], ses[[3]]) se.class2 <- c(ses[[4]], ses[[5]], ses[[6]]) #Separates the high and low class based on slope (as generated): #if the slope for class 1 is smaller than the slope of class2, the EST will have class1 #parameters first and then class 2 parameters; for each the mean is at the end ifelse(class1[2]<class2[2], EST<-c(class1,class2,pars[[7]]), EST<-c(class2,class1,pars[[7]])) ifelse(class1[2]<class2[2], SE<-c(se.class1,se.class2,ses[[7]]), SE<-c(se.class2,se.class1,ses[[7]])) #collect the need output information #i is the dataset number res <- c(i, EST, SE, model.fit) results <- rbind(results, res) } #assign colnames colnames(results) <- c("rep", "c1 Int","c1 Slope","c1 res","c2 int","c2 slope", "c2 resid", "mean", "se c1 Int","se c1 Slope","se c1 resid", "se c2 int", "se c2 Slope","se c2 resid", "se mean", "LL", "LLC", "Free", "AIC", "BIC", "ADJ BIC", "Entropy") #write results depending on file type: if (sum.type=='.txt'){write.table(results, sum.fln, row.names=F)} if (sum.type=='.csv'){write.csv(results, sum.fln, row.names=F)} } 35 A Second Course # Determine what estimate files are available useful when simulations do # not converge estgen <- function(location) { setwd(location) x <- as.matrix(shell("dir est * . * ", intern = T)) x <- as.matrix(x[6:(dim(x)[1] - 2), ]) splt1 <- strsplit(c(x), "est") x1 <- sapply(splt1, "[", 2) splt2 <- strsplit(c(x1), ".txt") x2 <- sort(as.numeric(sapply(splt2, "[", 1))) return(x2) }