Statistical Computing I 1
Statistical Computing I 1
Statistical Computing I 1
CREDIT HOURS: 3
By:
Editor;
Natnael Moges (M.Sc., Biostatistics), Staff member of Statistics department
Nigisti Gebremedhin (M.Sc., Biostatistics), Staff member of Statistics department
SPSS, standing for Statistical Package for the Social Sciences, is a powerful, user - friendly
software package for the manipulation and statistical analysis of data. The package is particularly
useful for students and researchers in statistics, psychology, sociology, psychiatry, and other
behavioral sciences, containing as it does an extensive range of both univariate and multivariate
procedures much used in these disciplines.
MINITAB is a powerful, easy – to - use, statistical software package that provides a wide range
of basic and advanced data analysis capabilities. MINITAB's straightforward command structure
makes it accessible to users with a great variety of background and experience. MINITAB runs
on PC and Macintosh computers, and most of the leading workstations, minicomputers and
mainframe computers. While MINITAB differs across releases and computer platforms, the core
of MINITAB - the worksheet and commands is the same. Thus, if you know how to use one
release of MINITAB on one platform, you can easily switch to another. The module is not
intended in any way to be an introduction to statistics and, indeed, we assume that most readers
will have attended at least one statistics course and will be relatively familiar with concepts such
as linear regression, correlation, significance tests, and simple analysis of variance. Our hope is
that researchers and students with such a background will find this manual a relatively self-
contained means of using SPSS and MINITAB to analyze their data correctly or efficiently.
Each chapter ends with a number of exercises, some relating to the data sets introduced in the
chapter and others introducing further data sets. Working through these exercises will develop
SPSS, MINITAB and statistical skills.
i
Table of contents
1.Introduction to SPSS ............................................................................................................................. 1
1.1 What is a statistical package? ............................................................................................................ 1
1.2. Starting SPSS.................................................................................................................................... 1
1.3.Overview of SPSS for windows ........................................................................................................ 2
1.4. The menus and their use ................................................................................................................... 6
1.5.Entering and saving data in SPSS...................................................................................................... 7
1.6.Data Importing from Microsoft Excel and ASCII files ................................................................... 19
2. Modifying and organizing data ......................................................................................................... 22
2.1.Retrieving data ................................................................................................................................. 22
2.2.Inserting cases and variables ........................................................................................................... 22
2.3.Deleting cases or variables .............................................................................................................. 23
2.4.Transforming Variables with the Compute Command.................................................................... 23
2.5.Transforming Variables with the Recode Command ...................................................................... 24
2.5.1.Banding Values ......................................................................................................................... 27
2.6.Keeping and dropping of cases ........................................................................................................ 33
2.7 Collapsing and transposing Data ..................................................................................................... 34
2.8.Listing Cases.................................................................................................................................... 41
3. DESCRIPTIVE STATISTICS USING SPSS ................................................................................... 43
3.1.Summarizing Data ........................................................................................................................... 43
3.1.1. Producing Frequency distribution ............................................................................................ 43
3.1.2. Descriptive Statistics ................................................................................................................ 44
3.1.3. Cross Tabulation ...................................................................................................................... 44
3.1.4. Diagrams and graphs ................................................................................................................ 47
4. Customizing SPSS outputs and reporting ........................................................................................ 52
4.1. Customizing SPSS outputs ............................................................................................................. 52
4.1.1. Modifying Tables ..................................................................................................................... 52
4.1.2. Exporting Tables in SPSS ........................................................................................................ 55
4.1.3. Modifying scatter plot .............................................................................................................. 56
4.1.4. Modifying and Exporting Graphs ............................................................................................ 57
5.INTRODUCTION TO MINITAB ...................................................................................................... 65
5.1. How to start and exit Minitab ........................................................................................................ 65
ii
5.2. Minitab windows: worksheet, session and project ......................................................................... 65
5.2.1. Worksheet Window .................................................................................................................. 65
5.2.2. Session Window ....................................................................................................................... 66
5.2.3. Minitab Project ......................................................................................................................... 67
5.2.4. Moving between windows ....................................................................................................... 67
5.2.5. Understanding the interface ..................................................................................................... 68
5.3. The menu and their use ................................................................................................................... 69
5.4. Type Data ....................................................................................................................................... 69
5.4.1. The data in the spreadsheet ...................................................................................................... 69
5.4.2. Creating a Set of Data in Minitab............................................................................................. 70
5.5. Entering and saving data................................................................................................................. 70
5.5.1. Entering the Data ...................................................................................................................... 70
5.5.2. Saving Minitab data ................................................................................................................ 71
5.6. Importing and Exporting data ......................................................................................................... 72
5.6.1. Importing Data from Excel ..................................................................................................... 72
5.6.2. Opening a text file .................................................................................................................... 73
5.6.3. Export data .............................................................................................................................. 76
6. Descriptive statistics using Minitab .................................................................................................. 78
7. STATISTICAL ANALYSIS USING MINITAB AND SPSS..................................................... 107
7.1. Inferential statistics Using Minitab............................................................................................... 107
7.2. Inferential Statistics using SPSS.................................................................................................. 116
7.3. Regression and Correlation ......................................................................................................... 127
7.3.1. Correlation Analysis in SPSS................................................................................................. 127
7.3.2. Linear Regression ................................................................................................................... 129
7.3.3. Regression Diagnostics using SPSS....................................................................................... 133
7.3.4 Logistic Regression ................................................................................................................. 182
7.3.5. Regression Diagnostics in Minitab ........................................................................................ 186
REFERENCES ...................................................................................................................................... 188
iii
1. Introduction to SPSS
The “Statistical Package for the Social Sciences” (SPSS) is a package of programs for
manipulating, analyzing, and presenting data; the package is widely used in the social and
behavioral sciences. SPSS enables you to perform intense numerical calculations in a fraction
of time. SPSS is frequently used in both academia and business environments. Much of SPSS‟s
popularity within academia and various industries can be attributed to its capacity for managing
data sets, a functionality that represents the bulk of the work done by professional statisticians. In
addition, SPSS allows you to create, with great ease, beautiful graphics and tabular outputs.
However, despite these significant conveniences, it is important to remember that no statistical
software will relieve you of the need to think critically about the results any software package
produces.
A small window will appear. This window has several choices with the following questions and
options.
What would you like to do?
• Run tutorial
• Type in Data
• Run an existing query
• Create new query using an existing data base
• Open an existing data source
If you choose type in data, you will get Data Editor Window.
SPSS for Windows consists of five different windows, each of which is associated with a
particular SPSS file type. This document discusses the two windows most frequently used in
analyzing data in SPSS, the Data Editor and the Output Viewer windows.
DATA EDITOR
is the window that is open at start - up and is used to enter and store data in a spreadsheet
format.
it consists of two windows: Data View and the Variable View windows, each window
can be accessible by clicking on tabs at the bottom of the screen.
2
Data View window
Click on Data View tab at the bottom of the screen to open the “Data view” window. The window
is simply a grid with rows and columns that display the content of a data file.
Each row represents a case (one individual‟s data).
Each column represents a variable whose name should appear at the top of the column.
The intersection between a row and a column is known as a cell. Each cell contains the
score of a particular case on one particular variable.
Note: It is good practice to define all variables first , before entering data.
Click on Variable View tab at the bottom of the screen to open the “Variable view” window. The
Variable View window is also a simple grid of rows and columns. This is where you define the
structures of all your variables.
There are ten fixed columns in the Variable View, these are:
Name: is what you want the variable to be called. SPSS has rules for variable names such
as variable names are limited to eight characters; variable names should always begin with
a letter and should never include a full stop or space.
Type: is the kind of information SPPS should expect for the variable. Variables come in
different types, including Numeric, String, Currency, Date…etc. but the ones that you will
probably use the most are Numeric or String (text).
Width: The maximum number of characters to be entered for the variable.
Decimals column: This is where you specify how many decimal places you would like
SPSS to store for a variable.
Values labels: Provide a method for mapping your variable values to string labels. It is
mainly used for categorical variable. For example, if you have a variable called “Gender”,
there are two acceptable values for that variable: Female or Male. You can assign a code
for each category, f for Female and m for Male or 1 for Female and 2 for Male.
Missing value: It is important to define missing values; this will help you in your
data analysis. For example, you may want to distinguish data missing because a
respondent refused to answer from data missing because the question did not apply to the
respondent.
3
Columns: Use this to adjust the width of the Data Editor columns, note that if the
actual width of a value is wider than the column, asterisks are displayed in the Data
View.
Align: To change the alignment of the value in the column (left, right or Centre).
Measure: You can specify the level of measurement of the variable as scale,
ordinal or nominal.
The Output Viewer opens automatically when you execute an analysis or create a graph
using dialog box or command syntax to execute a procedure.
All statistical results, tables, and charts are displayed in the Viewer. You can edit the
output and save it for later use. A Viewer window opens automatically the first time you
run a procedure that generates output.
4
The Output Viewer is divided into two panes. The right - hand pane contains statistical tables,
charts, and text output. The left - hand pane contains a tree structure similar to those used in
Windows Explorer, which provides an outline view of the contents.
Output that is displayed in pivot tables can be modified in many ways with the Pivot Table
Editor. You can edit text, swap data in rows and columns, add color, create
multidimensional tables, and selectively hide and show results.
Syntax Editor
A text editor where you compose SPSS commands and submit them to the SPSS processor.
All outputs from this command appear in the output view.
Chart Editor
You can modify high-resolution charts and plots in chart windows. You can change the
colors, select different type fonts or sizes, switch the horizontal and vertical axes, rotate
3 - D scatter plots, and even change the chart type and the like.
5
1.4. The menus and their use
Each window in SPSS has their own menus. The common menus are:
File (new, open, save, save as, etc…)
Edit (undo, redo, cut, copy, insert cases/variables, etc…)
View (value labels, etc…)
Analysis (descriptive statistics, tables, compare means, correlate, regression, etc…)
Graphs (bar, pie, scatter plot, histogram, etc…)
Window (split; minimize the window, etc…)
Help (topics, tutorial, etc…)
The menu bar provides easy access to most SPSS features. It consists of ten
drop - down menus:
6
Data Editor Toolbar
Clicking once on any of these buttons allows you to perform an action, such as opening a
data file, or selecting a chart for editing etc.
You may also use the Up, Down, Left, and Right arrow keys to enter values and move to
another cell for data input.
To edit existing data points (i.e., change a specific data value), click in the cell, type in the
new value, and press the Tab, Enter, Up, Down, Right, or Left arrow keys.
In Data View, you enter your data just as you would in a spreadsheet program. You can move
from cell to cell with the arrow keys on your keyboard or by clicking on the cell with the mouse.
Once one case (row) is complete, begin entering another case at the beginning of the next
row.
You can delete a row of data by clicking on the row number at the far left and pushing the
delete key on your keyboard.
In a similar fashion, you delete a variable (column) by clicking on the variable name so
that the entire column is highlighted and pushing the delete key.
7
In the steps that follow, we would see how to type in data by defining different variable types.
Click the Variable View tab at the bottom of the Data Editor window. Define the variables that
are going to be used. In our case, let us consider three variables: namely age, marital status, and
income.
In the first row of the first column, type age.
In the second row, type marital.
In the third row, type income.
New variables are automatically given a numeric data type. If you don't enter variable names,
unique names are automatically created. However, these names are not descriptive and are not
recommended for large data files.
The names that you entered in Variable View are now the headings for the first three columns in
Data View.
Begin entering data in the first row, starting at the first column.
In the age column, type 55.
In the marital column, type 1.
In the income column, type 72000.
Move the cursor to the first column of the second row to add the next subject's data.
In the age column, type 53.
In the marital column, type 0.
In the income column, type 153000.
Currently, the age and marital columns display decimal points, even though their values are
intended to be integers. To hide the decimal points in these variables: Click the Variable View tab
at the bottom of the Data Editor window.
Select the Decimals column in the age row and type 0 to hide the decimal.
Select the Decimals column in the marital row and type 0 to hide the decimal.
Non-numeric data, such as strings of text, can also be entered into the Data Editor.
Click the Variable View tab at the bottom of the Data Editor window.
In the first cell of the first empty row, type sex for the variable name.
Click the Type cell.
Click the button in the Type cell to open the Variable Type dialog box.
Select String to specify the variable type.
8
Click OK to save your changes and return to the Data Editor.
In addition to defining data types, you can also define descriptive variable and value labels for
variable names and data values. These descriptive labels are used in statistical reports and charts.
Labels can be up to 256 characters long. These labels are used in your output to identify the
different variables.
Click the Variable View tab at the bottom of the Data Editor window.
In the Label column of the age row, type Respondent's Age.
In the Label column of the marital row, type Marital Status.
In the Label column of the income row, type Household Income.
In the Label column of the sex row, type Gender.
Adding a Variable Label: Click the Variable View tab at the bottom of the Data Editor window.
In the Label column of the age row, type Respondent's Age. In the Label column of the marital
row, type Marital Status. In the Label column of the income row, type Household Income. In the
Label column of the sex row, type Gender.
The Type column displays the current data type for each variable. The most common are
numeric and string, but many other formats are supported.
In the current data file, the income variable is defined as a numeric type.
Click the Type cell for the income row, and then click the button to open the Variable Type dialog
box.
Select Dollar in the Variable Type dialog box. The formatting options for the currently selected
data type are displayed. Select the format of this currency. For this example, select $###, ###,
###.
Click OK to save your changes.
Value labels provide a method for mapping your variable values to a string label. In the case of
this example, there are two acceptable values for the marital variable. A value of “0” means that
the subject is single and a value of “1” means that he or she is married.
Click the values cell for the marital row, and then click the button to open the Value Labels dialog
box.
9
Click Add to add this label to the list.
Repeat the process, this time typing 1 in the value field and Married in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
These labels can also be displayed in Data View, which can help to make your data more
readable.
Click the Data View tab at the bottom of the Data Editor window.
From the menus choose:
View
Value Labels
The labels are now displayed in a list when you enter values in the Data Editor. This has the
benefit of suggesting a valid response and providing a more descriptive answer.
String variables may require value labels as well. For example, your data may use single letters,
M or F, to identify the sex of the subject for male and female respectively.
Value labels can be used to specify that M stands for Male and F stands for Female.
Click the Variable View tab at the bottom of the Data Editor window.
Click the Values cell in the sex row, and then click the button to open the Value Labels dialog
box.
Type F in the value field, and then type Female in the Value Label field.
Click Add to add this label to your data file.
Repeat the process, this time typing M in the Value field and Male in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
Because string values are case sensitive, you should make sure that you are consistent. A
lowercase m is not the same as an uppercase M.
In a previous example, we choose to have value labels displayed rather than the actual data by
selecting Value Labels from the View menu. You can use these values for data entry.
Click the Data View tab at the bottom of the Data Editor window. In the first row, select the
cell for sex and select Male from the drop-down list.
In the second row, select the cell for sex and select Female from the drop-down list. Only defined
values are listed, which helps to ensure that the data entered are in a format that you expect.
10
Handling Missing Data
Missing or invalid data: are generally too common to ignore. Survey respondents may refuse to
answer certain questions, may not know the answer, or may answer in an unexpected format.
If you don't take steps to filter or identify these data, your analysis may not provide accurate
results.
For numeric data, empty data fields or fields containing invalid entries are handled by
converting the fields to system missing, which is identifiable by a single period.
The reason a value is missing may be important to your analysis. For example, you may find it
useful to distinguish between those who refused to answer a question and those who didn't answer
a question because it was not applicable.
Click the Variable View tab at the bottom of the Data Editor window. Click the Missing cell in
the age row, and then click the button to open the Missing Values dialog box. In this dialog box,
you can specify up to three distinct missing values, or a range of values plus one additional
discrete value.
Select Discrete missing values. Type 999 in the first text box and leave the other two empty.
Click OK to save your changes and return to the Data Editor. Now that the missing data value has
been added, a label can be applied to that value. Click the Values cell in the age row, and then
click the button to open the Value Labels dialog box.
Type 999 in the Value field. Type No Response in the Value Label field. Click Add to add this
label to your data file. Click OK to save your changes and return to the Data Editor.
Missing values for string variables are handled similarly to those for numeric values.
Unlike numeric values, empty fields in string variables are not designated as system missing.
Rather, they are interpreted as an empty string. Click the Variable View tab at the bottom of the
Data Editor window.
Click the Missing cell in the sex row, and then click the button to open the Missing Values dialog
box. Select Discrete missing values. Type NR in the first text box.
Missing values for string variables are Case sensitive. So, a value of “nr” is not treated as a
missing value.
Click OK to save your changes and return to the Data Editor. Now you can add a label for the
missing value. Click the Values cell in the sex row, and then click the button to open the Value
Labels dialog box. Type NR in the Value field. Type “No Response” in the Value Label field.
11
Click Add to add this label to your project. Click OK to save your changes and return to the Data
Editor.
Once you've defined variable attributes for a variable, you can copy these attributes and
apply them to other variables.
In Variable View, type agewed in the first cell of the first empty row. In the Label column, type
Age Married. Click the Values cell in the age row. From the menus choose:
Edit
Copy
Click the Values cell in the agewed row
From the menus choose:
Edit
Paste
The defined values from the age variable are now applied to the agewed variable. To apply the
attribute to multiple variables, simply select multiple target cells (click and drag down the
column).
When you paste the attribute, it is applied to all of the selected cells. New variables are
automatically created if you paste the values into empty rows.
You can also copy all of the attributes from one variable to another. Click the row number in the
marital row.
From the menus choose:
Edit
Copy
Click the row number of the first empty row.
From the menus choose:
Edit
Paste
All of the attributes of the marital variable are applied to the new variable.
For categorical (nominal, ordinal) data, Define Variable Properties can help you define value
labels and other variable properties. Define Variable Properties:
Scans the actual data values and lists all unique data values for each selected variable.
Identifies unlabeled values and provides an "auto-label" feature.
Provides the ability to copy defined value labels from another variable to the selected
variable or from the selected variable to multiple additional variables.
This example uses the data file demo.sav. This data file already has defined value labels; so
before we start, let's enter a value for which there is no defined value label:
12
In Data View of the Data Editor, click the first data cell for the variable ownpc (you may have
to scroll to the right) and enter the value 99.
From the menus choose:
Data
Define Variable Properties...
In the initial Define Variable Properties dialog box, you select the nominal or ordinal variables for
which you want to define value labels and/or other properties.
Since Define Variable Properties relies on actual values in the data file to help you make good
choices, it needs to read the data file first. This can take some time if your data file contains a
very large number of cases, so this dialog box also allows you to limit the number of cases to
read, or scan.
Limiting the number of cases is not necessary for our sample data file. Even though it contains
over 6,000 cases, it doesn't take very long to scan that many cases.
Drag and drop Owns computer [ownpc] through Owns VCR [ownvcr] into the Variables to Scan
list.
You might notice that the measurement level icons for all of the selected variables indicate that
they are scale variables, not categorical variables. By default, all numeric variables are assigned
the scale measurement level, even if the numeric values are actually just codes that represent
categories.
All of the selected variables in this example are really categorical variables that use the numeric
values 0 and 1 to stand for No and Yes, respectively--and one of the variable properties that we'll
change with Define Variable Properties is the measurement level.
Click Continue
In the Scanned Variable List, select ownpc. The current level of measurement for the selected
variable is scale. You can change the measurement level by selecting one from the drop-down list
or you can let Define Variable Properties suggest a measurement level.
Click Suggest
Since the variable doesn't have very many different values and all of the scanned cases contain
integer values, the proper measurement level is probably ordinal or nominal. Select Ordinal and
then click Continue.
The measurement level for the selected variable is now ordinal.
13
The Value Labels Grid displays all of the unique data values for the selected variable, any defined
value labels for these values, and the number of times (count) each value occurs in the scanned
cases.
The value that we entered, 99, is displayed in the grid. The count is only 1 because we changed
the value for only one case, and the Label column is empty because we haven't defined a value
label for 99 yet.
An X in the first column of the Scanned Variable List also indicates that the selected variable has
at least one observed value without a defined value label. In the Label column for the value of 99,
enter No answer.
Then click (check) the box in the Missing column. This identifies the value 99 as user missing.
Data values specified as user missing are flagged for special treatment and are excluded from
most calculations.
Before we complete the job of modifying the variable properties for ownpc, let's apply the same
measurement level, value labels, and missing values definitions to the other variables in the list.
In the Copy Properties group, click to other Variables.
In the Apply Labels and Level to dialog box, select all of the variables in the list, and then click
Copy. If you select any other variable in the list in the Define Variable Properties main dialog box
now, you'll see that they are all now ordinal variables, with a value of 99 defined as user missing
and a value label of No answer. Click OK to save all of the variable properties that you have
defined. By doing so, we copied the property of the ownpc variable to the other five selected
variables.
14
Exercise - 1: The following small data set consists of four variables namely, Agecat, gender,
accid and pop.
Where: agecat is a categorical variable created for age. 1= „Under 21‟ 2= „21-25‟, and
3= „26-30‟ Gender: 0 = „Male‟ and 1= „Female‟
Accid and Pop are numeric.
After defining these variables in a data editor window, enter the following data for the variables
Agecat, Gender, Accid and Pop respectively. Your data should appear as given below. Save the
data set as trial1.sav.
1 1 57997 198522
2 1 57113 203200
3 1 54123 200744
1 0 63936 187791
2 0 64835 195714
3 0 66804 208239
Exercise-2: Create a data set called Trial2.sav from the following data. The data set has the
following variables:
In addition, there is no value level for each of the above variables. After completing the
definition of the above variables, type in the following data into your data editor window so that
your data appears as given below.
1 1 1 18 1
1 1 1 14 2
1 1 1 12 3
1 1 1 6 4
2 1 1 19 1
2 1 1 12 2
2 1 1 8 3
2 1 1 4 4
3 1 1 14 1
15
3 1 1 10 2
3 1 1 6 3
3 1 1 2 4
4 1 2 16 1
4 1 2 12 2
4 1 2 10 3
4 1 2 4 4
5 1 2 12 1
5 1 2 8 2
5 1 2 6 3
5 1 2 2 4
6 1 2 18 1
6 1 2 10 2
6 1 2 5 3
6 1 2 1 4
7 2 1 16 1
7 2 1 10 2
7 2 1 8 3
7 2 1 4 4
8 2 1 18 1
8 2 1 8 2
8 2 1 4 3
8 2 1 1 4
9 2 1 16 1
9 2 1 12 2
9 2 1 6 3
9 2 1 2 4
10 2 2 19 1
10 2 2 16 2
10 2 2 10 3
10 2 2 8 4
11 2 2 16 1
11 2 2 14 2
11 2 2 10 3
16
11 2 2 9 4
12 2 2 16 1
12 2 2 12 2
12 2 2 8 3
Exercise-3: Given below is an example of a questionnaire, suppose you have information on
several of such questionnaires. Prepare a data entry format that will help you to enter your data to
SPSS.
Examples of questionnaire Design
Name ____________________________________________________________
Age __________________________Sex ________________________________
City _____________________________________________________________
17
8. How much do you spend on eating out (one time)?
□ Below 200 □ 200-500 □ 500-800 □ More than 800
9. What did you normally order?
□ Pizza □ Burgers □ Curries and Breads □ Pasta
10. The price paid by you for the above is
10.1 Pizza: □ Very high □ A little bit high □ Just right
10.2 Burgers: □ Very high □ A little bit high □ Just right
10.3 Curries and Breads: □ Very high □ A little bit high □ Just right
10.4 Soups: □ Very high □ A little bit high □ Just right
10.5 Pasta: □ Very high □ A little bit high □ Just right
18
1.6. Data Importing from Microsoft Excel and ASCII files
Data can be directly entered in SPSS (as seen above), or a file containing data can be opened
in the Data Editor. From the menu in the Data Editor window, choose the following menu
options.
File
Open...
If the file you want to open is not an SPSS data file, you can often use the Open menu
item to import that file directly into the Data Editor.
If a data file is not in a format that SPSS recognizes, then try using the software
package in which the file was originally created to translate it into a format that can
be imported into SPSS.
Data can be imported into SPSS from Microsoft Excel with relative ease. If you are working with
a spreadsheet in another software package, you may want to save your data as an Excel file, then
import it into SPSS.
To open an Excel file, select the following menu options from the menu in the Data Editor
window in SPSS.
File
Open...
First, select the desired location on disk using the Look in option. Next, select Excel from the
Files of type drop-down menu. The file you saved should now appear in the main box in the
Open File dialog box. You can open it by double-clicking on it. You will see one more dialog
box which appears as follows.
19
This dialog box allows you to select a spreadsheet from within the Excel Workbook.
The drop-down menu in the example shown above offers two sheets from which to choose. As
SPSS only operates on one spreadsheet at a time, you can only select one sheet from this menu.
This box also gives you the option of reading variable names from the Excel Workbook directly
into SPSS. Click on the Read variable names box to read in the first row of your spreadsheet as
the variable names.
If the first row of your spreadsheet does indeed contain the names of your variables and you want
to import them into SPSS, these variables names should conform to SPSS variable naming
conventions (eight characters or fewer, not beginning with any special characters).
You should now see data in the Data Editor window. Check to make sure that all variables and
cases were read correctly. Next, save your dataset in SPSS format by choosing the Save option in
the File menu.
Example: Import an excel data set called book1.xls into SPSS data editor window from the
desktop.
The procedure is as follows:
File
Open... Data
After you select data you will see a window with the header “opens file”. On the same window,
select the desktop using the Look in option.
Then select Excel (*.xls) from the file type drop down menu. Then another small window will
appear. In this window you may see that there is only one worksheet. Now if the first row of the
Book1.xls data set has variables names, then you select the option “Read variable names from the
first row of the data”. Subsequently, SPSS will consider the elements of the first row as variables.
If the first of row of book1.xls is not variable names then leave the option unselected, then SPSS
will understand elements of the first row as data values.
Data are often stored in an ASCII file format, alternatively known as a text or flat file format.
Typically, a space, tab, comma, or some other character separates columns of data in an ASCII
file. To import text files to SPSS we have two wizards to consider:
20
Read Text Data: If you know that your data file is an ASCII file, then you can open the
data file by opening the Read Text Data Wizard from the File menu. The TextImport
Wizard will first prompt you to select a file to import. After you have selected a file, you
will go through a series (about six steps) of dialog boxes that will provide you with several
options for importing data.
Once we are through with importing of the data, we need to check for its accuracy. It is also
necessary to save a copy of the dataset in SPSS format by selecting the Save or Save As options
from the File menu.
Open Data: The second option to read an ASCII file to SPSS is by using File Open Data
option.
File
Open
Data
After you select data you will see a dialogue box with the header “opens file”. On the same
window, select the desktop using the Look in option.
Then select Text (*.txt) from the file type drop down menu. Select the file and click on open
button. A serious of dialog boxes will follow.
Exercise: Suppose there is a text file named mychap1 on the desktop under the subdirectory
training. Import this file to SPSS. Also name the first variable as X and the second as Y.
21
2. Modifying and organizing data
File
Open
Data
You will see the open data file dialogue box. Assuming the data type, you want is on the floppy
disk and has been saved previously by SPSS, open the drives drop down list and click on the icon
for the drive a: All the files on drive A ending with. SAV extension will be listed in the files list.
Click on the name of the file you want to retrieve, and it will appear in the file name box. Click
on the Ok button on the right hand side of the dialogue box. The file will then be put into the Data
Editor Window, and its name will be the title of that window.
Assume the data type you want is on the hard disk and has been saved previously by SPSS,
under the directory program files. Open the drive‟s drop down list and click on the icon for
program files: All the files under program files ending with”. SAV” extension will be listed in
the files list. Click on the name of the file you want to retrieve, and it will appear in the file
name box. Click on the Ok button on the right hand side of the dialogue box. The file will
then be put into the Data Editor Window, and its name will be the title of that window.
You may want to add new variables or cases to an existing dataset. The Data Editor provides
menu options that allow you to do that. For example, you may want to add data about
participants' ages to an existing dataset.
To insert a new variable, click on the variable name to select the column in which the
variable is to be inserted.
To insert a case, select the row in which the case is to be added by clicking on the row's
number. Clicking on either the row's number or the column's name will result in that row or
column is being highlighted. Next, use the insert options available in the Data menu in the
Data Editor:
22
Data
Insert Variable
Insert case
If a row has been selected, choose Insert Case from the Data menu; if a column has been
selected, choose, Insert Variable. This will produce an empty row or column in the
highlighted area of the Data Editor. The existing cases and variables will be shifted down and
to the right respectively.
You may want to delete cases or variables from a dataset. To do that, select a row or column
by highlighting as described above. Next, use the Delete key to delete the highlighted area. Or
you can use the Delete option in the Edit menu to do it.
In the Data Editor, you can use the COMPUTE or the RECODE command to create new
variables from existing variables.
The COMPUTE option allows you to arithmetically combine or alter variables and place the
resulting value under a new variable name. As an example, to calculate the area of shapes
based on their height and width, you compute a new variable "area" by multiplying "height"
and "width" with one another. See below.
23
The new variable created is area. This is specified under target variable. This target variable is
the product of the two existing variables height and width.
Another example may be a dataset that contained employees' salaries in terms of their
beginning and current salaries. Our interest is on the difference between starting salary and
present salary. A new variable could be computed by subtracting the starting salary from the
present salary. See the dialogue box below
Transform
Compute...
In other situations, you may also want to transform an existing variable. For example, if data
were entered as months of experience and you wanted to analyze data in terms of years on the
job, then you could re-compute that variable to represent experience on the job in numbers of
years by dividing number of months on the job by 12.
The RECODE option allows you to create discrete categories from continuous variables. As
an example, you may want to change the height variable where values can range from 0 to
over 100 into a variable that only contains the categories tall, medium, and short. We have to
pass through the following steps.
Select Transform/Recode/Into Different Variables.
A list of variables in the active data set will appear. Select the variable you wish to
change by clicking once on the variable name and clicking the arrow button.
Click the output box and enter a new variable name (8 characters‟ maximum) and
click Change.
24
See the figure below. The variable to be recoded is the height.
NOTE: In dialog boxes that are used for mathematical or statistical operations, only those
variables that you defined as numeric will be displayed. String variables will not be displayed
in the variable lists.
Now the variable height_b is the new variable that will be obtained after recoding. The value
label for the new variable is “Height variable recoded”.
Select OLD AND NEW VALUES. This box presents several recoding options. You
identify one value or a range of values from the old variable and indicate how these
values will be coded in the new variable.
After identifying one value category or range, enter the value for the new variable in
the New Value box. In our example, the old values might be 0 through 10, and the
new value might be 1 (the value label for 1 would be "short", for 2 "medium", for 3
"tall").
Click ADD and repeat the process until each value of the new variable is properly
defined.
(See Figure Below). Recode: Old and new values
25
Caution: You also have the option of recoding a variable into the same name. If you did this
in the height example, the working data file would change all height data to the three
categories (a value of 1 for "short"), 2 ("for medium", or 3 for "tall"). If you save this file with
the same name, you will lose all of the original height data. The best way to avoid this is to
always use the recode option that creates a different variable. Saving the data file keeps
the original height data intact while adding the new categorized variable to the data set for
future use.
IF statement is an option to use within the compute or recode command. You can
choose to only recode values if one of your variables satisfies a condition of your
choice. This condition, which is captured by means of the "IF" command, can be simple
(such as "if area=15). To create more sophisticated conditions, you can employ logical
transformations using AND, OR, NOT. The procedure is as given below.
In the Compute and Recode dialog boxes click on the IF button.
They Include If Case Satisfies Condition dialog pops up (see the Figure below).
Select the variable of interest and click the arrow button.
Use the key pad provided in the dialog box or type in the appropriate completion of
the IF statement.
When the IF statement is complete, click CONTINUE.
26
2.5.1. Banding Values
Banding is taking two or more continuous values and grouping them into the same category.
The data you start with may not always be organized in the most useful manner for your
analysis or reporting needs. For example, you may want to:
Create a categorical variable from a scale variable.
Combine several response categories into a single category.
Create a new variable that is the computed difference between two existing variables.
Calculate the length of time between two dates.
In the initial Visual Bander dialog box, you select the scale and/or ordinal variables for which
you want to create new, banded variables. Banding is taking two or more contiguous values
and grouping them into the same category.
Since the Visual Bander relies on actual values in the data file to help you make good banding
choices, it needs to read the data file first. Since this can take some time if your data file
contains a large number of cases, this initial dialog box also allows you to limit the number of
cases to read ("scan").
This is not necessary for our sample data file. Even though it contains more than 6,000 cases,
it does not take long to scan that number of cases.
Drag and drop Household income in thousands [income] from the Variables list into
the Variables to Band list, and then click Continue.
In the main Visual Bander dialog box, select Household income [in thousands] in the Scanned
Variable List.
27
A histogram displays the distribution of the selected variable (which in this case is highly
skewed). Enter inccat2 for the new banded variable name and Income category (in thousands)
for the variable label.
Enter 25 for the first cut-point location, 3 for the number of cut-points, and 25 for the width.
The number of banded categories is one greater than the number of cut-points. So, in this
example, the new banded variable will have four categories, with the first three categories
each containing ranges of 25 (thousand) and the last one containing all values above the
highest cut-point value of 75 (thousand).
Click Apply.
The values now displayed in the grid represent the defined cut-points, which are the upper
endpoints of each category. Vertical lines in the histogram also indicate the locations of the
cut-points. By default, these cut-point values are included in the corresponding categories. For
example, the first value of 25 would include all values less than or equal to 25. But in this
example, we want categories that correspond to less than 25, 25–49, 50–74, and 75 or higher.
This automatically generates descriptive value labels for each category. Since the actual
values assigned to the new banded variable are simply sequential integers starting with 1, the
value labels can be very useful.
You can also manually enter or change cut-points and labels in the grid, change cut-point
locations by dragging and dropping the cut-point lines in the histogram, and delete cut-points
by dragging cut-point lines off of the histogram. Click OK to create the new, banded
variable.
The new variable is displayed in the Data Editor. Since the variable is added to the end of the
file, it is displayed in the far right column in Data View and in the last row in Variable View.
But in this example, we want categories that correspond to less than 25, 25–49, 50–74, and 75
or higher. In the Upper Endpoints group, select Excluded (<).
Sorting Cases
Sorting cases allows you to organize rows of data in ascending or descending order on the
basis of one or more variable. For instance, consider once again the Employee data set.
Suppose we are interested to sort the data based on the variable “Jobcat” which refers to the
category of employment. The procedure for sorting will be as follows:
Data
Sort Cases...
A small dialog box with header Sort Cases will pop up. This dialogue box has few options. If
you choose the ascending option in the dialogue box and click OK, your data will be sorted by
Jobcat. All of the cases coded as job category 1 appear first in the dataset, followed by all of
the cases that are labeled 2 and 3 respectively.
The data could also be sorted by more than one variable. For example, within job category,
cases could be listed in order of their salary. Again we can choose
Data
Sort Cases...
In the small dialogue box select, select the variable jobcat followed by salary. The dialogue
box comes into view as follows.
To choose whether the data are sorted in ascending or descending order, select the appropriate
button. Let us choose ascending so that the data are sorted in ascending order of magnitude
with respect to the values of the selected variables. The hierarchy of such a sorting is
determined by the order in which variables are entered in the Sort by box.
29
Data are sorted by the first variable entered, and then sorting will take place by the next
variable within that first variable. In our case jobcat was the first variable entered, followed by
salary, the data would first be sorted by job category, and then, within each of the job
categories, data would be sorted by salary.
Merging Files:
We can merge files into two different ways. The first option is “add variables” and the
second is “add cases”.
Add variables: The Add Variables adds new variables on the basis of variables that are
common to both files. In this case, we need to have two data files. Each case in the one file
corresponds to one case in the other file. In both files each case has an identifier, and the
identifiers match across cases. We want to match up records by identifiers. First, we must sort
the records in each file by the identifier. This can be done by clicking Data, Sort Cases, and
then selecting the identifier into the “Sort by” box, OK.
Example, given below we have a file containing dads and we have a file containing faminc.
We would like to merge the files together so we have the dads observation on the same line
with the faminc observation based on the key variable famid. The procedure to merge the two
files is as follows:
First sort both data sets by famid.
Retrieve the dads data set into data editor window.
Select
Data Merge files … add variables and select the file faminc. dads
Famid name Inc
2 Art 22000
1 Bill 30000
3 Paul 25000
Faminc
Famid faminc96 faminc97 faminc98
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
30
After merging the dads and faminc, the data would look like the following.
The next example considers a one to many merge where one observation in one file may have
multiple matching records in another file. Imagine that we had a file with dads like we saw in
the previous example, and we had a file with kids where a dad could have more than one kid.
It is clear why this is called a one to many merge since we are matching one dad observation
to one or more (many) kids of observations. Remember that the dads file is the file with one
observation, and the kids file is the one with many observations. Below, we create the data
file for the dads and for the kids.
We can also retrieve the data set dads2 to data editor window and perform steps 4 to 6 for the
file kids2. This time you select working file is keyed table and choose famid as key variable.
The data editor window will appear as given below.
Here the correct choice of keyed table can give us correct results.
The key difference between a one to one merge and a one to many merge is that you need to
correctly identify the keyed table. That means we have to identify which file plays the role of one
(in one to many). That file should be chosen as keyed table. In the above example the keyed table
file is only dads2 but not kids2.
32
Merging files (add cases option)
The Add Cases option combines two files with different cases that have the same variables.
To merge files in this option we should follow the following procedures.
Data→ merge files→ add cases
All variables should be listed under the small window “new working data file”.
Click Ok to complete merging.
Selecting Cases
You can analyze a specific subset of your data by selecting only certain cases in which you
are interested. For example, you may want to do a particular analysis on employees only if the
employees have been with the company for greater than six years. This can be done by using
the Select Cases menu option, which will either temporarily or permanently remove cases you
didn't want from the dataset. The Select Cases option (or Alt+D+C) is available under the
Data menu item:
Data
Select Cases...
Selecting this menu item will produce the following dialog box. This box contains a list of the
variables in the active data file on the left and several options for selecting cases on the right.
The portion of the dialog box labeled “Unselected Cases Are” gives us the option of
temporarily or permanently removing data from the dataset.
33
If the “Filtered” option is selected, the selected cases will be removed from subsequent
analyses until “All Cases” option reset.
If the “Deleted” option is selected, the unselected cases will be removed from the
working dataset. If the dataset is subsequently saved, these cases will be permanently
deleted.
Selecting one of these options will produce a second dialog box that prompts us to a particular
specification in which we are interested. For example, if we choose the “If condition is
satisfied” option and clicking on the If button the results in a second dialog box, will appear as
shown below.
The above example selects all of the cases in the dataset that meet a specific criterion:
employees that have worked at the company for greater than six years (72 months) will be
selected. After this selection has been made, subsequent analyses will use only this subset of
the data. If you have chosen the Filter option in the previous dialog box, SPSS will indicate
the inactive cases in the Data Editor by placing a slash over the row number. To select the
entire dataset again, return to the Select Cases dialog box and select the All Cases option.
At times we might have data files that need to be collapsed to be useful to us. For Instance,
you might have student data but we really want classroom data, or we might have weekly data
but we are interested on monthly data, etc. Let us see how we can collapse data across kids to
make family level data.
34
Aggregating Files
Aggregating files is one way of data manipulation procedure. The Aggregate procedure
allows you to condense a dataset by collapsing the data on the basis of one or more variables.
For example, to investigate the characteristics of people in the company on the basis of the
amount of their education, you could collapse all of the variables you want to analyze into
rows defined by the number of years of education. To access the dialog boxes for aggregating
data, follow the following steps:
Select Data and then AGGREGATE
We will observe a dialogue box. This dialogue box has several options. These are as
follows.
Break variable: The top box, labeled Break Variable(s), contains the variable within which other
variables are summarized. This is something like classification variable.
Aggregated Variables: contains the variables that will be collapsed.
Number of cases: This option allows us to save the number of cases that were collapsed at each
level of the break variable.
Example: Suppose we have a file containing information about the kids in three families.
There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sex are the
child's age, weight and sex respectively. This data is saved as kid3.sav file in the directory
desktop: \ training r. We will use this file for showing how to collapse data across
observations. If we consider the aggregate command under the data menu we can collapse
across all of the observations and make a single record with the average age of the kids. To do
so we need to create a break variable const =1 using the compute command.
35
Famid Kidname birth Age Wt Sex
1 Bekele 1 9 60 F
1 Bogale 2 6 40 M
1 Barbie 3 3 20 F
2 Anteneh 1 8 80 M
2 Alemayehu 2 6 50 M
2 Abush 3 2 20 F
3 Chapie 1 6 60 M
3 Chuchu 2 4 40 F
3 Mamush 3 2 20 M
The “age_mean” variable will be added to our working data. This is the mean age of all 9
children.
If we follow all of the above steps and change the last option to “Create new data file
containing aggregated variables only”, we will have the following output saved as aggr.sav.
CONST AVGAGE N_Break
1.00 5.11 9
If we use “famid” as break variable, the aggregate option will the average age of the kids in
the family. The following output will be obtained.
FAMID AGE1
1.00 6.00
2.00 5.33
3.00 4.00
We can request averages for more than one variable. For instance, if we want to aggregate
both age and weight by famid we can follow the following steps.
Select Data and then AGGREGATE. In the observed dialogue box, select Const as
break variable.
Choose “age” and “wt” for summaries of variables
Choose “Create new data file containing aggregated variables only”.
The following output will be produced. The variable N_Break is the count of the number of
kids in each family.
Famid Age_mean Wt_mean N_Break
1 6.00 40.00 3
2 5.33 50.00 3
3 4.00 40.00 3
36
We can variable “girls” that counts the number of girls in the family, and “boys” that can
help us count the number of boys in the family. You can also add a label after the new
variable name. If you save the output in SPSS, you can see the labels in SPSS data editor after
clicking on the "variable view" tab in the lower left corner of the editor window.
To have summary information which shows the number of boys and girls per family, we will
follow the following procedure. We create two dummy variables Sexdum1 for girls and
Sexdum2 for boys. The sum of sexdum1 is the number of girls in the family. The sum of
sexdum2 is the number of boys in the family.
For instance, if we save our file in the directory desktop\training r, our file will be saved as
SPSS file. Our results look like the following output.
FamId Boys girls Numkid
1 1.00 2.00 3
2 2.00 1.00 3
3 2.00 1.00 3
37
Restructure Data: We use Restructure data wizard to restructure our data.
In the first dialog box, we select the type of restructuring that we want to do. Suppose, we the
data that are arranged in groups of related columns. Our interest is to restructure these data
into groups of rows in the new data file. Then we choose the option restructure selected
variables into cases.
Example: Consider a small data set consisting of three variables as given below.
V1 V2 V3
8 63 82
9 62 87
10 64 89
12 66 85
15 67 86
The objective is then to restructure the above data into groups of rows in the new data file. In
other words, we want to convert the above data into one variable that has all the values of the
three variables and one factor variable that indicate group. This procedure is known as the
restructuring of variables to cases. The procedure is as follows.
From the data menu select restructure, the dialogue box which says “Welcome to the
restructure Data wizard” will appear.
Choose the first option “Restructure selected variables into cases” and click next.
Another dialogue box which says “Variable to cases: Number of variable groups” will
appear. Choose the first option “One” and click next.
Give the name of target variable call it “all - inone”
Select all three variables (V1, V2 and V3) to variables to be transposed box and Click next.
Another dialogue box which says “Variable to cases: Create index variable” will appear.
Choose the first option “One” and click next.
Another new dialogue box will appear here change the variable name “Index” to group.
Click finish and see your restructured data.
38
The data may appear as shown below.
Id Group All_inone
1 1 8
1 2 63
1 3 82
2 1 9
2 2 62
2 3 87
3 1 10
3 2 64
3 3 89
4 1 12
4 2 66
4 3 85
5 1 15
5 2 67
5 3 86
The variable ID stands for the row position in of the data before the data was restructured. We can
also restructure the data from cases to variables.
For instance, consider the following small data set on age of Nurses and Doctors.
The variable group 1 stands for nurses and 2 stands for doctors.
Id Age Group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
1 60 2
2 36 2
3 29 2
4 56 2
5 32 2
6 54 2
39
The objective is to restructure the above age data into a data set having two separate variables
for Nurses and Doctors. To do so, we follow the following procedure.
From the data menu we select restructure
From the dialogue box we select “Restructure selected cases to variables”
We select Id for Identifier variable
We select group for Index variable and click next and respond to the dialogue box that
will appear.
When you observe the dialogue box which says “Cases to variables: Options”
dialogue box select group by Index and click next.
Click finish.
Id Age.1 Age.2
1 23 60
2 25 36
3 26 29
4 35 56
5 42 32
6 22 54
40
Transpose all data. We choose this when we want to transpose our data. All rows will
become columns and all columns will become rows in the new data. The procedure is as
follows:
From the data menu we select restructure
From the dialogue box we select “Transpose all data” and click finish
Transpose dialogue box will appear. We have to select all variables to transpose. (Note
un-selected variables will be lost.) Click Ok.
The transformed data that change rows to columns and columns to row will appear.
Id Age group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
7 60 2
8 36 2
9 29 2
10 56 2
11 32 2
12 54 2
Applying the above procedure, the transposed form of this data is as given below.
Case_lbl V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Id 1 2 3 4 5 6 7 8 9 10 11 12
Age 23 25 26 35 42 22 60 36 29 56 32 54
Group 1 1 1 1 1 1 2 2 2 2 2 2
41
The procedure for doing this cannot be performed using dialog boxes and is available only
through command syntax. The syntax for generating a list of cases is shown in the Syntax
Editor window below. The variable names shown in lower case below instruct SPSS which
variables to list in the output. Or, you can type in the command ALL in place of variables
names, which will produce listing of all of the variables in the file. The sub-command
/CASES FROM 1 TO 10, is an instruction to SPSS to print only the first ten cases. If this
instruction were omitted, all cases would be listed in the output.
To execute this command, first highlight the selection by pressing on your mouse button
while dragging the arrow across the command or commands that you want to execute. Next,
click on the icon with the black, right-facing arrow on it. Or, you can choose a selection from
the Run menu.
Executing the command will print the list of variables, gender and minority in the above
example, to the Output Viewer. The Output Viewer is the window in which all output will be
printed. The Output Viewer is shown below, containing the text that would be generated
from the above syntax.
42
3. DESCRIPTIVE STATISTICS USING SPSS
From the previous section, we have seen in the Output Viewer:
The results from running a statistical procedure are displayed in the Viewer.
The output produced can be statistical tables, charts or graphs, or text, depending on the
choices you make when you run the procedure.
The viewer window is divided into two panes.
The outline pane (left side): contains an outline of all of the information stored in the
Viewer.
The contents pane (right hand side): contains statistical tables, charts, and text output.
The icons in the outline pane can have two forms namely:
The open book icon: indicates that it is currently visible in the Viewer
The closed book icon: indicates that it is not currently be visible in the viewer.
It is more useful to investigate the numbers of cases that fall into various categories.
Frequency tables are useful for summarizing categorical variables -- variables with a
limited number of distinct categories.
From the menu bar chose:
Analyze
Descriptive (Statistics)
Frequencies...
Under Frequencies dialog box you can choose Statistics, Chart and Format button to add
whatever you want Chart button, for instance, has different types of charts such as bar, pie
and histogram.
For example, clicking on the Histograms button with its sub option, with normal curve will
produce a histogram with bell shaped diagram.
43
3.1.2. Descriptive Statistics
Analyze
Descriptive Statistics
Cross tabs…
After selecting Crosstabs from the menu, the dialog box shown below will appear on
your monitor
44
The options available by selecting the Statistics and Cells buttons provide you with
several additional output features.
Selecting the Cells button will produce a menu that allows you to add additional values
to your table.
45
Displaying Tables:
Tables
Much of the output in SPSS is displayed in a pivot table format.
The procedure for creating tables from the menu-bar select
Analyze
Table
Custom Tables.
Then simply drag and drop variables where we want them to appear in the table.
Summary Statistics:
Right-click on variable category on the canvas pane and select Summary Statistics
from the pop-up context menu.
In the Summary Statistics dialog box, select Row N % in the Statistics list and click
the arrow button to add it to the Display list.
Both the counts and row percentages will be displayed in the table.
Click Apply to Selection to save these settings and return to the table builder.
To insert totals and subtotals click categories and totals in the define section
Then click ok.
For scale variable we can display summaries statistics ( mean, median, …) in the
cells of the table
Stacking Variables:
Taking separate tables and pasting them together into the same display.
To Stack Variables:
In the variable list, select all of the variables you want to stack, then
drag and drop them together into the rows or columns of the canvas
pane. Or
Drag and drop variables separately, dropping each variable either
above or below existing variables in the rows or to the right or left
of existing variables in the columns.
46
3.1.4. Diagrams and graphs
A. Bar Chart
Bar Charts are a common way to graphically display the data that represent the
frequency of each level of a variable
Graphs
Bar...
To get started with the bar graph, click on the icon representing the type of graph that
you want, then click on the Define button to produce the following dialog box
47
B. Pie Chart
used to present categorical variable
From the menu bar choose
Graph
Pie chart
C. Histograms
Making histograms is one of the best ways to check your data for normality.
From the Graphs menu, select "Histogram."
Put your variable in the "variable" box.
D. Scatter Plots
Scatter plots give you a tool for visualizing the relationship between two or more
variables
Scatter plots are especially useful when you are examining the relationship between
continuous variables using statistical techniques such as correlation or regression.
Scatter plots are also often used to evaluate the bivariate relationships in regression
analyses.
Useful in the early stage of analysis when exploring data and determining is a linear
regression analysis is appropriate
May show outliers in your data
Example: Performance and Self-confidence
To obtain a scatter plot in SPSS
Graphs
Scatter...
48
Simple Scatter Plot
The Simple scatter plot graphs the relationship between two variables
When you select the Simple option from the initial dialog box, you will get the
following dialog box:
We can also have SPSS draw different colored markers for each group by entering a
group variable in the Set Markers by box.
49
Every combination is plotted twice so that each variable appears on both the X and Y
axis.
Considerer a Matrix scatter plot with three variables, salary, salbegin, and jobtime,
you would receive the following scatterplot matrix:
salary.
50
Exercise:
1. Let us consider a small data set given below.
x 400 675 475 350 425 600 550 325 675 450
y 1.8 3.8 2.8 1.7 2.8 3.1 2.6 1.9 3.2 2.3
After entering these data into SPSS plot the scatter plot. What type relationship do you
observe between x and y? Is an increase in x followed by an increase in y?
2. Produce a scatter plot for the following data and discuss the results.
x 400 675 475 350 425 600 550 325 675 450
y -1.8 -3.8 -2.8 -1.7 -2.8 -3.1 -2.6 -1.9 -3.2 -2.3
51
4. Customizing SPSS outputs and reporting
Much of the output in SPSS is displayed in a pivot table format. While these pivot tables are
professional quality in their appearance, you may wish to alter their appearance or export
them to another application. There are several ways in which you can modify tables. In this
section, we will discuss how you can alter text, modify a table's appearance, and export
information in tables to other applications.
To edit the text in any SPSS output table, you should first double-click that table. This will
outline dashed lines, as shown in the figure below, indicating that it is ready to be edited.
Some of the most commonly used editing techniques are changing the width of rows and
columns, altering text, and moving text. Each of these topic is discussed below:
Changing column width and altering text. To change column widths, move the
mouse arrow above the lines defining the columns until the arrow changes to a double-
headed arrow facing left and right. When you see this new arrow, press down on your
left mouse button, then drag the line until the column is the width you want, then
release your mouse button.
Editing text. First double-click on the cell you wish to edit, then place your cursor on
that cell and modify or replace the existing text. For example, in the frequency table
shown below, the table was double-clicked to activate it, and then the pivot table's title
was double-clicked to activate the title. The original title, "Employment Category,"
was modified by adding the additional text, "as of August 1999."
Using basic editing commands, such as cut, copy, delete, and paste. When you cut
and copy rows, columns, or a combination of rows and columns by using the Edit
menu's options, the cell structure is preserved and these values can easily be pasted
into a spread sheet or table in another application.
52
Aside from changing the text in a table, you may also wish to change the appearance of the table
itself. But first, it is best to have an understanding of the SPSS Table Look concept. A Table Look
is a file that contains all of the information about the formatting and appearance of a table,
including fonts, the width and height of rows and columns, coloring, etc. There are several
predefined Table Looks that can be viewed by first right-clicking on an active table, then selecting
the Table Looks menu item. Doing so will produce the following dialog box:
53
You can browse the available Table Looks by clicking on the file names in the Table Look
Files box, as shown above. This will show you a preview of the Table Look in the Sample
box.
While the Table Looks dialog box provides an easy way to change the look of your table, you
may wish to have more control of the look or create your own Table Look. To modify an
existing table, right-click on an active pivot table, then select the Table Properties menu item.
This will produce the following dialog box:
The above figure shows the Table Properties dialog box with the Cell Formats tab selected.
You can alternate between tabs (e.g., General, Footnotes, etc.) by clicking on the tab at the
upper left of the dialog box. While a complete description of the options available in the Table
Properties dialog box is beyond the scope of this document, there are a few key concepts that
are worth mentioning. Note the Area box at the upper right of the dialog box. This refers to
the portion of the box that is being modified by the options on the left side of the box. For
example, the color in the Background of the Data portion of the table was changed to black
and the color of the text was changed to white by first choosing Data from the Area box, then
54
selecting black from the Background drop-down menu and selecting white for the text by
clicking on the color palette icon in the Text area on the left side of the dialog box.
The Printing tab also has some useful options. For example, the default option for three-
dimensional tables containing several layers is that only the visible layer will be printed. One
of the options under the Printing tab allows you to request that all layers be printed as
individual tables. Another useful Printing option is the Rescale wide/long tables to fit page,
which will shrink a table that is larger than a page so that it will fit on a single page.
Any modifications to a specific table can be saved as a Table Look. By saving a Table Look,
you will be saving all of the layout properties of that table and can thus apply that look to
other tables in the future. To save a Table Look, click on the General tab in the Table
Properties dialog box. There are three buttons on the bottom right of this box. Use the Save
Look button to save a Table Look. That button will produce a standard Save As dialog box
with which you can save the Table Look you created.
In addition to modifying a table's appearance, you may also wish to export that table. There
are three primary ways to export tables in SPSS. To get a menu that contains the available
options for exporting tables, right-click on the table you wish to export. The three options for
exporting tables are: Copy, Copy object, and Export.
The Copy option copies the text and preserves the rows and columns of your table but does
not copy formatting, such as colors and borders. This is a good option if you want to modify
the table in another application. When you select this option, the table will be copied into your
system clipboard. Then, to paste the table, select the Paste command from the Edit menu in
the application to which you are importing the table. The Copy option is useful if you plan to
format your table in the new application; the disadvantage of this method is that only the text
and table formatting remains and you will therefore lose much of the formatting that you
observe in the Output Viewer.
The Copy object method will copy the table exactly as it appears in the SPSS Output Viewer.
When you select this option, the table will be copied into your clipboard and can be imported
55
into another application by selecting the Paste option from the Edit menu of that application.
When you paste the table using this option, it will appear exactly as it is in the Output Viewer.
The disadvantage of this method is that it can be more difficult to change the appearance of
the table once it has been imported.
The third method, Export, allows you to save the table as an HTML or an ASCII file. The
result is similar to the Copy command: you will have a table that retains the text and cell
layout of the table you exported, but it will retain little formatting. This method for exporting
tables to other applications is different from the above two methods in that it creates a file
containing the table rather than placing a copy in the system clipboard. When you select this
method, you will immediately be presented with a dialog box allowing you to choose the
format of the file you are saving and its location on disk. The primary advantage of this
method is that you can immediately create an HTML file that can be viewed in a Web
browser.
Chart
Options...
To get the following dialog box
Some of the most useful options that will add information to your scatterplot are the
Fit Line options.
56
The Fit Line option will allow you to plot a regression line over your scatter plot.
Click on the Fit Options button to get this dialog box:
The primary tool for modifying charts in SPSS is the Chart Editor. The Chart Editor will open
in a new window, displaying a chart from your Output Viewer. The Chart Editor has several
tools for changing the appearance of your charts or even the type of chart that you are using.
To open the Chart Editor, double-click on an existing chart and the Chart Editor window will
open automatically. The Chart Editor shown below contains a bar graph of employment
categories:
57
While there are many useful features in the Chart Editor, we will concentrate on the three of
them: changing the type of chart, modifying text in the chart, and modifying the graphs.
You can change the type of chart that you are using to display your data using the Chart
Editor. For example, if you want to compare how your data would look when displayed as a
bar graph and as a pie chart, you can do this from the Gallery menu:
Gallery
Pie...
Selecting this option will change the above bar graph into the following pie chart:
58
Once you have selected your graphical look, you can start modifying the appearance of your
graph. One aspect of the chart that you may want to alter is the text, including the titles,
footnotes, and value labels. Many of these options are available from the Chart menu. For
example, the Title option could be selected from the Chart menu to alter the charts title:
Chart
Title...
Selecting this menu item will produce the following dialog box:
The title "Employment Categories" was entered in the box above and the default justification
was changed from left to center in the Title Justification box. Clicking OK here would cause
this title to appear at the top center of the above pie chart. Other text in the chart, such as
footnotes, legends, and annotations, can be altered similarly. The labels for the individual
slices of the pies can also be modified, although it may not be obvious from the menu items.
To alter the labels for areas of the pie, choose the Options item from the Chart menu.
Chart
Options...
59
In addition to providing some general options for displaying the slices, the Labels section
enables you to alter the text labelling slices of the pie chart as well as format that text. You
can click the Edit Text button to change the text for the labels. Doing so will produce the
following dialog box:
To edit individual labels, first click on the current label, which will be displayed below the
Label box, then alter the text in the Label box. When you finish, click the Continue button to
return to the Pie Options dialog box. You can make changes to the format of your labels by
clicking the Format button here. If you do not want to change formatting, click on OK to
return to the Chart Editor.
In addition to altering the text in your chart, you may also want to change the appearance of
the graph with which you are working. Options for changing the appearance of graphs can be
accessed from the Format menu. Many options available from this menu are specific to a
particular type of graph. There are some general options that are worth discussing here. One
such is Fill Pattern option, which changes the pattern of the graph. It can be obtained by
selecting the Fill Pattern option from the Format menu:
Format
Fill Pattern...
This will produce the following dialog box:
60
First, click on the portion of the graph where you want to change the pattern, then select the
pattern you want by clicking on the pattern sample on the left side of the dialog box. Then,
click the Apply button to change the appearance of your graph.
One other formatting option that is generally useful is the ability to change the colors of your
graphs. To do that, select the Color option from the Format menu:
Format
Color...
This will allow you to change the color of a portion of a graph and its border. First, select the
portion of the graph for which you would like to change its color, then select the Fill option if
you want to change the color of a portion of the graph and select the Border option if you
want to change the color of the border for a portion of the graph. Next, click on the color that
you want and click Apply. Repeat this process for each area or border in the graph that you
want to change.
Interactive Charts
Many of the standard graphs available through SPSS are also available as interactive charts.
Interactive charts offer more flexibility than standard SPSS graphics: you can add variables to
an existing chart, add features to the charts, and change the summary statistics used in the
chart. To obtain a list of the available interactive charts, select the Interactive option from the
Graphs menu:
61
Graphs
Interactive
Selecting one of the available options will produce a dialog for designing an interactive graph.
For example, if you selected the Boxplot option from the menu, you would get this dialog
box:
Dialog boxes for interactive charts have many of the same features as other SPSS dialog
boxes. For example, in the above dialog box, the variable type is represented by icons: scale
variables, such as the variable bdate, are represented by the icon that resembles a ruler, while
categorical variables, such as the variable educ, are represented by the icon that resembles a
set of blocks. Variables in the variable list on the left of the dialog box can be moved into the
boxes on the right side of the screen by dragging them with your mouse, in contrast to using
the arrow button used in other SPSS dialog boxes. Options in interactive graphs can be
accessed by clicking on the tabs. For example, clicking on the Boxes tab produces the
following dialog box:
62
Here, you have several choices about the look of your boxplot. The choice to display the
median line is selected here, but the options to indicate outliers and extremes are not selected.
The Titles and Options tabs offer several other choices for altering the look of your table as
well, although a thorough discussion of these is beyond the scope of this document. When you
have finished the specifications for a graph, click the OK button to produce the graph you
have specified in the Output Viewer.
Interactive graphs offer several choices for altering the look of the chart after you have a draft
in the Output Viewer. To get the menus for doing that, double-click on the interactive graph
that you want to alter. For example, double-clicking on the boxplot obtained through the
above dialog box will produce the following menus:
The icons immediately surrounding the graph provide you with several possibilities for
altering the look of your graph. The three leftmost items in the horizontal menu are worthy of
mention. The leftmost icon produces a dialog box that resembles the original Interactive
Graphs dialog box and contains many of the same options. For example, you could change the
variables that you are graphing using this dialog box. The next icon, the small bar graph, lets
you add additional graphical information. For example, you could overlay a line that graphed
the means of the three groups in the above graph by choosing the Dot-Line option from the
menu, or you could add circles representing individual‟s salaries within each group by
choosing the Cloud option. The third icon provides several options for changing the look of
your chart. Selecting that icon will produce the following dialog box:
63
Each icon in this dialog box can be double-clicked to produce a dialog box that contains the
properties of the component of the chart represented by that icon. For example, you could
obtain the properties of the boxes in the interactive graph above by double-clicking on the
icon labelled Box. Doing so would produce this dialog box:
Changing the properties in this or any other dialog box that controls the properties of any
portion of the chart will change the look of the graph in the Output Viewer. For example, you
could change the colors of the boxes and their outlines by selecting a different
64
5. INTRODUCTION TO MINITAB
Minitab is Statistical Analysis software that allows to easily conducting analyses of data. This
is one of the suggested software for the class. It is commonly used to enter, organize, present
and analyze any certain data of a given variable. It can be used for learning about statistics as
well as to undertake statistical researches. Its applications have the advantage of being
accurate reliable and generally faster than computing statistics and drawing graphs by hand.
This guide is intended to guide you through the basics of Minitab and help you get started
with it.
Starting Minitab
65
D. Opening an existing Worksheet (Minitab type file)
Within a project you can open one or more files that contain data. When you open a file, you
copy the contents of the file into the current Minitab project. Any changes you make to the
worksheet while in the project will not affect the original file. To open a Minitab type file
Displays output and lets you type commands. In order to be able to type commands in the
Session window you need to enable this option. To do so go to
EDITOR->ENABLE COMMANDS.
The Session Window will now look like
66
Minitab has a large number of built-in routines that allows you to do most of the basic data analysis.
Commands can also be typed in to the Session Window, to either replicate the built-in routines or to create a
more tailored data analysis.
Projects are made up of the commands, graphs and worksheets. Every time you save a Minitab project you
will be saving graphs, worksheets and commands. However, each one of the elements can be saved
individually for use in other documents or Minitab projects. Likewise, you can print projects and its
elements.
Project manager contains different folders which has their own function, these are:
a) Session Folder: It manages the session window
b) History folder: It lists commands you have used in your session.
c) Graph folder: It is for managing, arranging and naming your graphs.
d) Report pad folder: It is used to creating, arranging & editing reports of you project.
e) Related document folder: For quickly accessing project related, non-MINITAB
files for easy reference.
f) Graph window: it used to display graphs and charts, but it is visible if you create a graph or chart for
your data.
You may keep the worksheet and session windows occupying half a screen each or you can maximize any
one of them to a full screen. Then you can move between the different windows:
67
Window
Then choose your desired window from the given list and click on it. The report pad is
accessible through the project manager.
Alternatively, each window is represented by an icon on the top bar. Clicking on the icon will
take you to the window right way. In particular, note the icons for worksheet, session, and
report pad.
After loading Minitab, you will either open an existing project or a new one. In either case, the
following window structure will appear.
Close Button
Title bar
Menu
bar
Standard
tool bar
Session
Window
Column
names
Row
names
Worksheet
Project
manager
Status bar
68
5.3. The menu and their use
There are 4 areas in the screen, the Menu bar, the Toolbar, the Session window and the
Worksheet window.
You can open menus and choose commands. Here you can find the built-in routines.
File -use this menu to open and save worksheets and to import data.
Edit -use this menu to cut and paste text and data across windows.
Manip -use this menu to sort and recode your data.
Calc-use this menu to create new columns.
Stat -use this menu to analyses your data. This key menu performs many useful
statistical functions
Graph -use this menu to graphically represent your data.
Editor -use this menu to edit and format your data.
Window -use this menu to change windows.
Help - this opens a standard Microsoft Help window containing information on how
to use the many features of Minitab.
This section discusses the types of data you can work with in MINITAB and the various forms
those data types can take. In MINITAB you can work with 3 types of data in three forms:
columns, constants, or matrices, these are
1. Numeric: It includes digits 0, 1 … 9 and *. But the symbol * is reserved for missing
value. The number can have a – or + sign, also it can be written in exponential notation
if it is very large or very small number. e.g. 3.2E12 which is equal with 3.2×1012.
Numbers can be stored in columns, constants or matrices. MINITAB stores and
computes numbers in double precision, which means that numbers can have up to15 or
16 digits (depending on the number) without round-off error.
2. Text: It can be two types either character or string. Characters are a single alphabet,
digits (from 0 to 9), spaces and punctuation marks such as >, ? <, !.... Strings are a series
of characters; some examples of strings are country, name, occupation etc.
69
The maximum number of characters that can be entered at a time is 80. Texts can be stored in
columns or constants but not in matrices.
3. Date/Time: You can write Date (Such as Jan-1-1997, 03/01/2011…) or Times (Such
as 24:23) or both (Such as 24/11/2002; 10:30AM)
There are two main ways to enter data into the Minitab worksheet:
1. Typing in the values (of give variable) one by one and clicking <enter> after
each entry.
2. Opening an existing Minitab worksheet or Minitab project.
70
FILE + OPEN WORKSHEET or FILE + OPEN PROJECT
Then select the file from appropriate drive/folder
Minitab files have a yellow icon with MTB written on them.
Note (i) Data sets from the textbook are available on the CD-ROM attached to the
book. They organize by chapters. The files may need to be unzipped.
(ii) A new spreadsheet is created each time you open a data file.
You can merge the spreadsheet only; you need to use
FILE + MERGE WORKSHEET
(iii) The “open file” icon defaults to Minitab project file only.
To open a spreadsheet only, you need to use FILE + OPEN WORKSHEET.
Entering Data into a Worksheet
There are various methods for entering data into a worksheet. The simplest approach is
to use the Data window to enter data directly into the worksheet by clicking your mouse
in a cell and then typing the corresponding data entry and hitting Enter. Remember that
you can make a Data window active by clicking anywhere in the window or by using
Windows in the menu bar.
If you type any character that is not a number, Minitab automatically identifies the
column containing that cell as a text variable and indicates that by appending T to the
column name, e.g., C5-T in Display I.4. You do not need to append the T when
referring to the column. Also, there is a data direction arrow in the upper left corner of
the data window that indicates the direction the cursor moves after you hit Enter.
Clicking on it alternates between row-wise and column wise data entry. Certainly, this
is an easy way to enter data when it is suitable.
Remember, columns are variables and rows are observations! Also, you can have
multiple data windows open and move data between them. Use the command to open a
new worksheet.
Quite often, you will want to save the results of all your work in creating a work-sheet.
If you exit Minitab before you save your work, you will have to reenter everything. So
we recommend that you always save. To use the commands of this section, make sure
that the Worksheet window of the worksheet in question is active.
71
Save Current Worksheet to save the worksheet with its current name, or the default
name if it doesn‟t have one.
The Save in box at the top contains the name of the folder in which the worksheet will
be saved once you click on the Save button. Here the folder is called data, and you can
navigate to a new folder using the Up One Level button immediately to the right of this
box. The next button takes you to the Desktop and the third button allows you to create
a subfolder within the current folder. The box immediately below contains a list of all
files of type .mtw in the current folder.
You can select the type of file to display by clicking on the arrow in the Save as type
box, which we have done here, and click on the type of file you want to display that
appears in the drop-down list.
There are several possibilities including saving the worksheet in other formats, such as
Excel. Currently, there is only one .mtw file in the folder data and it is called marks
.mtw. If you want to save the worksheet with a different name, type this name in the
File name box and click on the save button.to retrieve a worksheet, use File I Open
Worksheet and file in the dialog box as depicted in Display I.20 appropriately. The
various windows and buttons Minitab for Data Management 27 in this dialog box work
as described for the File I Save Current Worksheet as command, with the exception that
we now type the name of the file we want to open in the File name box and click on the
Open button
To set up a connection between Minitab and Excel, we need to tell Minitab the file path
(directories, folders, etc.) to where that Excel file lives. The simplest import of an Excel
file is by using the File > Open Worksheet command in Minitab.
In the Open Worksheet dialog box, the first step is to click the “Files of Type” drop-
down list and choose “All.” This lets us see all file types in the folder. Navigate to your
Excel file and select it.
72
But before you click “Open,” take a look at the buttons that appear at the bottom of the
dialog box after you select the Excel file. Click “Preview” to view how Minitab is
recognizing the data in the worksheet. Then you can click “Options” to specify which
data in the worksheet you want to import.
Since Excel is a general, cell-based spreadsheet, your document may have data in any
row or column with formulas scattered in between. Minitab, as a statistical software
package, requires the data to be in column-wise format (which is why it's easy to
manipulate data with the Data menu in Minitab). Because of this difference, you want
to avoid bringing over any header or footer information from Excel. Just focus on
bringing over the raw dataset into Minitab. Use the Open Worksheet > Options box to
specify exactly which rows to import.
73
4. Go to the “SINGLE CHARACTER SEPARATOR” option. The data on the text file
is usually separated by spaces or tabs. Choose the appropriate option. If you are unsure
how the data is separated, another option is to use the number of data rows. Just
introduce the number of data rows in the “NUMBER OF DATA ROWS” box.
5. Click OK.
6. The results will appear in the worksheet window.
74
Note: This can be sometimes a little tricky as you can get a file that does not have the
data in the format that you want. If this happens, close the worksheet where the data is
placed and try importing it again, changing some of the options in step 4. This is a trial
and error procedure; so don‟t panic if you don‟t get it in the first attempt.
Copying data to Minitab works like copying data to any other type of spreadsheet (eg.
Excel).
1. Copy the data you wish to use in Minitab.
2. Go to the position where you want to copy the data in the desired Minitab
worksheet. If you wish to paste a cell with a Header or Name, make sure that you stand
in the variable name cell (cell below the number of the Column C1, C2, etc).
3. Go to EDIT ->PASTE CELLS to paste the data.
4. Sometimes when you copy data, Minitab reads it in a wrong format, eg. As a text
When is numeric. To solve this problem, select the problematic column(s) and go to
DATA -> CHANGE DATA TYPE-> CHOOSETHE DESIRED FORMAT. The most
useful format is numeric.
The following dialog box appears. Choose the variables you want to modify and where
you want to store them. The storage variables can be the same variables as the ones you
are modifying. Then hit OK.
75
5.6.3. Export data
To export data, you can save the Minitab worksheet as a different file type. Choose File > Save
Current Worksheet as to save the following types of files in Minitab:
76
2. Import the Excel file into Access. Consult the Access Help for details.
When you save your worksheet as a text file, Minitab saves date/time data in the same format in
which it is displayed in the worksheet. Thus, if dates are displayed in the format mm/dd/yyyy,
then only the date is saved and not the hidden components, such as the time.
When you save your worksheet as a file type other than text, Minitab saves all the date/time
information. For example, if dates in a column are displayed in the format mm/dd/yyyy and you
save the worksheet as an Excel file, when you open that file in Excel, your spreadsheet will
include both the date and time information: mm/dd/yyyy h: mm.ss.ss.
If you use Save Current Worksheet As to save the worksheet as a text file, you cannot specify the
columns to save. You also cannot save your data in a custom format, for example, with line
breaks after certain columns. If you want to have more control over how text files are saved, use
File > Other Files > Export Special Text.
77
6. Descriptive statistics using Minitab
Descriptive Statistics
Displays N, N*, Mean, SE Mean, StDev, Min, Q1, Median, Q3, and Max
a) Descriptive Statistics for one variable
Stat→ Basic Statistics→ Display Descriptive Statistics→ Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is displayed under
Variables).
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Statistics button, this screen will appear:
78
The checked items will be displayed in the output. To check or uncheck an item, click in the
box to the left of the word.
If you click on the Graphs button, this screen will appear:
To display any of these graphs (in addition the descriptive statistics displayed in the session
window), click in the box. (For purposes of this example, I have not clicked on any graphs since
graphs will be explained in the next section.)
To display the data, click on OK. For the Dell example, this information is displayed in the
session window:
Descriptive Statistics: Rates of Return
Variable N N* Mean SE Mean StDev Minimum Q1 Median
Rates of 60 0 0.0907 0.0195 0.1511 -0.2175 -0.0304 0.0784
Return
Variable Q3 Maximum
Rates of Return 0.1931 0.4561
b) Descriptive statistics for one variable, grouped by a second variable
Stat→ Basic Statistics→ Display Descriptive Statistics→ Double-click on
appropriate variable→ Click in By variables (optional) box and then double-click on
appropriate variable → OK. (For Auction Data, double-click on Auction Price so that it is
displayed under Variables. Then move the cursor into the By variables (optional) box and
double-click on No. of bidders so that it is displayed under By variables (optional).)
79
For the Auction Data example, this information is displayed in the session window:
Note: If you see a * in the output, that indicates that the value could not be calculated.
In this example, the numerous * appear because N is not large enough in each group to
calculate all the descriptive statistics. (e.g. There is only one instance where the number
of bidders equals 5, and thus SE Mean, StDev, Q1, and Q3 could not be calculated with
only one data point)
80
C) Store Descriptive Statistics
This feature adds the descriptive statistics to the data worksheet instead of displaying
the output in the session window:
Stat→ Basic Statistics→ Store Descriptive Statistics→ Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is displayed under
Variables).
As you can see from the screen above, you are again given the option to alter the output
by clicking on the buttons. If you click on the Statistics button, this screen will appear:
81
d) Column Statistics
You can calculate various statistics on columns. Column statistics are displayed in the
Session window, and are optionally stored in a constant.
Calc→ Column Statistics→ Click by the Statistic you want calculated (For Auction
Data, click by Standard Deviation) → Double-click on appropriate column in Input
variable box (Double-click on No. of Bidders) →OK.
82
This output is displayed in the session window:
Standard Deviation of No. of Bidders
Standard deviation of No. of Bidders = 2.83963
e) Row Statistics
You can compute one value for each row in a set of columns. The statistic is calculated
across the rows of the column(s) specified and the answers are stored in the
corresponding rows of a new column.
Calc→ Row Statistics→ Click by the Statistic you want calculated→ Double-click
on appropriate variable(s)→ in Input variables box Type the name of the new
column that will be created→ OK.
Calculating Row Statistics does not make sense using the example data because it is not
meaningful in context. Thus, an example is not given here. However, in order to see
what row statistics are able to be calculated, the screen shot is shown below.
83
Graphs
a) Histogram
Using the Dell Data that is now inserted into Minitab, a histogram can be made by
going to Graph→ Histogram Then this screen will appear:
Click on appropriate graph and then click→ OK. (For this example, we
will display the simple histogram). → Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is
displayed under Graph Variables) → OK.
84
Note: You are able to edit the graph at this point. On the graph below, the arrows
represent where you can double-click to make changes to the graph. You can do this
type of editing on most graphs.
Let‟s say you wanted to edit the scale on the x-axis. By double-clicking on any of the x-axis
numbers (For this Dell example, you could double-click on -0.16), this screen will then appear:
85
This screen shows the Scale tab. Another way to edit the scale is to click on the
Binning tab. By doing so, this screen will appear:
(The default sets the Interval Definition to Automatic. However, for this Dell example,
click by Midpoint/Cut point positions and replace the numbers given with the new
numbers shown above.)
If you click on the Show tab, this screen will appear:
86
If you click on the Labels tab, this screen will appear:
(The default is set to Tahoma Font, Size 10. For this example, choose Lucida
Handwriting Font, Size 12.)
87
If you click on the Alignment tab, this screen will appear:
As you can see, the binning, size, and font have been changed in this example. Since
we originally double-clicked on one of the x-axis numbers, we were able to make
changes regarding that aspect of the graph. Likewise, you can make changes to other
parts of the graph by double-clicking on the appropriate spot. The details for all the
other arrows (displayed on page 26) are not going to be explained here.
88
Basically, you can change the way the text, bars, and background are displayed.
Another way to alter graphs is to use the buttons. If we go back to our original
histogram example, after going to Graph Histogram OK Double-clicking on
appropriate variable, we are back to this screen:
Here you are given the option to alter the output by clicking on the buttons. If you click
on the Scale button, this screen will appear:
89
This screen shows the Axes and Ticks tab. If you click on the Y-Scale Type tab,
this screen will appear:
(The default is set for Percent, but for this Dell example, click by Frequency.) , If you
click on the Gridlines tab, this screen will appear
If you click on the Reference Lines tab, this screen will appear:
90
(There are no references lines by default, but for this example, type 6 to show a
reference line at y = 6.)
If you click on the Labels button, this screen will appear:
(The default is set for None, but click by Use y-value labels for this example.)
If you click on the Data View button, this screen will appear:
91
This screen shows the Data Display tab. If you click on the Distribution tab, this screen
will appear:
If you click on the Multiple Graphs button, this screen will appear:
92
This screen shows the Multiple Variables tab. If you click on the By Variables tab, this screen
will appear:
If you click on the Data Options button, this screen will appear:
This screen shows the Subset tab. If you click on the Group Options tab, this
screen will appear:
This screen shows the Subset tab. If you click on the Group Options tab, this
screen will appear:
93
To display the graph, click on OK. The histogram will display:
Dot plot
Graph→ Dot plot Then this screen will appear:
94
Click on appropriate graph and then click OK. (For this example, we will display the
simple dot plot). →Double-click on appropriate variable (For Dell Data, double-click
on Rates of Return so that it is displayed under Graph Variables) →OK.
95
Click on appropriate graph and then click →OK. (For this example, we will display
the simple boxplot). →Double-click on appropriate variable (For Dell Data, double -
click on Rates of Return so that it is displayed under Graph Variables) →OK.
Click on appropriate graph and then click OK. (For this example, we will display the
single probability plot). →Double-click on appropriate variable (For Dell Data,
double-click on Rates of Return so that it is displayed under Graph Variables) →
OK.
96
This Probability Plot will display:
e) Graphical Summary
Stat→ Basic Statistics→ Graphical Summary→ Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is displayed under
Variables) → OK.
97
Note: The By variables option is used to create multiple graphical summaries based on
a type of grouping variables, called a by variable. For an example using the Auction
Data, if use Auc Price as the Variable and No. of Bidders as the By variable.
98
The output will display a graphical summary for every group of number of bidders.
Here is one of the graphical summaries that is displayed:
Thus, only the auction prices for when the number of bidders = 9 is shown.
f) Bar Chart
Choose this graphical format if you have one or more columns of categorical data and
you want to chart the frequency of each category.
Graph→ Bar Chart→ Choose Counts of unique values from the drop box and Click
OK. (For this example, we will use the Student Data and show a simple Bar Chart.)
99
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Chart Options button, this screen will
appear:
100
If we would have chosen Decreasing Y instead of Default after clicking on the Bar
Chart Options button,
101
Bars representing a function of a variable
Choose if you have one or more columns of data and you want to chart a function of the
data. Quite a few of these functions are summary statistics.
Graph→ Bar Chart Choose→ A function of a variable from the drop box (Then, for
this example, we will click on Cluster under Multiple Y‟s) and Click OK.
Click on appropriate variable from the drop box to choose a function. (Here we‟ll
choose mean) →
102
The bar chart will display:
Choose if you have one or more columns of summary data and you want to chart the
summary value for each category.
Graph →Bar Chart →Choose Values from a table from the drop box (Then, for this
example, we will click on Simple under One column of values) and Click OK.
103
Double-click on appropriate variable in Graph variables box and then double-click
on appropriate variable in the Categorical variable. (For Student Data, age was put
under the graph variable and gender was put under the categorical variable.) →OK.
Although it does not provide much use in context to sum the ages of males versus
females, this example was completed to showcase the use of this function.
104
g) Pie Chart
Choose when each row in a column represents a single observation. Each slice in the
pie is proportional to the number of occurrences of a value in the column.
Graph→ Pie Chart →Click on Chart raw data →Double-click on appropriate
variable in Categorical variables box (For Student Data, double-click on Portfolio).
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Pie Options button, this screen will appear:
105
ii) Chart values from a table
Choose when the category names are in one column and summary data are in another
column.
Let‟s look at how to use a pie chart if our data was organized differently. (Look at
Student (2) Data)
Graph →Pie Chart →Click on Chart values from a table →Double-click on
appropriate variable in Categorical variable box and double-click on appropriate
variable in the Summary variables box. (For Student (2) Data, double-click Gender
for Categorical variable and Count for Summary variables.)
106
7. STATISTICAL ANALYSIS USING MINITAB AND SPSS
Analysis in Minitab can be done in two ways: using the Built-In routines or using command
language in the Session window. These two can be used interchangeably.
Built-In routines
Most of the functions needed in basic and more advanced statistical analysis are found as
Minitab Built-In routines. These routines are accessed through the menu bar. To use the menu
commands, click on an item in the menu bar to open a menu, click on a menu item to execute a
command or open a submenu or dialog box.
Command Language
To be able to type commands in the Session window, you must obtain the “MTB>” prompt. All
commands are then entered after the “MTB>” prompt. All command lines are free format, in
other words, all text may be entered in upper or lower case letters anywhere in the line.
NOTE: This guide focuses mainly on using the Built-In routines. All the explanations and
examples that follow will be done using Minitab‟s Built-In routines. A brief introduction to
using Minitab commands is found in section.
INFERENTIAL STATISTICS
a. Confidence Intervals:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level in options.
ii. 1-Sample t: Stat> Basic Statistics> 1-sample t >check the alpha level in
options.
b. Hypothesis Testing:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level and alternative
hypothesis in options
ii. 1-Sample t: Stat> Basic Statistics> 1-sample t>check the alpha level and alternative
hypothesis in options
107
Point and interval estimation
2. Select the Stat menu, highlight Basic Statistics, then click 1-Sample Z . . .
3. If you have raw data, enter C1 in the cell marked “Samples in columns:”. If you have
summarized data, select the summarized data radio button and enter the summarized
values. Select Options and enter a confidence level. Click OK. In the cell marked
standard deviation, enter the value of. Click OK.
Confidence Intervals about μ, Unknown
1. If you have raw data, enter them in column C1.
2. Select the Stat menu; highlight Basic Statistics, then highlight 1-Sample t . . .
3. If you have raw data, enter C1 in the cell marked “Samples in columns”. In you have
summarized data, select the “summarized data” radio button and enter the summarized
data. Select Options . . . and enter a confidence level. Click OK twice.
Confidence Intervals about p
1. If you have raw data, enter the data in column C1.
2. Select the Stat menu, highlight Basic Statistics, and then highlight 1 Proportion . . .
3. Enter C1 in the cell marked “samples in Columns” if you have raw data. If you have
Summary statistics; Click “Summarized data” and enter the number of trials, n, and the
number of events (successes) x.
4. Click the Options . . . button. Enter a confidence level. Click “Use test based on a
normal distribution” (provided that the assumptions stated are satisfied). Click OK
twice.
Confidence Intervals about σ2
1. Enter raw data in column C1
2. Select the Stat menu, highlight Basic Statistics, and then highlight Graphical
Summary . . .
3. Enter C1 in the cell marked “Variables.”
108
4. Enter the confidence level desired. Click OK. The confidence interval for sigma is
reported in the output.
2. Select the Stat menu, highlight Basic Statistics, and then highlight 1-Sample Z . . .
3. Click Options. In the cell marked “Alternative,” select the appropriate direction for
the alternative hypothesis. Click OK.
1. If you have raw data, enter them in C1, using 0 for failure and 1 for success.
2. Select the Stat menu, highlight Basic Statistics, then highlight 1-Proportion.
3. If you have raw data, select the “Samples in columns” radio button and enter C1. If
you have summarized statistics, select “Summarized data.” Enter the number of trials
and the number of successes.
4. Click Options. Enter the value of the proportion stated in the null hypothesis. Enter
the direction of the alternative hypothesis. If (), check the box marked “Use test and
interval based on normal distribution.” Click OK twice.
1. Enter the raw data into column C1 if necessary. Select the Stat menu, highlight
Basic Statistics, and then highlight 1 Variance.
2. Make sure the pull-down menu has “Enter standard deviation” in the window. If you
have raw data, enter C1 in the window marked “Samples in columns” and make sure
the radio button is selected. If you have summarized data, select the “Summarized data”
radio button and enter the sample size and sample standard deviation.
3. Click Options and select the direction of the alternative hypothesis. Click OK.
4. Check the “Perform hypothesis test” box and enter the value of the standard deviation
in the null hypothesis. Click OK.
109
Comparisons of two population means and proportions
MINITAB will calculate the test value (statistics) and p-value for difference between the
means for two populations when the population standard deviation is unknown.
1. Enter the data into C1 and C2.
2. Select “stat”>” basic statistics “>”2-sample t”.
3. Click the button from [sample in different columns].
4. Click in the box for [first]:. Double click C1 in the list.
5. Click in the box for [second]:., then double click C2 in the list. Do not check the box
for [assume equal variances]. Minitab will use the large sample formula. The completed
dialog box in shown.
6. Click [options].
a. Type in 90 for the [confidence level] and 0 for the [Test mean].
b. Select [greater than] for the [Alternative]. This option affects the p-value. It must be
Correct.
7. Click [ok] twice.
110
Calculates the value of the Chi-square (4) density curve at each value in C1 and stores these
values in C2. This is useful for plotting the density curve. The Calc I Probability Distributions I
Chi-Square command or the session commands cdf and invcdf, can also be used to obtain values
of the Chi-square (k) cumulative distribution function and inverse distribution function,
respectively. We use the Calc I Random Data I Chi-Square command, or the session command
random, to obtain random samples from these distributions.
We will see applications of the chi-square distribution later in the book but we mention one
here. In particular, if x1. . . xn is a sample from a N (μ, σ) distribution, then (n − 1) s2/σ2
=Pni=1 (xi − ¯x)2 /σ2 is known to follow a Chi-square (n − 1) distribution, and this fact is used
as a basis for inference about σ (confidence intervals and tests of significance). Because of the
non-robustness of these inferences to small deviations from normality, these inferences are not
recommended.
Correlations
While a scatter plot is a convenient graphical method for assessing whether or not there is any
relationship between two variables, we would also like to assess this numerically. The
coefficient provides a numerical summarization of the degree to which a linear relationship
exists between two quantitative variables, and this can be calculated using the Stat I Basic
Statistics. I Correlation command. Correlate E1 . . . Em
111
2. Results are displayed in the Session window as presented below.
112
The following dialog box appears
This is basically a calculator that allows doing many calculations with the variables. Basic
functions are found in the number pad and more sophisticated ones are found in the functions
box to the right of the number pad.
To make sure that your results is not over writing a variable, name a new variable in the
“STORERESULTSIN VAVRIABLE” field in the top of the calculator.
a. Adding variables
1. To add variables, name the variable where you want to store the results.
2. Select the first variable, press the “+” sign and select the second variable (and so on
for more than two variables). You should obtain something similar to the window in the
below
113
3. The result will then be shown in the worksheet window.
Taking logarithms
114
Logical functions
Some statistical analysis will need to separate by groups according to characteristics that are
contained in the data. Logical functions are particularly useful in these cases. A simple
example on how to use them is described below.
1. Choose the variable you want to do the logical test to. Here we are looking at the
“SEX” variable.
2. Choose the logical test you want to use. Here we want to see which observations
have the variable “SEX” equal 1. That is, which observations are males?
3. Make sure that you have indicated a variable in which to store your results, by typing
the name of your result variable in the “STORE RESULT IN VARIABLE” box.
4. The result variable will be a binary variable (variable of 1s and 0s) where 1
indicates the logical testis true and0the test is false. The result variable will appear in
the Worksheet window.
115
Determining the Least-Squares Regression Line
Regression is another technique for assessing the strength of a linear relationship existing
between two variables and it is closely related to correlation. For this, we use the Stat I
Regression command.
As noted in IPS, the regression analysis of two quantitative variables involves computing the
least-squares line y = a + bx, where one variable is taken to bet he response variable y and the
other is taken to be the explanatory variable x.
It is very convenient to have a scatter plot of the points together with the least-squares line.
This can be accomplished using the Stat I Regression I Fitted Line Plot command.
1. With the explanatory variable in C1 and the response variable in C2, select the Stat
menu and highlight Regression. Highlight Regression . . ..
2. Select the explanatory (predictor) variable and response variable and click OK.
The Coefficient and Determination,
This is provided in the standard regression output
Residual Plots
Follow the same steps as those used to obtain the regression output (Section 4.2). Before
selecting OK, click GRAPHS. In the cell that says “Residuals versus the variables,” enter
the name of the explanatory variable. Click OK Simulation
1. Set the seed by selecting the Calc menu and highlighting Set Base . . . Insert any
seed you with into the cell and click OK.
2. Select the Calc menu, highlight Random Data, and then highlight Integer.
3. Select the Stat menu, highlight Tables, and then highlight Tally . . . Enter C1 into
the variables cell. Make sure that the Counts box is checked and click OK.
The chi-squared (2) test statistics is widely used in the analysis of contingency
tables.
The chi-square measures test the hypothesis that the row and column variables in a
cross tabulation are independent.
116
It compares the actual observed frequency in each group with the expected frequency
(the latter is based on theory, experience or comparison groups).
The chi-squared test (Pearson‟s χ2) allows us to test for association between
categorical (nominal!) variables.
The null hypothesis for this test is there is no association between the variables.
Consequently, a significant p-value implies association.
After opening the Crosstabs dialog box as described in the preceding section, click the
Statistics button to get the following dialog box:
Variable A
A1 A2 Total
B1 A b a+b
Variable B B2 C d c+d
Total a+c b+d N
2 2
Test Statistic: -test Test Statistic: -test for 2 x 2 Contingency table
nad bc
2 2
(a c) (b d) (a b) (c d)
117
2
Test Statistic: - test with d.f. = (r-1) x (c-1)
Where:
Oij = observed frequency,
th th
Eij = expected frequency of the cell at the juncture of i raw & j column,
2
Assumptions of the - test
Data must be categorical
The data should be a frequency data (counts for frequency, proportions
/difference of proportions for prevalence & incidence).
The chi-squared test assumes adequate sample size -that the numbers in each
cell are „not too small‟
No expected frequency should be less than 1, and no more than 20% of the
expected frequencies should be less than 5.
If some numbers are too small,
row or column variables categories can sometimes be combined to make
the expected frequencies larger or use Yates correction,
the Fisher‟s exact test should be used instead.
It assumes that measures are independent of each other i.e. the categories
created are mutually exclusive.
The 2 - test assumes that there is/must exist theoretical basis for the
categorization of the variables.
Measures of Association
118
Test for a relationship between two categorical variables
Frat or sorority ?
y es No Total
Ever - Depression yes Count 681 7692 8373
Expected Count 715.6 7657.4 8373.0
no Count 3744 39657 43401
Expected Count 3709.4 39691.6 43401.0
Total Count 4425 47349 51774
Expected Count 4425.0 47349.0 51774.0
Chi-Square Tests
119
T-tests
The t test is a useful technique for comparing mean values of two sets of numbers.
The comparison will provide you with a statistic for evaluating whether the difference
between two means is statistically significant.
T tests can be used either to compare independent-samples t test or paired-samples t test.
There are three types of t tests; the options are all located under the Analyze menu item.
Analyze
Compare Means
One-Sample T test...
Independent-Samples T test...
Paired-Samples T test...
One-Sample T- Test:
Example: College students report drinking an average of 5 drinks the last time they
“partied”/socialized.
Hypotheses
Ho: µ = 5
HA: µ≠5
Test: Two-tailed t-test
Result: Reject
One-Sample Statistics
Std. Error
N Mean Std. Deviation Mean
How many drinks 53374 4.42 4.401 .019
One-Sample Test
Test Value = 5
95% Conf idence
Interval of the
Mean Dif f erence
t df Sig. (2-tailed) Dif f erence Lower Upper
How many drinks -30.352 53373 .000 -.578 -.62 -.54
120
The independent - sample t test:
Used to compare two groups' scores on the same variable.
In Independent-Sample T test dialog box we have to identify the grouping variable
or cut point by clicking on define groups button after dragging the grouping
variable
Example: Men and women report significantly different numbers of sexual partners
over the past 12 months
Hypotheses
µ1= µ2
µ1≠µ2
Test: Independent Samples t - test OR One - way ANOVA
Result: Reject null
Group Statistics
Std. Error
Sex N Mean Std. Dev iation Mean
Partners you had f emale 32687 1. 34 2. 017 .011
Male 18474 1. 82 3. 627 .027
121
Independent Samples Test
Group Statistics
Std. Error
gender N Mean Std. Dev iation Mean
verbal fluenc y - animal female 855 15.24 5.711 .195
naming score male 580 15.95 5.493 .228
The group statistics tells us the mean of animal naming score among males and females.
The t - test is a test that tells us the mean difference of animal naming score among
males and females, is statistically significant.
122
The paired - sample t test:
It compares the means of two variables that represent the same group at different
times (e.g. before and after an event) or related groups (e.g., husbands and
wives).
In paired sample T test dialog box we have to choose two variables from the
left side box to paired variable box.
Analysis of Variance
The One-Way ANOVA compares the mean of one or more groups based on one
independent variable (or factor)
It measures differences among group means.
In SPSS can be performed as:
From the menus choose:
Analyze
Compare Means
One-Way ANOVA
123
Move all dependent variables into the box labeled "Dependent List"
Move the independent variable into the box labeled "Factor"
Click on the button labeled "Options"
Check off the boxes for Descriptive and Homogeneity of Variance
Click on the box marked "Post Hoc" and choose the appropriate post hoc comparison
The two groups have approximately equal variance on the dependent variable. You can
check this by looking at the Levene's Test
If Levene's statistic is significant, we have evidence that the homogeneity assumption has
been violated.
If it is a problem, you can re-run the analysis selecting the option for "Equal Variances
Not Assumed"
Hypotheses:
o Null: There are no significant differences between the groups' mean
scores.
o Alternate: There is a significant difference between the groups'
mean scores.
HA: μi μj for i j
2
S
F Variation between the population B
Variation with in the population S2
W
124
Steps can be summarized in to ANOVA - table as:
Levene's Test
If the mean is significantly different (reject Ho), we are interested in which pair
of mean are different. Consequently, we should use method called multiple
comparisons.
For all test, the hypothesis will be:
Ho: Pair of treatment mean is equal (μi=μj for i≠ j)
H1: Not equal (μi≠μj for i≠ j).
125
Reject Ho if p-value < 0.05 or zero is not included in the confidence interval
To do this in SPSS, click post Hoc button and select method based on equal
variance assumed or not (For this see Levene‟s test of homogeneity of variance)
Descriptives
126
ANOVA
Blood Alcohol Content
Sum of
Squares Df Mean Square F Sig.
Between Groups 3. 188 5 .638 92. 123 .000
Within Groups 348.695 50376 .007
Total 351.884 50381
Bivariate Correlation
Partial, and
Distances
The bivariate correlation is for situations where you are interested only in the
relationship between two variables
To obtain a bivariate correlation, choose the following menu option:
Analyze
Correlate
Bivariate...
Drag the necessary variables to Variables dialog box
The partial correlation measures an association between two variables with the
effects of one or more other variables factored out
To obtain a partial correlation, select the following menu item:
127
Analyze
Correlate
Partial...
2 2
(xi x) (yi y) 2 2
[ x ( x) /n][ y ( y ) /n]
2 2
HA: ρ≠0
Test: Pearson Product Moment Correlation
Correlati ons
128
Partial Correlation in SPSS
The partial correlation measures the strength association between two variables by
controlling the effects of one or more other variables. For example: current and
beginning salary by controlling effect of previous experience
Partial correlations can be especially useful in situations where it is not obvious
whether several variables overlap with each other
To obtain a partial correlation, select the following menu item:
– Analyze
> Correlate
Partial...
Under partial correlation dialog box we have to select necessary variable to
Variable box and controlling for box.
Example: Let us compare the strength of relationship between current salary and
beginning salary, after controlling the effect of previous experience.
Regression is a technique that can be used to investigate the effect of one or more predictor
variables on an outcome variable
Fitting a simple linear regression model to the data allows us to explain or predict
the values of one variable (the dependent or outcome or response variable or y)
given the values of a second variable (called the independent or exposure or
explanatory variable or x).
The basic idea of simple linear regression is to find the straight line which best
fits the data.
For example, if we are interested in predicting under-five mortality rate from percentage of
children immunized against DPT we would treat immunization as independent variable and
mortality rate as dependent variable.
129
Equation of the fitted line
y = a + bx.
To conduct a regression analysis, select the following from the Analyze menu
Analyze
Regression
Linear...
This will produce the following dialog box:
130
R is the multiple correlation coefficient between all of the predictor variables and the
dependent variable.
R Square used to describe the goodness-of-fit or the amount of variance explained by a given
set of predictor variables.
Select the dependent variable to the „dependent‟ space and the independent
variable to the „independent‟.
After Clicking the „statistics‟, chose the „estimate‟, „model fit‟, „confidence
interval‟ and „R squared change‟ and click the „Ok‟.
This will give you the mean difference between and within group difference
and its significance is measured using F-test.
It also gives you Regression coefficients (the intercept and the slop)
(the ß = slop, gives you positive or negative relationship between the predictor
and the Outcome Variable)
It also gives you R2 which is the explanatory or prediction power of the model
in predicting the outcome variable.
131
After clicking the „statistics‟
„Estimate‟,
„Model fit‟,
OUTPUT
Model Summary
Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df 1 df 2 Sig. F C hange
a
1 .193 .037 .037 5.496 .037 52.271 1 1344 .000
a. Predictors: (Constant), marital status
2
The Model summary shows you the R which tells us how many the predictive
b
ANOVA
Sum of
Model Squares df Mean Square F Sig.
a
1 Regression 1578.905 1 1578.905 52. 271 .000
Residual 40597.181 1344 30. 206
Tot al 42176.086 1345
a. Predic tors : (Constant), marital status
b. Dependent Variable: verbal fluency - animal naming score
132
ANOVA statistics also tells us whether the explanatory variable predicts the outcome
variable well using F-test.
a
Coeffici ents
Unstandardized Standardized
Coeff icients Coeff icients 95% Conf idence Interv al for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 17.779 .344 51.718 .000 17.105 18.454
marital status -.808 .112 -.193 -7.230 .000 -1.027 -.589
a.
Dependent Variable: verbal fluency - animal naming score
Standard coefficient may be useful and gives a good estimate through relative
estimation using standard deviation.
Students‟ t-test is the statistics that estimates the significance, and the upper and
lower 95% CI, are significant if both become Negative or Positive.
Without verifying that your data have met the regression assumptions, your results may be
misleading. This sub topic will explore how you can use SPSS to test whether your data meet
the assumptions of linear regression. In particular, we will consider the following assumptions.
Linearity - the relationships between the predictors and the outcome variable
should be linear.
Normality - the errors should be normally distributed - technically normality is
necessary only for the t-tests to be valid, estimation of the coefficients only
requires that the errors be identically and independently distributed.
133
Homogeneity of variance (homoscedasticity) - the error variance should be
constant
Independence - the errors associated with one observation are not correlated
with the errors of any other observation
Model specification - the model should be properly specified (including all
relevant variables, and excluding irrelevant variables)
Additionally, there are issues that can arise during the analysis that, while strictly speaking are
not assumptions of regression, are none the less, of great concern to regression analysts.
Influence - individual observations that exert undue influence on the
coefficients
Collinearity - predictors that are highly collinear, i.e. linearly related, can cause
problems in estimating the regression coefficients.
Many graphical methods and numerical tests have been developed over the years for regression
diagnostics and SPSS makes many of these methods easy to access and use. In this chapter, we
will explore these methods and show how to verify regression assumptions and detect potential
problems using SPSS.
A single observation that is substantially different from all other observations can make a large
difference in the results of your regression analysis. If a single observation (or small group of
observations) substantially changes your results, you would want to know about this and
investigate further. There are three ways that an observation can be unusual.
Outliers: In linear regression, an outlier is an observation with large residual. In other words, it
is an observation whose dependent-variable value is unusual given its values on the predictor
variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or
other problem.
134
Leverage: An observation with an extreme value on a predictor variable is called a point with
high leverage. Leverage is a measure of how far an observation deviates from the mean of that
variable. These leverage points can have an unusually large effect on the estimate of regression
coefficients.
How can we identify these three types of observations? Let's look at an example dataset called
crime. This dataset appears in Statistical Methods for Social Sciences, Third Edition by Alan
Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name
(state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the
percent of the population living in metropolitan areas (pctmetro), the percent of the population
that is white (pctwhite), percent of population with a high school education or above (pcths),
percent of population living under poverty line (poverty), and percent of population that are
single parents (single). Below we read in the file and do some descriptive statistics on these
variables. You can click crime.sav to access this file, or see the Regression with SPSS page to
download all of the data files used in this book.
Descriptive Statistics
135
PCTWHITE 51 31.80 98.50 84.1157 13.25839
Valid N (listwise) 51
Let's say that we want to predict crime by pctmetro, poverty, and single. That is to say, we
want to build a linear regression model between the response variable crime and the
independent variables pctmetro, poverty and single. We will first look at the scatter plots of
crime against each of the predictor variables before the regression analysis so we will have
some ideas about potential problems. We can create a scatter plot matrix of these variables as
shown below.
graph
/scatter plot(matrix)=crime murder pctmetro pctwhite pcths poverty
single.
136
The graphs of crime with other variables show some potential problems. In every plot, we see
a data point that is far away from the rest of the data points. Let's make individual graphs of
crime with pctmetro and poverty and single so we can get a better view of these scatterplots.
We will use BY state (name) to plot the state name instead of a point.
137
GRAPH /SCATTER PLOT(BIVAR) = single WITH crime BY state(name).
All the scatter plots suggest that the observation for state = "dc" is a point that requires extra
attention since it stands out away from all of the other points. We will keep it in mind when
we do our regression analysis.
Now let's try the regression command predicting crime from pctmetro poverty and single.
We will go step-by-step to identify all the potentially unusual or influential points afterwards.
regression
/dependent crime
/method = enter pctmetro poverty single.
Variables Entered/Removed(b)
138
Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate
ANOVA(b)
Total 9728474.745 50
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
T Sig.
Std.
Model B Beta
Error
139
Let's examine the standardized residuals as a first means for identifying outliers. Below we use
the /residuals=histogram subcommand to request a histogram for the standardized residuals.
As you see, we get the standard output that we got above, as well as a table with information
about the smallest and largest residuals, and a histogram of the standardized residuals. The
histogram indicates a couple of extreme residuals worthy of investigation.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram.
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
Model Summary(b)
ANOVA(b)
140
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
Residuals Statistics(a)
141
Let's now request the same kind of information, except for the Studentized deleted residual.
The Studentized deleted residual is the residual that would be obtained if the regression was
re - run omitting that observation from the analysis. This is useful because some points are so
influential that when they are included in the analysis they can pull the regression line close to
that observation making it appear as though it is not an outlier -- however when the observation
is deleted it then becomes more obvious how outlying it is. To save space, below we show just
the output related to the residual analysis.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid).
Residuals Statistics(a)
142
The histogram shows some possible outliers. We can use the outliers (sdresid) and id(state)
options to request the 10 most extreme values for the studentized deleted residual to be displayed
labeled by the state from which the observation originated. Below we show the output generated
by this option, omitting all of the rest of the output to save space. You can see that "dc" has the
largest value (3.766) followed by "ms" (-3.571) and "fl" (2.620).
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid) id (state) outliers (sdresid).
Outlier Statistics(a)
1 51 Dc 3.766
2 25 ms -3.571
3 9 Fl 2.620
Stud. Deleted Residual 4 18 La -1.839
5 39 Ri -1.686
6 12 Ia 1.590
7 47 Wa -1.304
143
8 13 Id 1.293
9 14 Il 1.152
10 35 Oh -1.148
a Dependent Variable: CRIME
We can use the /casewise subcommand below to request a display of all observations where
the sdresid exceeds 2. To save space, we show just the new output generated by the /casewise
subcommand. This shows us that Florida, Mississippi and Washington DC have sdresid values
exceeding 2.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid) id (state) outliers (sdresid)
/case wise=plot(sdresid) outliers (2).
Now let's look at the leverage values to identify observations that will have potential great
influence on regression coefficient estimates. We can include lever with the histogram () and
the outliers () options to get more information about observations with high leverage. We
show just the new output generated by these additional subcommands below. Generally, a
point with leverage greater than (2k+2)/n should be carefully examined. Here k is the number
of predictors and n is the number of observations, so a value exceeding (2*3+2)/51 = .1568
would be worthy of further investigation.
144
As you see, there are 4 observations that have leverage values higher than .1568.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers (sdresid lever)
/case wise = plot (sdresid) outliers (2).
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc .517
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
Centered Leverage Value
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME
145
As we have seen, DC is an observation that both has a large residual and large leverage. Such
points are potentially the most influential. We can make a plot that shows the leverage by the
residual and look for observations that are high in leverage and have a high residual. We can
do this using the /scatter plot subcommand as shown below. This is a quick way of checking
potential influential observations and outliers at the same time. Both types of points are of
great concern for us. As we see, "dc" is both a high residual and high leverage point, and "ms"
has an extremely negative residual but does not have such a high leverage.
146
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers (sdresid, lever)
/casewise=plot(sdresid) outliers (2)
/scatterplot (*lever, *sdresid).
Now let's move on to overall measures of influence, specifically let's look at Cook's D, which
combines information on the residual and leverage. The lowest value that Cook's D can assume
is zero, and the higher the Cook's D is, the more influential the point is. The conventional
cut-off point is 4/n, or in this case 4/51 or .078. Below we add the cook keyword to the
outliers option and also on the /casewise subcommand and below we see that for the 3 outliers
flagged in the "Casewise Diagnostics" table, the value of Cook's D exceeds this cutoff. And, in
the "Outlier Statistics" table, we see that "dc", "ms", "fl" and "la" are the 4 states that exceed
this cutoff, all others falling below this threshold.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers (sdresid, lever, cook)
/casewise=plot(sdresid) outliers (2) cook dffit
/scatterplot (*lever, *sdresid).
147
Casewise Diagnostics(a)
Outlier Statistics(a)
1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc 3.203 .021
2 25 ms .602 .663
3 9 fl .174 .951
4 18 la .159 .958
5 39 ri .041 .997
Cook's Distance
6 12 ia .041 .997
7 13 id .037 .997
8 20 md .020 .999
9 6 co .018 .999
10 49 wv .016 .999
Centered Leverage Value 1 51 dc .517
148
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME
Cook's D can be thought of as a general measure of influence. You can also consider more
specific measures of influence that assess how each coefficient is changed by including the
observation. Imagine that you compute the regression coefficients for the regression model
with a particular case excluded, then re compute the model with the case included, and you
observe the change in the regression coefficients due to including that case in the model. This
measure is called DFBETA and a DFBETA value can be computed for each observation for
each predictor. As shown below, we use the /save sdbeta (sdbf) subcommand to save the
DFBETA values for each of the predictors. This saves 4 variables into the current data file,
sdfb1, sdfb2, sdfb3 and sdfb4, corresponding to the DFBETA for the Intercept and for
pctmetro, poverty and for single, respectively. We could replace sdfb with anything we like,
and the variables created would start with the prefix that we provide.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers (sdresid, lever, cook)
/casewise=plot(sdresid) outliers (2) cook dffit
/scatterplot (*lever, *sdresid)
/save sdbeta(sdfb).
The /save sdbeta (sdfb) subcommand does not produce any new output, but we can see
the variables it created for the first 10 cases using the list command below.
149
For example, by including the case for "ak" in the regression analysis (as compared to excluding
this case), the coefficient for pctmetro would decrease by -.106 standard errors. Likewise, by
including the case for "ak" the coefficient for poverty decreases by -.131 standard errors, and the
coefficient for single increases by .145 standard errors (as compared to a model excluding "ak").
Since the inclusion of an observation could either contribute to an increase or decrease in a
regression coefficient, DFBETAs can be either positive or negative. A DFBETA value in excess
of 2/sqrt(n) merits further investigation. In this example, we would be concerned about absolute
values in excess of 2/sqrt(51) or .28.
List
/variables state sdfb1 sdfb2 sdfb3
/cases from 1 to 10.
STATE SDFB1 SD FB2 SDFB3
Ak -.10618 -.13134 .14518
Al .01243 .05529 -.02751
Ar -.06875 .17535 -.10526
Az -.09476 -.03088 .00124
Ca .01264 .00880 -.00364
Co -.03705 .19393 -.13846
Ct -.12016 .07446 .03017
De .00558 -.01143 .00519
Fl .64175 .59593 -.56060
Ga .03171 .06426 -.09120
Number of cases read: 10 Number of cases listed: 10
We can plot all three DFBETA values for the 3 coefficients against the state id in one graph
shown below to help us see potentially troublesome observations. We see changed the value
labels for sdfb1sdfb2 and sdfb3 so they would be shorter and more clearly labeled in the
graph. We can see that the DFBETA for single for "dc" is about 3, indicating that by including
"dc" in the regression model, the coefficient for single is 3 standard errors larger than it would
have been if "dc" had been omitted.
150
This is yet another bit of evidence that the observation for "dc" is very problematic.
The following table summarizes the general rules of thumb we use for the measures we have
discussed for identifying observations worthy of further investigation (where k is the number
of predictors and n is the number of observations).
Measure Value
leverage >(2k+2)/n
abs(rstu) >2
Cook's D > 4/n
abs(DFBETA) > 2/sqrt(n)
151
We have shown a few examples of the variables that you can refer to in the /residuals, /casewise,
/scatterplot and /save sdbeta () subcommands. Here is a list of all of the variables that can be used
on these subcommands; however, not all variables can be used on each subcommand.
152
In addition to the numerical measures we have shown above, there are also several graphs that
can be used to search for unusual and influential observations. The partial-regression plot is
very useful in identifying influential points. For example, below we add the /partial plot
subcommand to produce partial-regression plots for all of the predictors. For example, in the
3rd plot below you can see the partial-regression plot showing crime by single after both
crime and single have been adjusted for all other predictors in the model. The line plotted has
the same slope as the coefficient for single. This plot shows how the observation for DC
influences the coefficient. You can see how the regression line is tugged upwards trying to fit
through the extreme value of DC. Alaska and West Virginia may also exert substantial
leverage on the coefficient of single as well. These plots are useful for seeing how a single
point may be influencing the regression line, while taking other variables in the model into
account.
Note that the regression line is not automatically produced in the graph. We double clicked on
the graph, and then chose "Chart" and the "Options" and then chose "Fit Line Total" to add a
regression line to each of the graphs below.
Regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers (sdresid, lever,
cook) /casewise=plot(sdresid) outliers (2) cook dffit /scatterplot (*lever,
*sdresid)
/partialplot.
153
DC has appeared as an outlier as well as an influential point in every analysis. Since DC is
really not a state, we can use this to justify omitting it from the analysis saying that we really
wish to just analyze states. First, let's repeat our analysis including DC below.
Regression
/dependent crime
/method=enter pctmetro poverty single.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients T Sig.
Model B Std. Error Beta
(Constant) -1666.436 147.852 -11.271 .000
PCTMETRO 7.829 1.255 .390 6.240 .000
1
POVERTY 17.680 6.941 .184 2.547 .014
SINGLE 132.408 15.503 .637 8.541 .000
a Dependent Variable: CRIME
154
Now, let's run the analysis omitting DC by using the filter command to omit "dc" from the
analysis. As we expect, deleting DC made a large change in the coefficient for single. The
coefficient for single dropped from 132.4 to 89.4. After having deleted DC, we would repeat
the process we have illustrated in this section to search for any other outlying and influential
observations.
computefiltvar = (state NE "dc").
filter by filtvar.
regression
/dependent crime
/method=enter pctmetro poverty single.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients T Sig.
Model B Std. Error Beta
(Constant) -1197.538 180.487 -6.635 .000
PCTMETRO 7.712 1.109 .565 6.953 .000
1
POVERTY 18.283 6.136 .265 2.980 .005
SINGLE 89.401 17.836 .446 5.012 .000
a Dependent Variable: CRIME
Summary
In this section, we explored a number of methods of identifying outliers and influential points.
In a typical analysis, you would probably use only some of these methods. Generally speaking,
there are two types of methods for assessing outliers: statistics such as residuals, leverage, and
Cook's D, which assess the overall impact of an observation on the regression results, and
statistics such as DFBETA that assess the specific impact of an observation on the regression
coefficients. In our example, we found out that DC was a point of major concern. We
performed a regression with it and without it and the regression equations were very different.
155
We can justify removing it from our analysis by reasoning that our model is to predict crime
rate for states not for metropolitan areas.
One of the assumptions of linear regression analysis is that the residuals are normally
distributed. It is important to meet this assumption for the p-values for the t-tests to be valid.
Let's use the elemapi2 data file we saw in Chapter 1 for these analyses. Let's predict academic
performance (api00) from percent receiving free meals (meals), percent of English language
learners (ell), and percent of teachers with emergency credentials (emer). We then use the
/save command to generate residuals.
get file="c:\spssreg\elemapi2.sav".
regression
/dependent api00
/method=enter meals ell emer
/save resid (apires).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 EMER, ELL, MEALS(a) . Enter
a All requested variables entered.
b Dependent Variable: API00
Model Summary(b)
Std. Error of
Model R R Square Adjusted R Square
the Estimate
156
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6749782.747 3 2249927.582 672.995 .000(a)
1 Residual 1323889.251 396 3343.155
Total 8073671.997 399
a Predictors: (Constant), EMER, ELL, MEALS
b Dependent Variable: API00
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
Casewise Diagnostics(a)
93 3.087 604
157
Residuals Statistics(a)
We now use the examine command to look at the normality of these residuals. All of the
results from the examine command suggest that the residuals are normally distributed -- the
skewness and kurtosis are near 0, the "tests of normality" are not significant, the histogram
looks normal, and the Q-Q plot looks normal. Based on these results, the residuals from this
regression appear to conform to the assumption of being normally distributed.
examine
variables=apires
/plot boxplot stemleaf histogram npplot.
Cases
158
Statistic Std. Error
Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic Df Sig. Statistic df Sig.
APIRES .033 400 .200(*) .996 400 .510
* This is a lower bound of the true significance. a
Lilliefors Significance Correction
159
Unstandardized Residual Stem-and-Leaf Plot
160
Heteroscedasticity
Another assumption of ordinary least squares regression is that the variance of the residuals is
homogeneous across levels of the predicted values, also known as homoscedasticity. If the
model is well-fitted, there should be no pattern to the residuals plotted against the fitted values.
If the variance of the residuals is non-constant, then the residual variance is said to be
"heteroscedastic." Below we illustrate graphical methods for detecting heteroscedasticity. A
commonly used graphical method is to use the residual versus fitted plot to show the residuals
versus fitted (predicted) values. Below we use the /scatter plot subcommand to plot *zresid
(standardized residuals) by *pred (the predicted values). We see that the pattern of the data
points is getting a little narrower towards the right end, an indication of mild
heteroscedasticity.
161
regression
/dependent api00
/method=enter meals ell emer
/scatterplot (*zresid *pred).
Let's run a model where we include just enroll as a predictor and show the residual vs.
predicted plot. As you can see, this plot shows serious heteroscedasticity. The variability of the
residuals when the predicted value is around 700 is much larger than when the predicted value
is 600 or when the predicted value is 500.
regression
/dependent api00
/method=enter enroll
/scatterplot (*zresid *pred).
162
As we saw in Chapter 1, the variable enroll was skewed considerably to the right, and we
found that by taking a log transformation, the transformed variable was more normally
distributed. Below we transform enroll, run the regression and show the residual versus fitted
plot. The distribution of the residuals is much improved. Certainly, this is not a perfect
distribution of residuals, but it is much better than the distribution with the untransformed
variable.
computelenroll = ln(enroll).
regression
/dependent api00
/method=enter lenroll
/scatterplot (*zresid *pred).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LENROLL(a) . Enter
a. All requested variables entered.
b. Dependent Variable: API00
Model Summary(b)
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
Regression 609460.408 1 609460.408 32.497 .000(a)
1 Residual 7464211.589 398 18754.300
Total 8073671.997 399
a Predictors: (Constant), LENROLL
b Dependent Variable: API00
163
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
T Sig.
Std.
Model B Beta
Error
(Constant) 1170.429 91.966 12.727 .000
1
LENROLL -86.000 15.086 -.275 -5.701 .000
a Dependent Variable: API00
Residuals Statistics(a)
Std. Predicted
-2.816 2.666 .000 1.000 400
Value
164
Finally, let's revisit the model we used at the start of this section, predicting api00 from meals,
ell and emer. Using this model, the distribution of the residuals looked very nice and even
across the fitted values. What if we add enroll to this model? Will this automatically ruin the
distribution of the residuals? Let's add it and see.
regression
/dependent api00
/method=enter meals ell emer enroll
/scatterplot (*zresid *pred).
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
ENROLL,MEALS, EMER,
1 . Enter
ELL(a)
a All requested variables entered.
b Dependent Variable: API00
Model Summary(b)
R Adjusted R Std. Error of the
Model R
Square Square Estimate
1 .915(a) .838 .836 57.552
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6765344.050 4 1691336.012 510.635 .000(a)
1 Residual 1308327.948 395 3312.223
Total 8073671.997 399
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00
165
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
T Sig.
Std.
Model B Beta
Error
(Constant) 899.147 8.472 106.128 .000
MEALS -3.222 .152 -.723 -21.223 .000
ELL -.768 .195 -.134 -3.934 .000
1
EMER -1.418 .300 -.117 -4.721 .000
-3.126E-
ENROLL .014 -.050 -2.168 .031
02
a Dependent Variable: API00
Casewise Diagnostics(a)
93 3.004 604
Residuals Statistics(a)
Std. Predicted
-1.665 1.847 .000 1.000 400
Value
166
As you can see, the distribution of the residuals looks fine, even after we added the variable
enroll. When we had just the variable enroll in the model, we did a log transformation to
improve the distribution of the residuals, but when enroll was part of a model with other
variables, the residuals looked good so no transformation was needed. This illustrates how the
distribution of the residuals, not the distribution of the predictor, was the guiding factor in
determining whether a transformation was needed.
7.3.3.3. Collinearity
When there is a perfect linear relationship among the predictors, the estimates for a regression
model cannot be uniquely computed. The term collinearity implies that two variables are near
perfect linear combinations of one another. When more than two variables are involved it is
often called multicollinearity, although the two terms are often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression model
estimates of the coefficients become unstable and the standard errors for the coefficients can
get wildly inflated. In this section, we will explore some SPSS commands that help to detect
multicollinearity.
We can use the /statistics=defaults tol to request the display of "tolerance" and "VIF" values
for each predictor as a check for multicollinearity. The "tolerance" is an indication of the
percent of variance in the predictor that cannot be accounted for by the other predictors, hence
very small values indicate that a predictor is redundant, and values that are less than .10 may
merit further investigation.
167
The VIF, which stands for variance inflation factor, is (1 / tolerance) and as a rule of thumb, a
variable whose VIF values is greater than 10 may merit further investigation. Let's first look at
the regression we did from the last section, the regression model predicting api00 from meals,
ell and emer using the /statistics=defaults tol subcommand. As you can see, the "tolerance"
and "VIF" values are all quite acceptable.
regression
/statistics=defaults tol
/dependent api00
/method=enter meals ell emer.
<some output deleted to save space>
Coefficients(a)
Now let's consider another example where the "tolerance" and "VIF" values are more
worrisome. In the regression analysis below, we use acs_k3, avg_ed, grad_sch, col_grad and
some_col as predictors of api00. As you see, the "tolerance" values for avg_edgrad_sch and
col_grad are below .10, and avg_ed is about 0.02, indicating that only about 2% of the
variance in avg_ed is not predictable given the other predictors in the model. All of these
variables measure education of the parents and the very low "tolerance" values indicate that
these variables contain redundant information.
168
For example, after you know grad_sch and col_grad, you probably can predict avg_ed very
well. In this example, multicollinearity arises because we have put in too many variables that
measure the same thing, parent education.
We also include the Collin option which produces the "Collinearity Diagnostics" table below.
The very low eigenvalue for the 5th dimension (since there are 5 predictors) is another
indication of problems with multicollinearity. Likewise, the very high "Condition Index" for
dimension 5 similarly indicates problems with multicollinearity with these predictors.
regression
/statistics=defaults tolcollin
/dependent api00
/method=enter acs_k3, avg_ed, grad_sch, col_grad& some_col.
Remark: <some output deleted to save space>
Coefficients(a)
169
Collinearity Diagnostics(a)
Variance Proportions
Eigen Cond
Mod Dime Value ition ACS_ AVG_ GRAD_ COL_
el nsion Index (Constant) K3 ED SCH GRAD SOME_COL
1 5.013 1.000 .00 .00 .00 .00 .00 .00
2 .589 2.918 .00 .00 .00 .05 .00 .01
3 .253 4.455 .00 .00 .00 .03 .07 .02
1
4 .142 5.940 .00 .01 .00 .00 .00 .23
5 .0028 42.036 .22 .86 .14 .10 .15 .09
6 .0115 65.887 .77 .13 .86 .81 .77 .66
a Dependent Variable: API00
Let's omit one of the parent education variables, avg_ed. Note that the VIF values in the analysis below
appear much better. Also, note how the standard errors are reduced for the parent education variables,
grad_sch and col_grad. This is because the high degree of collinearity caused the standard errors to be
inflated. With the multicollinearity eliminated, the coefficient for grad_sch, which had been non-
significant, is now significant.
regression
/statistics=defaults tolcollin
/dependent api00
/method=enter acs_k3 grad_sch col_grad some_col.
Remark: <some output omitted to save space>
Coefficients(a)
Unstandardized Standardized Collinearity
Coefficients Coefficients t Sig. Statistics
Model B Std.Error Beta Tolerance VIF
(Constant) 283.745 70.325 4.035 .000
ACS_K3 11.713 3.665 .113 3.196 .002 .977 1.024
1 GRAD_SCH 5.635 .458 .482 12.298 .000 .792 1.262
COL_GRAD 2.480 .340 .288 7.303 .000 .783 1.278
SOME_COL 2.158 .444 .173 4.862 .000 .967 1.034
a Dependent Variable: API00
170
Collinearity Diagnostics(a)
Dim Cond Variance Proportions
Eigen
Model ension ition ACS_ GRAD_ COL_ SOME_
value
Index (Constant K3 SCH GRAD COL
1 3.970 1.000 .00 .00 .02 .02 .01
2 .599 2.575 .00 .00 .60 .03 .04
1 3 .255 3.945 .00 .00 .37 .94 .03
4 .174 4.778 .00 .00 .00 .00 .92
5 .0249 39.925 .99 .99 .01 .01 .00
a Dependent Variable: API00
When we do linear regression, we assume that the relationship between the response variable
and the predictors is linear. If this assumption is violated, the linear regression will try to fit a
straight line to data that do not follow a straight line. Checking the linearity assumption in the
case of simple regression is straightforward, since we only have one predictor. All we have to
do is a scatter plot between the response variable and the predictor to see if nonlinearity is
present, such as a curved band or a big wave-shaped curve. For example, let us use a data file
called nations.sav that has data about a number of nations around the world. Let's look at the
relationship between GNP per capita (gnpcap) and births (birth).
Below if we look at the scatter plot between gnpcap and birth, we can see that the
relationship between these two variables is quite non-linear. We added a regression line to the
chart by double clicking on it and choosing "Chart" then "Options" and then "Fit Line Total"
and you can see how poorly the line fits this data. Also, if we look at the residuals by
predicted, we see that the residuals are not homoscedastic, due to the non-linearity in the
relationship between gnpcap and birth.
171
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
1 GNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH
Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .626(a) .392 .387 10.679
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH
ANOVA(b)
Mean
Model Sum of Squares Df F Sig.
Square
Regression 7873.995 1 7873.995 69.047 .000(a)
1 Residual 12202.152 107 114.039
Total 20076.147 108
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH
Coefficients(a)
Standardized
Unstandardized Coefficients
Coefficients t Sig.
Model B Std. Error Beta
172
Residuals Statistics(a)
173
We modified the above scatter plot changing the fit line from using linear regression to using
"lowess" by choosing "Chart" then "Options" then choosing "Fit Options" and choosing
"Lowess" with the default smoothing parameters. As you can see, the "lowess" smoothed curve
fits substantially well than the linear regression, further suggesting that the relationship
between gnpcap and birth is not linear.
We can see that the capgnp scores are quite skewed with most values being near 0, and a
handful of values of 10,000 and higher. This suggests to us that some transformation of the
variable may be necessary. One commonly used transformation is a log transformation, so let's
try that. As you see, the scatter plot between capgnp and birth looks much better with the
regression line going through the heart of the data. Also, the plot of the residuals by predicted
values look much more reasonable.
computelgnpcap = ln(gnpcap).
regression
/dependent birth
/method=enter lgnpcap
/scatterplot(*zresid *pred) /scat(birth lgnpcap)
/save resid(bres2).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LGNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH
174
Model Summary(b)
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
175
Residuals Statistics(a)
Minimum
Maximum Mean Std. Deviation N
176
This section has shown how you can use scatter plots to diagnose problems of non-linearity,
both by looking at the scatter plots of the predictor and outcome variable, as well as by
examining the residuals by predicted values. These examples have focused on simple
regression; however similar techniques would be useful in multiple regression. However, when
using multiple regression, it would be more useful to examine partial regression plots instead
of the simple scatter plots between the predictor variables and the outcome variable.
A model specification error can occur when one or more relevant variables are omitted from
the model or one or more irrelevant variables are included in the model. If relevant variables
are omitted from the model, the common variance they share with included variables may be
wrongly attributed to those variables, and the error term can be inflated. On the other hand, if
irrelevant variables are included in the model, the common variance they share with included
variables may be wrongly attributed to them. Model specification errors can substantially affect
the estimate of regression coefficients.
Consider the model below.
177
This regression suggests that as class size increases the academic performance increases, with
p=0.053. Before we publish results saying that increased class size is associated with higher
academic performance, let's check the model specification.
/dependent api00
/save pred(apipred).
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
SPSS does not have any tools that directly support the finding of specification errors, however
you can check for omitted variables by using the procedure below. As you notice above, when
we ran the regression we saved the predicted value calling it apipred. If we use the predicted
value and the predicted value squared as predictors of the dependent variable, apipred should
be significant since it is the predicted value, but apipred squared shouldn't be a significant
predictor because, if our model is specified correctly, the squared predictions should not have
much of explanatory power above and beyond the predicted value.
178
That is, we wouldn't expect apipred squared to be a significant predictor if our model is
specified correctly. Below we compute apipred2 as the squared value of apipred and then
include apipred and apipred2 as predictors in our regression model, and we hope to find that
apipred2 is not significant.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 858.873 283.460 3.030 .003
-
1 APIPRED -1.869 .937 -1.088 .047
1.994
APIPRED2 2.344E-03 .001 1.674 3.070 .002
a Dependent Variable: API00
The above results show that apipred2 is significant, suggesting that we may have omitted
important variables in our regression. We therefore should consider whether we should add any
other variables to our model. Let's try adding the variable meals to the above model. We see
that meals is a significant predictor, and we save the predicted value calling it preda for
inclusion in the next analysis for testing to see whether we have any additional important
omitted variables.
regression
/dependent api00
/method=enter acs_k3 full meals
/save pred(preda).
179
Remark: <some output omitted to save space>
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 771.658 48.861 15.793 .000
ACS_K3 -.717 2.239 -.007 -.320 .749
1
FULL 1.327 .239 .139 5.556 .000
MEALS -3.686 .112 -.828 -32.978 .000
a Dependent Variable: API00
We now create preda2 which is the square of preda, and include both of these as predictors in
our model.
compute preda2 = preda**2.
regression
/dependent api00
/method=enter preda preda2.
Remark: <some output omitted to save space>
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) -136.510 95.059 -1.436 .152
PREDA 1.424 .293 1.293 4.869 .000
1
-3.172E-
PREDA2 .000 -.386 -1.455 .146
04
a Dependent Variable: API00
We now see that preda2 is not significant, so this test does not suggest there are any other
important omitted variables. Note that after including meals and full, the coefficient for class
size is no longer significant.
180
While acs_k3 does have a positive relationship with api00 when only full is included in the
model, but when we also include (and hence control for) meals, acs_k3 is no longer
significantly related to api00 and its relationship with api00 is no longer positive.
The statement of this assumption is that the errors associated with one observation are not
correlated with the errors of any other observation. Violation of this assumption can occur in a
variety of situations. Consider the case of collecting data from students in eight different
elementary schools. It is likely that the students within each school will tend to be more like
one another that students from different schools, that is, their errors are not independent.
Another way in which the assumption of independence can be broken is when data are
collected on the same variables over time. Let's say that we collect truancy data every semester
for 12 years. In this situation it is likely that the errors for observations between adjacent
semesters will be more highly correlated than for observations more separated in time -- this is
known as autocorrelation. When you have data that can be considered to be time-series you can
use the Durbin-Watson statistic to test for correlated residuals.
We don't have any time-series data, so we will use the elemapi2 dataset and pretend that snum
indicates the time at which the data were collected. We will sort the data on snum to order the
data according to our fake time variable and then we can run the regression analysis with the
durbin option to request the Durbin-Watson test. The Durbin-Watson statistic has a range
from 0 to 4 with a midpoint of 2. The observed value in our example is less than 2, which is not
surprising since our data are not truly time-series.
sort cases by snum.
regression
/dependent api00
/method=enter enroll
/residuals = durbin.
181
Model Summary
182
183
Assumptions. Logistic regression does not rely on distributional assumptions in the same sense
that discriminant analysis does. However, your solution may be more stable if your predictors
have a multivariate normal distribution. Additionally, as with other forms of regression,
multicollinearity among the predictors can lead to biased estimates and inflated standard errors.
The procedure is most effective when group membership is a truly categorical variable; if group
membership is based on values of a continuous variable (for example, "high IQ" versus "low IQ"),
you should consider using linear regression to take advantage of the richer information offered by
the continuous variable itself.
Procedurs
a. Fileopendatabankloan.sav
b. Analyzeregression binary logistic
184
c. as shown below select previously defaulted as a dependent variable d the rest as a predictor
variable
d. Select Hosmer-Lemeshow (L-S) and confidence interval from options
185
a. Variable(s) entered on step 1: age, ed, employ, address, income, debtinc, creddebt, othdebt, preddef1,
preddef2, preddef3.
Minitab does not explicitly produce partial regression plots. Fortunately, they can be created
easily (if tediously, for large models):
Residuals Residuals
Standardized Studentized residuals divided by its standard
Residuals Error
Deleted t residuals Studentized deleted divided by its standard
Residuals error, where is deleted
residual
In the Graphs… window in the regression procedure, these three kinds of residuals are called
Regular, Standardized, and Deleted, respectively. The standardized residuals are what Minitab
uses to flag unusually large residuals (any observations with standardized residual greater than
2 in absolute value).
186
Leverage and Influence
The Storage… window of the regression procedure provides three measures of leverage and
influence:
DFBETAS
Minitab does not explicitly produce DFBETAS statistics of influence on particular coefficients.
It can be calculated for a particular suspect observation i (perhaps flagged by the preceding
measures), and coefficient k, as follows:
Summary
This chapter has covered a variety of topics in assessing the assumptions of regression using
SPSS, and the consequences of violating these assumptions. As we have seen, it is not
sufficient to simply run a regression analysis, but it is important to verify that the assumptions
have been met. If this verification stage is omitted and your data does not meet the assumptions
of linear regression, your results could be misleading and your interpretation of your results
could be in doubt. Without thoroughly checking your data for problems, it is possible that
another researcher could analyze your data and uncover such problems and question your
results showing an improved analysis that may contradict your results and undermine your
conclusions.
187
REFERENCES
Minitab.MeetMinitab15.2007
Sabian, L and bria, S (2004). A handbook of statistical analysis using SPSS.
Federico Bandi.Introduction to Minitab. Summer 2009.
SPSS Manual: SPSS Base Version User's Guide, SPSS Inc.
Freund, J. E. and Walpole, R. E. (1980). Mathematical Statistics.
Mann, P. S. (2006). Introductory Statistics.
Krzanowski, W. J. (1998). An Introduction to Statistical Modelling.
Puri, B. K. (1996). Statistics in Practice: An Illustrated Guide to SPSS. Oxford
University Press, Inc.
Gupta, C.B. and Gupta, V. (2004). An Introduction to Statistical Methods. Vikas
Publishing House, Pvt. Ltd, India.
188