Sta 222 - New (1) - 1-1

STATISTICAL COMPUTING I
STA 222 Lectures

3 Units
Bello O.A.( DOE, Computational Statistics & Applied Nonlinear Method)
Department of Statistics
Federal University of Technology, Minna
Nigeria
oyedele.bello@futminna.edu.ng
(Not For Sale!)
June 9, 2016
Objectives This course assume that students can write programme using FORTRAN
language. Students will be expected to prepare flow charts; write programmes to compute
descriptive statistics like mean, variance, correlation coefficient and estimate simple regression
line and run these programme with data.
Contents
1 Introduction 3
1.1 What is Statistical Programming? . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Advantages Of ‘Computer-Intensive Statistics’ 3
3 Data Visualization 4
3.0.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
3.0.2 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . 5
4 Review of Basic Perquisite Knowledge to Sta222 6

4.1 Location,Variability Parameters Estimation and the Formulas . . . . . . . . . . 6
4.2 Regression Coefficients and the Measures of Association . . . . . . . . . . . . . 6
4.2.1 Class Exercise* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Micro-Soft Excel 10
5.1 Introduction to Micro-soft Excel Window Environment . . . . . . . . . . . . . . 10
5.1.1 Understanding Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 Cell references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.3 Typing data in multiple cells . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.4 Installing the Analysis Toolpak . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.5 Functions and Computation Syntax . . . . . . . . . . . . . . . . . . . . . 12
5.1.6 Adding and deleting rows and columns . . . . . . . . . . . . . . . . . . . 12
5.1.7 Calculating a Sum and an Average . . . . . . . . . . . . . . . . . . . . . 14
5.1.8 Calculating a Mean, Median, Mode and a Standard Deviation . . . . . . 15
5.1.9 Linear Regression with Ms. Excel . . . . . . . . . . . . . . . . . . . . . 15
5.1.10 Correlation, and CORREL. . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Introduction to R Language 17
6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Brief History and Description of R . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3 How to start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Programming with R 20
7.1 Writing functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Statistical Modelling with R 22

8.1 Simple Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.2 Regression with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
9 R and Others 25
2
1 Introduction
1.1 What is Statistical Programming?
The rapid growth of statistics has benefited from the development of more powerful electronic
machines and computer. Computer programming involves controlling computers, telling them
what calculations to do, what to display, etc. Statistical programming might be the kind of
computer programming statisticians do – but statisticians do all sorts of programming. But
again, statistics involves a wide variety of computing tasks. For example, statisticians are con-
cerned with collecting and analyzing data, and some statisticians would be involved in setting
up connections between computers and laboratory instruments – but we would not call that
statistical programming. Statisticians often oversee data entry from questionnaires, and may
set up programs to aid in detecting data entry errors.
Statistical programming involves doing computations to aid in statistical analysis. For ex-
ample, data must be summarized and displayed. Models must be fit to data, and the results
displayed. These tasks can be done in a number of different computer applications such as
Microsoft Excel, SAS, SPSS, S-PLUS, R, Stata, etc.1 In this course we shall be introduced to
computing in Excel,Introduction to R and hopefully Minitab. Using any of these applications
is certainly statistical computing, and usually involves statistical programming. Since graphs
play an important role in statistical analysis, Also drawing graphics of one, two, or higher di-
mensional data is an aspect of statistical programming.
We shall also considered The Explanatory Data analyses procedures in statistical computing.
2 Advantages Of ‘Computer-Intensive Statistics’

“Computer-intensive” statistics is a statistical methodology which makes use of a large amount
of computer time — examples include the bootstrap re-sampling method, Markov Chain Monte
Carlo(MCMC) simulations, smoothing, image analysis and many uses of the ‘EM algorithm’.
The benefits of computing in Statistics are;
• Working with (much) larger datasets is made easy.
• Using more realistic models and better ways to fit models.
• Exploring a (much) larger class of models.
• Attempting a more realistic analysis of existing simple models.
• Better visualization of data or fitted models or their combination.
Statistical software that has revolutionized the way we approach data analysis, replacing
the calculators used by earlier generations. Software has often been an access barrier to many
statistical methods.
1
Computer programming involves controlling computers, telling them what calculations to do, what to dis-
play,to aid in statistical analysis
1
(C) Bello O.A.
3
3 Data Visualization
The first step to computing in statistics is to look at your data and ask researchable questions
on it.
”Statistics starts with a problem, continues with the collection of data, proceeds
with the data analysis and finishes with conclusions. It is a common mistake of in-
experienced Statisticians to plunge into a complex analysis without paying attention
to what the objectives are or even whether the data are appropriate for the proposed
analysis. Look before you leap!”(Julian J. Faraway, 2002)
Many different words can be used to describe graphic representations of data, but the overall
aim is always to visualize the information in the data and so the term Data Visualization is
the best universal term. Other terms have different connotations. At a seminar in Munich in
2004 where researchers from all fields interested in visualization met, one person thought the
word ‘plot’ was being used to describe the story in a statistical graphic, not the graphic itself.
Would that every plot had a good plot! Deciding on which graphics to use is often a matter of
taste. What one person thinks are good graphics for illustrating information may not appeal to
someone else.
3.0.1 Data Mining

The aim for data mining is to look at ways of visualizing large datasets, whether large in num-
bers of cases or large in numbers of variables or large in both. You might ask, if a dataset is
so large, why not just take big portion of the sample? But that will not pick out outliers, local
structures, or systematic errors in the data. They will not be large enough for multivariate
categorical analyses. When considering applications where methods are needed for coping with
larger volumes of data. Visualization is valuable for exploring data and for presenting informa-
tion in data, whether global summaries or local patterns. Why is visualizing data important?
Data visualization is good for data cleaning, for exploring data — as John Tukey put it (Tukey
and Tukey; 1985): “There is nothing better than a picture for making you think of questions
you had forgotten to ask about the data collected” — data visualization/mining/EDA helps to
identifying outliers,data-entry errors, skewed or unusual distributions trends, clusters, spotting
local patterns and for evaluating modelling output and for presenting results. Visualization is
essential for Exploratory Data Analysis.
Data mining is the analysis of (often large) observational data sets to find unsuspecting re-
lationships and to summarize the data in novel ways that are both understandable and useful
to the data owner. The relationships and summaries derived through a data mining exercise
are often referred to as models or patterns. Examples include linear equations, rules, clusters,
graphs, tree structures, and recurrent patterns in time series. The definition above refers to
”observational data,” as opposed to ”experimental data.” Data mining typically deals with data
that have already been collected for some purposes. This means that the objectives of the data
mining exercise play no role in the data collection strategy. This is one way in which data mining
differs from much of statistics, in which data are often collected by using efficient strategies to
1
(C) Bello O.A.
4
answer specific questions. For this reason, data mining is often referred to as ”secondary” data
analysis. The definition also mentions that the data sets examined in data mining are often
large. If only small data sets were involved, we would merely be discussing classical exploratory
data analysis as practised by statisticians
3.0.2 Exploratory Data Analysis (EDA)

As the name suggests, the goal here is simply to explore the data without any clear ideas of what
we are looking for. Typically, EDA techniques are interactive and visual, and there are many
effective graphical display methods for relatively small, low-dimensional data sets.e.g scatter
plot, density plot, box plot, whisker plot, 3D plots, contour plot etc.
Tutorial Questions
1 Classify the following statistical software appropriately under (i) Menu-driven, (ii) Command-
driven , or (iii) Both; Minitab, Stata, SPSS, R, SAS, S-plus, Ms.Excel, E-view, Easy-fit
and Design Expert.
2a Explain in details any (3) data processing method you know

(b)Differentiate between Manual data processing and Computerized data precessing.
3 Give five advantages of EDP over others
4 What do you understand by ”Computer intensive Statistics”? and give five benefits of
computer intensive statistics.
5 Relate between and also differentiate between data mining and EDA.
6 Give five benefits of data visualization
1
(C) Bello O.A.
5
4 Review of Basic Perquisite Knowledge to Sta222
4.1 Location,Variability Parameters Estimation and the Formulas
The measures of location mean, median, and mode tell us about distribution shape our data
came from.
Mean −→ P
x
x̄ =
n
Median −→
N
2
−F
L+[ ]C
f
where L-lower class boundary of the median class F-freq of the class just above the one
containing the median class f-freq of the median class
Mode −→
fm − fa
L+[ ]C
2fm − fa − fb
where; L-lower class boundary of the modal class
fm −freq of the modal class fa −freq of the class above the modal class
fb −freq of the class below the modal class C-size of the modal class interval
Variance −→
E[(X − µ)2 ]
4.2 Regression Coefficients and the Measures of Association

Regression analysis is used for explaining or modeling the relationship between a single variable
Y, called the response, output or dependent variable, and one or more predictor, input, inde-
pendent or explanatory variables, X1, ...Xp. When p =1 it is called simple regression but when
p > 1 it is called multiple regression or sometimes multivariate regression. When there is more
than one Y, then it is called multivariate multiple. Regression analyses have several possible
objectives including
• . A general description of data structure
• Assessment of the effect of, or relationship between, explanatory variables on the response.
• Prediction of future observations.
1
(C) Bello O.A.
6
Regression
recall P P P
n xy − ( x)( y)
β= P P
n( x2 ) − ( x)2
where;
α = ȳ − β x̄
or
( x)2
P P P
Sxy X ( x)( y) X
2
β= where; Sxy = xy − and Sxx = x −
Sxx n n
Correlation P P P
xy − ( x)( y)
n
r=p P P P P
[n x2 − ( x)2 ][n y 2 − ( y)2 ]
4.2.1 Class Exercise*

In the fuel consumption case study, where a natural gas company in a city wants to make pre-
diction on the city’s natural gas consumption, where the weekly fuel consumption substantially
depends on the average temperature(in degree Fahrenheit) measured in the city during the week.
The following data represent the carried out research on fuel consumption by the gas company:
Table generated by Excel2LaTeX from sheet ’Sheet1’

week Average temp.x(0 F ) Weekly fuel consumption y(MMoF)
1 28 12.4
2 28 11.7
3 32.5 12.4
4 39 10.8
5 45.9 9.4
6 57.8 9.5
7 58.1 8
8 62.5 7.5
1. Calculate the least square point estimates(coefficients)
2. Predict the fuel consumption when temperature is 40o F
Solution
x y x2 y2 xy ŷ = 15.84 − 0.1279xi y − ȳ (y − ŷ)2
28 12.4 784 153.76 347.2 12.2588 0.1412 0.019937
28 11.7 784 136.89 327.6 12.2588 -0.5588 0.312257
32.5 12.4 1056.25 153.76 403 11.68325 0.71675 0.513731
39 10.8 1521 116.64 421.2 10.8519 -0.0519 0.002694
45.9 9.4 2106.81 88.36 431.46 9.96939 -0.56939 0.324205
57.8 9.5 3340.84 90.25 549.1 8.44738 1.05262 1.108009
58.1 8 3375.61 64 464.8 8.40901 -0.40901 0.167289
62.5 7.5 3906.25 56.25 468.75 7.84625 -0.34625 0.119889
x)2 = 123763.2
P P
x = 351.8 n=8 (
7
P 2
y 2 = 859.91
P P P
y = 81.7 x = 16874.76 xy = 3413.11
2
P
SSE = (y − ŷ) = 2.568011
To calculate
P Mean: P
y x
mean of ȳ = n = 81.7
8
= 10.2125 mean of x̄ = n
= x
8
= 43.975
To estimate the least square point estimates:

recall P P P
n xy − ( x)( y)
β= P P
n( x2 ) − ( x)2
where;
α = ȳ − β x̄
Then;
8(3413.11) − (351.8)(859.91)
β=
8(16874.76) − (351.8)2
β = −0.12792
and
α = ȳ − β x̄
α = 10.2125 − (−0.12792)(43.98) = 15.83786
Student
Assignments/Tutorial Questions . is ad-
vised
1. Using the information above calculate the correlation coefficient ’r’. to at-
2. What do you understand by the different between the observed values and the predicted tempt
values in regression analysis. the
ques-
3. Give the statistical interpretation of all the results of the analyses above. tions
4a When is a regression called multivariate regression?
b What is multivariate multiple regression?
5 Using the information in the Sta 222 Students’ Weigth Table below, calculate the std.
deviation,mode and median values
1
(C) Bello O.A.
8
Interval Weight(kg) of 40 Sta222 Students Frequency
48-52 8
53-57 12
58-62 10
63-67 6
68-72 4
1a. Define Flow Chart and Give four advantages of a flow chart.
(b)Define Algorithm and explain the following
(i) Start and Stop operator
(ii)Processing operator symbol
(iii)Decision Box
(iv)Connector symbols.
2a. What problem is the algorithm below solving?

Step 1: Start
Step 2: Read N
Step 3: [Initialize all counters] Set FACT= 1, i = 1
Step 4: Compute Fact = Fact * I Increment i
Step 5: Check if i <N if true repeat step 4 if false go to step 6
Step 6: Print fact
b. Draw a flow chart for the algorithm above
4. * Write an algorithm with the appropriate flow chat to
a. Find the (i) mean, (ii) mode, (iii) median and (iv)variance of the weight of 40 students
using the information from Sta 222 Students’ Weight Table in section 4 above.(Tip: recall
the formulas for mean, mode, median and variance).
b. Also solve the above using the manual method.

P P P
n xy−( x)(
P 2y)
c. Find the coefficients of a regression analysis using (β = P 2
n( x )−( x)
and α = ȳ − β x̄)
1
(C) Bello O.A.
1
(C) Bello O.A.
9
5 Micro-Soft Excel
The development of the MS.Excel is similar to LOTUS 123. Being a software for data analysis,
rules of mathematics must be followed. MS. Excel makes use of arithmetic and logical operators
in returning the output involving two or more figures. The operator are ×,÷,+ and − ( for
arithmetic, and <, >, ≤, ≥( for logical). The rule , BODMAS, is also important in the order MS
Excel to perform its calculations. However,to return the output of any formula or expression,
MS. Excel uses0 =0 to start any formula or expression. To execute a formula press the ENTER
key. Excel is both menu and command driven.
5.1 Introduction to Micro-soft Excel Window Environment

5.1.1 Understanding Spreadsheets
Excel organizes numbers in rows and columns. An entire page of rows and columns is called a
spreadsheet or a worksheet. (A collection of one or more worksheets is stored in a file called a
workbook.) Each row is identified by a number such as 1 or 249; and each column is identified
by letters, such as A, G,or BF. The intersection of each row and column defines a cell which
contains one of three items such as Numbers,Text (labels),Formulas.
5.1.2 Cell references

Single cell references, such as =ROUND(C4,2), which rounds the number found in cell C4 to
two decimal places Contiguous (adjacent) cell ranges, such as =SUM(A4:A9), which adds all
the numbers found in cells A4, A5, A6, A7, A8, and A9 Non-contiguous cell ranges, such as
=SUM(A4,B7,C11), which adds all the numbers found in cells A4, B7, and C11.(See figure 6
below)
5.1.3 Typing data in multiple cells

After you type data in a cell, you can press one of the following four keystrokes to select a
different cell:
• Enter: Selects the cell below in the same column
• Tab: Selects the cell to the right in the same row
• Shift+Enter: Selects the cell above in the same column
• Shift+Tab: Selects the cell to the left in the same row
If you type data in cell A1 and press Enter, Excel selects the next cell below,which is A2. If you
type data in A2 and press Tab, Excel selects the cell to the right, which is B2.
1
(C) Bello O.A.
10
Figure 1:
5.1.4 Installing the Analysis Toolpak

The data analysis tools in Microsoft Excel are provided as an ”add − in” toolbox. This tool-
box contains additional functions enabling a variety of statistical analyses including descriptive
statistics, regression, correlation t-tests, the F-test, and analysis of variance (ANOVA). If you
do not see the Data Analysis option on the Data sub-menu at right side, that is, the Data
Analysis option is not available on the Tools menu, you must activate it by clicking on Menu
button −→ Excel Option −→ select Analysis Toolpak in Add-Ins tool and click ok −→ checking
the Analysis ToolPak option, and clicking ’go’−→ The next action will install the data analysis
tools on the Tools menu. 2
Statistical functions in the Analysis Toolpak Data Analysis – Statistics Add-ins. Ex-
cel also has an add-in called “Data Analysis” which performs various mathematical tasks such as:
1 Analysis ToolPak: Adds financial, statistical, and engineering analysis tools and functions.
2 Analysis ToolPak VBA: Allows users to publish financial, statistical, and engineering anal-
ysis tools and functions using Analysis ToolPak syntax.
3 Conditional Sum Wizard: Creates a formula that sums data in a list if the data matches
criteria you specify.
4 Euro Currency Tools: Formats values as euros, and provides the EUROCONVERT work-
sheet function to convert currencies.
5 Internet Assistant VBA: Allows developers to publish Excel data to the Web by using
Internet Assistant syntax.
6 Lookup Wizard: Creates a formula to look up data in a list by using another known value
in the list.
2
See demonstrated Picture 1-4 on how to install the Analysis Toolpak ,and practice in Computer Lab.
11
7 Solver Add-In: Calculates solutions to what-if scenarios based on adjustable cells and con-
straint cells.
5.1.5 Functions and Computation Syntax

Additional Statistical Functions: Other functions can also be accessed from the Σ
button on the toolbar. A partial list of statistical functions, adapted from the Excel help pages,
is given below. Sta2222:
You
Function Purpose
should
-------------------------------------------------------------------------
atleast
AVERAGE: Returns the average of its arguments
CORREL: Returns the correlation coefficient between two data sets know
COUNT: Counts how many numbers are in the list of arguments 10 of
DEVSQ: Returns the sum of squares of deviations them
FREQUENCY: Returns a frequency distribution as a vertical array and
INTERCEPT: Returns the intercept of the linear regression line write
KURT: Returns the kurtosis of a data set their
LARGE: Returns the k-th largest value in a data set com-
LINEST: Returns the parameters of a linear trend mands
MAX: Returns the maximum value in a list of arguments
MEDIAN: Returns the median of the given numbers
MIN: Returns the minimum value in a list of arguments
MODE: Returns the most common value in a data set
NORMINV: Returns the inverse of the normal cumulative distribution
PEARSON: Returns the Pearson product moment correlation coefficient
PERCENTILE:Returns the k-th percentile of values in a range
QUARTILE: Returns the quartile of a data set
RANK: Returns the rank of a number in a list of numbers
RSQ: " the square of Pearson product moment correlation coeff. (R)
SKEW: Returns the skewness of a distribution
SLOPE: Returns the slope of the linear regression line
SMALL: Returns the k-th smallest value in a data set
STDEV: Estimates standard deviation based on a sample
VAR: Estimates variance based on a sample
------------------------------------------------------------------------------
Examples Calculating a sum. Numbers in an Excel spreadsheet can be added by writing

an equation referring directly to the cell elements to be added.
to add the numbers 1, 2, and 3 in cells A1:A3, type ‘=A1+A2+A3’ in cell A4 or any other cell
of choice and press the Enter key.
5.1.6 Adding and deleting rows and columns

After you type in labels, numbers, and formulas, you may suddenly realize that you need to add
or delete extra rows or columns.
To add a row or column, follow these steps:
12
1. Click the Home tab.
2. Click the row or column heading where you want to add another row or column.
3. Click the Insert icon in the Cells group.
Inserting a row adds a new row above the selected row. Inserting a column adds a new column
to the left of the selected column.
To delete a row or column, follow these steps:

1. Click the Home tab.
2. Click the row or column heading that you want to delete.
3. Click the Delete icon in the Cells group.
Deleting a row or column deletes any data stored in that row or column.
1
Examples (1) 3
of 19 is written =(1/3) ∗ 9 (2) 2/3 + 4/5 is written =(2/3) + (4/5)
Computation Using Cell References Functions in Excel are implemented as macro

programs that usually require one or more input values and produce a corresponding output
value. To see a list of functions available in Excel, select the Insert Function menu option, or
press the toolbar function and select More Functions. These actions bring up the ‘Insert function’
dialog box from which you can select a function to use. When a function is selected, the ‘Function
arguments’ dialog box then provides a description of the function inputs (“arguments”) and use.
For additional information on any function, use the Help Microsoft Excel Help menu option or
press the F1 key. When using a function in an equation the function name and arguments in
parenthesis are entered following an equal sign, as shown in the next section. Consider the
data in Table below. In this table y is the purity of oxygen produced in a chemical distillation
process, and x is the percentage of hydrocarbons that are present in the main condenser of the
distillation unit
13
Hydrocarbon Level x(%) Purity y (%)
0.99 90.01
1.02 89.05
1.15 91.43
1.29 93.74
1.46 96.73
1.36 94.45
0.87 87.59
1.23 91.77
1.55 99.42
1.4 93.65
1.19 93.54
1.15 92.52
0.98 90.56
1.01 89.54
1.11 89.85
1.2 90.39
1.26 93.65
1.32 93.41
1.43 94.98
0.95 87.3
For the purpose of this work, we are going to consider the variable purity(y) for demonstrating
how to calculate the sum, mean, median, mode, variance, and others using Ms.excel.
5.1.7 Calculating a Sum and an Average

Using the Command Prompt on the cell or fx Function
Step 1. Type the observed data into the excel spread sheet starting from any desired cell. Lets
start ours from cell A1 to A20
Step 11.Select cell C2 or any other desired cell and type ’=sum(A1:A20)’ to calculate the sum
of values in cell A1 to A20. Note this command is not case sensitive and ’=sum(A1,A2)’ means
add values in cell A1 and A2 only different from ’=sum(A1:A20)’.
Step iii. Press Enter to display the result.
Step iv. Select cell C3 or any other desired cell, type ’=Average(A1:A20)’
Step v. Press Enter to display the result
3
Using the Menu Prompt

Step 1. As of above
Step ii. Select the cell you desire your result to be display, click on ’Formula on the menu list’
Step iii. Select ’Inset Function’ or ’More Function’
Step iv. Select Statistics from the list category
3
Students should follow these practical instructions to solve for the sum, mean, median, mode, range,variance
and std deviation of the variable(y) purity(%) from the above table. Results is to be copied and printed for
submission-Class Rep* take note.
14
Step v. Click on the desire function SUM/AVERAGE
Step vi. Specify in the dialogue box the range your data is in the spread sheet e.g a1:a20
Step vii. Click ok
5.1.8 Calculating a Mean, Median, Mode and a Standard Deviation

Method 1. The above method can also be used, only for the students to select the relevant
function at step v, all other steps remain the same.
Method 2. This is a broad method and a one stop shop method.

Step i.As of above
Step ii. Select data from the list of Menu
step iii. Click on ’Data Analysis’ from the list of sub-menu inside Data menu.Note: ’Analysis
toolpak’ must have been installed on your Ms.Excel.
Step iv, Select ’Descriptive Statistics’
Step v. Specify your data range in the dialogue box as ’A1:A20’
Step vi. Select the all the output range
Step vii. Click ok for you results
5.1.9 Linear Regression with Ms. Excel

Menu Prompt Method
Step 1. Type the observed data into the excel spread sheet starting from any desired cell. Lets
start ours from cell A1 to A20
Step ii. Select data from the list of Menu
step iii. Click on ’Data Analysis’ from the list of sub-menu inside Data menu.Note: ’Analysis
toolpak’ must have been installed on your Ms.Excel.
Step iv: Select ’Regression’,a dialogue box will pop out
Step v. Specify the y and x data range and select desired or all possible output
Step vi. Click ok for the results

4
Command Prompt Method

————————————————————————————————————————-
...................................................To be demonstrated during lecture.........................
............................ using the Oxygen and Hydrocarbon Levels Research Data............
—————————————————————————————————————————–
5.1.10 Correlation, and CORREL.

DESCRIPTION: The Correlation function in the Analysis Toolpak and the CORREL function
produce the same result: the correlation coefficient (R) for two or more data ranges consisting of
equal numbers of measurements arranged in columns. The correlation coefficient in both cases
4
Compare the menu and command prompt method after practices
15
is that associated with the fitted straight line with intercept and slope parameters. R ranges in
magnitude from 0 (uncorrelated) to 1 (perfect correlation); the sign of R indicates whether the
two sets of data are positively or negatively correlated, i.e., whether the values in the second
data become larger or smaller as the values in the first data set become larger. USAGE:
1. To use the CORREL function, type ‘=correl(’ enter the range of cell: values), and press
the Enter key to execute the command.
2. The CORRELATION function is accessed from the Analysis Toolpak. Enter the range
of cells encompassing the rows and columns of two sets of variables, indicate whether the vari-
ables are grouped in rows or columns, select the output range, and click on OK. 1. Write the
MS Excel format for mathematical expressions for
3
i. (25% of 30)-40% of 25) ii.0.4- 11 × 33 12 ÷ 0.3
5
2.If 2 is stored in cell B4, 4 is stored in cell B5, and 5 is stored in cell C3(where 2 is parallel to
4 and 5 is the height), write an Ms Excel expression for the area of a trapezium using the cell
addresses stated above
3*. The following data(23,12,19,40,34,34,31,39,40,12,24,32,35,37,11,10,23,11,7,40) shows the

scores of students in Sta222 test. Let these data be stored in cell A3 to A22 respectively.
Write an MS Excel expression for obtaining
i.the sum using arithemetric operation and store result on cell D3
ii.obtain the average of (i) and store in cell D4
iii. compute the subtraction of (ii) from each of the data points in A3 to A22 and store in cells
B2 to B21.
iv. What do you expect the sum of (iii) to be?
v. Obtain the square of each value in cell B2 to B21, store in cell C2 to C21 and sum the result
vi.What have you calculated in (ii) and (v)?
4a.Mention two function MS Excel can perform.

4b. The sub-function below can be found under which function and what action can each of
them carried out?
i.AVERAGE ii.Regression iii. Descriptive Statistics iv. MODE v. STDEV vi. VAR
vii. Correlation viii.CORREL
5
Question 2 and 3 are to be carried out in the computer lab
5
(C) Bello O.A.
16
6 Introduction to R Language
6.1 Basics
We will concern ourself with basic programming: how to tell a computer what to do. We do
this using the open source R statistical package, you start by installing R. You can get R at:
http://www.cran.r-project.org. After installation click on the R icon then you can start by
typing instructions.But for your convenience its already installed in the department’s computer
lab for your uses.
6.2 Brief History and Description of R

It began in the early 1990s as a project of Ross Ihaka and Robert Gentleman, who were both
at the time working at the University of Auckland (New Zealand). The R system implements
a dialect of the influential S language, developed at AT&T Bell Laboratories by Rick Becker,
John Chambers, and Allan Wilks, which is the basis for the commercial S-PLUS system.There
are versions for Unix, Linux, Windows and Mac. R is free,case sensitive and it is command
prompt. R have the capacity to open many windows at a time and we have; R-graphics window
where graphs are displayed, R-Editor where you write and run your codes and R- console where
the results are display. R is handled by the R Core Team, whose members are widely drawn
internationally. Especially important are the large number of packages that supplement base
R, and that anyone is free to contribute Primarily, R is designed for scientific computing and
for graphics. Among the packages that have been added are many that are not obviously
statistical – for drawing and coloring maps, for map projections, for plotting data collected by
balloon-borne weather instruments,for creating color palettes, for working with bitmap images,
for solving sudoko puzzles, for creating magic squares, for reading and handling shapefiles, for
solving ordinary differential equations, for processing various types of genomic data, and so on.
Check through the list of R packages that can be found on any of the CRAN sites, and you may
be surprised at what you find! 6
6.3 How to start

After you get to R for the first time it is a very good practise to type:
IMPORTANT NOTICES
ls() #To see the current contents
rm(list=ls()) #To remove any rubish in case you are doing a new thing!
getwd() #To know the current directory where R is
#This is also where it will store everything you do!
#This is very important but it depends on where you want!
setwd(dir=’C/Users/computer’s name/Desktop/sta222’)
#sta222 is a folder holding your data!
#it is on the desktop!
5
(C) Bello O.A.
6
Students is strongly advice to practice and follow each steps of this manual on the computer
17
read.csv(’cityfuel.csv’,header=TRUE) #read your data into R!
# R reads data stored in MS-Excel.csv

# cityfuel.csv is file with the data.
Use help(yyyy) #to get help on command yyyy.
start() #to open up a help window.
q() #to quit from R.
# To run the codes you type in R;
highlight the codes, right click and click on "run line or section"#See fig 1
R has very wide documentations written by the statistical community. We will start this
course with simple commands. The # symbol refers to “comment.” R ignores any command
after #. I will add comments to explain what is going on. What you need to do is to type
the commands and make sure you see the results of each command.
Fig. 1. REditor and RConsole Window Display
x = 10 ###### assign x the value 10

x ###### print x
print(x) ###### another way to print x
x <- 10 ###### you can also use <- to make assignments

y = 110
z = x + y ### what of this!
z # to see the content of z
y = sqrt(10) ### can you do this!

6
(C) Bello O.A.
18
q() # to quit R
x = 1:10 ### the vector (1,2,3,4,5,6,7,8,9,10)

print(x) ### this shows the content of x!
x = seq(1,10,length=10) ### same thing as above #see fig 1 for the display
print(x)
x = c(5,4,3,2,0.1,53,44,3,2,14,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
y = c(25,14,13,21,15,55,0.4,3,20,11,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
z = x + y ;print(z) ### You can put two commands
###### Graphics###########
plot(x,y,type="l",lwd=3,col=6,xlab="x",ylab="y")
hist(y,col=’blue’) ### histogram
######## Arithmetic Expressions ######
example:
(2*3)+((4)-(5^2))## * denotes ’multiply’,
’+’ plus,’-’ minus,’^’raised to power
120
sqrt(10) # the square root of 10
#############DIRECT CODES FOR MEAN, VARIANCE,STD DEVIATION###############################
x=c(5,4,3,2,0.1,53,44,3,2,14,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
mean(x)
var(x)
sd(x) #see fig 2
##################Median#################
median(x)
19
Fig. 2. REditor and RConsole Window display for sta222 lecture
7 Programming with R
7.1 Writing functions in R
You can write your own functions in R. An example is given below.
##################LETS NOW WRITE OUR OWN CODES FOR MEAN,VARIANCE#######

x =c(5,4,3,2,0.1,5,4,3,2,4,5)
sum(x)
length(x)
mean=sum(x)/length(x)
print(mean)
#####################VARIANCE###########
x =c(5,4,3,2,0.1,5,4,3,2,4,5)
sum(x)
length(x)
sumsq=sum((c(x)-mean)^2)
var=sumsq/(length(x)-1)
print(var)
##############STD DEVIATION##########
x =c(5,4,3,2,0.1)
f=c(8,12,10,6,4)
sum(x)
sum(f)
20
mean=sum(x)/(sum(f)-1)
sumsq=sum(f*(c(x)-mean)^2)
var=sumsq/(sum(f)-1)
std=sqrt(var)
print(std)
##############STD DEVIATION##########
x =c(5,4,3,2,0.1,5,4,3,2,4,5)
sum(x)
length(x)
sumsq=sum((c(x)-mean)^2)
var=sumsq/(length(x)-1)
std=sqrt(var)
print(std)
7
Each member of a group will submit print out of all written codes and results with their names in the R
console
7
(C) Bello O.A.
21
8 Statistical Modelling with R
R can be used to perform a lot of modelling activities in Science and Engineering. However
in this course is intend to limit to simple linear regression, correlation coefficient. Regression
analysis is a tool with which you must be familiar with. In its simplest form,it involves building
a predictive model to relate a predictor variable, X, to a response variable, Y , through a
relationship of the form Y = aX + b. For example, we might build a model which would allow
us to predict a city fuel consumption for the week. You must check that the package N IST nls
is installed on your computer.
8.1 Simple Regression Models

Predictor variable x through a function f . The function f is not completely known, the unknown
parameters say β = (β0 , β1 ). For clarity the relationship between the predictor and the response
can be formulated as
y = f (x, β) (1)
8.2 Regression with R

#Now if you have y,x1 and you need to fit a simple regression
# Y = b0 + b1x1 + e
x1 = c(0.99,1.02,1.15,1.29,1.46,1.36,0.87,1.23,1.55,1.4,1.19,1.15,0.98,1.01,
1.11,1.2,1.26,1.32,1.43,0.95)
y =c(90.01,89.05,91.43,93.74,96.73,94.45,87.59,91.77,99.42,93.65,93.54,92.52,
90.56,89.54,89.85,90.39,93.65,93.41,94.98,87.3)
###### EDA#######
plot(x,y,type="l",lwd=3,col=6,xlab="x",ylab="y")
hist(y,col=’blue’) ### histogram
plot(density(x1,y),col=’green’,xlab="x",ylab="y")#see the graph below

abline(v=sum(x1,y[0]), lwd=1.5, col=’gold’) #See FIG 4 below
8
8
Bello O.A.
22
density.default(x = x1, bw = y)
0.004
0.003
y
0.002
0.001
0.000
−200 −100 0 100 200
Fig. 4. Density Plots for x1 vs y- plot for sta222 lecture
###### Continue Analysis#######

sta222 = data.frame(y=y,x1=x1)
out = lm(y ~ x1, data = sta222)
names(out)
extractAIC(out)
s = summary(out)
print(s)
names(s)
#############writing our own Program for regression#############

x1 = c(0.99,1.02,1.15,1.29,1.46,1.36,0.87,1.23,1.55,1.4,1.19,1.15,0.98,1.01,
1.11,1.2,1.26,1.32,1.43,0.95)
y =c(90.01,89.05,91.43,93.74,96.73,94.45,87.59,91.77,99.42,93.65,93.54,92.52,
90.56,89.54,89.85,90.39,93.65,93.41,94.98,87.3)
sum(x1)
sum(y)
sum(x1*y)
length(x1)
length(y)
sum(x1^2)
meanx1=sum(x1)/length(x1)###or use## mean(x1)
meany=sum(y)/length(y)#######or use## mean(y)
slope=((length(x1)*sum(x1*y))-(sum(x1)*sum(y)))/((length(x1)*sum(x1^2))-(sum(x1)^2))
intercept=meany-(slope*meanx1)
23
print(slope); print(intercept)
#############CORELLATION COEFFICIENT##########
r=((length(x1)*sum(x1*y))-(sum(x1)*sum(y)))/
sqrt(((length(x1)*sum(x1^2))-(sum(x1)^2))*((length(y)*sum(y^2))-(sum(y)^2)))
print(r)
8
(C) Bello O.A.
24
9 R and Others
R gives programmer unending avenues for programming expression, its command prompt.SPSS/Mintab
is menu prompt, while SAS/Stata are both. There have been very few comments about SPSS
versus R. This probably means that SPSS is perceived to be the most distant from R. SPSS has
been characterized as the ”prototypical statistical package” it is inexible, produces voluminous
output from individual rectangular datasets, and has poor programming tools. Seeing R graphs
users seem to view SAS graphics as decidedly inferior. R is much more of a complement to any
of Stata, SAS or SPSS than any of these three is to another(Jonathan Baron ;personal commu-
nication, January 2006). R is free and will continue to exist.Nothing can make it go away. Once
you learn it, you are no longer subject to price increases
Reference Materials
• David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data MiningISBN: 026208290x The MIT Press c 2001 (546 pages)Copyright
c 2006 by Stephen L. Morgan and Stanley N. Deming. All rights reserved
• Krzanowski W. J (1998): An introduction to Statistical Modelling. Hodder Arnord, London.
• Ritz C and Streiby J.C (2008): Non-linear Regression with R: Springer, Newyork.
• Robert Gentleman Kurt Hornik Giovanni Parmigiani, Series Editors: Use R!
• Seber G.A.F and Wild, C.J(1989): Non-linear Regression: John Wiley and Sons: New York.
• T K Rajan Slen Fortran 77 Grade Lecturer in Mathematics Govt Victoria College, Palakkad
• Joaquim P. Marques de Sá Applied Statistics Using SPSS, STATISTICA, MATLAB and R
25

Sta 222 - New (1) - 1-1

Uploaded by

Copyright:

Available Formats

Sta 222 - New (1) - 1-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sta 222 - New (1) - 1-1

Uploaded by

Copyright:

Available Formats

STATISTICAL COMPUTING I

STA 222 Lectures

2 Advantages Of ‘Computer-Intensive Statistics’ 3

4 Review of Basic Perquisite Knowledge to Sta222 6

8 Statistical Modelling with R 22

2 Advantages Of ‘Computer-Intensive Statistics’

• Working with (much) larger datasets is made easy.

• Using more realistic models and better ways to fit models.

• Exploring a (much) larger class of models.

• Attempting a more realistic analysis of existing simple models.

• Better visualization of data or fitted models or their combination.

3.0.1 Data Mining

3.0.2 Exploratory Data Analysis (EDA)

2a Explain in details any (3) data processing method you know

3 Give five advantages of EDP over others

6 Give five benefits of data visualization

4.2 Regression Coefficients and the Measures of Association

• . A general description of data structure

• Prediction of future observations.

4.2.1 Class Exercise*

Table generated by Excel2LaTeX from sheet ’Sheet1’

1. Calculate the least square point estimates(coefficients)

2. Predict the fuel consumption when temperature is 40o F

To estimate the least square point estimates:

4a When is a regression called multivariate regression?

b What is multivariate multiple regression?

(i) Start and Stop operator

(ii)Processing operator symbol

2a. What problem is the algorithm below solving?

b. Draw a flow chart for the algorithm above

4. * Write an algorithm with the appropriate flow chat to

b. Also solve the above using the manual method.

5.1 Introduction to Micro-soft Excel Window Environment

5.1.2 Cell references

5.1.3 Typing data in multiple cells

• Enter: Selects the cell below in the same column

• Tab: Selects the cell to the right in the same row

• Shift+Enter: Selects the cell above in the same column

• Shift+Tab: Selects the cell to the left in the same row

5.1.4 Installing the Analysis Toolpak

5.1.5 Functions and Computation Syntax

Examples Calculating a sum. Numbers in an Excel spreadsheet can be added by writing

5.1.6 Adding and deleting rows and columns

To add a row or column, follow these steps:

To delete a row or column, follow these steps:

Computation Using Cell References Functions in Excel are implemented as macro

5.1.7 Calculating a Sum and an Average

Using the Menu Prompt

5.1.8 Calculating a Mean, Median, Mode and a Standard Deviation

Method 2. This is a broad method and a one stop shop method.

5.1.9 Linear Regression with Ms. Excel

Step vi. Click ok for the results

Command Prompt Method

5.1.10 Correlation, and CORREL.

3*. The following data(23,12,19,40,34,34,31,39,40,12,24,32,35,37,11,10,23,11,7,40) shows the

4a.Mention two function MS Excel can perform.

6.2 Brief History and Description of R

6.3 How to start

# R reads data stored in MS-Excel.csv