Sta 222 - New (1) - 1-1
Sta 222 - New (1) - 1-1
Sta 222 - New (1) - 1-1
June 9, 2016
Objectives This course assume that students can write programme using FORTRAN
language. Students will be expected to prepare flow charts; write programmes to compute
descriptive statistics like mean, variance, correlation coefficient and estimate simple regression
line and run these programme with data.
Contents
1 Introduction 3
1.1 What is Statistical Programming? . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Data Visualization 4
3.0.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
3.0.2 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . 5
5 Micro-Soft Excel 10
5.1 Introduction to Micro-soft Excel Window Environment . . . . . . . . . . . . . . 10
5.1.1 Understanding Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.2 Cell references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.3 Typing data in multiple cells . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.4 Installing the Analysis Toolpak . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.5 Functions and Computation Syntax . . . . . . . . . . . . . . . . . . . . . 12
5.1.6 Adding and deleting rows and columns . . . . . . . . . . . . . . . . . . . 12
5.1.7 Calculating a Sum and an Average . . . . . . . . . . . . . . . . . . . . . 14
5.1.8 Calculating a Mean, Median, Mode and a Standard Deviation . . . . . . 15
5.1.9 Linear Regression with Ms. Excel . . . . . . . . . . . . . . . . . . . . . 15
5.1.10 Correlation, and CORREL. . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Introduction to R Language 17
6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Brief History and Description of R . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3 How to start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Programming with R 20
7.1 Writing functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
9 R and Others 25
2
1 Introduction
1.1 What is Statistical Programming?
The rapid growth of statistics has benefited from the development of more powerful electronic
machines and computer. Computer programming involves controlling computers, telling them
what calculations to do, what to display, etc. Statistical programming might be the kind of
computer programming statisticians do – but statisticians do all sorts of programming. But
again, statistics involves a wide variety of computing tasks. For example, statisticians are con-
cerned with collecting and analyzing data, and some statisticians would be involved in setting
up connections between computers and laboratory instruments – but we would not call that
statistical programming. Statisticians often oversee data entry from questionnaires, and may
set up programs to aid in detecting data entry errors.
Statistical programming involves doing computations to aid in statistical analysis. For ex-
ample, data must be summarized and displayed. Models must be fit to data, and the results
displayed. These tasks can be done in a number of different computer applications such as
Microsoft Excel, SAS, SPSS, S-PLUS, R, Stata, etc.1 In this course we shall be introduced to
computing in Excel,Introduction to R and hopefully Minitab. Using any of these applications
is certainly statistical computing, and usually involves statistical programming. Since graphs
play an important role in statistical analysis, Also drawing graphics of one, two, or higher di-
mensional data is an aspect of statistical programming.
We shall also considered The Explanatory Data analyses procedures in statistical computing.
Statistical software that has revolutionized the way we approach data analysis, replacing
the calculators used by earlier generations. Software has often been an access barrier to many
statistical methods.
1
Computer programming involves controlling computers, telling them what calculations to do, what to dis-
play,to aid in statistical analysis
1
(C) Bello O.A.
3
3 Data Visualization
The first step to computing in statistics is to look at your data and ask researchable questions
on it.
”Statistics starts with a problem, continues with the collection of data, proceeds
with the data analysis and finishes with conclusions. It is a common mistake of in-
experienced Statisticians to plunge into a complex analysis without paying attention
to what the objectives are or even whether the data are appropriate for the proposed
analysis. Look before you leap!”(Julian J. Faraway, 2002)
Many different words can be used to describe graphic representations of data, but the overall
aim is always to visualize the information in the data and so the term Data Visualization is
the best universal term. Other terms have different connotations. At a seminar in Munich in
2004 where researchers from all fields interested in visualization met, one person thought the
word ‘plot’ was being used to describe the story in a statistical graphic, not the graphic itself.
Would that every plot had a good plot! Deciding on which graphics to use is often a matter of
taste. What one person thinks are good graphics for illustrating information may not appeal to
someone else.
Data mining is the analysis of (often large) observational data sets to find unsuspecting re-
lationships and to summarize the data in novel ways that are both understandable and useful
to the data owner. The relationships and summaries derived through a data mining exercise
are often referred to as models or patterns. Examples include linear equations, rules, clusters,
graphs, tree structures, and recurrent patterns in time series. The definition above refers to
”observational data,” as opposed to ”experimental data.” Data mining typically deals with data
that have already been collected for some purposes. This means that the objectives of the data
mining exercise play no role in the data collection strategy. This is one way in which data mining
differs from much of statistics, in which data are often collected by using efficient strategies to
1
(C) Bello O.A.
4
answer specific questions. For this reason, data mining is often referred to as ”secondary” data
analysis. The definition also mentions that the data sets examined in data mining are often
large. If only small data sets were involved, we would merely be discussing classical exploratory
data analysis as practised by statisticians
Tutorial Questions
1 Classify the following statistical software appropriately under (i) Menu-driven, (ii) Command-
driven , or (iii) Both; Minitab, Stata, SPSS, R, SAS, S-plus, Ms.Excel, E-view, Easy-fit
and Design Expert.
4 What do you understand by ”Computer intensive Statistics”? and give five benefits of
computer intensive statistics.
5 Relate between and also differentiate between data mining and EDA.
1
(C) Bello O.A.
5
4 Review of Basic Perquisite Knowledge to Sta222
4.1 Location,Variability Parameters Estimation and the Formulas
The measures of location mean, median, and mode tell us about distribution shape our data
came from.
Mean −→ P
x
x̄ =
n
Median −→
N
2
−F
L+[ ]C
f
where L-lower class boundary of the median class F-freq of the class just above the one
containing the median class f-freq of the median class
Mode −→
fm − fa
L+[ ]C
2fm − fa − fb
where; L-lower class boundary of the modal class
fm −freq of the modal class fa −freq of the class above the modal class
fb −freq of the class below the modal class C-size of the modal class interval
Variance −→
E[(X − µ)2 ]
• Assessment of the effect of, or relationship between, explanatory variables on the response.
1
(C) Bello O.A.
6
Regression
recall P P P
n xy − ( x)( y)
β= P P
n( x2 ) − ( x)2
where;
α = ȳ − β x̄
or
( x)2
P P P
Sxy X ( x)( y) X
2
β= where; Sxy = xy − and Sxx = x −
Sxx n n
Correlation P P P
xy − ( x)( y)
n
r=p P P P P
[n x2 − ( x)2 ][n y 2 − ( y)2 ]
Solution
x y x2 y2 xy ŷ = 15.84 − 0.1279xi y − ȳ (y − ŷ)2
28 12.4 784 153.76 347.2 12.2588 0.1412 0.019937
28 11.7 784 136.89 327.6 12.2588 -0.5588 0.312257
32.5 12.4 1056.25 153.76 403 11.68325 0.71675 0.513731
39 10.8 1521 116.64 421.2 10.8519 -0.0519 0.002694
45.9 9.4 2106.81 88.36 431.46 9.96939 -0.56939 0.324205
57.8 9.5 3340.84 90.25 549.1 8.44738 1.05262 1.108009
58.1 8 3375.61 64 464.8 8.40901 -0.40901 0.167289
62.5 7.5 3906.25 56.25 468.75 7.84625 -0.34625 0.119889
x)2 = 123763.2
P P
x = 351.8 n=8 (
7
P 2
y 2 = 859.91
P P P
y = 81.7 x = 16874.76 xy = 3413.11
2
P
SSE = (y − ŷ) = 2.568011
To calculate
P Mean: P
y x
mean of ȳ = n = 81.7
8
= 10.2125 mean of x̄ = n
= x
8
= 43.975
8(3413.11) − (351.8)(859.91)
β=
8(16874.76) − (351.8)2
β = −0.12792
and
α = ȳ − β x̄
α = 10.2125 − (−0.12792)(43.98) = 15.83786
Student
Assignments/Tutorial Questions . is ad-
vised
1. Using the information above calculate the correlation coefficient ’r’. to at-
2. What do you understand by the different between the observed values and the predicted tempt
values in regression analysis. the
ques-
3. Give the statistical interpretation of all the results of the analyses above. tions
5 Using the information in the Sta 222 Students’ Weigth Table below, calculate the std.
deviation,mode and median values
1
(C) Bello O.A.
8
Interval Weight(kg) of 40 Sta222 Students Frequency
48-52 8
53-57 12
58-62 10
63-67 6
68-72 4
1a. Define Flow Chart and Give four advantages of a flow chart.
(b)Define Algorithm and explain the following
(iii)Decision Box
(iv)Connector symbols.
a. Find the (i) mean, (ii) mode, (iii) median and (iv)variance of the weight of 40 students
using the information from Sta 222 Students’ Weight Table in section 4 above.(Tip: recall
the formulas for mean, mode, median and variance).
1
(C) Bello O.A.
1
(C) Bello O.A.
9
5 Micro-Soft Excel
The development of the MS.Excel is similar to LOTUS 123. Being a software for data analysis,
rules of mathematics must be followed. MS. Excel makes use of arithmetic and logical operators
in returning the output involving two or more figures. The operator are ×,÷,+ and − ( for
arithmetic, and <, >, ≤, ≥( for logical). The rule , BODMAS, is also important in the order MS
Excel to perform its calculations. However,to return the output of any formula or expression,
MS. Excel uses0 =0 to start any formula or expression. To execute a formula press the ENTER
key. Excel is both menu and command driven.
If you type data in cell A1 and press Enter, Excel selects the next cell below,which is A2. If you
type data in A2 and press Tab, Excel selects the cell to the right, which is B2.
1
(C) Bello O.A.
10
Figure 1:
Statistical functions in the Analysis Toolpak Data Analysis – Statistics Add-ins. Ex-
cel also has an add-in called “Data Analysis” which performs various mathematical tasks such as:
1 Analysis ToolPak: Adds financial, statistical, and engineering analysis tools and functions.
2 Analysis ToolPak VBA: Allows users to publish financial, statistical, and engineering anal-
ysis tools and functions using Analysis ToolPak syntax.
3 Conditional Sum Wizard: Creates a formula that sums data in a list if the data matches
criteria you specify.
4 Euro Currency Tools: Formats values as euros, and provides the EUROCONVERT work-
sheet function to convert currencies.
5 Internet Assistant VBA: Allows developers to publish Excel data to the Web by using
Internet Assistant syntax.
6 Lookup Wizard: Creates a formula to look up data in a list by using another known value
in the list.
2
See demonstrated Picture 1-4 on how to install the Analysis Toolpak ,and practice in Computer Lab.
11
7 Solver Add-In: Calculates solutions to what-if scenarios based on adjustable cells and con-
straint cells.
12
1. Click the Home tab.
2. Click the row or column heading where you want to add another row or column.
3. Click the Insert icon in the Cells group.
Inserting a row adds a new row above the selected row. Inserting a column adds a new column
to the left of the selected column.
1
Examples (1) 3
of 19 is written =(1/3) ∗ 9 (2) 2/3 + 4/5 is written =(2/3) + (4/5)
13
Hydrocarbon Level x(%) Purity y (%)
0.99 90.01
1.02 89.05
1.15 91.43
1.29 93.74
1.46 96.73
1.36 94.45
0.87 87.59
1.23 91.77
1.55 99.42
1.4 93.65
1.19 93.54
1.15 92.52
0.98 90.56
1.01 89.54
1.11 89.85
1.2 90.39
1.26 93.65
1.32 93.41
1.43 94.98
0.95 87.3
For the purpose of this work, we are going to consider the variable purity(y) for demonstrating
how to calculate the sum, mean, median, mode, variance, and others using Ms.excel.
14
Step v. Click on the desire function SUM/AVERAGE
Step vi. Specify in the dialogue box the range your data is in the spread sheet e.g a1:a20
Step vii. Click ok
15
is that associated with the fitted straight line with intercept and slope parameters. R ranges in
magnitude from 0 (uncorrelated) to 1 (perfect correlation); the sign of R indicates whether the
two sets of data are positively or negatively correlated, i.e., whether the values in the second
data become larger or smaller as the values in the first data set become larger. USAGE:
1. To use the CORREL function, type ‘=correl(’ enter the range of cell: values), and press
the Enter key to execute the command.
2. The CORRELATION function is accessed from the Analysis Toolpak. Enter the range
of cells encompassing the rows and columns of two sets of variables, indicate whether the vari-
ables are grouped in rows or columns, select the output range, and click on OK. 1. Write the
MS Excel format for mathematical expressions for
3
i. (25% of 30)-40% of 25) ii.0.4- 11 × 33 12 ÷ 0.3
5
2.If 2 is stored in cell B4, 4 is stored in cell B5, and 5 is stored in cell C3(where 2 is parallel to
4 and 5 is the height), write an Ms Excel expression for the area of a trapezium using the cell
addresses stated above
5
Question 2 and 3 are to be carried out in the computer lab
5
(C) Bello O.A.
16
6 Introduction to R Language
6.1 Basics
We will concern ourself with basic programming: how to tell a computer what to do. We do
this using the open source R statistical package, you start by installing R. You can get R at:
http://www.cran.r-project.org. After installation click on the R icon then you can start by
typing instructions.But for your convenience its already installed in the department’s computer
lab for your uses.
IMPORTANT NOTICES
ls() #To see the current contents
rm(list=ls()) #To remove any rubish in case you are doing a new thing!
getwd() #To know the current directory where R is
#This is also where it will store everything you do!
#This is very important but it depends on where you want!
setwd(dir=’C/Users/computer’s name/Desktop/sta222’)
#sta222 is a folder holding your data!
#it is on the desktop!
5
(C) Bello O.A.
6
Students is strongly advice to practice and follow each steps of this manual on the computer
17
read.csv(’cityfuel.csv’,header=TRUE) #read your data into R!
R has very wide documentations written by the statistical community. We will start this
course with simple commands. The # symbol refers to “comment.” R ignores any command
after #. I will add comments to explain what is going on. What you need to do is to type
the commands and make sure you see the results of each command.
18
q() # to quit R
x = c(5,4,3,2,0.1,53,44,3,2,14,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
y = c(25,14,13,21,15,55,0.4,3,20,11,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
z = x + y ;print(z) ### You can put two commands
###### Graphics###########
plot(x,y,type="l",lwd=3,col=6,xlab="x",ylab="y")
hist(y,col=’blue’) ### histogram
example:
(2*3)+((4)-(5^2))## * denotes ’multiply’,
’+’ plus,’-’ minus,’^’raised to power
120
sqrt(10) # the square root of 10
#############DIRECT CODES FOR MEAN, VARIANCE,STD DEVIATION###############################
x=c(5,4,3,2,0.1,53,44,3,2,14,5,4,3,2,0.1,53,44,3,2,14,53,44,3,2,14)
mean(x)
var(x)
sd(x) #see fig 2
##################Median#################
median(x)
19
Fig. 2. REditor and RConsole Window display for sta222 lecture
7 Programming with R
7.1 Writing functions in R
You can write your own functions in R. An example is given below.
##############STD DEVIATION##########
x =c(5,4,3,2,0.1)
f=c(8,12,10,6,4)
sum(x)
sum(f)
20
mean=sum(x)/(sum(f)-1)
sumsq=sum(f*(c(x)-mean)^2)
var=sumsq/(sum(f)-1)
std=sqrt(var)
print(std)
##############STD DEVIATION##########
x =c(5,4,3,2,0.1,5,4,3,2,4,5)
sum(x)
length(x)
mean=sum(x)/length(x)
sumsq=sum((c(x)-mean)^2)
var=sumsq/(length(x)-1)
std=sqrt(var)
print(std)
7
Each member of a group will submit print out of all written codes and results with their names in the R
console
7
(C) Bello O.A.
21
8 Statistical Modelling with R
R can be used to perform a lot of modelling activities in Science and Engineering. However
in this course is intend to limit to simple linear regression, correlation coefficient. Regression
analysis is a tool with which you must be familiar with. In its simplest form,it involves building
a predictive model to relate a predictor variable, X, to a response variable, Y , through a
relationship of the form Y = aX + b. For example, we might build a model which would allow
us to predict a city fuel consumption for the week. You must check that the package N IST nls
is installed on your computer.
###### EDA#######
plot(x,y,type="l",lwd=3,col=6,xlab="x",ylab="y")
hist(y,col=’blue’) ### histogram
8
Bello O.A.
22
density.default(x = x1, bw = y)
0.004
0.003
y
0.002
0.001
0.000
slope=((length(x1)*sum(x1*y))-(sum(x1)*sum(y)))/((length(x1)*sum(x1^2))-(sum(x1)^2))
intercept=meany-(slope*meanx1)
23
print(slope); print(intercept)
#############CORELLATION COEFFICIENT##########
r=((length(x1)*sum(x1*y))-(sum(x1)*sum(y)))/
sqrt(((length(x1)*sum(x1^2))-(sum(x1)^2))*((length(y)*sum(y^2))-(sum(y)^2)))
print(r)
8
(C) Bello O.A.
24
9 R and Others
R gives programmer unending avenues for programming expression, its command prompt.SPSS/Mintab
is menu prompt, while SAS/Stata are both. There have been very few comments about SPSS
versus R. This probably means that SPSS is perceived to be the most distant from R. SPSS has
been characterized as the ”prototypical statistical package” it is inexible, produces voluminous
output from individual rectangular datasets, and has poor programming tools. Seeing R graphs
users seem to view SAS graphics as decidedly inferior. R is much more of a complement to any
of Stata, SAS or SPSS than any of these three is to another(Jonathan Baron ;personal commu-
nication, January 2006). R is free and will continue to exist.Nothing can make it go away. Once
you learn it, you are no longer subject to price increases
Reference Materials
• David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data MiningISBN: 026208290x The MIT Press c 2001 (546 pages)Copyright
c 2006 by Stephen L. Morgan and Stanley N. Deming. All rights reserved
• Ritz C and Streiby J.C (2008): Non-linear Regression with R: Springer, Newyork.
• Seber G.A.F and Wild, C.J(1989): Non-linear Regression: John Wiley and Sons: New York.
• T K Rajan Slen Fortran 77 Grade Lecturer in Mathematics Govt Victoria College, Palakkad
• Joaquim P. Marques de Sá Applied Statistics Using SPSS, STATISTICA, MATLAB and R
25