project OF statistical packages

MA’AM wAjihA nAsir

5 TH

Project of statistical packages using excel
Descriptive statistics
Descriptive statistics are brief descriptive coefficients that summarize a given data set,
which can be either a representation of the entire population or a sample of a population.

Types of descriptive statistics

• There are 3 main types of descriptive statistics:
• The distribution concerns the frequency of each value.
• The central tendency concerns the averages of the values.
• The variability or dispersion concerns how spread out the values are.

Binomial distribution
The binomial is a type of distribution that has two possible outcomes. Binomial distribution
summarizes the number of trials, or observations when each trial has the same probability of
attaining one particular value. n a binomial distribution, only 2 parameters, namely n and p.
Procedure on excel =binomdist (x, n, p, false)
Example :n=15 P=0.8 find binomial dist.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0.00047018
1 0.00517203
2 0.027114
3 0.0905019
4 0.21727771
5 0.40321555
6 0.60981316
7 0.78689682
8 0.90495259
9 0.9661667
10 0.99065234
11 0.99807223
12 0.9997211
13 0.99997477
14 0.99999893
15 1
Poison distribution
A Poisson distribution is a probability distribution that is used to show how many times
an event is likely to occur over a specified period.
Procedure on excel: =POISSONDIST (x, mean, true)
X=10 mean= 7
so p (10,7) = 0.901479
From the above provided evidence, we can see that the probability that 10 particles enter the
counter is 0.901479.

The central tendency

The arithmetic Mean of a dataset is the sum of all values divided by the total number of values.
It’s the most commonly used measure of central tendency because all values are used in the
Procedure on excel: =average (data)
Example: Finding the mean
Participant 1 2 3 4 5
Reaction time (milliseconds) 287 345 365 298 380
The average mean of milliseconds is 335.
we have further types of mean like geometric mean and hormonic mean.

Geometric mean
Geometric mean takes several values and multiplies them together and sets them to the
1/nth power.
Procedure on excel: =geomean(data)
Participant 1 2 3 4 5
Reaction time (milliseconds) 287 345 365 298 380
The average geometric mean is 322.96

Harmonic mean
The harmonic mean is calculated by dividing the number of observations by the reciprocal
of each number in the series. The harmonic mean is the reciprocal of the arithmetic mean of the
Procedure on excel: =harmean(data)
Participant 1 2 3 4 5
Reaction time (milliseconds) 287 345 365 298 380
The average hormonic mean of milliseconds is 330.90

For an even-numbered data set, find the two values in the middle of the data set: the values at the
n/2 and (n/2) + 1 positions. Then, find their mean.
Procedure on excel: =median(data)

Reaction time 287 298 345 357 365 380
The median of milliseconds is 351.

The mode is the most frequently occurring value in the data set. It’s possible to have no mode,
one mode, or more than one mode.
Procedure on excel: =mode.multi(data)
Participant 1 2 3 4 5 6 7 8 9
Reaction time (milliseconds) 267 345 401 324 401 312 382 298 303
The mode of milliseconds is 401 because it repeated in data 2 time.

The variability
variability summarizes how far apart they are. This is important because it tells you whether the
points tend to be clustered around the center or more widely spread out.

The range tells you the spread of your data from the lowest to the highest value in the
distribution. It’s the easiest measure of variability to calculate.
Procedure on excel: Max value -Mini value
Data (minutes) 72 110 134 190 238 287 305 324
The range is 252 minutes.

Interquartile range
The interquartile range gives you the spread of the middle of your distribution.
For any distribution that’s ordered from low to high, the interquartile range contains half of the
values. While the first quartile (Q1) contains the first 25% of values, the fourth quartile (Q4)
contains the last 25% of values.
Procedure on excel: Q1 -Q3
Where, Q1 is equal to =Quartile(data,1)
Q3 is equal to =Quartile(data,3)
Data (minutes) 72 110 134 190 238 287 305 324
Out IQR of minutes is 163.5

Outliers can significantly increase or decrease the mean when they are included in the
calculation. Since all values are used to calculate the mean, it can be affected by extreme
outliers. An outlier is a value that differs significantly from the others in a data set.
Procedure on excel: =IF (value>minimum/maximum value, "outlier", "not outlier")
Participant 1 2 3 4 5
Reaction time (milliseconds) 832 345 365 298 380
Interpretation :832 is outlier in data.
Standard deviation
The standard deviation is the average amount of variability in your dataset.
Procedure on excel: =stdev(data)

Data (minutes) 72 110 134 190 238 287 305 324

The standard deviation of data is 95.54. This means that on average, each score deviates
from the mean by 95.54 points.
The variance is the average of squared deviations from the mean. A deviation from the
mean is how far a score lies from the mean.
Procedure on excel:=var(data)
Data (minutes) 72 110 134 190 238 287 305 324
The variance of your data is 9129.14.

A statistical graph or chart is defined as the pictorial representation of statistical data in
graphical form. The statistical graphs are used to represent a set of data to make it easier to
understand and interpret statistical information.
There are some types of charts
• Bar chart
• Pie chart
• Scatter chart
• Preto chart
• Histogram chart

Bar chart
A bar graph is a chart that plots data using rectangular bars or columns (called bins) that represent
the total amount of observations in the data for that category.

Table: Favorite Type of Movie

Comedy Action Romance Drama SciFi
4 5 6 1 4

Chart Title
Series1, 4
Axis Title

Comedy Action Romance Drama SciFi
Pie chart
A pie chart, sometimes called a circle chart, is a way of summarizing a set of nominal data
or displaying the different values of a given variable (e.g. percentage distribution). This type of
chart is a circle divided into a series of segments. Each segment represents a particular category.
Expenses Amount
Rent 7000
Grocery 3000
Transport 800
Current 300
School fee 2000
Savings 1900

Chart Title

13% Rent
13% 47% Transport
2% Current
School fee
20% Savings

Scatter plot
A scatterplot is a type of data display that shows the relationship between two numerical
variables. Each member of the dataset gets plotted as a point whose x-y coordinates relates to its
values for the two variables.
Ice Cream Sales vs Temperature
Temperature °C Ice Cream Sales
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Axis Title

$300 Series1
$200 Linear (Series1)
0 5 10 15
Axis Title

Pareto chart
A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost and are
arranged with longest bars on the left and the shortest to the right. In this way the chart visually
depicts which situations are more significant.
Button Defect 23 39.0 39.0
Pocket Defect 16 27.1 66.1
Collar Defect 10 16.9 83.1
Cuff Defect 7 11.9 11.9
Sleeve Defect 3 5.1 16.9
Total 59 - -

Chart Title
25 120.00%

20 100.00%
94.92% 100.00% 80.00%
15 83.05%
10 66.10%
5 38.98% 20.00%
0 0.00%
Button Pocket Collar Cuff Sleeve
Defect Defect Defect Defect Defect

Histogram chart
A histogram is a graphical representation that organizes a group of data points into user-
specified ranges. Similar in appearance to a bar graph, the histogram condenses a data series into
an easily interpreted visual by taking many data points and grouping them into logical ranges or
Salary (in thousands of $) Number of employees
0–10 50
11–20 300
21–30 250
31–40 400
41–50 550
51–60 433
61–70 266
71–80 350
81–90 100
91+ 20
salaries of empolyees
Number of empolyeee



A t-test is a type of inferential statistic used to determine if there is a significant difference
between the means of two groups, which may be related in certain features.
Procedure on excel: Go to data→data analysis→t-test paired two sample for mean→then
ok→then select data →level of significance →then ok

pre-treatment post-treatment
143 143
129 112
142 132
154 120
133 127
130 127
147 138
128 128
144 142
142 121
142 138
130 131
129 121
128 125
120 117
114 123
125 138
121 120
144 125
124 129
treatment post-treatment
Mean 133.45 127.85
Variance 112.47 73.40
Observations 20 20
Pearson Correlation 0.38
Hypothesized Mean Difference 0.00
Df 19.00
t Stat 2.31
P(T<=t) one-tail 0.02
t Critical one-tail 1.73
P(T<=t) two-tail 0.03
t Critical two-tail 2.09

H0: µd =0

H1: µd ≠0
Level of significance:
α =0.05
Test statistic:
̅ −µ𝒅
𝒕 = 𝒔𝒅

Calculations: t = 2.31
From the provided evidence, we can see that our p-value is 0.03 which is less than 0.05,
i.e,.0.03<0.05, so we reject H0 and conclude that there is difference in blood pressure after the

Simple linear regression
Simple linear regression is a regression model that estimates the relationship between one
independent variable and one dependent variable using a straight line. Both variables should be
Procedure on excel: Data then data analysis then regression then select x independent variable
and y dependent variable then select place press ok.
The number of pounds of steam used per month by a chemical plant is thought to be
related to the average ambient temperature (in F) for that month. The past year’s usage and
temperature are shown in the following table:

Months Temp. Usage/1000

Jan. 21 185.79
Feb. 24 214.47
Mar. 32 288.03
Apr. 47 424.84
May 50 454.58
June 59 539.03
July 68 621.55
Aug. 74 675.06
Sept. 62 562.03
Oct. 50 452.93
Nov. 41 369.95
Dec. 30 273.98

Regression Statistics
Multiple R 0.999933
R Square 0.999865
Adjusted R
Square 0.999852
Error 1.942835
Observations 12

df SS MS F F
Regression 1 280583.1 280583.1 74334.36 1.08E-20
Residual 10 37.74609 3.774609
Total 11 280620.9

Coefficients Error t Stat P-value
Intercept -6.3355 1.667648 -3.79906 0.003491
Temp. 9.208362 0.033774 272.6433 1.08E-20
̂ = -6.33+9.21X
From the above model we conclude that if we change one unit in temperature then there
will be 9.21 units change in usage same direction.

Goodness of fit of test model

As our calculated R2 =0.99 so, there is 99% of variation in Y explained by X. It is a good

Individual testing
𝜷′ 𝒔 = 𝟎
𝜷′ 𝒔 ≠ 𝟎
Level of significance:
𝜶 =0.05
Test statistic
̂ 𝟏− 𝜷𝟏
t= ̂ 𝟏)

t= 272.64
As our P-value is 0.003 so we reject Ho and conclude that there is relationship between
temperature and usage.

Overall significant
Ho: 𝜷′ 𝒔 = 𝟎
H1: 𝜷′ 𝒔 ≠ 𝟎
Level of significance:
𝛼 = 0.05
Test statistic:
𝑀𝑆 𝑟𝑒𝑔
𝑀𝑆 𝑟𝑒𝑠𝑖

df SS MS F F
Regression 1 280583.1 280583.1 74334.36 1.08E-20
Residual 10 37.74609 3.774609
Total 11 280620.9
From the above table as our f(Cal) 74334.36 is greater than our f tabulated 1.08E-20 so we
reject Ho and conclude that our regression parameters are significant.

Multiple linear regression

Multiple regression, is a statistical technique that uses several explanatory variables to
predict the outcome of a response variable. Multiple regression is an extension of linear (OLS)
regression that uses just one explanatory variable.
Procedure on excel: Data then data analysis then regression then select x’s independent variables
and y dependent variable then select place press ok.

An engineer at a semiconductor company wants to model the relationship between
the device HFE ( y) and three parameters: Emitter-RS (x1), Base-RS (x2), and Emitter-to-
Base RS (x3). The data are shown in the following table.

Emitter -RS Base RS Emitter to base -RS HFE-IM-5V

14.62 226 7 128.4
15.63 220 3.375 52.62
14.62 217.4 6.375 113.9
15 220 6 98.01
14.5 226.5 7.625 139.9
15.25 224.1 6 102.6
16.12 220.5 3.375 48.14
15.13 223.5 6.125 109.6
15.5 217.6 5 82.68
15.13 228.5 6.625 112.6
15.5 230.2 5.75 97.52
16.12 226.5 3.75 59.06
15.13 226.6 6.125 111.8
15.63 225.6 5.375 89.09
15.38 229.7 5.875 101
14.38 234 8.875 171.9
15.5 230 4 66.8
14.25 224.3 8 157.1
14.5 240.5 10.87 208.4
14.62 223.7 7.375 133.4

Regression Statistics
Multiple R 0.996842
R Square 0.993695
Adjusted R
Square 0.992513
Standard Error 3.479627
Observations 20

df SS MS F F
Regression 3 30531.5 10177.17 840.5463 8.31E-18
Residual 16 193.7248 12.1078
Total 19 30725.23
Coefficients Standard Error t Stat P-value
Intercept 47.174 49.58148 0.951444 0.355532
Emitter-RS -9.7352 3.691625 -2.6371 0.017935
Base -RS 0.428287 0.223933 1.912564 0.073876
Emitter to base 18.23745 1.311802 13.9026 2.37E-10
̂ = 47.17-9.7352+0.42+18.23X
From the above model we conclude that if we change one unit in Emitter-RS and Emitter,
base-RS and Emitter to base then there will be 0.42 and 18.23 units in same direction and 9.734
units in opposite direction.

Goodness of fit of test model

As our calculated R2 =0.993 so, there is 99% of variation in Y explained by X’s. It is a
good fit.

Individual testing
𝜷′ 𝒔 = 𝟎
𝜷′ 𝒔 ≠ 𝟎
Level of significance:
𝜶 =0.05
Test statistic
̂ 𝟏− 𝜷𝟏
t= ̂ 𝟏)

t= -2.63
As our P-value is 0.01 so we reject Ho and conclude that there is relationship between
device HFF and Emitter-RS, Base RS and Emitter to base RS.

Overall significant
Ho: 𝜷′ 𝒔 = 𝟎
H1: 𝜷′ 𝒔 ≠ 𝟎
Level of significance:
𝛼 = 0.05
Test statistic:
𝑀𝑆 𝑟𝑒𝑔
𝑀𝑆 𝑟𝑒𝑠𝑖

df SS MS F F
Regression 3 30531.5 10177.17 840.5463 8.31E-18
Residual 16 193.7248 12.1078
Total 19 30725.23
From the above table as our fcal is 840.54 is greater than our f tabulated 8.31E-18 so we
reject Ho and conclude that our regression parameters are significant.

ANOVA complete randomized design

CRD is the basic single factor design. In this design the treatments are assigned completely
at random so that each experimental unit has the same chance of receiving any one treatment. But
CRD is appropriate only when the experimental material is homogeneous.
Procedure on excel: data then data analysis then ANOVA one-way factor. Then enter data select
place press ok
The Effect of Nozzle Design on the Stability and Performance of Turbulent Water
Jets” (Fire Safety Journal, Vol. 4, August 1981), C. Theobald describes an experiment in
which a shape measurement was determined for several differ ent nozzle types at different
levels of jet efflux velocity. Interest in this experiment focuses primarily on nozzle type, and
velocity is a nuisance factor. The data are as follows:

Nozzle Jet efflux

11.73 14.37 16.59 20.43 23.46 28.74
1 0.78 0.8 0.81 0.75 0.77 0.78
2 0.85 0.85 0.92 0.86 0.81 0.83
3 0.93 0.92 0.95 0.89 0.89 0.83
4 1.14 0.97 0.98 0.88 0.86 0.83
5 0.97 0.86 0.78 0.76 0.76 0.75
Anova: Two-Factor Without Replication
Source of
Variation SS df MS F P-value F crit
Rows 0.10218 4 0.025545 8.91623 0.000266 2.866081
Columns 0.062867 5 0.012573 4.388598 0.007364 2.71089
Error 0.0573 20 0.002865
Total 0.222347 29
Ho: Nozzle types affect the shape measurement.
H1: Nozzle types not affect the shape measurement.

Level of significance
𝛼 = 0.05
Test statistic

Source of
Variation SS df MS F P-value F crit
Rows 0.10218 4 0.025545 8.91623 0.000266 2.866081
Columns 0.062867 5 0.012573 4.388598 0.007364 2.71089
Error 0.0573 20 0.002865

Total 0.222347 29
As our p value is less than 0.05 so we reject Ho. Its mean there is no effect of nozzle types
on shape measurement.

ANOVA randomized complete block design

The RCBD is the standard design for agricultural experiments where similar experimental units
are grouped into blocks or replicates. It is used to control variation in an experiment by accounting
for spatial effects in field or greenhouse.
Procedure on excel: data then data analysis then ANOVA two-way factor without replication.
Then enter data select place press ok.
An experiment was conducted to investigate leaking current in a SOS MOSFETS device.
The purpose of the experiment was to investigate how leakage current varies as the channel
length changes. Four channel lengths were selected. For each channel length, five different
widths were also used, and width is to be considered a nuisance factor. The data are as
Channel Width
Length 1 2 3 4 5
1 0.7 0.8 0.8 0.9 1
2 0.8 0.8 0.9 0.9 1
3 0.9 1 1.7 2 4
4 1 1.5 2 3 20
Anova: Two-Factor Without Replication

SUMMARY Count Sum Average Variance

Row 1 5 4.2 0.84 0.013
Row 2 5 4.4 0.88 0.007
Row 3 5 9.6 1.92 1.567
Row 4 5 5 5.5 66.25

Column 1 4 3.4 0.85 0.016667

Column 2 4 4.1 1.025 0.109167
Column 3 4 5.4 1.35 0.35
Column 4 4 6.8 1.7 1.02
Column 5 4 26 6.5 83

Source of
Variation SS df MS F P-value F crit
Rows 72.6575 3 24.21917 1.6072 0.239502 3.490295
Columns 90.518 4 22.6295 1.501709 0.262908 3.259167
Error 180.83 12 15.06917

Total 344.0055 19
Ho: Leakage voltage does not depend on channel length.
H1: Leakage voltage depends on channel length.

Level of significance
𝛼 = 0.05
Test statistic
F1= , F2=

Source of
Variation SS df MS F P-value F crit
Rows 72.6575 3 24.21917 1.6072 0.239502 3.490295
Columns 90.518 4 22.6295 1.501709 0.262908 3.259167
Error 180.83 12 15.06917

Total 344.0055 19
As our p value is greater than 0.05 so we don’t not reject Ho. Its mean Leakage voltage
does not depend on channel length.

A matrix (whose plural is matrices) is a rectangular array of numbers, symbols, or
expressions, arranged in rows and columns.

Determinant of matrix
The determinant of a matrix is the scalar value or number calculated using a square matrix.
The square matrix could be 2×2, 3×3, 4×4, or any type, such as n × n, where the number of column
and rows are equal.
Procedure on excel: =MDETERM (data)
𝟑 −𝟏
The matrix is given by, A = [ ]Find the value of |A|
𝟒 𝟑
The determinant of matrix is 13.
Transpose of matrix
The transpose of a matrix is an operator which flips a matrix over its diagonal.
Procedure on excel: =transpose (array, ctrl, shift, enter)
−2 5 6
A= [ ]order the transpose od given matrix
5 2 7
The transpose of matrix is
-2 5
5 2
6 7

Inverse of matrix
The concept of inverse of a matrix is a multidimensional generalization of the concept of
reciprocal of a number: the product between a number and its reciprocal is equal to 1.
Procedure on excel: =MMULT (1st,2nd ctrl, shift, enter)

2 3 −4 3
A= 𝐵=
3 4 3 −2
3 −2]
Inverse of matrix is
1 0
0 1

