Report 1
Report 1
Report 1
Section: E
Submitted by:
Before understanding graphs, we need some numerical data to represent them. Here we’ll
be using the class marks of English exam of a total of 40 students. The list of data is given
below:
171, 57, 78, 159, 44, 23, 111, 17, 46, 143, 53, 96, 102, 14, 66, 9, 22, 156, 89,
117,60, 39, 174, 24, 77, 108, 51, 132, 19, 168, 62, 67,198, 124, 59, 5, 71, 165,
75, 74
Pie Graph:
The "pie chart," alternatively termed as the "circle graph," partitions a circular statistical
illustration into segments or slices to depict numerical data. Each segment represents a
proportional fraction of the entirety. When examining the makeup of a whole, the pie chart
proves particularly effective. Often, pie charts substitute other graphical representations such
as bar graphs, line plots, and histograms in various scenarios.
Steps taken:
1. Enter the data into the table. This this case we are making interval of 25 to make the
table.
2. Add all the value in the table to get the total
3. Divide each value by the total and then multiply by 100 to get a percent.
4. Next to know how many degrees for each “pie sector” we need, we will take a full
circle of 360° and follow the calculations below:
The central angle of each component = (Value of each component/sum of values of all the
components) ✕360°
If we take frequency, which is the number of students to be represented on the y-axis and
the grades on the x-axis, we will get a graph that resembles the one below.
The rectangles here are called bars. Note that the bars have equal width and are equally
spaced, as mentioned above. This is a simple bar diagram.
Histogram:
A histogram, a type of non-cumulative frequency graph, is constructed on a natural scale,
portraying frequencies of various value ranges using closely spaced vertical rectangles. This
graph facilitates the easy determination of the mode, a measure of central tendency, within
the data.
Steps to create a histogram:
1. Plot class intervals on the X-axis and their frequencies on the Y-axis using a natural
scale
2. Begin the X-axis with the lowest limit of the lowest class interval. If this limit is far
from the origin, create a break in the X-axis to indicate its displacement.
3. Draw bars aligned with the Y-axis over each class interval, using the class units as
their bases. Ensure the areas of the rectangles represent the frequencies of their
respective classes. In this graph we shall take class intervals in the X axis and
frequencies in the Y axis. Before plotting the graph, we must convert the class into
their exact limits.
Frequency Polygon:
In this graph we shall take the class intervals (marks in English) in X axis, and frequencies
(Number of students) in the Y axis. Before plotting the graph, we must convert the C.I. into
their exact limits and extend one C.I. in each end with a frequency of O.
Steps to draw frequency polygon:
1. Draw the 'X' axis with class intervals marked. If the lowest score is large, create a
break in the axis () to adjust. Add two points at each end.
2. Create the 'OY' axis vertically, marking units for class interval frequencies. Scale it to
make the highest frequency about 75% of the figure's width.
3. Plot points above midpoints of class intervals, proportional to their frequencies.
4. Connect these points with short lines to form the frequency polygon. Include extra
intervals at both ends with a frequency of zero to complete the graph.
Arithmetic Mean:
A value obtained by dividing the sum of all observations by the number of observations is
called arithmetic mean.
Calculating the arithmetic mean involves adding up all the values in a dataset and then
dividing that sum by the total number of values. This process gives you the average or
typical value within the dataset.
Exam Marks Midpoint (xi) Frequency (fi) fixi
1 - 25 13 8 104
26 - 50 38 3 114
51 - 75 63 11 693
76 - 100 88 4 352
101 - 125 113 5 565
126 - 150 138 2 276
151 - 175 163 6 978
176 - 200 188 1 188
Σ fixi 6340
Arithmetic mean = = =49.5
Σ fi 40
The arithmetic mean is a fundamental statistical concept used for various purposes:
Central Tendency: It provides a central or representative value within a dataset, helping to
understand the "average" or typical value of a set of numbers.
Comparative Analysis: It allows for easy comparison between different sets of data, making
it useful in various fields like finance, economics, and science.
Prediction and Estimation: It's often used to predict future values or estimate missing values
within a dataset.
Basis for Further Analysis: The mean serves as a foundational statistic, often alongside other
measures like standard deviation, forming the basis for more advanced statistical analyses.
Understanding Patterns: It helps identify trends or patterns within data, aiding in decision-
making processes.
Harmonic Mean:
The harmonic mean is a type of average that is particularly useful when dealing with rates or
ratios. It's the reciprocal of the arithmetic mean of the reciprocals of a set of numbers.
Here are the steps to calculate the harmonic mean:
Reciprocal of Each Number: Find the reciprocal of each number in the dataset.
Find the Mean of the Reciprocals: Calculate the arithmetic mean of these reciprocals.
Reciprocal of the Result: Finally, take the reciprocal of the arithmetic mean obtained in
step 2 to find the harmonic mean.
n
40
Harmonic Mean = Σ 1 =
xi ( )
0.4486
= 47.55
Median:
The median is the middle value in a dataset when the values are arranged in ascending or
descending order. If there's an odd number of values, the median is the middle number. If
there's an even number of values, the median is the average of the two middle numbers.
Steps to calculate the median:
Order the Data: Arrange the values in ascending or descending order.
Identify the Middle Value: For an odd number of values, the median is the middle
number. For an even number, it's the average of the two middle numbers.
Calculate the Median: Once the middle value(s) is identified that value or the
average of the two middle values is the median.
Quartile:
Quartiles divide a dataset into four equal parts. There are three quartiles: Q1, Q2 (which is
also the median), and Q3. Q1 represents the value below which 25% of the data falls, Q2 is
the median (50% of the data falls below and 50% above), and Q3 represents the value below
which 75% of the data falls.
Qi=l+ ( )
h i× n
f 4
−C
211 ( 10 )
10 2 ×40
Q 2=69.5+ −589=73.75
Decile:
Deciles divide a dataset into ten equal parts. There are nine deciles in a dataset: D1, D2,
D3...D9. D1 represents the value below which 10% of the data falls, D2 represents the value
below which 20% of the data falls, and so on until D9, which represents the value below
which 90% of the data falls
Di=l+ ( )
h i× n
f 10
−C
211 ( 10 )
10 8 × 40
D 8=69.5+ −589=75.11
Percentile:
Percentiles divide a dataset into hundred equal parts. A percentile is a measure indicating the
value below which a given percentage of points in a dataset fall. For example, the 25th
percentile represents the value below which 25% of the data falls.
Pi=l+ ( )
h i× n
f 100
−C
211 ( 100 )
10 2 × 40
P 18=49.5+ −589=74.182
Mode:
In statistics, the mode refers to the value that appears most frequently in a dataset. It's a
measure of central tendency alongside the mean and median. Unlike the mean and median,
which are concerned with the average or middle value, the mode focuses on the most
common value or values within a dataset.
For example, consider a dataset representing the number of pets owned by households in a
neighborhood:
2,1,3,2,5,2,1,4,2,3
In this dataset, the number "2" appears most frequently—it occurs four times, more than any
other number. Therefore, the mode of this dataset is "2". If there were two values tied for the
most frequent occurrence, the dataset would be described as "bimodal" (two modes). If more
than two values occurred with equal frequency and more frequently than any other values, it
could be described as "multimodal."
The mode is particularly useful when dealing with categorical or nominal data, such as
colors, types of cars, or categories of products, where identifying the most common category
can be informative.
What is a measure of location? What is the purpose served by it? What are its desirable
qualities?
Measures of Dispersion
In the realm of statistics, effective data presentation and analysis play a pivotal role,
particularly when exploring measures of dispersion. Measures of dispersion, such as
range and standard deviation, provide crucial insights into the variability and spread of a
dataset, offering a deeper understanding beyond central tendencies like the mean.
Accurate depiction and interpretation of dispersion are essential in making informed
decisions, identifying patterns, and drawing meaningful conclusions from data. Whether
in scientific research, business analytics, or various other domains, a comprehensive
grasp of measures of dispersion enhances the ability to assess the reliability and
consistency of data, facilitating more robust and reliable statistical inferences. As data-
driven decision-making becomes increasingly prevalent, proficiency in presenting and
analyzing measures of dispersion becomes an indispensable skill for professionals and
researchers alike, contributing to the robustness and credibility of statistical findings.
We examined a dataset representing the daily commute times (in minutes) for a group of
individuals over the course of a week. Two key measures of dispersion, the range and
standard deviation, were employed to assess the variability and spread within the data.
1. Range:
o Definition: The range is the difference between the maximum and
minimum values in a dataset.
o Calculation: Range=Max−Min
o Result: For the given commute times, the range was found to be 20
minutes, indicating the span between the shortest and longest commute
durations.
2. Standard Deviation:
o Definition: Standard deviation quantifies the amount of variation or
dispersion in a set of values.
√
n
Results:
Mean: 30 minutes
Calculation of Squared Differences and Summation: σ =
√
2
( 20−30 ) +(25−30)2 +(30−30)2+(35−30)2+(40−30)2
5
2.
Standard Deviation: The calculated standard deviation provides a
numerical measure of the average deviation of each commute time
from the mean, offering insights into the overall variability within
the dataset.
This analysis not only highlights the spread of commute times but also provides a
foundation for informed decision-making and a deeper understanding of the dataset's
characteristics. Such clarity in measures of dispersion contributes to the robustness and
reliability of statistical insights, crucial for data-driven decision-making in various fields.
Dispersion: “The variability (spread) that exists between the value of a data is called
dispersion”
∑ fi = 969
i=1
Q1 = l + h/f(n/4 - c) = 57.611
Here, l = 49.5 ; h = 10 ; f = 184 ; c = 93
Q1 = l + h/f(n/4 - c) = 73.435
Here, l = 69.5 ; h = 10 ; f = 210 ; c = 646
Quartile deviation = (Q3-Q1)/2 = (73.435-57.611)/2 = 7.912(Ans).
Mean Absolute Deviation or Mean Deviation (Average Deviation): “The arithmetic
mean of the absolute deviation from an average (mean, median etc .) is called mean
deviation or average deviation.”
Grouped Ungrouped
M.D from Mean
M.D =
∑ f |xi−x| M.D =
∑ |xi−x|
n n
M.D from Median
M.D =
∑ f |xi−Med| M.D =
∑ |xi−Med|
n n
Calculate the mean deviation and coefficient of mean deviation from (i) the mean, (ii)
the median, in the ungrouped data case, of the following set.
Xi 46,33,38,47,40,37,42,49,37
N=9
Xi Xi- x , | Xi−x| | Xi−Med|
X =41 Med=40
33 -8 8 7
37 -4 4 3
37 -4 4 3
38 -3 3 2
40 -1 1 0
42 1 1 2
46 5 5 6
47 6 6 7
49 8 8 9
∑ xi = 369 ∑|Xi−x| = 40 ∑|Xi−Med| = 39
x=
∑ xi = 369 = 41
n 9
th
n+1
Median marks obtained by the student ( ) data = 40.
2
Calculate the Variance, Standard deviation and Coefficient of Variation from the
following weight of 60 mangoes in Continuous grouped data:
Weight MidPoints(Xi) Frequency, Fi FiXi 2
fixi
65----84 74.5 9 670.5 49 952.25
85----104 94.5 10 945 89 302.50
105----124 114.5 17 1946.5 222 874.25
125----144 134.5 10 1345 180 902.50
145----164 154.5 5 772.5 119 351.25
165----184 174.5 4 698 121 801.00
185----204 194.5 5 972.5 189 151.25
∑ fi = 60 ∑ fi xi = 7350 ∑ fixi2=
973335
Variance :
∑ fixi2 - ( ∑ fi xi )
2
S
2
=
∑ fi ∑ fi
S = 16222.25 – 15006.25
2
Standard deviation :
S = √ 1216 = 34.87 unit
Coefficient of Variation:
S .D
C.V = *100 = 28.46 (Ans)
x
Measures of dispersion, such as range and standard deviation, have various practical
applications across different fields. Here are some key applications:
8. Sports Analytics:
o In sports, measures of dispersion are used to assess the consistency of
player performance. Coaches and analysts may use standard deviation to
evaluate how consistently a player performs over a series of games.
Measures of dispersion, such as range and standard deviation, play a crucial role in
various fields. In finance, standard deviation is employed to gauge stock price volatility,
aiding investors and analysts in risk assessment. In manufacturing, these measures
ensure product consistency by assessing the reliability of processes, while in education,
they evaluate the consistency of student scores, offering insights into performance
variations. In public health, measures of dispersion assist in understanding data
variability in epidemiological studies, guiding the planning of interventions. Market
researchers use them to analyze consumer preferences, project managers apply them to
assess project variability, and meteorologists utilize them in climate studies to
understand weather data variability. Even in sports analytics, measures of dispersion
help assess the consistency of player performance. Across these diverse applications,
measures of dispersion contribute to a comprehensive understanding of data variability,
facilitating informed decision-making, risk management, and tailored intervention
strategies.
Moments
In statistics, raw moments quantify the shape of a dataset by emphasizing
deviations from the mean. The (r)-th raw moment, denoted as ( Mr ), is calculated
as:
r
Mr = Σ(Xi - X̄ ) / n
where:
This report explores the calculation and significance of the first four raw
moments.
M1 = Σ(Xi - X̄ ) / n
This moment measures the average deviation of each data point from the mean,
providing insights into the central tendency of the dataset.
2
M2 = Σ(Xi - X̄ ) / n
3
M3 = Σ(Xi - X̄ ) / n
This moment captures the skewness of the dataset, indicating whether the
distribution is symmetric or skewed.
This moment provides information about the kurtosis, highlighting the tails and
peakedness of the distribution.
Consider a larger dataset of exam scores: {60, 75, 80, 85, 90, 95, 100, 105, 110,
115}. Let's calculate the first four raw moments for this dataset using the provided
formula.
2 2 2
M2 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10
3 3 3
M3 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10
4 4 4
M4 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10
r-th Moment about the Origin (O)
r
Mr(O) = Σ(Xi ) / n
These moments describe the distribution of data points with respect to the origin,
providing insights into symmetry and concentration.
r
Mr^(A) = Σ(Xi - A) / n
These moments offer insights into the distribution of data points relative to the
chosen origin A, allowing for a more flexible analysis.
The skewness (Sk) and kurtosis (K) can be expressed in dimensionless form:
(3/2)
SK = M3 / (M2)
2
K = M4 / (M2) - 3
These dimensionless measures provide standardized indicators of skewness and
kurtosis, making them comparable across different datasets.
1. Skewness (SK):
5. Kurtosis (K):
(mesokurtic).
7. If K > 0, the distribution is leptokurtic (heavier tails and a sharper peak).
Example:
Dataset of exam scores: {60, 75, 80, 85, 90, 95, 100, 105, 110, 115}. Let's
calculate the first four raw moments, r-th moment about the origin (O), r-th
moment about an arbitrary origin (A = 90), and dimensionless forms of skewness
and kurtosis.
2
2. Second-Order Raw Moment: M2 = Σ(Xi - X̄ ) / 10
3
3. Third-Order Raw Moment: M3 = Σ(Xi - X̄ ) / 10
4
4. Fourth-Order Raw Moment: M4 = Σ(Xi - X̄ ) / 10
r
5. r-th Moment about the Origin (O): Mr(O) = Σ(Xi ) / 10
r
6. r-th Moment about Arbitrary Origin (A = 90): Mr(90) = Σ(Xi - 90) / 10
(3/2)
7. Dimensionless Skewness (SK): SK = M3 / (M2)
2
8. Dimensionless Kurtosis (K): K = M4 / (M2) - 3
Box-and-Whisker Plot
2. Outliers: Outliers are easily identifiable, aiding in the detection of unusual data points
that might significantly impact the analysis.
3. Spread and Dispersion: The length of the box and whiskers provides insights into the
spread and dispersion of the data. A longer box and whiskers suggest greater variability.
4. Central Tendency: The position of the median within the box indicates the central
tendency. A median closer to one quartile than the other signifies skewness in the data.
Advantage:
1. Comparison: Facilitates the comparison of multiple datasets, enabling a quick
overview of their distributions.
Q2
Q2
Q1
Q1
Conclusion: In conclusion, the box-and-whisker plot is a valuable tool in data analysis, providing a clear
and concise representation of dataset characteristics. Its simplicity, ability to highlight outliers, and
effectiveness in comparing datasets make it an indispensable asset in statistical exploration. Understanding
and utilizing box-and-whisker plots enhance the interpretability of data, supporting informed decision-
making processes across various disciplines.
Stem and Leaf Plots
Id: 011221007
A stem and leaf plot uses the digits of data values to organize a data set. Stem and leaf plots have
data placed into order from lowest to highest. The stem and leaf plot show how data are distributed.
Each data is broken into a stem (digit or digits on the left of the vertical line) and leaf (digit or
digits on the right of the vertical line). The stems all represent tens place in stem and leaf plot. The
26 45 32 27 29 30 40 36 37
Stem | leaves
2 | 6 7 9
3 | 0 2 6 7
4 | 0 5
(ii) Find least value, greatest value, mean, median, mode and range.
Sol:
Least value=26
Greatest value=45
N 9 9
The back-to-back stem and leaf plots are used to compare two distributions side-by-side. This type
of back-to-back stem and leaf plot contains three columns, each separated by a vertical line.
Example-4: The following stem and leaf diagrams show the times taken by some girls and
Girls Boys
1
9 8 8
2 3 5 7 8 8
6 4 3 1
3 0 4 7
7 6 5 4 2
4 0 1 2
3 0 0
(a) Compare the times taken to complete the level between the children and the adults.
Sol:
For Girls,
I.Q.R=Q3-Q1=37-21=16
For Boys,
I.Q.R=Q3-Q1=40-27=13
Here we can see that boys I.Q.R=13 & girls I.Q.R=16.So we can say Boy’s time to complete the level
on the computer game were more consistent
CORRELATION THEORY AND REGRESSION ANALYSIS
Types of Correlation
1. Positive or negative
2. Simple or multiple
3. Linear or non-linear
0 = No correlation
Application Problem-1: A research physician recorded the pulse rates and the temperatures of
water submerging the faces of ten small children in cold water to control the abnormally rapid
heartbeats. The results are presented in the following table. Calculate the correlation coefficient
Temp. of water 68 65 70 62 60 55 58 65 69 63
Reduction in 2 5 1 10 9 13 10 3 4 6
pulse rate.
x y X^2 Y^2 xy
68 2 4624 4 136
65 5 4225 25 325
70 1 4900 1 70
62 10 3844 100 620
60 9 3600 81 540
55 13 3025 169 715
58 10 3364 100 580
65 3 4225 9 195
69 4 4761 16 276
63 6 3969 36 378
Σx=635 Σy=63 Σx^2=40537 Σy^2=541 Σxy=3835
n∑XY−∑X∑Y
We know, rxy =r = --------------------------------------------------------------------
Sqrt{( n ∑ X 2 − ( ∑ X ) 2 ) ⋅ ( n ∑ Y 2 − ( ∑ Y ) 2 )}
(10*3835)-(635*63)
R=--------------------------------------------------------------
Sqrt(10*40537 – (635)^2) * sqrt(10*541-(63^2))
= -0.9
The result -0.94, indicates that the correlation coefficient between temperature of water and
RANK CORRELATION
Rank correlation: In some situation it is difficult to measure the values of the variables from
bivariate distribution numerically, but they can be ranked. The correlation coefficient between
these two ranks is usually called rank correlation coefficient, given by Spearman (1904). It is
denoted by R. this is the only method for finding relationship between two qualitative variables
like beauty, honesty, intelligence, efficiency and so on.
Interpretation of Rank Correlation Coefficient (R)
The value of rank correlation coefficient, R ranges from -1 to +1
If R = +1, then there is complete agreement in the order of the ranks and the ranks are in the
same
direction
If R = -1, then there is complete agreement in the order of the ranks and the ranks are in the
opposite
direction
If R = 0, then there is no correlation
6 Σd^2
R=1 - --------
N^3-N
A B R1 R2 D=R1- D^2
R2
80 65 2 5 -3 9
75 70 3 4 -1 1
90 60 1 6 -5 25
70 75 4 3 1 1
65 85 5 1 4 16
60 80 6 2 4 16
Σd=0 Σd^2=68
6 Σd^2
R=1 - --------
N^3-N
6 *68
R=1 - -------- = -0.94
6^3-6
Strongly negative relation between A and B.
x y R1 R2 D=R1- D^2
R2
20 30 3.5 4 -0.5 0.25
80 60 8 8 0 0
40 20 6 2 4 16
12 30 1 4 -3 9
28 50 5 7 -2 4
20 30 3.5 4 -0.5 0.25
15 40 2 6 -4 16
60 10 7 1 6 36
Σ d^2=81.5
6{81.5+1/12(2^3-2)+1/12(3^3-3)}
R=1 - ----------------------------------------------
8^3-8
R=0
NO correlation between x and y
REGRESSION ANALYSIS
What is regression?
Ans: The probable movement of one variable in terms of the other variables is
called
regression.
In other words the statistical technique by which we can estimate the unknown
value of
one variable (dependent) from the known value of another variable is called
regression.
Regression analysis.
Ans: Regression analysis is a mathematical measure of the average relationship
between
two or more variables in terms of the original units of data.
X y X^2 Y^2 xy
68 2 4624 4 136
65 5 4225 25 325
70 1 4900 1 70
62 10 3844 100 620
60 9 3600 81 540
55 13 3025 169 715
58 10 3364 100 580
65 3 4225 9 195
69 4 4761 16 276
63 6 3969 36 378
Σx=635 Σy=63 Σx^2=40537 Σy^2=541 Σxy=3835
Line x on y:
N Σ x y -Σ x Σ y
B(xy)= ----------------------------------
nΣ y^2-(Σy)^2
(10*3835)-(635*63)
B(xy)= ------------------------
(10*541)-(63)^2
B(xy)=-1.14
X=a+by
Σ x/n=a+b*(Σy/n)
a=70.73
x=70.73-1.14y[regression line]
Line Y on X:
N Σ x y -Σ x Σ y
B(yx)= ----------------------------------
nΣ x^2-(Σx)^2
(10*3835)-(635*63)
B(xy)= ------------------------
(10*40537)-( 635)^2
B(xy)=-0.77
Y=a+bx
Σ y/n=a+b*(Σx/n)
a=55.195
Y=55.195-0.77*X [regression line]