Report 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

Report

Probability and Statistics


Fall 233 MATH 2205

Section: E

Submitted by:

Sadat Reza Apon – 011221035


Sheikh Shakib Hossain – 011221031
Sheikh Md. Sajjad – 011221007
Sauda Binti Noor – 011221049
Abrar Zahin Arian - 011221018
Chapter 1 : Graph Representation

Graphical representation is an alternative method for analyzing numerical information. It


involves utilizing a graph, which visually displays statistical data by presenting lines or
curves across plotted points on a coordinated surface.
Typically, there are four techniques for visually representing a frequency distribution. These
methods include the histogram, smoothed frequency graph, Ogive (Cumulative Frequency)
graph, and pie chart.
Various types of graphical representation include:
 Pie Chart
 Bar Diagram
 Histogram
 Frequency Polygon
 Cumulative Frequency Curve
 Cumulative Percentage Curve

Before understanding graphs, we need some numerical data to represent them. Here we’ll
be using the class marks of English exam of a total of 40 students. The list of data is given
below:

171, 57, 78, 159, 44, 23, 111, 17, 46, 143, 53, 96, 102, 14, 66, 9, 22, 156, 89,
117,60, 39, 174, 24, 77, 108, 51, 132, 19, 168, 62, 67,198, 124, 59, 5, 71, 165,
75, 74

Pie Graph:

The "pie chart," alternatively termed as the "circle graph," partitions a circular statistical
illustration into segments or slices to depict numerical data. Each segment represents a
proportional fraction of the entirety. When examining the makeup of a whole, the pie chart
proves particularly effective. Often, pie charts substitute other graphical representations such
as bar graphs, line plots, and histograms in various scenarios.
Steps taken:
1. Enter the data into the table. This this case we are making interval of 25 to make the
table.
2. Add all the value in the table to get the total
3. Divide each value by the total and then multiply by 100 to get a percent.
4. Next to know how many degrees for each “pie sector” we need, we will take a full
circle of 360° and follow the calculations below:
The central angle of each component = (Value of each component/sum of values of all the
components) ✕360°

Exam Marks Number of Students percent


1 - 25 8 (8/40) *100 = 20
26 - 50 3 (3/40) * 100 =7.5
51 - 75 11 (11/40) * 100 =27.5
76 - 100 4 (4/40) * 100 =10
101 - 125 5 (5/40) * 100 =12.5
126 - 150 2 (2/40) * 100 =5
151 - 175 6 (6/40) * 100 =15
176 - 200 1 (1/40) * 100 =2.5
Bar Diagram:
A bar graph, also called a bar diagram, visually displays data using rectangular shapes. These
rectangles are evenly spaced apart and share the same width, key characteristics defining a
bar graph.

Using the following data, we can make a bar diagram.

Exam Marks Exam Grade Number of Students


(frequency)
1 - 25 D 8
26 - 50 C 3
51 - 75 B- 11
76 - 100 B 4
101 - 125 B+ 5
126 - 150 A- 2
151 - 175 A 6
176 - 200 A+ 1

If we take frequency, which is the number of students to be represented on the y-axis and
the grades on the x-axis, we will get a graph that resembles the one below.
The rectangles here are called bars. Note that the bars have equal width and are equally
spaced, as mentioned above. This is a simple bar diagram.

Histogram:
A histogram, a type of non-cumulative frequency graph, is constructed on a natural scale,
portraying frequencies of various value ranges using closely spaced vertical rectangles. This
graph facilitates the easy determination of the mode, a measure of central tendency, within
the data.
Steps to create a histogram:
1. Plot class intervals on the X-axis and their frequencies on the Y-axis using a natural
scale
2. Begin the X-axis with the lowest limit of the lowest class interval. If this limit is far
from the origin, create a break in the X-axis to indicate its displacement.
3. Draw bars aligned with the Y-axis over each class interval, using the class units as
their bases. Ensure the areas of the rectangles represent the frequencies of their
respective classes. In this graph we shall take class intervals in the X axis and
frequencies in the Y axis. Before plotting the graph, we must convert the class into
their exact limits.

Original Class Frequency (f)


53.5-57.5 5
57.5-61.5 5
61.5-65.5 9
65.5-69.5 14
69.5-73.5 7

Frequency Polygon:

In this graph we shall take the class intervals (marks in English) in X axis, and frequencies
(Number of students) in the Y axis. Before plotting the graph, we must convert the C.I. into
their exact limits and extend one C.I. in each end with a frequency of O.
Steps to draw frequency polygon:

1. Draw the 'X' axis with class intervals marked. If the lowest score is large, create a
break in the axis () to adjust. Add two points at each end.
2. Create the 'OY' axis vertically, marking units for class interval frequencies. Scale it to
make the highest frequency about 75% of the figure's width.
3. Plot points above midpoints of class intervals, proportional to their frequencies.
4. Connect these points with short lines to form the frequency polygon. Include extra
intervals at both ends with a frequency of zero to complete the graph.

Class Marks (x) Frequency (f)


55.5 5
59.5 5
63.5 9
67.5 14
71.5 7

Cumulative Frequency Curve:


To plot this graph first we must convert the class intervals into their exact limits. Then we
must calculate the cumulative frequencies of the distribution.

Class Upper Boundary CF (F)


57.5 5
61.5 10
65.5 19
69.5 33
73.5 40

Chapter 2 : Measures of Central Tendency

Arithmetic Mean:
A value obtained by dividing the sum of all observations by the number of observations is
called arithmetic mean.
Calculating the arithmetic mean involves adding up all the values in a dataset and then
dividing that sum by the total number of values. This process gives you the average or
typical value within the dataset.
Exam Marks Midpoint (xi) Frequency (fi) fixi
1 - 25 13 8 104
26 - 50 38 3 114
51 - 75 63 11 693
76 - 100 88 4 352
101 - 125 113 5 565
126 - 150 138 2 276
151 - 175 163 6 978
176 - 200 188 1 188

Σ fixi 6340
Arithmetic mean = = =49.5
Σ fi 40

The arithmetic mean is a fundamental statistical concept used for various purposes:
Central Tendency: It provides a central or representative value within a dataset, helping to
understand the "average" or typical value of a set of numbers.
Comparative Analysis: It allows for easy comparison between different sets of data, making
it useful in various fields like finance, economics, and science.
Prediction and Estimation: It's often used to predict future values or estimate missing values
within a dataset.
Basis for Further Analysis: The mean serves as a foundational statistic, often alongside other
measures like standard deviation, forming the basis for more advanced statistical analyses.
Understanding Patterns: It helps identify trends or patterns within data, aiding in decision-
making processes.

Harmonic Mean:
The harmonic mean is a type of average that is particularly useful when dealing with rates or
ratios. It's the reciprocal of the arithmetic mean of the reciprocals of a set of numbers.
Here are the steps to calculate the harmonic mean:
 Reciprocal of Each Number: Find the reciprocal of each number in the dataset.
 Find the Mean of the Reciprocals: Calculate the arithmetic mean of these reciprocals.
 Reciprocal of the Result: Finally, take the reciprocal of the arithmetic mean obtained in
step 2 to find the harmonic mean.

Exam Marks Midpoint (xi) Frequency (fi) fi


xi
1 - 25 13 8 0.62
26 - 50 38 3 0.08
51 - 75 63 11 0.17
76 - 100 88 4 0.04
101 - 125 113 5 0.04
126 - 150 138 2 0.01
151 - 175 163 6 0.03
176 - 200 188 1 0.005

n
40
Harmonic Mean = Σ 1 =
xi ( )
0.4486
= 47.55

The harmonic mean offers several advantages in specific contexts:


 Dealing with Rates: It's particularly useful when dealing with rates, ratios, or
frequencies. For instance, it's beneficial in calculating average speeds, average rates
of return on investments, or average ratios.
 Balancing Extreme Values: Unlike the arithmetic mean, the harmonic mean tends to
give lower weight to extreme values. This property makes it more resistant to the
influence of outliers or extremely large/small values in a dataset.
 Accuracy in Averaging Rates: When averaging rates or ratios (like speed or
efficiency), the harmonic mean provides a more accurate representation than the
arithmetic mean. It considers the proportions within the dataset rather than just the
values themselves.
 Consistency in Relationships: In scenarios where relationships between values are
important, the harmonic mean ensures that the average retains the same relationship
as the original values. For instance, if A is to B as B is to C, then the harmonic mean
of A and C will be B.
 Weighted Averages: It's useful for calculating weighted averages when different
components have different weights, especially in fields like finance and economics.

Median:
The median is the middle value in a dataset when the values are arranged in ascending or
descending order. If there's an odd number of values, the median is the middle number. If
there's an even number of values, the median is the average of the two middle numbers.
Steps to calculate the median:
 Order the Data: Arrange the values in ascending or descending order.
 Identify the Middle Value: For an odd number of values, the median is the middle
number. For an even number, it's the average of the two middle numbers.
 Calculate the Median: Once the middle value(s) is identified that value or the
average of the two middle values is the median.

Exam Marks Midpoint (xi) Frequency (fi) Cumulative


frequency
1 - 25 13 8 11
26 - 50 38 3 22
51 - 75 63 11 26
76 - 100 88 4 32
101 - 125 113 5 35
126 - 150 138 2 36
151 - 175 163 6 37
176 - 200 188 1 40
h n
Median = l + ( - c) = 45.5+
f 2 305 2 (
10 802
−285 = 35 )

Quartile, Percentile and Decile:

Exam Marks Midpoint (xi) Frequency (fi) Cumulative


frequency
1 - 25 13 8 11
26 - 50 38 3 22
51 - 75 63 11 26
76 - 100 88 4 32
101 - 125 113 5 35
126 - 150 138 2 36
151 - 175 163 6 37
176 - 200 188 1 40

Quartile:
Quartiles divide a dataset into four equal parts. There are three quartiles: Q1, Q2 (which is
also the median), and Q3. Q1 represents the value below which 25% of the data falls, Q2 is
the median (50% of the data falls below and 50% above), and Q3 represents the value below
which 75% of the data falls.

Qi=l+ ( )
h i× n
f 4
−C

211 ( 10 )
10 2 ×40
Q 2=69.5+ −589=73.75

Decile:
Deciles divide a dataset into ten equal parts. There are nine deciles in a dataset: D1, D2,
D3...D9. D1 represents the value below which 10% of the data falls, D2 represents the value
below which 20% of the data falls, and so on until D9, which represents the value below
which 90% of the data falls
Di=l+ ( )
h i× n
f 10
−C

211 ( 10 )
10 8 × 40
D 8=69.5+ −589=75.11

Percentile:
Percentiles divide a dataset into hundred equal parts. A percentile is a measure indicating the
value below which a given percentage of points in a dataset fall. For example, the 25th
percentile represents the value below which 25% of the data falls.

Pi=l+ ( )
h i× n
f 100
−C

211 ( 100 )
10 2 × 40
P 18=49.5+ −589=74.182

Mode:
In statistics, the mode refers to the value that appears most frequently in a dataset. It's a
measure of central tendency alongside the mean and median. Unlike the mean and median,
which are concerned with the average or middle value, the mode focuses on the most
common value or values within a dataset.
For example, consider a dataset representing the number of pets owned by households in a
neighborhood:

2,1,3,2,5,2,1,4,2,3
In this dataset, the number "2" appears most frequently—it occurs four times, more than any
other number. Therefore, the mode of this dataset is "2". If there were two values tied for the
most frequent occurrence, the dataset would be described as "bimodal" (two modes). If more
than two values occurred with equal frequency and more frequently than any other values, it
could be described as "multimodal."
The mode is particularly useful when dealing with categorical or nominal data, such as
colors, types of cars, or categories of products, where identifying the most common category
can be informative.
What is a measure of location? What is the purpose served by it? What are its desirable
qualities?

A measure of location, also known as a measure of central tendency, is a statistic that


represents a single value that best describes the center of a dataset. The primary purpose of a
measure of location is to provide a representative or typical value around which the data is
centered.
The desirable qualities of a measure of location include:
 Representativeness: It should accurately represent the central value or tendency of
the dataset, giving a meaningful summary of the data.
 Robustness: It should be relatively unaffected by outliers or extreme values in the
dataset. A robust measure of location won't be heavily skewed by extreme
observations.
 Ease of Interpretation: A good measure of location should be easily understood and
interpreted, making it useful for conveying information about the dataset to others.
 Applicability: It should be applicable to different types of data distributions, whether
the data is normally distributed, skewed, or has other characteristics.
 Mathematical Properties: It should possess desirable mathematical properties that
allow for meaningful statistical calculations and analyses.
Common measures of location include the mean (arithmetic average), median (middle
value), mode (most frequent value), quartiles, percentiles, and deciles. Different measures
have their strengths and weaknesses, making them suitable for various scenarios based on the
nature of the dataset and the specific context of analysis. The choice of the measure of
location often depends on the characteristics of the data and the objectives of the analysis.
Name : Abrar Zahin Arian
ID: 011221018

Measures of Dispersion
In the realm of statistics, effective data presentation and analysis play a pivotal role,
particularly when exploring measures of dispersion. Measures of dispersion, such as
range and standard deviation, provide crucial insights into the variability and spread of a
dataset, offering a deeper understanding beyond central tendencies like the mean.
Accurate depiction and interpretation of dispersion are essential in making informed
decisions, identifying patterns, and drawing meaningful conclusions from data. Whether
in scientific research, business analytics, or various other domains, a comprehensive
grasp of measures of dispersion enhances the ability to assess the reliability and
consistency of data, facilitating more robust and reliable statistical inferences. As data-
driven decision-making becomes increasingly prevalent, proficiency in presenting and
analyzing measures of dispersion becomes an indispensable skill for professionals and
researchers alike, contributing to the robustness and credibility of statistical findings.

Measures of Dispersion Analysis: Daily Commute Times

We examined a dataset representing the daily commute times (in minutes) for a group of
individuals over the course of a week. Two key measures of dispersion, the range and
standard deviation, were employed to assess the variability and spread within the data.

1. Range:
o Definition: The range is the difference between the maximum and
minimum values in a dataset.
o Calculation: Range=Max−Min
o Result: For the given commute times, the range was found to be 20
minutes, indicating the span between the shortest and longest commute
durations.
2. Standard Deviation:
o Definition: Standard deviation quantifies the amount of variation or
dispersion in a set of values.


n

o Calculation: σ = ∑ (Xi− X)2


i=1
n

 Results:
 Mean: 30 minutes
 Calculation of Squared Differences and Summation: σ =


2
( 20−30 ) +(25−30)2 +(30−30)2+(35−30)2+(40−30)2
5

2.

 Standard Deviation: The calculated standard deviation provides a
numerical measure of the average deviation of each commute time
from the mean, offering insights into the overall variability within
the dataset.

This analysis not only highlights the spread of commute times but also provides a
foundation for informed decision-making and a deeper understanding of the dataset's
characteristics. Such clarity in measures of dispersion contributes to the robustness and
reliability of statistical insights, crucial for data-driven decision-making in various fields.

Dispersion: “The variability (spread) that exists between the value of a data is called
dispersion”

Types of Measures of Dispersion: There are two types of measure of dispersion


I) Absolute Measure of Dispersion
II) Relative Measure of Dispersion
Absolute Measure of Dispersion: “An absolute measure of dispersion measures the
variability in terms of the same units of the data”
e.g. if the units of the data are Rs, meters, kg, etc. The units of the measures of
dispersion will also be Rs, meters, kg, etc. The common absolute measures of dispersion
are:
 Range
 Quartile Deviation or Semi Inter-Quartile Range
 Average Deviation or Mean Deviation
 Standard Deviation
Relative Measure of Dispersion: “A relative measure of dispersion compares the
variability of two or more data that are independent of the units of measurement”
The common relative measures of dispersion are:
 Coefficient of Dispersion or Coefficient of Range
 Coefficient of Quartile Deviation
 Coefficient of Mean Deviation
 Coefficient of Standard Deviation or Coefficient of Variation (C.V)
Coefficient of Range or Coefficient of Dispersion: The coefficient of range or coefficient
of dispersion is a relative measure of dispersion and is given by: Coefficient of Range =
(Xm - X0)/ (Xm+ X0)
Quartile Deviation or Semi-inter-quartile Range: “half of the difference between the
upper quartile and lower quartile is called the semi-inter quartile range or quartile
deviation.” i.e. Quartile deviation = (Q3-Q1)/2
Ex - Calculate quartile deviation for continuous grouped data.
Class boundaries Midpoints xi Frequency  fi Cumulative
frequency c. f 
29.5---39.5 34.5 8 8
39.5---49.5 44.5 85 93
49.5---59.5 54.5 184 277
59.5---69.5 64.5 369 646
69.5---79.5 74.5 210 856
79.5---89.5 84.5 89 945
89.5---99.5 94.5 24 969
n

∑ fi = 969
i=1

Q1 = l + h/f(n/4 - c) = 57.611
Here, l = 49.5 ; h = 10 ; f = 184 ; c = 93
Q1 = l + h/f(n/4 - c) = 73.435
Here, l = 69.5 ; h = 10 ; f = 210 ; c = 646
Quartile deviation = (Q3-Q1)/2 = (73.435-57.611)/2 = 7.912(Ans).
Mean Absolute Deviation or Mean Deviation (Average Deviation): “The arithmetic
mean of the absolute deviation from an average (mean, median etc .) is called mean
deviation or average deviation.”
Grouped Ungrouped
M.D from Mean
M.D =
∑ f |xi−x| M.D =
∑ |xi−x|
n n
M.D from Median
M.D =
∑ f |xi−Med| M.D =
∑ |xi−Med|
n n

Calculate the mean deviation and coefficient of mean deviation from (i) the mean, (ii)
the median, in the ungrouped data case, of the following set.
Xi 46,33,38,47,40,37,42,49,37
N=9
Xi Xi- x , | Xi−x| | Xi−Med|
X =41 Med=40
33 -8 8 7
37 -4 4 3
37 -4 4 3
38 -3 3 2
40 -1 1 0
42 1 1 2
46 5 5 6
47 6 6 7
49 8 8 9
∑ xi = 369 ∑|Xi−x| = 40 ∑|Xi−Med| = 39

x=
∑ xi = 369 = 41
n 9
th
n+1
Median marks obtained by the student ( ) data = 40.
2

M.D (mean) = 40/9 = 4.4 (Ans)


M.D (Med) = 39/9 = 4.3 (Ans)
Standard Deviation: “The positive square root of the variance is called standard
deviation.”
Coefficient of Standard Deviation OR Coefficient of Variation: The coefficient of
standard deviation is a relative measure of dispersion and is given by :
C.V  (Standard Deviation/Mean) X 100

Calculate the Variance, Standard deviation and Coefficient of Variation from the
following weight of 60 mangoes in Continuous grouped data:
Weight MidPoints(Xi) Frequency, Fi FiXi 2
fixi
65----84 74.5 9 670.5 49 952.25
85----104 94.5 10 945 89 302.50
105----124 114.5 17 1946.5 222 874.25
125----144 134.5 10 1345 180 902.50
145----164 154.5 5 772.5 119 351.25
165----184 174.5 4 698 121 801.00
185----204 194.5 5 972.5 189 151.25
∑ fi = 60 ∑ fi xi = 7350 ∑ fixi2=
973335

Variance :

∑ fixi2 - ( ∑ fi xi )
2

S
2
=
∑ fi ∑ fi
S = 16222.25 – 15006.25
2

= 1216 unit^2 (Ans)

Standard deviation :
S = √ 1216 = 34.87 unit

Coefficient of Variation:
S .D
C.V = *100 = 28.46 (Ans)
x
Measures of dispersion, such as range and standard deviation, have various practical
applications across different fields. Here are some key applications:

1. Risk Assessment in Finance:


o In finance, standard deviation is often used to measure the volatility of
stock prices. Higher standard deviation indicates greater price variability,
which can be seen as a measure of risk. Investors and financial analysts
use this information to assess and manage investment risk.
2. Quality Control in Manufacturing:
o Measures of dispersion are employed to assess the consistency and
reliability of manufacturing processes. For example, in the production of
goods, a low standard deviation in product dimensions indicates that
items are consistently manufactured to meet specified standards.
3. Education Assessment:
o In educational testing, measures of dispersion can be applied to evaluate
the consistency of student scores. A low standard deviation in test scores
suggests a more consistent performance across students, while a higher
standard deviation may indicate greater variability in student
performance.
4. Public Health and Epidemiology:
o In epidemiological studies, measures of dispersion help assess the
variability in health-related data, such as disease prevalence or patient
response to treatments. This information aids in understanding the range
of outcomes and planning public health interventions accordingly.
5. Market Research:
o Measures of dispersion are essential in market research to analyze
consumer preferences and behaviors. For instance, the range or standard
deviation of responses to a survey question can indicate the diversity of
opinions within a target population.
6. Project Management:
o In project management, measures of dispersion can be applied to evaluate
the variability in project timelines or costs. Understanding the range of
possible outcomes helps project managers make more accurate
predictions and set realistic expectations.
7. Climate Studies:
o Meteorologists use measures of dispersion to analyze weather data. For
example, the standard deviation of temperatures over a period can
provide insights into the variability of the climate in a specific region.

8. Sports Analytics:
o In sports, measures of dispersion are used to assess the consistency of
player performance. Coaches and analysts may use standard deviation to
evaluate how consistently a player performs over a series of games.

In all these applications, measures of dispersion contribute to a more comprehensive


understanding of data variability, helping professionals make informed decisions,
manage risks, and plan interventions or strategies tailored to the characteristics of the
data at hand.

Measures of dispersion, such as range and standard deviation, play a crucial role in
various fields. In finance, standard deviation is employed to gauge stock price volatility,
aiding investors and analysts in risk assessment. In manufacturing, these measures
ensure product consistency by assessing the reliability of processes, while in education,
they evaluate the consistency of student scores, offering insights into performance
variations. In public health, measures of dispersion assist in understanding data
variability in epidemiological studies, guiding the planning of interventions. Market
researchers use them to analyze consumer preferences, project managers apply them to
assess project variability, and meteorologists utilize them in climate studies to
understand weather data variability. Even in sports analytics, measures of dispersion
help assess the consistency of player performance. Across these diverse applications,
measures of dispersion contribute to a comprehensive understanding of data variability,
facilitating informed decision-making, risk management, and tailored intervention
strategies.
Moments
In statistics, raw moments quantify the shape of a dataset by emphasizing

deviations from the mean. The (r)-th raw moment, denoted as ( Mr ), is calculated
as:

r
Mr = Σ(Xi - X̄ ) / n

where:

● ( Mr ) is the (r)-th raw moment,

● ( n ) is the total number of data points,

● ( Xi ) is the (i)-th data point,

● ( X̄ ) is the mean of the dataset.

This report explores the calculation and significance of the first four raw
moments.

First-Order Raw Moment

The first-order raw moment ( r = 1 ) is given by:

M1 = Σ(Xi - X̄ ) / n
This moment measures the average deviation of each data point from the mean,
providing insights into the central tendency of the dataset.

Second-Order Raw Moment

The second-order raw moment ( r = 2 ) is defined as:

2
M2 = Σ(Xi - X̄ ) / n

This moment quantifies the variability or spread of the dataset, emphasizing


squared deviations from the mean.

Third-Order Raw Moment

The third-order raw moment ( r = 3 ) is calculated by:

3
M3 = Σ(Xi - X̄ ) / n

This moment captures the skewness of the dataset, indicating whether the
distribution is symmetric or skewed.

Fourth-Order Raw Moment

The fourth-order raw moment ( r = 4 ) is expressed as:


4
M4 = Σ(Xi - X̄ ) / n

This moment provides information about the kurtosis, highlighting the tails and
peakedness of the distribution.

Example: Exam Scores Dataset

Consider a larger dataset of exam scores: {60, 75, 80, 85, 90, 95, 100, 105, 110,
115}. Let's calculate the first four raw moments for this dataset using the provided
formula.

1. First-Order Raw Moment:

M1 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10

1. Second-Order Raw Moment:

2 2 2
M2 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10

1. Third-Order Raw Moment:

3 3 3
M3 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10

1. Fourth-Order Raw Moment:

4 4 4
M4 = Σ(60 - X̄ ) + (75 - X̄ ) + ... + (115 - X̄ ) / 10
r-th Moment about the Origin (O)

The r-th moment about the origin is given by:

r
Mr(O) = Σ(Xi ) / n

These moments describe the distribution of data points with respect to the origin,
providing insights into symmetry and concentration.

r-th Moment about Arbitrary Origin (A)

The r-th moment about an arbitrary origin A is expressed as:

r
Mr^(A) = Σ(Xi - A) / n

These moments offer insights into the distribution of data points relative to the
chosen origin A, allowing for a more flexible analysis.

Dimensionless Forms: Skewness and Kurtosis

The skewness (Sk) and kurtosis (K) can be expressed in dimensionless form:

(3/2)
SK = M3 / (M2)

2
K = M4 / (M2) - 3
These dimensionless measures provide standardized indicators of skewness and
kurtosis, making them comparable across different datasets.

Conditions for Skewness and Kurtosis

1. Skewness (SK):

2. If SK = 0, the distribution is perfectly symmetrical.

3. If SK > 0, the distribution is positively skewed (tail on the right).

4. If SK < 0, the distribution is negatively skewed (tail on the left).

5. Kurtosis (K):

6. If K = 0, the distribution has the same kurtosis as a normal distribution

(mesokurtic).
7. If K > 0, the distribution is leptokurtic (heavier tails and a sharper peak).

8. If K < 0, the distribution is platykurtic (lighter tails and a flatter peak).

Example:

Dataset of exam scores: {60, 75, 80, 85, 90, 95, 100, 105, 110, 115}. Let's
calculate the first four raw moments, r-th moment about the origin (O), r-th
moment about an arbitrary origin (A = 90), and dimensionless forms of skewness
and kurtosis.

1. First-Order Raw Moment: M1 = Σ(Xi - X̄ ) / 10

2
2. Second-Order Raw Moment: M2 = Σ(Xi - X̄ ) / 10
3
3. Third-Order Raw Moment: M3 = Σ(Xi - X̄ ) / 10

4
4. Fourth-Order Raw Moment: M4 = Σ(Xi - X̄ ) / 10

r
5. r-th Moment about the Origin (O): Mr(O) = Σ(Xi ) / 10

r
6. r-th Moment about Arbitrary Origin (A = 90): Mr(90) = Σ(Xi - 90) / 10

(3/2)
7. Dimensionless Skewness (SK): SK = M3 / (M2)

2
8. Dimensionless Kurtosis (K): K = M4 / (M2) - 3
Box-and-Whisker Plot

Summary: The box-and-whisker plot, commonly referred to as a box plot, is a powerful


graphical tool used in statistical analysis to depict the distribution and central tendencies of a
dataset. It provides a visual summary that aids in understanding key statistical measures,
enabling researchers, analysts, and decision-makers to gain insights into the variability and
patterns within the data.

Which components need for Box and Whisker Plot:


1. Box: Represents the interquartile range (IQR), encompassing the middle 50% of the data.
The length of the box indicates the spread of the central part of the data.
2. Whiskers: Extend from the box to the minimum and maximum values within a specified
range. Provide information about the overall range of the data.
3. Median (Q2): A line inside the box denotes the median, representing the middle value of
the dataset and dividing it into two equal halves.
4. Outliers: Individual data points beyond the whiskers are considered outliers and may be
marked separately.

Character of Box and Whisker Plot:


1. Symmetry and Skewness: Symmetry or skewness is evident from the position of the
box within the whiskers. Asymmetry indicates skewness in the data distribution.

2. Outliers: Outliers are easily identifiable, aiding in the detection of unusual data points
that might significantly impact the analysis.

3. Spread and Dispersion: The length of the box and whiskers provides insights into the
spread and dispersion of the data. A longer box and whiskers suggest greater variability.

4. Central Tendency: The position of the median within the box indicates the central
tendency. A median closer to one quartile than the other signifies skewness in the data.

Advantage:
1. Comparison: Facilitates the comparison of multiple datasets, enabling a quick
overview of their distributions.

2. Outlier Detection: Offers a straightforward method for identifying outliers, helping to


pinpoint data points that deviate significantly from the norm.

3. Summary of Statistics: Provides a concise summary of key statistical measures,


including median, quartiles, and potential outliers.

4. Visual Representation: Presents a visually intuitive representation of data, making it


accessible to a broad audience.

Example: Exam Scores


Suppose you have the exam scores of two different classes, Class A and Class B, to compare
their performance. The scores are as follows:
Class A Scores: 65,70,72,75,78,80,82,85,90,92,95,65,70,72,75,78,80,82,85,90,92,95
Class B Scores: 55,60,68,70,75,78,82,88,92,96,98,55,60,68,70,75,78,82,88,92,96,98
Now, let's create a box-and-whisker plot to visualize and compare the distribution of scores
between the two classes.
1. Calculate Quartiles:
 Class A: Q1=72, Q2(Median)=80, Q3=88
 Class B: Q1=68, Q2(Median)=78, Q3=92
2. Interquartile Range (IQR):
 Class A: IQR=Q3−Q1=88−72=16
 Class B:IQR=Q3−Q1=92−68=24
Q3
Q3

Q2
Q2
Q1
Q1
Conclusion: In conclusion, the box-and-whisker plot is a valuable tool in data analysis, providing a clear
and concise representation of dataset characteristics. Its simplicity, ability to highlight outliers, and
effectiveness in comparing datasets make it an indispensable asset in statistical exploration. Understanding
and utilizing box-and-whisker plots enhance the interpretability of data, supporting informed decision-
making processes across various disciplines.
Stem and Leaf Plots

Id: 011221007

Stem and Leaf Plots:

A stem and leaf plot uses the digits of data values to organize a data set. Stem and leaf plots have

data placed into order from lowest to highest. The stem and leaf plot show how data are distributed.

Each data is broken into a stem (digit or digits on the left of the vertical line) and leaf (digit or

digits on the right of the vertical line). The stems all represent tens place in stem and leaf plot. The

leaves all represent one’s place in stem and leaf plot.

Example-1: Draw a stem-and-leaf diagram for the following data

26 45 32 27 29 30 40 36 37

(i) Make an ordered list of the 7 values.


Sol:

Stem | leaves

2 | 6 7 9
3 | 0 2 6 7
4 | 0 5

[key: 2|6 means 26]

(ii) Find least value, greatest value, mean, median, mode and range.

Sol:

Least value=26

Greatest value=45

ΣXi 26+ 45+ 32+ 27+ 29+ 30+40 +36+ 37 302

Mean= ------------ = ---------------------------------------------------= -------- = 33.555

N 9 9

Median= (n+1)/2-th value= (9+1)/2=5th value =32


Mode: there is no mode.

Range=largest value - greatest value=45-26=19.

Back-to-Back Stem and Leaf Plot:

The back-to-back stem and leaf plots are used to compare two distributions side-by-side. This type

of back-to-back stem and leaf plot contains three columns, each separated by a vertical line.

The center column contains the stems.

Example-4: The following stem and leaf diagrams show the times taken by some girls and

boys to complete a level on a computer game.

Girls Boys

1
9 8 8

2 3 5 7 8 8
6 4 3 1

3 0 4 7
7 6 5 4 2

4 0 1 2
3 0 0

(a) Compare the times taken to complete the level between the children and the adults.
Sol:
For Girls,

Q1=(1*(15+1))/4- th data value=4th=21

Q2=(2*(15+1))/4=8th data value=32

Q3=(3*(15+1))4=12th data value= 37

I.Q.R=Q3-Q1=37-21=16

For Boys,

Q1=(1*(11+1))/4 th data value=3rd data value= 27

Q2=(2*(11+1)/4)th data value=6th data value= 30

Q3=(3*(11+1)/4)th data value=9th data value =40

I.Q.R=Q3-Q1=40-27=13

Here we can see that boys I.Q.R=13 & girls I.Q.R=16.So we can say Boy’s time to complete the level
on the computer game were more consistent
CORRELATION THEORY AND REGRESSION ANALYSIS

Types of Correlation
1. Positive or negative
2. Simple or multiple
3. Linear or non-linear

Comment on Correlation Coefficient

1 = Perfect positive correlation

0.7  c < 1 = Strong positive correlation

0.4  c < 0.7 = Fairly positive correlation

0 < c < 0.4 = Weak positive correlation

0 = No correlation

0 > c > -0.4 = Weak negative correlation

-0.4  c > -0.7 = Fairly negative correlation

-0.7  c < -1 = Strong negative correlation


-1 = Perfect negative correlation

Application Problem-1: A research physician recorded the pulse rates and the temperatures of

water submerging the faces of ten small children in cold water to control the abnormally rapid

heartbeats. The results are presented in the following table. Calculate the correlation coefficient

between temperature of water and reduction in pulse rate.

Temp. of water 68 65 70 62 60 55 58 65 69 63

Reduction in 2 5 1 10 9 13 10 3 4 6
pulse rate.

x y X^2 Y^2 xy
68 2 4624 4 136
65 5 4225 25 325
70 1 4900 1 70
62 10 3844 100 620
60 9 3600 81 540
55 13 3025 169 715
58 10 3364 100 580
65 3 4225 9 195
69 4 4761 16 276
63 6 3969 36 378
Σx=635 Σy=63 Σx^2=40537 Σy^2=541 Σxy=3835

n∑XY−∑X∑Y
We know, rxy =r = --------------------------------------------------------------------
Sqrt{( n ∑ X 2 − ( ∑ X ) 2 ) ⋅ ( n ∑ Y 2 − ( ∑ Y ) 2 )}
(10*3835)-(635*63)
R=--------------------------------------------------------------
Sqrt(10*40537 – (635)^2) * sqrt(10*541-(63^2))
= -0.9
The result -0.94, indicates that the correlation coefficient between temperature of water and

reduction in pulse rate is highly negatively correlated.

RANK CORRELATION

Rank correlation: In some situation it is difficult to measure the values of the variables from
bivariate distribution numerically, but they can be ranked. The correlation coefficient between
these two ranks is usually called rank correlation coefficient, given by Spearman (1904). It is
denoted by R. this is the only method for finding relationship between two qualitative variables
like beauty, honesty, intelligence, efficiency and so on.
Interpretation of Rank Correlation Coefficient (R)
The value of rank correlation coefficient, R ranges from -1 to +1
If R = +1, then there is complete agreement in the order of the ranks and the ranks are in the
same
direction
If R = -1, then there is complete agreement in the order of the ranks and the ranks are in the
opposite
direction
If R = 0, then there is no correlation

6 Σd^2
R=1 - --------
N^3-N

Application Problem-1: Obtain the rank correlation co-


efficient for the following data:
A 80 75 90 70 65 60
B 65 70 60 75 85 80
Sol:

A B R1 R2 D=R1- D^2
R2
80 65 2 5 -3 9
75 70 3 4 -1 1
90 60 1 6 -5 25
70 75 4 3 1 1
65 85 5 1 4 16
60 80 6 2 4 16
Σd=0 Σd^2=68

6 Σd^2
R=1 - --------
N^3-N

6 *68
R=1 - -------- = -0.94
6^3-6
Strongly negative relation between A and B.

x y R1 R2 D=R1- D^2
R2
20 30 3.5 4 -0.5 0.25
80 60 8 8 0 0
40 20 6 2 4 16
12 30 1 4 -3 9
28 50 5 7 -2 4
20 30 3.5 4 -0.5 0.25
15 40 2 6 -4 16
60 10 7 1 6 36
Σ d^2=81.5

6{81.5+1/12(2^3-2)+1/12(3^3-3)}
R=1 - ----------------------------------------------
8^3-8
R=0
NO correlation between x and y
REGRESSION ANALYSIS
What is regression?
Ans: The probable movement of one variable in terms of the other variables is
called
regression.
In other words the statistical technique by which we can estimate the unknown
value of
one variable (dependent) from the known value of another variable is called
regression.

Regression analysis.
Ans: Regression analysis is a mathematical measure of the average relationship
between
two or more variables in terms of the original units of data.

Example1: 1)find regression line x on y and y on x.

X y X^2 Y^2 xy
68 2 4624 4 136
65 5 4225 25 325
70 1 4900 1 70
62 10 3844 100 620
60 9 3600 81 540
55 13 3025 169 715
58 10 3364 100 580
65 3 4225 9 195
69 4 4761 16 276
63 6 3969 36 378
Σx=635 Σy=63 Σx^2=40537 Σy^2=541 Σxy=3835
Line x on y:
N Σ x y -Σ x Σ y
B(xy)= ----------------------------------
nΣ y^2-(Σy)^2

(10*3835)-(635*63)
B(xy)= ------------------------
(10*541)-(63)^2
B(xy)=-1.14
X=a+by
Σ x/n=a+b*(Σy/n)
a=70.73
x=70.73-1.14y[regression line]

Line Y on X:
N Σ x y -Σ x Σ y
B(yx)= ----------------------------------
nΣ x^2-(Σx)^2

(10*3835)-(635*63)
B(xy)= ------------------------
(10*40537)-( 635)^2
B(xy)=-0.77
Y=a+bx
Σ y/n=a+b*(Σx/n)
a=55.195
Y=55.195-0.77*X [regression line]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy