0% found this document useful (0 votes)
27 views

Frequency Distribution & Data Visualisation

The document provides an overview of quantitative data, detailing the differences between continuous and discrete variables, and explains how to create frequency distribution tables for both types. It includes examples of how to present data through frequency tables and cumulative frequency distribution tables, as well as guidelines for visualizing data using graphs like pie charts and bar charts. The document emphasizes the importance of clear representation and understanding of data in statistical analysis.

Uploaded by

Nitin Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Frequency Distribution & Data Visualisation

The document provides an overview of quantitative data, detailing the differences between continuous and discrete variables, and explains how to create frequency distribution tables for both types. It includes examples of how to present data through frequency tables and cumulative frequency distribution tables, as well as guidelines for visualizing data using graphs like pie charts and bar charts. The document emphasizes the importance of clear representation and understanding of data in statistical analysis.

Uploaded by

Nitin Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data exploration

Quantitative Data or Numeric variables


A numeric variable (also called quantitative variable) is a quantifiable
characteristic whose values are numbers (except numbers which are codes
standing up for categories). Numeric variables may be either continuous or
discrete.
Continuous variables
A variable is said to be continuous if it can assume an infinite number of real
values within a given interval. For instance, consider the height of a student. The
height can’t take any values. It can’t be negative and it can’t be higher than three
metres. But between 0 and 3, the number of possible values is theoretically
infinite. A student may be 1.6321748755 … metres tall. In practice, the methods
used and the accuracy of the measurement instrument will restrict the precision of
the variable. The reported height would be rounded to the nearest centimeter, so it
would be 1.63 metres. The age is another example of a continuous variable that is
typically rounded down.
Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a finite
number of real values within a given interval. An example of a discrete variable
would be the score given by a judge to a gymnast in competition: the range is
0 to 10 and the score is always given to one decimal (e.g. a score of 8.5). You can
enumerate all possible values (0, 0.1, 0.2…) and see that the number of possible
values is finite: it is 101! Another example of a discrete variable is the number of
people in a household for a household of size 20 or less. The number of possible
values is 20, because it’s not possible for a household to include a number of
people that would be a fraction of an integer like 2.27 for instance.
Frequency Distribution
The frequency (f) of a particular value is the number of times the value occurs in
the data. The distribution of a variable is the pattern of frequencies, meaning the
set of all possible values and the frequencies associated with these values.
Frequency distributions are portrayed as frequency tables or charts.
Frequency distributions can show either the actual number of observations
falling in each range or the percentage of observations. In the latter instance, the
distribution is called a relative frequency distribution.
Frequency distribution tables can be used for both categorical and numeric
variables. Continuous variables should only be used with class intervals.

Example
A survey was taken on Maple Avenue. In each of 20 homes, people were asked how
many cars were registered to their households. The results were recorded as
follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Use the following steps to present this data in a frequency distribution table.

1. Divide the results (x) into intervals, and then count the number of results in
each interval. In this case, the intervals would be the number of households
with no car (0), one car (1), two cars (2) and so forth.
2. Make a table with separate columns for the interval numbers (the number of
cars per household), the tallied results, and the frequency of results in each
interval. Label these columns Number of cars, Tally and Frequency.
3. Read the list of data from left to right and place a tally mark in the
appropriate row. For example, the first result is a 1, so place a tally mark in
the row beside where 1 appears in the interval column (Number of cars).
The next result is a 2, so place a tally mark in the row beside the 2, and so
on. When you reach your fifth tally mark, draw a tally line through the
preceding four marks to make your final frequency calculations easier to
read.
4. Add up the number of tally marks in each row and record them in the final
column entitled Frequency.
5. This relative frequency of a particular observation or class interval is
found by dividing the frequency (f) by the number of observations (n): that
is, (f ÷ n). Thus:
6. Relative frequency = frequency ÷ number of observations
7. The percentage frequency is found by multiplying each relative frequency
value by 100. Thus:
8. Percentage frequency = relative frequency X 100 = f ÷ n X 100
Your frequency distribution table for this exercise should look like this:

Table.1
Frequency table for the number of cars registered in each
household
Table summary
This table displays the results of Frequency table for the number
of cars registered in each household. The information is grouped
by Number of cars (x) (appearing as row headers), Frequency (f)
(appearing as column headers).

Number Frequency Relative F Cumulative Cumulative


of cars (x) (f) (%) CF (5)

0 (4/20) = .20 or
4 4 20
20%

1 (6/20) = .30 or
6 4+6 =10 20+30=50
30%

2 (5/20) = .25 or
5 10+5 =15 50+25 = 75
25%

3 (3/20) = .15 or
3 15+3 =18 75+15=90
15%

4 (2/20) = .10 or
2 18+2 = 20 90+10=100
10%

Total 20 1.00 or 100

0 true zero or a value rounded to zero

By looking at this frequency distribution table quickly, we can see that out of
20 households surveyed, 4 households had no cars, 6 households had 1 car, etc.

Table-I A: Age of MBA-I(A)


Age (X) in Tally Bars f
Years
20 8
21 11
22 8
23 9
Total 36 (N)
Source: Authors’ own compilation
Class intervals
If a variable takes a large number of values, then it is easier to present
and handle the data by grouping the values into class intervals.
Continuous variables are more likely to be presented in class intervals,
while discrete variables can be grouped into class intervals or not.
To illustrate, suppose we set out age ranges for a study of young people,
while allowing for the possibility that some older people may also fall into
the scope of our study.

At a recent chess tournament, all 10 of the participants had to fill out a form
that gave their names, address and age. The ages of the participants were
recorded as follows:
36, 48, 54, 92, 57, 63, 66, 76, 66, 80
Use the following steps to present these data in a cumulative frequency
distribution table.

1. Divide the results into intervals, and then count the number of
results in each interval. In this case, intervals of 10 are appropriate.
Since 36 is the lowest age and 92 is the highest age, start the
intervals at 35 to 44 and end the intervals with 85 to 94.
2. Create a table similar to the frequency distribution table but with
three extra columns.
o In the first column or the Lower value column, list the lower
value of the result intervals. For example, in the first row, you
would put the number 35.
o The next column is the Upper value column. Place the upper
value of the result intervals. For example, you would put the
number 44 in the first row.
o The third column is the Frequency column. Record the number
of times a result appears between the lower and upper values.
In the first row, place the number 1.
o The fourth column is the Cumulative frequency column. Here
we add the cumulative frequency of the previous row to the
frequency of the current row. Since this is the first row, the
cumulative frequency is the same as the frequency. However, in
the second row, the frequency for the 35–44 interval (i.e., 1) is
added to the frequency for the 45–54 interval (i.e. 2). Thus, the
cumulative frequency is 3, meaning we have 3 participants in
the 34 to 54 age group.
1+2=3

o The next column is the Percentage column. In this column, list


the percentage of the frequency. To do this, divide the
frequency by the total number of results and multiply by 100. In
this case, the frequency of the first row is 1 and the total
number of results is 10. The percentage would then be 10.0.
10.0. (1 ÷ 10) X 100 = 10.0

o The final column is Cumulative percentage. In this column,


divide the cumulative frequency by the total number of results
and then to make a percentage, multiply by 100. Note that the
last number in this column should always equal 100.0. In this
example, the cumulative frequency is 1 and the total number of
results is 10, therefore the cumulative percentage of the first
row is 10.0.
10.0. (1 ÷ 10) X 100 = 10.0
3. The cumulative frequency distribution table should look like this:
4.

Table 2
Ages of participants at a chess tournament
Table summary
This table displays the results of Ages of participants at a chess
tournament. The information is grouped by Lower Value
(appearing as row headers), Upper Value, Frequency (f),
Cumulative frequency, Percentage and Cumulative percentage
(appearing as column headers).

Lower Upper Frequency ( Cumulative Percenta Cumulative


Value Value f) frequency ge percentage

35 44 1 1 10.0 10.0

45 54 2 3 20.0 30.0

55 64 2 5 20.0 50.0

65 74 2 7 20.0 70.0

75 84 2 9 20.0 90.0

85 94 1 10 10.0 100.

The frequency of a class interval is the number of observations that occur


in a particular predefined interval. So, for example, if 20 people aged 5 to 9
appear in our study's data, the frequency for the 5–9 interval is 20.
The endpoints of a class interval are the lowest and highest values that a
variable can take. So, the intervals in our study are 0 to 4 years,
5 to 9 years, 10 to 14 years, 15 to 19 years, 20 to 24 years, and 25 years and
over. The endpoints of the first interval are 0 and 4 if the variable is
discrete, and 0 and 4.999 if the variable is continuous. The endpoints of the
other class intervals would be determined in the same way.
Class interval width is the difference between the lower endpoint of an
interval and the lower endpoint of the next interval. Thus, if our study's
continuous intervals are 0 to 4, 5 to 9, etc., the width of the first
five intervals is 5, and the last interval is open, since no higher endpoint is
assigned to it. The intervals could also be written as 0 to less than 5, 5 to
less than 10, 10 to less than 15, 15 to less than 20, 20 to less than 25, and
25 and over.
Rules for data sets that contain a large number of observations
In summary, follow these basic rules when constructing a frequency
distribution table for a data set that contains a large number of
observations:
 find the lowest and highest values of the variables
 decide on the width of the class intervals
 Include all possible values of the variable.
In deciding on the width of the class intervals, you will have to find a
compromise between having intervals short enough so that not all of the
observations fall in the same interval, but long enough so that you do not
end up with only one observation per interval.
It is also important to make sure that the class intervals are mutually
exclusive and collectively exhaustive.

Exclusive vs Inclusive Series


Frequency distribution can be visualized using:

Knowing how to convey information graphically is important in presenting


statistics. The following is a list of general rules to keep in mind when
preparing graphs.
 A good graph:
 accurately shows the facts,
 grabs the reader’s attention,
 complements or demonstrates arguments presented in the text,
 has a title and labels,
 is simple and uncluttered,
 shows data without altering the message of the data,
 clearly shows any trends or differences in the data,
 is visually accurate (i.e. if one chart value is 15 and another 30, then
30 should appear to be twice the size of 15).
Why use graphs to present data?
Because they…
 are quick and direct,
 highlight the most important facts,
 facilitate understanding of the data,
 can convince readers,
 can be easily remembered.
There are many different types of graphs that can be used to convey
information, including:
 pie charts (nominal variable)
 bar charts (nominal or ordinal variable)
 pictographs
 line charts (ordinal or discrete variable)
 scatterplots (two continuous variable)
 histograms (continuous variable)
A pie chart, sometimes called a circle chart, is a way of summarizing a set
of nominal data or displaying the different values of a given variable (e.g.
percentage distribution). This type of chart is a circle divided into a series
of segments. Each segment represents a particular category. The area of
each segment is the same proportion of a circle as the category is of the
total data set.

Pie chart usually shows the component parts of a whole. Sometimes you will
see a segment of the drawing separated from the rest of the pie in order to
emphasize an important piece of information.

A pie chart is constructed by converting the share of each component into a


percentage of 360 degrees.

This is called an exploded pie chart. Chart 1.1 is an example of an


exploded pie chart.

Figure 1.1: Should we discontinue 8.30


AM Class
Precautions :

1. not more than 6 categories


2. When drawing a pie chart, ensure that the segments are ordered by
size (largest to smallest) and in a clockwise direction.
Bar chart
A bar chart may be either horizontal or vertical. The important point to note
about bar charts is their bar length or height—the greater their length or
height, the greater their value. Bar charts are one of the many techniques
used to present data in a visual form so that the reader may readily
recognize patterns or trends.
Bar charts usually present categorical variables, discrete variables or
continuous variables grouped in class intervals. They consist of an axis and
a series of labelled horizontal or vertical bars. The bars depict frequencies
of different values of a variable or simply the different values themselves.
The numbers on the y-axis of a vertical bar chart or the x-axis of a
horizontal bar chart are called the scale.
When developing bar charts manually, draw a vertical or horizontal bar for
each category or value. The height or length of the bar will represent the
number of units or observations in that category (frequency) or simply the
value of the variable. Select an arbitrary but consistent width for each bar
as well.

Vertical bar charts


Bar charts should be used when you are showing segments of information.
Vertical bar charts are useful to compare different categorical or discrete
variables, such as age groups, classes, schools, etc., as long as there are not
too many categories to compare. They are also very useful for time series
data. The space for labels on the x-axis is small, but ideal for years, minutes,
hours or months. For example, Chart 2 below shows the number of police
officers in Crimeville for each year from 2011 to 2019.
Number of police officers
70
60
50
40
30
20
10
0
2011 2012 2013 2014 2015 2016 2017 2018 2019

Number of police officers

Grouped bar charts

The grouped bar chart is another effective means of comparing sets of data
about the same places or items. It gives two or more pieces of information
for each item on the x-axis instead of just one as in Chart 2. This allows you
to make direct comparisons on the same chart by age group, gender or
anything else you wish to compare. However, if a grouped bar chart has too
many series of data, the chart becomes cluttered and it can be confusing to
read.

Chart 3, a grouped vertical bar chart, compares two series of data: the
numbers of boys and girls that have a smartphone at CUHP from 2012 to
2019. The blue bar represents the number of boys, and the red bar
represents the number of girls.

Ye Number of Number of
ar boys girls
201 110 85
2
201 185 175
3
201 240 225
4
201 285 295
5
201 305 280
6
201 310 315
7
201 315 305
8
201 315 320
9

The numbers of boys and girls that


have a smartphone
350
300
250
200
150
100
50
0
2012 2013 2014 2015 2016 2017 2018 2019

Number of boys Number of girls

Horizontal bar charts

One disadvantage of vertical bar charts, however, is that they lack space for
text labelling at the foot of each bar. When category labels in the chart are
too long, you might find a horizontal bar chart better for displaying
information, like the example in Chart 4.

Sport Percentage of Percentage of


boys (%) girls (%)
Athletic 17 17
s
Basebal 24 17
l
Basketb 40 25
all
Footbal 40 2
l
Soccer 20 17
Swimmi 9 12
ng
Tennis 8 8
Volleyb 10 23
all
Wrestli 10 5
ng

Chart Title
Percentage of girls (%) Percentage of boys (%)

Wrestling 5
10
Volleyball 23
10
Tennis 8
8
Swimming 12
9
Soccer 17
20
Football 2
40
Basketball 25
40
Baseball 17
24
Athletics 17
17
Stacked bar charts
There are several other types of bar chart that you may encounter.
The population pyramid is a special application of a grouped bar chart.
Another useful type of bar chart is the stacked bar chart.
The stacked bar chart is a preliminary data analysis tool used to show
segments of totals. The stacked bar chart can be very difficult to analyze if
too many items are in each stack. It can contrast values, but not necessarily
in the simplest manner.
In Chart 5, it is easy to analyze the data presented since there are only
three items in each stack: swimming, running and biking. It is easy to see at
a glance what percentage of time each woman spent on an event. Had this
been a chart representing a decathlon (with 10 events) the data would have
been significantly harder to analyze.
Nam Perce Perce Perce
e ntage ntage ntage
of of of
time time time
spent spent spent
swim cyclin runnin
ming g (%) g (%)
(%)
Averi 13 50 37
Bron 32 53 15
wyn
Hillar 21 28 51
y
Jessa 41 14 45
Mega 9 81 10
n
Merc 28 47 25
edes
Rosal 32 40 28
yn
Tiiu 38 24 38
Tiiu 38 24 38

Rosalyn 32 40 28

Mercedes 28 47 25

Megan 9 81 10

Jessa 41 14 45

Hillary 21 28 51

Bronwyn 32 53 15

Averi 13 50 37
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of time spent swimming (%) Percentage of time spent cycling (%)
Percentage of time spent running (%)

Advices to build bar charts


You should keep the following guidelines in mind when creating bar charts:
Make bars and columns wider than the space between them.
Use a single font type on a chart. Try to maintain a consistent font style
from chart to chart in a single presentation or document.
Order your shade pattern from darkest to lightest.
Avoid using a combination of red and green in the same display.
Pictograph
A pictograph uses picture symbols to illustrate statistical information. It is
often more difficult to visualize data precisely with a pictograph. This is why
pictographs should be used carefully to avoid misrepresenting data either
accidentally or deliberately.
Chart 6 shows a scale that represents the number of elementary students
who prefer chocolate chip cookies. This type of pictograph shows how a
symbol can be used to represent data. One cookie symbol represents two
students, and a half-cookie symbol is used to represent one student. These
data could easily have been presented in a bar chart using a scale to
present the figure rather than a symbol.

Data table for Chart 5.3.1


Table summary
This table displays the results of Data
table for Chart 6.
Group Number of cookies
1 2
2 5
3 2.5
4 1
5 6
6 2
7 4
8 1.5
Note: Each cookie represents two
students
Line charts
Line charts, especially useful in the fields of statistics and science, are more
popular than all other graphs combined because their visual characteristics
reveal data trends clearly and these charts are easy to create.
A line chart is a visual comparison of how two variables—shown on the x-
and y-axes—are related or vary with each other. It shows related
information by drawing a continuous line between all the points on a grid.
Line charts compare two variables: one is plotted along the x-axis
(horizontal) and the other along the y-axis (vertical). The y-axis in a line
chart usually indicates quantity (e.g. dollars, litres) or percentage, while the
horizontal x-axis often measures units of time. As a result, the line chart is
often viewed as a time series graph. For example, if you wanted to graph
the height of a baseball pitch over time, you could measure the time
variable along the x-axis, and the height along the y-axis. Although they do
not present specific data as well as tables do, line charts are able to show
relationships more clearly than tables do. Line charts can also depict
multiple series and hence are usually the best candidate for time series data
and frequency distribution.
Vertical bar charts and line charts share a similar purpose. The vertical bar
chart, however, reveals a change in magnitude, whereas the line chart is
used to show a change in direction.
In summary, line charts:
show specific values of data well,
reveal trends and relationships between data,
compare trends in different groups.
Graphs can give a distorted image of the data. If scales on the axes of a line
graph force data to appear a certain way, then a graph can even reveal a
trend that is entirely different from the one intended. This happens when
the intervals between adjacent points along the axis may be dissimilar, or
when the same data charted in two graphs using different scales appear
different.

Month Numb
er of Number of students
stude
350
nts
300
Januar 250
y 250
200
Febru 250
ary 150

March 255 100

April 260 50
0
May 280 January February March April May June July
June 290 Number of students
July 315
Age & donation ($)
Age Average
Average donation ($)
donation ($) 120

15 36 100

16 52 80

17 83 Axis Title 60

40
18 100
20
19 110
0
15 16 17 18 19

Data table : 250


Cellophane User
200
Ye Num Num Total
ar ber ber numbe
of of r 150
men wom (thous
(tho en ands) 100
usan (tho
ds) usan
ds) 50

201 150.0 147.5 297.5


2 0
2012 2013 2014 2015 2016 2017 2018
201 165.0 157.5 322.5 Number of men (thousands) Number of women (thousands)
3
201 177.5 160.0 337.5
4
201 155.0 177.5 332.5
5
201 162.5 182.5 345.0
6
201 175.0 180.0 355.0
7
201 195.0 187.5 382.5
8
The scatter plot
In science, the scatter plot is widely used to present measurements of two
or more related variables. It is particularly useful when the values of the
variables of the y-axis are thought to be dependent upon the values of the
variable of the x-axis.
In a scatter plot, the data points are plotted but not joined. The resulting
pattern indicates the type and strength of the relationship between two or
more variables. Chart 5.6.1 is an example of a scatter plot. Car ownership
increases as the household income increases, showing that there is a
positive relationship between these two variables.
The pattern of the data points on the scatter plot reveals the relationship
between the variables. Scatter plots can illustrate various patterns and
relationships, such as:

 a linear or non-linear relationship,


 a positive (direct) or negative (inverse) relationship,
 the concentration or spread of data points,
 the presence of outliers.

Income Percenta 120


($) ge of
100
Car (%)
20,000 60 80
30,000 55
60
Car Owner

40,000 75
50,000 85 40
60,000 82
20
70,000 97
80,000 87 0
90,000 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0
100,000 95 10 20 30 40 50 60 70 80 90 1 00 1 10

Income
The histogram

The histogram is a popular graphing tool. It is used to summarize discrete


or continuous data that are measured on an interval scale. It is often used to
illustrate the major features of the distribution of the data in a convenient
form. It is also useful when dealing with large data sets (greater than
100 observations). It can help detect any unusual observations (outliers) or
any gaps in the data.

A histogram divides up the range of possible values in a data set into


classes or groups. For each group, a rectangle is constructed with a base
length equal to the range of values in that specific group and a length equal
to the number of observations falling into that group. A histogram has an
appearance similar to a vertical bar chart, but there are no gaps between
the bars. Generally, a histogram will have bars of equal width. Chart 7 is an
example of a histogram that shows the distribution of salary, a continuous
variable, of the employees of a company.

Salary (in Number


thousands of
of $) employees
0–10 50
11–20 300
21–30 250
31–40 400
41–50 550
51–60 433
61–70 266
71–80 350
81–90 100
91+ 20

Differences between bar chart and histogram


Compariso Bar chart Histogram
n terms
Usage To compare To display the distribution of a variable.
different
categories of
data.
Type of Categorical Numeric variables
variable variables
Rendering Each data point The data points are grouped and rendered
is rendered as a based on the bin value. The entire range of
separate bar. data values is divided into a series of non-
overlapping intervals.
Space Can have space. No space.
between
bars
Reordering Can be Cannot be reordered.
bars reordered.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy