0% found this document useful (0 votes)
7 views

ST1009 Week2

Uploaded by

Anon son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ST1009 Week2

Uploaded by

Anon son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

4/17/2020

ST1009

EXPLORATORY DATA ANALYSIS

Week 2
1

FROM LAST WEEK …..

1
4/17/2020

IN DESCRIPTIVE STATISTICS…

 Collect data
 e.g. Survey
 Organise and Present data
 e.g. Tables and graphs
 Analyse data
 e.g. Sample mean =  X i
n 3

COLLECT DATA
Primary Secondary
Data Collection Data Compilation

Print or Electronic
Observation Survey

Experimentation
4

2
4/17/2020

ORGANISING AND PRESENTING DATA

 Raw data provide little information to the decision


makers.
 We need to convert the raw data into useful
information.
 Here, we will concentrate on some of the frequently
used methods of presenting and organizing data.

WAYS OF ORGANIZING & PRESENTING DATA

 There are two methods of presenting data


 Tabular form - Presenting information in a simple table is
useful in finding the distribution of values for any given
characteristic.
 Pictorial form - In this form, the data are presented using
diagrams, charts or graphs.

3
4/17/2020

TABULAR FORM
ONE WAY TABLE – FREQUENCY TABLE
District No. of
No. of students admitted to students
the University in an Admitted
Colombo
academic year (2019/2020),
Gampaha
according to district.



Total

RELATIVE FREQUENCY TABLE

 Often expressing frequency counts is not very helpful.


 With qualitative variables relative frequency of
occurrence found by dividing the frequency of a given
category from the total number could be more
informative. (proportions!)

4
4/17/2020

TABULAR FORM
CROSS TABULATIONS – TWO-WAY TABLE
 Distribution of ethnicity in different age groups
 Distribution of gender of the employers for different employment category

Gender * Employm ent Category Crosstabulation

Count
Employment Category
Clerical Custodial Manager Total
Gender Female 206 0 10 216
Male 157 27 74 258
Total 363 27 84 474
9

PICTURE FORMS

 Does not supply any additional information


 They illustrates the important facts
more clearly.
 Examples are;
 Bar charts
 Line charts
 Scatter plots
 Pie charts 10

5
4/17/2020

BAR CHARTS
 The bar chart makes comparisons by means of parallel bars
whose lengths are proportional to the values represented

 Many types of bar charts;

 Simple bar charts


 Component bar charts
 Multiple bar charts
11

SIMPLE BAR CHART


 Example: Age group of visitors arriving at a wildlife park
40.0
35.0
30.0
Percentage

25.0
20.0
15.0
10.0
5.0
0.0
Children Young adults Middle Aged adults Senior citizens 12
Age groups

6
4/17/2020

COMPONENT BAR CHART


 This is the same as the simple bar chart except that the bars are
sub divided into its component parts
Mode of Travel Local Foreign
80.0

70.0

60.0
40.0
Percentage

50.0

40.0 15.0
35.0
30.0

20.0
33.3 33.3
10.0 5.0 5.0
16.7
8.3 8.3
0.0
13
Private Rented Tourist Hired Jeep Other
vehicle vehicle coach
Mode of travel

PERCENTAGE COMPONENT BAR CHART


When changes in relative size of component figures are required we
use this chart

14

7
4/17/2020

MULTIPLE BAR CHARTS


 Here, component figures are shown as separate bars adjacent to each other
Mode of Travel Local Foreign

45.0
40.0
40.0
35.0
33.3 33.3
35.0
30.0
Percentage

25.0
20.0 16.7
15.0
15.0
8.3 8.3
10.0
5.0 5.0
5.0
0.0
Private Rented Tourist Hired Jeep Other 15
vehicle vehicle coach

LINE CHART
This is useful in particular to emphasize the changes in some variable
occurring during an interval of time.
No. selected students to Universities by Year of AL
Exam
16,000
14,000
Number of students

12,000
10,000
8,000
6,000
4,000
2,000
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 16
Year

8
4/17/2020

PIE CHART
Total export figures in 1993
 This is useful in showing a
Type Rs.('000) %
total into its component parts.
Agriculture 14554 57.99
 Component parts are
Industrial 8821 35.15
expressed as percentages of
the total and are represented Mineral 1132 4.51
by segments of a circle whose Other 589 2.35
sizes are proportional to the
percentages. Total 25096 100.00
17

Total export figures in 2000


Type In '000) %

PIE CHART CTD… Agriculture


Industrial
14554
8821
57.99
35.15
Mineral 1132 4.51
Other 589 2.35
Total
Total export figures in 1993 25096 100.00

5% 2% Total export figures in 1993


Agriculture
5% 2%
Industrial
Mineral

Agriculture 35% Other

Industrial
35%
Mineral 58%
58% Other

18

9
4/17/2020

SCATTER DIAGRAM
 Are used to examine the relationship between two continuous variables.
 Example: Student marks for two subjects
Relationship between subject-1 &
subject-2
120
100
subject-1

80
60
40
20
0
0 20 40 60 80 100 19
subject-2

SCATTER DIAGRAM CTD…

 Using a scatter plot, you may observe how values of one


variable changes with the values of another. We may
observe;
 an increasing trend
 a decreasing trend
 nonlinear trends
20

10
4/17/2020

IMPORTANT!

 We have discussed many forms of presenting data with


respect to tabular and pictorial.
 The ways of presenting data has to be chosen according
to the type of data; whether qualitative or quantitative.
 One way & two way tables, bar charts & pie chars can be
used to present qualitative data
 Line charts and scatter plots are for quantitative data 21

EXAMPLES…

 A firm produces 10,000 flash-cubes every day. Since sales are


adversely affected by defective cubes, it is necessary to set up a
quality control procedure. A random sample of 150 flash-cubes
are chosen and finds out that 6 were defective. Summarize the
data in a suitable manner. (tabular + graphical)
22

11
4/17/2020

EXAMPLE…
 Following are the ethnicities of 200 persons in the sample
Sinhalese 155
Tamil 25
Muslims 15
Other 5

Present these data graphically!

23

FREQUENCY DISTRIBUTION

 The easiest method of organizing data is a frequency


distribution, which converts raw data into meaningful groups
(or class) of data.
 Shows the frequency of occurrence in each group.
 There are some steps that are usually followed in constructing
a frequency distribution.
24

12
4/17/2020

STEPS…

 Specify the number of class intervals.


 There is no accepted rule for the number of groups.
 Between 5 and 15 class intervals are generally
recommended.
 Classes must be selected so that an item can only fall
into one class. (classes cannot overlap) 25

STEPS…

 When all intervals have the same width, the following rule
may be used to find the required class interval width:

W = (L - S) / K

where:
W = Class width, L= Largest value,
S = Smallest value, K= No. of classes 26

13
4/17/2020

EXAMPLE
 Suppose the age of a sample of 10 students are:

20.9, 18.1, 18.5, 21.3, 19.4, 25.3, 22.0, 23.1, 23.9, and 22.5

We select K=4 and W=(25.3 - 18.1)/4 = 1.8 which is


rounded-up to 2.

The frequency table is constructed in the following way. 27

EXAMPLE
Class Interval Frequency Rel. Freq.
18 ≤ x < 20 3 30%
20 ≤ x < 22 2 20%
22 ≤ x < 24 4 40%
24 ≤ x < 26 1 10%
Total 10 100%
Note that the sum of all relative frequencies must add up to 1.00 or
100%. Here, we see that 40% of all students are younger than 24 years 28
old, but older than 22 years old.

14
4/17/2020

EXAMPLE

 The frequency distribution tells us;


 how the observations cluster around a central value
 the degree of difference between observations.

For example, in the above problem we know that no student is


younger than 18 and the age below 24 is most typical. The
most common age is between 22 and 24.
29

CLASS LIMITS

 True classes are those classes such that the upper true limit of
a class is the same as the lower true limit of the next class.
 True class limit is obtained by adding the upper class limit of
one class interval to the lower class limit of the next higher
class interval and dividing by two.
30

15
4/17/2020

 Class mark is the mid-point of the class interval and is obtained


by adding the lower and upper class limits and dividing by two

Example:
Stated Limits True Limits
Rs.600 ≤ x ≤ 799...........Rs.599.50 ≤ x ≤ 799.50
Rs.800 ≤ x ≤ 999...........Rs.799.50 ≤ x ≤ 999.50

Class Mark = 699.5 31

CUMULATIVE FREQUENCY
 The total frequency of all the values less than the upper class boundary
of a given class interval is called the cumulative frequency up to and
including that class interval.
 Cumulative frequency of a class interval = frequency of the class
interval + frequencies of preceding class intervals.
 The cumulative frequencies for the previous problem are: 3, 5, 9, and 10.
(Slide 28) 32

16
4/17/2020

HISTOGRAM

 Histogram is a graphical representation of a


frequency distribution. It consist of a set of rectangles
having;
 Bars on the horizontal axis (X-axis) with centers at the
class mark and length equal to the class width.
 Area is proportional to the class frequency
 If the class widths are equal for all classes, the heights of
the rectangles are proportional to the class frequency. 33

HISTOGRAM

 When class widths are unequal, the heights of the


rectangles must be adjusted so that the areas of the
rectangles are proportional to the frequencies. For this,
frequency density is used instead of frequency,
frequency density = freq./Class width

34

17
4/17/2020

EXAMPLE
Class Intervals Mid Point Width U.C.B. Freq. Freq. Density Cumul. Freq.
126-136 131 10 136 3 0.3 3
136-146 141 10 146 5 0.5 8
146-156 151 10 156 5 0.5 13
156-166 161 10 166 5 0.5 18
166-176 171 10 176 2 0.2 20

Histogram with equal class widths

6
Frequency

4
2
0
35

Weight of the students

EXAMPLE; UNEQUAL WIDTH!


Mass Mid Point Width U.C.B. Freq. Freq. Density Cumul. Freq.
6-8 7 3 8.5 4 1.33 4
9-11 10 3 11.5 6 2.00 10
12-17 14.5 6 17.5 10 1.67 20
18-20 19 3 20.5 3 1.00 23
21-29 25 9 29.5 12 1.33 35

36

18
4/17/2020

FREQUENCY POLYGON

 A figure used to represent a frequency distribution is a


frequency polygon
 To construct a freq. polygon, plot freq. density against the
class mid point of an interval. Then join the points with
straight lines.
 This provides a useful way of comparing two or more data
sets.
37

EXAMPLE

Frequency Polygon
Frequency Density

0.6

0.4

0.2

0
121 131 141 151 161 171 181

Class mid points 38

19
4/17/2020

CUMULATIVE FREQUENCY POLYGON


 A chart of a cumulative frequency distribution (ogive).

 To construct an ogive, use the cumulative frequencies of


the freq. distribution (instead of the absolute frequencies)
as the values plotted on the Y-axis against the upper class
boundaries on the X-axis.
39

OGIVE

Ogive
Cumulative frequency

25
20
15
10
5
0
126 136 146 156 166 176
Upper class boundaries 40

20
4/17/2020

FREQUENCY CURVE
 When the no. of intervals gets large, the freq. polygon will
consist of a large no. of line segments, and the freq.
polygon approaches a smooth curve known as a freq.
curve. i.e. The freq. curve is obtained by smoothing the
freq. polygon.
 Freq. curve is useful to have some idea about the shape of
the freq. distribution.
41

SMOOTHING OF DISTRIBUTION

42

21
4/17/2020

SHAPE OF THE FREQUENCY CURVE

Two aspects of the shape:


1. Skewness is a measure of the lack of symmetry
(asymmetry).
 A distribution is symmetric if it looks the same to the left
and right of the center point.

43

positively skewed negatively skewed

 When a frequency curve is asymmetric, one tail of the


curve is longer than the other. If it has a long right tail
then it is positively skewed.
 If it has a long left tail then it is negatively skewed.
44

22
4/17/2020

SHAPE OF THE FREQUENCY CURVE

2. Kurtosis:
 Kurtosis is a measure of whether the data are peaked or flat relative to a
normal distribution.
 Data sets with high kurtosis tend to have a distinct peak near the mean,
decline rather rapidly, and have heavy tails.
 Data sets with low kurtosis tend to have a flat top near the mean rather
than a sharp peak.

45

EXAMPLE

 A department store has its own credit card accounts. The department
randomly selects 40 accounts and records the number of days within
which the bill is paid:
16 9 5 8 6 10 16 4 11 4
3 19 21 16 15 24 45 11 8 19
37 59 14 72 3 22 10 6 14 11
20 9 16 6 75 21 7 15 12 10
46

23
4/17/2020

EXAMPLE

 Using a width of 10, set up a frequency distribution and a relative frequency


distribution (using percentages)
 Construct a histogram, using frequency to represent the points along the
vertical axis.
 Currently no finance charge is charged on overdue accounts. The credit
department decides to charge interest on accounts that are more than 30
days late. What percentage of these accounts will be affected.
 The credit department decides to terminate all accounts that are more
than 60 days overdue. What percentage of these accounts will be
47
terminated.

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy