cs3352-foundations-of-data-science-unit-ii
cs3352-foundations-of-data-science-unit-ii
Syllabus
Types of Data - Types of Variables - Describing Data with Tables and Graphs -
Describing Data with Averages - Describing Variability - Normal Distributions
and Standard (z) Scores.
Types of Data
• Data is collection of facts and figures which relay something specific, but which
are not organized in any way. It can be numbers, words, measurements,
observations or even just descriptions of things. We can say, data is raw material in
the production of information.
• Collection of data objects and their attributes. Attributes captures the basic
characteristics of an object
• Each row of a data set is called a record. Each data set also has multiple
attributes, each of which gives information on a specific characteristic.
• Data can broadly be divided into following two types: Qualitative data and
quantitative data.
Qualitative data:
• Qualitative data is data concerned with descriptions, which can be observed but
cannot be computed. Qualitative data is also called categorical data. Qualitative
data can be further subdivided into two types as follows:
1. Nominal data
2. Ordinal data
Qualitative data:
• Qualitative data is the one that focuses on numbers and mathematical calculations
and can be calculated and computed.
• There are two types of qualitative data: Interval data and ratio data.
1. Advantages:
• Qualitative data helps the market researchers to understand the mindset of their
customers.
• Avoid pre-judgments
2. Disadvantages:
• Time consuming
1. Advantages:
2. Disadvantages:
Ranked Data
• Ranked data is a variable in which the value of the data is captured from an
ordered set, which is recorded in the order of magnitude. Ranked data is also called
as Ordinal data.
c) Along with the information provided by the nominal scale, ordinal scales give
the rankings of those variables
e) The surveyors can quickly analyze the degree of agreement concerning the
identified order of variables
• Examples:
Scale of Measurement
• There are four different scales of measurement. The data can be defined as being
one of the four scales. The four types of scales are: Nominal, ordinal, interval and
ratio.
Nominal
• A nominal data is the 1 level of measurement scale in which the numbers serve as
"tags" or "labels" to classify or identify the objects.
• A nominal data usually deals with the non-numeric variables or the numbers that
do not have any value. While developing statistical models, nominal data are
usually transformed before building the model.
3. The numbers don't define the object characteristics. The only permissible aspect
of numbers in the nominal scale is "counting".
• Example:
Interval
a) The interval data is quantitative as it can quantify the difference between the
values.
c) To understand the difference between the variables, you can subtract the values
between the variables.
d) The interval scale is the preferred scale in statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calender types, etc.
• Examples:
1. Celsius temperature
2. Fahrenheit temperature
Ratio
• Any variable for which the ratios can be computed and are meaningful is called
ratio data.
d) Ratio data has unique and useful properties. One such feature is that it allows
unit conversions like kilogram - calories, gram - calories, etc.
Example 2.1.1: Indicate whether each of the following terms is qualitative; ranked
or quantitative:
(c) age
(f) temperature
(i) IQ score
(j) gender
Solution :
Types of Variables
Discrete variables:
• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but it
cannot be 23/32 or 12.23 students.
• Number of page in the book is a discrete variable. Discrete data can only take on
certain individual values.
Continuous variables:
• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at least
in theory, have no restrictions.
• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.
Approximate Numbers
• For example, 2, 4, 9 are exact numbers as they do not need any approximation.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• The two main variables in an experiment are the independent and dependent
variable. An experiment is a study in which the investigator decides who receives
the special treatment.
1. Independent variables
• The independent variable is the one that the researcher intentionally changes or
controls.
2. Dependent variables
• The dependent variable is the factor that the research measures. It changes in
response to the independent variable or depends upon it.
Observational Study
• These studies are often qualitative in nature and can be used for both exploratory
and explanatory research purposes. While quantitative observational studies exist,
they are less common.
• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of control
and treatment groups means that forming inferences is difficult and there is a risk
of confounding variables impacting user analysis.
Confounding Variable
• Confounding variables are those that affect other variables in a way that produces
spurious or distorted associations between two variables. They confound the "true"
relationship between two variables. Confounding refers to differences in outcomes
that occur because of differences in the baseline risks of the comparison groups.
• For example, if we have an association between two variables (X and Y) and that
association is due entirely to the fact that both X and Y are affected by a third
variable (Z), then we would say that the association between X and Y is spurious
and that it is a result of the effect of a confounding variable (Z).
• A difference between groups might be due not to the independent variable but to
a confounding variable.
• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such that
they can be influenced by another factor. For instance, alcohol drinkers might
consume cigarettes more than non drinkers that act as a confounding variable
(consuming cigarettes in this case) to study an association amidst drinking alcohol
and heart disease.
• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean that
increased ice cream sales cause more shark attacks? That's unlikely. The more
likely cause is the confounding variable temperature. When it is warmer outside,
more people buy ice cream and more people go in the ocean.
• In order to find the frequency distribution of quantitative data, we can use the
following table that gives information about "the number of smartphones owned
per family."
• When observations are sorted into classes of single values, the result is referred to
as a frequency distribution for ungrouped data. It is the representation of
ungrouped data and is typically used when we have a smaller data set.
1. Grouped data:
• Grouped data refers to the data which is bundled together in different classes or
categories.
• Data are grouped when the variable stretches over a wide range and there are a
large number of observations and it is not possible to arrange the data in any order,
as it consumes a lot of time. Hence, it is pertinent to convert frequency into a class
group called a class interval.
• Suppose we conduct a survey in which we ask 15 familys how many pets they
have in their home. The results are as follows:
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
2. Classes should be set up so that they do not overlap and so that each piece of
data belongs to exactly one class.
• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary.
• Table 2.3.4 gives a frequency distribution of the IQ test scores for 75 adults.
• If the lower class limit for the second class, 95, is added to the upper class limit
for the first class,94 and the sum divided by 2, the upper boundary for the first
class and the lower boundary for the second class is determined. Table 2.3.5 gives
all the boundaries for Table 2.3.5.
• If the lower class limit is added to the upper class limit for any class and the sum
divided by 2, the class mark for that class is obtained. The class mark for a class is
the midpoint of the class and is sometimes called the class midpoint rather than the
class mark.
Example 2.3.1: Following table gives the frequency distribution for the
cholesterol values of 45 patients in a cardiac rehabilitation study. Give the
lower and upper class limits and boundaries as well as the class marks for
each class.
• Solution: Below table gives the limits, boundaries and marks for the classes.
Example 2.3.2: The IQ scores for a group of 35 school dropouts are as follows:
b) Specify the real limits for the lowest class interval in this frequency distribution.
(123-69)/ 10=54/10=5.4≈ 5
b) Real limits for the lowest class interval in this frequency distribution = 64.5-
69.5.
Example 2.3.3: Given below are the weekly pocket expenses (in Rupees) of a
group of 25 students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49, 45,
44, 37, 36
Solution:
• In the given data, the smallest value is 26 and the largest value is 49. So, the
range of the weekly pocket expenses = 49-26=23.
Outliers
• 'In statistics, an Outlier is an observation point that is distant from other
observations.'
• An outlier is a value that escapes normality and can cause anomalies in the results
obtained through algorithms and analytical systems. There, they always need some
degrees of attention.
• Understanding the outliers is critical in analyzing data for at least two aspects:
• The simplest way to find outliers in data is to look directly at the data table, the
dataset, as data scientists call it. The case of the following table clearly exemplifies
a typing error, that is, input of the data.
• The field of the individual's age Antony Smith certainly does not represent the
age of 470 years. Looking at the table it is possible to identify the outlier, but it is
difficult to say which would be the correct age. There are several possibilities that
can refer to the right age, such as: 47, 70 or even 40 years.
• A relative frequency distribution lists the data values along with the percent of
all observations belonging to each group. These relative frequencies are calculated
by dividing the frequencies for each group by the total number of observations.
• Example: Suppose we take a sample of 200 India family's and record the number
of people living there. We obtain the following:
Cumulative frequency:
• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies, the
recorded values are the sum of all frequencies for values less than and including
the current value.
• Example: Suppose we take a sample of 200 India family's and record the number
of people living there. We obtain the following:
• Averages consist of numbers (or words) about which the data are, in some sense,
centered. They are often referred to as measures of central tendency. It is
already covered in section 1.12.1.
• We look at various ways to measure the central tendency of data, include: Mean,
Weighted mean, Trimmed mean, Median, Mode and Midrange.
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x is
the point estimator of the population mean μ.
2. Median :
• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the median is
the preferred measure of central location.
• The median is the measure of location most often reported for annual income and
property value data. A few extremely large incomes of property values can inflate
the mean.
Median=19
8 observations=26 18 29 12 14 27 30 19
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If the
data have more than two modes, the data are multimodal.
Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values. Even a small number of extreme values can corrupt the mean. The
trimmed mean is the mean obtained after cutting off values at the high and low
extremes.
• For example, we can sort the values and remove the top and bottom 2 % before
computing the mean. We should avoid trimming too large a portion (such as 20 %)
at both ends as this can result in the loss of valuable information.
• Holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
• The goal for variability is to obtain a measure of how spread out the scores are in
a distribution. A measure of variability usually accompanies a measure of central
tendency as basic descriptive statistics for a set of scores.
• Central tendency describes the central point of the distribution and variability
describes how the scores are scattered around that central point. Together, central
tendency and variability are the two primary values that are used to describe a
distribution of scores.
• Variability can be measured with the range, the interquartile range and the
standard deviation/variance. In each case, variability is determined by measuring
distance.
Range
• The range is the total distance covered by the distribution, from the highest score
to the lowest score (using the upper and lower real limits of the range).
Merits :
a) It is easier to compute.
Variance
• In the formula above, μ represents the mean of the data points, x is the value of
an individual data point and N is the total number of data points.
• Data scientists often use variance to better understand the distribution of a data
set. Machine learning uses variance calculations to make generalizations about a
data set, aiding in a neural network's understanding of data distribution. Variance is
often used in conjunction with probability distributions.
Standard Deviation
• Standard deviation is simply the square root of the variance. Standard deviation
measures the standard distance between a score and the mean.
Standard deviation=√Variance
• The standard deviation is a measure of how the values in data differ from one
another or how spread out data is. There are two types of variance and standard
deviation in terms of sample and population.
• The standard deviation measures how far apart the data points in observations are
from each. we can calculate it by subtracting each data point from the mean value
and then finding the squared mean of the differenced values; this is called
Variance. The square root of the variance gives us the standard deviation.
b) The center of the distribution (the mean) changes, but the standard deviation
remains the same.
d) Multiplying by a constant will multiply the distance between scores and because
the standard deviation is a measure of distance, it will also be multiplied.
• If user are given numerical values for the mean and the standard deviation, we
should be able to construct a visual image (or a sketch) of the distribution of
scores. As a general rule, about 70% of the scores will be within one standard
deviation of the mean and about 95% of the scores will be within a distance of two
standard deviations of the mean.
• Standard deviation distances always originate from the mean and are expressed as
positive deviations above the mean or negative deviations below the mean.
SS = Σ (X-X̄)2
SS = Σx2 - (Σx)2/n
Example 2.8.1: The heights of animals are: 600 mm, 470 mm, 170 mm, 430
mm and 300 mm. Find out the mean, the variance and the standard deviation.
Solution:
Variance = 21704
= 142.32 ≈ 142
Example 2.8.2: Using the computation formula for the sum of squares,
calculate the population standard deviation for the scores: 1, 3, 7, 2, 0, 4, 7, 3.
= 5.64+0.14+13.14+1.89+11.39+0.39+13.14+0.14 /8
= 45.87 /8 = 5.73
Variance = 5.73
The population standard deviation is the square root of the variance = (5.73) 1/2 =
2.393
• The interquartile range is the distance covered by the middle 50% of the
distribution (the difference between Q1 and Q3).
• The first quartile, denoted Q1, is the value in the data set that holds 25% of the
values below it. The third quartile, denoted Q3, is the value in the data set that
holds 25% of the values above it.
Example 2.8.3: Determine the values of the range and the IQR for the
following sets of data.
(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Solution:
a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Range = 25
IQR:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70
Median
Q1=60 , Q3 65
IQR = Q3-Q1=65-60 = 5