3 Data Description
3 Data Description
3 Data Description
2.0 45
3.0 78
4.0 68
5.0 25
6.0 20
7.0 5
8.0 1
9.0 0
10.0 2
Frequency Tables –Categorical Data
For this we have used crosstab() function of pandas
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
• From an individual value, we can get an understanding of the temperature in Goa, but it lacks summary.
• We can find an average of the 31 values by adding all the values and dividing by the number of observations.
• In this case average is 21 + 24 + 22 + ⋯ + 28 + 29)/31 = 25.80.
• From average we can understand what temperature we can expect in Goa.
• To decide on the right clothing, we need to know the variability in temperature.
• The variability is between 31 and 21. So we need three numbers to summarize the data 25.8, 31 𝑎𝑛𝑑 21
Measure of central tendency
• Human tends to use “average” to make comparisons.
• Suppose you score 60% in an examination.
• In order to understand how well you did, you check the average score of
your class.
• If average score of the class is 40%, you feel happy.
• If the average score of the class is 90%, you will not feel so happy.
• In our daily life, we tend to use several measures such as average
temperature in January of a city, average income, average height etc.
• The statistical functions that we use to describe the average or center of
the data as measure of central tendency
• The measure of central tendency tells us where most of the data is located
Measure of central tendency
• The most common measures of central tendency are the mean and
the median.
• For this study, we have not considered mode (most common value) to
be a suitable measure of central tendency.
• For the continuous data set each observation is generally unique.
• In many cases, we found the mode to be near one end of the
distribution, not at the central region.
• The mode may not be unique for a given data set.
The mean
• The mean or arithmetic mean is a simple and effective summary of the data set. It is an intuitive measure of
central tendency. For a sample of 𝑛 values 𝑥𝑖 , 𝑖 = 1, … , 𝑛, mean
1 1
𝜇 (𝑜𝑟 𝑥)ҧ = 𝑥𝑖 = × (𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 )
𝑛 𝑛
𝑖=1
• The mean is simple, and it is easy to calculate mean. You just need to add the numbers and divide the sum
by number of observations in the sample
• The mathematical properties of mean: The mean of the sum is also the sum of the mean. For example, if
the total income of every person in a community has three components from three sources. 𝑢𝑖 , 𝑣𝑖 , 𝑤𝑖 . The
total income of each person is 𝑦𝑖 = 𝑢𝑖 + 𝑣𝑖 + 𝑤𝑖 . The average income of the community is
𝑦ത = 𝑢ത + 𝑣ҧ + 𝑤
ഥ
• Where as 𝑢,ത 𝑣,ҧ 𝑤
ഥ are average income of community from each source. Average income can also be
calculated as
1
𝑦ത = × 𝑦1 + 𝑦2 + ⋯ + 𝑦𝑛
𝑛
• Hence, however we try to calculate the mean income of the community, we always find the same answer.
Combined Mean
• Mean combines
• The combined mean is the weighted average of the means of two or more separate groups, where the
weights are the size of each group.
• For example, if the data comes from two sources such as males and females.
• The overall mean is the weighted mean of the average of male dataset and the weighted mean of the
average of female dataset where weights are the proportion of male and female in the overall data set.
𝑚 × 𝑥1 + 𝑛 × 𝑥2
𝑥ҧ =
𝑚+𝑛
𝑚 𝑛
𝑥ҧ = × 𝑥1 + × 𝑥2
𝑚+𝑛 𝑚+𝑛
• The two most important reasons behind using weighted mean are
• Some values in dataset will have more variability than others. We can give the higher variable values less weight
• During our sampling process, we may realize that one group was underrepresented. We can correct that by giving more
weight to the underrepresented group
Mean as center of gravity
• Mean is a center of gravity of numbers.
• Mean can be represented as the balance point if we place equal
weights at each of the data points on a weightless number line.
• In that case, the mean would be the balance point.
• The presence of an outlier can lead to a big disadvantage.
• A single observation that is much bigger or smaller than the rest of
the observations can have a big effect on the overall mean.
• In such cases that involve skewed data, the mean would be
problematic
Useful facts about mean
• If you scale the data, the mean will also scale
𝑚𝑒𝑎𝑛 𝑘 × 𝑥𝑖 = 𝑘 × 𝑚𝑒𝑎𝑛 𝑥𝑖
• If you translate the data, the mean will also translate
𝑚𝑒𝑎𝑛 𝑥𝑖 + 𝑐 = 𝑚𝑒𝑎𝑛 𝑥𝑖 + 𝑐
• The sum of signed deviation from mean is always zero
𝑛
𝑥𝑖 − 𝑥ҧ = 0
𝑖=1
• The sum of squared distances of data points from mean is minimum.
Mean using Python
• Python helps in data analysis and statistics. The statistics module of python comes with functions like
mean(), median(), and mode(). The function mean(), can be used to calculate the mean or average of a data
set
• Example: Let’s calculate the mean temperature of Goa for a month where the data set of daily temperature
is as follows
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
Data in case (c) is more variable than in case (b) and (a). Data in case (b) is
more variable than in case (a) even though the mean is same in all the three
cases
Range
• The difference between the largest and the smallest value of the
observation is called Range.
• Calculating the range is very easy.
• But the largest and the smallest observations may be the outliers.
• The range is heavily influenced by outliers.
Range using Python
• To calculate range, we can use max()and min() functions from standard python library
• Example: Let’s calculate the range of temperature of Goa for a month where the data set of daily temperature is as follows
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
>>>temperature=[21,24,22,21,23,26,24,29,28,30,31,28,24,26,27,25,26,26,25,27,26,29,30,29,29,28,24,2
3,22,23,24]
>>> print("Temperature Range is :", max(temperature) - min(temperature))
Temperature Range is : 10
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 2
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
• Let’s compute the variance using var() function of NumPy as well as variance() and pvariance()
functions of statistics module.
• The variance() function is for sample variance and uses (𝑛 − 1) in the denominator where as
pvariance()is for population variance and uses 𝑛 in the denominator.
• var() function of NumPy also represents population variance and uses 𝑛 in the denominator.
• We can calculate the sample variance using NumPy by passing ddof=1 parameter to var() function. The
default value of ddof=0
Variance using Python
>>> import statistics
>>>
temperature=[21,24,22,21,23,26,24,29,28,30,31,28,24,26,27,25,26,26,25,27,26,29,30,29,29,28,24,23,22,23,24]
>>> print("Sample Variance is :",statistics.variance(temperature))
Sample Variance is : 7.894623655913978
>>> print("Population Variance :",statistics.pvariance(temperature))
Population Variance is : 7.639958376690947
21, 24, 22, 21, 23, 26, 24, 29, 28, 30, 31, 28, 24, 26, 27, 25, 26, 26, 25, 27, 26, 29, 30, 29, 29, 28, 24, 23, 22, 23, 24
• Let’s compute the standard deviation using std() function of NumPy as well as stdev() and pstdev()
functions of statistics module.
• The stdev() function is for sample variance and uses (𝑛 − 1) in the denominator where as pstdev()is for
population variance and uses 𝑛 in the denominator.
• std() function of NumPy also represents population variance and uses 𝑛 in the denominator.
• We can calculate the sample variance using NumPy by passing ddof=1 parameter to std() function. The
default value of ddof=0
Standard Deviation using Python
>>> import statistics
>>>temperature=[21,24,22,21,23,26,24,29,28,30,31,28,24,26,27,25,26,26,25,27,26,29,30,29,29,28,24,23,22,23,24]
>>> print("Sample Standard Deviation is:", statistics.stdev(temperature))
Sample Standard Deviation is : 2.8097372930425326
>>> print("Population Standard Deviation is:", statistics.pstdev(temperature))
Population Standard Deviation is : 2.7640474628144407
Using Numpy
• In this example, we have plotted a scatter plot diagram of two samples on a single plot. We can
infer that both the samples demonstrate a similar relationship between the two variables.
Scatterplot Matrix
• For many experiments, our dataset consists of measurements of
multiple variables.
• Such a dataset is called multivariate data.
• We use a scatterplot matrix to study the relationship between
variables.
• We plot the scatterplot matrix for each pair of variables.
• We show them in a matrix form.
Scatterplot Matrix
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> np.random.seed(100)
>>> N = 1000
>>> x1 = np.random.normal(0, 1, N)
>>> x2 = x1 + np.random.normal(0, 3, N)
>>> x3 = 2 * x1 - x2 + np.random.normal(0, 2, N)
>>> df.head()
x1 x2 x3
0 -0.224315 -8.840152 10.145993
1 1.337257 2.383882 -1.854636
2 0.882366 3.544989 -1.117054
3 0.295153 -3.844863 3.634823
4 0.780587 -0.465342 2.121288
>>> pd.plotting.scatter_matrix(df)
>>> plt.suptitle("Scatter Plot Matrix")
>>> plt.show()
Scatterplot Matrix
• At the diagonal, we the plot shows the distribution of the
three measurements of our dataset.
• In other cells, there is a scatterplot between two variables at
a time.
• Second column of the first row shows the scatterplot
between x1 and x2
• Third column of the first row shows the scatterplot between
x1 and x3
• First column of the second row shows the scatterplot
between x1 and x2
• Third column of the second row shows the scatterplot
between x2 and x3
• First column of the third row shows the scatterplot between
x1 and x3
• Second column of the third row shows the scatterplot
between x2 and x3
Scatterplot Matrix
• We can change the number of bins by adding
>>> pd.plotting.scatter_matrix(df,
hist_kwds={'bins':30})
• For this study, we will create the random data for our analysis. We
collected the data for 200 days.
Exercise Pyhton
>>> import pandas as pd
>>> import numpy as np
>>> import seaborn as sns
>>> np.random.seed(2500)
>>> df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (200, 2)),
... columns= ['Recreation Hours', 'Item Produced'])
>>> df
Recreation Hours Item Produced
0 4 19
1 0 0
2 1 6
3 11 19
4 2 17
.. ... ...
195 7 5
196 6 19
197 3 17
198 13 19
199 1 1