Stats Week 1 PDF
Stats Week 1 PDF
Stats Week 1 PDF
Random variable:
We often summarize the outcome from a random experiment by a simple number. In many of
the examples of random experiments that we have considered, the sample space has been a
description of possible outcomes. Hence, the variable that associates a number with the outcome
of a random experiment is referred to as a random variable.
Events:
An Event is a subset of the sample space of a random experiment.
Sample Space:
To model and analyze a random experiment, we must understand the set of possible outcomes
from the experiment. Hence, the set of all possible outcomes of a random experiment is called the
sample space
of the experiment. The sample space is denoted as S.
Independence:
In probability, any two events are independent if any one of the following equivalent statements
is true:
Here, a mutually exclusive relationship between two events is based only on the outcomes that
comprise the events. However, an independence relationship depends on the probability model
used for the random experiment.
Bayes Theorem:
Generally the conditional probabilities provide the probability of an event given a condition. But
after a random experiment generates an outcome, we are naturally interested in the probability
after a random experiment generates an outcome, we are naturally interested in the probability
that a condition was present given an outcome. To under this essential question, Bayes' Theorem
comes into picture. The following equation shows the Bayes' Theorem.
This is a useful result that enables us to solve for P(A|B) in terms of P(B|A).
If E1, E2, , Ek are k mutually exclusive and exhaustive events and B is any event,
1. Sample mean: If the n observations in a sample are denoted by x1, x2,., xn, the sample
mean is
Although the sample mean is useful, it does not convey all of the information about a sample of
data. The variability or scatter in the data may be described by the sample variance or the sample
standard deviation.
The sample standard deviation, s, is the positive square root of the sample variance.
3. Sample Range: If the N observations in a sample are denoted by x1, x2, ., xn, the sample range is
Scatter Plot
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data
points. However, they have a very specific purpose. Scatter plots show how much one variable is
affected by another. The relationship between two variables is called their correlation .
Scatter plots usually consist of a large body of data. The closer the data points come when plotted
to making a straight line, the higher the correlation between the two variables, or the stronger the
relationship.
If the data points make a straight line going from the origin out to high x- and y-values, then the
variables are said to have a positive correlation . If the line goes from a high-value on the y-axis
down to a high-value on the x-axis, the variables have a negative correlation .
Histogram:
Dividing each class frequency by the total number of observations, we obtain the proportion of
the set of observations in each of the classes. A table listing relative frequencies is called a relative
frequency distribution. The information provided by a relative frequency distribution in tabular
form is easier to grasp if presented graphically. Using the midpoint of each interval and the
corresponding relative frequency, we construct a relative frequency histogram
Here is how the histogram looks after plotting.
Skewness of a plot:
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or negative, or even undefined. For
example, consider the two distributions in the figure just below. Within each graph, the bars on
the right side of the distribution taper differently than the bars on the left side. These tapering
sides are called tails, and they provide a visual means for determining which of the two kinds of
skewness a distribution has:
Negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of
the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left.
Positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of
the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right.
Box Plot:
The box plot is a graphical display that simultaneously describes several important features of a
data set, such as center, spread, departure from symmetry, and identification of unusual
observations or outliers. A box plot displays the three quartiles, the minimum, and the maximum
of the data on a rectangular box, aligned either horizontally or vertically. The box encloses the
interquartile range with the left (or lower) edge at the first quartile, q1, and the right (or upper)
edge at the third quartile, q3. A line is drawn through the box at the second quartile (which is the
50th percentile or the median), A line, or whisker, extends from each end of the box. The lower
whisker is a line from the first quartile to the smallest data point within 1.5 interquartile ranges
from the first quartile. The upper whisker is a line from the third quartile to the largest data point
within 1.5 interquartile ranges from the third quartile.
Data farther from the box than the whiskers are plotted as individual points. A point beyond a
whisker, but less than three interquartile ranges from the box edge, is called an outlier. A point
more than three interquartile ranges from the box edge is called an extreme outlier.
Mean:
It is the most well-known measure of central tendency and it can be used in both discrete and
continuous data.
The mean
The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually
denoted by
denoted by
Median:
The median is the numerical value separating the higher half of a data sample, a population or a
probability distribution from the lower half.
In individual series (if number of observation is very low) first one must arrange all the observations in
order. Then count(n) is the total number of observation in given data.
If n is odd then Median (M) = value of ((n + 1)/2)th item term.
If n is even then Median (M) = value of [((n)/2)th item term + ((n)/2 + 1)th item term ]/2
Mode:
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a
bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below