Statistics 1
Statistics 1
Statistics 1
In [ ]:
"""
OBSERVATIONAL AND EXPERIMENTAL DATA
Observational data are collected by observing and recording the natural behavior of
individuals or groups without any intervention or manipulation by the researcher.
Experimental Data:
Experimental data are collected through controlled experiments where the researcher
deliberately manipulates one or more variables to observe the effect on another
variable.
"""
In [ ]:
"""
what is statistics
"""
In [ ]:
"""
Difference between Descriptive and Inferential Statistics:
Descriptive Statistics:
Descriptive statistics involves organizing, summarizing, and presenting data in a
meaningful way. It helps to describe and visualize the main features of a dataset.
Descriptive statistics do not involve making inferences or generalizations beyond the
data at hand.
Example: Calculating the mean, median, and mode of a set of exam scores to understand
the central tendency and distribution of student performance.
Inferential Statistics:
Inferential statistics involves using sample data to make inferences or predictions
about a larger population. It allows researchers to draw conclusions and test
hypotheses based on sample data, extrapolating findings to the population from which
the sample was drawn.
"""
In [ ]:
"""
what topic we cover in descriptive statistics?
Descriptive Statistics:
Measures of Central Tendency:
Mean, Median, Mode: You learn how to calculate and interpret these measures that
describe the center of a dataset.
Measures of Dispersion:
Variance, Standard Deviation, Range: You understand how to quantify the spread or
variability of data around the central tendency.
Data Visualization:
Histograms, Bar Charts, Pie Charts: You learn how to create and interpret graphical
representations of data to identify patterns and trends.
Summary Statistics:
Percentiles, Quartiles: You learn additional measures that provide insights into the
distribution of data beyond the mean and median.
"""
In [ ]:
"""
What is data types present ?
1. Qualitative (Categorical) Data
Subtypes:
Nominal Data:
Description: Categories without any inherent order or ranking.
Examples: Gender (male, female), Eye color (blue, green, brown), Marital status
(single, married, divorced).
Ordinal Data:
Description: Categories with a specific order or ranking, but the intervals between
the ranks are not necessarily equal.
Examples: Education level (high school, bachelor’s, master’s, PhD), Customer satisfaction
(very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).
Subtypes:
Discrete Data:
Description: Data that can only take specific, separate values, often counts of items.
Discrete data is usually counted and not measured.
Continuous Data:
Description: Data that can take any value within a given range. Continuous data is
usually measured and can be infinitely divided into smaller parts.
In [ ]:
"""
Data measurment scale :
Nominal Data:
This type of data represents categories or labels with no inherent order or ranking.
Examples: Gender (male, female), Marital Status (single, married, divorced), Colors
(red, blue, green).
Ordinal Data:
Interval Data:
Interval data represent values where the difference between any two values is
meaningful and consistent. However, there is no true zero point.
Examples: Temperature (measured in Celsius or Fahrenheit), Years (e.g., 2000, 2005,
2010), IQ Scores.
Ratio Data:
Ratio data have all the characteristics of interval data with the addition of a true
zero point. Ratios of values are meaningful.
Examples: Height, Weight, Distance, Time in seconds, Counts (number of cars, number
of people).
"""
In [ ]:
"""
Sample vs Population ?
Population:
The population refers to the entire group or set of individuals, objects, or events
that possess certain characteristics and are of interest to the researcher.
It represents the complete collection of elements under consideration in a study.
The population is often large and may be difficult or impractical to study in its
entirety.
Example: If you're studying the average height of adults in a country, the entire
adult population of that country would constitute
the population.
Sample:
In [ ]:
"""
What is measure of central tendency?
The Mean:
Median:
The median is the middle value of a dataset when the values are arranged in ascending
or descending order.If the dataset has an odd number of values, the median is the
middle value.If the dataset has an even number of values, the median is the average
of the two middle values.The median is less affected by outliers compared to the mean,
making it a more robust measure of central tendency in skewed distributions.
Mode:
In [ ]:
"""
When median prferred over mean?
Robustness to Outliers:
The median is less sensitive to outliers or extreme values compared to the mean.
Outliers can heavily skew the mean, affecting its representativeness of the central
tendency of the dataset. In such cases, the median provides a more robust estimate.
Skewed Distributions:
In skewed distributions, where the data is not symmetrically distributed around the
center, the median can provide a better representation of the typical value.
This is because the median is not influenced by the extreme values at the tails of
the distribution, as the mean might be.
Ordinal Data:
When dealing with ordinal data or ranked data, the median is often more appropriate.
For example, in ranking survey responses from least to most favorable, the median
represents the middle response, providing a clear indication of the central position.
"""
In [ ]:
"""
When mean prferred over median ?
Symmetric Distributions:
In symmetric distributions with no outliers, the mean and median are usually very
close to each other, and the mean can provide a precise measure of central tendency.
For interval or ratio data that follow a normal distribution without outliers, the
mean is often the most appropriate measure of central tendency.
"""
In [ ]:
"""
Where we use the concept of central tendency ?
In [ ]:
"""
What is use of bassel correction in statistics ?
The reason for Bessel's correction lies in the fact that when you calculate the sample
variance and standard deviation, you are using the sample mean as an estimate of the
population mean. However, using the sample mean tends to slightly underestimate the
population.
Bessel's correction adjusts for this underestimation by dividing the sum of squared
deviations (used in the calculation of variance) by n−1 instead of n is the sample size.
"""
In [ ]:
"""
A random variable is a variable whose possible values are outcomes of a random
phenomenon. In other words, it is a numerical quantity that
can take on different values as a result of a random process or experiment
A discrete random variable is one that can only take on a countable number of distinct
values.The values of a discrete random variable are often integers.
Example: The number of heads obtained when flipping a coin three times is a discrete
random variable. Possible values include 0, 1, 2, or 3.
A continuous random variable is one that can take on any value within a specified
range or interval.The values of a continuous random variable are typically real numbers.
Example: The height of a person selected at random from a population is a continuous
random variable. It can take on any value within a range, such as 150 cm to 200 cm.
"""
In [ ]:
"""
How is missing data handled in statistics?
There are many ways to handle missing data in Statistics:
"""
In [ ]:
"""
Dispersion in statistics :
Dispersion in statistics refers to the extent to which data points in a dataset spread
out or vary from the central value, such as the mean or median. It provides information
about the spread, variability, or consistency of the data distribution
Variance
Standard Deviation
Range
Interquartile Range (IQR)
"""
In [ ]:
"""
What is the meaning of the five-number summary in Statistics?
The five-number summary is a measure of five entities that cover the entire range of
data as shown below:
In [ ]:
"""
What is the range and IQR how to differtiate them ?
In [ ]:
"""
What large dispersion value is mean for ?
High Variability:
Spread of Data:
The larger the standard deviation, the greater the spread of data points around the
mean. Data points may be widely dispersed across the dataset, indicating a diverse
range of values and potential outliers.
Heterogeneity:
A large standard deviation suggests that the dataset is heterogeneous, with data points
varying widely in magnitude or value. This heterogeneity may reflect diverse
characteristics, behaviors, or conditions within the dataset.
Uncertainty:
Skewed Distribution:
In some cases, a large standard deviation may indicate a skewed distribution, where
data points are asymmetrically distributed around the mean. The presence of outliers
or extreme values can contribute to the skewness of the distribution.
Potential Outliers:
Large standard deviations may also suggest the presence of outliers or unusual data
points that significantly influence the variability of the dataset. These outliers
may represent rare events, measurement errors, or unique observations within the
dataset.
"""
In [ ]:
"""
Real world where we can use the dispersion in statistics ?
Quality Assurance:
The company must ensure that the dimensions of each microchip meet specific tolerance
levels to function correctly and fit within electronic devices. Understanding the
dispersion of the dimensions helps the company assess whether the manufacturing
process is producing microchips within acceptable limits.
Identifying Variability:
Process Improvement:
Understanding dispersion allows the company to identify areas for process improvement.
For example, if the standard deviation of microchip dimensions is consistently high,
the company may investigate factors such as machine calibration, material quality, or
operator training to reduce variability and improve product consistency.
"""
In [ ]:
"""
What is the use of histogram ?
A histogram is a graphical representation of the distribution of data. It consists of
a series of adjacent rectangles, or bars, where the width of each bar represents the
range of values for a particular interval, and the height of each bar represents the
frequency or count of data points within that interval. Histograms are widely used in
statistics and data analysis for visualizing the distribution of numerical data.
use of histogram :
By examining the shape and pattern of the histogram, analysts can identify key
characteristics of the data distribution, such as whether it is symmetric, skewed,
bimodal, or uniform.
Outliers, or data points that significantly deviate from the rest of the dataset,
can be easily identified in a histogram as bars that are much higher or lower than
the rest. This helps in detecting anomalies or errors in the data.
Histograms allow for a visual assessment of measures of central tendency (mean, median,
mode) and dispersion (range, standard deviation, variance) within the data set.
In [ ]:
"""
Use of skewness ans kurtosis ?
Skewness:
Skewness measures the asymmetry of the distribution around its mean. It quantifies the
degree to which the distribution is skewed to one side or the other.
A distribution can be positively skewed, negatively skewed, or approximately symmetric.
Positive skewness indicates that the tail of the distribution is longer on the right
side, while negative skewness indicates a longer tail on the left side.
A skewness of zero indicates perfect symmetry around the mean.
Kurtosis:
In [ ]:
"""
Skewness is a measure of the asymmetry of the probability distribution of a real-valued
random variable about its mean. It indicates the degree to which the data deviates from
symmetry. In simpler terms, skewness measures the lack of symmetry in a dataset's
distribution.
A distribution is positively skewed if the tail on the right-hand side (higher values)
is longer or fatter than the tail on the left-hand side (lower values).
In a positively skewed distribution, the mean is typically greater than the median.
Example: Income distributions often exhibit positive skewness, with a few individuals
earning exceptionally high incomes.
A distribution is negatively skewed if the tail on the left-hand side (lower values)
is longer or fatter than the tail on the right-hand side (higher values).
In a negatively skewed distribution, the mean is typically less than the median.
Example: The distribution of scores on a very easy exam might be negatively skewed,
as most students would score high, but a few might score very low.
Zero Skewness:
In [ ]:
"""
What is kurtosis?
Kurtosis is a measure of the degree of the extreme values present in one tail of
distribution or the peaks of frequency distribution as compared to the others.
The standard normal distribution has a kurtosis of 3 whereas the values of symmetry
and kurtosis between -2 and +2 are considered normal and acceptable. The data sets
with a high level of kurtosis imply that there is a presence of outliers.
One needs to add data or remove outliers to overcome this problem. Data sets with
low kurtosis levels have light tails and lack outliers.
"""
In [ ]:
"""
What types of biases can you encounter while sampling?
Sampling bias occurs when a sample is not representative of a target population during
an investigation or a survey. The three main that one can encounter while sampling is:
Undercoverage bias: This type of bias occurs when some population members are
inadequately represented in the sample.
Survivorship bias occurs when a sample concentrates on the ‘surviving’ or existing
observations and ignores those that have already ceased to exist. This can lead to
wrong conclusions in numerous different means.
"""
In [ ]:
"""
What is the meaning of selection bias?
Selection bias is a phenomenon that involves the selection of individual or grouped
data in a way that is not considered to be random. Randomization plays a key role in
performing analysis and understanding model functionality better.
If correct randomization is not achieved, then the resulting sample will not accurately
represent the population.
"""
In [ ]:
"""
Sampling is a fundamental concept in statistics and data analysis that involves
selecting a subset of individuals, items, or observations from a larger population.
The purpose of sampling is to draw conclusions about the population based on the
characteristics of the sample, without having to study every individual in the
population, which may be impractical or impossible.
There are several types of sampling methods, each with its own advantages and
disadvantages.
Some common types of sampling include:
Simple Random Sampling: In this method, every individual in the population has an
equal chance of being selected, and each selection is independent of the others.
Simple random sampling is often conducted using random number generators or drawing
names from a hat.
Systematic Sampling: Systematic sampling involves selecting every nth individual from
the population after a random start. For example, if a population consists of 1000
individuals and a sample of 100 is desired, every 10th individual could be selected
after randomly choosing a starting point between 1 and 10.
"""
In [ ]:
"""
Box plots, also known as box-and-whisker plots, offer several benefits in data analysis
and visualization:
Identification of Outliers:
Box plots visually highlight potential outliers in the dataset, which are data points
that lie outside the whiskers of the plot.Outliers can be easily identified as
individual data points beyond the range of the whiskers.
Box plots are robust to the effects of skewed distributions and outliers.
They provide a clearer representation of the central tendency and variability,
even in the presence of extreme values.
"""
In [ ]:
"""
What can you do with an outlier?
Outliers affect A/B testing and they can be either removed or kept according to
what situation demands or the data set requirements.
In [ ]:
"""
What is meant by mean imputation for missing data? Why is it bad?
Mean imputation is a rarely used practice where null values in a dataset are replaced
directly with the corresponding mean of the data.
In [ ]:
"""
Covariance is a statistical measure that quantifies the degree to which two variables
change together. In other words, it measures the directional relationship between two
random variables. If the covariance between two variables is positive, it indicates
that they tend to move in the same direction. Conversely, if the covariance is
negative, it suggests that the variables move in opposite directions.
"""
In [ ]:
"""
What is the meaning of covariance?
Covariance is the measure of indication when two items vary together in a cycle. The
systematic relation is determined between a pair of random variables to see if the
change in one will affect the other variable in the pair or not.
"""
In [ ]:
"""
What is the use of covariance ?
Covariance analysis helps in selecting relevant features for machine learning models.
Features with high covariance with the target variable may be good candidates for
predictive modeling.Covariance matrices are also used in techniques like Principal
Component Analysis (PCA) for dimensionality reduction. PCA aims to find orthogonal
features that capture the maximum variance in the data, which is related to the
covariance matrix.
In [ ]:
"""
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It quantifies how closely related two variables
are and the direction of their relationship. Correlation analysis is crucial in
understanding the association between variables and is widely used in various fields,
including finance, economics, psychology, and scientific research.
Types of correlation:
r=0: No correlation
The strength of the correlation is determined by the magnitude of r, while the sign
(+ or -) indicates the direction of the relationship.The Pearson correlation
coefficient assumes that the relationship between variables is linear and that the
data follows a normal distribution.
In [ ]:
"""
Why we are using correlation ?
Feature Selection:
Correlation analysis helps in selecting the most relevant features (variables) for
predictive models. Features that are highly correlated with the target variable are
often considered important predictors.
Multicollinearity Detection:
Data Exploration:
Dimensionality Reduction:
In [ ]:
"""
What are some of the properties of a normal distribution?
In [ ]:
"""
What general conditions must be satisfied for the central limit theorem to hold?
Here are the conditions that must be satisfied for the central limit theorem to hold –
The data must follow the randomization condition which means that it must be sampled
randomly.The Independence Assumptions dictate that the sample values must be
independent of each other.Sample sizes must be large. They must be equal to or
greater than 30 to be able to hold CLT. Large sample size is required to hold the
accuracy of CLT to be true.
"""
In [ ]:
"""
Where are long-tailed distributions used?
A long-tailed distribution is a type of distribution where the tail drops off
gradually toward the end of the curve.
The Pareto principle and the product sales distribution are good examples to denote
the use of long-tailed distributions.
Also, it is widely used in classification and regression problems.
"""
In [ ]:
"""
What is exploratory data analysis?
Exploratory data analysis is the process of performing investigations on data to
understand the data better.
In [ ]:
"""
What is DOE?
DOE is an acronym for the Design of Experiments in statistics. It is considered as the
design of a task that describes the information and the change of the same based on
the changes to the independent input variables.
"""
In [ ]:
"""
What is the meaning of KPI in statistics?
KPI stands for Key Performance Analysis in statistics. It is used as a reliable metric
to measure the success of a company with respect to its achieving the required business
objectives.
In [ ]:
"""
What is the meaning of standard deviation?
Standard deviation represents the magnitude of how far the data points are from the
mean. A low value of standard deviation is an indication of the data being close to
the mean, and a high value indicates that the data is spread to extreme ends, far
away from the mean.
"""
In [ ]:
""" What is correlation?
Correlation is used to test relationships between quantitative variables and categorical
variables. Unlike covariance, correlation tells us how strong the relationship is
between two variables. The value of correlation between two variables ranges from
-1 to +1.
The -1 value represents a high negative correlation, i.e., if the value in one variable
increases, then the value in the other variable will drastically decrease. Similarly,
+1 means a positive correlation, and here, an increase in one variable will lead to an
increase in the other. Whereas, 0 means there is no correlation.
If two variables are strongly correlated, then they may have a negative impact on the
statistical model, and one of them must be dropped.
"""
In [ ]:
"""
What is the relationship between the confidence level and the significance level in
statistics?
Both significance and confidence level are related by the following formula:
In [ ]:
"""
What types of variables are used for Pearson’s correlation coefficient?
Variables to be used for the Pearson’s correlation coefficient must be either in a
ratio or in an interval.
Note that there can exist a condition when one variable is a ratio, while the other
is an interval score.
"""
In [ ]:
"""
What are the examples of symmetric distribution?
Symmetric distribution means that the data on the left side of the median is the same
as the one present on the right side of the median.
There are many examples of symmetric distribution, but the following three are the
most widely used ones:
Uniform distribution
Binomial distribution
Normal distribution
"""
In [ ]:
"""
Where is inferential statistics used?
Inferential statistics is used for several purposes, such as research, in which we wish
to draw conclusions about a population using some sample data. This is performed in a
variety of fields, ranging from government operations to quality control and quality
assurance teams in multinational corporations.
"""
In [ ]:
"""
What are the scenarios where outliers are kept in the data?
There are not many scenarios where outliers are kept in the data, but there are some
important situations when they are kept. They are kept in the data for analysis if:
In [ ]:
"""
What is the meaning of degrees of freedom (DF) in statistics?
Degrees of freedom or DF is used to define the number of options at hand when
performing an analysis. It is mostly used with t-distribution and not with the
z-distribution.
If there is an increase in DF, the t-distribution will reach closer to the normal
distribution. If DF > 30, this means that the t-distribution at hand is having all
of the characteristics of a normal distribution.
"""
In [ ]:
"""
What are some of the techniques to reduce underfitting and overfitting during model
training?
Underfitting refers to a situation where data has high bias and low variance, while
overfitting is the situation where there are high variance and low bias.
"""
In [ ]:
"""
Does a symmetric distribution need to be unimodal?
A symmetric distribution does not need to be unimodal (having only one mode or one
value that occurs most frequently). It can be bi-modal (having two values that have
the highest frequencies) or multi-modal (having multiple or more than two values that
have the highest frequencies).
"""
In [ ]:
"""
What is the impact of outliers in statistics?
Outliers in statistics have a very negative impact as they skew the result of any
statistical query. For example, if we want to calculate the mean of a dataset that
contains outliers, then the mean calculated will be different from the actual mean
(i.e., the mean we will get once we remove the outliers).
"""
In [ ]:
"""
When creating a statistical model, how do we detect overfitting?
In [ ]:
"""
What is the relationship between standard deviation and standard variance?
Standard deviation is the square root of standard variance. Basically, standard
deviation takes a look at how the data is spread out from the mean. On the other hand,
standard variance is used to describe how much the data varies from the mean of the
entire dataset.
"""
In [ ]:
"""
What is the difference between inferential and descriptive statistics?
Inferential statistics attempts to infer from some sample to the larger population.
"""
In [ ]:
"""
What is the difference between long format and wide format data?
Wide format is where we have a single row for every data point with multiple
columns to hold the values of various attributes.
The long format is where for each data point we have as many rows as the number
of attributes and each row contains the value of a particular attribute for a given
data point.
"""
In [ ]:
"""
What do you understand by the term Normal Distribution?
"""
In [ ]:
"""
What are some of the properties of a normal distribution?
Unimodal: normal distribution has only one peak. (i.e., one mode)
The Mean, Mode, and Median are all located in the centre (i.e., are all equal)
Asymptotic: normal distributions are continuous and have tails that are asymptotic.
The curve approaches the x-axis, but it never touches
"""
In [ ]: