Statistics 1

day1 questions
In [ ]:
"""
OBSERVATIONAL AND EXPERIMENTAL DATA
Observational data are collected by observing and recording the natural behavior of
individuals or groups without any intervention or manipulation by the researcher.
Experimental Data:
Experimental data are collected through controlled experiments where the researcher
deliberately manipulates one or more variables to observe the effect on another
variable.
"""
In [ ]:
"""
what is statistics
"""
In [ ]:
"""
Difference between Descriptive and Inferential Statistics:
Descriptive Statistics:
Descriptive statistics involves organizing, summarizing, and presenting data in a
meaningful way. It helps to describe and visualize the main features of a dataset.
Descriptive statistics do not involve making inferences or generalizations beyond the
data at hand.
Example: Calculating the mean, median, and mode of a set of exam scores to understand
the central tendency and distribution of student performance.
Inferential Statistics:
Inferential statistics involves using sample data to make inferences or predictions
about a larger population. It allows researchers to draw conclusions and test
hypotheses based on sample data, extrapolating findings to the population from which
the sample was drawn.
"""
In [ ]:
"""
what topic we cover in descriptive statistics?
Descriptive Statistics:
Measures of Central Tendency:
Mean, Median, Mode: You learn how to calculate and interpret these measures that
describe the center of a dataset.
Measures of Dispersion:
Variance, Standard Deviation, Range: You understand how to quantify the spread or
variability of data around the central tendency.
Data Visualization:
Histograms, Bar Charts, Pie Charts: You learn how to create and interpret graphical
representations of data to identify patterns and trends.
Summary Statistics:
Percentiles, Quartiles: You learn additional measures that provide insights into the
distribution of data beyond the mean and median.
"""
In [ ]:
"""
What is data types present ?
1. Qualitative (Categorical) Data
Definition: Qualitative data represents categories or groups and describes non-numeric

characteristics or attributes.
Subtypes:
Nominal Data:
Description: Categories without any inherent order or ranking.
Examples: Gender (male, female), Eye color (blue, green, brown), Marital status
(single, married, divorced).
Ordinal Data:
Description: Categories with a specific order or ranking, but the intervals between
the ranks are not necessarily equal.
Examples: Education level (high school, bachelor’s, master’s, PhD), Customer satisfaction
(very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).
2. Quantitative (Numerical) Data

Definition: Quantitative data represents numerical values and can be measured or counted.
Subtypes:
Discrete Data:
Description: Data that can only take specific, separate values, often counts of items.
Discrete data is usually counted and not measured.
Examples: Number of children in a family, Number of cars in a parking lot, Number

of defective products in a batch.
Continuous Data:
Description: Data that can take any value within a given range. Continuous data is
usually measured and can be infinitely divided into smaller parts.
Examples: Height, Weight, Temperature, Time, Distance

"""
In [ ]:
"""
Data measurment scale :
Nominal Data:
This type of data represents categories or labels with no inherent order or ranking.
Examples: Gender (male, female), Marital Status (single, married, divorced), Colors
(red, blue, green).
Ordinal Data:
Ordinal data represents categories with a natural order or ranking.

The intervals between the categories may not be equal.
Examples: Educational Levels (high school, bachelor's degree, master's degree),
Socioeconomic Status (low, medium, high), Rating Scales (poor, fair, good, excellent).
Interval Data:
Interval data represent values where the difference between any two values is
meaningful and consistent. However, there is no true zero point.
Examples: Temperature (measured in Celsius or Fahrenheit), Years (e.g., 2000, 2005,
2010), IQ Scores.
Ratio Data:
Ratio data have all the characteristics of interval data with the addition of a true
zero point. Ratios of values are meaningful.
Examples: Height, Weight, Distance, Time in seconds, Counts (number of cars, number
of people).
"""
In [ ]:
"""
Sample vs Population ?
Population:
The population refers to the entire group or set of individuals, objects, or events
that possess certain characteristics and are of interest to the researcher.
It represents the complete collection of elements under consideration in a study.
The population is often large and may be difficult or impractical to study in its
entirety.
Example: If you're studying the average height of adults in a country, the entire
adult population of that country would constitute
the population.
Sample:
A sample is a subset or a smaller representative group selected from the population.

It is chosen in such a way that it reflects the characteristics and diversity of the
population it represents.The purpose of sampling is to gather data efficiently and
cost-effectively while maintaining the ability to generalize findings to the population.
Example: Instead of measuring the height of every adult in the country, you might
select a random sample of 1000 adults and measure their
heights. This subset represents the sample
"""
In [ ]:
"""
What is measure of central tendency?
A measure of central tendency is a statistical summary of a dataset that represents a

central or typical value around which the data tend to cluster. It provides insight
into the center or average of the data distribution
The Mean:
often referred to as the average, is calculated by summing up all the values in a

dataset and dividing by the total number of values.
Formula: Mean = (Sum of all values) / (Number of values)
The mean is sensitive to outliers, meaning that extreme values can significantly
impact its value.
Median:
The median is the middle value of a dataset when the values are arranged in ascending
or descending order.If the dataset has an odd number of values, the median is the
middle value.If the dataset has an even number of values, the median is the average
of the two middle values.The median is less affected by outliers compared to the mean,
making it a more robust measure of central tendency in skewed distributions.
Mode:
The mode is the value that occurs most frequently in a dataset.

A dataset may have one mode (unimodal), two modes (bimodal), or more modes (multimodal).
Unlike the mean and median, the mode can be used with nominal (categorical) data.
"""
In [ ]:
"""
When median prferred over mean?
Robustness to Outliers:
The median is less sensitive to outliers or extreme values compared to the mean.
Outliers can heavily skew the mean, affecting its representativeness of the central
tendency of the dataset. In such cases, the median provides a more robust estimate.
Skewed Distributions:
In skewed distributions, where the data is not symmetrically distributed around the
center, the median can provide a better representation of the typical value.
This is because the median is not influenced by the extreme values at the tails of
the distribution, as the mean might be.
Ordinal Data:
When dealing with ordinal data or ranked data, the median is often more appropriate.
For example, in ranking survey responses from least to most favorable, the median
represents the middle response, providing a clear indication of the central position.
"""
In [ ]:
"""
When mean prferred over median ?
Symmetric Distributions:
In symmetric distributions with no outliers, the mean and median are usually very
close to each other, and the mean can provide a precise measure of central tendency.
Interval or Ratio Data:
For interval or ratio data that follow a normal distribution without outliers, the
mean is often the most appropriate measure of central tendency.
"""
In [ ]:
"""
Where we use the concept of central tendency ?
Summary and Description

Data Interpretation
Comparison
Decision Making
Inference and Estimation
Modeling and Prediction
Quality Control
Communication
"""
In [ ]:
"""
What is use of bassel correction in statistics ?
It is used to provide an unbiased estimate of the population variance and standard

deviation based on a sample of the population.
The reason for Bessel's correction lies in the fact that when you calculate the sample
variance and standard deviation, you are using the sample mean as an estimate of the
population mean. However, using the sample mean tends to slightly underestimate the
population.
variance and standard deviation, especially for smaller sample sizes.
Bessel's correction adjusts for this underestimation by dividing the sum of squared
deviations (used in the calculation of variance) by n−1 instead of n is the sample size.
"""
In [ ]:
"""
A random variable is a variable whose possible values are outcomes of a random
phenomenon. In other words, it is a numerical quantity that
can take on different values as a result of a random process or experiment
Discrete Random Variable:
A discrete random variable is one that can only take on a countable number of distinct
values.The values of a discrete random variable are often integers.
Example: The number of heads obtained when flipping a coin three times is a discrete
random variable. Possible values include 0, 1, 2, or 3.
Continuous Random Variable:
A continuous random variable is one that can take on any value within a specified
range or interval.The values of a continuous random variable are typically real numbers.
Example: The height of a person selected at random from a population is a continuous
random variable. It can take on any value within a range, such as 150 cm to 200 cm.
"""
In [ ]:
"""
How is missing data handled in statistics?
There are many ways to handle missing data in Statistics:
Prediction of the missing values

Assignment of individual (unique) values
Deletion of rows, which have the missing data
Mean imputation or median imputation
Using random forests, which support the missing values
"""
In [ ]:
"""
Dispersion in statistics :
Dispersion in statistics refers to the extent to which data points in a dataset spread
out or vary from the central value, such as the mean or median. It provides information
about the spread, variability, or consistency of the data distribution
Here are the topics related to dispersion in statistics:
Variance
Standard Deviation
Range
Interquartile Range (IQR)
"""
In [ ]:
"""
What is the meaning of the five-number summary in Statistics?
The five-number summary is a measure of five entities that cover the entire range of
data as shown below:
Low extreme (Min)

First quartile (Q1)
Median
Upper quartile (Q3)
"""
In [ ]:
"""
What is the range and IQR how to differtiate them ?
Calculate the Range:

The range is the difference between the maximum and minimum values in the dataset.
Formula: Range = Maximum Value - Minimum Value
Calculate the Interquartile Range (IQR):
The interquartile range (IQR) is a measure of statistical dispersion, which represents

the range of the middle 50% of the dataset.To calculate the IQR, you first need to
find the first quartile (Q1) and the third quartile (Q3).Q1 is the value below which
25% of the data fall, and Q3 is the value below which 75% of the data fall.
Once you have Q1 and Q3, calculate the IQR by subtracting Q1 from Q3.
Formula: IQR = Q3 - Q1
"""
In [ ]:
"""
What large dispersion value is mean for ?
High Variability:
A large standard deviation indicates high variability or dispersion of data points

around the mean. This means that the data values deviate significantly from the
average value, resulting in a wider distribution.
Spread of Data:
The larger the standard deviation, the greater the spread of data points around the
mean. Data points may be widely dispersed across the dataset, indicating a diverse
range of values and potential outliers.
Heterogeneity:
A large standard deviation suggests that the dataset is heterogeneous, with data points
varying widely in magnitude or value. This heterogeneity may reflect diverse
characteristics, behaviors, or conditions within the dataset.
Uncertainty:
A large standard deviation implies greater uncertainty or unpredictability in the

dataset. It indicates that individual data points may vary considerably from the mean,
making it challenging to make precise predictions or draw conclusions about the dataset.
Skewed Distribution:
In some cases, a large standard deviation may indicate a skewed distribution, where
data points are asymmetrically distributed around the mean. The presence of outliers
or extreme values can contribute to the skewness of the distribution.
Potential Outliers:
Large standard deviations may also suggest the presence of outliers or unusual data
points that significantly influence the variability of the dataset. These outliers
may represent rare events, measurement errors, or unique observations within the
dataset.
"""
In [ ]:
"""
Real world where we can use the dispersion in statistics ?
Quality Assurance:
The company must ensure that the dimensions of each microchip meet specific tolerance
levels to function correctly and fit within electronic devices. Understanding the
dispersion of the dimensions helps the company assess whether the manufacturing
process is producing microchips within acceptable limits.
Identifying Variability:
By analyzing measures of dispersion such as standard deviation or range, the company

can identify the variability in microchip dimensions across production batches or
manufacturing lines. High dispersion may indicate inconsistencies or errors in the
manufacturing process that need to be addressed.
Minimizing Defects and Waste:
High dispersion in microchip dimensions increases the likelihood of producing defective

or non-conforming products. By monitoring and controlling dispersion, the company can
minimize defects, reduce waste, and improve overall product quality.
Process Improvement:
Understanding dispersion allows the company to identify areas for process improvement.
For example, if the standard deviation of microchip dimensions is consistently high,
the company may investigate factors such as machine calibration, material quality, or
operator training to reduce variability and improve product consistency.
"""
In [ ]:
"""
What is the use of histogram ?
A histogram is a graphical representation of the distribution of data. It consists of
a series of adjacent rectangles, or bars, where the width of each bar represents the
range of values for a particular interval, and the height of each bar represents the
frequency or count of data points within that interval. Histograms are widely used in
statistics and data analysis for visualizing the distribution of numerical data.
use of histogram :
Visualization of Data Distribution:
Histograms provide a visual summary of the distribution of data, allowing analysts to

quickly understand the shape, center, and spread of the data set.
Identification of Patterns and Trends:
By examining the shape and pattern of the histogram, analysts can identify key
characteristics of the data distribution, such as whether it is symmetric, skewed,
bimodal, or uniform.
Detection of Outliers and Anomalies:
Outliers, or data points that significantly deviate from the rest of the dataset,
can be easily identified in a histogram as bars that are much higher or lower than
the rest. This helps in detecting anomalies or errors in the data.
Understanding Central Tendency and Dispersion:
Histograms allow for a visual assessment of measures of central tendency (mean, median,
mode) and dispersion (range, standard deviation, variance) within the data set.
Comparison Between Groups or Datasets:
Histograms facilitate comparisons between different groups or datasets by displaying

their respective distributions side by side. This helps in identifying similarities,
differences, and patterns between the groups.
"""
In [ ]:
"""
Use of skewness ans kurtosis ?
Skewness:
Skewness measures the asymmetry of the distribution around its mean. It quantifies the
degree to which the distribution is skewed to one side or the other.
A distribution can be positively skewed, negatively skewed, or approximately symmetric.
Positive skewness indicates that the tail of the distribution is longer on the right
side, while negative skewness indicates a longer tail on the left side.
A skewness of zero indicates perfect symmetry around the mean.
Kurtosis:
Kurtosis measures the peakedness or flatness of the distribution's curve. It quantifies

the degree to which the distribution's tails are heavier or lighter than a normal
distribution.A distribution with positive kurtosis has heavier tails and a sharper
peak than a normal distribution (leptokurtic).A distribution with negative kurtosis
has lighter tails and a flatter peak than a normal distribution (platykurtic).
A normal distribution has a kurtosis of zero, indicating its tails and peak are similar
to a standard normal distribution.
"""
In [ ]:
"""
Skewness is a measure of the asymmetry of the probability distribution of a real-valued
random variable about its mean. It indicates the degree to which the data deviates from
symmetry. In simpler terms, skewness measures the lack of symmetry in a dataset's
distribution.
Positive Skewness (Right Skewness):
A distribution is positively skewed if the tail on the right-hand side (higher values)
is longer or fatter than the tail on the left-hand side (lower values).
In a positively skewed distribution, the mean is typically greater than the median.
Example: Income distributions often exhibit positive skewness, with a few individuals
earning exceptionally high incomes.
Negative Skewness (Left Skewness):
A distribution is negatively skewed if the tail on the left-hand side (lower values)
is longer or fatter than the tail on the right-hand side (higher values).
In a negatively skewed distribution, the mean is typically less than the median.
Example: The distribution of scores on a very easy exam might be negatively skewed,
as most students would score high, but a few might score very low.
Zero Skewness:
A distribution is considered symmetric or approximately symmetric if it has zero

skewness. This means that the distribution is balanced and evenly distributed around
its mean.In a symmetric distribution, the mean, median, and mode are all equal.
Example: The standard normal distribution (bell-shaped curve) has zero skewness.
"""
In [ ]:
"""
What is kurtosis?
Kurtosis is a measure of the degree of the extreme values present in one tail of
distribution or the peaks of frequency distribution as compared to the others.
The standard normal distribution has a kurtosis of 3 whereas the values of symmetry
and kurtosis between -2 and +2 are considered normal and acceptable. The data sets
with a high level of kurtosis imply that there is a presence of outliers.
One needs to add data or remove outliers to overcome this problem. Data sets with
low kurtosis levels have light tails and lack outliers.
"""
In [ ]:
"""
What types of biases can you encounter while sampling?
Sampling bias occurs when a sample is not representative of a target population during
an investigation or a survey. The three main that one can encounter while sampling is:
Selection bias: It involves the selection of individual or grouped data in a way

that is not random.
Undercoverage bias: This type of bias occurs when some population members are
inadequately represented in the sample.
Survivorship bias occurs when a sample concentrates on the ‘surviving’ or existing
observations and ignores those that have already ceased to exist. This can lead to
wrong conclusions in numerous different means.
"""
In [ ]:
"""
What is the meaning of selection bias?
Selection bias is a phenomenon that involves the selection of individual or grouped
data in a way that is not considered to be random. Randomization plays a key role in
performing analysis and understanding model functionality better.
If correct randomization is not achieved, then the resulting sample will not accurately
represent the population.
"""
In [ ]:
"""
Sampling is a fundamental concept in statistics and data analysis that involves
selecting a subset of individuals, items, or observations from a larger population.
The purpose of sampling is to draw conclusions about the population based on the
characteristics of the sample, without having to study every individual in the
population, which may be impractical or impossible.
There are several types of sampling methods, each with its own advantages and
disadvantages.
Some common types of sampling include:
Simple Random Sampling: In this method, every individual in the population has an
equal chance of being selected, and each selection is independent of the others.
Simple random sampling is often conducted using random number generators or drawing
names from a hat.
Stratified Sampling: In stratified sampling, the population is divided into distinct

subgroups, or strata, based on certain characteristics (such as age, gender,
income level, etc.). Then, random samples are drawn from each stratum in proportion
to their size in the population. This method ensures that each subgroup is adequately
represented in the sample.
Systematic Sampling: Systematic sampling involves selecting every nth individual from
the population after a random start. For example, if a population consists of 1000
individuals and a sample of 100 is desired, every 10th individual could be selected
after randomly choosing a starting point between 1 and 10.
Cluster Sampling: In cluster sampling, the population is divided into clusters, or

groups, based on some criterion (such as geographical location). Then, a random
sample of clusters is selected, and data is collected from all individuals within the
chosen clusters. Cluster sampling is often more practical and cost-effective than other
methods, especially when the population is widely dispersed.
Convenience Sampling: Convenience sampling involves selecting individuals who are

readily available or easy to access. While convenient, this method may introduce bias
into the sample, as it may not accurately represent the entire population.
Snowball Sampling: Snowball sampling is a non-probability sampling method where

existing study subjects recruit future subjects from among their acquaintances.
This method is often used in studies where the population is difficult to access.
"""
In [ ]:
"""
Box plots, also known as box-and-whisker plots, offer several benefits in data analysis
and visualization:
Summary of Data Distribution:

Box plots provide a concise summary of the distribution of a dataset, including
measures of central tendency (median) and variability (interquartile range).
They offer insights into the spread, skewness, and presence of outliers in the data.
Comparison Between Groups:

Box plots allow for easy comparison of distributions between different groups or
categories.They help identify similarities, differences, and patterns in the data
across groups.
Identification of Outliers:
Box plots visually highlight potential outliers in the dataset, which are data points
that lie outside the whiskers of the plot.Outliers can be easily identified as
individual data points beyond the range of the whiskers.
Robustness to Skewed Distributions:
Box plots are robust to the effects of skewed distributions and outliers.
They provide a clearer representation of the central tendency and variability,
even in the presence of extreme values.
"""
In [ ]:
"""
What can you do with an outlier?
Outliers affect A/B testing and they can be either removed or kept according to
what situation demands or the data set requirements.
Here are some ways to deal with outliers in data –
Filter out outliers especially when we have loads of data.

If a data point is wrong, it is best to remove the outliers.
Alternatively, two options can be provided – one with outliers and one without.
During post-test analysis, outliers can be removed or modified. The best way to
modify them is to trim the data set.If there are a lot of outliers and results are
critical, then it is best to change the value of the outliers to other variables.
They can be changed to a value that is representative of the data set.
When outliers have meaning, they can be considered, especially in the case of mild
outliers.
"""
In [ ]:
"""
What is meant by mean imputation for missing data? Why is it bad?
Mean imputation is a rarely used practice where null values in a dataset are replaced
directly with the corresponding mean of the data.
It is considered a bad practice as it completely removes the accountability for

feature correlation. This also means that the data will have low variance and
increased bias, adding to the dip in the accuracy of the model, alongside narrower
confidence intervals.
"""
In [ ]:
"""
Covariance is a statistical measure that quantifies the degree to which two variables
change together. In other words, it measures the directional relationship between two
random variables. If the covariance between two variables is positive, it indicates
that they tend to move in the same direction. Conversely, if the covariance is
negative, it suggests that the variables move in opposite directions.
"""
In [ ]:
"""
What is the meaning of covariance?
Covariance is the measure of indication when two items vary together in a cycle. The
systematic relation is determined between a pair of random variables to see if the
change in one will affect the other variable in the pair or not.
"""
In [ ]:
"""
What is the use of covariance ?
feature Selection and Dimensionality Reduction:
Covariance analysis helps in selecting relevant features for machine learning models.
Features with high covariance with the target variable may be good candidates for
predictive modeling.Covariance matrices are also used in techniques like Principal
Component Analysis (PCA) for dimensionality reduction. PCA aims to find orthogonal
features that capture the maximum variance in the data, which is related to the
covariance matrix.
Understanding Relationships Between Variables:
Covariance helps in understanding the relationships between different variables in the

dataset. High positive covariance values between two variables indicate that they tend
to increase or decrease together, while negative covariance values indicate an inverse
relationship.
Multivariate Normal Distribution:
In some machine learning algorithms, such as Gaussian Naive Bayes classifiers,

covariance matrices are used to model the multivariate normal distribution of features.
"""
In [ ]:
"""
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It quantifies how closely related two variables
are and the direction of their relationship. Correlation analysis is crucial in
understanding the association between variables and is widely used in various fields,
including finance, economics, psychology, and scientific research.
Types of correlation:
Pearson Correlation Coefficient (Parametric Correlation):
The Pearson correlation coefficient, denoted by

r (rho), measures the linear relationship between two continuous variables.
It ranges from -1 to +1, where:
r=1: Perfect positive correlation
r=−1: Perfect negative correlation
r=0: No correlation
The strength of the correlation is determined by the magnitude of r, while the sign
(+ or -) indicates the direction of the relationship.The Pearson correlation
coefficient assumes that the relationship between variables is linear and that the
data follows a normal distribution.
Spearman Rank Correlation (Non-parametric Correlation):
Spearman rank correlation, denoted by

ρ (rho), measures the strength and direction of the monotonic relationship between two
variables.It does not assume a linear relationship and is suitable for ordinal or
non-normally distributed data.Spearman correlation is calculated based on the ranks of
the data points rather than their actual values, making it robust to outliers and
non-linear relationships.
"""
In [ ]:
"""
Why we are using correlation ?
Feature Selection:
Correlation analysis helps in selecting the most relevant features (variables) for
predictive models. Features that are highly correlated with the target variable are
often considered important predictors.
Multicollinearity Detection:
Correlation analysis helps in identifying multicollinearity, which occurs when two or

more predictor variables in a regression model are highly correlated. Multicollinearity
can affect the stability and interpretability of the model, so identifying and
addressing it is crucial.
Data Exploration:
Correlation analysis is an essential step in exploratory data analysis (EDA) during

the initial stages of model development. It provides insights into the relationships
between different features and the target variable, helping data scientists understand
the underlying patterns and dynamics of the data.
Dimensionality Reduction:
In high-dimensional datasets, correlation analysis can be used for dimensionality

reduction. Features that are highly correlated with each other may contain redundant
information. Removing one of the correlated features can reduce computational
complexity and improve model interpretability without sacrificing predictive
performance.
"""
In [ ]:
"""
What are some of the properties of a normal distribution?
A normal distribution, also known as Gaussian distribution, Normal distribution refers

to the data which is symmetric to the mean, and data far from the mean is less frequent
in occurrence. It appears as a bell-shaped curve in graphical form, which is symmetrical
along the axes.
The properties of a normal distribution are –
Symmetrical – The shape changes with that of parameter values

Unimodal – Has only one mode.
Mean – the measure of central tendency
Central tendency – the mean, median, and mode lie at the centre, which means that
they are all equal, and the curve is perfectly
symmetrical at the midpoint.
"""
In [ ]:
"""
What general conditions must be satisfied for the central limit theorem to hold?
Here are the conditions that must be satisfied for the central limit theorem to hold –
The data must follow the randomization condition which means that it must be sampled
randomly.The Independence Assumptions dictate that the sample values must be
independent of each other.Sample sizes must be large. They must be equal to or
greater than 30 to be able to hold CLT. Large sample size is required to hold the
accuracy of CLT to be true.
"""
In [ ]:
"""
Where are long-tailed distributions used?
A long-tailed distribution is a type of distribution where the tail drops off
gradually toward the end of the curve.
The Pareto principle and the product sales distribution are good examples to denote
the use of long-tailed distributions.
Also, it is widely used in classification and regression problems.
"""
In [ ]:
"""
What is exploratory data analysis?
Exploratory data analysis is the process of performing investigations on data to
understand the data better.
In this, initial investigations are done to determine patterns, spot abnormalities,

test hypotheses, and also check if the assumptions are right.
"""
In [ ]:
"""
What is DOE?
DOE is an acronym for the Design of Experiments in statistics. It is considered as the
design of a task that describes the information and the change of the same based on
the changes to the independent input variables.
"""
In [ ]:
"""
What is the meaning of KPI in statistics?
KPI stands for Key Performance Analysis in statistics. It is used as a reliable metric
to measure the success of a company with respect to its achieving the required business
objectives.
There are many good examples of KPIs:
Profit margin percentage

Operating profit margin
Expense ratio
"""
In [ ]:
"""
What is the meaning of standard deviation?
Standard deviation represents the magnitude of how far the data points are from the
mean. A low value of standard deviation is an indication of the data being close to
the mean, and a high value indicates that the data is spread to extreme ends, far
away from the mean.
"""
In [ ]:
""" What is correlation?
Correlation is used to test relationships between quantitative variables and categorical
variables. Unlike covariance, correlation tells us how strong the relationship is
between two variables. The value of correlation between two variables ranges from
-1 to +1.
The -1 value represents a high negative correlation, i.e., if the value in one variable
increases, then the value in the other variable will drastically decrease. Similarly,
+1 means a positive correlation, and here, an increase in one variable will lead to an
increase in the other. Whereas, 0 means there is no correlation.
If two variables are strongly correlated, then they may have a negative impact on the
statistical model, and one of them must be dropped.
"""
In [ ]:
"""
What is the relationship between the confidence level and the significance level in
statistics?
The significance level is the probability of obtaining a result that is extremely

different from the condition where the null hypothesis is true. While the confidence
level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level

"""
In [ ]:
"""
What types of variables are used for Pearson’s correlation coefficient?
Variables to be used for the Pearson’s correlation coefficient must be either in a
ratio or in an interval.
Note that there can exist a condition when one variable is a ratio, while the other
is an interval score.
"""
In [ ]:
"""
What are the examples of symmetric distribution?
Symmetric distribution means that the data on the left side of the median is the same
as the one present on the right side of the median.
There are many examples of symmetric distribution, but the following three are the
most widely used ones:
Uniform distribution
Binomial distribution
Normal distribution
"""
In [ ]:
"""
Where is inferential statistics used?
Inferential statistics is used for several purposes, such as research, in which we wish
to draw conclusions about a population using some sample data. This is performed in a
variety of fields, ranging from government operations to quality control and quality
assurance teams in multinational corporations.
"""
In [ ]:
"""
What are the scenarios where outliers are kept in the data?
There are not many scenarios where outliers are kept in the data, but there are some
important situations when they are kept. They are kept in the data for analysis if:
Results are critical

Outliers add meaning to the data
The data is highly skewed
"""
In [ ]:
"""
What is the meaning of degrees of freedom (DF) in statistics?
Degrees of freedom or DF is used to define the number of options at hand when
performing an analysis. It is mostly used with t-distribution and not with the
z-distribution.
If there is an increase in DF, the t-distribution will reach closer to the normal
distribution. If DF > 30, this means that the t-distribution at hand is having all
of the characteristics of a normal distribution.
"""
In [ ]:
"""
What are some of the techniques to reduce underfitting and overfitting during model
training?
Underfitting refers to a situation where data has high bias and low variance, while
overfitting is the situation where there are high variance and low bias.
Following are some of the techniques to reduce underfitting and overfitting:
For reducing underfitting:
Increase model complexity

Increase the number of features
Remove noise from the data
Increase the number of training epochs
For reducing overfitting:
Increase training data

Stop early while training
Lasso regularization
Use random dropouts
"""
In [ ]:
"""
Does a symmetric distribution need to be unimodal?
A symmetric distribution does not need to be unimodal (having only one mode or one
value that occurs most frequently). It can be bi-modal (having two values that have
the highest frequencies) or multi-modal (having multiple or more than two values that
have the highest frequencies).
"""
In [ ]:
"""
What is the impact of outliers in statistics?
Outliers in statistics have a very negative impact as they skew the result of any
statistical query. For example, if we want to calculate the mean of a dataset that
contains outliers, then the mean calculated will be different from the actual mean
(i.e., the mean we will get once we remove the outliers).
"""
In [ ]:
"""
When creating a statistical model, how do we detect overfitting?
Overfitting can be detected by cross-validation. In cross-validation, we divide the

available data into multiple parts and iterate on the entire dataset. In each
iteration, one part is used for testing, and others are used for training. This way,
the entire dataset will be used for training and testing purposes, and we can detect
if the data is being overfitted.
"""
In [ ]:
"""
What is the relationship between standard deviation and standard variance?
Standard deviation is the square root of standard variance. Basically, standard
deviation takes a look at how the data is spread out from the mean. On the other hand,
standard variance is used to describe how much the data varies from the mean of the
entire dataset.
"""
In [ ]:
"""
What is the difference between inferential and descriptive statistics?
Descriptive statistics describes some sample or population.
Inferential statistics attempts to infer from some sample to the larger population.
"""
In [ ]:
"""
What is the difference between long format and wide format data?
A dataset can be written in two different formats: wide and long.
Wide format is where we have a single row for every data point with multiple
columns to hold the values of various attributes.
The long format is where for each data point we have as many rows as the number
of attributes and each row contains the value of a particular attribute for a given
data point.
"""
In [ ]:
"""
What do you understand by the term Normal Distribution?
Normal distribution, also known as the Gaussian distribution, is a bell-shaped

frequency distribution.
"""
In [ ]:
"""
What are some of the properties of a normal distribution?
Some of the properties of a Normal Distribution are as follows:
Unimodal: normal distribution has only one peak. (i.e., one mode)
Symmetric: a normal distribution is perfectly symmetrical around its centre.

(i.e., the right side of the centre is a mirror image of the left side)
The Mean, Mode, and Median are all located in the centre (i.e., are all equal)
Asymptotic: normal distributions are continuous and have tails that are asymptotic.
The curve approaches the x-axis, but it never touches
"""
In [ ]:

Statistics 1

Uploaded by

Copyright:

Available Formats

Statistics 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics 1

Uploaded by

Copyright:

Available Formats

day1 questions

Definition: Qualitative data represents categories or groups and describes non-numeric

2. Quantitative (Numerical) Data

Examples: Number of children in a family, Number of cars in a parking lot, Number

Examples: Height, Weight, Temperature, Time, Distance

Ordinal data represents categories with a natural order or ranking.

A sample is a subset or a smaller representative group selected from the population.

A measure of central tendency is a statistical summary of a dataset that represents a

often referred to as the average, is calculated by summing up all the values in a

The mode is the value that occurs most frequently in a dataset.

Interval or Ratio Data:

Summary and Description

It is used to provide an unbiased estimate of the population variance and standard

variance and standard deviation, especially for smaller sample sizes.

Discrete Random Variable:

Continuous Random Variable:

Prediction of the missing values

Here are the topics related to dispersion in statistics:

Low extreme (Min)

Calculate the Range:

The interquartile range (IQR) is a measure of statistical dispersion, which represents

A large standard deviation indicates high variability or dispersion of data points

A large standard deviation implies greater uncertainty or unpredictability in the

By analyzing measures of dispersion such as standard deviation or range, the company

Minimizing Defects and Waste:

High dispersion in microchip dimensions increases the likelihood of producing defective

Visualization of Data Distribution:

Histograms provide a visual summary of the distribution of data, allowing analysts to

Identification of Patterns and Trends:

Detection of Outliers and Anomalies:

Understanding Central Tendency and Dispersion:

Comparison Between Groups or Datasets:

Histograms facilitate comparisons between different groups or datasets by displaying

Kurtosis measures the peakedness or flatness of the distribution's curve. It quantifies

Positive Skewness (Right Skewness):

Negative Skewness (Left Skewness):

A distribution is considered symmetric or approximately symmetric if it has zero

Selection bias: It involves the selection of individual or grouped data in a way

Stratified Sampling: In stratified sampling, the population is divided into distinct

Cluster Sampling: In cluster sampling, the population is divided into clusters, or

Convenience Sampling: Convenience sampling involves selecting individuals who are

Snowball Sampling: Snowball sampling is a non-probability sampling method where

Summary of Data Distribution:

Comparison Between Groups:

Robustness to Skewed Distributions:

Here are some ways to deal with outliers in data –

Filter out outliers especially when we have loads of data.

It is considered a bad practice as it completely removes the accountability for

feature Selection and Dimensionality Reduction:

Understanding Relationships Between Variables:

Covariance helps in understanding the relationships between different variables in the

Multivariate Normal Distribution:

In some machine learning algorithms, such as Gaussian Naive Bayes classifiers,

Pearson Correlation Coefficient (Parametric Correlation):

The Pearson correlation coefficient, denoted by

r=1: Perfect positive correlation

r=−1: Perfect negative correlation

Spearman Rank Correlation (Non-parametric Correlation):

Spearman rank correlation, denoted by

Correlation analysis helps in identifying multicollinearity, which occurs when two or