PSLP notes
PSLP notes
PSLP notes
(a) Mean=Mode=Median
(b) Mean>Median>Mode
(c) Mode>Median>Mean
(d) Mean>Mode=Median
Solution:(c)
Explanation:
(a) Diagonal elements must be positive and other elements are always zero.
(b) Diagonal elements can never be negative and other elements are always positive.
(c) Diagonal elements can never be negative and other elements can be negative or positive.
(d) Diagonal elements can be negative and positive and other elements are always negative.
Solution: (c)
Explanation: In a covariance matrix, the diagonal entries represent covariance of the variable with itself
which is equal to the variance of that variable and is calculated as the square of standard deviation. Since
variance is always positive, therefore diagonal entries are always positive.
(b) Range
(c) Mean
Solution: (d)
Explanation: The IQR is essentially the range of the middle 50% of the data. Since it uses the middle 50%,
therefore it is not affected by the outliers.
4. If X and Y are independent random variables, then which of the following is TRUE?
Solution: (d)
Explanation: If X and Y are independent then Cov(X,Y)=0 and Var(X+Y) = Var(X)+Var(Y) (∵ 2Cov(X, Y) = 0)
Solution: (d)
Explanation:
6. Let X and Y be normal random variables with their respective means 3 and 4 and variances
9 and 16, then 2X-Y will have normal distribution with parameters:
Solution: (d)
7. Suppose X and Y take values {0,1} and are independent with P(X=1)=1/2 and P(Y=1)=1/3.
What is the probability that P(X+Y=1)?
(a) 5/18
(b) 1/2
(c) 5/6
(d) 1/6
Solution:(b)
8. Let X and Y are random variables with E(X)=μ/2 and E(Y)=μ, then which one is TRUE?
Solution: (c)
9. Suppose that X takes values between 0 and 1 and has probability density function(PDF)
2x, then the value of Variance of X2 is :
(a)1/12
(b) 1/18
(c) 1/6
(d) 5/18
Solution:(a)
10. For random variables X and Y, we have Var(X)=1, Var(Y)=4, and Var(2X-3Y)=34, then the
correlation between X and Y is:
(a) 1/2
(b) 1/4
(c) 1/3
Solution:(b)
Explanation: Var(2X-3Y) = 34
= 4Var(X)+9Var(Y)-12Cov(X, Y)
= 4(1)+9(4)-12Cov(X, Y) = 34
∴ Cov(X, Y)=1/2
11. A fair die is rolled repeatedly until a number larger than 4 is observed. If K is the total
number of times that the die is rolled, then P(K=4) is equal to:
(a) 16/81
(b) 8/81
(c) 8/27
(d) 16/27
Solution: (b)
12. Let X and Y be independent uniform (0, 1) random variables. Define A=X+Y and B=X-Y.
Then,
Solution: (b)
Explanation: Cov(X+Y, X-Y) = Cov(X, X) – Cov(X, Y) + Cov(Y, X) – Cov(Y ,Y) ⇒ Var(X) – Var(Y) = 0
Solution: (c)
14. Let X and Y be two random variables and let a, b, c, d be real numbers, then which one
of the following is FALSE?
Solution: (d)
Solution: (d)
Explanation: If X and Y be the bivariate normal distribution, then any linear combination of X and Y is also
normally distributed.
16. Let X1, X2, X3, ——-, Xn be a random sample from a distribution with E(Xi)=μ and
Var(Xi)=.σ2 Now, consider two estimators:
g1=X1 g2=X’=(X1+X2+X3+————-Xn)/n
(a) g1
(b) g2
Solution: (a)
17. A random sample of n=6 taken from the population has the elements 6, 10, 13, 14 ,18, 20.
Then, which option is False?
Solution: (c)
18. True or False: If the Pearson’s correlation between 2 variables is zero, then they are
necessarily independent.
Solution: False
19. True or False: Let g be an unbiased estimator of X and U be a random variable with zero
means, then h=g+U is also unbiased for X.
Solution: True.
Explanation: E(h) =E(g) + E(U) = 0+0 =0( ∵ E(g)=0 due to unbiased estimator)
20. True or False: Let X and Y be two independent standard normal random variables and
T=XY2+X+1 and P=X-3, then Cov(T, P)=1
Solution: False.
21. True or False: Let X has a normal distribution with parameters μ and σ2, then X2 follows
a chi-square distribution with parameter 1.
Solution: False.
Explanation: For the given statement to be True, X should be Standard normal distribution(μ=0, σ2=1)
22. True or False: If the characteristic function of a random variable exists, then its
expectation and variance will also exist.
Solution: False.
23. True or False: Let X has uniform distribution U(a, b) such that E(X)=2 and Var(X)=3/4, then
P(X<1)=1/6.
Solution: True.
24. True or False: The correlation coefficient between X+Y and X-Y, where X and Y are
independent random variables with variances 36 and 16 respectively is 6/13.
Solution: False.
Explanation: Corr(X+Y, X-Y) = Cov(X+Y, X-Y)/ Std(X+ Y).Std(X-Y) [Std= Standard Deviation]
25. True or False: In interval estimation, As the confidence level increases the margin of error
decreases.
Solution: False.
For example, using words like ‘tall’ or ‘short’ to describe a person’s height.
The Central Limit Theorem states that as the sample size gets larger, the distribution of the sample mean
gets closer to the actual population distribution. This means that as the sample size increases, the sample
error will reduce.
We’ve already touched on this topic with some of the previous statistics and probability interview
questions. But since it’s a fundamental part of data analysis, we wish to cover it in more detail.
Hypothesis testing allows us to evaluate a hypothesis about the population based on sample data. How is it
conducted -
First, we formulate a null hypothesis (or H0)—assuming no difference or relationship between the variables.
For each null hypothesis, there’s an alternative one considering the opposite. If H0 is rejected, the
alternative hypothesis is supported.
We need to choose an appropriate statistical test to determine whether the data supports a particular
hypothesis. If the probability of the null hypothesis is below a predetermined significance level, we can
reject it.
If there are limited samples of the actual population, bootstrapping is used to sample repeatedly from the
sample population. The sample mean will vary for each resample, and a sampling distribution will be created
L1 and L2 regularization
Collect more samples
Using K-fold cross-validation instead of a regular train-test split
31. How do you deal with missing data?
There are several ways you can handle missing data based on the number of missing values and type of
variable:
Variability measures are also crucial in describing data distribution. They show how spread-out data points
are and how far away they are from the mean.
Some basic questions during a statistics interview might require you to explain the meaning and usage of
Variance measures the average squared distance of data points from the mean. A small variance
corresponds to a narrow spread of the values, while a big variance implies that data points are far
from the mean.
Standard deviation is the square root of the variance. It shows the amount of variation of values in a
dataset.
Range is the difference between the maximum and minimum data value. It’s a good indicator of
variability when there are no outliers in a dataset, but when there are, it can be misleading.
Interquartile range (IQR) measures the spread of the middle part of a dataset. It’s essentially the
difference between the third and the first quartile.
A/B testing is a mechanism used to test user experience with the help of a randomized experiment. For
example, a company wants to test two versions of their landing page with different backgrounds to
understand which version drives conversions. A controlled experiment is created, and two variations of the
landing page are shown to different sets of people.
Next on our list of statistics questions for a data science interview are the measures of the shape of data
distribution: skewness and kurtosis.
Skewness is an excellent way to measure the symmetry of distribution and the likelihood of a given value
falling in the tails. With symmetrical distribution, the mean and median coincide. If the data distribution
Positive is when the right tail is longer. Most values are clustered around the left tail, and the median
is smaller than the mean.
Negative is when the left tail is longer. Most values are clustered around the right tail, and the
median is greater than the mean.
Kurtosis, on the other hand, reveals how heavy or light-tailed data is compared to the normal distribution.
There are three types of kurtoses:
Knowing the meaning and calculations of these measures may be enough for an entry-level job. But statistics
interview questions for advanced data science positions may revolve around using these concepts in practice.
A confidence interval is a probability that the true population parameter falls between a range of two
estimates. The level of confidence (for example, 95% or 99%) refers to the certainty that the true parameter
lies within the confidence interval as multiple samples are repeatedly taken.
A p-value is a probability of obtaining the observed result if the null hypothesis were true. We can set a
threshold for the p-value based on the hypothesis created, and if the p-value falls below this threshold,
then there is little to no chance that the observed result could have occurred. This gives us enough evidence
to reject the null hypothesis.
38. What is standardization? Under which circumstances should data be standardized?
Standardization is the process of putting different variables on the same scale. Variables are made to follow
Standardizing data can give us a better idea of extreme outliers, as it is easy to identify values that are 2–
3 standard deviations away from the mean. Standardization is also used as a pre-processing technique
before feeding data into machine learning models so that all variables are given the same weightage.
39. What are some properties of a normal distribution? Give some examples of data points
that follow a normal distribution.
The mean, median, and mode in a normal distribution are very close to each other.
There is a 50% probability that a value will fall on the left of the normal distribution, and a 50%
probability that a value will fall on the right.
The total area under the curve is 1.
Normal distribution is a central concept in mathematics and data analysis. As such, it often appears in
statistics interview questions.
The normal (or Gaussian) distribution is the most important probability distribution in statistics. It’s often
called а “bell curve” because of its shape—tall in the middle, flat toward the ends.
A correlation coefficient is an indicator of how strong the relationship between two variables is. A coefficient
near +1 indicates a strong positive correlation, a coefficient of 0 indicates no correlation, and a coefficient
Covariance is a statistical term that refers to a systematic relationship between two random variables in
which a change in the other reflects a change in one variable. The covariance value can range from -∞ to +∞,
with a negative value indicating a negative relationship and a positive value indicating a positive
relationship. The greater this number, the more reliant the relationship. Positive covariance denotes a direct
relationship and is represented by a positive number.
A covariance matrix is a square matrix that illustrates the variance of dataset elements and the covariance
between two datasets. Variance is a measure of dispersion defined as data spread from the provided
dataset's mean. Covariance between two variables is calculated and used to measure how the two
variables fluctuate together.
A correlation matrix can be defined as a matrix with correlation coefficients among different variables. The
connection between the two variables is represented by each cell in the table. A correlation matrix can be
used to summarize data, as an input to a more advanced analysis, or as a diagnostic for further studies.
45. Explain the Difference Between Probability Distribution and Sampling Distribution
As noted, you may be asked various statistics interview questions regarding sampling and the
generalizability of results. The difference between probability and sampling distribution is just one example.
A probability distribution is a function used to calculate the probability of a random variable X taking
different values. There are two main types depending on the variable: discrete and continuous.
Examples of the former are the binomial and Poisson distributions, and of the latter: normal and
uniform distributions.
A sampling distribution is the probability distribution of a statistic based on a range of random samples
from a population. The definition sounds confusing, but it’s encountered often in practice.
For example, imagine you’re a clinical data analyst working on developing a new treatment for patients
with Alzheimer’s. You’ll likely be working with samples from the entire population of individuals with the
disease. So, you’ll use the sampling distribution during the data analysis.