0% found this document useful (0 votes)
29 views20 pages

Statistics Interview Questions

Uploaded by

sairamesht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views20 pages

Statistics Interview Questions

Uploaded by

sairamesht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Statistics Interview

Questions & Answers for


Data Scientists

Questions

 Q1: Explain the central limit theorem and give


examples of when you can use it in a real-world
problem?

 Q2: Briefly explain the A/B testing and its


application? What are some common pitfalls
encountered in A/B testing?

 Q3: Describe briefly the hypothesis testing and


p-value in layman’s terms? And give a practical
application for them?

 Q4: Given a left-skewed distribution that has a


median of 60, what conclusions can we draw
about the mean and the mode of the data?

 Q5: What is the meaning of selection bias and


how to avoid it?

 Q6: Explain the long-tailed distribution and


provide three examples of relevant phenomena
that have long tails. Why are they important in
classification and regression problems?

 Q7: What is the meaning of KPI in statistics

 Q8: Say you flip a coin 10 times and observe only


one head. What would be the null hypothesis
and p-value for testing whether the coin is fair or
not?

 Q9: You are testing hundreds of hypotheses,


each with a t-test. What considerations would
you take into account when doing this?

 Q10: What general conditions must be satisfied


for the central limit theorem to hold?

 Q11: What is skewness discuss two methods to


measure it.

 Q12: You sample from a uniform distribution [0,


d] n times. What is your best estimate of d?

 Q13: Discuss the Chi-square, ANOVA, and t-test

Questions & Answers

Q1: Explain the central limit theorem and give


examples of when you can use it in a real-world
problem.

Answers:
The center limit theorem states that if any random variable,
regardless of the distribution, is sampled a large enough time, the
sample mean will be approximately normally distributed. This
allows for studying the properties of any statistical distribution as
long as there is a large enough sample size.

Important remark:

1. we can rely on the CLT with means (because it applies to


any unbiased statistic) only if expressing data in this way
makes sense. And it makes sense ONLY in the case of
unimodal and symmetric data, coming from additive
processes. So forget skewed, multi-modal data with
mixtures of distributions, coming from multiplicative
processes, and non-trivial mean-variance relationships.
That are the places where arithmetic means is
meaningless. Thus, using the CLT of e.g. bootstrap will
give some valid answers to an invalid question.

2. The distribution of means isn’t enough. Every single


kind of inference requires the entire test statistic to
follow a certain distribution. And the test statistic
consists also of the estimate of variance. Never assume
the same sample size sufficient for means will suffice for
the entire test statistic. See an excerpt from Rand Wilcox
attached. Especially do never believe in magic numbers
like N=30.
3. Think first about how to sensible describe your data,
state the hypothesis of interest, and then apply a valid
method.

Examples of real-world usage of CLT:

1. The CLT can be used at any company with a large


amount of data. Consider companies like Uber/Lyft
wants to test whether adding a new feature will increase
the booked rides or not using hypothesis testing. So if we
have a large number of individual ride X, which in this
case is a Bernoulli random variable (since the rider will
book a ride or not), we can estimate the statistical
properties of the total number of bookings.
Understanding and estimating these statistical
properties play a significant role in applying hypothesis
testing to your data and knowing whether adding a new
feature will increase the number of booked riders or not.

2. Manufacturing plants often use the central limit theorem


to estimate how many products produced by the plant
are defective.

Q2:Briefly explain the A/B testing and its application?


What are some common pitfalls encountered in A/B
testing?

A/B testing helps us to determine whether a change in something


will cause a change in performance significantly or not. So in other
words you aim to statistically estimate the impact of a given change
within your digital product (for example). You measure success and
counter metrics on at least 1 treatment vs 1 control group (there can
be more than 1 XP group for multivariate tests).

Applications:

1. Consider the example of a general store that sells bread


packets but not butter, for a year. If we want to check
whether its sale depends on the butter or not, then
suppose the store also sells butter and sales for next year
are observed. Now we can determine whether selling
butter can significantly increase/decrease or doesn’t
affect the sale of bread.

2. While developing the landing page of a website you


create 2 different versions of the page. You define
criteria for success eg. conversion rate. Then define your
hypothesis Null hypothesis(H): No difference between
the performance of the 2 versions. Alternative
hypothesis(H’): version A will perform better than B.

NOTE: You will have to split your traffic randomly(to avoid sample
bias) into two versions. The split doesn’t have to be symmetric, you
just need to set the minimum sample size for each version to avoid
undersample bias.

Now if version A gives better results than version B, we will still have
to statistically prove that results derived from our sample represent
the entire population. Now one of the very common tests used to do
so is 2 sample t-test where we use values of significance level (alpha)
and p-value to see which hypothesis is right. If p-value<alpha, H is
rejected.

Common pitfalls:

1. Wrong success metrics inadequate for the business


problem

2. Lack of counter metric, as you might add friction to the


product regardless along with the positive impact

3. Sample mismatch: heterogeneous control and treatment,


unequal variances

4. Underpowered test: too small sample or XP running too


short 5. Not accounting for network effects (introduce
bias within measurement)

Q3: Describe briefly the hypothesis testing and p-value


in layman’s terms? And give a practical application for
them?

In Layman’s terms:

 A hypothesis test is where you have a current state (null


hypothesis) and an alternative state (alternative
hypothesis). You assess the results of both of the states
and see some differences. You want to decide whether
the difference is due to the alternative approach or not.
You use the p-value to decide this, where the p-value is the
likelihood of getting the same results the alternative approach
achieved if you keep using the existing approach. It’s the probability
to find the result in the gaussian distribution of the results you may
get from the existing approach.

The rule of thumb is to reject the null hypothesis if the p-value <
0.05, which means that the probability to get these results from the
existing approach is <95%. But this % changes according to task and
domain.

To explain the hypothesis testing in Layman’s term with an example,


suppose we have two drugs A and B, and we want to determine
whether these two drugs are the same or different. This idea of
trying to determine whether the drugs are the same or different is
called hypothesis testing. The null hypothesis is that the drugs are
the same, and the p-value helps us decide whether we should reject
the null hypothesis or not.

p-values are numbers between 0 and 1, and in this particular case, it


helps us to quantify how confident we should be to conclude that
drug A is different from drug B. The closer the p-value is to 0, the
more confident we are that the drugs A and B are different.

Q4: Given a left-skewed distribution that has a median


of 60, what conclusions can we draw about the mean
and the mode of the data?

Answer: Left skewed distribution means the tail of the distribution is


to the left and the tip is to the right. So the mean which tends to be
near outliers (very large or small values) will be shifted towards the
left or in other words, towards the tail.

While the mode (which represents the most repeated value) will be
near the tip and the median is the middle element independent of
the distribution skewness, therefore it will be smaller than the mode
and more than the mean.

Mean < 60 Mode > 60

The mean, median, and mode for distributions with different skews.

Q5: What is the meaning of selection bias and how to


avoid it?

Answer:

Sampling bias is the phenomenon that occurs when a research study


design fails to collect a representative sample of a target population.
This typically occurs because the selection criteria for respondents
failed to capture a wide enough sampling frame to represent all
viewpoints.

The cause of sampling bias almost always owes to one of two


conditions.

1. Poor methodology: In most cases, non-representative


samples pop up when researchers set improper
parameters for survey research. The most accurate and
repeatable sampling method is simple random sampling
where a large number of respondents are chosen at
random. When researchers stray from random sampling
(also called probability sampling), they risk injecting
their own selection bias into recruiting respondents.

2. Poor execution: Sometimes data researchers craft


scientifically sound sampling methods, but their work is
undermined when field workers cut corners. By
reverting to convenience sampling (where the only
people studied are those who are easy to reach) or giving
up on reaching non-responders, a field worker can
jeopardize the careful methodology set up by data
scientists.

The best way to avoid sampling bias is to stick to probability-based


sampling methods. These include simple random sampling,
systematic sampling, cluster sampling, and stratified sampling. In
these methodologies, respondents are only chosen through
processes of random selection — even if they are sometimes sorted
into demographic groups along the way.

Q6: Explain the long-tailed distribution and provide


three examples of relevant phenomena that have long
tails. Why are they important in classification and
regression problems?

Answer: A long-tailed distribution is a type of heavy-tailed


distribution that has a tail (or tails) that drop off gradually and
asymptotically.

Three examples of relevant phenomena that have long tails:

1. Frequencies of languages spoken

2. Population of cities
3. Pageviews of articles

All of these follow something close to 80–20 rule: 80% of outcomes


(or outputs) result from 20% of all causes (or inputs) for any given
event. This 20% forms the long tail in the distribution.

It’s important to be mindful of long-tailed distributions in


classification and regression problems because the least frequently
occurring values make up the majority of the population. This can
ultimately change the way that you deal with outliers, and it also
conflicts with some machine learning techniques with the
assumption that the data is normally distributed.
Q7: What is the meaning of KPI in statistics

Answer:

KPI stands for key performance indicator, a quantifiable measure of


performance over time for a specific objective. KPIs provide targets
for teams to shoot for, milestones to gauge progress, and insights
that help people across the organization make better decisions.
From finance and HR to marketing and sales, key performance
indicators help every area of the business move forward at the
strategic level.

KPIs are an important way to ensure your teams are supporting the
overall goals of the organization. Here are some of the biggest
reasons why you need key performance indicators.

 Keep your teams aligned: Whether measuring project


success or employee performance, KPIs keep teams
moving in the same direction.

 Provide a health check: Key performance indicators give


you a realistic look at the health of your organization,
from risk factors to financial indicators.

 Make adjustments: KPIs help you clearly see your


successes and failures so you can do more of what’s
working, and less of what’s not.

 Hold your teams accountable: Make sure everyone


provides value with key performance indicators that help
employees track their progress and help managers move
things along.

Types of KPIs Key performance indicators come in many flavors.


While some are used to measure monthly progress against a goal,
others have a longer-term focus. The one thing all KPIs have in
common is that they’re tied to strategic goals. Here’s an overview of
some of the most common types of KPIs.
 Strategic: These big-picture key performance indicators
monitor organizational goals. Executives typically look to
one or two strategic KPIs to find out how the
organization is doing at any given time. Examples
include return on investment, revenue and market share.

 Operational: These KPIs typically measure performance


in a shorter time frame, and are focused on
organizational processes and efficiencies. Some
examples include sales by region, average monthly
transportation costs, and cost per acquisition (CPA).

 Functional Unit: Many key performance indicators are


tied to specific functions, such as finance or IT. While IT
might track time to resolution or average uptime, finance
KPIs track gross profit margin or return on assets. These
functional KPIs can also be classified as strategic or
operational.

 Leading vs Lagging: Regardless of the type of key


performance indicator you define, you should know the
difference between leading indicators and lagging
indicators. While leading KPIs can help predict
outcomes, lagging KPIs track what has already
happened. Organizations use a mix of both to ensure
they’re tracking what’s most important.
Q8: Say you flip a coin 10 times and observe only one
head. What would be the null hypothesis and p-value
for testing whether the coin is fair or not?

Answer:

The null hypothesis is that the coin is fair, and the alternative
hypothesis is that the coin is biased. The p-value is the probability of
observing the results obtained given that the null hypothesis is true,
in this case, the coin is fair.

In total for 10 flips of a coin, there are 2¹⁰ = 1024 possible outcomes
and in only 10 of them are there 9 tails and one head.

Hence, the exact probability of the given result is the p-value, which
is 10/1024 = 0.0098. Therefore, with a significance level set, for
example, at 0.05, we can reject the null hypothesis.

Q9: You are testing hundreds of hypotheses,


each with a t-test. What considerations would
you take into account when doing this?

Answer:
The main consideration when we have a large number of tests is that
the probability of getting a significant test due to chance alone
increases. This will increase the type 1 error (rejecting the null
hypothesis when it’s actually true).

Therefore we need to consider the Bonferroni Effect which happens


when we make many tests. Ex. If our significance level is 0.05 but we
made a 100 test it means that the probability of getting a value
inside the rejection rejoin is 0.0005, not 0.05 so here we need to use
another significance level which’s called alpha star = significance
level /K Where K is the number of the tests.

Q10: What general conditions must be


satisfied for the central limit theorem to hold?

Answer:

In order to apply the central limit theorem, there are four conditions
that must be met:

1. Randomization: The data must be sampled randomly


such that every member in a population has an equal
probability of being selected to be in the sample.

2. Independence: The sample values must be


independent of each other.

3. The 10% Condition: When the sample is drawn


without replacement, the sample size should be no larger
than 10% of the population.
4. Large Sample Condition: The sample size needs to
be sufficiently large.

Q11: What is skewness discuss two methods


to measure it.

Answer:

Skewness refers to a distortion or asymmetry that deviates from the


symmetrical bell curve, or normal distribution, in a set of data. If the
curve is shifted to the left or to the right, it is said to be skewed.
Skewness can be quantified as a representation of the extent to
which a given distribution varies from a normal distribution. There
are two main types of skewness negative skew which refers to a
longer or fatter tail on the left side of the distribution, while positive
skew refers to a longer or fatter tail on the right. These two skews
refer to the direction or weight of the distribution.

The mean of positively skewed data will be greater than the median.
In a negatively skewed distribution, the exact opposite is the case:
the mean of negatively skewed data will be less than the median. If
the data graphs symmetrically, the distribution has zero skewness,
regardless of how long or fat the tails are.

There are several ways to measure skewness. Pearson’s first and


second coefficients of skewness are two common methods. Pearson’s
first coefficient of skewness, or Pearson mode skewness, subtracts
the mode from the mean and divides the difference by the standard
deviation. Pearson’s second coefficient of skewness, or Pearson
median skewness, subtracts the median from the mean, multiplies
the difference by three, and divides the product by the standard
deviation.

Q12: You sample from a uniform distribution [0, d] n


times. What is your best estimate of d?

Answer:
Intuitively it is the maximum of the sample points. Here’s the
mathematical proof is in the figure below:
Q13: Discuss the Chi-square, ANOVA, and t-test

Answer:

Chi-square test A statistical method is used to find the difference or


correlation between the observed and expected categorical variables
in the dataset.

Example: A food delivery company wants to find the relationship


between gender, location, and food choices of people.

It is used to determine whether the difference between 2 categorical


variables is:
 Due to chance or

 Due to relationship

Analysis of Variance (ANOVA) is a statistical formula used to


compare variances across the means (or average) of different
groups. A range of scenarios uses it to determine if there is any
difference between the means of different groups.
t_test is a statistical method for the comparison of the mean of the
two groups of the normally distributed sample(s).

It comes in various types such as:

1. One sample t-test:


Used to compare the mean of a sample and the population.

2. Two sample t-tests:


Used to compare the mean of two independent samples and whether
their population is statistically different.

3. Paired t-test:
Used to compare means of different samples from the same group.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy