Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
“People worry that computers will get too smart and take over the world, but the real
problem is that they’re too stupid and they’ve already taken over the world.” — by Pedro
Domingos
Are you preparing for Data Science Interviews? but getting no ideas how to start. What
kind of problems can be asked? What topics need to be covered?
From broad mathematical discipline — Statistics, In this post I have listed top 10 Data
Science interview questions based on the current Interview trend and my past 4
company’s (Check out the Linkedin Profile here) interview experience:
Ans: There are 3 types of measures used to summarize the distribution in descriptive
statistics shown below in the picture:
Make sure you are well verse with all the three measures before entering into Interview
room. Important Points need to remember:
• Median v/s Mean for imputation of missing values. Median should be used when the
data has outlier as it is robust to outliers otherwise use mean for missing value
imputation.
Ans: Before getting into CLT, Let’s understand what is Sampling Distribution:
Central Limit Theorem: For a population with any distribution, the Sampling
distribution of sample means approaches Normal distribution as the sample size
increases.
CLT Theorem
hist(samp.means, col = "steelblue", main = "Sampling Distribution of Sample Means", xlab = "Sample
Mean")
Confidence interval gives us a range of values which is likely to contain the population
parameter. Confidence interval is generally preferred, as it tells us how likely this
parameter. Confidence interval is generally preferred, as it tells us how likely this
interval is to contain the population parameter. This likeliness or probability is called
Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha
is the level of significance.
Here following pic gives the formulas use to calculate Confidence interval for population
mean under 2 scenarios.
A sample of 11 circuits from a normal population has a mean resistance of 2.20 ohms.
We know from past testing that the population standard deviation is 0.35 ohms.
Determine a 95% confidence interval for the true mean resistance of the population.
5. What is p-value and what does it signify?
Ans: Most of the candidates feel frustrated when it comes to explain the most widely
term used in statistics “P-Value”. The fundamental problem with p-value is that no one
can easily explain what exactly p-value is without using statistical jargons. Let me try to
explain it as simple as possible:
Just imagine a scenario, Modi ji went to buy mangoes from a vendor. As usual, Vendor
guy is claiming he is having sweet mangoes. Now as our Modi ji is statistician (don’t be
serious), he wants to investigate Vendor’s Hypothesis.
Now from the population of Mangoes, Modi ji picked 1 sample and investigated it. He
found that Mango sample is not sweet and thoughtfully said the probability of getting
mango as sweet as this one or more sweeter than this is very less (p-value), say less
than 5%. So, Modi ji rejected the vendor’s claim and went to another vendor.
In general hypothesis testing procedure, we will have some hypothesis about the
population parameter and we investigate it using a sample extracted from the
population. P-value is nothing but the probability of observing such sample from the
population given that null hypothesis is true, if the probability is too small, we doubt on
the accuracy of null hypothesis and reject it, otherwise we accept Null hypothesis by
saying we don’t have enough evidence to reject the null hypothesis.
The p-value reflects the strength of evidence against the null hypothesis.
p-value is defined as the probability that the data would be at least as extreme as those
observed, if the null hypothesis were true.
Suppose that we are interested in the burning rate of a solid propellant used to power
aircrew escape systems. Propellant producer claims that the mean burning rate of solid
propellant is 50 cm/s.
To investigate the producer claim, Imagine we have collected a sample of propellants
and found sample mean as 51.3 and the symmetric value 48.7.
P-Value Calculation
From above plot, P-value tells us that if the null hypothesis H0: μ = 50 is true, the
probability of obtaining a random sample whose mean is at least as far from 50 as 51.3
and 48.7 is 0.038. Therefore, an observed sample mean of 51.3 is a rarely event if the
null hypothesis is true. So, we reject the null hypothesis at 5% level.
So keep in mind: P-value helps the statistician to draw conclusions on Null hypothesis
and is always between 0 and 1.
• P- Value > 0.05 denotes weak evidence against the null hypothesis which means
the null hypothesis cannot be rejected.
• P-value < 0.05 denotes strong evidence against the null hypothesis which means
the null hypothesis can be rejected.
Following picture explains the steps followed to get the Anova results.
There is a limitation of ANOVA that it does not tell which pair is having significant
difference. In above example, It is clear that there is a significant difference between the
means of Data Scientist salary among these 3 cities but it does not provide any
information on which pair is having the significant difference. This problem is being
solved by Tukey HSD. If interested about it, read more here.
Ans: Principal Component analysis and factor analysis, both techniques can be used to
reduce the dimensions in the data. But, generally Statisticians use Factor analysis to
understand and simplify patterns of relationships underlying measured variables.
Following picture depicts how both approaches are different
Factor analysis seeks linear combinations of variables, called factors, that represent
underlying fundamental quantities of which the observed variables are expressions.
More precisely, the manifest variables are linear combinations of the factors, plus
unique (or specific) factors. From the above picture, It is clear the Factor F causes the
responses on the 4 measured Y variables.
PCA on the other hand summarizes common variation in many variables using just a
few variables. You can see in above picture from the direction of the arrows that the Y
variables contribute to the component variable.
Ans: This is one of the frequently asked and simplest question which can help you in
creating strong impression on the interviewer. Let’s crack it:
Linear Regression
Here, errors are assumed to have Normal distribution with mean 0 and variance sigma².
Now using this assumption, we can conclude that dependent variable also follows
normal distribution with mean as Regression function and variance same as of error
sigma².
Ans: Correlation and Covariance are statistical concepts which are generally used to
determine the relationship and measure the dependency between two random
variables. Actually, Correlation is a special case of covariance which can be observed
when the variables are standardized. This point will become clear from the formulas :
Correlation Formula
Ans: Yes absolutely. At this particular question candidates get stumbled if they hear it
for the first time. This question tests knowledge of Logistic Regression, Excel functions
and Excel solver options.
Download the excel file here explaining different steps involved to perform Logistic
regression in Excel using Solver.
Steps to be performed:
Now open the solver from Data Tab in excel and estimate the coefficients that maximize
the log-likelihood function given in J9.
After providing the input to the solver, mentioned in the above snapshot, click on Solve
and if solver will be able to find the solution, a new window will pop up which will look
like this:
Just click OK. It will display the estimated coefficients in the respective cells.
Solver Output
apart of estimated coefficients, Solver will show the maximized Log-Likelihood values
which is -6.65456 in the current case using estimated coefficients.
******************************************************************
Apart of questions explained above, I will recommend following topics one must prepare
before sitting in Data Science/Data Analyst/Business Analyst Interview.
• Hypothesis Testing
******************************************************************
What’s your story? Did this guide help you better prepare for your next interview? Do
you have any other statistics question which you want to be in the post? Let us know in
the comments below!
Interview Logistic In
Data Science Statistics Questions P Value Excel
8
claps