Interview Quations Data Science

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3
At a glance
Powered by AI
The text discusses different statistical tests like z-test, chi-square test, regression, ANOVA, as well as machine learning algorithms like random forest, CHAID, CART and concepts like supervised vs unsupervised learning, precision and recall.

Statistical tests discussed include z-test, chi-square test, regression, multi-collinearity testing using VIF, ROC curve analysis, Gini index, discriminant analysis, logistic regression. Machine learning algorithms explained include random forest, CHAID, CART.

Machine learning algorithms explained include random forest, which creates multiple decision trees on random samples, CHAID which creates non-binary decision trees, and CART which creates only binary trees. Concepts explained include feature vectors, precision, recall, support vector machines and collaborative filtering.

Stats:

1- You have 2 population and the samples. Need to check the difference of
proportion between the 2 populations.
Ans use z test for difference of two population proportions
2- How to test the correlation between two discrete variables?
a. Chi square test cramers V - intercorrelation of two discrete variables[2] and
may be used with variables having two or more levels
3- What is multi-collinearity? How to test?
a. When there is good correlation between independent variables it
affects the regression model. It can increase the variance of the
coefficients of estimates and make the estimates very sensitive to
model changes which results into unstable estimates and can cause
them to switch signs.
b. Variance Inflation factor is used to detect multi co linearity. VIF is the
variance of coefficient of estimation inflated due to multi collinearity.
VIF = 1/(1-R-square)
c. Sqrt of VIF explains how much larger the standard error is , compared
with what it would be in absence of multi collineariy.
d. VIF > 1 is considered high.
e. To reduce VIF, standardize the variables
4- What is ROC Curve?
a. ROC Receiver Operating Curve analysis
b. It is used to check model validation goodness of fit
c. It is curve between sensitivity (True Positive) and (1-specificity) i.e.
False Positive rate.
d. Sensitivity / (1-specificity) = positive likelihood ratio
e. For the 45 degree line in curve - line of equality model with intercept
only (for this line sensitivity + specificity = 1)
5- What is ginni index
a. Ratio of the area between line of equality and Lorenz curve (roc curve)
6- What to do when the variables are not normal in regression?
a. Transformation log , exp etc
b. Box cox transformation

7- What
a.
b.
c.

s AIC?
Akaike information criteria
To estimate quality of each model
Relative estimate of information lost when a given model is used to
represent data
d. AIC = 2k-2ln(L)
e. K number of parameters, L maximized value of likelihood function
f. Min AIC is the best model

8- Concordance matrix?
a. Percent Concordant = (Number of concordant pairs)/Total number of pairs
Percent Discordance = (Number of discordant pairs)/Total number of pairs
Percent Tied = (Number of tied pairs)/Total number of pairs
Area under curve (c statistics) = Percent Concordant + 0.5 * Percent Tied
b. Concordant when 1 is predicted 1 and 0 is predicted 0 (prob of event is higher
than no event)
c. Discordant when 1 is predicted 0 and 0 predicted 1(prob of non even is higher
than evet)
d. Tied when prob of 1 and 0 are same
9- What is Discriminant analysis?
a. When dependent variable is categorical and independent variables are
continuous
10-Diff between logistic regression and Discriminant analysis
a. Unlike the discriminant analysis, the logistic regression does not have
the requirements of the independent variables to be normally
distributed, linearly related, nor equal variance within each group

Machine Learning
1- What is the difference in supervised and unsupervised learning? Example
a. In Supervised learning we have a response variable or target variable
classification , regression
b. In unsupervised we dont have target variable clustering
2- Is IRIS (data set in R) data is example of supervised or unsupervised?
a. Unsupervised (it clusters the flowers based on petal length and width)
3- What is random forest algo?
a. Random forest algo is the multiple decision models (like CHAID, CART )
on random sample to data and decision is taken by voting or averaging
from all the models.
b. Random sampling could be done on attributes or on rows
4- What is a feature vector?
a. N dimensional vector of feature to represent an object
5- What is CHAID algo?
a. Chi square automatic interaction detector
b. Tree based decision algo
c. Algo to create non binary decision trees for classification (when
dependent variable is categorical ) based on chi square test and for
regression type problems ( when dependent variable is of continuous
type) using F test
6- CART ?
a. Classification tree analysis

b. Only binary tree can be created unlike CHAID where more than two
categories tree can be created
7- What is precision and recall?
a. Precision = true positive /(true pos + false pos)
b. Recall = true pos/ (true pos+false neg)
8- Explain Central Limit theorem.
9- What is the difference between SVD and PCA
10-When do u use factor analysis and when PCA. What is the diff in them
11-What is support vector machine algo, when do u use it
12-What is collaborative filtering
13-What is a perceptron
14-When do u use anova and where anova can not be applied
15-Why can we not use pairs of t tests instead of anova
16-

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy