Interview Questions ML

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 83

Machine Learning Interview Questions:

Algorithms/Theory
These algorithms questions will test your grasp of the theory behind machine learning.

Q1- What’s the trade-off between bias and variance?

More reading: Bias-Variance Tradeoff (Wikipedia)

Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm
you’re using. This can lead to the model underfitting your data, making it hard for it to
have high predictive accuracy and for you to generalize your knowledge from the
training set to the test set.

Variance is error due to too much complexity in the learning algorithm you’re using. This
leads to the algorithm being highly sensitive to high degrees of variation in your training
data, which can lead your model to overfit the data. You’ll be carrying too much noise
from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any
algorithm by adding the bias, the variance and a bit of irreducible error due to noise in
the underlying dataset. Essentially, if you make the model more complex and add more
variables, you’ll lose bias but gain some variance — in order to get the optimally
reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either
high bias or high variance in your model.

Q2- What is the difference between supervised and unsupervised machine


learning?

More reading: What is the difference between supervised and unsupervised machine
learning? (Quora)

Supervised learning requires training labeled data. For example, in order to do


classification (a supervised learning task), you’ll need to first label the data you’ll use to
train the model to classify data into your labeled groups. Unsupervised learning, in
contrast, does not require labeling data explicitly.

Q3- How is KNN different from k-means clustering?

More reading: How is the k-nearest neighbor algorithm different from k-means
clustering? (Quora)

K-Nearest Neighbors is a supervised classification algorithm, while k-means


clustering is an unsupervised clustering algorithm. While the mechanisms may seem
similar at first, what this really means is that in order for K-Nearest Neighbors to work,
you need labeled data you want to classify an unlabeled point into (thus the nearest
neighbor part). K-means clustering requires only a set of unlabeled points and a
threshold: the algorithm will take unlabeled points and gradually learn how to cluster
them into groups by computing the mean of the distance between different points.

The critical difference here is that KNN needs labeled points and is thus supervised
learning, while k-means doesn’t — and is thus unsupervised learning.

Q4- Explain how a ROC curve works.

More reading: Receiver operating characteristic (Wikipedia)

The ROC curve is a graphical representation of the contrast between true positive rates
and the false positive rate at various thresholds. It’s often used as a proxy for the trade-
off between the sensitivity of the model (true positives) vs the fall-out or the probability it
will trigger a false alarm (false positives).

Q5- Define precision and recall.


More reading: Precision and recall (Wikipedia)

Recall is also known as the true positive rate: the amount of positives your model claims
compared to the actual number of positives there are throughout the data. Precision is
also known as the positive predictive value, and it is a measure of the amount of
accurate positives your model claims compared to the number of positives it actually
claims. It can be easier to think of recall and precision in the context of a case where
you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d
have perfect recall (there are actually 10 apples, and you predicted there would be 10)
but 66.7% precision because out of the 15 events you predicted, only 10 (the apples)
are correct.

Q6- What is Bayes’ Theorem? How is it useful in a machine learning context?

More reading: An Intuitive (and Short) Explanation of Bayes’ Theorem (BetterExplained)

Bayes’ Theorem gives you the posterior probability of an event given what is known as
prior knowledge.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by
the sum of the false positive rate of the population and the true positive rate of a
condition. Say you had a 60% chance of actually having the flu after a flu test, but out of
people who had the flu, the test will be false 50% of the time, and the overall population
only has a 5% chance of having the flu. Would you actually have a 60% chance of
having the flu after having a positive test?

Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a
Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95)
(False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.
Bayes’ Theorem is the basis behind a branch of machine learning that most notably
includes the Naive Bayes classifier. That’s something important to consider when you’re
faced with machine learning interview questions.

Q7- Why is “Naive” Bayes naive?

More reading: Why is “naive Bayes” naive? (Quora)

Despite its practical applications, especially in text mining, Naive Bayes is considered
“Naive” because it makes an assumption that is virtually impossible to see in real-life
data: the conditional probability is calculated as the pure product of the individual
probabilities of components. This implies the absolute independence of features — a
condition probably never met in real life.

As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that
you liked pickles and ice cream would probably naively recommend you a pickle ice
cream.

Q8- Explain the difference between L1 and L2 regularization.

More reading: What is the difference between L1 and L2 regularization? (Quora)


L2 regularization tends to spread error among all the terms, while L1 is more
binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1
corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a
Gaussian prior.

Q9- What’s your favorite algorithm, and can you explain it to me in less than a
minute?

This type of question tests your understanding of how to communicate complex and
technical nuances with poise and the ability to summarize quickly and efficiently. Make
sure you have a choice and make sure you can explain different algorithms so simply
and effectively that a five-year-old could grasp the basics!

Q10- What’s the difference between Type I and Type II error?

More reading: Type I and type II errors (Wikipedia)


Don’t think that this is a trick question! Many machine learning interview questions will
be an attempt to lob basic questions at you just to make sure you’re on top of your
game and you’ve prepared all of your bases.

Type I error is a false positive, while Type II error is a false negative. Briefly stated,
Type I error means claiming something has happened when it hasn’t, while Type II error
means that you claim nothing is happening when in fact something is.

A clever way to think about this is to think of Type I error as telling a man he is
pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a
baby.

Q11- What’s a Fourier transform?

More reading: Fourier transform (Wikipedia)

A Fourier transform is a generic method to decompose generic functions into a


superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a
smoothie, it’s how we find the recipe. The Fourier transform finds the set of cycle
speeds, amplitudes and phases to match any time signal. A Fourier transform converts
a signal from time to frequency domain — it’s a very common way to extract features
from audio signals or other time series such as sensor data.

Q12- What’s the difference between probability and likelihood?

More reading: What is the difference between “likelihood” and “probability”? (Cross
Validated)
Q13- What is deep learning, and how does it contrast with other machine learning
algorithms?

More reading: Deep learning (Wikipedia)

Deep learning is a subset of machine learning that is concerned with neural networks:
how to use backpropagation and certain principles from neuroscience to more
accurately model large sets of unlabelled or semi-structured data. In that sense, deep
learning represents an unsupervised learning algorithm that learns representations of
data through the use of neural nets.

Q14- What’s the difference between a generative and discriminative model?

More reading: What is the difference between a Generative and Discriminative


Algorithm? (Stack Overflow)
A generative model will learn categories of data while a discriminative model will simply
learn the distinction between different categories of data. Discriminative models will
generally outperform generative models on classification tasks.

Q15- What cross-validation technique would you use on a time series dataset?

More reading: Using k-fold cross-validation for time-series model selection


(CrossValidated)

Instead of using standard k-folds cross-validation, you have to pay attention to the fact
that a time series is not randomly distributed data — it is inherently ordered by
chronological order. If a pattern emerges in later time periods for example, your model
may still pick up on it even if that effect doesn’t hold in earlier years!

You’ll want to do something like forward chaining where you’ll be able to model on past
data then look at forward-facing data.

 fold 1 : training [1], test [2]


 fold 2 : training [1 2], test [3]
 fold 3 : training [1 2 3], test [4]
 fold 4 : training [1 2 3 4], test [5]
 fold 5 : training [1 2 3 4 5], test [6]

Q16- How is a decision tree pruned?

More reading: Pruning (decision trees)

Pruning is what happens in decision trees when branches that have weak predictive
power are removed in order to reduce the complexity of the model and increase the
predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-
down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t
decrease predictive accuracy, keep it pruned. While simple, this heuristic actually
comes pretty close to an approach that would optimize for maximum accuracy.

Q17- Which is more important to you– model accuracy, or model performance?

More reading: Accuracy paradox (Wikipedia)

This question tests your grasp of the nuances of machine learning model performance!
Machine learning interview questions often look towards the details. There are models
with higher accuracy that can perform worse in predictive power — how does that make
sense?
Well, it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you wanted to
detect fraud in a massive dataset with a sample of millions, a more accurate model
would most likely predict no fraud at all if only a vast minority of cases were fraud.
However, this would be useless for a predictive model — a model designed to find fraud
that asserted there was no fraud at all! Questions like this help you demonstrate that
you understand model accuracy isn’t the be-all and end-all of model performance.

Q18- What’s the F1 score? How would you use it?

More reading: F1 score (Wikipedia)

The F1 score is a measure of a model’s performance. It is a weighted average of the


precision and recall of a model, with results tending to 1 being the best, and those
tending to 0 being the worst. You would use it in classification tests where true
negatives don’t matter much.

Q19- How would you handle an imbalanced dataset?

More reading: 8 Tactics to Combat Imbalanced Classes in Your Machine Learning


Dataset (Machine Learning Mastery)

An imbalanced dataset is when you have, for example, a classification test and 90% of
the data is in one class. That leads to problems: an accuracy of 90% can be skewed if
you have no predictive power on the other category of data! Here are a few tactics to
get over the hump:

1- Collect more data to even the imbalances in the dataset.

2- Resample the dataset to correct for imbalances.

3- Try a different algorithm altogether on your dataset.

What’s important here is that you have a keen sense for what damage an unbalanced
dataset can cause, and how to balance that.

Q20- When should you use classification over regression?

More reading: Regression vs Classification (Math StackExchange)

Classification produces discrete values and dataset to strict categories, while regression
gives you continuous results that allow you to better distinguish differences between
individual points. You would use classification over regression if you wanted your results
to reflect the belongingness of data points in your dataset to certain explicit categories
(ex: If you wanted to know whether a name was male or female rather than just how
correlated they were with male and female names.)

Q21- Name an example where ensemble techniques might be useful.

More reading: Ensemble learning (Wikipedia)

Ensemble techniques use a combination of learning algorithms to optimize better


predictive performance. They typically reduce overfitting in models and make the model
more robust (unlikely to be influenced by small changes in the training data).

You could list some examples of ensemble methods, from bagging to boosting to a
“bucket of models” method and demonstrate how they could increase predictive power.

Q22- How do you ensure you’re not overfitting with a model?

More reading: How can I avoid overfitting? (Quora)

This is a simple restatement of a fundamental problem in machine learning: the


possibility of overfitting training data and carrying the noise of that data through to the
test set, thereby providing inaccurate generalizations.

There are three main methods to avoid overfitting:

1- Keep the model simpler: reduce variance by taking into account fewer variables and
parameters, thereby removing some of the noise in the training data.

2- Use cross-validation techniques such as k-folds cross-validation.

3- Use regularization techniques such as LASSO that penalize certain model


parameters if they’re likely to cause overfitting.

Q23- What evaluation approaches would you work to gauge the effectiveness of a
machine learning model?

More reading: How to Evaluate Machine Learning Algorithms (Machine Learning


Mastery)

You would first split the dataset into training and test sets, or perhaps use cross-
validation techniques to further segment the dataset into composite sets of training and
test sets within the data. You should then implement a choice selection of performance
metrics: here is a fairly comprehensive list. You could use measures such as the F1
score, the accuracy, and the confusion matrix. What’s important here is to demonstrate
that you understand the nuances of how a model is measured and how to choose the
right performance measures for the right situations.
Q24- How would you evaluate a logistic regression model?

More reading: Evaluating a logistic regression (CrossValidated)

A subsection of the question above. You have to demonstrate an understanding of what


the typical goals of a logistic regression are (classification, prediction etc.) and bring up
a few examples and use cases.

Q25- What’s the “kernel trick” and how is it useful?

More reading: Kernel method (Wikipedia)

The Kernel trick involves kernel functions that can enable in higher-dimension spaces
without explicitly calculating the coordinates of points within that dimension: instead,
kernel functions compute the inner products between the images of all pairs of data in a
feature space. This allows them the very useful attribute of calculating the coordinates
of higher dimensions while being computationally cheaper than the explicit calculation of
said coordinates. Many algorithms can be expressed in terms of inner products. Using
the kernel trick enables us effectively run algorithms in a high-dimensional space with
lower-dimensional data.

(Learn about Springboard’s AI / Machine Learning Bootcamp, the first of its kind to
come with a job guarantee.)

Machine Learning Interview Questions: Programming


These machine learning interview questions test your knowledge of programming
principles you need to implement machine learning principles in practice. Machine
learning interview questions tend to be technical questions that test your logic and
programming skills: this section focuses more on the latter.

Q26- How do you handle missing or corrupted data in a dataset?

More reading: Handling missing data (O’Reilly)

You could find missing/corrupted data in a dataset and either drop those rows or
columns, or decide to replace them with another value.

In Pandas, there are two very useful methods: isnull() and dropna() that will help you
find columns of data with missing or corrupted data and drop those values. If you want
to fill the invalid values with a placeholder value (for example, 0), you could use the
fillna() method.

Q27- Do you have experience with Spark or big data tools for machine learning?
More reading: 50 Top Open Source Tools for Big Data (Datamation)

You’ll want to get familiar with the meaning of big data for different companies and the
different tools they’ll want. Spark is the big data tool most in demand now, able to
handle immense datasets with speed. Be honest if you don’t have experience with the
tools demanded, but also take a look at job descriptions and see what tools pop up:
you’ll want to invest in familiarizing yourself with them.

Q28- Pick an algorithm. Write the psuedo-code for a parallel implementation.

More reading: Writing pseudocode for parallel programming (Stack Overflow)

This kind of question demonstrates your ability to think in parallelism and how you could
handle concurrency in programming implementations dealing with big data. Take a look
at pseudocode frameworks such as Peril-L and visualization tools such as Web
Sequence Diagrams to help you demonstrate your ability to write code that reflects
parallelism.

Q29- What are some differences between a linked list and an array?

More reading: Array versus linked list (Stack Overflow)

An array is an ordered collection of objects. A linked list is a series of objects with


pointers that direct how to process them sequentially. An array assumes that every
element has the same size, unlike the linked list. A linked list can more easily grow
organically: an array has to be pre-defined or re-defined for organic growth. Shuffling a
linked list involves changing which points direct where — meanwhile, shuffling an array
is more complex and takes more memory.

Q30- Describe a hash table.

More reading: Hash table (Wikipedia)

A hash table is a data structure that produces an associative array. A key is mapped to
certain values through the use of a hash function. They are often used for tasks such as
database indexing.
Q31- Which data visualization libraries do you use? What are your thoughts on
the best data visualization tools?

More reading: 31 Free Data Visualization Tools (Springboard)

What’s important here is to define your views on how to properly visualize data and your
personal preferences when it comes to tools. Popular tools include R’s ggplot, Python’s
seaborn and matplotlib, and tools such as Plot.ly and Tableau.

Related: 20 Python Interview Questions

Machine Learning Interview Questions:


Company/Industry Specific
These machine learning interview questions deal with how to implement your general
machine learning knowledge to a specific company’s requirements. You’ll be asked to
create case studies and extend your knowledge of the company and industry you’re
applying for with your machine learning skills.

Q32- How would you implement a recommendation system for our company’s
users?

More reading: How to Implement A Recommendation System? (Stack Overflow)

A lot of machine learning interview questions of this type will involve implementation of
machine learning models to a company’s problems. You’ll have to research the
company and its industry in-depth, especially the revenue drivers the company has, and
the types of users the company takes on in the context of the industry it’s in.

Q33- How can we use your machine learning skills to generate revenue?
More reading: Startup Metrics for Startups (500 Startups)

This is a tricky question. The ideal answer would demonstrate knowledge of what drives
the business and how your skills could relate. For example, if you were interviewing for
music-streaming startup Spotify, you could remark that your skills at developing a better
recommendation model would increase user retention, which would then increase
revenue in the long run.

The startup metrics Slideshare linked above will help you understand exactly what
performance indicators are important for startups and tech companies as they think
about revenue and growth.

Q34- What do you think of our current data process?

More reading: The Data Science Process Email Course – Springboard


This kind of question requires you to listen carefully and impart feedback in a manner
that is constructive and insightful. Your interviewer is trying to gauge if you’d be a
valuable member of their team and whether you grasp the nuances of why certain
things are set the way they are in the company’s data process based on company- or
industry-specific conditions. They’re trying to see if you can be an intellectual peer. Act
accordingly.

Machine Learning Interview Questions: General


Machine Learning Interest
This series of machine learning interview questions attempts to gauge your passion and
interest in machine learning. The right answers will serve as a testament for your
commitment to being a lifelong learner in machine learning.

Q35- What are the last machine learning papers you’ve read?

More reading: What are some of the best research papers/books for machine learning?

Keeping up with the latest scientific literature on machine learning is a must if you want
to demonstrate interest in a machine learning position. This overview of deep learning in
Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can
be a good reference paper and an overview of what’s happening in deep learning —
and the kind of paper you might want to cite.

Q36- Do you have research experience in machine learning?

Related to the last point, most organizations hiring for machine learning positions will
look for your formal experience in the field. Research papers, co-authored or supervised
by leaders in the field, can make the difference between you being hired and not. Make
sure you have a summary of your research experience and papers ready — and an
explanation for your background and lack of formal research experience if you don’t.

Q37- What are your favorite use cases of machine learning models?

More reading: What are the typical use cases for different machine learning algorithms?
(Quora)

The Quora thread above contains some examples, such as decision trees that
categorize people into different tiers of intelligence based on IQ scores. Make sure that
you have a few examples in mind and describe what resonated with you. It’s important
that you demonstrate an interest in how machine learning is implemented.

Q38- How would you approach the “Netflix Prize” competition?


More reading: Netflix Prize (Wikipedia)

The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better
collaborative filtering algorithm. The team that won called BellKor had a 10%
improvement and used an ensemble of different methods to win. Some familiarity with
the case and its solution will help demonstrate you’ve paid attention to machine learning
for a while.

Q39- Where do you usually source datasets?

More reading: 19 Free Public Data Sets For Your First Data Science Project
(Springboard)

Machine learning interview questions like these try to get at the heart of your machine
learning interest. Somebody who is truly passionate about machine learning will have
gone off and done side projects on their own, and have a good idea of what great
datasets are out there. If you’re missing any, check out Quandl for economic and
financial data, and Kaggle’s Datasets collection for another great list.

Q40- How do you think Google is training data for self-driving cars?

More reading: Waymo Tech

Machine learning interview questions like this one really test your knowledge of different
machine learning methods, and your inventiveness if you don’t know the answer.
Google is currently using recaptcha to source labeled data on storefronts and traffic
signs. They are also building on training data collected by Sebastian Thrun at GoogleX
— some of which was obtained by his grad students driving buggies on desert dunes!

Q41- How would you simulate the approach AlphaGo took to beat Lee Sidol at
Go?

More reading: Mastering the game of Go with deep neural networks and tree search
(Nature)

AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a
truly seminal event in the history of machine learning and deep learning. The Nature
paper above describes how this was accomplished with “Monte-Carlo tree search with
deep neural networks that have been trained by supervised learning, from human
expert games, and by reinforcement learning from games of self-play.”
What is the difference between supervised and unsupervised machine learning?

Supervised learning requires training using labelled data. For example, in order to
do classification, which is a supervised learning task, you’ll first need to label the
data you’ll use to train the model to classify data into your labelled groups.
Unsupervised learning, in divergence, does not require labeling data explicitly.

What’s the trade-off between bias and variance?

Bias is error due to over simplistic assumptions in the learning algorithm that you
are using, which can lead to model under fitting your data and making it hard for
model to give accurate predictions.

Variance, on the other hand, is error due to way too much complexity in your
learning algorithm. Due to this complexity, the algorithm is highly sensitive to high
degrees of variation, which can lead your model to over fit the data. In addition,
you will be carrying too much noise from your training data for your model to be
useful.

The bias-variance decomposition essentially decomposes the learning error from


any algorithm by adding the bias, the variance and a bit of irreducible error due
to noise in the underlying data-set. Essentially, if you make the model more
complex and add more variables, you’ll lose bias but gain some variance — in order
to get the optimally reduced amount of error, you’ll have to trade-off bias and
variance. You don’t want either high bias or high variance in your model.

How KNN is different from k-means clustering?

The crucial difference between both is, K-Nearest Neighbor is a supervised


classification algorithm, whereas k-means is an unsupervised clustering
algorithm. While the procedure may seem similar at first, what it really means is
that in order to K-Nearest Neighbors to work, you need labelled data which you
want to classify an unlabeled point into. In k-means clustering it requires set of
unlabeled points and a threshold only. The algorithm will take that unlabeled data
and will learn how to cluster them into groups by computing the mean of the
distance between different points.

What is Bayes’ Theorem? How it is useful?

Bayes’ Theorem gives you the posterior probability of an event given what is
known as prior knowledge. Also, it is the basis behind Naive Bayes classifier. That’s
something important to know when you’re faced with machine learning interview
questions and answers. For detailed and simple explanation of Naïve Bayes
visit here.

What is the difference between L1 and L2 regularization?

First, regularization is the technique which helps to solve over fitting problem in
Machine Learning. L2 regularization incline to spread error among all the terms,
while L1 is more binary, with most variables either being assigned a 1 or 0 in
weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2
corresponds to a Gaussian prior. I would say as a rule of thumb, one should always
go with L2 in practice.
What is the difference between Type I and Type II error?

Don’t think this as something high level stuff, interviewers ask questions in such
terms just to know that you have all the bases and you are on the top.

Type I error is false positive, while Type II is false negative. Type I error is claiming
something has happened when it hasn’t. For instance, telling a man he is pregnant.
On the other hand, Type II error means you claim nothing is happened but in fact
something is. To exemplify, you tell a pregnant lady she isn’t carrying baby.

What is the difference between Probability and Likelihood?

Not going too deep in technical, Probability quantifies prediction of outcome,


likelihood quantifies trust in model. For instance, someone challenges us to a
‘profitable gambling game’. Then, probabilities will serve us to compute things like
the expected profile of your gains and losses. In contrast, likelihood will serve us
to quantify whether we trust those probabilities in the first place; or whether
we smell a rat.

What Deep Learning is exactly?

Most people don’t know this but Machine Learning and Deep Learning is not two
different things, but Deep learning is a subset of Machine learning. It mostly deals
with neural networks: how to use back propagation and other certain principles
from neuroscience to more accurately model large sets of unlabeled data. In a
nutshell, Deep Learning represents unsupervised learning algorithm that learns
data representation mainly through neural networks. Explore a little about neural
nets to answer deep learning interview questions effectively.

What’s the difference between a generative and discriminative model?

A discriminative model will learn the distinction between different categories of


data, while A generative model will learn categories of data. Discriminative models
will predominantly outperform generative models on classification tasks.

What is Time Series Analysis/Forecasting?

A Machine Learning data-set is a collection of observations. For example,

 Observation 1
 Observation 2
 Observation 3

But, a Time series data-set is different. Time series adds an explicit order
dependence between observations: a time dimension. This additional dimension is
both a constraint and a structure that provides a source of additional information.

 Time 1, Observation
 Time 2, Observation
 Time 3, Observation

How would you handle an imbalanced data-set?

Imbalanced data is, for example, you have 90% of the data in one class and 10%
in other. Which leads to problems such as, no predictive power on the other
category of data. Here are few techniques to get over it,

 Obviously collect more data to balance


 Try different algorithm (Not going to work effectively)
 Correct the imbalance in data-set

Explain Pruning in Decision trees.

Pruning is you remove branches that have weak predictive power in order to
reduce the complexity of the model and in addition increase the predictive
accuracy of a decision tree model. There are several flavors which includes,
bottom-up and top-down pruning, with approaches such as reduced error pruning
and cost complexity pruning.

In your opinion which one is more important: Model accuracy or Model Performance?

Questions like these tests your grasp over Machine Learning model performance
and often look towards details. There are models with higher accuracy that can
perform worse in predictive power, how does that make sense?

Well, it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you wanted
to detect fraud in a massive data-set with a sample of millions, a more accurate
model would most likely predict no fraud at all if only a vast minority of cases were
fraud. However, this would be useless for a predictive model — a model designed
to find fraud that asserted there was no fraud at all! Questions like this help you
demonstrate that you understand model accuracy isn’t the be-all and end-all of
model performance.

What’s the F1 score?

It is a measure of model’s performance. More technically, it is a weighted average


of the precision and recall of the model, with results 1 being the best and 0 being
the worst.

When should you use classification over regression?

Classification and Regression are both different in meaning. Classification produces


discreate values while Regression gives you continuous results. You would use
classification over regression, for example, when you wanted to know whether a
name was male or female rather than just how correlated they were with male
and female names.
What is convex hull?

Convex hull represents the outer boundaries of the two group of data points. Once
convex hull is created, we get maximum margin hyperplane (MMH), which
attempts to create the greatest separation between two groups, as a perpendicular
bisector between two convex hulls.

General Questions

Do you have research experience in machine learning?

Machine Learning is emerging and no one wants novice players in their teams.
Most employers hiring for Machine Learning position will look for your experience
in the field. Research papers, co-authored or supervised by leaders in the field,
can set you apart from the herd. Make sure you are ready with all the summary
and justification of the work you have done in the past years.

What are the last Machine Learning papers you read? Why you think that was important?

As this field is emerging day by day, it is crucial to keep up with the latest scientific
literatures to show that you are really into Machine Learning and not here just
because it is the latest buzzword. Some good books to start with includes Deep
Learning by Ian Goodfellow.

How would you approach the “Netflix Prize” competition?

The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a
better collaborative filtering algorithm. The team that won called BellKor had a
10% improvement and used an ensemble of different methods to win. Some
familiarity with the case and its solution will help demonstrate you’ve paid attention
to machine learning for a while.
What’s your favorite algorithm, and can you explain it to me in less than a minute?

This type of question mainly tests your ability of communicating complex and
technical nuances with poise and the ability to summarize quickly and efficiently.
Make sure you have a choice of algorithm which you can explain easily. Try to
explain different algorithms so simply and effectively that a five-year-old could
grasp the basics.

Where do you usually source data-sets?

This type of questions are the real tie-breakers. If someone is going for an
interview, he/she must know the drill of some related question. It is questions like
this which purely illustrates your interest in Machine Learning. See my post for
detailed answer on where to find machine learning data-sets.

How do you think Google is training data for self-driving cars?

Questions like this check your understanding of current affairs in the industry and
how things at certain level works. Google is currently using recaptcha to source
labelled data on storefronts and traffic signs. They are also building on training
data collected by Sebastian Thrun at GoogleX.
Industry Specific Questions

How would you implement a recommendation system for our company’s users?

There will be a lot of questions like this which will involve implementation of
machine learning models to their company’s problems. You should definitely study
company’s profile and its products before going in. In addition, factors such as,
financials of the company, in which the company operates, what are their users
will help you get a clearer picture.

How can we use your machine learning skills to generate revenue?

This is a tricky question, I would say. The ideal answer would demonstrate
knowledge of what drives the business and how your skills could relate. To
exemplify, if you were interviewing for Spotify, you could remark that your skills
at developing a better recommendation model would remarkably increase user
retention, which would then increase revenue in the long run. Or something like
that.

Practical/Programming Questions

How will you handle missing data?

One can find missing data in a data-set and either drop those rows or columns, or
decide to replace them with another value. In python library Pandas there are two
useful functions which will be helpful, isnull() and dropna().

Describe a hash table.

A hash table is a data structure that produces an associative array. A key is mapped
to certain values through the use of a hash function. They are often used for tasks
such as database indexing.
Which data visualization libraries do you use and why they are useful?

What’s important here is to define your views on how to properly visualize data
and your personal preferences when it comes to tools. Popular tools include R’s
ggplot, Python’s seaborn and matplotlib, and tools such as Plot.ly and Tableau.

Do you have experience with Spark or big data tools for machine learning?

Spark is the big data tool most in demand now, able to handle immense data-sets
with speed. Be honest if you don’t have experience with the tools demanded, but
also take a look at job descriptions and see what tools pop up: you’ll want to invest
in familiarizing yourself with them.

Considering the long list of machine learning algorithm, given a data set, how do you decide which
one to use?

This is one more tricky question. Given what type of data there is, discrete, time
series, continuous, you should give your answers.

Explain machine learning to me like a 5-year-old. (To check the ability of explaining complex
concepts in simple terms)

There are a lot of definition of Machine Learning are out there, which one you
think is the easiest and at the same time covers the entire meaning of Machine
Learning? Tell us in the comments and let other readers give insights to how you
think.

Q1. You are given a train data set having 1000 columns and 1 million rows. The data
set is based on a classification problem. Your manager has asked you to reduce the
dimension of this data so that model computation time can be reduced. Your machine
has memory constraints. What would you do? (You are free to make practical
assumptions.)

Answer: Processing a high dimensional data on a limited memory machine is a strenuous


task, your interviewer would be fully aware of that. Following are the methods you can use to
tackle such situation:

1. Since we have lower RAM, we should close all other applications in our machine,
including the web browser, so that most of the memory can be put to use.
2. We can randomly sample the data set. This means, we can create a smaller data set,
let’s say, having 1000 variables and 300000 rows and do the computations.
3. To reduce dimensionality, we can separate the numerical and categorical variables
and remove the correlated variables. For numerical variables, we’ll use correlation.
For categorical variables, we’ll use chi-square test.
4. Also, we can use PCA and pick the components which can explain the maximum
variance in the data set.
5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible
option.
6. Building a linear model using Stochastic Gradient Descent is also helpful.
7. We can also apply our business understanding to estimate which all predictors can
impact the response variable. But, this is an intuitive approach, failing to identify useful
predictors might result in significant loss of information.

Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic
Gradient Descent. These are advanced methods.

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the
components?

Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference


between variance captured by the component. This makes the components easier to
interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer
components (than features) which can explain the maximum variance in the data set. By
doing rotation, the relative location of the components doesn’t change, it only changes the
actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select
more number of components to explain variance in the data set.

Know more: PCA


Q3. You are given a data set. The data set has missing values which spread along 1
standard deviation from the median. What percentage of data would remain
unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since, the data is spread
across median, let’s assume it’s a normal distribution. We know, in a normal distribution,
~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves
~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by
missing values.

Q4. You are given a data set on cancer detection. You’ve build a classification model
and achieved an accuracy of 96%. Why shouldn’t you be happy with your model
performance? What can you do about it?

Answer: If you have worked on enough data sets, you should deduce that cancer detection
results in imbalanced data. In an imbalanced data set, accuracy should not be used as a
measure of performance because 96% (as given) might only be predicting majority class
correctly, but our class of interest is minority class (4%) which is the people who actually got
diagnosed with cancer. Hence, in order to evaluate model performance, we should use
Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine
class wise performance of the classifier. If the minority class performance is found to to be
poor, we can undertake the following steps:

1. We can use undersampling, oversampling or SMOTE to make the data balanced.


2. We can alter the prediction threshold value by doing probability caliberation and
finding a optimal threshold using AUC-ROC curve.
3. We can assign weight to classes such that the minority classes gets larger weight.
4. We can also use anomaly detection.

Know more: Imbalanced Classification

Q5. Why is naive Bayes so ‘naive’ ?

Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set
are equally important and independent. As we know, these assumption are rarely true in real
world scenario.

Q6. Explain prior probability, likelihood and marginal likelihood in context of


naiveBayes algorithm?
Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the
data set. It is the closest guess you can make about a class, without any further information.
For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1
(spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances
that any new email would be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other


variable. For example: The probability that the word ‘FREE’ is used in previous spam
message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in
any message.

Q7. You are working on a time series data set. You manager has asked you to build a
high accuracy model. You start with the decision tree algorithm, since you know it
works fairly well on all kinds of data. Later, you tried a time series regression model
and got higher accuracy than decision tree model. Can this happen? Why?

Answer: Time series data is known to posses linearity. On the other hand, a decision tree
algorithm is known to work best to detect non – linear interactions. The reason why decision
tree failed to provide robust predictions because it couldn’t map the linear relationship as
good as a regression model did. Therefore, we learned that, a linear regression model can
provide robust prediction given the data set satisfies its linearity assumptions.

Q8. You are assigned a new project which involves helping a food delivery company
save more money. The problem is, company’s delivery team aren’t able to deliver food
on time. As a result, their customers get unhappy. And, to keep them happy, they end
up delivering food for free. Which machine learning algorithm can save them?

Answer: You might have started hopping through the list of ML algorithms in your mind. But,
wait! Such questions are asked to test your machine learning fundamentals.

This is not a machine learning problem. This is a route optimization problem. A machine
learning problem consist of three things:

1. There exist a pattern.


2. You cannot solve it mathematically (even by writing exponential equations).
3. You have data on it.

Always look for these three factors to decide if machine learning is a tool to solve a particular
problem.
Q9. You came to know that your model is suffering from low bias and high variance.
Which algorithm should you use to tackle it? Why?

Answer: Low bias occurs when the model’s predicted values are near to actual values. In
other words, the model becomes flexible enough to mimic the training data distribution. While
it sounds like great achievement, but not to forget, a flexible model has no generalization
capabilities. It means, when this model is tested on an unseen data, it gives disappointing
results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance
problem. Bagging algorithms divides a data set into subsets made with repeated randomized
sampling. Then, these samples are used to generate a set of models using a single learning
algorithm. Later, the model predictions are combined using voting (classification) or averaging
(regression).

Also, to combat high variance, we can:

1. Use regularization technique, where higher model coefficients get penalized,


hence lowering model complexity.
2. Use top n features from variable importance chart. May be, with all the variable in the
data set, the algorithm is having difficulty in finding the meaningful signal.

Q10. You are given a data set. The data set contains many variables, some of which
are highly correlated and you know about it. Your manager has asked you to run PCA.
Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect.
Discarding correlated variables have a substantial effect on PCA because, in presence of
correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on
this data set, the first principal component would exhibit twice the variance than it would
exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more
importance on those variable, which is misleading.

Q11. After spending several hours, you are now anxious to build a high accuracy
model. As a result, you build 5 GBM models, thinking a boosting algorithm would do
the magic. Unfortunately, neither of models could perform better than benchmark
score. Finally, you decided to combine those models. Though, ensembled models are
known to return high accuracy, but you are unfortunate. Where did you miss?

Answer: As we know, ensemble learners are based on the idea of combining weak learners
to create strong learners. But, these learners provide superior result when the combined
models are uncorrelated. Since, we have used 5 GBM models and got no accuracy
improvement, suggests that the models are correlated. The problem with correlated models
is, all the models provide same information.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and
model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners
are built on the premise of combining weak uncorrelated models to obtain better predictions.

Q12. How is kNN different from kmeans clustering?

Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental
difference between both these algorithms is, kmeans is unsupervised in nature and kNN is
supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression)
algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is
homogeneous and the points in each cluster are close to each other. The algorithm tries to
maintain enough separability between these clusters. Due to unsupervised nature, the
clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number )
surrounding neighbors. It is also known as lazy learner because it involves minimal training
of model. Hence, it doesn’t use training data to make generalization on unseen data set.

Q13. How is True Positive Rate and Recall related? Write the equation.

Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Know more: Evaluation Metrics

Q14. You have built a multiple regression model. Your model R² isn’t as good as you
wanted. For improvement, your remove the intercept term, your model R² becomes 0.8
from 0.3. Is it possible? How?

Answer: Yes, it is possible. We need to understand the significance of intercept term in a


regression model. The intercept term shows model prediction without any independent
variable i.e. mean prediction. The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is
predicted value.
When intercept term is present, R² value evaluates your model wrt. to the mean model. In
absence of intercept term (ymean), the model can make no such evaluation, with large
denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in
higher R².

Q15. After analyzing the model, your manager has informed that your regression
model is suffering from multicollinearity. How would you check if he’s true? Without
losing any information, can you still build a better model?

Answer: To check multicollinearity, we can create a correlation matrix to identify & remove
variables having correlation above 75% (deciding a threshold is subjective). In addition, we
can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF
value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious
multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those
variables, we can use penalized regression models like ridge or lasso regression. Also, we
can add some random noise in correlated variable so that the variables become different from
each other. But, adding noise might affect the prediction accuracy, hence this approach
should be carefully used.

Know more: Regression

Q16. When is Ridge regression favorable over Lasso regression?

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of
few variables with medium / large sized effect, use lasso regression. In presence of many
variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter
shrinkage, whereas Ridge regression only does parameter shrinkage and end up including
all the coefficients in the model. In presence of correlated variables, ridge regression might
be the preferred choice. Also, ridge regression works best in situations where the least square
estimates have higher variance. Therefore, it depends on our model objective.

Know more: Ridge and Lasso Regression

Q17. Rise in global average temperature led to decrease in number of pirates around
the world. Does that mean that decrease in number of pirates caused the climate
change?
Answer: After reading this question, you should have understood that this is a classic case
of “causation and correlation”. No, we can’t conclude that decrease in number of pirates
caused the climate change because there might be other factors (lurking or confounding
variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of
pirates, but based on this information we can’t say that pirated died because of rise in global
average temperature.

Know more: Causation and Correlation

Q18. While working on a data set, how do you select important variables? Explain
your methods.

Answer: Following are the methods of variable selection you can use:

1. Remove the correlated variables prior to selecting important variables


2. Use linear regression and select variables based on p values
3. Use Forward Selection, Backward Selection, Stepwise Selection
4. Use Random Forest, Xgboost and plot variable importance chart
5. Use Lasso Regression
6. Measure information gain for the available set of features and select top n features
accordingly.

Q19. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance.

Covariances are difficult to compare. For example: if we calculate the covariances of salary
($) and age (years), we’ll get different covariances which can’t be compared because of
having unequal scales. To combat such situation, we calculate correlation to get a value
between -1 and 1, irrespective of their respective scale.

Q20. Is it possible capture the correlation between continuous and


categorical variable? If yes, how?

Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association
between continuous and categorical variables.
Q21. Both being tree based algorithm, how is random forest different from Gradient
boosting algorithm (GBM)?

Answer: The fundamental difference is, random forest uses bagging technique to make
predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then,
using a single learning algorithm a model is build on all samples. Later, the resultant
predictions are combined using voting or averaging. Bagging is done is parallel. In boosting,
after the first round of predictions, the algorithm weighs misclassified predictions higher, such
that they can be corrected in the succeeding round. This sequential process of giving higher
weights to misclassified predictions continue until a stopping criterion is reached.

Random forest improves model accuracy by reducing variance (mainly). The trees grown are
uncorrelated to maximize the decrease in variance. On the other hand, GBM improves
accuracy my reducing both bias and variance in a model.

Know more: Tree based modeling

Q22. Running a binary classification tree algorithm is the easy part. Do you know how
does a tree splitting takes place i.e. how does the tree decide which variable to split at
the root node and succeeding nodes?

Answer: A classification trees makes decision based on Gini Index and Node Entropy. In
simple words, the tree algorithm find the best possible feature which can divide the data
set into purest possible children nodes.

Gini index says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure. We can calculate Gini as following:

1. Calculate Gini for sub-nodes, using formula sum of square of probability for success
and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split

Entropy is the measure of impurity as given by (for binary class):

Here p and q is probability of success and failure respectively in that node. Entropy is zero
when a node is homogeneous. It is maximum when a both the classes are present in a node
at 50% – 50%. Lower entropy is desirable.
Q23. You’ve built a random forest model with 10000 trees. You got delighted after
getting training error as 0.00. But, the validation error is 34.23. What is going on?
Haven’t you trained your model perfectly?

Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the
training data patterns to an extent, that they are not available in the unseen data. Hence,
when this classifier was run on unseen sample, it couldn’t find those patterns and returned
prediction with higher error. In random forest, it happens when we use larger number of trees
than necessary. Hence, to avoid these situation, we should tune number of trees using cross
validation.

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation).
Why is OLS as bad option to work with? Which techniques would be best to use? Why?

Answer: In such high dimensional data sets, we can’t use classical regression techniques,
since their assumptions tend to fail. When p > n, we can no longer calculate a unique least
square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge
which can shrink the coefficients to reduce variance. Precisely, ridge regression works best
in situations where the least square estimates have higher variance.

Among other methods include subset regression, forward stepwise regression.

Q25. What is convex hull ? (Hint: Think SVM)

Answer: In case of linearly separable data, convex hull represents the outer boundaries of
the two group of data points. Once convex hull is created, we get maximum margin
hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line
which attempts to create greatest separation between two groups.
Q26. We know that one hot encoding increasing the dimensionality of a data set. But,
label encoding doesn’t. How ?

Answer: Don’t get baffled at this question. It’s a simple question asking the difference
between the two.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased
because it creates a new variable for each level present in categorical variables. For example:
let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green.
One hot encoding ‘color’ variable will generate three new variables
as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new
variable is created. Label encoding is majorly used for binary variables.

Q27. What cross validation technique would you use on time series data set? Is it k-
fold or LOOCV?

Answer: Neither.

In time series problem, k fold can be troublesome because there might be some pattern in
year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we
might end up validation on past years, which is incorrect. Instead, we can use forward
chaining strategy with 5 fold as shown below:

 fold 1 : training [1], test [2]


 fold 2 : training [1 2], test [3]
 fold 3 : training [1 2 3], test [4]
 fold 4 : training [1 2 3 4], test [5]
 fold 5 : training [1 2 3 4 5], test [6]

where 1,2,3,4,5,6 represents “year”.

Q28. You are given a data set consisting of variables having more than 30% missing
values? Let’s say, out of 50 variables, 8 variables have missing values higher than
30%. How will you deal with them?

Answer: We can deal with them in the following ways:


1. Assign a unique category to missing values, who knows the missing values might
decipher some trend
2. We can remove them blatantly.
3. Or, we can sensibly check their distribution with the target variable, and if found any
pattern we’ll keep those missing values and assign them a new
category while removing others.

29. ‘People who bought this, also bought…’ recommendations seen on amazon is a
result of which algorithm?

Answer: The basic idea for this kind of recommendation engine comes from collaborative
filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They
exploit behavior of other users and items in terms of transaction history, ratings, selection and
purchase information. Other users behaviour and preferences over the items are used to
recommend items to the new users. In this case, features of the items are not known.

Know more: Recommender System

Q30. What do you understand by Type I vs Type II error ?

Answer: Type I error is committed when the null hypothesis is true and we reject it, also
known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and
we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value
as positive (1) when it is actually negative (0). Type II error occurs when we classify a value
as negative (0) when it is actually positive(1).

Q31. You are working on a classification problem. For validation purposes, you’ve
randomly sampled the training data set into train and validation. You are confident that
your model will work incredibly well on unseen data since your validation accuracy is
high. However, you get shocked after getting poor test accuracy. What went wrong?

Answer: In case of classification problem, we should always use stratified sampling instead
of random sampling. A random sampling doesn’t takes into consideration the proportion of
target classes. On the contrary, stratified sampling helps to maintain the distribution of target
variable in the resultant distributed samples also.
Q32. You have been asked to evaluate a regression model based on R², adjusted R²
and tolerance. What will be your criteria?

Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of


percent of variance in a predictor which cannot be accounted by other predictors. Large
values of tolerance is desirable.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases


irrespective of improvement in prediction accuracy as we add more variables. But, adjusted
R² would only increase if an additional variable improves the accuracy of model, otherwise
stays same. It is difficult to commit a general threshold value for adjusted R² because it
varies between data sets. For example: a gene mutation data set might result in
lower adjusted R² and still provide fairly good predictions, as compared to a stock market data
where lower adjusted R² implies that model is not good.

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between
nearest neighbors. Why not manhattan distance ?

Answer: We don’t use manhattan distance because it calculates distance horizontally or


vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used
in any space to calculate distance. Since, the data points can be present in any dimension,
euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by
manhattan distance because of their respective vertical & horizontal movements.

Q34. Explain machine learning to me like a 5 year old.

Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down, they
learn (unconsciously) & realize that their legs should be straight and not in a bend position.
The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that
again’. In order to avoid that pain, they try harder. To succeed, they even seek support from
the door or wall or anything near them, which helps them stand firm.

This is how a machine works & develops intuition from its environment.

Note: The interview is only trying to test if have the ability of explain complex concepts in
simple terms.
Q35. I know that a linear regression model is generally evaluated using Adjusted R² or
F value. How would you evaluate a logistic regression model?

Answer: We can use the following methods:

1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve
along with confusion matrix to determine its performance.
2. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the
measure of fit which penalizes model for the number of model coefficients. Therefore,
we always prefer model with minimum AIC value.
3. Null Deviance indicates the response predicted by a model with nothing but an
intercept. Lower the value, better the model. Residual deviance indicates the response
predicted by a model on adding independent variables. Lower the value, better the
model.

Know more: Logistic Regression

Q36. Considering the long list of machine learning algorithm, given a data set, how do
you decide which one to use?

Answer: You should say, the choice of machine learning algorithm solely depends of the
type of data. If you are given a data set which is exhibits linearity, then linear regression would
be the best algorithm to use. If you given to work on images, audios, then neural network
would help you to build a robust model.

If the data comprises of non linear interactions, then a boosting or bagging algorithm should
be the choice. If the business requirement is to build a model which can be deployed, then
we’ll use regression or a decision tree model (easy to interpret and explain) instead of black
box algorithms like SVM, GBM etc.

In short, there is no one master algorithm for all situations. We must be scrupulous enough
to understand which algorithm to use.

Q37. Do you suggest that treating a categorical variable as continuous variable would
result in a better predictive model?

Answer: For better predictions, categorical variable can be considered as a continuous


variable only when the variable is ordinal in nature.

Q38. When does regularization becomes necessary in Machine Learning?


Answer: Regularization becomes necessary when the model begins to ovefit / underfit. This
technique introduces a cost term for bringing in more features with the objective function.
Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term.
This helps to reduce model complexity so that the model can become better at predicting
(generalizing).

Q39. What do you understand by Bias Variance trade off?

Answer: The error emerging from any model can be broken down into three components
mathematically. Following are these component :

Bias error is useful to quantify how much on an average are the predicted values different
from the actual value. A high bias error means we have a under-performing model which
keeps on missing important trends. Varianceon the other side quantifies how are the
prediction made on same observation different from each other. A high variance model will
over-fit on your training population and perform badly on any observation beyond training.

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain


the statement.

Answer: OLS and Maximum likelihood are the methods used by the respective regression
methods to approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the
parameters resulting in minimum distance between actual and predicted values. Maximum
Likelihood helps in choosing the the values of parameters which maximizes the likelihood that
the parameters are most likely to produce observed data.

 What is data normalization and why do we need it? I


felt this one would be important to highlight. Data
normalization is very important preprocessing step, used to
rescale values to fit in a specific range to assure better
convergence during backpropagation. In general, it boils down
to subtracting the mean of each data point and dividing by its
standard deviation. If we don’t do this then some of the
features (those with high magnitude) will be weighted more in
the cost function (if a higher-magnitude feature changes by 1%,
then that change is pretty big, but for smaller features it’s quite
insignificant). The data normalization makes all features
weighted equally.
 Explain dimensionality reduction, where it’s used,
and it’s benefits?Dimensionality reduction is the process of
reducing the number of feature variables under consideration
by obtaining a set of principal variables which are basically the
important features. Importance of a feature depends on how
much the feature variable contributes to the information
representation of the data and depends on which technique you
decide to use. Deciding which technique to use comes down to
trial-and-error and preference. It’s common to start with a
linear technique and move to non-linear techniques when
results suggest inadequate fit. Benefits of dimensionality
reduction for a data set may be: (1) Reduce the storage space
needed (2) Speed up computation (for example in machine
learning algorithms), less dimensions mean less computing,
also less dimensions can allow usage of algorithms unfit for a
large number of dimensions (3) Remove redundant features,
for example no point in storing a terrain’s size in both sq
meters and sq miles (maybe data gathering was flawed) (4)
Reducing a data’s dimension to 2D or 3D may allow us to plot
and visualize it, maybe observe patterns, give us insights (5)
Too many features or too complex a model can lead to
overfitting.
 How do you handle missing or corrupted data in a
dataset? You could find missing/corrupted data in a dataset
and either drop those rows or columns, or decide to replace
them with another value. In Pandas, there are two very useful
methods: isnull() and dropna() that will help you find columns
of data with missing or corrupted data and drop those values. If
you want to fill the invalid values with a placeholder value (for
example, 0), you could use the fillna() method.
 Explain this clustering algorithm? I wrote a popular
article on the The 5 Clustering Algorithms Data Scientists Need
to Know explaining all of them in detail with some great
visualizations.
 How would you go about doing an Exploratory Data
Analysis (EDA)?The goal of an EDA is to gather some
insights from the data before applying your predictive model i.e
gain some information. Basically, you want to do your EDA in
a coarse to fine manner. We start by gaining some high-level
global insights. Check out some imbalanced classes. Look at
mean and variance of each class. Check out the first few rows to
see what it’s all about. Run a pandas df.info() to see which
features are continuous, categorical, their type (int, float,
string). Next, drop unnecessary columns that won’t be useful in
analysis and prediction. These can simply be columns that look
useless, one’s where many rows have the same value (i.e it
doesn’t give us much information), or it’s missing a lot of
values. We can also fill in missing values with the most
common value in that column, or the median. Now we can start
making some basic visualizations. Start with high-level stuff.
Do some bar plots for features that are categorical and have a
small number of groups. Bar plots of the final classes. Look at
the most “general features”. Create some visualizations about
these individual features to try and gain some basic insights.
Now we can start to get more specific. Create visualizations
between features, two or three at a time. How are features
related to each other? You can also do a PCA to see which
features contain the most information. Group some features
together as well to see their relationships. For example, what
happens to the classes when A = 0 and B = 0? How about A = 1
and B = 0? Compare different features. For example, if feature
A can be either “Female” or “Male” then we can plot feature A
against which cabin they stayed in to see if Males and Females
stay in different cabins. Beyond bar, scatter, and other basic
plots, we can do a PDF/CDF, overlayed plots, etc. Look at some
statistics like distribution, p-value, etc. Finally it’s time to build
the ML model. Start with easier stuff like Naive Bayes and
Linear Regression. If you see that those suck or the data is
highly non-linear, go with polynomial regression, decision
trees, or SVMs. The features can be selected based on their
importance from the EDA. If you have lots of data you can use
a Neural Network. Check ROC curve. Precision, Recall
 How do you know which Machine Learning model you
should use?While one should always keep the “no free lunch
theorem” in mind, there are some general guidelines. I wrote
an article on how to select the proper regression model here.
This cheatsheet is also fantastic!
 Why do we use convolutions for images rather than
just FC layers?This one was pretty interesting since it’s not
something companies usually ask. As you would expect, I got
this question from a company focused on Computer Vision.
This answer has 2 parts to it. Firstly, convolutions preserve,
encode, and actually use the spatial information from the
image. If we used only FC layers we would have no relative
spatial information. Secondly, Convolutional Neural Networks
(CNNs) have a partially built-in translation in-variance, since
each convolution kernel acts as it’s own filter/feature detector.
 What makes CNNs translation invariant? As explained
above, each convolution kernel acts as it’s own filter/feature
detector. So let’s say you’re doing object detection, it doesn’t
matter where in the image the object is since we’re going to
apply the convolution in a sliding window fashion across the
entire image anyways.
 Why do we have max-pooling in classification
CNNs? Again as you would expect this is for a role in
Computer Vision. Max-pooling in a CNN allows you to reduce
computation since your feature maps are smaller after the
pooling. You don’t lose too much semantic information since
you’re taking the maximum activation. There’s also a theory
that max-pooling contributes a bit to giving CNNs more
translation in-variance. Check out this great video from
Andrew Ng on the benefits of max-pooling.
 Why do segmentation CNNs typically have an encoder-
decoder style / structure? The encoder CNN can basically
be thought of as a feature extraction network, while the
decoder uses that information to predict the image segments by
“decoding” the features and upscaling to the original image
size.
 What is the significance of Residual Networks? The
main thing that residual connections did was allow for direct
feature access from previous layers. This makes information
propagation throughout the network much easier. One very
interesting paper about this shows how using local skip
connections gives the network a type of ensemble multi-path
structure, giving features multiple paths to propagate
throughout the network.
 What is batch normalization and why does it
work? Training Deep Neural Networks is complicated by the
fact that the distribution of each layer’s inputs changes during
training, as the parameters of the previous layers change. The
idea is then to normalize the inputs of each layer in such a way
that they have a mean output activation of zero and standard
deviation of one. This is done for each individual mini-batch at
each layer i.e compute the mean and variance of that mini-
batch alone, then normalize. This is analogous to how the
inputs to networks are standardized. How does this help? We
know that normalizing the inputs to a network helps it learn.
But a network is just a series of layers, where the output of one
layer becomes the input to the next. That means we can think
of any layer in a neural network as the first layer of a smaller
subsequent network. Thought of as a series of neural networks
feeding into each other, we normalize the output of one layer
before applying the activation function, and then feed it into
the following layer (sub-network).
 How would you handle an imbalanced dataset? I have
an article about this! Check out #3 :)
 Why would you use many small convolutional kernels
such as 3x3 rather than a few large ones? This is very
well explained in the VGGNet paper. There are 2 reasons: First,
you can use several smaller kernels rather than few large ones
to get the same receptive field and capture more spatial
context, but with the smaller kernels you are using less
parameters and computations. Secondly, because with smaller
kernels you will be using more filters, you’ll be able to use more
activation functions and thus have a more discriminative
mapping function being learned by your CNN.
 Do you have any other projects that would be related
here? Here you’ll really draw connections between your
research and their business. Is there anything you did or any
skills you learned that could possibly connect back to their
business or the role you are applying for? It doesn’t have to be
100% exact, just somehow related such that you can show that
you will be able to directly add lots of value.
 Explain your current masters research? What
worked? What didn’t? Future directions? Same as the
last question!
Conclusion
There you have it! All of the interview questions I got when apply
for roles in Data Science and Machine Learning. I hope you
enjoyed this post and learned something new and useful! If you
did, feel free to hit the clap button.
1: What is machine learning?
In answering this question, try to show your understand of the broad applications
of machine learning, as well as how it fits into AI. Put it into your own words, but
convey your understanding that machine learning is a form of AI that automates
data analysis to enable computers to learn and adapt through experience to do
specific tasks without explicit programming.
2: What is your training in machine learning and what types of hands-on
experience do you have?
Your answer to this question will depend on your training in machine learning. Be
sure to emphasize any direct projects you’ve completed as part of your education.
Don’t fail to mention any additional experience that you have including certifications
and how they have prepared you for your role in the machine learning field.

3: What is deep learning?


This might or might not apply to the job you’re going after, but your answer will help
to show you know more than just the technical aspects of machine learning. Deep
learning is a subset of machine learning. It refers to using multi -layered neural
networks to process data in increasingly complex ways, enabling the software to
train itself to perform tasks like speech and image recognition through exposure to
these vast amounts of data. Thus the machine undergoes continual improvement
in the ability to recognize and process information. Layers of neural networks
stacked on top of each for use in deep learning are called deep neural networks.
4: How do deductive and inductive machine learning differ?
Deductive machine learning starts with a conclusion, then learns by deducing what
is right or wrong about that conclusion. Inductive machine learning starts with
examples from which to draw conclusions.
5: How do you choose an algorithm for a classification problem?
The answer depends on the degree of accuracy needed and the size of the training
set. If you have a small training set, you can use a low variance/high bias c lassifier.
If your training set is large, you will want to choose a high variance/low bias
classifier.
6: How do bias and variance play out in machine learning?
Both bias and variance are errors. Bias is an error due to flawed assumptions in
the learning algorithm. Variance is an error resulting from too much complexity in
the learning algorithm.
7: What are some methods of reducing dimensionality?
You can reduce dimensionality by combining features with feature engineering,
removing collinear features, or using algorithmic dimensionality reduction.
8: How do classification and regression differ?
Classification predicts group or class membership. Regression involves predicting
a response. Classification is the better technique when you need a more defin ite
answer.
9: What is supervised versus unsupervised learning?
Supervised learning is a process of machine learning in which outputs are fed back
into a computer for the software to learn from for more accurate results the next
time. With supervised learning, the “machine” receives initial training to start. In
contrast, unsupervised learning means a computer will learn without initial training.
10: What is kernel SVM?
Kernel SVM is the abbreviated version of kernel support vector machine. Kernel
methods are a class of algorithms for pattern analysis and the most common one
is the kernel SVM.
11. What is decision tree classification?
A decision tree builds classification (or regression) models as a tree structure, with
datasets broken up into ever smaller subsets while developing the decision tree,
literally in a tree-like way with branches and nodes. Decision trees can handle both
categorical and numerical data.
12: What is a recommendation system?
Anyone who has used Spotify or shopped at Amazon will recognize a
recommendation system: It’s an information filtering system that predicts what a
user might want to hear or see based on choice patterns provided by the user.

1) What is Machine learning?

Machine learning is a branch of computer science which deals with system programming in
order to automatically learn and improve with experience. For example: Robots are
programed so that they can perform the task based on data they gather from sensors. It
automatically learns programs from data.

2) Mention the difference between Data Mining and Machine learning?

Machine learning relates with the study, design and development of the algorithms that
give computers the capability to learn without being explicitly programmed. While, data
mining can be defined as the process in which the unstructured data tries to extract
knowledge or unknown interesting patterns. During this process machine, learning
algorithms are used.
3) What is ‘Overfitting’ in Machine learning?

In machine learning, when a statistical model describes random error or noise instead of
underlying relationship ‘overfitting’ occurs. When a model is excessively complex,
overfitting is normally observed, because of having too many parameters with respect to
the number of training data types. The model exhibits poor performance which has been
overfit.

4) Why overfitting happens?

The possibility of overfitting exists as the criteria used for training the model is not the
same as the criteria used to judge the efficacy of a model.

5) How can you avoid overfitting ?


By using a lot of data overfitting can be avoided, overfitting happens relatively as you have
a small dataset, and you try to learn from it. But if you have a small database and you are
forced to come with a model based on that. In such situation, you can use a technique
known as cross validation. In this method the dataset splits into two section, testing and
training datasets, the testing dataset will only test the model while, in training dataset, the
datapoints will come up with the model.

In this technique, a model is usually given a dataset of a known data on which training
(training data set) is run and a dataset of unknown data against which the model is tested.
The idea of cross validation is to define a dataset to “test” the model in the training phase.

6) What is inductive machine learning?

The inductive machine learning involves the process of learning by examples, where a
system, from a set of observed instances tries to induce a general rule.

7) What are the five popular algorithms of Machine Learning?

a) Decision Trees

b) Neural Networks (back propagation)

c) Probabilistic networks

d) Nearest Neighbor

e) Support vector machines

8) What are the different Algorithm techniques in Machine Learning?

The different types of techniques in Machine Learning are

a) Supervised Learning

b) Unsupervised Learning

c) Semi-supervised Learning
d) Reinforcement Learning

e) Transduction

f) Learning to Learn

9) What are the three stages to build the hypotheses or model in machine learning?

a) Model building

b) Model testing

c) Applying the model

10) What is the standard approach to supervised learning?

The standard approach to supervised learning is to split the set of example into the training
set and the test.

11) What is ‘Training set’ and ‘Test set’?

In various areas of information science like machine learning, a set of data is used to
discover the potentially predictive relationship known as ‘Training Set’. Training set is an
examples given to the learner, while Test set is used to test the accuracy of the hypotheses
generated by the learner, and it is the set of example held back from the learner. Training
set are distinct from Test set.

12) List down various approaches for machine learning?

The different approaches in Machine Learning are

a) Concept Vs Classification Learning

b) Symbolic Vs Statistical Learning

c) Inductive Vs Analytical Learning

13) What is not Machine Learning?

a) Artificial Intelligence

b) Rule based inference

14) Explain what is the function of ‘Unsupervised Learning’?

a) Find clusters of the data

b) Find low-dimensional representations of the data

c) Find interesting directions in data

d) Interesting coordinates and correlations


e) Find novel observations/ database cleaning

15) Explain what is the function of ‘Supervised Learning’?

a) Classifications

b) Speech recognition

c) Regression

d) Predict time series

e) Annotate strings

16) What is algorithm independent machine learning?

Machine learning in where mathematical foundations is independent of any particular


classifier or learning algorithm is referred as algorithm independent machine learning?

17) What is the difference between artificial learning and machine learning?

Designing and developing algorithms according to the behaviours based on empirical data
are known as Machine Learning. While artificial intelligence in addition to machine learning,
it also covers other aspects like knowledge representation, natural language processing,
planning, robotics etc.

18) What is classifier in machine learning?

A classifier in a Machine Learning is a system that inputs a vector of discrete or continuous


feature values and outputs a single discrete value, the class.

19) What are the advantages of Naive Bayes?

In Naïve Bayes classifier will converge quicker than discriminative models like logistic
regression, so you need less training data. The main advantage is that it can’t learn
interactions between features.

20) In what areas Pattern Recognition is used?

Pattern Recognition can be used in

a) Computer Vision

b) Speech Recognition

c) Data Mining

d) Statistics

e) Informal Retrieval

f) Bio-Informatics
21) What is Genetic Programming?

Genetic programming is one of the two techniques used in machine learning. The model is
based on the testing and selecting the best choice among a set of results.

22) What is Inductive Logic Programming in Machine Learning?

Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical
programming representing background knowledge and examples.

23) What is Model Selection in Machine Learning?

The process of selecting models among different mathematical models, which are used to
describe the same data set is known as Model Selection. Model selection is applied to the
fields of statistics, machine learning and data mining.

24) What are the two methods used for the calibration in Supervised Learning?

The two methods used for predicting good probabilities in Supervised Learning are

a) Platt Calibration

b) Isotonic Regression

These methods are designed for binary classification, and it is not trivial.

25) Which method is frequently used to prevent overfitting?

When there is sufficient data ‘Isotonic Regression’ is used to prevent an overfitting issue.

26) What is the difference between heuristic for rule learning and heuristics for
decision trees?

The difference is that the heuristics for decision trees evaluate the average quality of a
number of disjointed sets while rule learners only evaluate the quality of the set of
instances that is covered with the candidate rule.

27) What is Perceptron in Machine Learning?

In Machine Learning, Perceptron is an algorithm for supervised classification of the input


into one of several possible non-binary outputs.

28) Explain the two components of Bayesian logic program?

Bayesian logic program consists of two components. The first component is a logical one ;
it consists of a set of Bayesian Clauses, which captures the qualitative structure of the
domain. The second component is a quantitative one, it encodes the quantitative
information about the domain.

29) What are Bayesian Networks (BN) ?


Bayesian Network is used to represent the graphical model for probability relationship
among a set of variables .

30) Why instance based learning algorithm sometimes referred as Lazy learning
algorithm?

Instance based learning algorithm is also referred as Lazy learning algorithm as they delay
the induction or generalization process until classification is performed.

31) What are the two classification methods that SVM ( Support Vector Machine) can
handle?

a) Combining binary classifiers

b) Modifying binary to incorporate multiclass learning

32) What is ensemble learning?

To solve a particular computational program, multiple models such as classifiers or experts


are strategically generated and combined. This process is known as ensemble learning.

33) Why ensemble learning is used?

Ensemble learning is used to improve the classification, prediction, function approximation


etc of a model.

34) When to use ensemble learning?

Ensemble learning is used when you build component classifiers that are more accurate and
independent from each other.

35) What are the two paradigms of ensemble methods?

The two paradigms of ensemble methods are

a) Sequential ensemble methods

b) Parallel ensemble methods

36) What is the general principle of an ensemble method and what is bagging and
boosting in ensemble method?

The general principle of an ensemble method is to combine the predictions of several


models built with a given learning algorithm in order to improve robustness over a single
model. Bagging is a method in ensemble for improving unstable estimation or classification
schemes. While boosting method are used sequentially to reduce the bias of the combined
model. Boosting and Bagging both can reduce errors by reducing the variance term.

37) What is bias-variance decomposition of classification error in ensemble method?

The expected error of a learning algorithm can be decomposed into bias and variance. A
bias term measures how closely the average classifier produced by the learning algorithm
matches the target function. The variance term measures how much the learning
algorithm’s prediction fluctuates for different training sets.

38) What is an Incremental Learning algorithm in ensemble?

Incremental learning method is the ability of an algorithm to learn from new data that may
be available after classifier has already been generated from already available dataset.

39) What is PCA, KPCA and ICA used for?

PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis)
and ICA ( Independent Component Analysis) are important feature extraction techniques
used for dimensionality reduction.

40) What is dimension reduction in Machine Learning?

In Machine Learning and statistics, dimension reduction is the process of reducing the
number of random variables under considerations and can be divided into feature selection
and feature extraction

41) What are support vector machines?

Support vector machines are supervised learning algorithms used for classification and
regression analysis.

42) What are the components of relational evaluation techniques?

The important components of relational evaluation techniques are

a) Data Acquisition

b) Ground Truth Acquisition

c) Cross Validation Technique

d) Query Type

e) Scoring Metric

f) Significance Test

43) What are the different methods for Sequential Supervised Learning?

The different methods to solve Sequential Supervised Learning problems are

a) Sliding-window methods

b) Recurrent sliding windows

c) Hidden Markow models

d) Maximum entropy Markow models


e) Conditional random fields

f) Graph transformer networks

44) What are the areas in robotics and information processing where sequential
prediction problem arises?

The areas in robotics and information processing where sequential prediction problem
arises are

a) Imitation Learning

b) Structured prediction

c) Model based reinforcement learning

45) What is batch statistical learning?

Statistical learning techniques allow learning a function or predictor from a set of observed
data that can make predictions about unseen or future data. These techniques provide
guarantees on the performance of the learned predictor on the future unseen data based
on a statistical assumption on the data generating process.

46) What is PAC Learning?

PAC (Probably Approximately Correct) learning is a learning framework that has been
introduced to analyze learning algorithms and their statistical efficiency.

47) What are the different categories you can categorized the sequence learning
process?

a) Sequence prediction

b) Sequence generation

c) Sequence recognition

d) Sequential decision

48) What is sequence learning?

Sequence learning is a method of teaching and learning in a logical manner.

49) What are two techniques of Machine Learning ?

The two techniques of Machine Learning are

a) Genetic Programming

b) Inductive Learning

50) Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses Machine
Learning

BASIC DATA SCIENCE INTERVIEW QUESTIONS

1. What is Data Science? Also, list the differences between supervised and
unsupervised learning.

Data Science is a blend of various tools, algorithms, and machine learning principles with
the goal to discover hidden patterns from the raw data. How is this different from what
statisticians have been doing for years?

The answer lies in the difference between explaining and predicting.


Supervised Learning vs Unsupervised Learning

Supervised Learning Unsupervised Learning

1. Input data is labeled. 1. Input data is unlabeled.

2. Uses training dataset. 2. Uses the input data set.

3. Used for prediction. 3. Used for analysis.

4. Enables Classification, Density Estimation, & Dimens


4. Enables classification and regression.
Reduction

2. What are the important skills to have in Python with regard to data analysis?

The following are some of the important skills to possess which will come handy when
performing data analysis using Python.

 Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.

 Mastery of N-dimensional NumPy Arrays.

 Mastery of Pandas dataframes.

 Ability to perform element-wise vector and matrix operations on NumPy arrays.

 Knowing that you should use the Anaconda distribution and the conda package manager.

 Familiarity with Scikit-learn. **Scikit-Learn Cheat Sheet**

 Ability to write efficient list comprehensions instead of traditional for loops.

 Ability to write small, clean functions (important for any developer), preferably pure

functions that don’t alter objects.

 Knowing how to profile the performance of a Python script and how to optimize

bottlenecks.

The following will help to tackle any problem in data analytics and machine learning.

3. What is Selection Bias?


Selection bias is a kind of error that occurs when the researcher decides who is going to
be studied. It is usually associated with research where the selection of participants isn’t
random. It is sometimes referred to as the selection effect. It is the distortion of statistical
analysis, resulting from the method of collecting samples. If the selection bias is not taken
into account, then some conclusions of the study may not be accurate.

The types of selection bias include:

1. Sampling bias: It is a systematic error due to a non-random sample of a population


causing some members of the population to be less likely to be included than others
resulting in a biased sample.
2. Time interval: A trial may be terminated early at an extreme value (often for ethical
reasons), but the extreme value is likely to be reached by the variable with the largest
variance, even if all variables have a similar mean.
3. Data: When specific subsets of data are chosen to support a conclusion or rejection of
bad data on arbitrary grounds, instead of according to previously stated or generally
agreed criteria.
4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants)
discounting trial subjects/tests that did not run to completion.

Data Scientist Masters ProgramWatch The Course Preview

STATISTICS INTERVIEW QUESTIONS

4. What is the difference between “long” and “wide” format data?

In the wide format, a subject’s repeated responses will be in a single row, and each
response is in a separate column. In the long format, each row is a one-time point per
subject. You can recognize data in wide format by the fact that columns generally
represent groups.
5. What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can
all be jumbled up.
However, there are chances that data is distributed around a central value without any
bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

Figure: Normal distribution in a bell curve

The random variables are distributed in the form of a symmetrical bell-shaped curve.
Properties of Nornal Distribution:

1. Unimodal -one mode


2. Symmetrical -left and right halves are mirror images
3. Bell-shaped -maximum height (mode) at the mean
4. Mean, Mode, and Median are all located in the center
5. Asymptotic

6. What is the goal of A/B Testing?

It is a statistical hypothesis testing for a randomized experiment with two variables A and
B.

The goal of A/B Testing is to identify any changes to the web page to maximize or
increase the outcome of an interest. A/B testing is a fantastic method for figuring out the
best online promotional and marketing strategies for your business. It can be used to test
everything from website copy to sales emails to search ads

An example of this could be identifying the click-through rate for a banner ad.

7. What do you understand by statistical power of sensitivity and how do you


calculate it?

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM,


Random Forest etc.).

Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the
events which were true and model also predicted them as true.

Calculation of seasonality is pretty straightforward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

*where true positives are positive events which are correctly classified as positives.

8. What are the differences between overfitting and underfitting?

In statistics and machine learning, one of the most common tasks is to fit a model to a set
of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying
relationship. Overfitting occurs when a model is excessively complex, such as having too
many parameters relative to the number of observations. A model that has been overfit
has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture
the underlying trend of the data. Underfitting would occur, for example, when fitting a
linear model to non-linear data. Such a model too would have poor
predictive performance.

Trending Courses in this category

Python Certification Training for Data Science


5 (14100)
36k Learners Enrolled Live Class
Best Price 19,795 21,995

Similar Courses Python Certification Training for Data SciencePython Programming

Certification CourseMachine Learning Certification Training using Python

DATA ANALYSIS INTERVIEW QUESTIONS

9. Python or R – Which one would you prefer for text analytics?

We will prefer Python because of the following reasons:

 Python would be the best option because it has Pandas library that provides easy to use

data structures and high-performance data analysis tools.

 R is more suitable for machine learning than just text analysis.

 Python performs faster for all types of text analytics.


Python vs R

10. How does data cleaning plays a vital role in analysis?

Data cleaning can help in analysis because:

 Cleaning data from multiple sources helps to transform it into a format that data analysts

or data scientists can work with.

 Data Cleaning helps to increase the accuracy of the model in machine learning.

 It is a cumbersome process because as the number of data sources increases, the time

taken to clean the data increases exponentially due to the number of sources and the

volume of data generated by these sources.

 It might take up to 80% of the time for just cleaning data making it a critical part of analysis

task.

11. Differentiate between univariate, bivariate and multivariate analysis.

Univariate analyses are descriptive statistical analysis techniques which can be


differentiated based on the number of variables involved at a given point of time. For
example, the pie charts of sales based on territory involve only one variable and can the
analysis can be referred to as univariate analysis.

The bivariate analysis attempts to understand the difference between two variables at a
time as in a scatterplot. For example, analyzing the volume of sale and spending can be
considered as an example of bivariate analysis.

Multivariate analysis deals with the study of more than two variables to understand the
effect of variables on the responses.

12. What is Cluster Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target
population spread across a wide area and simple random sampling cannot be
applied. Cluster Sample is a probability sample where each sampling unit is a collection
or cluster of elements.

For eg., A researcher wants to survey the academic performance of high school
students in Japan. He can divide the entire population of Japan into different clusters
(cities). Then the researcher selects a number of clusters depending on his research
through simple or systematic random sampling.

Let’s continue our Data Science Interview Questions blog with some more statistics
questions.

13. What is Systematic Sampling?

Systematic sampling is a statistical technique where elements are selected from an


ordered sampling frame. In systematic sampling, the list is progressed in a circular
manner so once you reach the end of the list, it is progressed from the top again. The
best example of systematic sampling is equal probability method.

14. What are Eigenvectors and Eigenvalues?

Eigenvectors are used for understanding linear transformations. In data analysis, we


usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are
the directions along which a particular linear transformation acts by flipping, compressing
or stretching.

Eigenvalue can be referred to as the strength of the transformation in the direction of


eigenvector or the factor by which the compression occurs.

Trending Courses in this category

Python Certification Training for Data Science


5 (14100)
36k Learners Enrolled Live Class
Best Price 19,795 21,995

Similar Courses Python Certification Training for Data SciencePython Programming

Certification CourseMachine Learning Certification Training using Python

15. Can you cite some examples where a false positive is important than a false
negative?

Let us first understand what false positives and false negatives are.

 False Positives are the cases where you wrongly classified a non-event as an event a.k.a

Type I error.

 False Negatives are the cases where you wrongly classify events as non-events, a.k.a

Type II error.

Example 1: In the medical field, assume you have to give chemotherapy to patients.
Assume a patient comes to that hospital and he is tested positive for cancer, based on
the lab prediction but he actually doesn’t have cancer. This is a case of false positive.
Here it is of utmost danger to start chemotherapy on this patient when he actually does
not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage
to his normal healthy cells and might lead to severe diseases, even cancer.

Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the
customers whom they assume to purchase at least $10,000 worth of items. They send
free voucher mail directly to 100 customers without any minimum purchase condition
because they assume to make at least 20% profit on sold items above $10,000. Now the
issue is if we send the $1000 gift vouchers to customers who have not actually purchased
anything but are marked as having made $10,000 worth of purchase.

16. Can you cite some examples where a false negative important than a false
positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and
based on certain characteristics they identify whether a particular passenger can be a
threat or not. Due to a shortage of staff, they decide to scan passengers being predicted
as risk positives by their predictive model. What will happen if a true threat customer is
being flagged as non-threat by airport model?

Example 2: What if Jury or judge decides to make a criminal go free?

Example 3: What if you rejected to marry a very good person based on your predictive
model and you happen to meet him/her after a few years and realize that you had a
false negative?

17. Can you cite some examples where both false positive and false negatives are
equally important?

In the Banking industry giving loans is the primary source of making money but at the
same time if your repayment rate is not good you will not make any profit, rather you will
risk huge losses.

Banks don’t want to lose good customers and at the same point in time, they don’t want
to acquire bad customers. In this scenario, both the false positives and false negatives
become very important to measure.

18. Can you explain the difference between a Validation Set and a Test Set?

A Validation set can be considered as a part of the training set as it is used for parameter
selection and to avoid overfitting of the model being built.

On the other hand, a Test Set is used for testing or evaluating the performance of a
trained machine learning model.

In simple terms, the differences can be summarized as; training set is to fit the parameters
i.e. weights and test set is to assess the performance of the model i.e. evaluating the
predictive power and generalization.

19. Explain cross-validation.


Cross-validation is a model validation technique for evaluating how the outcomes of
statistical analysis will generalize to an Independent dataset. Mainly used in
backgrounds where the objective is forecast and one wants to estimate how accurately a
model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase
(i.e. validation data set) in order to limit problems like overfitting and get an insight on how
the model will generalize to an independent data set.

Trending Courses in this category

Python Certification Training for Data Science


5 (14100)
36k Learners Enrolled Live Class
Best Price 19,795 21,995

Similar Courses Python Certification Training for Data SciencePython Programming

Certification CourseMachine Learning Certification Training using Python

MACHINE LEARNING INTERVIEW QUESTIONS

20. What is Machine Learning?

Machine Learning explores the study and construction of algorithms that can learn from
and make predictions on data. Closely related to computational statistics. Used to devise
complex models and algorithms that lend themselves to a prediction which in commercial
use is known as predictive analytics.
Figure: Applications of Machine Learning

21. What is the Supervised Learning?

Supervised learning is the machine learning task of inferring a function from labeled
training data. The training data consist of a set of training examples.

Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-


nearest Neighbor Algorithm and Neural Networks

E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and
this is a banana”, based on showing the classifier examples of apples, oranges and
bananas.

LEARN DATA SCIENCE TODAY

22. What is Unsupervised learning?

Unsupervised learning is a type of machine learning algorithm used to draw inferences


from datasets consisting of input data without labeled responses.

Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models

E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots
of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
23. What are the various classification algorithms?

The below diagram lists the most important classification algorithms.

Figure: Various Classification algorithms

24. What is logistic regression? State an example when you have used logistic
regression recently.

Logistic Regression often referred as logit model is a technique to predict the binary
outcome from a linear combination of predictor variables.

For example, if you want to predict whether a particular political leader will win the election
or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor
variables here would be the amount of money spent for election campaigning of a
particular candidate, the amount of time spent in campaigning, etc.

25. What are Recommender Systems?

Recommender Systems are a subclass of information filtering systems that are meant
to predict the preferences or ratings that a user would give to a product. Recommender
systems are widely used in movies, news, research articles, products, social tags, music,
etc.

Examples include movie recommenders in IMDB, Netflix & BookMyShow, product


recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video
recommendations and game recommendations in Xbox.

26. What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted


from the score of a second variable X. X is referred to as the predictor variable and Y as
the criterion variable.
27. What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or
information by collaborating viewpoints, various data sources and multiple agents.

An example of collaborative filtering can be to predict the rating of a particular user based
on his/her ratings for other movies and others’ ratings for all movies. This concept is
widely used in recommending movies in IMDB, Netflix & BookMyShow, product
recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video
recommendations and game recommendations in Xbox.
28. How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis
method. If the number of outlier values is few then they can be assessed individually but
for a large number of outliers, the values can be substituted with either the 99th or the 1st
percentile values.

All extreme values are not outlier values. The most common ways to treat outlier values

1. To change the value and bring in within a range.


2. To just remove the value.

29. What are the various steps involved in an analytics project?

The following are the various steps involved in an analytics project:

1. Understand the Business problem


2. Explore the data and become familiar with it.
3. Prepare the data for modeling by detecting outliers, treating missing values, transforming
variables, etc.
4. After data preparation, start running the model, analyze the result and tweak the approach.
This is an iterative step until the best possible outcome is achieved.
5. Validate the model using a new data set.
6. Start implementing the model and track the result to analyze the performance of the model
over the period of time.

30. During analysis, how do you treat missing values?

The extent of the missing values is identified after identifying the variables with missing
values. If any patterns are identified the analyst has to concentrate on them as it could
lead to interesting and meaningful business insights.

If there are no patterns identified, then the missing values can be substituted with mean
or median values (imputation) or they can simply be ignored. Assigning a default value
which can be mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned
a default value. If you have a distribution of data coming, for normal distribution give the
mean value.
If 80% of the values for a variable are missing then you can answer that you would be
dropping the variable instead of treating the missing values.
Data Scientist Masters ProgramWatch The Course Preview

31. How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question is mostly in reference to
K-Means clustering where “K” defines the number of clusters. The objective of clustering
is to group similar entities in a way that the entities within a group are similar to each other
but the groups are different from each other.

For example, the following image shows three different groups.


Within Sum of squares is generally used to explain the homogeneity within a cluster. If
you plot WSS for a range of number of clusters, you will get the plot shown below.
 The Graph is generally known as Elbow Curve.

 Red circled point in above graph i.e. Number of Cluster =6 is the point after which you

don’t see any decrement in WSS.

 This point is known as the bending point and taken as K in K – Means.

This is the widely used approach but few data scientists also use Hierarchical clustering
first to create dendrograms and identify the distinct groups from there.

Now that we have seen the Machine Learning Questions, Let’s continue our Data
Science Interview Questions blog with some Probability questions.
PROBABILITY INTERVIEW QUESTIONS

32. In any 15-minute interval, there is a 20% probability that you will see at least
one shooting star. What is the probability that you see at least one shooting star in
the period of an hour?
Probability of not seeing any shooting star in 15 minutes is

= 1 – P( Seeing one shooting star )


= 1 – 0.2 = 0.8

Probability of not seeing any shooting star in the period of one hour

= (0.8) ^ 4 = 0.4096

Probability of seeing at least one shooting star in the one hour

= 1 – P( Not seeing any star )


= 1 – 0.4096 = 0.5904

33. How can you generate a random number between 1 – 7 with only a die?

 Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single

rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36

different outcomes.

 To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We

can thus consider only 35 outcomes and exclude the other one.

 A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6

appears twice.

 All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each.

This way all the seven sets of outcomes are equally likely.

34. A certain couple tells you that they have two children, at least one of which is a
girl. What is the probability that they have two girls?

In the case of two children, there are 4 equally likely possibilities

BB, BG, GB and GG;

where B = Boy and G = Girl and the first letter denotes the first child.

From the question, we can exclude the first case of BB. Thus from the remaining 3
possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1/3

35. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin
at random, and toss it 10 times. Given that you see 10 heads, what is the probability
that the next toss of that coin is also a head?

There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick
the one with two heads.

Probability of selecting fair coin = 999/1000 = 0.999


Probability of selecting unfair coin = 1/1000 = 0.001

Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair
coin

P (A) = 0.999 * (1/2)^5 = 0.999 * (1/1024) = 0.000976


P (B) = 0.001 * 1 = 0.001
P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939
P( B / A + B ) = 0.001 / 0.001976 = 0.5061

Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 +


0.5061 = 0.7531

GET DATA SCIENCE CERTIFICATION NOW

DEEP LEARNING INTERVIEW QUESTIONS

36. What do you mean by Deep Learning and Why has it become popular now?

Deep Learning is nothing but a paradigm of machine learning which has shown incredible
promise in recent years. This is because of the fact that Deep Learning shows a great
analogy with the functioning of the human brain.

Now although Deep Learning has been around for many years, the major breakthroughs
from these techniques came just in recent years. This is because of two main reasons:
 The increase in the amount of data generated through various sources

 The growth in hardware resources required to run these models

GPUs are multiple times faster and they help us build bigger and deeper deep learning
models in comparatively less time than we required previously

37. What are Artificial Neural Networks?

Artificial Neural networks are a specific set of algorithms that have revolutionized machine
learning. They are inspired by biological neural networks. Neural Networks can adapt to
changing input so the network generates the best possible result without needing to
redesign the output criteria.

38. Describe the structure of Artificial Neural Networks?

Artificial Neural Networks works on the same principle as a biological Neural Network. It
consists of inputs which get processed with weighted sums and Bias, with the help of
Activation Functions.

39. Explain Gradient Descent.

To Understand Gradient Descent, Let’s understand what is a Gradient first.


A gradient measures how much the output of a function changes if you change the
inputs a little bit. It simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function.

Gradient Descent can be thought of climbing down to the bottom of a valley, instead of
climbing up a hill. This is because it is a minimization algorithm that minimizes a given
function (Activation Function).

40. What is Back Propagation and Explain it’s Working.

Backpropagation is a training algorithm used for multilayer neural network. In this


method, we move the error from an end of the network to all weights inside the network
and thus allowing efficient computation of the gradient.

It has the following steps:

 Forward Propagation of Training Data

 Derivatives are computed using output and target

 Back Propagate for computing derivative of error wrt output activation

 Using previously calculated derivatives for output

 Update the Weights

41. What are the variants of Back Propagation?

 Stochastic Gradient Descent: We use only single training example for calculation of

gradient and update parameters.

 Batch Gradient Descent: We calculate the gradient for the whole dataset and perform

the update at each iteration.

 Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a

variant of Stochastic Gradient Descent and here instead of single training example, mini-

batch of samples is used.


42. What are the different Deep Learning Frameworks?

 Pytorch

 TensorFlow

 Microsoft Cognitive Toolkit

 Keras

 Caffe

 Chainer

43. What is the role of Activation Function?

The Activation function is used to introduce non-linearity into the neural network helping
it to learn more complex function. Without which the neural network would be only able
to learn linear function which is a linear combination of its input data. An activation
function is a function in an artificial neuron that delivers an output based on inputs

44. What is an Auto-Encoder?

Autoencoders are simple learning networks that aim to transform inputs into outputs
with the minimum possible error. This means that we want the output to be as close to
input as possible. We add a couple of layers between the input and the output, and the
sizes of these layers are smaller than the input layer. The autoencoder receives
unlabeled input which is then encoded to reconstruct the input.

45. What is a Boltzmann Machine?

Boltzmann machines have a simple learning algorithm that allows them to discover
interesting features that represent complex regularities in the training data. The
Boltzmann machine is basically used to optimize the weights and the quantity for the
given problem. The learning algorithm is very slow in networks with many layers of feature
detectors. “Restricted Boltzmann Machines” algorithm has a single layer of feature
detectors which makes it faster than the rest.
1. Question 1. Explain How We Can Capture The Correlation Between Continuous
And Categorical Variable?
Answer :
Yes, it is possible by using ANCOVA technique. It stands for Analysis of Covariance.
It is used to calculate the association between continuous and categorical variables.
2. Question 2. How To Handle Or Missing Data In A Dataset?
Answer :
An individual can easily find missing or corrupted data in a data set either by
dropping the rows or columns. On contrary, they can decide to replace the data with
another value.
In Pandas they are two ways to identify the missing data, these two methods are very
useful.
isnull() and dropna().
3. Question 3. Define What Is Fourier Transform In A Single Sentence?
Answer :
A process of decomposing generic functions into a superposition of symmetric
functions is considered to be a Fourier Transform.
4. Question 4. What Is Deep Learning?
Answer :
Deep learning is a process where it is considered to be a subset of machine learning
process.
5. Question 5. What Is The Difference Between An Array And Linked List?
Answer :
An array is an ordered fashion of collection of objects. A linked list is a series of
objects that are processed in a sequential order.
6. Question 6. Define A Hash Table?
Answer :
They are generally used for database indexing.
A hash table is nothing but a data structure that produces an associative array.
7. Question 7. Mention Any One Of The Data Visualization Tools That You Are
Familiar With?
Answer :
This is another question where one has to be completely honest and also giving out
your personal experience with these type of tools are really important. Some of the
data visualization tools are Tableau, Plot.ly, and matplotlib.
8. Question 8. Is Rotation Necessary In Pca?
Answer :
Yes, the rotation is definitely necessary because it maximizes the differences
between the variance captured by the components.
9. Question 9. How Is F1 Score Is Used?
Answer :
The average of Precision and Recall of a model is nothing but F1 score measure.
Based on the results, the F1 score is 1 then it is classified as best and 0 being the
worst.
10. Question 10. How Recall And True Positive Rate Are Related?
Answer :
The relation is
True Positive Rate = Recall.
11. Question 11. Assume That You Are Working On A Data Set, Explain How Would
You Select Important Variables?
Answer :
The following are few methods can be used to select important variables:
o Use of Lasso Regression method.
o Using Random Forest, plot variable imprtance chart.
o Using Linear regression.
12. Question 12. Explain The Concept Of Machine Learning And Assume That You
Are Explaining This To A 5-year-old Baby?
Answer :
Yes, Machine learning is exactly the same way how babies do their day to day
activities, the way they walk or sleep etc. It is a common fact that babies cannot walk
straight away and they fall and then they get up again and then try. This is the same
thing when it comes to machine learning, it is all about how the algorithm is working
and at the same time redefining every time to make sure the end result is as perfect
as possible.
13. Question 13. What Is The Difference Between Machine Learning And Data
Mining?
Answer :
Data mining is about working on unstructured data and then extract it to a level
where the interesting and unknown patterns are identified.
Machine learning is a process or a study whether it closely relates to design,
development of the algorithms that provide an ability to the machines to capacity to
learn.
14. Question 14. Please State Few Popular Machine Learning Algorithms?
Answer :
o Nearest Neighbour
o Neural Networks
o Decision Trees etc
o Support vector machines
15. Question 15. What Are The Three Stages To Build The Model In Machine
Learning?
Answer :
o Model building
o Model testing
o Applying the model
16. Question 16. What Is The Difference Between Supervised And Unsupervised
Machine Learning?
Answer :
A Supervised learning is a process where it requires training labeled data. When it
comes to Unsupervised learning it doesn’t require data labeling.
17. Question 17. What Is The Difference Between Bias And Variance?
Answer :
Bias:Bias can be defined as a situation where an error has occurred due to use of
assumptions in the learning algorithm.
Variance: Variance is an error caused because of the complexity of the algorithm
that is been used to analyze the data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy