Machine Learning Questions
Machine Learning Questions
Machine Learning Questions
and the path to becoming a data scientist, machine learning engineer, or data engineer.
Springboard created a free guide to data science interviews, so we know exactly how
they can trip up candidates! In order to help resolve that, here is a curated and created a
list of key questions that you could see in a machine learning interview. There are some
answers to go along with them so you don’t get stumped. You’ll be able to do well in any
job interview (even for a machine learning internship) with after reading through this
piece.
We’ve divided this guide to machine learning interview questions into the categories we
mentioned above so that you can more easily get to the information you need when it
comes to machine learning interview questions.
Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm
you’re using. This can lead to the model underfitting your data, making it hard for it to
have high predictive accuracy and for you to generalize your knowledge from the
training set to the test set.
Variance is error due to too much complexity in the learning algorithm you’re using. This
leads to the algorithm being highly sensitive to high degrees of variation in your training
data, which can lead your model to overfit the data. You’ll be carrying too much noise
from your training data for your model to be very useful for your test data.
Q2- What is the difference between supervised and unsupervised machine learning?
More reading: What is the difference between supervised and unsupervised machine
learning? (Quora)
More reading: How is the k-nearest neighbor algorithm different from k-means
clustering? (Quora)
The critical difference here is that KNN needs labeled points and is thus supervised
learning, while k-means doesn’t — and is thus unsupervised learning.
Recall is also known as the true positive rate: the amount of positives your model
claims compared to the actual number of positives there are throughout the data.
Precision is also known as the positive predictive value, and it is a measure of the
amount of accurate positives your model claims compared to the number of positives it
actually claims. It can be easier to think of recall and precision in the context of a case
where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples.
You’d have perfect recall (there are actually 10 apples, and you predicted there would be
10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples)
are correct.
Bayes’ Theorem gives you the posterior probability of an event given what is known as
prior knowledge.
Mathematically, it’s expressed as the true positive rate of a condition sample divided by
the sum of the false positive rate of the population and the true positive rate of a
condition. Say you had a 60% chance of actually having the flu after a flu test, but out of
people who had the flu, the test will be false 50% of the time, and the overall population
only has a 5% chance of having the flu. Would you actually have a 60% chance of having
the flu after having a positive test?
Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a
Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95)
(False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.
Bayes’ Theorem is the basis behind a branch of machine learning that most notably
includes the Naive Bayes classifier. That’s something important to consider when you’re
faced with machine learning interview questions.
Despite its practical applications, especially in text mining, Naive Bayes is considered
“Naive” because it makes an assumption that is virtually impossible to see in real-life
data: the conditional probability is calculated as the pure product of the individual
probabilities of components. This implies the absolute independence of features — a
condition probably never met in real life.
As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that
you liked pickles and ice cream would probably naively recommend you a pickle ice
cream.
Q9- What’s your favorite algorithm, and can you explain it to me in less than a minute?
This type of question tests your understanding of how to communicate complex and
technical nuances with poise and the ability to summarize quickly and efficiently. Make
sure you have a choice and make sure you can explain different algorithms so simply
and effectively that a five-year-old could grasp the basics!
Don’t think that this is a trick question! Many machine learning interview questions will
be an attempt to lob basic questions at you just to make sure you’re on top of your
game and you’ve prepared all of your bases.
Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type
I error means claiming something has happened when it hasn’t, while Type II error
means that you claim nothing is happening when in fact something is.
A clever way to think about this is to think of Type I error as telling a man he is pregnant,
while Type II error means you tell a pregnant woman she isn’t carrying a baby.
More reading: What is the difference between “likelihood” and “probability”? (Cross
Validated)
Q13- What is deep learning, and how does it contrast with other machine learning
algorithms?
Deep learning is a subset of machine learning that is concerned with neural networks:
how to use backpropagation and certain principles from neuroscience to more
accurately model large sets of unlabelled or semi-structured data. In that sense, deep
learning represents an unsupervised learning algorithm that learns representations of
data through the use of neural nets.
More reading: What is the difference between a Generative and Discriminative Algorithm?
(Stack Overflow)
A generative model will learn categories of data while a discriminative model will simply
learn the distinction between different categories of data. Discriminative models will
generally outperform generative models on classification tasks.
Q15- What cross-validation technique would you use on a time series dataset?
Instead of using standard k-folds cross-validation, you have to pay attention to the fact
that a time series is not randomly distributed data — it is inherently ordered by
chronological order. If a pattern emerges in later time periods for example, your model
may still pick up on it even if that effect doesn’t hold in earlier years!
You’ll want to do something like forward chaining where you’ll be able to model on past
data then look at forward-facing data.
Pruning is what happens in decision trees when branches that have weak predictive
power are removed in order to reduce the complexity of the model and increase the
predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-
down, with approaches such as reduced error pruning and cost complexity pruning.
Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t
decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes
pretty close to an approach that would optimize for maximum accuracy.
This question tests your grasp of the nuances of machine learning model performance!
Machine learning interview questions often look towards the details. There are models
with higher accuracy that can perform worse in predictive power — how does that make
sense?
Well, it has everything to do with how model accuracy is only a subset of model
performance, and at that, a sometimes misleading one. For example, if you wanted to
detect fraud in a massive dataset with a sample of millions, a more accurate model
would most likely predict no fraud at all if only a vast minority of cases were fraud.
However, this would be useless for a predictive model — a model designed to find fraud
that asserted there was no fraud at all! Questions like this help you demonstrate that
you understand model accuracy isn’t the be-all and end-all of model performance.
An imbalanced dataset is when you have, for example, a classification test and 90% of
the data is in one class. That leads to problems: an accuracy of 90% can be skewed if
you have no predictive power on the other category of data! Here are a few tactics to get
over the hump:
What’s important here is that you have a keen sense for what damage an unbalanced
dataset can cause, and how to balance that.
You could list some examples of ensemble methods, from bagging to boosting to a
“bucket of models” method and demonstrate how they could increase predictive power.
1- Keep the model simpler: reduce variance by taking into account fewer variables and
parameters, thereby removing some of the noise in the training data.
3- Use regularization techniques such as LASSO that penalize certain model parameters
if they’re likely to cause overfitting.
The Kernel trick involves kernel functions that can enable in higher-dimension spaces
without explicitly calculating the coordinates of points within that dimension: instead,
kernel functions compute the inner products between the images of all pairs of data in a
feature space. This allows them the very useful attribute of calculating the coordinates
of higher dimensions while being computationally cheaper than the explicit calculation
of said coordinates. Many algorithms can be expressed in terms of inner products.
Using the kernel trick enables us effectively run algorithms in a high-dimensional space
with lower-dimensional data.
In Pandas, there are two very useful methods: isnull() and dropna() that will help you
find columns of data with missing or corrupted data and drop those values. If you want
to fill the invalid values with a placeholder value (for example, 0), you could use the
fillna() method.
Q27- Do you have experience with Spark or big data tools for machine learning?
More reading: 50 Top Open Source Tools for Big Data (Datamation)
You’ll want to get familiar with the meaning of big data for different companies and the
different tools they’ll want. Spark is the big data tool most in demand now, able to
handle immense datasets with speed. Be honest if you don’t have experience with the
tools demanded, but also take a look at job descriptions and see what tools pop up:
you’ll want to invest in familiarizing yourself with them.
This kind of question demonstrates your ability to think in parallelism and how you
could handle concurrency in programming implementations dealing with big data. Take
a look at pseudocode frameworks such as Peril-L and visualization tools such as Web
Sequence Diagrams to help you demonstrate your ability to write code that reflects
parallelism.
Q29- What are some differences between a linked list and an array?
A hash table is a data structure that produces an associative array. A key is mapped to
certain values through the use of a hash function. They are often used for tasks such as
database indexing.
Q31- Which data visualization libraries do you use? What are your thoughts on the best
data visualization tools?
What’s important here is to define your views on how to properly visualize data and your
personal preferences when it comes to tools. Popular tools include R’s ggplot, Python’s
seaborn and matplotlib, and tools such as Plot.ly and Tableau.
A lot of machine learning interview questions of this type will involve implementation of
machine learning models to a company’s problems. You’ll have to research the
company and its industry in-depth, especially the revenue drivers the company has, and
the types of users the company takes on in the context of the industry it’s in.
Q33- How can we use your machine learning skills to generate revenue?
This is a tricky question. The ideal answer would demonstrate knowledge of what drives
the business and how your skills could relate. For example, if you were interviewing for
music-streaming startup Spotify, you could remark that your skills at developing a better
recommendation model would increase user retention, which would then increase
revenue in the long run.
The startup metrics Slideshare linked above will help you understand exactly what
performance indicators are important for startups and tech companies as they think
about revenue and growth.
Q35- What are the last machine learning papers you’ve read?
More reading: What are some of the best research papers/books for machine learning?
Keeping up with the latest scientific literature on machine learning is a must if you want
to demonstrate interest in a machine learning position. This overview of deep learning
in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun)
can be a good reference paper and an overview of what’s happening in deep learning —
and the kind of paper you might want to cite.
Related to the last point, most organizations hiring for machine learning positions will
look for your formal experience in the field. Research papers, co-authored or supervised
by leaders in the field, can make the difference between you being hired and not. Make
sure you have a summary of your research experience and papers ready — and an
explanation for your background and lack of formal research experience if you don’t.
Q37- What are your favorite use cases of machine learning models?
More reading: What are the typical use cases for different machine learning algorithms?
(Quora)
The Quora thread above contains some examples, such as decision trees that
categorize people into different tiers of intelligence based on IQ scores. Make sure that
you have a few examples in mind and describe what resonated with you. It’s important
that you demonstrate an interest in how machine learning is implemented.
The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better
collaborative filtering algorithm. The team that won called BellKor had a 10%
improvement and used an ensemble of different methods to win. Some familiarity with
the case and its solution will help demonstrate you’ve paid attention to machine
learning for a while.
Q39- Where do you usually source datasets?
More reading: 19 Free Public Data Sets For Your First Data Science Project (Springboard)
Machine learning interview questions like these try to get at the heart of your machine
learning interest. Somebody who is truly passionate about machine learning will have
gone off and done side projects on their own, and have a good idea of what great
datasets are out there. If you’re missing any, check out Quandl for economic and
financial data, and Kaggle’s Datasets collection for another great list.
Q40- How do you think Google is training data for self-driving cars?
Machine learning interview questions like this one really test your knowledge of
different machine learning methods, and your inventiveness if you don’t know the
answer. Google is currently using recaptcha to source labeled data on storefronts and
traffic signs. They are also building on training data collected by Sebastian Thrun at
GoogleX — some of which was obtained by his grad students driving buggies on desert
dunes!
Q41- How would you simulate the approach AlphaGo took to beat Lee Sidol at Go?
More reading: Mastering the game of Go with deep neural networks and tree search
(Nature)
AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a
truly seminal event in the history of machine learning and deep learning. The Nature
paper above describes how this was accomplished with “Monte-Carlo tree search with
deep neural networks that have been trained by supervised learning, from human expert
games, and by reinforcement learning from games of self-play.”