Data scientist interview questions ❓
Data scientist interview questions ❓
Data scientist interview questions ❓
Company: Splunk
Ans: There are two methods here: we can either initialize the weights to zero or assign them
randomly.
Initializing all weights to 0: This makes your model similar to a linear model. All the neurons
and every layer perform the same operation, giving the same output and making the deep
net useless.
Initializing all weights randomly: Here, the weights are assigned randomly by initializing them
very close to 0. It gives better accuracy to the model since every neuron performs different
computations. This is the most commonly used method.
Ans: Stochastic Gradient Descent: We use only a single training example for calculation of
gradient and
update parameters.
Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the
update at
each iteration.
Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a
variant of
Stochastic Gradient Descent and here instead of single training example, mini-batch of
samples is
used.
3. What are the feature selection methods used to select the right variables?
Filter Methods
This involves:
• ANOVA
• Chi-Square
The best analogy for selecting features is "bad data in, bad answer out." When we're limiting
or selecting the features, it's all about selecting the useful feature.
Wrapper Methods
This involves:
• Forward Selection: We test one feature at a time and keep adding them until we get a good
fit
• Backward Selection: We test all the features and start removing them to see what works
better
• Recursive Feature Elimination: Recursively looks through all the different features and how
they pair together. Wrapper methods are very labor-intensive, and high-end computers are
needed if a lot of data analysis is performed with the wrapper method.
· Joint sampling is done when there are equal number of events and non-events. Not
appropriate for imbalanced data
· Separate sampling is done for imbalanced data. For rare event, all observations are kept
when target = 1 and only few observations are kept when target = 0.
Date - 20-12-2023
Data is grouped using sets and groups according to predefined criteria. The primary
distinction between the two is that although a set can have only two options—either in or
out—a group can divide the dataset into several groups. A user should decide which group
or sets to apply based on the conditions.
It contains a text representation that describes the frequency with which words appear in a
document.
Triggers may implement DML by using INSERT, UPDATE, and DELETE statements. These
triggers that contain DML and find other triggers for data modification are called Nested
Triggers.
True positive rate or Recall: It gives us the percentage of the true positives captured by the
model out of all the Actual Positive class.
False Positive rate: It gives us the percentage of all the false positives by my model
prediction from the all Actual Negative class.
FPR = FP/(FP+TN)
Date: 12-12-2023
For the RNN the order of the input matters due to being stateful. The same words with
different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting recognition
or speech recognition.
Ans: Array elements can be removed using pop() or remove() method. The difference
between these two functions is that the former returns the deleted value whereas the latter
does not.
As there is no physical location where the data in the view is stored, it generates output
without wasting resources.
Data access is restricted as it does not allow commands like insertion, updation, and
deletion.
Disadvantages of Views:
Much memory space is occupied when the view is created for large tables.
4. Describe the Difference Between Window Functions and Aggregate Functions in SQL.
The main difference between window functions and aggregate functions is that aggregate
functions group multiple rows into a single result row; all the individual rows in the group are
collapsed and their individual data is not shown. On the other hand, window functions
produce a result for each individual row. This result is usually shown as a new column value
in every row within the window.
The Ribbon is basically your key interface with Excel and it appears at the top of the Excel
window. It allows users to access many of the most important commands directly. It consists
of many tabs such as File, Home, View, Insert, etc. You can also customize the ribbon to suit
your preferences. To customize the Ribbon, right-click on it and select the “Customize the
Ribbon” option.
————————————————————-
Date: 04-12-2023
The silhouette coefficient is a measure of how well clustered together a data point is with
respect to the other points in its cluster. It is a measure of how similar a point is to the points
in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette
coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst
possible score.
Bag of Words is a commonly used model that depends on word frequencies or occurrences
to train a classifier. This model creates an occurrence matrix for documents or sentences
irrespective of its grammatical structure or word order.
Bagging is a homogeneous weak learners’ model that learns from each other independently
in parallel and combines them for determining the model average. Boosting is also a
homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning
algorithm
———————————————————-
Date: 30-11-2023
Company: Facebook
Question 1 : How would you approach building a recommendation system for personalized
content on Facebook? Consider factors like scalability and user privacy.
Question 2 : Describe a situation where you had to navigate conflicting opinions within your
team. How did you facilitate resolution and maintain team cohesion?
Question 3 : How would you enhance the security of user data on Facebook, considering the
evolving landscape of cybersecurity threats?
- Answer: Enhancing the security of user data on Facebook involves implementing robust
encryption mechanisms, access controls, and regular security audits. Ensuring compliance
with privacy regulations and proactive threat monitoring are essential.
Question 4 : Design a real-time notification system for Facebook, ensuring timely delivery of
notifications to users across various platforms.
- Answer: Designing a real-time notification system for Facebook requires technologies like
WebSocket for real-time communication and push notifications. Ensuring scalability and
reliability through distributed systems is crucial for timely delivery.
Date: 27-11-2023
Company: Twitter
1: How would you preprocess and tokenize text data from tweets for sentiment analysis?
Discuss potential challenges and solutions.
- Answer: Preprocessing and tokenizing text data for sentiment analysis involves tasks like
lowercasing, removing stop words, and stemming or lemmatization. Handling challenges like
handling emojis, slang, and noisy text is crucial. Tools like NLTK or spaCy can assist in these
tasks.
3: Write a Python or Scala function to count the frequency of hashtags in a given collection
of tweets.
- Answer (Python):
def count_hashtags(tweet_collection):
hashtags_count = {}
hashtags_count[hashtag] = hashtags_count.get(hashtag, 0) + 1
return hashtags_count
4: How does graph analysis contribute to understanding user interactions and content
propagation on Twitter? Provide a specific use case.
- Answer: Graph analysis on Twitter involves examining user interactions. For instance,
identifying influential users or detecting communities based on retweet or mention networks.
Algorithms like PageRank or Louvain Modularity can aid in these analyses.
Date: 26-11-2023
Company: Microsoft
1. Write a SQL query to find the top three products with the highest revenue in the last
quarter from a sales database.
Answer: A SQL query to find the top three products with the highest revenue in the last
quarter:
FROM Sales
GROUP BY ProductID
2. Explain the concept of ensemble learning and provide examples of scenarios where
ensemble methods could be beneficial.
3. How would you build a language translation system using Microsoft's Cognitive Services
or other NLP tools?
Answer: To build a language translation system, I would use Azure Cognitive Services'
Translator API. By sending text to the API, it provides translated content. Preprocess text,
send requests to the API, and handle responses. Azure Cognitive Services also supports
neural machine translation for improved accuracy.
4. Explain the concept of transfer learning in the context of deep neural networks. How can
transfer learning be applied to tasks like image classification?
Answer: Transfer learning involves using a pre-trained model on a related task to improve
performance on a new task. For image classification, a pre-trained Convolutional Neural
Network (CNN) can be fine-tuned on a specific dataset, saving training time and leveraging
learned features.
Date: 23-11-2023
Company: Google
1. Explain the concept of transfer learning in the context of deep learning models. How can it
be beneficial in practical applications?
Ans- Transfer learning involves leveraging pre-trained models on large datasets and
adapting them to new, related tasks with smaller datasets. In deep learning, this is achieved
by reusing the knowledge gained during the training of one model on a different, but related,
task. This is particularly beneficial when the new task has limited labeled data.
Practical applications include image recognition, where a model pre-trained on a dataset like
ImageNet can be fine-tuned for a specific domain. Transfer learning accelerates model
convergence, requires less labeled data, and helps overcome the challenges of training
deep neural networks from scratch.
2. Given a large dataset, how would you efficiently sample a representative subset for model
training? Discuss the trade-offs involved.
Answer- To efficiently sample a representative subset, one can use techniques like random
sampling or stratified sampling. For random sampling, simple random sampling or
systematic sampling methods can be employed. For stratified sampling, data is divided into
strata, and samples are randomly selected from each stratum.
Trade-offs involve the choice between biased and unbiased sampling. Random sampling
may not capture rare events, while stratified sampling might introduce complexity but
ensures representation. The size of the sample is also crucial; a too-small sample may not
be representative, while a too-large sample may incur unnecessary computational costs.
3. How would you approach analyzing A/B test results to determine the effectiveness of a
new feature on a platform like Google Search?
Answer: A/B testing involves comparing the performance of two versions (A and B) to
determine the impact of a change. To analyze A/B test results:
- Define Metrics: Clearly define key metrics (e.g., click-through rate, user engagement)
before the test.
- Random Assignment: Ensure random assignment of users to control (A) and experimental
(B) groups.
- Statistical Significance: Use statistical tests (e.g., t-test) to determine if differences between
groups are statistically significant.
- Segmentation: Analyze results across different user segments for nuanced insights.
4. You have access to search query logs. How would you identify and address potential
biases in the search results?
Answer: To identify and address biases in search results:
- Query Intent: Understand user query intent and ensure diverse queries are
well-represented.
- Evaluate Results: Assess the diversity of results to avoid favoring specific perspectives.
- User Feedback: Gather feedback from users to identify biased or inappropriate results.
Date: 20-11-2023
1. How would you handle imbalanced datasets when building a predictive model, and what
techniques would you use to ensure model performance?
Answer: When dealing with imbalanced datasets, techniques like oversampling the minority
class, undersampling the majority class, or using advanced methods like SMOTE can be
employed. Additionally, adjusting class weights in the model or using ensemble techniques
like RandomForest can address imbalanced data challenges.
2. Explain the K-means clustering algorithm and its applications. How would you determine
the optimal number of clusters?
Answer: The K-means clustering algorithm partitions data into 'K' clusters based on
similarity. The optimal 'K' can be determined using methods like the Elbow Method or
Silhouette Score. Applications include customer segmentation, anomaly detection, and
image compression.
3.Describe a scenario where you successfully applied time series forecasting to solve a
business problem. What methods did you use?
Answer: In time series forecasting, one would start with data exploration, identify seasonality
and trends, and use techniques like ARIMA, Exponential Smoothing, or LSTM for modeling.
Evaluation metrics like MAE, RMSE, or MAPE help assess forecasting accuracy.
4. Discuss the challenges and considerations involved in deploying machine learning models
to a production environment.
Answer: Model deployment involves converting a trained model into a format suitable for
production, using frameworks like Flask or Docker. Deployment considerations include
scalability, monitoring, and version control. Tools like Kubernetes can aid in managing
deployed models.
5. Explain the concept of ensemble learning, and how might ensemble methods improve the
robustness of a predictive model?
————————————————————
Date: 22-10-2023
Company : Accenture
The silhouette coefficient is a measure of how well clustered together a data point is with
respect to the other points in its cluster. It is a measure of how similar a point is to the points
in its own cluster, and how dissimilar it is to the points in other clusters. The silhouette
coefficient ranges from -1 to 1, with 1 being the best possible score and -1 being the worst
possible score.
Trends and seasonality are two characteristics of time series metrics that break many
models. Trends are continuous increases or decreases in a metric’s value. Seasonality, on
the other hand, reflects periodic (cyclical) patterns that occur in a system, usually rising
above a baseline and then decreasing again.
Bag of Words is a commonly used model that depends on word frequencies or occurrences
to train a classifier. This model creates an occurrence matrix for documents or sentences
irrespective of its grammatical structure or word order.
4. What is a Self-Join?
A self-join is a type of join that can be used to connect two tables. As a result, it is a unary
relationship. Each row of the table is attached to itself and all other rows of the same table in
a self-join. As a result, a self-join is mostly used to combine and compare rows from the
same database table.
5.Explain the Law of Large Numbers.
The ‘Law of Large Numbers’ states that if an experiment is repeated independently a large
number of times, the average of the individual results is close to the expected value. It also
states that the sample variance and standard deviation also converge towards the expected
value.
————————————————————-
Date - 15-10-2023
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
————————————————————-
Date: 14-10-2023
1. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and
densities.Outliers will cause the centroids to be dragged, or the outliers might get their own
cluster instead of being ignored. Outliers should be clipped or removed before clustering.If
the number of dimensions increase, a distance-based similarity measure converges to a
constant value between any given examples. Dimensions should be reduced before
clustering them.
2. If your Time-Series Dataset is very long, what architecture would you use?
If the dataset for time-series is very long, LSTMs are ideal for it because it can not only
process single data points, but also entire sequences of data. A time-series being a
sequence of data makes LSTM ideal for it.For an even stronger representational capacity,
making the LSTM's multi-layered is better.Another method for long time-series dataset is to
use CNNs to extract information.
Power BI is a strong business analytical tool that creates useful insights and reports by
collating data from unrelated sources. This data can be extracted from any source like
Microsoft Excel or hybrid data warehouses. Power BI drives an extreme level of utility and
purpose using interactive graphical interface and visualizations.
When the KNN algorithm gets the training data, it does not learn and make a model, it just
stores the data. Instead of finding any discriminative function with the help of the training
data, it follows instance-based learning and also uses the training data when it actually
needs to do some prediction on the unseen datasets. As a result, KNN does not immediately
learn a model rather delays the learning thereby being referred to as Lazy Learner.
In SQL, the DROP command is used to remove the whole database or table indexes, data,
and more. Whereas the TRUNCATE command is used to remove all the rows from the table.
————————————————————-
Date - 06-08-2023
Topic: kmean, cnn layers, constraint in SQL, tableau, NLP, bagging and boosting
1. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
1. Input Layer: The input layer in CNN should contain image data. Image data is represented
by a three-dimensional matrix. We have to reshape the image into a single column.
For Example, Suppose we have an MNIST dataset and you have an image of dimension 28
x 28 =784, you need to convert it into 784 x 1 before feeding it into the input. If we have “k”
training examples in the dataset, then the dimension of input will be (784, k).
2. Convolutional Layer: To perform the convolution operation, this layer is used which
creates several smaller picture windows to go over the data.
3. ReLU Layer: This layer introduces the non-linearity to the network and converts all the
negative pixels to zero. The final output is a rectified feature map.
3. What Would You Do If Some Countries/Provinces (Any Geographical Entity) are Missing
and Displaying a Null When You Use Map View in Tableau?
When working with maps and geographical fields, unknown or ambiguous locations are
identified by the indicator in the lower right corner of the view.
Edit Locations - correct the locations by mapping your data to known locations
Filter Data - exclude the unknown locations from the view using a filter. The locations will not
be included in calculations
Show Data at Default Position - show the values at the default position of (0, 0) on the map.
Constraints are used to specify the rules concerning data in the table. It can be applied for
single or multiple fields in an SQL table during the creation of the table or after creating using
the ALTER TABLE command. The constraints are:
NOT NULL - Restricts NULL value from being inserted into a column.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
Bagging is a homogeneous weak learners’ model that learns from each other independently
in parallel and combines them for determining the model average. Boosting is also a
homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning
algorithm
————————————————————-
Date: 01-08-2023
In the sliding window method, the previous time steps can be used as input variables, and
the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of
previous time steps is called the window width or size of the lag. This sliding window is the
basis for how we can turn any time series dataset into a supervised learning problem.
A subquery is a query inside another query where a query is defined to retrieve data or
information back from the database. In a subquery, the outer query is called as the main
query whereas the inner query is called subquery. Subqueries are always executed first and
the result of the subquery is passed on to the main query. It can be nested inside a SELECT,
UPDATE or any other query. A subquery can also use any comparison operators such as
>,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A worksheet contains a single view along with shelves, legends, and the Data pane.
Random forest outperforms decision trees, and it also does not have the habit of overfitting
the data as decision trees do.
A decision tree trained on a specific dataset will become very deep and cause overfitting. To
create a random forest, decision trees can be trained on different subsets of the training
dataset, and then the different decision trees can be averaged with the goal of decreasing
the variance.
It relies on a very big assumption that the independent variables are not related to each
other.
It is generally not suitable for datasets with large numbers of numerical attributes.
It has been observed that if a rare case is not in the training dataset but is in the testing
dataset, then it will most definitely be wrong.
————————————————————-
Date: 30-07-2023
1. You are given a data set consisting of variables with more than 30 percent missing values.
How will you deal with them?
Ans.
If the data set is large, we can just simply remove the rows with missing data values. It is the
quickest way, we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with values like mean, mode,
forward or backward fill. There are different ways to do so, such as df.mean(),
df.fillna(mean).
Ans. Hypothesis testing is defined as the process of choosing hypotheses for a particular
probability distribution, on the basis of observed data Hypothesis testing is simply a core and
important topic in statistics.
Alternative Hypothesis is a statistical hypothesis used in hypothesis testing, which states that
there is a significant difference between the set of variables. It is often referred to as the
hypothesis other than the null hypothesis, often denoted by H1 (H-one). The acceptance of
alternative hypothesis depends on the rejection of the null hypothesis i.e. until and unless
null hypothesis is rejected, an alternative hypothesis cannot be accepted.
Q3. Why use Decision Trees?
Ans. First, a decision tree is a visual representation of a decision situation (and hence aids
communication). Second, the branches of a tree explicitly show all those factors within the
analysis that are considered relevant to the decision (and implicitly those that are not).
Observational data comes from observational studies which are when you observe certain
variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain
variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control
group lives their lives normally. The test group is told to drink a glass of wine every night for
30 days. Then research can be conducted to see how wine affects sleep.
Ans. The central limit theorem states that if you have a population with mean μ and standard
deviation σ and take sufficiently large random samples from the population with replacement
, then the distribution of the sample means will be approximately normally distributed.
————————————————————-
Date: 25-07-2023
Company: Adobe
• In Overfitting the model performs well for the training data, but for any new data it fails to
provide output. For Underfitting the model is very simple and not able to identify the correct
relationship. Following are the bias and variance conditions.
• Overfitting – Low bias and High Variance results in the overfitted model. The decision tree
is more prone to Overfitting.
• Underfitting – High bias and Low Variance. Such a model doesn’t perform well on test data
also. For example – Linear Regression is more prone to Underfitting.
Ans: Complex models, like the Random Forest, Neural Networks, and XGBoost are more
prone to overfitting. Simpler models, like linear regression, can overfit too – this typically
happens when there are more features than the number of instances in the training data.
Ans: We need to perform Feature Scaling when we are dealing with Gradient Descent
Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based
algorithms (KNN, K-means, SVM) as these are very sensitive to the range of the data points.
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity
to +infinity.
A linear model holds some strong assumptions that may not be true in application. It
assumes a linear relationship, multivariate normality, no or little multicollinearity, no
auto-correlation, and homoscedasticity
👁
Data scientist interview questions
𝕚𝓻𝓞𝓃 ΜⒶⓝ™ ⃤ • #sʏsᴛᴜᴍᴍᴍ
————————————————————-
Date: 18-07-2023
Q. How will you measure the Euclidean distance between the two arrays in NumPy?
A. The magnitude or length of the line segment between two locations is known as the
Euclidean distance in mathematics. We first create two numpy arrays in the first procedure.
The Euclidean distance is then computed directly using numpy's linalg.norm() function.
A. Batch Gradient Descent is highly slow on very big training sets since it entails
calculations over the entire training set at each step. As a result, Batch GD becomes
extremely computationally expensive. SGD is stochastic in nature, which means it chooses
up a "random" instance of training data at each step and then computes the gradient, which
is significantly faster than Batch GD because there is much less data to modify at once.
Q. What is root cause analysis?
————————————————————-
Date: 04-04-2023
Topic: svm kernels, cross validation, relationship types, data types in tableau , NLP
Polynomial kernel - When you have discrete data that has no natural notion of smoothness.
Radial basis kernel - Create a decision boundary able to do a much better job of separating
two classes than the linear kernel.
2. What is Cross-Validation?
Cross-validation is a method of splitting all your data into three parts: training, testing, and
validation data. Data is split into k subsets, and the model has trained on k-1of those
datasets.
The last subset is held for testing. This is done for each of the subsets. This is k-fold
cross-validation. Finally, the scores from all the k-folds are averaged to produce the final
score.
One-to-One - This can be defined as the relationship between two tables where each record
in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record
in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
Date values
Numerical values
The RNN is a stateful neural network, which means that it not only retains information from
the previous layer but also from the previous pass. Thus, this neuron is said to have
connections between passes, and through time.
For the RNN the order of the input matters due to being stateful. The same words with
different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting recognition
or speech recognition.
————————————————————-
Date - 05-07-2023
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
• Durability ensures that once a transaction is committed, it will occur regardless of what
happens in between such as a power outage, fire, or some other kind of disturbance.
3. What is the meaning of KPI in statistics?
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
There are two data structures supported by pandas library, Series and DataFrames. Both of
the data structures are built on top of Numpy. Series is a one-dimensional data structure in
pandas and DataFrame is the two-dimensional data structure in pandas. There is one more
axis label known as Panel which is a three-dimensional data structure and it includes items,
major_axis, and minor_axis.
————————————————————-
Date: 02-07-2023
k-means has trouble clustering data where clusters are of various sizes and
densities.Outliers will cause the centroids to be dragged, or the outliers might get their own
cluster instead of being ignored. Outliers should be clipped or removed before clustering.If
the number of dimensions increase, a distance-based similarity measure converges to a
constant value between any given examples. Dimensions should be reduced before
clustering them.
2. If your Time-Series Dataset is very long, what architecture would you use?
If the dataset for time-series is very long, LSTMs are ideal for it because it can not only
process single data points, but also entire sequences of data. A time-series being a
sequence of data makes LSTM ideal for it.For an even stronger representational capacity,
making the LSTM's multi-layered is better.Another method for long time-series dataset is to
use CNNs to extract information.
Power BI is a strong business analytical tool that creates useful insights and reports by
collating data from unrelated sources. This data can be extracted from any source like
Microsoft Excel or hybrid data warehouses. Power BI drives an extreme level of utility and
purpose using interactive graphical interface and visualizations.
When the KNN algorithm gets the training data, it does not learn and make a model, it just
stores the data. Instead of finding any discriminative function with the help of the training
data, it follows instance-based learning and also uses the training data when it actually
needs to do some prediction on the unseen datasets. As a result, KNN does not immediately
learn a model rather delays the learning thereby being referred to as Lazy Learner.
5. Explain the difference between drop and truncate.
In SQL, the DROP command is used to remove the whole database or table indexes, data,
and more. Whereas the TRUNCATE command is used to remove all the rows from the table.
————————————————————-
Date: 30-06-2023
1. Can you explain how the memory cell in an LSTM is implemented computationally?
The memory cell in an LSTM is implemented as a forget gate, an input gate, and an output
gate. The forget gate controls how much information from the previous cell state is forgotten.
The input gate controls how much new information from the current input is allowed into the
cell state. The output gate controls how much information from the cell state is allowed to
pass out to the next cell state.
A CTE (Common Table Expression) is a one-time result set that only exists for the duration
of the query. It allows us to refer to data within a single SELECT, INSERT, UPDATE,
DELETE, CREATE VIEW, or MERGE statement's execution scope. It is temporary because
its result cannot be stored anywhere and will be lost as soon as a query's execution is
completed.
3. List the advantages NumPy Arrays have over Python lists?
Python’s lists, even though hugely efficient containers capable of a number of functions,
have several limitations when compared to NumPy arrays. It is not possible to perform
vectorised operations which includes element-wise addition and multiplication. They also
require that Python store the type information of every element since they support objects of
different types. This means a type dispatching code must be executed each time an
operation on an element is done.
————————————————————
Date: 28-06-2023
A. The difference between a group of values observed and their arithmetical mean. The
difference between what was expected and what was predicted is called the residual error.
A. The elbow approach is a popular way for determining the ideal value of K when using the
K-Means Clustering Algorithm. The essential concept behind this method is that it plots
various cost values as k changes. There will be fewer elements in the cluster as the value of
K grows.
A. Deleting Rows with missing values, Impute missing values for continuous variable, Impute
missing values for categorical variable, Using Algorithms that support missing values,
Prediction of missing values, Imputation using Deep Learning Library — Datawig, Imputation
using Machine Learning libraries like SimpleImputer, KNNImputer,etc.
————————————————————————-
Date: 25-06-2023
A. Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Variance is the variability of model prediction for a given data
point or a value which tells us spread of our data. If our model is too simple and has very few
parameters then it may have high bias and low variance. On the other hand if our model has
large number of parameters then it’s going to have high variance and low bias. So we need
to find the right/good balance without overfitting and underfitting the data. This tradeoff in
complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more
complex and less complex at the same time.
Trimming/removing the outlier: In this technique, we remove the outliers from the dataset.
Quantile based flooring and capping : In this technique, the outlier is capped at a certain
value above the 90th percentile value or floored at a factor below the 10th percentile value.
Mean/Median imputation : As the mean value is highly influenced by the outliers, it is
advised to replace the outliers with the median value.
————————————————————————-
Date: 16-06-2023
Company - Hitachi
1. What is the difference between supervised learning and unsupervised learning? Give
concrete examples.
For example, if I had a dataset with two variables, age (input) and height (output), I could
implement a supervised learning model to predict the height of a person based on their age.
Unlike supervised learning, unsupervised learning is used to draw inferences and find
patterns from input data without references to labeled outcomes. A common use of
unsupervised learning is grouping customers by purchasing behavior to find target markets.
Ans: You would perform hypothesis testing to determine statistical significance. First, you
would state the null hypothesis and alternative hypothesis.
Second, you would calculate the p-value, the probability of obtaining the observed results of
a test assuming that the null hypothesis is true. Last, you would set the level of the
significance (alpha) and if the p-value is less than the alpha, you would reject the null — in
other words, the result is statistically significant.
Ans: The Law of Large Numbers is a theory that states that as the number of trials
increases, the average of the result will become closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.
4.If a Company says that they want to double the number of ads in Newsfeed, how would
you figure out if this is a good idea or not?
Ans: You can perform an A/B test by splitting the users into two groups: a control group with
the normal number of ads and a test group with double the number of ads. Then you would
choose the metric to define what a “good idea” is. For example, we can say that the null
hypothesis is that doubling the number of ads will reduce the time spent on Facebook and
the alternative hypothesis is that doubling the number of ads won’t have any impact on the
time spent on Facebook. However, you can choose a different metric like the number of
active users or the churn rate. Then you would conduct the test and determine the statistical
significance of the test to reject or not reject the null.
Date: 13-06-2023
Company: Cloudera
Q1 What are Loss Function and Cost Functions? Explain the key Difference Between them?
When calculating loss we consider only a single data point, then we use the term loss
function.
Whereas, when calculating the sum of error for multiple data then we use the cost function.
There is no major difference.
In other words, the loss function is to capture the difference between the actual and
predicted values for a single record whereas cost functions aggregate the difference for the
entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
Mean-Squared Error(MSE): In simple words, we can say how our model predicted values
against the actual values.
SVM stands for support vector machine. They are used for classification and prediction
tasks. SVM consists of a separating plane that discriminates between the two classes of
variables. This separating plane is known as hyperplane. Some of the kernels used in SVM
are –
Polynomial Kernel
Gaussian Kernel
Sigmoid Kernel
Hyperbolic Kernel
Ans. The following are the four significant subsets of the SQL:
Data definition language (DDL): It defines the data structure that consists of commands like
CREATE, ALTER, DROP, etc.
Data manipulation language (DML): It is used to manipulate existing data in the database.
The commands in this category are SELECT, UPDATE, INSERT, etc.
Data control language (DCL): It controls access to the data stored in the database. The
commands in this category include GRANT and REVOKE.
Transaction Control Language (TCL): It is used to deal with the transaction operations in the
database. The commands in this category are COMMIT, ROLLBACK, SET TRANSACTION,
SAVEPOINT, etc.
It's faster to do shallow repetitions. It does, however, handle pointers and references in a
"lazy" manner. It just copies over the pointer price rather than producing a current copy of
the specific knowledge the pointer links to. As a result, each of the initial and subsequent
copies can have pointers that relate to the same underlying knowledge. Deep repetition
clones the underlying data completely. It is not shared by the first and, as a result, by the
copy.
Date: 30-05-2023
Company: Twitter
The goal in machine learning is to strike a balance between bias and variance to achieve
good generalization performance. Ideally, we want a model that can capture the underlying
patterns and complexity of the data while avoiding overfitting or underfitting. This balance is
known as the bias-variance trade-off. Increasing the complexity of a model, such as using
more features or increasing the model's capacity (e.g., adding more layers in a neural
network), typically reduces bias and increases variance. Conversely, reducing the complexity
of a model, such as using simpler algorithms or reducing the number of features, generally
increases bias and reduces variance. To find the optimal trade-off, one can employ various
techniques such as cross-validation, regularization, or ensemble methods. Cross-validation
helps evaluate the model's performance on unseen data and aids in selecting the
appropriate complexity level. Regularization techniques add constraints to the model to
reduce overfitting. Ensemble methods combine multiple models to leverage the strengths of
each model while mitigating their weaknesses. Understanding the bias-variance trade-off is
crucial for selecting the appropriate models, optimizing hyperparameters, and ensuring the
model's ability to generalize well to unseen data.
2. What are the differences between shallow neural networks and deep neural networks?
Shallow Neural Networks: Shallow neural networks have a small number of hidden layers
(typically one or two). They are relatively simpler and have fewer parameters to learn.
Shallow networks are suitable for simpler tasks or datasets with fewer complex patterns.
Deep Neural Networks: Deep neural networks have a large number of hidden layers (more
than two). They are more complex and have a significantly higher number of parameters.
Deep networks can learn intricate patterns and hierarchies of features from data. They are
more suitable for complex tasks such as image recognition, natural language processing,
and speech recognition.
To handle missing values in SQL queries, you can use the COALESCE or ISNULL function
to replace NULL values with a specified default value, or you can use the WHERE clause to
filter out rows with missing values using the IS NULL or IS NOT NULL condition.
4. How would you create a dashboard in Tableau to visualize sales performance over time?
To create a dashboard in Tableau to visualize sales performance over time, you would:
- Drag a sales measure to the Rows or Columns shelf to represent the sales value.
- Add any additional dimensions or measures to the visualization for more detailed insights.
- Customize the visual appearance, including axis labels, titles, and formatting.
- Create a new dashboard and add the sales performance visualization to it.
- Arrange and resize the visualizations on the dashboard to create an intuitive and visually
appealing layout.
- Save and publish the dashboard for others to view and interact with.
Date: 23-05-2023
1. Can you explain how the memory cell in an LSTM is implemented computationally?
The memory cell in an LSTM is implemented as a forget gate, an input gate, and an output
gate. The forget gate controls how much information from the previous cell state is forgotten.
The input gate controls how much new information from the current input is allowed into the
cell state. The output gate controls how much information from the cell state is allowed to
pass out to the next cell state.
A CTE (Common Table Expression) is a one-time result set that only exists for the duration
of the query. It allows us to refer to data within a single SELECT, INSERT, UPDATE,
DELETE, CREATE VIEW, or MERGE statement's execution scope. It is temporary because
its result cannot be stored anywhere and will be lost as soon as a query's execution is
completed.
Python’s lists, even though hugely efficient containers capable of a number of functions,
have several limitations when compared to NumPy arrays. It is not possible to perform
vectorised operations which includes element-wise addition and multiplication. They also
require that Python store the type information of every element since they support objects of
different types. This means a type dispatching code must be executed each time an
operation on an element is done.
4. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
Date: 17-05-2023
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
2. What is the ACID property in a database?
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
• Durability ensures that once a transaction is committed, it will occur regardless of what
happens in between such as a power outage, fire, or some other kind of disturbance.
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
Answer: One common approach is to impute the missing values using statistical techniques
such as mean, median, or mode imputation. Another approach is to use machine learning
algorithms that can handle missing values, such as decision trees or k-nearest neighbors. If
the missing values are related to a specific feature or subset of the data, we can also
consider using a more targeted approach, such as regression imputation or multiple
imputation.
2. Can you explain the difference between supervised and unsupervised learning?
Answer: Supervised learning is a type of machine learning where the algorithm learns from
labeled data, where the target variable or outcome is already known. The algorithm uses this
labeled data to identify patterns and relationships, and then applies this learning to new,
unseen data to make predictions or classifications. In contrast, unsupervised learning is
used when there is no target variable or outcome variable to be predicted, and the algorithm
is tasked with finding patterns and relationships within the data on its own.
3. How do you decide which machine learning algorithm to use for a particular problem?
Answer: The choice of machine learning algorithm depends on several factors, including the
type of problem, the nature of the data, and the available computational resources. Some
common factors to consider when selecting a machine learning algorithm include the size of
the dataset, the number of features, the desired level of interpretability, the complexity of the
relationships between the features and the target variable, and the computational resources
available for training and testing the model.
Date ; 10-05-2023
Answer: One common approach is to impute the missing values using statistical techniques
such as mean, median, or mode imputation. Another approach is to use machine learning
algorithms that can handle missing values, such as decision trees or k-nearest neighbors. If
the missing values are related to a specific feature or subset of the data, we can also
consider using a more targeted approach, such as regression imputation or multiple
imputation.
2. Can you explain the difference between supervised and unsupervised learning?
Answer: Supervised learning is a type of machine learning where the algorithm learns from
labeled data, where the target variable or outcome is already known. The algorithm uses this
labeled data to identify patterns and relationships, and then applies this learning to new,
unseen data to make predictions or classifications. In contrast, unsupervised learning is
used when there is no target variable or outcome variable to be predicted, and the algorithm
is tasked with finding patterns and relationships within the data on its own.
3. How do you decide which machine learning algorithm to use for a particular problem?
Answer: The choice of machine learning algorithm depends on several factors, including the
type of problem, the nature of the data, and the available computational resources. Some
common factors to consider when selecting a machine learning algorithm include the size of
the dataset, the number of features, the desired level of interpretability, the complexity of the
relationships between the features and the target variable, and the computational resources
available for training and testing the model.
Date: 08-05-2023
Time series can be phrased as supervised learning. Given a sequence of numbers for a time
series dataset, we can restructure the data to look like a supervised learning problem.
In the sliding window method, the previous time steps can be used as input variables, and
the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of
previous time steps is called the window width or size of the lag. This sliding window is the
basis for how we can turn any time series dataset into a supervised learning problem.
A subquery is a query inside another query where a query is defined to retrieve data or
information back from the database. In a subquery, the outer query is called as the main
query whereas the inner query is called subquery. Subqueries are always executed first and
the result of the subquery is passed on to the main query. It can be nested inside a SELECT,
UPDATE or any other query. A subquery can also use any comparison operators such as
>,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
Date- 05-05-2023
Topic: sets & groups, bag of word, nested triggers, TPR vs FPR
Data is grouped using sets and groups according to predefined criteria. The primary
distinction between the two is that although a set can have only two options—either in or
out—a group can divide the dataset into several groups. A user should decide which group
or sets to apply based on the conditions.
It contains a text representation that describes the frequency with which words appear in a
document.
Triggers may implement DML by using INSERT, UPDATE, and DELETE statements. These
triggers that contain DML and find other triggers for data modification are called Nested
Triggers.
True positive rate or Recall: It gives us the percentage of the true positives captured by the
model out of all the Actual Positive class.
False Positive rate: It gives us the percentage of all the false positives by my model
prediction from the all Actual Negative class.
FPR = FP/(FP+TN)
Date: 02-05-2023
Average pooling: Computes the average value of the feature map covered by kernel or filter,
and takes the floor value of the result.
The WHERE clause specifies the criteria which individual records must meet to be selected
by a query. It can be used without the GROUP by clause. The HAVING clause cannot be
used without the GROUP BY clause . The WHERE clause selects rows before grouping.
The HAVING clause selects rows after grouping. The WHERE clause cannot contain
aggregate functions. The HAVING clause can contain aggregate functions
In Relative referencing, there is a change when copying a formula from one cell to another
cell with respect to the destination. cells’ address Meanwhile, there is no change in Absolute
cell referencing when a formula is copied, irrespective of the cell’s destination. This type of
referencing is there by default. Relative cell referencing doesn’t require a dollar sign in the
formula.
Date: 28-04-2023
1. Can you explain how the memory cell in an LSTM is implemented computationally?
The memory cell in an LSTM is implemented as a forget gate, an input gate, and an output
gate. The forget gate controls how much information from the previous cell state is forgotten.
The input gate controls how much new information from the current input is allowed into the
cell state. The output gate controls how much information from the cell state is allowed to
pass out to the next cell state.
A CTE (Common Table Expression) is a one-time result set that only exists for the duration
of the query. It allows us to refer to data within a single SELECT, INSERT, UPDATE,
DELETE, CREATE VIEW, or MERGE statement's execution scope. It is temporary because
its result cannot be stored anywhere and will be lost as soon as a query's execution is
completed.
4. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
Date: 26-04-2023
The RNN is a stateful neural network, which means that it not only retains information from
the previous layer but also from the previous pass. Thus, this neuron is said to have
connections between passes, and through time.
For the RNN the order of the input matters due to being stateful. The same words with
different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting recognition
or speech recognition.
As there is no physical location where the data in the view is stored, it generates output
without wasting resources.
Data access is restricted as it does not allow commands like insertion, updation, and
deletion.
Disadvantages of Views:
Much memory space is occupied when the view is created for large tables.
Click the drop down to the right of Dimensions on the Data pane and select “Create >
Calculated Field” to open the calculation editor.
Date - 25-04-2023
1. Explain some cases where k-Means clustering fails to give good results
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
1. Input Layer: The input layer in CNN should contain image data. Image data is represented
by a three-dimensional matrix. We have to reshape the image into a single column.
For Example, Suppose we have an MNIST dataset and you have an image of dimension 28
x 28 =784, you need to convert it into 784 x 1 before feeding it into the input. If we have “k”
training examples in the dataset, then the dimension of input will be (784, k).
2. Convolutional Layer: To perform the convolution operation, this layer is used which
creates several smaller picture windows to go over the data.
3. ReLU Layer: This layer introduces the non-linearity to the network and converts all the
negative pixels to zero. The final output is a rectified feature map.
3. What Would You Do If Some Countries/Provinces (Any Geographical Entity) are Missing
and Displaying a Null When You Use Map View in Tableau?
When working with maps and geographical fields, unknown or ambiguous locations are
identified by the indicator in the lower right corner of the view.
Edit Locations - correct the locations by mapping your data to known locations
Filter Data - exclude the unknown locations from the view using a filter. The locations will not
be included in calculations
Show Data at Default Position - show the values at the default position of (0, 0) on the map.
Constraints are used to specify the rules concerning data in the table. It can be applied for
single or multiple fields in an SQL table during the creation of the table or after creating using
the ALTER TABLE command.
NOT NULL - Restricts NULL value from being inserted into a column.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
Date - 04-04-2023
Company: Reliance JIO
Topic: svm kernels, cross validation, relationship types, data types in tableau
Polynomial kernel - When you have discrete data that has no natural notion of smoothness.
Radial basis kernel - Create a decision boundary able to do a much better job of separating
two classes than the linear kernel.
2. What is Cross-Validation?
Cross-validation is a method of splitting all your data into three parts: training, testing, and
validation data. Data is split into k subsets, and the model has trained on k-1of those
datasets.
The last subset is held for testing. This is done for each of the subsets. This is k-fold
cross-validation. Finally, the scores from all the k-folds are averaged to produce the final
score.
One-to-One - This can be defined as the relationship between two tables where each record
in one table is associated with the maximum of one record in the other table.
One-to-Many & Many-to-One - This is the most commonly used relationship where a record
in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
Date values
Numerical values
Date- 02-04-2023
The RNN is a stateful neural network, which means that it not only retains information from
the previous layer but also from the previous pass. Thus, this neuron is said to have
connections between passes, and through time.
For the RNN the order of the input matters due to being stateful. The same words with
different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting recognition
or speech recognition.
As there is no physical location where the data in the view is stored, it generates output
without wasting resources.
Data access is restricted as it does not allow commands like insertion, updation, and
deletion.
Disadvantages of Views:
Much memory space is occupied when the view is created for large tables.
Date - 01-04-2023
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
• Durability ensures that once a transaction is committed, it will occur regardless of what
happens in between such as a power outage, fire, or some other kind of disturbance.
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
Date- 24-03-2023
Company name: Kenstar
Autoencoders are artificial neural networks that learn without any supervision. Here, these
networks have the ability to automatically learn by mapping the inputs to the corresponding
outputs.
Decoder: Used to convert the computational state back into the output
K-Means is unsupervised. K-Means is a clustering algorithm. The points in each cluster are
similar to each other, and each cluster is different from its neighboring clusters. KNN is
supervised in nature. KNN is a classification algorithm It classifies an unlabeled observation
based on its K (can be any number) surrounding neighbors.
A stored procedure that calls itself until a boundary condition is reached, is called a recursive
stored procedure. This recursive function helps the programmers to deploy the same set of
code several times as and when required. Some SQL programming languages limit the
recursion depth to prevent an infinite loop of procedure calls from causing a stack overflow,
which slows down the system and may lead to system crashes.
4. SUM() vs SUMX(): What is the difference between the two DAX functions in Power BI?
The sum function (Sum()) takes the data columns and aggregates them totally but the SumX
function (SumX()) lets you filter the data which you are adding.
SUMX(Table, Expression), where the table contains the rows for calculation. Expression is a
calculation that will be evaluated on each row of the table.
Date: 20-03-2023
It combines multiple models together to get the final output or, to be more precise, it
combines multiple decision trees together to get the final output. So, decision trees are the
building blocks of the random forest model.
2. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often
misunderstood. Both of them deal with data. Data Science is a broad field that deals with
large volumes of data and allows us to draw insights out of this voluminous data. Machine
Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals
with data, but here, we are solely focused on learning how to convert the processed data
into a functional model, which can be used to map inputs to outputs, e.g., a model that can
expect an image as an input and tell us if that image contains a flower as an output.
In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a
kernel function takes data as input and converts it into a required form. This transformation
of the data is based on something called a kernel trick, which is what gives the kernel
function its name. Using the kernel function, we can transform the data that is not linearly
separable (cannot be separated using a straight line) into one that is linearly separable.
Date: 17-03-2023
1. Which of the following machine learning algorithms can be used for inputting missing
values of both categorical and continuous variables?
K-means clustering
Linear regression
Decision trees
Ans- The K nearest neighbor algorithm can be used because it can compute the nearest
neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all
the other features.
When you're dealing with K-means clustering or linear regression, you need to do that in
your pre-processing, otherwise, they'll crash. Decision trees also have the same problem,
although there is some variance
2. What is an RNN (recurrent neural network)?
Ans - RNN is an algorithm that uses sequential data. RNN is used in language translation,
voice recognition, image capturing etc. There are different types of RNN networks such as
one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google’s Voice
search and Apple’s Siri.
Ans - This is statistical hypothesis testing for randomized experiments with two variables, A
and B. The objective of A/B testing is to detect any changes to a web page to maximize or
increase the outcome of a strategy.
Ans - Star schema is the fundamental schema among the data mart schema and it is
simplest. It is said to be star as its physical model resembles to the star shape having a fact
table at its center and the dimension tables at its peripheral representing the star’s points.
Date - 27-02-2023
1. Explain some cases where k-Means clustering fails to give good results.
k-means has trouble clustering data where clusters are of various sizes and densities.
Outliers will cause the centroids to be dragged, or the outliers might get their own cluster
instead of being ignored. Outliers should be clipped or removed before clustering.
1. Input Layer: The input layer in CNN should contain image data. Image data is represented
by a three-dimensional matrix. We have to reshape the image into a single column.
For Example, Suppose we have an MNIST dataset and you have an image of dimension 28
x 28 =784, you need to convert it into 784 x 1 before feeding it into the input. If we have “k”
training examples in the dataset, then the dimension of input will be (784, k).
2. Convolutional Layer: To perform the convolution operation, this layer is used which
creates several smaller picture windows to go over the data.
3. ReLU Layer: This layer introduces the non-linearity to the network and converts all the
negative pixels to zero. The final output is a rectified feature map.
3. What Would You Do If Some Countries/Provinces (Any Geographical Entity) are Missing
and Displaying a Null When You Use Map View in Tableau?
When working with maps and geographical fields, unknown or ambiguous locations are
identified by the indicator in the lower right corner of the view.
Edit Locations - correct the locations by mapping your data to known locations
Filter Data - exclude the unknown locations from the view using a filter. The locations will not
be included in calculations
Show Data at Default Position - show the values at the default position of (0, 0) on the map.
Constraints are used to specify the rules concerning data in the table. It can be applied for
single or multiple fields in an SQL table during the creation of the table or after creating using
the ALTER TABLE command. The constraints are:
NOT NULL - Restricts NULL value from being inserted into a column.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
Date: 24-02-2023
Time series can be phrased as supervised learning. Given a sequence of numbers for a time
series dataset, we can restructure the data to look like a supervised learning problem.
In the sliding window method, the previous time steps can be used as input variables, and
the next time steps can be used as the output variable.
In statistics and time series analysis, this is called a lag or lag method. The number of
previous time steps is called the window width or size of the lag. This sliding window is the
basis for how we can turn any time series dataset into a supervised learning problem.
A subquery is a query inside another query where a query is defined to retrieve data or
information back from the database. In a subquery, the outer query is called as the main
query whereas the inner query is called subquery. Subqueries are always executed first and
the result of the subquery is passed on to the main query. It can be nested inside a SELECT,
UPDATE or any other query. A subquery can also use any comparison operators such as
>,< or =.
4. Explain the Difference Between Tableau Worksheet, Dashboard, Story, and Workbook?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A worksheet contains a single view along with shelves, legends, and the Data pane.
Date: 23-02-2023
The idea of stacking is to learn several different weak learners and combine them by training
a meta-model to output predictions based on the multiple predictions returned by these weak
models.
If a stacking ensemble is composed of L weak learners, then to fit the model the following
steps are followed:
Choose L weak learners and fit them to the data of the first fold.
For each of the L weak learners, make predictions for observations in the second fold.
Fit the meta-model on the second fold, using predictions made by the weak learners as
inputs.
2. Can you provide me examples of when a scatter graph would be more appropriate than a
line chart or vice versa?
A scatter graph would be more appropriate than a line chart when you are looking to show
the relationship between two variables that are not linearly related. For example, if you were
looking to show the relationship between a person’s age and their weight, a scatter graph
would be more appropriate than a line chart. A line chart would be more appropriate than a
scatter graph when you are looking to show a trend over time. For example, if you were
looking at the monthly sales of a company over the course of a year, a line chart would be
more appropriate than a scatter graph.
When data is ingested into Power BI, it is basically stored in Fact and Dimension tables.
Fact tables: The central table in a star schema of a data warehouse, a fact table stores
quantitative information for analysis and is not normalized in most cases.
Dimension tables: It is just another table in the star schema that is used to store attributes
and dimensions that describe objects stored in a fact table.
After any variable declaration, DECLARE a cursor. A SELECT Statement must always be
coupled with the cursor definition.
To start the result set, move the cursor over it. Before obtaining rows from the result set, the
OPEN statement must be executed.
To retrieve and go to the next row in the result set, use the FETCH command.
Finally, use the DEALLOCATE command to remove the cursor definition and free up the
resources connected with it.
Date: 16-02-2023
Ans: Decorators are used to add some design patterns to a function without changing its
structure. Decorators generally are defined before the function they are enhancing. To apply
a decorator we first define the decorator function. Then we write the function it is applied to
and simply add the decorator function above the function it has to be applied to. For this, we
use the @ symbol before the decorator.
• Atomicity refers that if any aspect of a transaction fails, the whole transaction fails and the
database state remains unchanged.
• Durability ensures that once a transaction is committed, it will occur regardless of what
happens in between such as a power outage, fire, or some other kind of disturbance.
4. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of
the given dataset?
To be added…