100% found this document useful (1 vote)
363 views14 pages

Delta Table and Pyspark Interview Questions

Data lakes allow organizations to store large amounts of raw data from various sources in a centralized, cost-effective manner. This data can then be accessed and analyzed by different teams to improve decision-making. While data warehouses are optimized for structured queries and reports, data lakes are designed to handle large volumes of raw data from diverse sources for a variety of use cases like machine learning. Some key benefits of data lakes include scalability, lower costs, and enabling self-service access to data for various users and teams.

Uploaded by

velamatiskiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
363 views14 pages

Delta Table and Pyspark Interview Questions

Data lakes allow organizations to store large amounts of raw data from various sources in a centralized, cost-effective manner. This data can then be accessed and analyzed by different teams to improve decision-making. While data warehouses are optimized for structured queries and reports, data lakes are designed to handle large volumes of raw data from diverse sources for a variety of use cases like machine learning. Some key benefits of data lakes include scalability, lower costs, and enabling self-service access to data for various users and teams.

Uploaded by

velamatiskiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Lakes Interview

Questions

1. Why do we need a Data Lake?


Data is typically saved in raw form without being fine-tuned
or structured first. It can then be scrubbed and optimized for
the intended purpose: a dashboard for interactive analytics,
downstream machine learning, or analytics applications.
Finally, the data lake infrastructure provides users and
developers self-service access to siloed information. It also
allows your data team to collaborate on the same information,
which can then be curated and secured for the appropriate team
or operation. It is now a critical component for businesses
migrating to modern data platforms to scale their data
operations and machine learning initiatives. For this reason,
Data lakes are important.

2. How are Data Lakes different from Data Warehouses?


While data lakes and warehouses store data, they are optimized
for different purposes. Consider them complementary rather
than competing tools, as businesses may require both. On the
other hand, data warehouses are frequently ideal for the
repeatable reporting and analysis common in business
practices, such as monthly sales reports, sales tracking by
region, or website traffic.

3. What are the advantages of using a Data Lake?


A data lake is a cost-effective and scalable way to store
large amounts of data. A data lake can also provide access to
data for analytics and decision-making.

4. Why do big tech companies use and invest in Data Lakes?


Data Lake is a big data technology that allows businesses to
store large amounts of data centrally. This data is then
accessible and analyzed by various departments within the
company, allowing for better decision-making and a more
comprehensive view of the company’s data.

5. How can Data Lakes be used for Data and Analytics?


Data Lakes are a critical component of any organization’s data
strategy. Data lakes make organizational data from various
sources available to end-users, such as business analysts,
data engineers, data scientists, product managers, executives,
etc. In turn, these personas use data insights to improve
business performance cost-effectively. Indeed, many types of
advanced analytics are currently only possible in data lakes.

6. Where should the metadata of a Data Lake be stored?


The metadata for a data lake should be kept centrally and
easily accessible to all users. This ensures that everyone can
find and use the metadata when needed.

7. What distinguishes the Data Lakehouse from a Data Lake?


A data lake is a central repository for almost any raw data.
Structured, unstructured, and semi-structured data can all be
dumped into a data lake quickly before being processed for
validation, sorting, summarisation, aggregation, analysis,
reporting, or classification.

A data lake house is a more recent data management


architecture that combines data lakes’ flexibility, open
format, and cost-effectiveness with data warehouses’
accessibility, management, and advanced analytics support.

Lakehouse addresses the fundamental issues that turn data


lakes into data swamps. It includes ACID transactions to
ensure consistency when multiple parties read or write data
simultaneously. It supports DW schema architectures such as
star/snowflake schemas and directly offers strong governance
and auditing mechanisms on the data lake.

8. Can we deploy and run a data lake on the cloud?


Yes, a data lake can be deployed and run in the cloud. One
option is using a cloud-based data management platform, such
as Amazon Web Services (AWS) Data Pipeline. This platform can
collect, process, and store data from various sources,
including on-premises and cloud-based data sources. A
cloud-based data warehouse, such as Amazon Redshift, is
another option for deploying a data lake in the cloud. This
platform can store data from various sources, including
on-premises data centers and cloud-based data sources.
9. What are the various types of metadata for a Data Lake?
A data lake can contain three types of metadata: structure
metadata, business metadata, and technical metadata. Structure
metadata describes the data’s organization, business metadata
describes the data’s meaning, and technical metadata describes
how the data was generated.

10. Why is data governance important?


The process of ensuring that data is accurate, consistent, and
compliant with organizational standards and regulations is
known as data governance. It is significant because it ensures
that data is high quality and can be used to make sound
decisions.

11. What are the challenges of a Data Lake?


Data governance, quality, and security are the primary
challenges associated with implementing a data lake solution.
Data governance ensures that the data in the data lake is
accurate, consistent, and by applicable regulations. Data
quality is the process of ensuring that data is clean and
usable for its intended purpose. Data security is the
protection of data from unauthorized access and misuse.

12. What are a Data Lake’s security and privacy compliance


requirements?
There are ways to ensure compliance with security and privacy
requirements when using a data lake. One method is to encrypt
all data stored in the data lake. Another approach is to use
role-based access controls to limit who has access to what
data. Finally, activity logs can be created to track who is
accessing data and when.

--------------------------------------------------------------
--------------------------------------------------------------
--------------------

Pyspark Interview Questions

1.What would happen if we lose RDD partitions due to the


failure of the worker node?
If any RDD partition is lost, then that partition can be
recomputed using operations lineage from the original
fault-tolerant dataset.

2. Why are Partitions immutable in PySpark?


In PySpark, every transformation generates a new partition.
Partitions use HDFS API to make partitions immutable,
distributed, and fault-tolerant. Partitions are also aware of
data locality.

3.What are the key differences between an RDD, a DataFrame,


and a DataSet?
Following are the key differences between an RDD, a DataFrame,
and a DataSet:

RDD

RDD is an acronym that stands for Resilient Distributed


Dataset. It is a core data structure of PySpark.
RDD is a low-level object that is highly efficient in
performing distributed tasks.
RDD is best to do low-level transformations, operations, and
control on a dataset.
RDD is mainly used to alter data with functional programming
structures than with domain-specific expressions.
If you have a similar arrangement of data that needs to be
calculated again, RDDs can be efficiently reserved.
RDD contains all datasets and DataFrames in PySpark.
DataFrame

A DataFrame is equivalent to a relational table in Spark SQL.


It facilitates the structure like lines and segments to be
seen.
If you are working on Python, it is best to start with
DataFrames and then switch to RDDs if you want more
flexibility.
One of the biggest disadvantages of DataFrames is Compile Time
Wellbeing. For example, if the information structure is
unknown, you cannot control it.
DataSet
A Dataset is a distributed collection of data. It is a subset
of DataFrames.
Dataset is a newly added interface in Spark 1.6 to provide RDD
benefits.
DataSet consists of the best encoding component. It provides
time security in an organized manner, unlike information
edges.
DataSet provides a greater level of type safety at
compile-time. It can be used if you want typed JVM objects.
By using DataSet, you can take advantage of Catalyst
optimization. You can also use it to benefit from Tungsten's
fast code generation.

4.What do you understand by PySpark SparkContext?


SparkContext acts as the entry point to any spark
functionality. When the Spark application runs, it starts the
driver program, and the main function and SparkContext get
initiated. After that, the driver program runs the operations
inside the executors on worker nodes. In PySpark, SparkContext
is known as PySpark SparkContext. It uses Py4J (library) to
launch a JVM and then creates a JavaSparkContext. The
PySpark's SparkContext is by default available as 'sc', so it
doesn't mean creating a new SparkContext.

5. What is the usage of PySpark StorageLevel?


The PySpark StorageLevel is used to control the storage of
RDD. It controls how and where the RDD is stored. PySpark
StorageLevel decides if the RDD is stored on the memory, over
the disk, or both. It also specifies whether we need to
replicate the RDD partitions or serialize the RDD.

6.What is PySpark SparkConf?


PySpark SparkConf is mainly used if we have to set a few
configurations and parameters to run a Spark application on
the local/cluster. In other words, we can say that PySpark
SparkConf is used to provide configurations to run a Spark
application.

7. What are the most frequently used Spark ecosystems?


The most frequently used Spark ecosystems are:

Spark SQL for developers. It is also known as Shark.


Spark Streaming for processing live data streams.
Graphx for generating and computing graphs.
MLlib (also known as Machine Learning Algorithms)
SparkR to promote R programming language in Spark engine.

8. What machine learning API does PySpark provide?


Just like Apache Spark, PySpark also provides a machine
learning API known as MLlib. MLlib supports the following
types of machine learning algorithms:

mllib.classification: This machine learning API supports


different methods for binary or multiclass classification and
regression analysis such as Random Forest, Decision Tree,
Naive Bayes, etc.
mllib.clustering: This machine learning API solves clustering
problems for grouping entities subsets with one another
depending on similarity.
mllib.fpm: FPM stands for Frequent Pattern Matching in this
machine learning API. This machine learning API is used to
mine frequent items, subsequences, or other structures that
are used for analyzing large datasets.
mllib.linalg: This machine learning API is used to solve
problems on linear algebra.
mllib.recommendation: This machine learning API is used for
collaborative filtering and recommender systems.
spark.mllib: This machine learning API is used to support
model-based collaborative filtering where small latent factors
are identified using the Alternating Least Squares (ALS)
algorithm used for predicting missing entries.
mllib.regression: This machine learning API solves problems by
using regression algorithms that find relationships and
variable dependencies.

9.What is PySpark Partition? How many partitions can you make


in PySpark?
PySpark Partition is a method of splitting a large dataset
into smaller datasets based on one or more partition keys. It
enhances the execution speed as transformations on partitioned
data run quicker because each partition's transformations are
executed in parallel. PySpark supports both partitioning in
memory (DataFrame) and partitioning on disc (File system).
When we make a DataFrame from a file or table, PySpark creates
the DataFrame in memory with a specific number of divisions
based on specified criteria.

It also facilitates us to create a partition on multiple


columns using partitionBy() by passing the columns you want to
partition as an argument to this method.

Syntax:

partitionBy(self, *cols)
In PySpark, it is recommended to have 4x of partitions to the
number of cores in the cluster available for application.

10. What do you understand by PySpark DataFrames?


PySpark DataFrames are the distributed collection of
well-organized data. These are the same as relational
databases tables and are placed into named columns. PySpark
DataFrames are better optimized than R or Python programming
language because these can be created from different sources
like Hive Tables, Structured Data Files, existing RDDs,
external databases, etc.

The biggest advantage of PySpark DataFrame is that the data in


the PySpark DataFrame is distributed across different machines
in the cluster, and the operations performed on this would be
run parallel on all the machines. This facilitates handling a
large collection of structured or semi-structured data of a
range of petabytes.

11.What is Parquet file in PySpark?


In PySpark, the Parquet file is a column-type format supported
by several data processing systems. By using the Parquet file,
Spark SQL can perform both read and write operations.

The Parquet file contains a column type format storage which


provides the following advantages:

It is small and consumes less space.


It facilitates us to fetch specific columns for access.
It follows type-specific encoding.
It offers better-summarized data.
It contains very limited I/O operations.

12.What do you understand by a cluster manager? What are the


different cluster manager types supported by PySpark?
In PySpark, a cluster manager is a cluster mode platform that
facilitates Spark to run by providing all resources to worker
nodes according to their requirements.

A Spark cluster manager ecosystem contains a master node and


multiple worker nodes. The master nodes provide the worker
nodes with the resources like memory, processor allocation,
etc., according to the nodes' requirements with the help of
the cluster manager.

PySpark supports the following cluster manager types:

Standalone: This is a simple cluster manager that comes with


Spark.
Apache Mesos: This cluster manager is used to run Hadoop
MapReduce and PySpark apps.
Hadoop YARN: This cluster manager is used in Hadoop2.
Kubernetes: This cluster manager is an open-source cluster
manager that helps automate deployment, scaling, and automatic
management of containerized apps.
local: This cluster manager is a mode for running Spark
applications on laptops/desktops.

13.Why is PySpark faster than pandas?


PySpark is faster than pandas because it supports the parallel
execution of statements in a distributed environment. For
example, PySpark can be executed on different cores and
machines, unavailable in Pandas. This is the main reason why
PySpark is faster than pandas.

14.What are UDFs in PySpark and when are they used?


UDFs, or User-Defined Functions, in PySpark are used to extend
the native functions of Spark by allowing custom processing or
transformation of data. UDFs are useful for complex operations
that are not readily available in Spark's built-in functions.

15) What is the difference between get(filename) and


getrootdirectory()?
The main difference between get(filename) and
getrootdirectory() is that the get(filename) is used to
achieve the correct path of the file that is added through
SparkContext.addFile(). On the other hand, the
getrootdirectory() is used to get the root directory
containing the file added through SparkContext.addFile().

16) What do you understand by SparkSession in Pyspark?


In PySpark, SparkSession is the entry point to the
application. In the first version of PySpark, SparkContext was
used as the entry point. SparkSession is the replacement of
SparkContext since PySpark version 2.0. After the PySpark
version 2.0, SparkSession acts as a starting point to access
all of the PySpark functionalities related to RDDs, DataFrame,
Datasets, etc. It is also a Unified API used to replace the
SQLContext, StreamingContext, HiveContext, and all other
contexts in Pyspark.

The SparkSession internally creates SparkContext and


SparkConfig according to the details provided in SparkSession.
You can create SparkSession by using builder patterns.

17) What are the key advantages of PySpark RDD?


Following is the list of key advantages of PySpark RDD:

Immutability: The PySpark RDDs are immutable. If you create


them once, you cannot modify them later. You have to create a
new RDD whenever you try to apply any transformation
operations on the RDDs.

Fault Tolerance: The PySpark RDD provides fault tolerance


features. Whenever an operation fails, the data gets
automatically reloaded from other available partitions. This
provides a seamless experience of execution of the PySpark
applications.

Partitioning: When we create an RDD from any data, the


elements in the RDD are partitioned to the cores available by
default.

Lazy Evolution: PySpark RDD follows the lazy evolution


process. In PySpark RDD, the transformation operations are not
performed as soon as they are encountered. The operations
would be stored in the DAG and are evaluated once it finds the
first RDD action.

In-Memory Processing: The PySpark RDD is used to help in


loading data from the disk to the memory. You can persist RDDs
in the memory for reusing the computations.

18) Explain the common workflow of a spark program.


The common workflow of a spark program can be described in the
following steps:

In the first step, we create the input RDDs depending on the


external data. Data can be obtained from different data
sources.
After creating the PySpark RDDs, we run the RDD transformation
operations such as filter() or map() to create new RDDs
depending on the business logic.
If we require any intermediate RDDs to reuse for later
purposes, we can persist those RDDs.
Finally, if any action operations like first(), count(), etc.,
are present, Spark launches it to initiate parallel
computation.

19) How can you implement machine learning in Spark?


We can implement machine learning in Spark by using MLlib.
Spark provides a scalable machine learning record called
MLlib. It is mainly used to create machine learning scalable
and straightforward with ordinary learning algorithms and use
cases like clustering, weakening filtering, dimensional
lessening, etc.

20) What do you understand by custom profilers in PySpark?


PySpark supports custom profilers. The custom profilers are
used for building predictive models. Profilers are also used
for data review to ensure that it is valid, and we can use it
in consumption. When we require a custom profiler, it has to
define some of the following methods:

stats: This is used to return collected stats of profiling.


profile: This is used to produce a system profile of some
sort.
dump: This is used to dump the profiles to a specified path.
dump(id, path): This is used to dump a specific RDD id to the
path given.
add: This is used for adding profile to existing accumulated
profile. The profile class has to be selected at the time of
SparkContext creation.

21) What do you understand by Spark driver?


The Spark driver is a plan that runs on the master node of a
machine. It is mainly used to state actions and alterations on
data RDDs.

22) What is PySpark SparkJobinfo?


The PySpark SparkJobinfo is used to get information about the
SparkJobs that are in execution.

Following is the code for using the SparkJobInfo:

class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds


status ")):

23) What are the main functions of Spark core?


The main task of Spark Core is to implement several vital
functions such as memory management, fault-tolerance,
monitoring jobs, job setting up, and communication with
storage systems. It also contains additional libraries, built
atop the middle that is used to diverse workloads for
streaming, machine learning, and SQL.

The Spark Core is mainly used for the following tasks:

Fault tolerance and recovery.


To interact with storage systems.
Memory management.
Scheduling and monitoring jobs on a cluster.

24) What do you understand by PySpark SparkStageinfo?


The PySpark SparkStageInfo is used to get information about
the SparkStages available at that time. Following is the code
used for SparkStageInfo:
class SparkStageInfo(namedtuple("SparkStageInfo", "stageId
currentAttemptId name numTasks unumActiveTasks"
"numCompletedTasks numFailedTasks" )):
25) What is the use of Spark execution engine?
The Apache Spark execution engine is a chart execution engine
that facilitates users to examine massive data sets with a
high presentation. You need to detain Spark in the memory to
pick up performance radically if you want data to be
manipulated with manifold stages of processing.

26) What is the use of Akka in PySpark?


Akka is used in PySpark for scheduling. When a worker requests
a task to the master after registering, the master assigns a
task to him. In this case, Akka sends and receives messages
between the workers and masters.

27) What do you understand by startsWith() and endsWith()


methods in PySpark?
The startsWith() and endsWith() methods in PySpark belong to
the Column class and are used to search DataFrame rows by
checking if the column value starts with some value or ends
with some value. Both are used for filtering data in
applications.

startsWith() method: This method is used to return a Boolean


value. It shows TRUE when the column's value starts with the
specified string and FALSE when the match is not satisfied in
that column value.
endsWith() method: This method is used to return a Boolean
value. It shows TRUE when the column's value ends with the
specified string and FALSE when the match is not satisfied in
that column value. Both methods are case-sensitive.

28) What do you understand by RDD Lineage?


The RDD lineage is a procedure that is used to reconstruct the
lost data partitions. The Spark does not hold up data
replication in the memory. If any data is lost, we have to
rebuild it using RDD lineage. This is the best use case as RDD
always remembers how to construct from other datasets.

29) Can we create PySpark DataFrame from external data


sources?
Yes, we can create PySpark DataFrame from external data
sources. The real-time applications use external file systems
like local, HDFS, HBase, MySQL table, S3 Azure, etc. The
following example shows how to create DataFrame by reading
data from a csv file present in the local system:

df = spark.read.csv("/path/to/file.csv")
PySpark supports csv, text, avro, parquet, tsv and many other
file extensions.

30) What are the main attributes used in SparkConf?


Following is the list of main attributes used in SparkConf:

-set(key, value): This attribute is used for setting


the configuration property.
-setSparkHome(value): This attribute enables the
setting Spark installation path on worker nodes.
-setAppName(value): This attribute is used for setting
the application name.
-setMaster(value): This attribute is used to set the
master URL.
-get(key, defaultValue=None): This attribute supports
getting a configuration value of a key.

31) How can you associate Spark with Apache Mesos?


We can use the following steps to associate Spark with Mesos:

-First, configure the sparkle driver program to


associate with Mesos.
-The Spark paired bundle must be in the area open by
Mesos.
-After that, install Apache Spark in a similar area as
Apache Mesos and design the property
"spark.mesos.executor.home" to point to the area where it is
introduced.

32) What are the main file systems supported by Spark?


Spark supports the following three file systems:

-Local File system.


-Hadoop Distributed File System (HDFS).
-Amazon S3

33) How can we trigger automatic cleanups in Spark to handle


accumulated metadata?
We can trigger the automatic cleanups in Spark by setting the
parameter ' Spark.cleaner.ttl' or separating the long-running
jobs into dissimilar batches and writing the mediator results
to the disk.

34) How can you limit information moves when working with
Spark?
We can limit the information moves when working with Spark by
using the following manners:

-Communicate
-Accumulator factors

35) How is Spark SQL different from HQL and SQL?


Hive is used in HQL (Hive Query Language), and Spark SQL is
used in Structured Query language for processing and querying
data. We can easily join SQL table and HQL table to Spark SQL.
Flash SQL is used as a unique segment on the Spark Core motor
that supports SQL and Hive Query Language without changing any
sentence structure.

36) What is DStream in PySpark?


In PySpark, DStream stands for Discretized Stream. It is a
group of information or gathering of RDDs separated into
little clusters. It is also known as Apache Spark Discretized
Stream and is used as a gathering of RDDs in the grouping.
DStreams are based on Spark RDDs and are used to enable
Streaming to flawlessly coordinate with some other Apache
Spark segments like Spark MLlib and Spark SQL.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy