RHadoop
RHadoop
• Apache Spark revolves around several core concepts that form the foundation of its distributed
computing model. Understanding these concepts is essential for effectively using Spark for various
data processing tasks. Here are the key core concepts:
• Resilient Distributed Dataset (RDD): RDD is the fundamental data structure in Spark, representing
an immutable, distributed collection of objects partitioned across multiple nodes in a cluster.
• RDDs support two types of operations: transformations and actions. Transformations create new
RDDs from existing ones (e.g., map, filter), while actions perform computation and return results to
the driver program (e.g., reduce, collect).
• Directed Acyclic Graph (DAG): Spark operations (transformations and actions) are organized into a
Directed Acyclic Graph (DAG) of stages. Each stage represents a set of transformations that can be
executed together, with dependencies between them.
• Spark optimizes execution by creating an optimized DAG of stages based on the operations
performed in the application.
• SparkContext:
• SparkContext is the main entry point for Spark functionality in a Spark application. It represents the connection to
the Spark cluster and is responsible for coordinating the execution of operations on the cluster.
• SparkContext is typically created as sc in applications written in Scala and Python.
• DataFrames and Datasets:
• DataFrames and Datasets are high-level distributed data abstractions built on top of RDDs, introduced in Spark 1.3
and 1.6, respectively.
• DataFrames represent structured data with named columns, similar to tables in a relational database. They offer
optimizations and additional APIs for structured data processing.
• Datasets are strongly-typed distributed collections of objects, available in Scala and Java, providing compile-time
type safety and optimizations.
• Spark SQL:
• Spark SQL is a module in Spark for working with structured data, providing support for
querying structured data using SQL-like syntax and processing data with DataFrames and
Datasets.
• Spark SQL integrates seamlessly with other Spark components, allowing users to combine
SQL queries, DataFrame/Dataset operations, and RDD transformations/actions in a single
application.
• Spark Streaming:
• Spark Streaming is a scalable, fault-tolerant stream processing engine built on top of Spark Core. It allows
real-time processing of data streams, supporting high-level abstractions like discretized streams
(DStreams).
• DStreams represent a sequence of RDDs representing data streams, allowing users to apply RDD
transformations and actions to process streaming data.
• Machine Learning Library (MLlib):
• MLlib is Spark's scalable machine learning library, providing a rich set of algorithms and tools for building
machine learning models at scale.
• MLlib supports common machine learning tasks such as classification, regression, clustering, collaborative
filtering, and dimensionality reduction.