0% found this document useful (0 votes)
2 views

RHadoop

RHadoop is a collection of R packages that enable users to perform data analysis using Hadoop's MapReduce technique, integrating R's statistical capabilities with Hadoop's distributed computing. The document outlines the components of RHadoop, the steps for performing data analysis, and the data analytics project life cycle, which includes problem identification, data preparation, analytics, and visualization. Additionally, it introduces Apache Spark, highlighting its features, core concepts, and programming capabilities with RDDs, emphasizing its speed, ease of use, and versatility for data processing tasks.

Uploaded by

Kashish Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

RHadoop

RHadoop is a collection of R packages that enable users to perform data analysis using Hadoop's MapReduce technique, integrating R's statistical capabilities with Hadoop's distributed computing. The document outlines the components of RHadoop, the steps for performing data analysis, and the data analytics project life cycle, which includes problem identification, data preparation, analytics, and visualization. Additionally, it introduces Apache Spark, highlighting its features, core concepts, and programming capabilities with RDDs, emphasizing its speed, ease of use, and versatility for data processing tasks.

Uploaded by

Kashish Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

RHadoop:

Data Analysis Using the MapReduce


Technique in RHadoop
RHadoop
RHadoop is a collection of packages that allow users to interact with Hadoop, a
popular open-source distributed computing framework, using the R programming
language.
R is commonly used for statistical analysis and data visualization, while Hadoop is
designed for processing and storing large datasets across clusters of computers.
The RHadoop project includes several components:
1.RHDFS: This package provides R functions to interact with the Hadoop
Distributed File System (HDFS), allowing users to read and write data to and from
Hadoop.
2.RHive: RHive provides an interface to Apache Hive, a data warehouse infrastructure built
on top of Hadoop. It allows users to run Hive queries directly from R, enabling data
analysis and manipulation.
3.Rmr2 (formerly RHadoop Streaming): Rmr2 is an R package that allows users to
write MapReduce programs in R. MapReduce is a programming model used for
processing and generating large datasets in parallel across distributed clusters.
4.RHbase: RHbase is an interface to Apache HBase, a distributed, scalable, NoSQL
database built on top of Hadoop. It allows users to interact with HBase tables from R.
By integrating R with Hadoop, RHadoop enables data scientists and analysts to leverage the
power of Hadoop for processing large datasets, while still using the familiar R environment
for data analysis and visualization.
This combination can be particularly useful for handling big data analytics tasks where
Data Analysis Using the MapReduce
Technique in RHadoop
• Performing data analysis using the MapReduce technique in RHadoop typically involves the following
steps:
1. Setup: Ensure that you have Hadoop and R installed on your system. Install the RHadoop packages
(RHadoop, RHive, RHDFS, Rmr2, RHbase) and any other necessary dependencies.
2. Data Preparation: Prepare your data for analysis. This may involve storing your data in HDFS or
HBase, depending on your data storage preferences.
3. MapReduce Programming: Write your MapReduce program using the Rmr2 package. MapReduce
programs consist of two main functions: map() and reduce(). The map() function processes each
input record and emits key-value pairs, while the reduce() function aggregates the values
associated with each key.
4. Execution: Submit your MapReduce job to the Hadoop cluster using the rhmr() function provided
by Rmr2. This function allows you to specify the input data, the mapper function, the reducer
function, and any other job configurations.
5. Data Analysis: Once the MapReduce job completes, you can analyze the output data using R. You
may read the output data from HDFS or HBase into R data structures using the rhread() function
provided by RHadoop packages.
6. Visualization and Interpretation: Visualize the results using R's plotting libraries or other
Understanding the data analytics
project life cycle
• While dealing with the data analytics projects, there are some fixed
tasks that should be followed to get the expected output.
• So here we are going to build a data analytics project cycle, which will be
a set of standard data-driven processes to lead data to insights
effectively.
• The defined data analytics processes of a project life cycle should be
followed by sequences for effectively achieving the goal using input
datasets.
• This data analytics process may include identifying the data analytics
problems, designing, and collecting datasets, data analytics, and data
visualization.
Data Analytics Project Life Cycle
Stages
Identifying the problem
• Today, business analytics trends change by performing data analytics over
web datasets for growing business.
• Since their data size is increasing gradually day by day, their analytical
application needs to be scalable for collecting insights from their datasets.
• With the help of web analytics, we can solve the business analytics problems.
• Let’s assume that we have a large e-commerce website, and we want to
know how to increase the business. We can identify the important pages of
our website by categorizing them as per popularity into high, medium, and
low. Based on these popular pages, their types, their traffic sources, and their
content, we will be able to decide the roadmap to improve business by
improving web trafic, as well as content.
Designing data requirement

• To perform the data analytics for a specific problem, it needs datasets


from related domains.
• Based on the domain and problem specification, the data source can
be decided and based on the problem definition, the data attributes
of these datasets can be defined.
• For example, if we are going to perform social media analytics
(problem specification), we use the data source as Facebook or
Twitter. For identifying the user characteristics, we need user profile
information, likes, and posts as data attributes.
Preprocessing data
• In data analytics, we do not use the same data sources, data attributes, data tools,
and algorithms all the time as all of them will not use data in the same format.
• This leads to the performance of data operations, such as data cleansing, data
aggregation, data augmentation, data sorting, and data formatting, to provide the
data in a supported format to all the data tools as well as algorithms that will be used
in the data analytics.
• In simple terms, preprocessing is used to perform data operation to translate data
into a fixed data format before providing data to algorithms or tools.
• The data analytics process will then be initiated with this formatted data as the input.
• In case of Big Data, the datasets need to be formatted and uploaded to Hadoop
Distributed File System (HDFS) and used further by various nodes with Mappers and
Reducers in Hadoop clusters.
Performing analytics over data
• After data is available in the required format for data analytics algorithms, data analytics
operations will be performed.
• The data analytics operations are performed for discovering meaningful information from
data to take better decisions towards business with data mining concepts.
• It may either use descriptive or predictive analytics for business intelligence.
• Analytics can be performed with various machine learning as well as custom algorithmic
concepts, such as regression, classification, clustering, and model-based recommendation.
• For Big Data, the same algorithms can be translated to MapReduce algorithms for running
them on Hadoop clusters by translating their data analytics logic to the MapReduce job
which is to be run over Hadoop clusters.
• These models need to be further evaluated as well as improved by various evaluation
stages of machine learning concepts.
• Improved or optimized algorithms can provide better insights.
Visualizing data
• Data visualization is used for displaying the output of data analytics.
• Visualization is an interactive way to represent the data insights.
• This can be done with various data visualization softwares as well as R packages.
• R has a variety of packages for the visualization of datasets.
• They are as follows:
• ggplot2: This is an implementation of the Grammar of Graphics by Dr. Hadley Wickham
(http://had.co.nz/). For more information refer
http://cran.r-project.org/web/packages/ggplot2/.
• rCharts: This is an R package to create, customize, and publish interactive JavaScript
visualizations from R by using a familiar lattice-style plotting interface by Markus Gesmann
and Diego de Castillo. For more information refer http://ramnathv.github.io/rCharts/.
Popular examples of visualization
with R
• Plots for facet scales (ggplot): The following figure shows the
comparison of males and females with different measures; namely,
education, income, life expectancy, and literacy, using ggplot:
• Dashboard charts: This is an rCharts type. Using this we can build
interactive animated dashboards with R.
Spark:
Core Concepts, Spark’s Python and
Scala shells
Apache Spark
• Apache Spark is an open-source distributed computing system that
provides an interface for programming entire clusters with implicit data
parallelism and fault tolerance.
• While it's often associated with Hadoop due to its compatibility and
ability to run on Hadoop clusters, Spark is a separate project under the
Apache Software Foundation.
• Apache Spark is designed for speed and ease of use, offering APIs in
languages such as Scala, Java, Python, and R.
• It provides high-level APIs for programming in batch processing,
streaming, machine learning, and interactive SQL-like queries, making it
a versatile tool for various data processing tasks.
Key features of Apache Spark
1. Speed: Spark achieves high performance through in-memory computing and efficient data
processing. It can perform batch processing tasks up to 100 times faster than Hadoop MapReduce
due to its ability to cache data in memory.
2. Ease of Use: Spark offers high-level APIs in multiple languages, making it accessible to a wide range
of users, including data engineers, data scientists, and developers. It also provides a unified engine
for batch processing, interactive queries, streaming, and machine learning.
3. Versatility: Spark supports a variety of data processing workloads, including batch processing (using
the Spark Core API), interactive SQL-like queries (using Spark SQL), real-time streaming data
processing (using Spark Streaming), and machine learning (using MLlib).
4. Fault Tolerance: Spark provides fault tolerance through resilient distributed datasets (RDDs), which
are immutable distributed collections of objects. RDDs automatically recover from node failures,
ensuring reliable data processing.
5. Integration with Hadoop: Spark can run on Hadoop clusters, accessing data stored in Hadoop
Distributed File System (HDFS), HBase, Cassandra, and other data sources. It can also leverage
Hadoop's YARN resource manager for cluster resource management.
Apache Spark

• Apache Spark revolves around several core concepts that form the foundation of its distributed
computing model. Understanding these concepts is essential for effectively using Spark for various
data processing tasks. Here are the key core concepts:
• Resilient Distributed Dataset (RDD): RDD is the fundamental data structure in Spark, representing
an immutable, distributed collection of objects partitioned across multiple nodes in a cluster.
• RDDs support two types of operations: transformations and actions. Transformations create new
RDDs from existing ones (e.g., map, filter), while actions perform computation and return results to
the driver program (e.g., reduce, collect).
• Directed Acyclic Graph (DAG): Spark operations (transformations and actions) are organized into a
Directed Acyclic Graph (DAG) of stages. Each stage represents a set of transformations that can be
executed together, with dependencies between them.
• Spark optimizes execution by creating an optimized DAG of stages based on the operations
performed in the application.
• SparkContext:
• SparkContext is the main entry point for Spark functionality in a Spark application. It represents the connection to
the Spark cluster and is responsible for coordinating the execution of operations on the cluster.
• SparkContext is typically created as sc in applications written in Scala and Python.
• DataFrames and Datasets:
• DataFrames and Datasets are high-level distributed data abstractions built on top of RDDs, introduced in Spark 1.3
and 1.6, respectively.
• DataFrames represent structured data with named columns, similar to tables in a relational database. They offer
optimizations and additional APIs for structured data processing.
• Datasets are strongly-typed distributed collections of objects, available in Scala and Java, providing compile-time
type safety and optimizations.
• Spark SQL:
• Spark SQL is a module in Spark for working with structured data, providing support for
querying structured data using SQL-like syntax and processing data with DataFrames and
Datasets.
• Spark SQL integrates seamlessly with other Spark components, allowing users to combine
SQL queries, DataFrame/Dataset operations, and RDD transformations/actions in a single
application.
• Spark Streaming:
• Spark Streaming is a scalable, fault-tolerant stream processing engine built on top of Spark Core. It allows
real-time processing of data streams, supporting high-level abstractions like discretized streams
(DStreams).
• DStreams represent a sequence of RDDs representing data streams, allowing users to apply RDD
transformations and actions to process streaming data.
• Machine Learning Library (MLlib):
• MLlib is Spark's scalable machine learning library, providing a rich set of algorithms and tools for building
machine learning models at scale.
• MLlib supports common machine learning tasks such as classification, regression, clustering, collaborative
filtering, and dimensionality reduction.

• Understanding these core concepts is crucial for effectively leveraging Apache


Spark's capabilities for distributed data processing, analytics, and machine learning
tasks.
Spark’s Python and Scala shells
• Apache Spark provides interactive shells for both Python (PySpark)
and Scala (Spark shell), allowing users to interactively explore and
manipulate data using Spark APIs. These shells provide a
convenient environment for prototyping, testing code snippets,
and performing ad-hoc data analysis.

• Both PySpark and Spark shell offer similar capabilities for


interacting with Spark, but they use different programming
languages (Python and Scala, respectively). Users can choose the
shell based on their language preference and familiarity with
Python or Scala. Additionally, both shells provide access to the
SparkContext, allowing users to connect to a Spark cluster and
execute operations on distributed datasets.
PySpark (Python shell)
• PySpark is the Python API for Apache Spark, allowing users to interact
with Spark using Python.
• To start PySpark, you can simply run the pyspark command in the
terminal. This launches the PySpark interactive shell, where you can type
Python code and interact with Spark.
• PySpark provides full access to Spark's functionality, including RDDs,
DataFrames, Spark SQL, MLlib (machine learning library), and Spark
Streaming.
• PySpark is particularly popular among data scientists and Python
developers due to Python's widespread adoption and its ease of use for
data analysis and machine learning tasks.
Spark shell (Scala shell)
• The Spark shell is the interactive shell for Spark's Scala API, allowing
users to write and execute Spark applications using Scala.
• To start the Spark shell, you can run the spark-shell command in the
terminal. This launches the Scala REPL (Read-Eval-Print Loop) with Spark
pre-configured, allowing you to write Scala code and interact with Spark.
• The Spark shell provides access to all of Spark's features, including RDDs,
DataFrames, Spark SQL, MLlib, and Spark Streaming, using Scala syntax.
• Scala is the native language of Spark, and the Spark shell provides a
powerful environment for writing complex Spark applications and
leveraging Scala's features and libraries.
Programming with RDD:
RDD Operations,
Passing Functions to Spark,
Common Transformations and Actions
Programming with RDD
• Programming with RDDs (Resilient Distributed Datasets) in Apache
Spark involves using a combination of transformations and actions to
perform distributed data processing tasks. RDDs are the primary
abstraction in Spark for working with distributed data, and they
support various operations for data manipulation.
RDD Operations
• Transformations:
• Transformations create a new RDD from an existing one. They are lazy operations,
meaning they don't execute immediately but instead build up a DAG (Directed Acyclic
Graph) of transformations that will be executed when an action is called.
• Examples of transformations include map, filter, flatMap, reduceByKey, sortByKey,
groupByKey, join, union, intersection, distinct, etc.
• Actions:
• Actions are operations that trigger the execution of the previously defined
transformations on RDDs and return results to the driver program or write data to an
external storage system.
• Examples of actions include collect, count, take, reduce, foreach, saveAsTextFile,
saveAsSequenceFile, saveAsObjectFile, countByKey, foreachPartition, etc.
Passing Functions to Spark
• Inline Functions:
• You can define functions inline using lambda expressions (anonymous functions)
directly in the transformation or action call. For example:
rdd.map(lambda x: x * 2)
• Named Functions:
• You can define named functions separately and pass them to transformations or
actions. This is particularly useful for complex functions or for reusing functions
across multiple operations. For example:
def double(x):
return x * 2
rdd.map(double)
Common Transformations
• map(func): Applies the function func to each element of the RDD and
returns a new RDD with the result.
• filter(func): Filters elements of the RDD based on the function func and
returns a new RDD with the filtered elements.
• flatMap(func): Similar to map, but flattens the result, so each input
element can be mapped to 0 or more output elements.
• reduceByKey(func): Aggregates the values for each key using the
provided function func.
• sortByKey(): Sorts the RDD by its keys.
• groupByKey(): Groups the values for each key in the RDD.
Common Actions
• collect(): Returns all the elements of the RDD to the driver program.
• count(): Returns the number of elements in the RDD.
• take(n): Returns the first n elements of the RDD.
• reduce(func): Aggregates the elements of the RDD using the function
func.
• foreach(func): Applies the function func to each element of the RDD.
Mining Data Streams:
Streams Concepts,
stream Data Model and Architecture, stream computing,
filtering Streams,
estimating Moments,
decaying window,
Real time Analytics Platform (RTAP) Applications
Mining Data Streams
• Mining data streams involves analyzing continuously generated data
in real-time or near real-time to extract useful insights or patterns.
• Mining data streams is crucial for many real-time applications across
various domains, including IoT, finance, e-commerce, healthcare, and
more.
• Effective stream processing systems and algorithms are essential for
extracting timely insights and enabling proactive decision-making
based on continuously evolving data.
Streams Concepts
• Data Streams: Data streams represent continuously flowing,
potentially unbounded sequences of data records. Examples include
sensor data, social media feeds, financial transactions, etc.
• Stream Processing: Stream processing refers to the real-time or near
real-time analysis of data streams to extract insights, detect patterns,
or make decisions as the data arrives.
• Event Time vs. Processing Time: Event time refers to the time when
an event actually occurred, while processing time refers to the time
when an event is observed by the processing system.
Stream Data Model and Architecture
• Event: An event represents a data record in the stream, typically
characterized by a timestamp and associated attributes.
• Stream Processing Architecture: Stream processing architectures
typically involve components for data ingestion, processing, analysis,
and output. These components may include messaging systems (e.g.,
Apache Kafka), stream processing engines (e.g., Apache Flink, Apache
Spark Streaming), and storage systems (e.g., databases, data lakes).
Stream Computing
• Stateful Stream Processing: Stateful stream processing involves
maintaining and updating state information across multiple events in
the stream. This enables tasks such as sessionization, pattern
detection, and anomaly detection.
• Windowing: Windowing divides the stream into finite segments or
windows, allowing computations to be performed over fixed time
intervals or based on a fixed number of events. Common window
types include tumbling windows, sliding windows, and session
windows.
Filtering Streams
• Filtering: Filtering involves selecting or excluding events from the
stream based on specified criteria. This can be done using predicates
or conditional expressions
Estimating Moments
• Moments: Moments are statistical measures of the distribution of
data. Common moments include mean, variance, skewness, and
kurtosis.
• Estimating Moments in Data Streams: Estimating moments in data
streams involves continuously updating estimates of statistical
moments as new data arrives. Techniques such as reservoir sampling,
sketching algorithms (e.g., Count-Min Sketch), and approximate
algorithms are often used to efficiently compute moment estimates in
data streams.
Decaying Window
• Decaying Window: A decaying window gives more weight to recent
events in the stream while gradually reducing the influence of older
events. This is particularly useful for capturing trends or detecting
changes in data distributions over time.
Real-Time Analytics Platform (RTAP)
Applications
• Real-Time Monitoring and Alerting: RTAPs can be used for monitoring systems or processes
in real-time and generating alerts or notifications based on predefined conditions or
anomalies.
• Fraud Detection and Anomaly Detection: RTAPs enable real-time detection of fraudulent
activities or anomalous behavior by analyzing streaming data and identifying patterns that
deviate from normal behavior.
• Personalized Recommendations: RTAPs can analyze user interactions and behaviors in real-
time to generate personalized recommendations or content suggestions.
• Predictive Maintenance: RTAPs analyze sensor data from industrial equipment or machinery
in real-time to detect potential failures or maintenance needs before they occur, thereby
minimizing downtime and optimizing maintenance schedules.
• Financial Trading and Algorithmic Trading: RTAPs analyze market data and execute trading
strategies in real-time, enabling algorithmic trading and automated decision-making in
financial markets.
Case studies
• Real Time Sentiment Analysis
• Stock Market Predictions
Real Time Sentiment Analysis
Stock Market Predictions
Computing the frequency of stock market change: This data analytics MapReduce problem is
designed for calculating the frequency of stock market changes.
Identifying the problem
• Since this is a typical stock market data analytics problem, it will calculate the frequency of past
changes for one particular symbol of the stock market, such as a Fourier Transformation. Based
on this information, the investor can get more insights on changes for different time periods. So
the goal of this analytics is to calculate the frequencies of percentage change.
Designing data requirement
• For this stock market analytics, we will use Yahoo! Finance as the input dataset.
• We need to retrieve the specific symbol's stock information.
• To retrieve this data, we will use the Yahoo! API with the following parameters:
• From month
• From day
• From year
• To month
• To day
• To year
• Symbol
Preprocessing data
• To perform the analytics over the extracted dataset, we will use R to fire the
following command:
stock_BP <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=BP")

or you can also download via the terminal:


wget http://ichart.finance.yahoo.com/table.csv?s=BP
#exporting to csv file
write.csv(stock_BP,"table.csv", row.names=FALSE)

• Then upload it to HDFS by creating a speciic Hadoop directory for this:


# creating /stock directory in hdfs
bin/hadoop dfs -mkdir /stock
# uploading table.csv to hdfs in /stock directory
bin/hadoop dfs -put /home/Vignesh/downloads/table.csv /stock/
Performing analytics over data
• To perform the data analytics operations, we will use streaming with R
and Hadoop (without the HadoopStreaming package).
• So, the development of this MapReduce job can be done without any
RHadoop integrated library/package.
• In this MapReduce job, we have defined Map and Reduce in different R
files to be provided to the Hadoop streaming function.

• From the following codes, we run MapReduce in R without installing or


using any R library/package.
• There is one system() method in R to fire the system command within R
console to help us direct the firing of Hadoop jobs within R.
• It will also provide the repose of the commands into the R console.
• While running this program, the output at your R console or terminal will be as given in the
following screenshot, and with the help of this we can monitor the status of the Hadoop
MapReduce job.
• Here we will see them sequentially with the divided parts.
• Please note that we have separated the logs output into parts to help you understand them better.
• The MapReduce log output contains (when run from terminal):
• With this initial portion of log, we can identify the metadata for the Hadoop MapReduce job. We can also
track the job status with the web browser, by calling the given Tracking URL. This is how the MapReduce job
metadata is tracked.
• With this portion of log, we can monitor the status of the Mapper or Reducer
tasks being run on the Hadoop cluster to get the details like whether it was
successful or failed. This is how we track the status of the Mapper and
Reducer tasks.
• Once the MapReduce job is completed, its
output location will be displayed at the
end of the logs. This is known as tracking
the HDFS output location.

• From the terminal, the output of the


Hadoop MapReduce program can be
called using the following command:
bin/hadoop dfs -cat /stock/outputs/part-
00000

• The headers of the output of your


MapReduce program will look as follows:
change frequency

• The following figure shows the sample


output of MapReduce problem:
Visualizing data
• We can get more insights if
we visualize our output with
various graphs in R.
• Here, we have tried to
visualize the output with the
help of the ggplot2 package.
• From the previous graph, we can quickly identify that most of the time the stock
price has changed from around 0 to 1.5.
• So, the stock's price movements in the history will be helpful at the time of investing.
• The required code for generating this graph is as follows:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy