Map Reduce
Map Reduce
Map Reduce
From the open-source and proprietary parallel computing vendors, there are generally three types
of parallel computing available, which are discussed below:
1. Bit-level parallelism: The form of parallel computing in which every task is dependent
on processor word size. In terms of performing a task on large-sized data, it reduces the
number of instructions the processor must execute. There is a need to split the operation
into series of instructions. For example, there is an 8-bit processor, and you want to do an
operation on 16-bit numbers. First, it must operate the 8 lower-order bits and then the 8
higher-order bits. Therefore, two instructions are needed to execute the operation. The
operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor decides in
instruction-level parallelism how many instructions are implemented at the same time.
For each clock cycle phase, a processor in instruction-level parallelism can have the
ability to address that is less than one instruction. The software approach in instruction-
level parallelism functions on static parallelism, where the computer decides which
instructions to execute simultaneously.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks are
decomposed into subtasks. Then, each subtask is allocated for execution. And, the
execution of subtasks is performed concurrently by processors.
One of the primary applications of parallel computing is Databases and Data mining.
The real-time simulation of systems is another use of parallel computing.
The technologies, such as Networked videos and Multimedia.
Science and Engineering.
Collaborative work environments.
The concept of parallel computing is used by augmented reality, advanced graphics, and
virtual reality.
Fundamentals of Parallel Computer Architecture
Parallel computer architecture exists in a wide variety of parallel computers, classified according
to the level at which the hardware supports parallelism. Parallel computer architecture and
programming techniques work together to effectively utilize these machines. The classes of
parallel computer architectures include:
Multi-core computing
A computer processor integrated circuit containing two or more distinct processing cores is
known as a multi-core processor, which has the capability of executing program instructions
simultaneously. Cores may implement architectures like VLIW, superscalar, multithreading, or
vector and are integrated on a single integrated circuit die or onto multiple dies in a single chip
package. Multi-core architectures are classified as heterogeneous that consists of cores that are
not identical, or they are categorized as homogeneous that consists of only identical cores.
Symmetric multiprocessing
Distributed computing
On different networked computers, the components of a distributed system are located. These
networked computers coordinate their actions with the help of communicating through HTTP,
RPC-like message queues, and connectors. The concurrency of components and independent
failure of components are the characteristics of distributed systems. Typically, distributed
programming is classified in the form of peer-to-peer, client-server, n-tier, or three-tier
architectures. Sometimes, the terms parallel computing and distributed computing are used
interchangeably as there is much overlap between both.
In this computing, several computers are used simultaneously to execute a set of instructions in
parallel. Grid computing is another approach where numerous distributed computer system
execute simultaneously and communicate with the help of the Internet to solve a specific
problem.
In parallel computing, more resources are used to complete the task that led to decrease
the time and cut possible costs. Also, cheap components are used to construct parallel
clusters.
Comparing with Serial Computing, parallel computing can solve larger problems in a
short time.
For simulating, modeling, and understanding complex, real-world phenomena, parallel
computing is much appropriate while comparing with serial computing.
When the local resources are finite, it can offer benefit you over non-local resources.
There are multiple problems that are very large and may impractical or impossible to
solve them on a single computer; the concept of parallel computing helps to remove these
kinds of issues.
One of the best advantages of parallel computing is that it allows you to do several things
in a time by using multiple computing resources.
Furthermore, parallel computing is suited for hardware as serial computing wastes the
potential computing power.
Disadvantages of Parallel Computing
Serial computing refers to the use of a single processor to execute a program, also known as
sequential computing, in which the program is divided into a sequence of instructions, and each
instruction is processed one by one. Traditionally, the software offers a simpler approach as it
has been programmed sequentially, but the processor's speed significantly limits its ability to
execute each series of instructions. Also, sequential data structures are used by the uni-processor
machines in which data structures are concurrent for parallel computing environments.
A real-life example of this would be people standing in a queue waiting for a movie ticket and
there is only a cashier. The cashier is giving tickets one by one to the persons. The complexity of
this situation increases when there are 2 queues and only one cashier.
However, parallel computing deals with larger problems and helps to solve problems faster.
Cloud computing is a general term that refers to the delivery of scalable services, such as
databases, data storage, networking, servers, and software, over the Internet on an as-needed,
pay-as-you-go basis.
Cloud computing services can be public or private, are fully managed by the provider, and
facilitate remote access to data, work, and applications from any device in any place capable of
establishing an Internet connection. The three most common service categories are Infrastructure
as as Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).
Cloud computing is a relatively new paradigm in software development that facilitates broader
access to parallel computing via vast, virtual computer clusters, allowing the average user and
smaller organizations to leverage parallel processing power and storage options typically
reserved for large enterprises.
Map Reduce:
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
Distributed System:
A distributed system is a system whose components are located on different networked
computers, which communicate and coordinate their actions by passing messages to one another.
These computers (nodes) collectively known as a cluster (if all nodes are on the same local
network and use similar hardware) and a grid (if the nodes are shared across geographically and
administratively distributed systems, and use more heterogeneous hardware).
4.2 Map Reduce Model:
Algorithm:
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence,
need to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Usage of MapReduce
It can be used in various application like document clustering, distributed sorting, and web link-
graph reversal.
It can be used for distributed pattern-based searching.
We can also use MapReduce in machine learning.
It was used by Google to regenerate Google's index of the World Wide Web.
It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile
environment.
4.3.1 Entertainment:
Hadoop MapReduce assists end users in finding the most popular movies based on their
preferences and previous viewing history. It primarily concentrates on their clicks and logs.
Various OTT services, including Netflix, regularly release many web series and movies. It may
have happened to you that you couldn’t pick which movie to watch, so you looked at Netflix’s
recommendations and decided to watch one of the suggested series or films. Netflix uses Hadoop
and MapReduce to indicate to the user some well-known movies based on what they have
watched and which movies they enjoy. MapReduce can examine user clicks and logs to learn
how they watch movies.
4.3.2 E-Commerce
Several e-commerce companies, including Flipkart, Amazon, and eBay, employ MapReduce to
evaluate consumer buying patterns based on customers’ interests or historical purchasing
patterns. For various e-commerce businesses, it provides product suggestion methods by
analyzing data, purchase history, and user interaction logs.
Many e-commerce vendors use the MapReduce programming model to identify popular products
based on customer preferences or purchasing behavior. Making item proposals for e-commerce
inventory is part of it, as is looking at website records, purchase histories, user interaction logs,
etc., for product recommendations.
Nearly 500 million tweets, or about 3000 per second, are sent daily on the microblogging
platform Twitter. MapReduce processes Twitter data, performing operations such as
tokenization, filtering, counting, and aggregating counters.
Tokenization: It creates key-value pairs from the tokenized tweets by mapping the
tweets as maps of tokens.
Filtering: The terms that are not wanted are removed from the token maps.
Counting: It creates a token counter for each word in the count.
Aggregate counters: A grouping of comparable counter values is prepared into small,
manageable pieces using aggregate counters.
Systems that handle enormous volumes of information are known as data warehouse systems.
The star schema, which consists of a fact table and several dimension tables, is the most popular
data warehouse model. In a shared-nothing architecture, storing all the necessary data on a single
node is impossible, so retrieving data from other nodes is essential.
This results in network congestion and slow query execution speeds. If the dimensions are not
too big, users can replicate them over nodes to get around this issue and maximize parallelism.
Using MapReduce, we may build specialized business logic for data insights while analyzing
enormous data volumes in data warehouses.
4.3.5 Fraud Detection
Conventional methods of preventing fraud are not always very effective. For instance, data
analysts typically manage inaccurate payments by auditing a tiny sample of claims and
requesting medical records from specific submitters. Hadoop is a system well suited for handling
large volumes of data needed to create fraud detection algorithms. Financial businesses,
including banks, insurance companies, and payment locations, use Hadoop and MapReduce for
fraud detection, pattern recognition evidence, and business analytics through transaction
analysis.
Conclusion:
For years, MapReduce was a prevalent (and the de facto standard) model for processing high-
volume datasets. In recent years, it has given way to new systems like Google’s new Cloud
Dataflow. However, MapReduce continues to be used across cloud environments, and in June
2022, Amazon Web Services (AWS) made its Amazon Elastic MapReduce (EMR) Serverless
offering generally available. As enterprises pursue new business opportunities from big data,
knowing how to use MapReduce will be an invaluable skill in building data analysis
applications.
MapReduce is a programming model and processing technique designed for parallel and
distributed computing. It is commonly used in cloud computing environments to process large
datasets efficiently. The parallel efficiency of MapReduce in cloud computing is influenced by
several factors:
1. Data Distribution:
o Efficient data distribution across the nodes in the cluster is crucial. If data is
unevenly distributed, some nodes may finish their tasks quickly while others are
still processing, leading to idle time and reduced parallel efficiency.
2. Task Granularity:
o The size of the tasks assigned to each node (both map and reduce tasks) affects
parallelism. If tasks are too fine-grained, the overhead of task distribution may
overshadow the actual computation. If they are too coarse-grained, some nodes
may take longer to finish, reducing parallel efficiency.
3. Communication Overhead:
o The efficiency of data transfer and communication between nodes is essential.
Excessive communication can lead to bottlenecks, slowing down the overall
processing speed and reducing parallel efficiency.
4. Scalability:
o The ability to scale the number of nodes in the cluster is critical for achieving
high parallel efficiency. MapReduce frameworks should be able to dynamically
add or remove nodes based on the workload.
5. Fault Tolerance:
o The ability to handle node failures without affecting the entire computation is
crucial. MapReduce frameworks often incorporate fault tolerance mechanisms to
reroute tasks from failed nodes to healthy ones, ensuring continuous progress.
6. Resource Management:
o Efficient resource utilization is important for achieving high parallel efficiency.
Cloud computing platforms typically provide resource management tools to
allocate and deallocate resources dynamically based on the workload.
7. Algorithm Design:
o The design of the Map and Reduce functions plays a significant role. Well-
designed algorithms can exploit parallelism effectively, while poorly designed
ones may introduce unnecessary dependencies and limit parallel efficiency.
8. Data Locality:
o Minimizing data movement across the network by processing data on nodes
where it is stored (data locality) can significantly improve parallel efficiency.
Cloud storage systems often optimize data placement to enhance data locality.
9. Cluster Configuration:
o The overall configuration of the cloud cluster, including the number of nodes,
their computing power, and network speed, impacts parallel efficiency. A well-
configured cluster can efficiently handle the parallel processing demands of a
MapReduce job.
10. Hardware and Network Infrastructure:
o The underlying hardware and network infrastructure provided by the cloud
service play a role in determining the overall performance. High-speed networks
and powerful compute resources contribute to better parallel efficiency.