9.elastic MapReduce-Redshift

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Elastic Map Reduce

AMAZON EMR
EMR

 Analyze and process vast amounts of data


 Uses Apache Hadoop
 EMR consists of Master and Slave Nodes
 Hadoop uses a distributed processing architecture
called MapReduce in which a task is mapped to a set
of servers for processing. The results of the
computation performed by those servers is then
reduced down to a single output set. One node,
designated as the master node, controls the
distribution of tasks
Hadoop

 Apache Hadoop software library is a framework that


allows for the distributed processing of large data
sets across clusters of computers using simple
programming models
 It is designed to scale up from single servers to
thousands of machines
 Hadoop is designed to detect and handle failures at
the application layer
EMR

 Hadoop clusters running on Amazon EMR use EC2


instances as virtual Linux servers for the master and
slave nodes
 Amazon S3 for bulk storage of input and output data,
 CloudWatch to monitor cluster performance and
raise alarms.
 You can also move data into and out of DynamoDB
using Amazon EMR and Hive
EMR
EMR
EMR

 Open-source projects that run on top of the Hadoop


architecture can also be run on Amazon EMR
 Hive, Pig, HBase, DistCp, and Ganglia, are already
integrated with Amazon EMR
EMR: Advantages

 The ability to provision clusters of virtual servers


within minutes.
 You can scale the number of virtual servers in your
cluster to manage your computation needs, and only
pay for what you use.
 Integration with other AWS services
EMR: Features

 Resizeable Clusters
 When you run your Hadoop cluster on Amazon EMR, you can
easily expand or shrink the number of virtual servers in your
cluster depending on your processing needs
 Pay Only for What You Use
 Pay as you go model of Amazon

 Easy to Use
 When you launch a cluster on Amazon EMR, the web service
allocates the virtual server instances and configures them with
the needed software for you. Within minutes you can have a
cluster configured and ready to run your Hadoop application
EMR: Features

 Use Amazon S3 or HDFS


 you can store your input and output data in Amazon S3, on the
cluster in HDFS, or a mix of both. Amazon S3 can be accessed
like a file system from applications running on your Amazon
EMR cluster
 Parallel Clusters
 If your input data is stored in Amazon S3 you can have
multiple clusters accessing the same data simultaneously
EMR: Features

 Hadoop Application Support


 You can use popular Hadoop applications such as Hive, Pig,
and HBase with Amazon EMR
 Save Money with Spot Instances
 Spot instances are lower cost instances than on-demand
instances
 AWS Integration
 Amazon EMR is integrated with other Amazon Web Services
such as Amazon EC2, Amazon S3, DynamoDB, Amazon RDS,
CloudWatch, and AWS Data Pipeline
EMR: Features

 Instance Options
 When you launch a cluster on Amazon EMR, you specify the
size and capabilities of the virtual servers used in the cluster
 MapR Support
 Amazon EMR supports several MapR distributions

 Business Intelligence Tools


 Amazon EMR integrates with popular business intelligence
(BI) tools such as Tableau, MicroStrategy, and Datameer
EMR: Features

 User Control
 When you launch a cluster using Amazon EMR, you have root
access to the cluster and can install software and configure the
cluster before Hadoop starts
 Management Tools
 You can manage your clusters using the Amazon EMR console
(a web-based user interface), a command line interface, web
service APIs, and a variety of SDKs
 Security
 You can run Amazon EMR in a Amazon VPC in which you
configure networking and security rules
EMR Lab

INTRODUCTION TO AMAZON EMR


RedShift
RedShift

 Amazon Redshift is a fast, fully managed, petabyte-


scale data warehouse service
 It is optimized for datasets ranging from a few
hundred gigabytes to a petabyte or more and costs
less than $1,000 per terabyte per year
 The first step to create a data warehouse is to launch
a set of nodes, called an Amazon Redshift cluster.
After you provision your cluster, you can upload your
data set and then perform data analysis queries

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy