9.elastic MapReduce-Redshift

Elastic Map Reduce
AMAZON EMR
EMR
 Analyze and process vast amounts of data

 Uses Apache Hadoop
 EMR consists of Master and Slave Nodes
 Hadoop uses a distributed processing architecture
called MapReduce in which a task is mapped to a set
of servers for processing. The results of the
computation performed by those servers is then
reduced down to a single output set. One node,
designated as the master node, controls the
distribution of tasks
Hadoop
 Apache Hadoop software library is a framework that

allows for the distributed processing of large data
sets across clusters of computers using simple
programming models
 It is designed to scale up from single servers to
thousands of machines
 Hadoop is designed to detect and handle failures at
the application layer
EMR
 Hadoop clusters running on Amazon EMR use EC2

instances as virtual Linux servers for the master and
slave nodes
 Amazon S3 for bulk storage of input and output data,
 CloudWatch to monitor cluster performance and
raise alarms.
 You can also move data into and out of DynamoDB
using Amazon EMR and Hive
EMR
EMR
EMR
 Open-source projects that run on top of the Hadoop

architecture can also be run on Amazon EMR
 Hive, Pig, HBase, DistCp, and Ganglia, are already
integrated with Amazon EMR
EMR: Advantages
 The ability to provision clusters of virtual servers

within minutes.
 You can scale the number of virtual servers in your
cluster to manage your computation needs, and only
pay for what you use.
 Integration with other AWS services
EMR: Features
 Resizeable Clusters
 When you run your Hadoop cluster on Amazon EMR, you can
easily expand or shrink the number of virtual servers in your
cluster depending on your processing needs
 Pay Only for What You Use
 Pay as you go model of Amazon
 Easy to Use
 When you launch a cluster on Amazon EMR, the web service
allocates the virtual server instances and configures them with
the needed software for you. Within minutes you can have a
cluster configured and ready to run your Hadoop application
EMR: Features
 Use Amazon S3 or HDFS

 you can store your input and output data in Amazon S3, on the
cluster in HDFS, or a mix of both. Amazon S3 can be accessed
like a file system from applications running on your Amazon
EMR cluster
 Parallel Clusters
 If your input data is stored in Amazon S3 you can have
multiple clusters accessing the same data simultaneously
EMR: Features
 Hadoop Application Support

 You can use popular Hadoop applications such as Hive, Pig,
and HBase with Amazon EMR
 Save Money with Spot Instances
 Spot instances are lower cost instances than on-demand
instances
 AWS Integration
 Amazon EMR is integrated with other Amazon Web Services
such as Amazon EC2, Amazon S3, DynamoDB, Amazon RDS,
CloudWatch, and AWS Data Pipeline
EMR: Features
 Instance Options
 When you launch a cluster on Amazon EMR, you specify the
size and capabilities of the virtual servers used in the cluster
 MapR Support
 Amazon EMR supports several MapR distributions
 Business Intelligence Tools

 Amazon EMR integrates with popular business intelligence
(BI) tools such as Tableau, MicroStrategy, and Datameer
EMR: Features
 User Control
 When you launch a cluster using Amazon EMR, you have root
access to the cluster and can install software and configure the
cluster before Hadoop starts
 Management Tools
 You can manage your clusters using the Amazon EMR console
(a web-based user interface), a command line interface, web
service APIs, and a variety of SDKs
 Security
 You can run Amazon EMR in a Amazon VPC in which you
configure networking and security rules
EMR Lab
INTRODUCTION TO AMAZON EMR

RedShift
RedShift
 Amazon Redshift is a fast, fully managed, petabyte-

scale data warehouse service
 It is optimized for datasets ranging from a few
hundred gigabytes to a petabyte or more and costs
less than $1,000 per terabyte per year
 The first step to create a data warehouse is to launch
a set of nodes, called an Amazon Redshift cluster.
After you provision your cluster, you can upload your
data set and then perform data analysis queries

9.elastic MapReduce-Redshift

Uploaded by

Copyright:

Available Formats

9.elastic MapReduce-Redshift

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9.elastic MapReduce-Redshift

Uploaded by

Copyright:

Available Formats

Elastic Map Reduce

 Analyze and process vast amounts of data

 Apache Hadoop software library is a framework that

 Hadoop clusters running on Amazon EMR use EC2

 Open-source projects that run on top of the Hadoop

 The ability to provision clusters of virtual servers

 Use Amazon S3 or HDFS

 Hadoop Application Support

 Business Intelligence Tools

INTRODUCTION TO AMAZON EMR

 Amazon Redshift is a fast, fully managed, petabyte-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.