Big Data HDP Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Introduction to Hortonworks

Data Platform (HDP)

© Copyright IBM Corporation 2021


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Describe the functions and features of HDP.
• List the IBM added value components.
• Describe the purpose and benefits of each added value component.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Hortonworks Data Platform
overview

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Hortonworks Data Platform
• HDP is a platform for data at rest.
• It is a secure, enterprise-ready open-source Apache Hadoop distribution
that is based on a centralized architecture (YARN).
• HDP has the following attributes:
▪ Open
▪ Central
▪ Interoperable
▪ Enterprise-ready

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Hortonworks Data Platform

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data flow

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data Flow

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Kafka

• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-


subscribe messaging system.
▪ Used for building real-time data pipelines and streaming apps

• Often used in place of traditional message brokers like JMS and AMQP
because of its higher throughput, reliability and replication.

• Kafka works in combination with variety of Hadoop tools:


▪ Apache Storm
▪ Apache HBase
▪ Apache Spark

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Sqoop

• Tool to easily import information from structured databases (Db2,


MySQL, Netezza, Oracle, and mode.) and related Hadoop systems
(such as Hive and HBase) into your Hadoop cluster

• Can also be used to extract data from Hadoop and export it to relational
databases and enterprise data warehouses

• Helps offload some tasks such as ETL from Enterprise Data Warehouse
to Hadoop for lower cost and efficient execution

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data access

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data access

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Hive

• Apache Hive is a data warehouse system built on top of Hadoop.

• Hive facilitates easy data summarization, ad-hoc queries, and the


analysis of very large datasets that are stored in Hadoop.

• Hive provides SQL on Hadoop


▪ Provides SQL interface, better known as HiveQL or HQL, which allows for
easy querying of data in Hadoop

• Includes HCatalog
▪ Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Pig

• Apache Pig is a platform for analyzing large data sets.


• Pig consists of a high-level language called Pig Latin, which was
designed to simplify MapReduce programming.
• Pig's infrastructure layer consists of a compiler that produces
sequences of MapReduce programs from this Pig Latin code that you
write.
• The system is able to optimize your code, and "translate" it into
MapReduce allowing you to focus on semantics rather than efficiency.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


HBase

• Apache HBase is a distributed, scalable, big data store.

• Use Apache HBase when you need random, real-time read/write


access to your big data.
▪ The goal of the HBase project is to be able to handle very large tables of
data that are running on clusters of commodity hardware.

• HBase is modeled after Google's BigTable and provides BigTable-like


capabilities on top of Hadoop and HDFS. HBase is a NoSQL data store.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Accumulo

• Apache Accumulo is a sorted, distributed key/value store that provides


robust, scalable data storage and retrieval.

• Based on Google’s BigTable and runs on YARN


▪ Think of it as a "highly secure HBase"

• Features:
▪ Server-side programming
▪ Designed to scale
▪ Cell-based access control
▪ Stable

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Spark
• Apache Spark is a fast and general engine for large-scale data
processing.
• Spark has a variety of advantages including:
▪ Speed
− Run programs faster than MapReduce in memory
▪ Easy to use
− Write apps quickly with Java, Scala, Python, R
▪ Generality
− Can combine SQL, streaming, and complex analytics
▪ Runs on a variety of environments and can access diverse data sources
− Hadoop, Mesos, standalone, cloud…
− HDFS, Cassandra, HBase, S3…

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Storm
• Apache Storm is an open source distributed real-time computation
system.
▪ Fast
▪ Scalable
▪ Fault-tolerant

• Used to process large volumes of high-velocity data

• Useful when milliseconds of latency matter and Spark isn't fast enough
▪ Has been benchmarked at over a million tuples processed per second per
node

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data lifecycle and governance

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Data Lifecycle and Governance

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Atlas
• Apache Atlas is a scalable and extensible set of core foundational
governance services
▪ Enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
• Exchange metadata with other tools and processes within and outside
of Hadoop
▪ Allows integration with the whole enterprise data ecosystem
• Atlas Features:
▪ Data classification
▪ Centralized auditing
▪ Centralized lineage
▪ Security and policy engine

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Security

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Security

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Ranger
• Centralized security framework to enable, monitor, and manage
comprehensive data security across the Hadoop platform

• Manage fine-grained access control over Hadoop data access


components like Apache Hive and Apache HBase

• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease

• Policies can be set for individual users or groups


▪ Policies enforced within Hadoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Operations

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Operations

Governance
Tools Security Operations
Integration

Data Lifecycle Zeppelin Ambari User Views Ranger Ambari


and Governance
Knox Cloudbreak
Falcon

Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others

Flume HBase HAWQ


Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL

Tez Tez Slider S T


NFS
YARN: Data Operating System

WebHDFS Hadoop Distributed File System (HDFS)


Data Management

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Ambari

• For provisioning, managing, and monitoring Apache Hadoop clusters.

• Provides intuitive, easy-to-use Hadoop management web UI backed by


its RESTful APIs

• Ambari REST APIs


▪ Allow application developers and system integrators to easily integrate
Hadoop provisioning, management, and monitoring capabilities in their own
applications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Cloudbreak
• A tool for provisioning and managing Apache Hadoop clusters in the
cloud

• Automates launching of elastic Hadoop clusters

• Policy-based autoscaling on several cloud infrastructure platforms,


including:
▪ Microsoft Azure
▪ Amazon Web Services
▪ Google Cloud Platform
▪ OpenStack
▪ Platforms that support Docker container

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021


Oozie
• Oozie is a Java based workflow scheduler system to manage Apache
Hadoop jobs

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions

• Integrated with the Hadoop stack


▪ YARN is its architectural center
▪ Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy