D75058GC20 Ep
D75058GC20 Ep
D75058GC20 Ep
Electronic Presentation
D75058GC20 | Edition 2.0 | November 2016
This document contains proprietary information and is protected by copyright and other intellectual property laws. You may
Technical Contributors copy and print this document solely for your own use in an Oracle training course. The document may not be modified or
and Reviewers altered in any way. Except where your use constitutes "fair use" under copyright law, you may not use, share, download,
Martin Gubar upload, copy, print, display, perform, reproduce, publish, license, post, transmit, or distribute this document in whole or in
part without the express authorization of Oracle.
Bill Beauregard
Melliyal Annamalai The information contained in this document is subject to change without notice. If you find any problems in the document,
Jean Ihm please report them in writing to: Oracle University, 500 Oracle Parkway, Redwood Shores, California 94065 USA. This
Bob DiMeo document is not warranted to be error-free.
Jean-Pierre Dijcks Restricted Rights Notice
Ben Gelernter
Bob Stanoch If this documentation is delivered to the United States Government or anyone using the documentation on behalf of the
Frederick Kush United States Government, the following notice is applicable:
Publishers
Veena Narasimhan
Michael Sebastian Almeida
Raghunath M
1
Introduction
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-2
Course Objectives
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-3
Course Road Map Lesson 1: Introduction
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-4
Questions About You
To ensure that the class can be customized to meet your specific needs and to encourage
interaction, answer the following questions:
Which organization do you work for?
What is your role in your organization?
If you are a DBA, what products have you worked on?
Have you used the Oracle BDA?
Have you used any Hadoop components?
Do you meet the course prerequisites?
What do you hope to learn from this course?
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-5
Lesson Objectives
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-6
Oracle Big Data Lite VM
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-7
Oracle Big Data Lite VM Home Page Sections
Section Contents
Introduction Introduction to the Big Data Lite VM and the installed components
Download Oracle Big Data Lite VM Contains the following:
Deployment Guide document (important details)
Links to download the required (13) 7-zip files
Links to download the required Oracle VM VirtualBox plus its
Extension Pack and 7-zip files
Previous Version You can download the previous versions of the Big Data Lite VM and
compare the available components with older versions.
Getting Started View information (and demos) about the Oracle MoviePlex demo
data that is used in several of the big data courses.
Hands-on Lab Access to several hands-on labs
Web Sites/White Papers/EBook/Blogs List some resources that help you to learn more about the Oracle big
data platform.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-8
Components of the Oracle Big Data Lite VM
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-9
Downloading, Installing, and Using the Oracle Big Data Lite VM
1. Review the important Deployment Guide document for detailed installation steps.
2. Download and install Oracle VM VirtualBox, its Extension Pack, and 7-zip files.
3. Run the 7-zip extractor on the BigDataLite450.7z.001 file only. This will create the
BigDataLite450.ova VirtualBox appliance file.
4. In VirtualBox, import BigDataLite450.ova.
5. Start BigDataLite-4.5.0.
6. Log in as oracle/welcome1.
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}
{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}
{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}
{"custId":1067283,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:55","recommended":null,"activity":9}
{"custId":1377537,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:58","recommended":null,"activity":9}
{"custId":1347836,"movieId":null,"genreId":null,"time":"2012-07-01:00:02:03","recommended":null,"activity":8}
{"custId":1137285,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:39","recommended":null,"activity":8}
{"custId":1354924,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:51","recommended":null,"activity":9}
{"custId":1036191,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:55","recommended":null,"activity":8}
{"custId":1143971,"movieId":1017161,"genreId":44,"time":"2012-07-01:00:04:00","recommended":"Y","activity":7}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:04:03","recommended":"Y","activity":5}
{"custId":1273464,"movieId":null,"genreId":null,"time":"2012-07-01:00:04:39","recommended":null,"activity":9}
{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}
...
Endeca Oracle
Log of all activity Information Business
Application on the site Discovery Intelligence EE
Log
Customer Profile Oracle Exalytics
Capture activity required (such as, recommended
for the MoviePlex site movies)
Streamed
into HDFS Clustering/Market Basket
using Flume Oracle NoSQL DB Mood Oracle Advanced
Recommendations Analytics
Load Recommendations Oracle Exadata
Load Session and
Activity Data
Oracle Big
HDFS Data
Connectors
MapReduce MapReduce MapReduce
ORCH - CF Recs. Pig - Sessionize Hive - Activities
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
Field Description
custId The customer's ID
movieId The ID of the selected movie
genreId The genre of the selected movie
time The timestamp when the customer watched the movie
recommended? Whether the selected movie was recommended, Y or N
Activity 1: Rate movie
2: Completed movie
3: Not completed
4: Started movie
5: Browsed movie
6: List movies
7: Search
education.oracle.com
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-2
Big Data: A Strategic Information Management (IM) Perspective
Information Management
Information Management is the means by which an
organization maximizes the efficiency and value with
which it plans, collects, organizes, uses, controls,
stores, and disseminates its Information.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-3
Big Data
Big Data is a term used to describe data sets whose size is beyond the capability of the
software tools commonly used to capture, manage, and process data.
Big Data can be generated from many different sources, including:
Social networks
Banking and financial services
E-commerce services
Web-centric services
Internet search indexes
Scientific and document searches
Medical records
Web logs
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-4
Characteristics of Big Data
Volume Velocity
Social Networks
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-5
Big Data Opportunities: Some Examples
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-7
Big Data Challenges
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-8
Information Management Landscape
Business
Value
Time / Effort
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-2
Agenda
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-3
Apache Hadoop
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-4
Types of Analysis That Use Hadoop
Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-6
Apache Hadoop Ecosystem
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-7
Oracle Big Data Appliance (BDA)
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-8
Agenda
Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-9
HDFS: Characteristics
Master-slave architecture
Fault-tolerant (HA)
Redundant
Supports MapReduce
Scalable
Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch-processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes.
Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores them on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
DataNode (DN) Stores the blocks or "chunks" of data for a set of files
A B
C A
...
B C
A B C
C A B
B C A
localhost:50070
Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Hue
Hadoop client
WebHDFS
HttpFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database
Manage HDFS
Advantages:
Enables direct HDFS writes without intermediate file staging on Linux FS
Easy to scale:
Initiate concurrent puts for multiple files
HDFS will leverage multiple target servers and ingest faster
Big Data Appliance
Disadvantages:
Additional software (Hadoop client) need to be installed on the SRC server
HDFS nodes
hadoop fs <args>
Linux command
HDFS command
http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/FileSystemShell.html
directories
files
For a file, it returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize
modification_date modification_time filename
For a directory, it returns a list of its direct children as in UNIX. A directory is listed as:
permissions userid groupid modification_date modification_time dirname
Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:
Display the contents of the part-r-00000 HDFS file by using the cat command:
Advantages:
WebHDFS performance comparable with the Hadoop client
No additional software required on the client side
Disadvantages:
Complex syntax (comparable with the Hadoop client)
HttpFS utilizes a single gateway node that can be a potential bottleneck Big Data Appliance
HDFS nodes
Source Server
Initiate data loading No
on the client side with Hadoop
curl command Client
Linux FS HDFS
Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database
Is a distributed service for collecting, aggregating, and moving large data to a centralized
data store
Was developed by Apache
Has the following features:
Simple
Reliable
Fault tolerant
Used for online analytic applications
Source Sink
Channel
Agent
Web HDFS
Server
Benefits
Easy to install and configure
Highly reliable
General-purpose database system
Scalable throughput and predictable latency
KV Store
Configurable consistency and durability
ERP
Customer Portals
EAM Simple Data
Management Application
CRM Real Time Event
Driver
Inventory Processing
Globally Distributed,
Control
Always On data Mobile Data
Accting Management
& Payroll Competitive Advantages Time Series &
Process of Fast Data Sensor Data Mgmt
Mgmt
Business Lower TCO, Online Banking
Analytics commodity HW scale-out
ERP
Customer
Simple Data Portals
Expand
EAM
Management
CRM Real Time
Application
Event
Run Your
Inventory
Your
Globally Distributed, Driver
Control Processing
Always On data
Accting Mobile Data
Business
& Payroll Management
Process
Mgmt
Competitive Advantages
of Fast Data
Business Time Series &
Sensor Data
Mgmt
Business Lower TCO,
Analytics commodity HW scale-out Online Banking
No inherent structure Simple data structure Complex data structures, rich SQL
www.education.oracle.com
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-2
Agenda
MapReduce
Spark
YARN
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-3
Apache Hadoop Core Components
Apache Hadoop is a system for large-scale and distributed data processing. It has two
core components:
Distributed storage with HDFS
Distributed and parallel processing with MapReduce or Spark framework
MapReduce is a batch-oriented software framework that enables you to write
applications that will process large amounts of data in parallel, on large clusters of
commodity hardware, and in a reliable and fault-tolerant manner.
The Apache Spark framework is a cluster-computing platform designed to
be fast and general-purpose. It extends the MapReduce functionality (covered later).
Both MapReduce and Spark are managed by YARN (the acronym for Yet Another
Resource Negotiator).
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-4
MapReduce Framework Features
Integrates with HDFS and provides the same benefits for parallel data processing
Parallelizes and distributes computations to where the data is stored
The framework:
Schedules and monitors tasks, and re-executes failed tasks
Hides complex distributed computing tasks from the developer
Enables developers to focus on writing the Map and Reduce functions
MapReduce code can be written in Java, C, and scripting languages. Higher-level
abstractions (such as Hive and Pig) enable easy interaction. Optimizers construct
MapReduce jobs.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-5
MapReduce Job
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-6
MapReduce Jobs Flow
MAP
MAP REDUCE
Input 1 Shuffle
MAP and REDUCE
(HDFS) Sort
MAP REDUCE
MAP
Output 1 (HDFS)
MAP
Input 2 MAP Shuffle REDUCE
and
(HDFS)
MAP Sort REDUCE
MAP
Output 2 (HDFS)
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-7
Simple Word Count MapReduce: Example
Input Text from HDFS
1: Hello BigData World
2: Bye Bye BigData World
Input split 1
Bye: 2
Hello: 1
Mapper 2 Hello: 1 Hello: 1 World: 2
Bye: 1
Bye: 1
World: 1, 1 World: 2
BigData: 1
World: 1
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-8
How Do You Run a MapReduce Job?
Code Description
hadoop jar Instructs the client to submit job to the
ResourceManager
WordCount.jar The jar file that contains the Map and Reduce code
WordCount The name of the class that contains the main
method where processing starts
/user/oracle/wordcount/input_ The input directory
directory
/user/oracle/wordcount/output A single HDFS output path. All final output will be
_directory written to this directory.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-9
Submitting a WordCount MapReduce Job:
Review the Input Data Files
MapReduce JobHistoryServer
archives jobs metrics and can
be accessed through the
http://bigdatalite.localdomain:19888/jobhistory JobHistory Web UI or Hue.
MapReduce
Spark
YARN
Speed
Ease of use (higher level of abstraction than MapReduce)
Sophisticated analytics
Runs on various cluster managers, such as YARN
Native integration with Java, Python, and Scala
There are two interactive Spark shells available to execute the Spark programs.
spark-shell is used for Scala pyspark is used for Python
MapReduce
Spark
YARN
YARN is:
A subproject of Hadoop that separates resource-management and processing
components
A resource-management framework for Hadoop that is independent of execution
engines
Runs both MapReduce and Non-MapReduce jobs
Storage HDFS
Scalability
Compatibility with MapReduce
Improved cluster utilization
Support for workloads other than MapReduce
Component Description
ResourceManager (RM) A dedicated scheduler that allocates resources to the requesting applications. It has
two main components: Scheduler and ApplicationsManager.
It is a critical component of the cluster and runs on a dedicated master node.
High Availability (HA) RM with Active and Standby RMs automatic failover
NodeManager Each slave node in the cluster has an NM daemon, which acts as a slave for the RM.
(NM) Each NM tracks the available data processing resources and usage (CPU, memory,
disk, and network) on its slave node and sends regular reports to the RM.
ApplicationMaster The per-application AM is responsible for negotiating resources from the RM and
(AM) working with the NM(s) to execute and monitor the tasks. It runs on a slave node.
Container A container is a collection of all the resources necessary to run an application: CPU,
memory, network bandwidth, and disk space. It runs on a slave node in a cluster.
Job History Server Archives jobs and metadata
Container Container
Container
Resource
Resource Manager Manager
Master Nodes
(Distributed processing) Job History
Server
YARN provides a pluggable model to schedule policies. The scheduler is responsible for
deciding where and when to run tasks. YARN supports the following pluggable schedulers:
FIFO (First In, First Out)
Allocates resources based on submission time (first in, first out)
Resources requests for the first application in the queue are allocated first; once its
requests have been satisfied, the next application in the queue is served, and so on.
Capacity Scheduler
Allocates resources to pools, with FIFO scheduling within each pool (default in
Hadoop)
Fair Scheduler
Allows YARN applications to share resources in large clusters fairly
We will focus on the fair scheduler in this course.
Default in CDH5 (used in Oracle BDA)
Cloudera Manager provides the following features to assist you with allocating cluster
resources to services:
Static allocation (percentage of cluster resources)
Dynamic allocation
Service Queues "Pools" Gray boxes represent
statically allocated
resource pools
HDFS (30%) Impala (15%)
hrpool marketingpool
Data Unification
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-2
Introducing Data Unification Options
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-3
Unifying Data: A Typical Requirement
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-4
Oracle Big Data Management System
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-5
Oracle Big Data Connectors
B
Load into Database
R Client XQuery
R Analytics XML/XQuery
Oracle Loader for Hadoop
Oracle SQL query of
HDFS data
X
Oracle R Advanced Oracle XQuery on
Hadoop Lightweight Big Data SQL:
Analytics for Oracle SQL Connector for
Hadoop HDFS
Oracle Data
Integrator
DATA LAKE Knowledge Modules DATA WAREHOUSE
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-6
Agenda
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-7
Oracle Loader for Hadoop (OLH)
Provides fast and efficient data loading from a Hadoop cluster into a table in an Oracle
Database
Pre-partitions the data if necessary and transforms it into a database-ready format
Sorts records by primary key or user-specified columns before loading the data or
creating output files
Uses the parallel processing framework of Hadoop (MapReduce) to perform these
preprocessing operations
Is a Java MapReduce application that balances the data across reducers to help
maximize performance
Reads from sources that have the data already in a record format, or can split the lines
of a text file into fields
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-8
Oracle Loader for Hadoop (OLH)
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-9
Copy to Hadoop
Optimized query
with Big Data
SQL
Optimized Stored as
query with Oracle
hadoop eco- Data
system tools Pump files
Optionally
convert to
Parquet,
ORC, text
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
B X
Commodity Servers Mixed Deployment Mixed Deployment
Processing Layer
Storage Layer
NoSQL Databases
Filesystem (HDFS)
(Oracle NoSQL DB, HBase)
Start by looking at click data in HDFS, which has rich information about customers
behaviors.
Easily access that data from Oracle Database and apply techniques for making complex
documents easy to query.
Query any data including recommendation data in a NoSQL DB.
Problem: Personally Identifiable Information (PII) data must be safeguarded. Apply same
techniques to safeguard data in HDFS and NoSQL that you do with Oracle Database data.
We can now apply any type of analysis using rich SQL. We will turn those clicks into
customer sessions (something that takes hundreds of lines of java code). Based on customer
segmentation, review sessions that convert to sales. Finally, look at recommendations by
genre and see how they drive interest and sales.
Applications using Oracle REST Data Services (ORDS) can take advantage of all of this.
Regardless of the interface, applications can leverage the rich analytics, security, and ability
to query any data.
Analyze on Hadoop
Direct, parallel, fast, secure access to master data
HCatalog
Spark
Storage
Handler
Impala
Input
Format
Hive
Implementation
Hive
external Storage Handler Oracle
table table
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
After completing this lesson, you should be able to describe the following products:
Oracle Big Data Discovery
Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-2
Oracle Unified Big Data Management and Analytics Strategy
Big Data Preparation Hadoop Platform Big Data Discovery Data Visualization
Data Integrator Big Data SQL R on Hadoop Business Intelligence
GoldenGate NoSQL Database Big Data Spatial Spatial and Graph
IoT Oracle Database and Graph Advanced Analytics
In the cloud
and on-premises
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-3
Agenda
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-4
Big Data Labs: Fueling Enterprise Innovation
The data lab enables organizations to think and act like startups.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-5
Data Lab: Key Principals
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-6
Opening Up the Lab to a Broad Community: Difficulties
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-7
Unlocking the Data Lab: New Approach
Find and explore Big Data to Quickly transform and Unlock Big Data for anyone
understand its potential. enrich Big Data to make to discover and share new
it better. value.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-8
Oracle Big Data Discovery: The Visual Face of Big Data
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-9
Catalog (Find)
Access a rich,
interactive catalog of all
data in Hadoop.
Use familiar search and
guided navigation for
ease of use.
See data set
summaries, user
annotation, and
recommendations.
Provision personal and
enterprise data to
Hadoop via self service.
Share projects,
bookmarks, and
snapshots with others.
Build galleries and tell
Big Data stories.
Collaborate and iterate
as a team.
Publish blended data to
HDFS for use in other
tools.
Install into your existing CDH Run on the Oracle Big Data Run Big Data Discovery
or Hortonworks Data Platform Appliance. workloads in Oracle Cloud.
(HDP) cluster. Avail of high-performance Experience fully automated
Leverage existing hardware optimized for Big lifecycle management.
infrastructure and technology Data. Avail of the industrys most
standards. Save 21% in costs and time secure and complete Big Data
Take Big Data Discovery to the versus commodity. Cloud Service.
Data Lake.
Democratize value from the data lab: Publish, secure, and leverage:
Integrate with Hadoop open standards
Increase the size, diversify the skills,
and leverage the unified Oracle Big Data
and improve the efficiency of Big Data ecosystem.
teams.
Are things in the same location? Who is the nearest? What tax zone is this in? Where
can I deliver in 35 minutes? What is in my sales territory? Is this built in a flood zone?
Which supplier am I most dependent on? Who is the most influential customer? Do my
products appeal to certain communities? What patterns are there in fraudulent behavior?
Spatial Analysis:
Location Data Enrichment
Proximity and containment analysis, clustering
Spatial data preparation (vector, raster)
Interactive visualization
Property Graph for Analysis:
Social media relationships
E-commerce targeted marketing
Cyber security, fraud detection
IoT, industrial engineering
Multimedia Analysis:
Framework for processing video and image data, such as facial recognition
Find people that are central in Identify groups of people that Find all the sets of entities that
Recommend the most similar match a given pattern, such as
item purchased by similar the given network, such as are close to each other, such as
influencer marketing. target group marketing. fraud detection.
people.
customer items
Purchase Communication
record stream such as
tweets
Cyber Security:
Critical / Alternate Path
Analysis:
Community Detection
Network Monitoring
Predictive Analysis
Multiple System Impact Analysis:
Transportation
Utilities
Finance
Javascript,
Smart filtering of large graphs
Java APIs
Flexible Interfaces
Python, Groovy Graph Data Access Layer API
Blueprints and SolrCloud / Lucene
Java, Tinkerpop, Blueprints, Gremlin
Apache Lucene and SolrCloud
Java APIs
Massively-Scalable Graph Database
Multiple back-ends: NoSQL, HBase, Scalable and Persistent Storage
Oracle Database Property Graph Storage on
Scales securely to tens of billions of Apache HBase and Oracle NoSQL
nodes/edges
In-Memory Analyst
35 Built-in Analytics
Graph Database
86%
Of insurance
companies agree
that analyzing
Actuarial and Accident Call data Customer
multiple data
Demographic data data data sources together is crucial to
making accurate predictions.
Enrich with
88%
postal code
Agree that linking
Categorize by region information by
location is key to
Data Products for Rate Structures combining disparate sources of
Underwriting/Risk Analysis
Big Data.
Source: The big data: How data analytics can yield underwriting gold.
Survey conducted by Ordnance Survey and Chartered Insurance Institute, 25 April 2013.
Make developers and data scientists more productive with pre-built componentry and
templates for applications.
Pre-built parallel MapReduce and Spark spatial
algorithms
Raster and vector processing frameworks
Comparison with Big Data Discovery:
Big Data Discovery: An interactive tool
Big Data Spatial and Graph:
A developer-centric framework
Install into your existing CDH or Run on the Oracle Big Data Run Big Data Spatial and
Hortonworks Data Platform Appliance. Graph workloads in Oracle
(HDP) cluster. Cloud.
Avail of high-performance
Leverage existing infrastructure hardware optimized for Big Experience fully automated
and technology standards. Data. lifecycle management.
Take graph, spatial, and Save 21% in costs and time Avail of the industrys most
multimedia analysis to the Data versus commodity. secure and complete Big Data
Lake. Cloud Service.
Hands On Lab for Big Data Spatial and Graph Property HOL/Demo:
http://www.oracle.com/technetwork/database/options/spatialandgraph/learnmore/biwa-
2016-more-session-information-2889878.html
Blog (technical examples and tips):
https://blogs.oracle.com/bigdataspatialgraph/
Oracle Big Data Lite Virtual Machine A free sandbox to get started:
www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-
2104726.html
Oracle Database
Text Avro
JSON XML
In this lesson, you should have learned about the following products:
Oracle Big Data Discovery
Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop
After completing this lesson, you should be able to identify the benefits of the Oracle Big
Data Appliance (BDA), such as:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-2
Agenda
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-3
Big Data Management System
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-4
Oracle BDA
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-5
Core Design Principles for BDA
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-6
Configuring and Installing the Oracle BDA: Road Map
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-7
Configuring and Installing the Oracle BDA: Key Players
Customer
Oracle Field
Engineer
Oracle BDA
Installation and Configuration
Install
Coordinator
Oracle ACS
Customer and
Oracle ACS
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-8
Key Definitions
Term Description
Oracle BDA Site Checklists They provide a site checklist that the customer must
complete before the installation of Oracle BDA.
Oracle BDA Configuration Enables you to provide your information such as IP
Generation Utility addresses, software preferences, that are required for
deploying Oracle BDA . After guiding you through a series
of pages, the utility generates a set of configuration files.
These files help automate the deployment process and
ensure that Oracle BDA is configured to your exact
specifications.
Base Image This includes the operating system, drivers, firmware, etc.
Mammoth Software Deployment The bundle contains the installation files and the Base
Bundle Image. Before you install the software, you must use
Oracle BDA Configuration Generation Utility to generate
the configuration files.
Mammoth Utility Mammoth is a command-line utility for installing and
configuring the Oracle BDA software.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-9
Accessing the Big Data Documentation Landing Page
https://docs.oracle.com/en/bigdata/
Logistics
Network Configuration
Auto Service Request
Oracle Enterprise Manager
Reracking
Collects all of the information required to install and configure the Oracle BDA Software.
Acquires information from you, such as IP addresses, security information, software
preferences, etc.
After guiding you through a series of pages, the utility generates a set of configuration
files.
The generated files help automate the deployment process and ensure that Oracle BDA
is configured to your specifications.
Welcome
Customer Details
Hardware Selection
Rack Details
Networking General Information,
Operating System, and
Administration Network Network configuration setup
1 3
2
Configuring cluster 1
as a CDH cluster.
3 4
Configuring cluster 2
as a NoSQL cluster.
bda1 folder:
Obsolete in Oracle BDA V4.2
bda1-BdaDeploy.json
and later versions
bda1-network.json
bda1h1 folder:
bda1h1-config Used by Mammoth during
bda1h2 folder: the software installation
bda1h2-config
bda-20160726-192435.zip
A CDH cluster is a minimum of 3 nodes, which is ideal for development, or 6 nodes for
production (recommended).
A starter rack contains 6 nodes.
A full rack contains 18 nodes
The Elastic configurations enables you to expand your system in 1-node increments by
adding a BDA X6-2 High Capacity (HC) into a 6-node starter rack.
The Rack can be multi-tenant. For example, you can have multiple clusters on a single
rack.
You can also have a single cluster spanning multiple racks.
BDA Node
The Mammoth software deployment bundle contains the Installation files and the OS
base image.
You use the same Oracle BDA Mammoth Software Deployment Bundle to do the
following:
Install the software on a new rack
Add servers to a cluster
Upgrade the software on the Oracle BDA
Change the configuration of optional software
Reinstall the base image
Install a one-off patch
mammoth is the command-line utility that deploys software on Oracle's BDA (across all
servers in the rack) by using the files generated by BDA Configuration Generation Utility.
You can use Mammoth to:
Set up the cluster by using the generated configuration files
Create a cluster on one or more racks
Create multiple clusters on an Oracle BDA rack
Extend a cluster to new servers
Update a cluster with new software
./mammoth -i bda1h1
In addition to installing the software across all servers in the rack, the mammoth Utility:
Creates the required user accounts
Starts the correct services
Sets the appropriate configuration parameters
You must run the mammoth utility once for each rack.
For additional information on installing the software, see in the Oracle BDA Owner's
Guide.
Access and bookmark the Oracle Big Data Appliance Download Mammoth bundle patch
3 Patch Set Master Note (Doc ID 1485745.1) from MOS # 21109091 Oracle BDA Base Image
for Oracle Linux 6.
Active NameNode
Stand-by NameNode
Active ResourceManager
Stand-by ResourceManager
# ./mammoth -i bda1h1
Cluster name
node08
node07
node06 Runs all mammoth steps
node05
node04 # ./mammoth -i bda1h1
node03
node02 Cluster name
node01
Active NameNode
Stand-by NameNode
node12
node11 Active ResourceManager
node10
node09 Stand-by ResourceManager
Rack 1 name: bda1
node08
node07 RM2
node06
node05
RM1
node04
node03
NN2
node02
node01 NN1
Once your CDH cluster is set up, you may want to update the configuration based on
specific needs, such as adding a new Impala or Hbase service.
You can use Cloudera Manager to manage the Hadoop cluster.
You can use Enterprise Manager to monitor the BDA, similar to how you would manage
your other Oracle products.
You can use Cloudera Manager to perform the following administrative tasks:
Manage cluster configuration
Monitor hosts, jobs, events, and services
Start and stop services
View detailed performance metrics
Monitor the health of the system
Set up resource management
Track resource usage
Manage Hadoop security
Generate alerts
2. Authorization
1. Authentication
4. Encryption
3. Auditing (at rest and over the network)
Secure Authorization
Ability to control access to data and/or privileges on data for authenticated users
Fine-grained Authorization
Ability to give users access to a subset of data
Includes access to a database, URI, table, or view
Role-based Authorization
Ability to create or apply template-based privileges based on functional roles
Cloudera Navigator:
Provides a deep level of auditing in Hadoop
Does not include auditing data from other sources
Includes lineage analysis
Cloudera Navigator audits the following activities:
HDFS data accessed through HDFS, Hive, HBase, and Impala services
Hive, HBase, and Impala operations
Hive metadata definition
Sentry access
Oracle BDA supports network encryption for key activities, thereby preventing network
sniffing between computers. Mammoth automatically configures:
Cloudera Manager Server communicating with Agents
Hadoop HDFS data transfers
Hadoop internal RPC communications
Cloudera Manager web interface
Hadoop web UIs and web services
Hadoop YARN/MapReduce shuffle transfers
In this lesson, you should have learned how to use the Oracle BDA to maximize its following
benefits:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-2
Oracle Big Data Cloud Service (BDCS)
Oracle BDCS delivers the power of Hadoop as a secure, automated, elastic service,
which can also be fully integrated with existing enterprise data in Oracle Database.
A subscription to Oracle BDCS gives you access to the resources of a pre-configured
Oracle Big Data environment including CDH and Apache Spark.
Use BDCS to capture and analyze the massive volumes of data generated by social
media feeds, email, web logs, photographs, smart meters, sensors, and similar devices.
When you subscribe to Big Data Cloud Service:
You can select between (3) and (60) nodes in one-node increments
You can also burst by adding or removing up to 192 OCPUs (32 compute-only
nodes)
Oracle manages the whole hardware and networking infrastructure as well as the
initial setup, while you have complete administrators control of the software
All servers in a BDCS instance form a cluster.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-3
Oracle Big Data Cloud Service: Key Features
Software included
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-4
Oracle Big Data Cloud Service: Benefits
Dedicated:
Dedicated instances that deliver high-performance
Elastic:
Scale up and down as needed.
Secure:
Secure Hadoop Cluster out of the box
Comprehensive Analytic software toolset: Big Data
Use the latest advances in Big Data processing.
Unify data processing with Big Data SQL.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-5
Elasticity: Dedicated Compute Bursting
Burst Nodes
Key Features
Cluster Burst nodes provide:
Self service, on-demand from Cluster Service
Manager
Permanent Nodes
Large expansion with 32 OCPUs and 256 GB of
memory
Expansion nodes automatically instantiated as
cluster nodes
Bursting nodes that share InfiniBand fabric
Hourly billing rates
Retain Dedicated Compute for performance
32 OCPU
Key Benefits 256 GB RAM
48 TB Storage
Flexibility
Consistent high performance
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-6
Automated Service
Big Data
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-7
Security Made Easy
Key Features
Kerberos-ready cluster out of the box:
Apache Sentry enabled on secure clusters
Built-in data encryption:
At-rest through HDFS encryption
In-flight for all phases within Hadoop and Spark
Encrypted traffic to all client tools
VPN service
Key Benefits
Reduced risk
Faster time-to-value
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-8
Comprehensive Analytics Toolset Included
Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Scale-out R capabilities
Scale-out Complex Document Parsing
Big Data Spatial and Graph provides:
Property Graph In-memory Engine
Property Graph Pre-built Analytics
Key Benefits
Faster time-to-value
Lower overall cost
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-9
Comprehensive Data Integration Toolset Included
Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Oracle Data Integrator Enterprise Edition
Specific Loaders for Oracle Database integration
Oracle Loader for Apache Hadoop
Oracle SQL Connector for Apache Hadoop
Additional licensing can provide ODI EE Big Data
capabilities
Key Benefits
Faster time-to-value
Lower overall cost
Big Data Appliance Big Data Cloud Machine Big Data Cloud Service
Note: The screens used in this lesson were the latest as this lesson was developed;
however, the screens might not match your screens.
Cloud Service
Instance Name:
bursting1
Started clusters
Each node has 32 OCPUs, 256 GB
RAM, and 48 TB storage;
therefore, this new cluster has
the following resources:
Clusters (3) nodes
96 OCPUs
768 GB RAM (256x3 nodes)
144 TB HDFS storage (48x3
nodes)
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: