D75058GC20 Ep

Introduction to Oracle Big Data
Electronic Presentation
D75058GC20 | Edition 2.0 | November 2016
Learn more from Oracle University at education.oracle.com

Author Copyright 2016, Oracle and/or its affiliates. All rights reserved.
Lauran K. Serhal Disclaimer
This document contains proprietary information and is protected by copyright and other intellectual property laws. You may
Technical Contributors copy and print this document solely for your own use in an Oracle training course. The document may not be modified or
and Reviewers altered in any way. Except where your use constitutes "fair use" under copyright law, you may not use, share, download,
Martin Gubar upload, copy, print, display, perform, reproduce, publish, license, post, transmit, or distribute this document in whole or in
part without the express authorization of Oracle.
Bill Beauregard
Melliyal Annamalai The information contained in this document is subject to change without notice. If you find any problems in the document,
Jean Ihm please report them in writing to: Oracle University, 500 Oracle Parkway, Redwood Shores, California 94065 USA. This
Bob DiMeo document is not warranted to be error-free.
Jean-Pierre Dijcks Restricted Rights Notice
Ben Gelernter
Bob Stanoch If this documentation is delivered to the United States Government or anyone using the documentation on behalf of the
Frederick Kush United States Government, the following notice is applicable:
Nancy Greenberg U.S. GOVERNMENT RIGHTS

Brian Pottle The U.S. Governments rights to use, modify, reproduce, release, perform, display, or disclose these training materials are
restricted by the terms of the applicable Oracle license agreement and/or the applicable U.S. Government contract.
Editors
Kavita Saini Trademark Notice
Chandrika Kennedy
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their
respective owners.
Graphic Editors
Prakash Dharmalingam
Kavya Bellur
Publishers
Veena Narasimhan
Michael Sebastian Almeida
Raghunath M
1
Introduction
Copyright 2016, Oracle and/or its affiliates. All rights reserved.

Agenda
Course objectives, course road map, and lesson objectives

Introduction to the Oracle Big Data Lite VM
Introduction to the Oracle MoviePlex case study
Useful resources
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-2
Course Objectives
After completing this course, you should be able to:

Define the term big data and discuss Oracles big data solution
Review the Hadoop ecosystem
Acquire data into the Hadoop Distributed File System (HDFS) and Oracle NoSQL
Database by using Flume and Sqoop
Examine and run MapReduce jobs
Use Oracle Big Data Connectors
Introduce the Oracle Big Data Appliance (BDA)
Course Road Map Lesson 1: Introduction
Lesson 2: Big Data and Oracle Information

Management System
Lesson 3: Data Acquisition and Storage
Lesson 4: Data Access and Processing
Lesson 5: Data Unification
Lesson 6: Data Discovery and Analysis
Lesson 7: Introduction to the Oracle BDA
Lesson 8: Introduction to Oracle Big Data

Cloud Service (BDCS)
Questions About You
To ensure that the class can be customized to meet your specific needs and to encourage
interaction, answer the following questions:
Which organization do you work for?
What is your role in your organization?
If you are a DBA, what products have you worked on?
Have you used the Oracle BDA?
Have you used any Hadoop components?
Do you meet the course prerequisites?
What do you hope to learn from this course?
Lesson Objectives
After completing this lesson, you should be able to:

Discuss the course objectives and road map
Download, install, and run the Oracle Big Data Lite Virtual Machine (VM)
Review the Oracle MoviePlex case study
List the available appendixes
Identify the relevant documentation and other resources
Oracle Big Data Lite VM
Oracle Big Data Lite VM provides an

integrated environment to help you get
started with the Oracle Big Data platform.
Many Oracle big data platform
components have been installed and
configured, allowing you to begin using
the system right away.
You can access the latest Big Data Lite
landing page on OTN at:
http://www.oracle.com/technetwork/datab
ase/bigdata-appliance/oracle-bigdatalite-
2104726.html.
The landing page contains tutorials,
videos, white papers, and more.
Oracle Big Data Lite VM Home Page Sections
Section Contents
Introduction Introduction to the Big Data Lite VM and the installed components
Download Oracle Big Data Lite VM Contains the following:
Deployment Guide document (important details)
Links to download the required (13) 7-zip files
Links to download the required Oracle VM VirtualBox plus its
Extension Pack and 7-zip files
Previous Version You can download the previous versions of the Big Data Lite VM and
compare the available components with older versions.
Getting Started View information (and demos) about the Oracle MoviePlex demo
data that is used in several of the big data courses.
Hands-on Lab Access to several hands-on labs
Web Sites/White Papers/EBook/Blogs List some resources that help you to learn more about the Oracle big
data platform.
Components of the Oracle Big Data Lite VM
Downloading, Installing, and Using the Oracle Big Data Lite VM
1. Review the important Deployment Guide document for detailed installation steps.
2. Download and install Oracle VM VirtualBox, its Extension Pack, and 7-zip files.
3. Run the 7-zip extractor on the BigDataLite450.7z.001 file only. This will create the
BigDataLite450.ova VirtualBox appliance file.
4. In VirtualBox, import BigDataLite450.ova.
5. Start BigDataLite-4.5.0.
6. Log in as oracle/welcome1.
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 10

Reviewing the Deployment Guide

Downloading Oracle VM VirtualBox Plus its Extension Pack

Installing Oracle VM VirtualBox Plus its Extension Pack

Starting the Oracle Big Data Lite VM

Oracle Big Data Lite VM Desktop

Review the Important Start Here Page

Starting and Stopping Services

Available Tools on the Browsers Toolbar

Agenda

Introduction to Oracle Big Data Lite VM
Useful resources

Oracle MoviePlex Case Study: Introduction
Oracle MoviePlex is an application that is based on a fictitious online movie-streaming

rental company.
With this web-based application, you can do:
Browse a catalog of movies
Watch movie trailers
Rent movies
Review and rank movies
Like many other online stores, Oracle MoviePlex needed a cost-effective approach to
tackle the big data challenges.
It recently implemented the Oracle Big Data platform to:
Better manage the business
Identify key opportunities
Enhance customer satisfaction

Big Data Challenge
Applications are generating massive volumes of unstructured data
that describes user behavior and application performance.
How can companies fully capitalize on this valuable information
Capture users' clicks due to cost and complexity?
How do you use this raw data to gain better insights into your
customers, enhance their user experience, and ultimately improve profitability?
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

Derive Value from Big Data
How can you:

Make the right movie offers at the right time?
Better understand the viewing trends of various customer segments?
Optimize marketing spend by targeting customers with optimal promotional offers?
Minimize infrastructure spend by understanding bandwidth usage over time?
Prepare to answer questions that you havent thought of yet!
...

Oracle MoviePlex: Goal
To deliver a personalized movie-watching experience by collecting and storing:

User profiles
Movie listings
Ratings
Users viewing location within a paused movie
All customer information and session details
are fictitious.

Oracle MoviePlex: Big Data Challenges
The application generates a huge volume of unstructured data.

Requests require fast response times (measured in milliseconds).
Data point: Latency matters.

Oracle MoviePlex: Architecture
Endeca Oracle
Log of all activity Information Business
Application on the site Discovery Intelligence EE
Log
Customer Profile Oracle Exalytics
Capture activity required (such as, recommended
for the MoviePlex site movies)
Streamed
into HDFS Clustering/Market Basket
using Flume Oracle NoSQL DB Mood Oracle Advanced
Recommendations Analytics
Load Recommendations Oracle Exadata
Load Session and
Activity Data
Oracle Big
HDFS Data
Connectors
MapReduce MapReduce MapReduce
ORCH - CF Recs. Pig - Sessionize Hive - Activities
Oracle Big Data Appliance

Oracle MoviePlex: Data Generation Format
...
Field Description
custId The customer's ID
movieId The ID of the selected movie
genreId The genre of the selected movie
time The timestamp when the customer watched the movie
recommended? Whether the selected movie was recommended, Y or N
Activity 1: Rate movie
2: Completed movie
3: Not completed
4: Started movie
5: Browsed movie
6: List movies
7: Search

Oracle MoviePlex Application
1 Simple profile updates 2 Advanced analytics: movies
based on mood
4 Advanced profile attributes What is a key-value

store?

Agenda

Introduction to Oracle Big Data Lite Virtual Machine (VM)
Useful resources

Oracle MoviePlex Demo on Oracle Big Data Lite Landing Page

Appendixes

Oracle Big Data Appliance Help Center Documentation
https://docs.oracle.com/en/bigdata/

Oracle Big Data Appliance Documentation

Additional Resources: Oracle Big Data
Tutorials on the Oracle Learning Library (OLL)
You can access tutorials, Oracle by Example or OBEs, and other material by visiting the
OLL at http://www.oracle.com/goto/oll.
Provide step-by-step instructions for performing a variety of tasks in big data and related
products.
Reduce the time that you spend investigating the required steps to perform a task.
Use practical real-world situations so that you can gain knowledge through valuable
hands-on experience. You can then use the solutions as the foundation for production
implementation, dramatically reducing time to deployment.

Oracle Big Data Tutorials on the OLL

Product Libraries

Oracle Big Data Landing Page

Oracle Big Data Discovery Landing Page

Oracle Big Data Administration Series

Oracle University Courses
education.oracle.com

Oracle University Courses Oracle Big Data Fundamentals

Oracle University Courses Oracle Big Data Fundamentals

Summary
In this lesson, you should have learned how to:

Discuss the course objectives and road map
Download, install, and start the Oracle Big Data Lite VM
Use the Oracle MoviePlex case study
Identify the available appendixes
Identify the relevant documentation and other resources

2
Big Data Management System

Objectives

Define the term Big Data
Identify the challenges and opportunities in implementing Big Data
Describe the Oracle Information Management Architecture for Big Data
Describe Oracles technology approach for Big Data adoption
Big Data: A Strategic Information Management (IM) Perspective
Information Management
Information Management is the means by which an
organization maximizes the efficiency and value with
which it plans, collects, organizes, uses, controls,
stores, and disseminates its Information.
Big Data
Big Data is a term used to describe data sets whose size is beyond the capability of the
software tools commonly used to capture, manage, and process data.
Big Data can be generated from many different sources, including:
Social networks
Banking and financial services
E-commerce services
Web-centric services
Internet search indexes
Scientific and document searches
Medical records
Web logs
Characteristics of Big Data
Volume Velocity
Social Networks
RSS Feeds Microblogs

Variety Value
Big Data Opportunities: Some Examples
Todays Challenge New Data What's Possible?

Preventive care, reduced
Healthcare: Remote patient
hospitalization,
Expensive office visits monitoring
epidemiological studies
Manufacturing: Automated and predictive
Product sensors
In-person support diagnosis and support
Geo-advertising,
Location-based services:
Real-time location data personalized notifications
Based on home ZIP code
and search
Increased availability,
Utilities: Detailed consumption
reduced cost, tiered
Complex distribution grid statistics
metering plans
Big Data Challenges
Schematize? Analyze? Processing? Governance?
Information Management Landscape
Information Management (IM) is the means by which an organization seeks to:

Maximize the efficiency with which it plans, collects, organizes, uses, controls, stores,
disseminates, and disposes of its information
Ensure that the value of that information is identified and exploited to the maximum
extent possible

Extending the Boundaries of Information Management
How can Big Data technologies be applied to create

additional business value or reduce the costs of
delivering Information Management?
Bridge Big Data and traditional relational database

worlds by integrating structured, semi-structured, and
unstructured information.
Augment Big Data analysis techniques with a

portfolio of Oracle Advanced Analytics, Business
Intelligence, and Data Warehousing technologies.

A Simple Functional Model for Big Data
Discovery and commercial exploitation for new data
Business
Value
Time / Effort

Oracle Information Management Conceptual Architecture
Actionable Actionable Actionable

Events Insights Information
Structured
Enterprise
Data
Data Event Engine Data Data Factory Enterprise Reporting

Streams Reservoir Information Store Other
Data
Execution
Innovation
Line of Governance!
Events Discovery Lab Discovery

& Data Output

IM Architecture Design Pattern: Discovery Lab

Structured
Enterprise
Data

Data
Execution
Innovation

& Data Output

IM Architecture Design Pattern: Information Platform

Structured
Enterprise
Data

Data
Execution
Innovation

& Data Output

IM Architecture Design Pattern: Data Application

Structured
Enterprise
Data

Data
Execution
Innovation

& Data Output

IM Architecture Design Pattern: Information Solution

Structured
Enterprise
Data

Streams Reservoir Info Store Other
Data
Execution
Innovation

& Data Output

IM Architecture Design Pattern: Real-Time Events

Structured
Enterprise
Data

Data
Execution
Innovation

& Data Output

Big Data Adoption and Implementation Patterns

Additional Resources
The Oracle Big Data Learning Library:

https://apexapps.oracle.com/pls/apex/f?p=44785:141:0:::141:P141_PAGE_ID,P141_SE
CTION_ID:27,615
Information Management and Big Data: A Reference Architecture (Oracle White Paper)
http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/2297765.pdf

Summary

Define the term Big Data
Identify the challenges and opportunities in implementing Big Data
Describe the Oracle Information Management Architecture for Big Data
Describe Oracles technology approach for Big Data adoption

3
Data Acquisition and Storage

Objectives

Define the Apache Hadoop ecosystem
Describe the architectural components of Hadoop Distributed File System (HDFS)
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using the Oracle NoSQL Database
Agenda
Introduction to Apache Hadoop

Understand the architectural components of HDFS
Acquire and access data by using Oracle NoSQL Database
Apache Hadoop
Apache Hadoop is an open-source software framework for distributed storage and

distributed processing of big data on clusters of commodity hardware.
Open source available:
From the Apache Hadoop Foundation
As Distributions, such as Clouderas Distribution Including Apache Hadoop (CDH)
Apache Hadoop has two core components:
Distributed storage with HDFS
Distributed and parallel processing with MapReduce and Spark
Types of Analysis That Use Hadoop
Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment
Apache Hadoop Ecosystem
Hadoop: a partial list of related projects
Hadoop Core Components:

HDFS (Storage)
MapReduce (Processing)
Oracle Big Data Appliance (BDA)
The Oracle BDA:

Simplifies the deployment of Apache Hadoop
Optimizes the secure deployment of CDH clusters on an engineered system to manage
and process the data
Is covered in a later lesson in this course
Agenda
Introduction to Hadoop
HDFS: Characteristics
Master-slave architecture
Fault-tolerant (HA)
Redundant
Supports MapReduce
Scalable

HDFS Key Definitions
Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch-processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes.
Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores them on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
DataNode (DN) Stores the blocks or "chunks" of data for a set of files

HDFS: Example
Client Blocks NameNode (Master)
A (128 MB) File: movieplex1.log
Blocks: A, B, C
Data Nodes: 1, 2, 3
B (128 MB) Replication Factor: 3
A: DN 1,DN 2, DN 3
C (94 MB) B: DN 1,DN 2, DN 3
C: DN 1,DN 2, DN 3
. . .
movieplex1.log; 350 MB in Heartbeat every 3

size and a block size of 128 seconds and a
MB. The Client chunks the Blockreport
file into (3) blocks: A, B, Heartbeat and Blockreport every 6 hours
and C
A B
C A
...
B C
Data Node 1 (slave) Data Node 2 (slave)

Storing and Accessing Data Files in HDFS
Client Blocks Active (and Standby) NameNodes
A File:
File: movieplex1.log
movieplex1.log
Blocks:
Blocks: A, A,
B,B,
CC
Data
Data Nodes:
Nodes: 1,1,
2,2,
33
B DN1,DN2, DN3
A: DN1,DN2, DN3
A:
B: DN1,DN2, DN3
C C: DN1,DN2, DN3
movieplex1.log; 350 MB in size and a block size of

128 MB. The Client chunks the file into (3) blocks:
A, B, and C
Ack messages from the pipeline are sent back Master
to the client (blocks are copied)
Slave Slave Slave
A B C
C A B
B C A
DataNode 1 DataNode 2 DataNode 3

Oracle Big Data Lite VM
The Oracle Big Data Lite VM provides an

integrated environment to help you get
started with the Oracle big data platform.
Many Oracle big data platform
components have been installed and
configured, allowing you to begin using
the system right away.
You can access the latest Big Data Lite
landing page on OTN at:
http://www.oracle.com/technetwork/datab
ase/bigdata-appliance/oracle-bigdatalite-
2104726.html.
The landing page contains tutorials,
videos, white papers, and more.

Accessing the NameNode Web UI on the Oracle Big Data Lite VM
The NameNode exposes its Web UI on port 50070
localhost:50070

NameNode Web UI

Agenda
Hue
Hadoop client
WebHDFS
HttpFS

Using Cloudera Hue to Interact with HDFS
http://bda1node03.example.com:8888
Manage HDFS

Using Hadoop Client to Batch Load Data
Advantages:
Enables direct HDFS writes without intermediate file staging on Linux FS
Easy to scale:
Initiate concurrent puts for multiple files
HDFS will leverage multiple target servers and ingest faster
Big Data Appliance
Disadvantages:
Additional software (Hadoop client) need to be installed on the SRC server
HDFS nodes
HDFS put command

Issued from the Client
Hadoop client on the
source server
Source Server
Linux FS HDFS

The HDFS File System (FS) Shell Interface
HDFS supports a traditional hierarchical file organization.

You can use the FS shell command-line interface to interact with the data in HDFS.
The syntax of this command set is similar to that of other shells.
You can create, remove, rename, and move directories/files.
You can invoke FS shell as follows:
hadoop fs <args>

FS Shell Commands
Linux command
HDFS command
http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/FileSystemShell.html

Sample FS Shell Commands
Command Description
ls Lists attributes of files and directories
cat Copies source paths to stdout
cp Copy files from source to destination in HDFS
mv Moves files from source to destination . Moving files across file systems is not
permitted.
rm Deletes files specified. The -r option deletes the directory and its contents.
put Copies files from the local file system to HDFS
get Copies files from HDFS to the local file system
mkdir Creates one or more HDFS directories
rmdir Deletes a directory
jar Runs a jar file. Users can bundle their Map Reduce code in a jar file and
execute it using this command.
version Prints the Hadoop version
help Return usage output (available commands to use)

ls Command
hadoop fs -ls
directories
files
For a file, it returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize
modification_date modification_time filename
For a directory, it returns a list of its direct children as in UNIX. A directory is listed as:
permissions userid groupid modification_date modification_time dirname

mkdir and copyFromLocal Commands
Create an HDFS directory named curriculum by using the mkdir command:
Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:

rm and cat Commands
Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:
Display the contents of the part-r-00000 HDFS file by using the cat command:

Load Data With WebHDFS or HttpFS
Advantages:
WebHDFS performance comparable with the Hadoop client
No additional software required on the client side
Disadvantages:
Complex syntax (comparable with the Hadoop client)
HttpFS utilizes a single gateway node that can be a potential bottleneck Big Data Appliance
HDFS nodes
Source Server
Initiate data loading No
on the client side with Hadoop
curl command Client
Linux FS HDFS

hadoop fs ls and LISTNAMES
curl -i
hadoop fs -ls "http://bigdatalite.localdomain:50070
/webhdfs/v1/user/oracle?op=LISTSTATUS
"
LISTSTATUS displays the same
content of the hadoop fs -ls
commend but in JSON format
Confirm file upload and

view its content

Uploading a Local File to an HDFS Directory With hadoop fs
Create an HDFS directory named test11 using hadoop fs CLI:
Copying localtest1.txt file to HDFS directory test11 using hadoop fs CLI:

hadoop fs -put test1.txt
hdfs://bigdatalite.localdomain:8020/user/oracle/test11
Confirm file upload and

view its content

Creating an HDFS Directory With WebHDFS
Creating an HDFS directory named test21 by using WebHDFS:
curl -i -X PUT -L -H 'Content-Type:application/octet-stream'
"http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/test21?op=
MKDIRS&user.name=oracle";

Uploading a Local File to HDFS With WebHDFS
Creating an HDFS directory named test21 by using WebHDFS:

"http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/test21/tes
t1.txt?op=CREATE&user.name=oracle" -T test1.txt;

Creating an HDFS and Loading Data by Using HttpFS
Creating an HDFS directory named test31 by using HttpFS and uploading:

"http://bigdatalite.localdomain:14000/webhdfs/v1/user/oracle/test31/tes
t1.txt?op=CREATE&user.name=oracle" -T test1.txt;
HttpFS uses default port 14000

Agenda

What is Apache Flume?
Is a distributed service for collecting, aggregating, and moving large data to a centralized
data store
Was developed by Apache
Has the following features:
Simple
Reliable
Fault tolerant
Used for online analytic applications

Apache Flume: Architecture
Source Sink
Channel
Agent
Web HDFS
Server

What is NoSQL Database?
Is a key-value (KV) database

Is accessible by using Java APIs
Stores unstructured or semi-structured data as byte arrays
Benefits
Easy to install and configure
Highly reliable
General-purpose database system
Scalable throughput and predictable latency
KV Store
Configurable consistency and durability

What is Oracle NoSQL Database?
less is more 101100101001001

001101010101011
100101010100100
101
Simple Fast Flexible Reliable
Advanced key-value database designed as a cost-effective,

high-performance solution for simple operations on collections of data
with built-in high availability and elastic scale-out.

Where is NoSQL Used?
ERP
Customer Portals
EAM Simple Data
Management Application
CRM Real Time Event
Driver
Inventory Processing
Globally Distributed,
Control
Always On data Mobile Data
Accting Management
& Payroll Competitive Advantages Time Series &
Process of Fast Data Sensor Data Mgmt
Mgmt
Business Lower TCO, Online Banking
Analytics commodity HW scale-out


Where is NoSQL Used?
ERP
Customer
Simple Data Portals
Expand
EAM
Management
CRM Real Time
Application
Event
Run Your
Inventory
Your
Globally Distributed, Driver
Control Processing
Always On data
Accting Mobile Data
Business
& Payroll Management
Process
Mgmt
Competitive Advantages
of Fast Data
Business Time Series &
Sensor Data
Mgmt
Business Lower TCO,
Analytics commodity HW scale-out Online Banking


Choose the Right Storage Option for the Job
HDFS Oracle NoSQL Database Oracle Database
File system Key-value database Relational database
No inherent structure Simple data structure Complex data structures, rich SQL
High-volume random reads and

High-volume writes High-volume OLTP with 2-PC
writes
Limited functionality, Simple get/put high-speed Security, backup/restore, data life
roll-your-own applications storage, flex configuration cycle mgmt, XML, etc.
Real-time, web-scale specialized General purpose SQL platform,

Batch oriented
applications multiple applications, ODBC, JDBC

Relational Database Management System Compared to NoSQL
Relational Database NoSQL Database

Management System
High-value, high-density, complex data Low-value, low-density, simple data
Complex data relationships Very simple relationships
Joins Avoids joins
Schema-centric, structured data Unstructured or semi-structured data
Designed to scale up (and not out) Distributed storage and processing
Well-defined standards Standards not yet evolved
Database-centric Application- and developer-centric
High security Minimal or no security

HDFS Compared to NoSQL
HDFS NoSQL Database

File system Database
No inherent structure Simple data structure
Batch-oriented Real-time
Processes data to use Delivers a service
Bulk storage Fast access to specific records
Write once, read many Read, write, delete, update

Available Resources
Component Website
Cloudera Manager http://www.cloudera.com/content/cloudera/en/products-and-
services/cloudera-enterprise/cloudera-manager.html
Apache Hadoop http://hadoop.apache.org/
fuse-dfs http://fuse.sourceforge.net/
Cloudera Hue http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH4/4.2.0/Hue-2-User-Guide/hue2.html
Apache Oozie http://oozie.apache.org/
Apache Hive https://hive.apache.org/
Apache Pig http://pig.apache.org
Apache Flume http://flume.apache.org/
Apache Sqoop http://sqoop.apache.org/
Apache HBase http://hbase.apache.org/
Apache ZooKeeper http://zookeeper.apache.org
Apache Mahout http://mahout.apache.org
Apache Whirr https://whirr.apache.org/

Available Resources: Oracle University Courses
www.education.oracle.com

Available Resources: Oracle Learning Library (OLL)

Summary

Define the Apache Hadoop ecosystem
Describe the architectural components of Hadoop Distributed File System (HDFS)
Acquire and access data by using the Oracle NoSQL Database

4
Data Access and Processing

Objectives

Identify the benefits of MapReduce, run a MapReduce job, and monitor the job
Identify the benefits of Apache Spark, run a Spark job, and monitor the job
Use YARN to monitor jobs and to manage resources in your Hadoop cluster
Agenda
MapReduce
Spark
YARN
Apache Hadoop Core Components
Apache Hadoop is a system for large-scale and distributed data processing. It has two
core components:
Distributed storage with HDFS
Distributed and parallel processing with MapReduce or Spark framework
MapReduce is a batch-oriented software framework that enables you to write
applications that will process large amounts of data in parallel, on large clusters of
commodity hardware, and in a reliable and fault-tolerant manner.
The Apache Spark framework is a cluster-computing platform designed to
be fast and general-purpose. It extends the MapReduce functionality (covered later).
Both MapReduce and Spark are managed by YARN (the acronym for Yet Another
Resource Negotiator).
MapReduce Framework Features
Integrates with HDFS and provides the same benefits for parallel data processing
Parallelizes and distributes computations to where the data is stored
The framework:
Schedules and monitors tasks, and re-executes failed tasks
Hides complex distributed computing tasks from the developer
Enables developers to focus on writing the Map and Reduce functions
MapReduce code can be written in Java, C, and scripting languages. Higher-level
abstractions (such as Hive and Pig) enable easy interaction. Optimizers construct
MapReduce jobs.
MapReduce Job
A MapReduce job is a unit of work requested by a client.

The job consists of:
Input data (stored in HDFS)
A MapReduce program (written by the developer)
Hadoop runs the job by dividing it into the following tasks:
Map tasks (code written by the developer)
Shuffle-and-sort tasks (performed by MapReduce)
Reduce tasks (code written by the developer)
MapReduce Jobs Flow
MAP
MAP REDUCE
Input 1 Shuffle
MAP and REDUCE
(HDFS) Sort
MAP REDUCE
MAP
Output 1 (HDFS)
MAP
Input 2 MAP Shuffle REDUCE
and
(HDFS)
MAP Sort REDUCE
MAP
Output 2 (HDFS)
Simple Word Count MapReduce: Example
Input Text from HDFS
1: Hello BigData World
2: Bye Bye BigData World
Input split 1
Mapper 1 Shuffle & Sort Intermediate data

Reducers
Hello: 1
BigData: 1 BigData: 1, 1 BigData: 2 Final Output
World: 1 to HDFS
Bye: 1, 1 Bye: 2 BigData: 2

Input split 2
Bye: 2
Hello: 1
Mapper 2 Hello: 1 Hello: 1 World: 2
Bye: 1
Bye: 1
World: 1, 1 World: 2
BigData: 1
World: 1
How Do You Run a MapReduce Job?
hadoop jar WordCount.jar WordCount \

/user/oracle/wordcount/input_directory \
/user/oracle/wordcount/output_directory
Code Description
hadoop jar Instructs the client to submit job to the
ResourceManager
WordCount.jar The jar file that contains the Map and Reduce code
WordCount The name of the class that contains the main
method where processing starts
/user/oracle/wordcount/input_ The input directory
directory
/user/oracle/wordcount/output A single HDFS output path. All final output will be
_directory written to this directory.
Submitting a WordCount MapReduce Job:
Review the Input Data Files
cat file01 file02

Submitting the WordCount MapReduce Job
hadoop jar WordCount.jar WordCount \

/user/oracle/wordcount/input \
/user/oracle/wordcount/output

Monitoring MapReduce Jobs by Using the YARN Resource Manager
Web UI YARN (covered later in this lesson) is
used to manage resources on a
http://localhost:8088/cluster Hadoop cluster. In this example,
YARN is managing a MapReduce
job, but can also used to monitor
Spark jobs.

Monitoring MapReduce Jobs by Using the JobHistory Server Web UI
MapReduce JobHistoryServer
archives jobs metrics and can
be accessed through the
http://bigdatalite.localdomain:19888/jobhistory JobHistory Web UI or Hue.

Viewing the WordCount.java Program Output

Agenda
MapReduce
Spark
YARN

What is Apache Spark?
A fast and general engine for large-scale data processing

An open-source, parallel data-processing framework with a proven scalability of up to
2000 nodes
Makes it easy to develop fast, unified big data applications combining batch, streaming,
and interactive analytics on all your data

Benefits of Using Spark
Speed
Ease of use (higher level of abstraction than MapReduce)
Sophisticated analytics
Runs on various cluster managers, such as YARN
Native integration with Java, Python, and Scala
Source: Apache Spark page on http://spark.apache.org/

Scala Language: Overview
The Scala programming language can be used for implementing Spark.

Scala:
Is the acronym for Scalable Language
Is a pure-bred, object-oriented language
Runs on the Java virtual machine
Is reliable for large mission-critical systems

Scala Program: Word Count Example
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val threshold = args(1).toInt
// split each document into words
val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// filter out words with less than threshold occurrences
val filtered = wordCounts.filter(_._2 >= threshold)
// count characters
val charCounts = filtered.flatMap(_._1.toCharArray).map((_,
1)).reduceByKey(_ + _)
System.out.println(charCounts.collect().mkString(", "))
}
}

Spark Interactive Shells
There are two interactive Spark shells available to execute the Spark programs.
spark-shell is used for Scala pyspark is used for Python

Starting Interactive Spark-Shell for Scala

Word Count Example by Using Interactive Scala
// Open the two input files and get their file handles.
val files =
sc.textFile("hdfs://bigdatalite:8020/user/oracle/wordcount/input")
// Set the execution context for the data.

val counts = files.flatMap(line => line.split(" ")) .map(word => (word,
1)) .reduceByKey(_ + _)
// Execute the counts and save the output to sp-out directory.

counts.saveAsTextFile("hdfs://bigdatalite:8020/user/oracle/wordcount/sp-
out")
// print out the counts to the console.

counts.collect().foreach(println)

Word Count Example by Using Interactive Scala: Output

Monitoring Spark Jobs by Using the YARN Resource Manager Web
UI

Agenda
MapReduce
Spark
YARN

Apache Hadoop YARN: Overview
YARN is:
A subproject of Hadoop that separates resource-management and processing
components
A resource-management framework for Hadoop that is independent of execution
engines
Runs both MapReduce and Non-MapReduce jobs
Applications MapReduce Spark ...
Compute YARN (cluster resource management)
Storage HDFS

YARN: Features
Scalability
Compatibility with MapReduce
Improved cluster utilization
Support for workloads other than MapReduce

YARN (MRv2) Daemons
Component Description
ResourceManager (RM) A dedicated scheduler that allocates resources to the requesting applications. It has
two main components: Scheduler and ApplicationsManager.
It is a critical component of the cluster and runs on a dedicated master node.
High Availability (HA) RM with Active and Standby RMs automatic failover
NodeManager Each slave node in the cluster has an NM daemon, which acts as a slave for the RM.
(NM) Each NM tracks the available data processing resources and usage (CPU, memory,
disk, and network) on its slave node and sends regular reports to the RM.
ApplicationMaster The per-application AM is responsible for negotiating resources from the RM and
(AM) working with the NM(s) to execute and monitor the tasks. It runs on a slave node.
Container A container is a collection of all the resources necessary to run an application: CPU,
memory, network bandwidth, and disk space. It runs on a slave node in a cluster.
Job History Server Archives jobs and metadata

Hadoop Basic Cluster YARN (MRv2): Example
NameNode
HDFS Master Nodes (Active &
Standby for
(Distributed storage) HA)
DataNode DataNode & DataNode &

Container Application Master Application Master
Container Container
Container
NodeManager NodeManager NodeManager
Slave Node (Server) 1 Slave Node (Server) 2 Slave Node (Server) 3
Resource
Resource Manager Manager
Master Nodes
(Distributed processing) Job History
Server

Job Scheduling in YARN
YARN provides a pluggable model to schedule policies. The scheduler is responsible for
deciding where and when to run tasks. YARN supports the following pluggable schedulers:
FIFO (First In, First Out)
Allocates resources based on submission time (first in, first out)
Resources requests for the first application in the queue are allocated first; once its
requests have been satisfied, the next application in the queue is served, and so on.
Capacity Scheduler
Allocates resources to pools, with FIFO scheduling within each pool (default in
Hadoop)
Fair Scheduler
Allows YARN applications to share resources in large clusters fairly
We will focus on the fair scheduler in this course.
Default in CDH5 (used in Oracle BDA)

Cloudera Manager Resource Management Features
Cloudera Manager provides the following features to assist you with allocating cluster
resources to services:
Static allocation (percentage of cluster resources)
Dynamic allocation
Service Queues "Pools" Gray boxes represent
statically allocated
resource pools
HDFS (30%) Impala (15%)
Blue boxes represent

YARN (30%)
dynamically allocated
Dynamic Resource Pools resource pools
hrpool marketingpool

Summary

Identify the benefits of MapReduce, run a MapReduce job, and monitor the job
Identify the benefits of Apache Spark, run a Spark job, and monitor the job
Use YARN to monitor jobs and to manage resources in your Hadoop cluster

5
Data Unification

Objectives

Identify the need to integrate your data
Provide an overview of data unification technologies that are supported by the Oracle
Identify the benefits of Oracle Big Data SQL
Introducing Data Unification Options
Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop
Unifying Data: A Typical Requirement
Oracle Big Data Management System
DATA RESERVOIR DATA WAREHOUSE

Cloudera Hadoop Oracle
Oracle Database
Database
Oracle Big Data SQL Oracle Big Data Oracle Big Data SQL
Connectors Oracle Industry
Oracle NoSQL ModelsMulti-tenant
In-Memory,
Oracle Big Data Discovery Oracle Industry Models
Oracle Advanced
Oracle Big Data Analytics
Oracle Advanced
Spatial and Graph Analytics
Oracle Data
Big Data Appliance

B Integrator Oracle Spatial & Graph
Oracle Spatial and
Graph
Exadata
X
Oracle Event Apache Oracle Oracle Data Oracle Oracle Event
Processing Flume GoldenGate Integrator GoldenGate Processing
SOURCES
Oracle Big Data Connectors
Hive query of Oracle

Tables
Oracle Datasource for
Hadoop
B
Load into Database
R Client XQuery
R Analytics XML/XQuery
Oracle Loader for Hadoop
Oracle SQL query of
HDFS data
X
Oracle R Advanced Oracle XQuery on
Hadoop Lightweight Big Data SQL:
Analytics for Oracle SQL Connector for
Hadoop HDFS
Oracle Data
Integrator
DATA LAKE Knowledge Modules DATA WAREHOUSE
Agenda
Batch Loading
Copy to Hadoop
Dynamic Access
Oracle Big Data SQL
Oracle Loader for Hadoop (OLH)
Provides fast and efficient data loading from a Hadoop cluster into a table in an Oracle
Database
Pre-partitions the data if necessary and transforms it into a database-ready format
Sorts records by primary key or user-specified columns before loading the data or
creating output files
Uses the parallel processing framework of Hadoop (MapReduce) to perform these
preprocessing operations
Is a Java MapReduce application that balances the data across reducers to help
maximize performance
Reads from sources that have the data already in a record format, or can split the lines
of a text file into fields
Supports JSON, log, text, sequence,

compressed, and Parquet input file formats
Oracle Loader for Hadoop (OLH)
OLH tasks include:

Offloading expensive data processing from the database server to Hadoop
Working with a range of input data formats
Handling skew in input data to maximize performance
Loading using online and offline modes
Supports JSON, log, text, sequence,

compressed, and Parquet input file formats
Copy to Hadoop
Optimized query
with Big Data
SQL
Optimized Stored as
query with Oracle
hadoop eco- Data
system tools Pump files
Optionally
convert to
Parquet,
ORC, text
Fast copy from

database to HDFS
Use with SQL Developer and CLI
CLI Example: ohsh> create hive table hive_schema:HIVE_TAB as select *

from ORA_TAB where date > 31-JAN-2016 and less than < 29-FEB-2016

Key Benefits
Fast, secure copy of database data, table partitions to Hadoop

Archive in HDFS: Query with Big Data SQL
Analysis in Hadoop: Query with Hive, Spark
Easy to use UI: SQL Developer, command-line interface
Copied data in Oracle binary format (Oracle Data Pump files)
Optimized query via Big Data SQL
No loss of data precision
Query results of archived data identical to when data was in the database

Agenda
Batch Loading
Copy to Hadoop
Dynamic Access
Oracle Big Data SQL

OSCH provides read access to HDFS from an Oracle database.
Oracle SQL Connector Direct access from Oracle

for HDFS (OSCH) Database
SQL access to Hive and

HDFS
Oracle Automated generation of

external tables to access
Database
the data
Access or load data in

parallel
External tables

Agenda
Batch Loading
Copy to Hadoop
Dynamic Access
Oracle Big Data SQL

The New Normal at Oracle Customers
Expansion of Data Management Components Expansion of Programming Environments

Database Strategy for Big Data
Conventional view of Emerging view of

Data Management Data Management
Oracle Big Data SQL

Big Data SQL: Expanding Deployments
Engineered Systems Engineered Systems* Oracle Cloud
B X
Commodity Servers Mixed Deployment Mixed Deployment

Big Data SQL: A New Hadoop Processing Engine
Processing Layer
MapReduce Big Data

Spark Impala Search
and Hive SQL
Resource Management (YARN, cgroups)
Storage Layer
NoSQL Databases
Filesystem (HDFS)
(Oracle NoSQL DB, HBase)

Demo Flow
Start by looking at click data in HDFS, which has rich information about customers
behaviors.
Easily access that data from Oracle Database and apply techniques for making complex
documents easy to query.
Query any data including recommendation data in a NoSQL DB.
Problem: Personally Identifiable Information (PII) data must be safeguarded. Apply same
techniques to safeguard data in HDFS and NoSQL that you do with Oracle Database data.
We can now apply any type of analysis using rich SQL. We will turn those clicks into
customer sessions (something that takes hundreds of lines of java code). Based on customer
segmentation, review sessions that convert to sales. Finally, look at recommendations by
genre and see how they drive interest and sales.
Applications using Oracle REST Data Services (ORDS) can take advantage of all of this.
Regardless of the interface, applications can leverage the rich analytics, security, and ability
to query any data.

Analyze on Hadoop
Direct, parallel, fast, secure access to master data
HCatalog
Spark
Storage
Handler
Impala
Input
Format
Hive
Access database tables from Hadoop

Implementation
Hive
external Storage Handler Oracle
table table
Generate database splits.

Rewrite Hive SQL into Oracle SQL for each split.
Process splits in parallel by Hadoop tasks.
Return matching rows to Hadoop query coordinator.

Agenda
Batch Loading
Copy to Hadoop
Dynamic Access
Oracle Big Data SQL

Oracle Data Integrator with Big Data
Heterogeneous Integration with Hadoop Environments
Oracle Data Integrator

- IKM Hive
(Transform)
Transforms - IKM Hive
(Control
_____
_____
_____
_____ Append)
_____ _____
_____ _____ _____
_____ - CKM Hive
__________
- RKM Hive
_____
_____ _____
_____ _____
_____
_____ _____
Loads _____ _____
Loads
IKM File to Hive IKM File-Hive to
_____
_____
_____
_____ Oracle
_____
_____
_____ _____ (OLH, OSCH) _____
_____
__________
_____
_____

Oracle GoldenGate for Big Data
Performs real-time replication and synchronization of data

Streamlines real-time data delivery into Big Data formats including Apache Hadoop,
Apache HBase, Apache Hive, Apache Flume, and others

Available Resources
Oracle Big Data Connectors Latest Documentation:

http://docs.oracle.com/bigdata/bda46/index.htm
Software Users Guide (includes Oracle DataSource for Hadoop, OD4H):
http://docs.oracle.com/bigdata/bda46/BIGUG/toc.htm
Oracle Big Data SQL 3.0.1 Users Guide:
http://docs.oracle.com/bigdata/bda45/BDSUG/toc.htm
Integrating Oracle Database and Apache Hadoop:
https://www.oracle.com/database/big-data-connectors/index.html
Oracle Technology Network:
http://www.oracle.com/technetwork/topics/bigdata/learnmore/index.html
Connecting Hadoop with Oracle:
https://blogs.oracle.com/bigdataconnectors/

Summary

Identify the need to integrate your data
Provide an overview of data unification technologies that are supported by the Oracle
Identify the benefits of Oracle Big Data SQL

6
Data Discovery and Analysis

Objectives
After completing this lesson, you should be able to describe the following products:
Oracle Big Data Discovery
Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop
Oracle Unified Big Data Management and Analytics Strategy
Aggregate Manage Experiment Analyze and act

Connect people to the Collect, secure, and Innovate through Transform the workplace
information they need. make data available. experimentation with with actionable insights.
data.
Big Data Preparation Hadoop Platform Big Data Discovery Data Visualization
Data Integrator Big Data SQL R on Hadoop Business Intelligence
GoldenGate NoSQL Database Big Data Spatial Spatial and Graph
IoT Oracle Database and Graph Advanced Analytics
In the cloud
and on-premises
Agenda

Big Data Labs: Fueling Enterprise Innovation
Create innovative products and

services.
Identify new customer segments.
Predict new product success.
Optimize pricing.
Detect fraudulent activity.
Manage risk efficiently.
The data lab enables organizations to think and act like startups.
Data Lab: Key Principals
Provides broad access to

Collects new and existing
a trusted community
data as raw materials
Offers a do anything Enables agile experimentation,

sandbox environment ability to fail fast
Opening Up the Lab to a Broad Community: Difficulties
Data lab is currently the Raw tools and complex

realm of the elusive languages impair business
data scientist. analysts and SME productivity.
Unlocking the Data Lab: New Approach
A single intuitive and visual user interface to do the

following:
find explore transform discover share
Find and explore Big Data to Quickly transform and Unlock Big Data for anyone
understand its potential. enrich Big Data to make to discover and share new
it better. value.
Oracle Big Data Discovery: The Visual Face of Big Data
find explore transform discover share
Catalog (Find)
Access a rich,
interactive catalog of all
data in Hadoop.
Use familiar search and
guided navigation for
ease of use.
See data set
summaries, user
annotation, and
recommendations.
Provision personal and
enterprise data to
Hadoop via self service.

Explore
Visualize all attributes by

type.
Sort attributes by
information potential.
Assess attribute statistics,
data quality, and outliers.
Use scratch pad to
uncover correlations
between attributes.

Transform
Perform intuitive, user-

driven data wrangling.
Access an extensive library
of powerful data
transformations and
enrichments.
Preview results, undo,
commit, and replay
transforms.
Test on sample data and
then apply to full data set in
Hadoop.

Discover
Join and blend data for

deeper perspectives.
Compose project pages
via drag-and-drop.
Use powerful search
and guided navigation to
ask questions.
See new patterns in
rich, interactive data
visualizations.

Share
Share projects,
bookmarks, and
snapshots with others.
Build galleries and tell
Big Data stories.
Collaborate and iterate
as a team.
Publish blended data to
HDFS for use in other
tools.

Multiple Deployment Options for Big Data Discovery
Deploy on Commodity Deploy as an Engineered Subscribe to the Cloud

Hardware System Service
Install into your existing CDH Run on the Oracle Big Data Run Big Data Discovery
or Hortonworks Data Platform Appliance. workloads in Oracle Cloud.
(HDP) cluster. Avail of high-performance Experience fully automated
Leverage existing hardware optimized for Big lifecycle management.
infrastructure and technology Data. Avail of the industrys most
standards. Save 21% in costs and time secure and complete Big Data
Take Big Data Discovery to the versus commodity. Cloud Service.
Data Lake.

Oracle Big Data Discovery: A Game Changing Platform
Business Benefits Technical Benefits

Get value faster: Rapidly turn raw Destroy existing technical barriers:
data into actionable insights, leveraged Run natively on Apache Spark cluster for
across the enterprise. maximum scalability and performance.
Democratize value from the data lab: Publish, secure, and leverage:
Integrate with Hadoop open standards
Increase the size, diversify the skills,
and leverage the unified Oracle Big Data
and improve the efficiency of Big Data ecosystem.
teams.

Resources
Oracle Big Data Discovery documentation:

http://docs.oracle.com/en/bigdata/
Oracle Big Data Discovery Learning Library on OLL:
https://apexapps.oracle.com/pls/apex/f?p=44785:141:7470999189151::::P141_PAGE_I
D,P141_SECTION_ID,P141_PREV_PAGE:157,1816,5

Agenda


Role of Relationships in Spatial and Graph Analysis
Are things in the same location? Who is the nearest? What tax zone is this in? Where
can I deliver in 35 minutes? What is in my sales territory? Is this built in a flood zone?
Which supplier am I most dependent on? Who is the most influential customer? Do my
products appeal to certain communities? What patterns are there in fraudulent behavior?

Big Data often needs an organizing

principle: Data Harmonization.
Big Data analysis is often about

relationships, not aggregation.
Big Data platform is economically

compelling for working with massive data
sets found in spatial and graph workflows.

Who Is Most Important?
Answers from Aggregation

Who spends the most? Tabular questions:
Well-suited to SQL-like tools
Who buys the highest margin goods?
Who is most consistently a top contributor?
Answers from Connectivity

Who is most influential? Graph questions:
Needs something different!
Which supplier do I depend on the most?
What is the right product mix for millennials?

Spatial Analysis:
Location Data Enrichment
Proximity and containment analysis, clustering
Spatial data preparation (vector, raster)
Interactive visualization
Property Graph for Analysis:
Social media relationships
E-commerce targeted marketing
Cyber security, fraud detection
IoT, industrial engineering
Multimedia Analysis:
Framework for processing video and image data, such as facial recognition

What Big Data Problems Can Graphs Address?
Find people that are central in Identify groups of people that Find all the sets of entities that
Recommend the most similar match a given pattern, such as
item purchased by similar the given network, such as are close to each other, such as
influencer marketing. target group marketing. fraud detection.
people.
Product Recommendation Influencer Identification Community Detection Graph Pattern Matching
customer items
Purchase Communication
record stream such as
tweets

Modeling and Analyzing the Internet of Things
Cyber Security:
Critical / Alternate Path
Analysis:
Community Detection
Network Monitoring
Predictive Analysis
Multiple System Impact Analysis:
Transportation
Utilities
Finance

Property Graph: Big Data Spatial and Graph
Fast Analytics with Horizontally Scalable Storage
Parallel In-Memory-Based Analyst (PGX)
Graph Analytics
39 built-in memory-based graph analysis algorithms
Python, Perl, PHP, Ruby,

REST Web Service
In-memory Analytic Engine
Property Graph Query Language (PGQL)
Javascript,
Smart filtering of large graphs
Java APIs
Flexible Interfaces
Python, Groovy Graph Data Access Layer API
Blueprints and SolrCloud / Lucene
Java, Tinkerpop, Blueprints, Gremlin
Apache Lucene and SolrCloud
Java APIs
Massively-Scalable Graph Database
Multiple back-ends: NoSQL, HBase, Scalable and Persistent Storage
Oracle Database Property Graph Storage on
Scales securely to tens of billions of Apache HBase and Oracle NoSQL
nodes/edges

More Than 35 Graph Functions
Detecting Components and Communities Ranking and Walking

Pagerank, Personalized Pagerank,
Betweenness Centrality (with variants),
Tarjans, Kosarajus, Closeness Centrality, Degree Centrality,
Weakly Connected Components, Eigenvector Centrality, HITS,
Label Propagation (with variants), Random walking and sampling (with variants)
Soman and Narangs
Evaluating Community Structures Path-Finding

Hop-Distance (BFS)
Conductance, Modularity Dijkstras
Clustering Coefficient (Triangle Bi-directional Dijkstras
Counting)
Bellman-Fords
Adamic-Adar
Link Prediction SALSA Other Classics Vertex Cover

(Twitters Who-to-follow) Minimum Spanning-Tree (Prims)

Multiple Interfaces for Many Kinds of Users
In-Memory Analyst
35 Built-in Analytics
Graph Database

Differentiators: Graph
Commercial, supported software

Best of Both Worlds Graph DB
In-memory graph analysis algorithms: Like Neo4J
Distributed graph database model: Like Datastax Titan
Dozens of pre-built in-memory graph analysis algorithms
10 to 50 times faster analytics than competitors offerings
Analysis of 20 to 30 billion edge graphs in memory on a single BDA node

Oracle Big Data Spatial and Graph: Spatial Features
Insurance Industry Use Case
86%
Of insurance
companies agree
that analyzing
Actuarial and Accident Call data Customer
multiple data
Demographic data data data sources together is crucial to
making accurate predictions.
Enrich with
88%
postal code
Agree that linking
Categorize by region information by
location is key to
Data Products for Rate Structures combining disparate sources of
Underwriting/Risk Analysis
Big Data.
Source: The big data: How data analytics can yield underwriting gold.
Survey conducted by Ordnance Survey and Chartered Insurance Institute, 25 April 2013.

Oracle Big Data Spatial and Graph: Spatial Features
Make developers and data scientists more productive with pre-built componentry and
templates for applications.
Pre-built parallel MapReduce and Spark spatial
algorithms
Raster and vector processing frameworks
Comparison with Big Data Discovery:
Big Data Discovery: An interactive tool
Big Data Spatial and Graph:
A developer-centric framework

Problems Addressed by Spatial Analysis Address
Data Harmonization using any location

attribute such as address, postal code, Categorization and filtering based
latitude/longitude, place name, and so on on location and proximity
Visualizing and displaying

results on a map
Spatial querying and analysis

Preparation, validation, and cleansing of Hadoop data with SQL
of spatial and raster data

Spatial Features
MapReduce routines for spatial

Data enrichment service API (gazetteer) analysis such as
distance/proximity, clustering
HTML5 Map Visualization API
Hive SQL API

Spatial processing of data stored in HDFS Query from Oracle DB with
or NoSQL, raster processing operations Big Data SQL and Oracle
and geodetic and Cartesian data SQL Connectors for Hadoop

Oracle Big Data Spatial and Graph: Benefits
Location-based insights, streamlining of Big Data development

Commercial, supported software
Componentry to boost efficiency of data scientists and developers,
time saving on custom development
New insight into location-based patterns and trends across your
entire Big Data volumes through spatial analytics
Access to the inherent location relationships across disparate Big
Data sources through harmonization and enrichment
Extension of existing Big Data applications to complement
relational environments
Multiple deployment options: Runs on commodity hardware or
BDA, both on-premises or in the cloud

Multiple Deployment Options for Big Data Spatial and Graph
Deploy on Commodity Deploy as an Engineered Subscribe to the Cloud

Hardware System Service
Install into your existing CDH or Run on the Oracle Big Data Run Big Data Spatial and
Hortonworks Data Platform Appliance. Graph workloads in Oracle
(HDP) cluster. Cloud.
Avail of high-performance
Leverage existing infrastructure hardware optimized for Big Experience fully automated
and technology standards. Data. lifecycle management.
Take graph, spatial, and Save 21% in costs and time Avail of the industrys most
multimedia analysis to the Data versus commodity. secure and complete Big Data
Lake. Cloud Service.

Resources
Oracle Big Data Spatial and Graph documentation:

Oracle Learning Library (OLL): Oracle Big Data Spatial and Graph - OBE Series
https://apexapps.oracle.com/pls/apex/f?p=44785:24:11930221486867::NO:RP,24:P24_
CONTENT_ID,P24_PREV_PAGE:13442,2
Oracle Big Data Spatial and Graph on Oracle.com:
www.oracle.com/database/big-data-spatial-and-graph
Oracle Technology Network (OTN):
http://www.oracle.com/technetwork/database/database-technologies/bigdata-
spatialandgraph

Resources
Hands On Lab for Big Data Spatial and Graph Property HOL/Demo:
http://www.oracle.com/technetwork/database/options/spatialandgraph/learnmore/biwa-
2016-more-session-information-2889878.html
Blog (technical examples and tips):
https://blogs.oracle.com/bigdataspatialgraph/
Oracle Big Data Lite Virtual Machine A free sandbox to get started:
www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-
2104726.html

Agenda


Oracle R Advanced Analytics for Hadoop
R Client
R algorithms: Neural*, GLM*, LM kMeans, NMF, Pre-packaged predictive analytics

LMF, data movement, sampling, statistics algorithms
Parallel jobs on Hadoop Familiar interface R (to data
scientists)
Hadoop
Customer: Credit behavior
evaluation
Faster analytics, simpler solution,
and better behavior model
* enabled for over 200x speedup

for some queries

Massively Scalable XQuery Processing in Hadoop
for $ln in text:collection()

let $f := tokenize($ln,,)
where $f[1] = x
return text:put($f[2]))
Oracle Database
Oracle NoSQL Database
Text Avro
JSON XML

Resources
Oracle Big Data Connectors documentation:

Oracle Learning Library (OLL): Oracle Big Data Administration Series
https://apexapps.oracle.com/pls/apex/f?p=44785:24:101837840220695::NO:24:P24_CO
NTENT_ID,P24_PREV_PAGE:12361,29

Summary
In this lesson, you should have learned about the following products:

7
Introduction to the Oracle Big Data
Appliance

Objectives
After completing this lesson, you should be able to identify the benefits of the Oracle Big
Data Appliance (BDA), such as:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open
Agenda
Learn about the Oracle BDA

Configure and install the Oracle BDA
Complete the Oracle BDA Site Checklists
Run the Oracle BDA Configuration Generation Utility
Deploy the Oracle BDA software using the mammoth utility
Manage the Oracle BDA
Secure the Oracle BDA
Oracle BDA
An engineered system of hardware and software that delivers:

A complete and optimized Hadoop/NoSQL platform
Single-vendor support for both hardware and software
An easy-to-deploy solution
Tight integration with the Oracle Database
Core Design Principles for BDA
Simplify access to all data, which provides the following benefits:

No bottlenecks
Full stack install and upgrades
Simplified management
Cluster growth
Critical node migration
Always highly available
Always secure
Very competitive price point
Configuring and Installing the Oracle BDA: Road Map
Review the Oracle BDA site requirements.

Pre-Delivery of BDA Complete the Oracle BDA Site Checklists.
Procedures
Run the Oracle BDA Configuration Generation Utility.
Review the Safety Guidelines.

Unpack the Oracle BDA.
Hardware Installation
Place the BDA in its allocated space.
and Base Configuration
Power on the system.
Configure an Oracle BDA full or starter rack.
Run the mammoth utility to deploy the software on

Software Installation and
your Oracle BDA by using the Oracle BDA Configuration
Configuration
Generation Utility-generated files.
Upgrades and Upgrade BDA Mammoth.

Expansions Expand an Oracle BDA Starter Rack.
Configuring and Installing the Oracle BDA: Key Players
Pre-Delivery of BDA Procedures
Customer
Hardware Installation and Base Configuration
Oracle Field
Engineer
Oracle BDA
Installation and Configuration
Install
Coordinator
Oracle ACS
Upgrades and Expansions
Customer and
Oracle ACS
Key Definitions
Term Description
Oracle BDA Site Checklists They provide a site checklist that the customer must
complete before the installation of Oracle BDA.
Oracle BDA Configuration Enables you to provide your information such as IP
Generation Utility addresses, software preferences, that are required for
deploying Oracle BDA . After guiding you through a series
of pages, the utility generates a set of configuration files.
These files help automate the deployment process and
ensure that Oracle BDA is configured to your exact
specifications.
Base Image This includes the operating system, drivers, firmware, etc.
Mammoth Software Deployment The bundle contains the installation files and the Base
Bundle Image. Before you install the software, you must use
Oracle BDA Configuration Generation Utility to generate
the configuration files.
Mammoth Utility Mammoth is a command-line utility for installing and
configuring the Oracle BDA software.
Accessing the Big Data Documentation Landing Page
https://docs.oracle.com/en/bigdata/

Completing the Oracle BDA Site Checklists
Download and complete the 16-

page Site Checklists PDF file.
Complete the Pre-installation Site
Evaluation form (page 1).
Complete the rest of the
checklists.

Completing the Oracle BDA Site Checklists Prior to Configuring and
Installing the Software
Complete the following checklists to ensure that the site is prepared for Oracle BDA:
System Components
Data Center Room
Data Center Environment
Access Route
Facility Power
Safety BDA Site Checklists
Logistics
Network Configuration
Auto Service Request
Oracle Enterprise Manager
Reracking

Using the Oracle BDA Configuration Generation Utility
Collects all of the information required to install and configure the Oracle BDA Software.
Acquires information from you, such as IP addresses, security information, software
preferences, etc.
After guiding you through a series of pages, the utility generates a set of configuration
files.
The generated files help automate the deployment process and ensure that Oracle BDA
is configured to your specifications.

The Oracle BDA Configuration Generation Utility Pages
Welcome
Customer Details
Hardware Selection
Rack Details
Networking General Information,
Operating System, and
Administration Network Network configuration setup
General Network Properties

Review and Edit Details
Define Clusters
Mammoth (Software) setup
Cluster 1, 2,
Client and InfiniBand Network
Complete

Downloading and Extracting the Configuration Generation Utility
1 http://www.oracle.com/technetwork/database/bigdata-appliance/downloads/index.html

Running the Oracle BDA Configuration Generation Utility
Run in MS-Windows Run in Linux

(double-click) $ ./bdaconf.sh

The Welcome Screen

Selecting the Last Option in the Welcome Screen

Customer Details

Hardware Selection

Rack Details and Networking Pages

Define Clusters (Cluster 1: CDH Cluster)
1 3
2
Configuring cluster 1
as a CDH cluster.

Define Clusters (Cluster 2: NoSQL Cluster)
3 4
Configuring cluster 2
as a NoSQL cluster.

Cluster 1 Use the Cluster 1 (bda1h1) page to enter
and select from the following regions:
User/Groups
Client Network
Connectors on Rack 1
Infiniband Network
Big Data SQL
Big Data Connectors
Oracle Spatial and Graph
MIT Kerberos
Active Directory Kerberos
HDFS Transparent Encryption
Audit Vault
Enterprise Management Cloud Control
Email Alerting

Cluster 2
Use the Cluster 2 (bda1h2) page to enter

and select from the following regions:
User/Groups
Client Network
Connectors on Rack 1
InfiniBand Network
Installed Components:
Enterprise Edition, or
Community Edition
Oracle NoSQL Configuration

Client and InfiniBand Network and Complete Pages

Complete Deployment Assistant and Generate Files

Viewing the Generated Directories and Files

Viewing the Generated Directories and Files
bda1 folder:
Obsolete in Oracle BDA V4.2
bda1-BdaDeploy.json
and later versions
bda1-network.json
bda1h1 folder:
bda1h1-config Used by Mammoth during
bda1h2 folder: the software installation
bda1h2-config
bda-20160726-192435.zip
Run bda-preinstall-checkip.sh to perform validation

checks on the network configuration values.
Send .zip file to your Oracle contact

Agenda
Learn about the BDA


Oracle BDA Provides Flexible Cluster Configurations
A CDH cluster is a minimum of 3 nodes, which is ideal for development, or 6 nodes for
production (recommended).
A starter rack contains 6 nodes.
A full rack contains 18 nodes
The Elastic configurations enables you to expand your system in 1-node increments by
adding a BDA X6-2 High Capacity (HC) into a 6-node starter rack.
The Rack can be multi-tenant. For example, you can have multiple clusters on a single
rack.
You can also have a single cluster spanning multiple racks.

BDA X6-2 Hardware HC Node plus
InfiniBand
Full Rack Starter Rack Infrastructure
96 TB HDD per node

256 GB memory per
node (expandable to
a maximum of 768GB)
BDA Node

BDA X6-2 Integrated and Optional Software
Integrated Software Optional Software

Oracle BDA Mammoth Software Deployment Bundle
The Mammoth software deployment bundle contains the Installation files and the OS
base image.
You use the same Oracle BDA Mammoth Software Deployment Bundle to do the
following:
Install the software on a new rack
Add servers to a cluster
Upgrade the software on the Oracle BDA
Change the configuration of optional software
Reinstall the base image
Install a one-off patch

Using the Oracle BDA mammoth Utility
mammoth is the command-line utility that deploys software on Oracle's BDA (across all
servers in the rack) by using the files generated by BDA Configuration Generation Utility.
You can use Mammoth to:
Set up the cluster by using the generated configuration files
Create a cluster on one or more racks
Create multiple clusters on an Oracle BDA rack
Extend a cluster to new servers
Update a cluster with new software
./mammoth -i bda1h1

Using the Oracle BDA mammoth Utility
In addition to installing the software across all servers in the rack, the mammoth Utility:
Creates the required user accounts
Starts the correct services
Sets the appropriate configuration parameters
You must run the mammoth utility once for each rack.
For additional information on installing the software, see in the Oracle BDA Owner's
Guide.

Downloading the Mammoth Software Deployment Bundle: Overview
Log into MOS at support.oracle.com by using your

1 sso credentials
Access and bookmark the Install/Upgrade/Configure

2 Oracle Big Data Appliance (Doc ID 1445745.2) from
MOS
Access and bookmark the Oracle Big Data Appliance Download Mammoth bundle patch
3 Patch Set Master Note (Doc ID 1485745.1) from MOS # 21109091 Oracle BDA Base Image
for Oracle Linux 6.
Download Oracle BDA Mammoth (Software

Access the Mammoth Bundle software for new installs
Deployment Bundle) for New CDH
4 (Oracle Linux 6), and upgrades (Oracle Linux 5) from
Installations with OL6 Base Image
MOS Table 1
Installation Doc

Searching for Document ID 1445745.2 on MOS
Bookmark this landing page in

your web browser

Oracle BDA Patch Set Master Note Page

Oracle BDA Mammoth Software Deployment Bundle Installation
Document (Doc ID 2011898.1)

BDA CDH Cluster Service Layout After Deployment
Active NameNode
Stand-by NameNode
Active ResourceManager
Stand-by ResourceManager
Runs all mammoth steps
# ./mammoth -i bda1h1
Cluster name

Example: Successful Big Data Systems Grow
12-node BDA for Production

Day 1 Hadoop HA and Security Set-up
Ready-to-Load Data
node12
node11
node10
node09
Rack 1 name: bda1
node08
node07
node06 Runs all mammoth steps
node05
node04 # ./mammoth -i bda1h1
node03
node02 Cluster name
node01
Cluster name: bda1h1

BDA Service Layout CDH Cluster Only
Active NameNode
Stand-by NameNode
node12
node11 Active ResourceManager
node10
node09 Stand-by ResourceManager
Rack 1 name: bda1
node08
node07 RM2
node06
node05
RM1
node04
node03
NN2
node02
node01 NN1

Successful Big Data Systems Grow
node18 Add 12 New Nodes across two racks
node17 Day 90
Cluster expansion with a single command
node16
node15 This expansion automatically optimizes HA
node14 setup across multiple racks.
node13 Because of uniform nodes and IB networking,
no data is moved.
Rack 1 name: bda1
node12
node11
node10 ./mammoth e bda1node13, bda1node14,,bda2node06
node09
node08
node07
Rack 2 name: bda2

node06 node06
node05 node05
node04 node04
node03 node03
node02 node02
node01 NN1 node01 NN2

Automatic Failover of the NameNode
node18
node17
node16
node15
node14
node13 Critical Node Failure => Active Name Node
node12 Automatic Failover to other (Standby) NameNode
node11
Automatic Service Request to Oracle for HW Failure
node10
node09
node08
node07
node06 node06
node05 node05
node04 node04
node03 node03
node02 node02

Restoring HA and Reinstating a Node
node18 Restore HA with a single command
node17
node16 bdacli admin_cluster migrate bda1node01
node15
node14
Reinstate the repaired node with a single command:
node13
node12 bdacli admin_cluster reprovision bda1node01

node11
node10
node09
node08
node07
node06 node06
node05 node05
node04 node04
node03 node03
node02 node02
Rack bda1 Rack bda2

Agenda
Learn about the Oracle BDA

Cloudera Manager
Oracle Enterprise Manager Cloud Control BDA Plug-in

Administering the Oracle BDA: Overview
Once your CDH cluster is set up, you may want to update the configuration based on
specific needs, such as adding a new Impala or Hbase service.
You can use Cloudera Manager to manage the Hadoop cluster.
You can use Enterprise Manager to monitor the BDA, similar to how you would manage
your other Oracle products.

What is Cloudera Manager?
You can use Cloudera Manager to perform the following administrative tasks:
Manage cluster configuration
Monitor hosts, jobs, events, and services
Start and stop services
View detailed performance metrics
Monitor the health of the system
Set up resource management
Track resource usage
Manage Hadoop security
Generate alerts

Cloudera Manager Web Console (Main Page)
http://scaj31bda03:7180
CDH Cluster name

Oracle Enterprise Manager BDA Plug-in
The Enterprise Manager BDA plug-in:

Enables organizations to manage the BDA using a consistent management framework
Monitors both hardware and software performance and manages incidents
Integrates with and complements Cloudera Manager, which provides detail cluster
configurations

The BDA Network Page: Software Overview
The BDA Network 1 page contains

the Hardware Overview and the
Software Overview sections. Here
is the Software Overview across
clusters. You can view the services
that are running across the nodes
and their status.

The BDA Network Page: Schematic Hardware Overview
The Hardware Schematic section

provides you with a quick view of
the status of each HW component
on the BDA.

Viewing the BDA Hardware and Software Components

Deployment of Secured Clusters on Oracle BDA
2. Authorization
1. Authentication
4. Encryption
3. Auditing (at rest and over the network)

BDA Secure Installation
Mammoth automates the setup of a secure cluster as follows:

Installs and configures Kerberos for strong authentication
Installs and configures Sentry to manage authorization
Configures auditing with Cloudera Navigator
Configures both data at rest and network encryption

Authorization
With strong authentication, authorization rules can be meaningful.

HDFS ACLs prevent unauthorized file access.
Access to Hive metadata is controlled by Sentry.

Sentry Authorization Features
Secure Authorization
Ability to control access to data and/or privileges on data for authenticated users
Fine-grained Authorization
Ability to give users access to a subset of data
Includes access to a database, URI, table, or view
Role-based Authorization
Ability to create or apply template-based privileges based on functional roles

Auditing: Overview
Security requires auditing.

Who did what and was it allowed?
Authentication is a prerequisite.
Ensure who
Authorization is a prerequisite.
Ensure it was allowed

Auditing Using Cloudera Navigator
Cloudera Navigator:
Provides a deep level of auditing in Hadoop
Does not include auditing data from other sources
Includes lineage analysis
Cloudera Navigator audits the following activities:
HDFS data accessed through HDFS, Hive, HBase, and Impala services
Hive, HBase, and Impala operations
Hive metadata definition
Sentry access

Cloudera Navigator Reporting
Captures details of Hadoop activities across services

Easily identifies improper access attempts

Encryption
Oracle BDA automatically configures two types of encryption:

Network encryption
Data-at-rest encryption

Network Encryption
Oracle BDA supports network encryption for key activities, thereby preventing network
sniffing between computers. Mammoth automatically configures:
Cloudera Manager Server communicating with Agents
Hadoop HDFS data transfers
Hadoop internal RPC communications
Cloudera Manager web interface
Hadoop web UIs and web services
Hadoop YARN/MapReduce shuffle transfers

Data-at-Rest Encryption
On-disk encryption protects data at rest. HDFS Transparent Encryption enables:

Selected HDFS folders configuration to transparently encrypt files and subdirectories
Protect data from being viewed or copied outside of the Hadoop file system

Resources
Oracle Big Data documentation:

http://docs.oracle.com/en/bigdata/
Oracle Big Data Appliance Online Documentation Library:
Oracle Big Data Administration Series on OLL:
https://apexapps.oracle.com/pls/apex/f?p=44785:24:101837840220695::NO:24:P24_CO
NTENT_ID,P24_PREV_PAGE:12361,29
Oracle Big Data Learning Library:
https://apexapps.oracle.com/pls/apex/f?p=44785:141:111624808872538::::P141_PAGE
_ID,P141_SECTION_ID,P141_PREV_PAGE:27,615,5

Summary
In this lesson, you should have learned how to use the Oracle BDA to maximize its following
benefits:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open

8
Introduction to Oracle Big Data Cloud
Service (BDCS)

Objectives

Define Oracle Big Data Cloud Service (BDCS)
Identify the key features and benefits of the Oracle BDCS
Create and modify an Oracle BDCS instance (Hadoop Cluster)
Use Cloudera Manager to monitor the Oracle BDCS instance
Oracle Big Data Cloud Service (BDCS)
Oracle BDCS delivers the power of Hadoop as a secure, automated, elastic service,
which can also be fully integrated with existing enterprise data in Oracle Database.
A subscription to Oracle BDCS gives you access to the resources of a pre-configured
Oracle Big Data environment including CDH and Apache Spark.
Use BDCS to capture and analyze the massive volumes of data generated by social
media feeds, email, web logs, photographs, smart meters, sensors, and similar devices.
When you subscribe to Big Data Cloud Service:
You can select between (3) and (60) nodes in one-node increments
You can also burst by adding or removing up to 192 OCPUs (32 compute-only
nodes)
Oracle manages the whole hardware and networking infrastructure as well as the
initial setup, while you have complete administrators control of the software
All servers in a BDCS instance form a cluster.
Oracle Big Data Cloud Service: Key Features
Dedicated Compute Shapes with Direct Attached Storage:

Start as small as 3 nodes and grow seamlessly.
Hadoop, Spark delivered as an automated Cloud Service:
Cloudera Enterprise Data Hub Edition 5.x
Oracle Big Data Connectors
Oracle Data Integrator Enterprise Edition
Platform for new Big Data Services:
Big Data Discovery
Big Data SQL (coming soon)
Software included
Oracle Big Data Cloud Service: Benefits
Dedicated:
Dedicated instances that deliver high-performance
Elastic:
Scale up and down as needed.
Secure:
Secure Hadoop Cluster out of the box
Comprehensive Analytic software toolset: Big Data
Use the latest advances in Big Data processing.
Unify data processing with Big Data SQL.
Elasticity: Dedicated Compute Bursting
Burst Nodes
Key Features
Cluster Burst nodes provide:
Self service, on-demand from Cluster Service
Manager
Permanent Nodes
Large expansion with 32 OCPUs and 256 GB of
memory
Expansion nodes automatically instantiated as
cluster nodes
Bursting nodes that share InfiniBand fabric
Hourly billing rates
Retain Dedicated Compute for performance
32 OCPU
Key Benefits 256 GB RAM
48 TB Storage
Flexibility
Consistent high performance
Automated Service
Ease-of-use of a Managed Service Flexibility of an Un-Managed Service

Full stack (OS to Hadoop) management Full access to the system
Single command patching, upgrading High performance
Complete Cloud dashboards Runs same workloads as On-Prem
Usage, uptime, users, and so on Keep On-Prem and Cloud in sync.
VPN networking Runs any third-party software on BDCS
Use your favorite developer tools.
Big Data
Security Made Easy
Key Features
Kerberos-ready cluster out of the box:
Apache Sentry enabled on secure clusters
Built-in data encryption:
At-rest through HDFS encryption
In-flight for all phases within Hadoop and Spark
Encrypted traffic to all client tools
VPN service
Key Benefits
Reduced risk
Faster time-to-value
Comprehensive Analytics Toolset Included
Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Scale-out R capabilities
Scale-out Complex Document Parsing
Big Data Spatial and Graph provides:
Property Graph In-memory Engine
Property Graph Pre-built Analytics
Key Benefits
Lower overall cost
Comprehensive Data Integration Toolset Included
Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Oracle Data Integrator Enterprise Edition
Specific Loaders for Oracle Database integration
Oracle Loader for Apache Hadoop
Oracle SQL Connector for Apache Hadoop
Additional licensing can provide ODI EE Big Data
capabilities
Key Benefits
Lower overall cost

Oracle Cloud at Customer
Helps conform to business and

government security and compliance IaaS Database Big Data
requirements
Same PaaS and IaaS hardware and
software as Oracle Public Cloud
Managed by Oracle and delivered
as a service behind your firewall
Same cost-effective subscription pricing
model as Oracle Cloud

Big Data Deployment Models: Choices
Big Data Appliance Big Data Cloud Machine Big Data Cloud Service
Customer Data Center Customer Data Center Oracle Cloud

Purchased Subscription Subscription
Customer Managed Oracle Managed Oracle Managed

Creating a New Instance and a New CDH Cluster: Demo
1. Purchase an Oracle BDCS subscription from an Oracle Sales Representative.

2. Log in to an account to create the Big Data instance for your subscription. This displays
the Oracle Cloud My Services dashboard page.
3. Click Create Instance > Big Data to create a Big Data service instance. This step
allocates the resources for your cluster, including the number of nodes in your
subscription. Important: This step does not create a cluster.
4. In the Create New Oracle Big Data Cloud Service Instance (Create Service Instance
wizard), specify the Instance Details, and then click Create Service Instance.
5. Use the BDCS Service Console to create the CDH cluster.
Note: The screens used in this lesson were the latest as this lesson was developed;
however, the screens might not match your screens.

Purchasing an Oracle BDCS Subscription

Signing In to the Oracle Cloud

Signing In to the Oracle Cloud

Creating a Service Instance
Account Name Cloud Service Type

Creating a Service Instance: Instance Details Page
Cloud Service
Instance Name:
bursting1

Creating a Service Instance: Create Service Instance Page

Creating a Service Instance: Overview Tab

Viewing the New Service Details

Creating a New Cluster

Create Instance Wizard: Service Details

Create Instance Wizard: Security

Create Instance Wizard: Storage

Create Instance Wizard: Confirmation

Service Instances Page
Started clusters
Each node has 32 OCPUs, 256 GB
RAM, and 48 TB storage;
therefore, this new cluster has
the following resources:
Clusters (3) nodes
96 OCPUs
768 GB RAM (256x3 nodes)
144 TB HDFS storage (48x3
nodes)

Using Cloudera Manager (CM) to Manage the BDCS Instance
You can use CM to manage the
BDCS instance (Hadoop Cluster)
similar to how it was used for
managing the Oracle BDA.

Resources
Oracle Big Data Cloud Service documentation:

http://docs.oracle.com/cloud/latest/bigdata-cloud/index.html
Oracle Big Data Cloud Service Tutorials:
http://docs.oracle.com/cloud/latest/bigdata-cloud/bigdata-cloud-tutorials.html
Oracle Cloud Compute Site:
https://cloud.oracle.com/opc/iaas/ebooks/Oracle_Compute_Cloud_Service.pdf
Oracle Cloud Home page:
https://cloud.oracle.com/home

Summary

Define Oracle Big Data Cloud Service (BDCS)
Identify the key features and benefits of the Oracle BDCS
Create and modify an Oracle BDCS instance (Hadoop Cluster)
Use Cloudera Manager to monitor the Oracle BDCS instance

D75058GC20 Ep

Uploaded by

Copyright:

Available Formats

D75058GC20 Ep

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

D75058GC20 Ep

Uploaded by

Copyright:

Available Formats

Introduction to Oracle Big Data

Learn more from Oracle University at education.oracle.com

Lauran K. Serhal Disclaimer

Nancy Greenberg U.S. GOVERNMENT RIGHTS

Copyright 2016, Oracle and/or its affiliates. All rights reserved.

Course objectives, course road map, and lesson objectives

After completing this course, you should be able to:

Lesson 2: Big Data and Oracle Information

Lesson 3: Data Acquisition and Storage

Lesson 4: Data Access and Processing

Lesson 5: Data Unification

Lesson 6: Data Discovery and Analysis

Lesson 7: Introduction to the Oracle BDA

Lesson 8: Introduction to Oracle Big Data

After completing this lesson, you should be able to:

Oracle Big Data Lite VM provides an

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 10

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 11

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 12

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 13

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 14

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 15

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 16

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 17

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 18

Course objectives, course road map, and lesson objectives

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 19

Oracle MoviePlex is an application that is based on a fictitious online movie-streaming

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 20

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 21

How can you:

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 22

To deliver a personalized movie-watching experience by collecting and storing:

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 23

The application generates a huge volume of unstructured data.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 24

Oracle Big Data Appliance

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 25

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 26

4 Advanced profile attributes What is a key-value

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 27

Course objectives, course road map, and lesson objectives

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 28

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 29

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 30

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 31

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 32

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 33

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 34

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 35

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 36

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 37

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 38

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 39

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 40

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 41

In this lesson, you should have learned how to:

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 42

Big Data Management System