CC Unit - 5
CC Unit - 5
Introduction to Hadoop
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-
founders are Doug Cutting and Mike Cafarella. It’s co-founder Doug
Cutting named it on his son’s toy elephant. In October 2003 the first paper
release was Google File System. In January 2006, MapReduce development
started on the Apache Nutch which consisted of around 6000 lines coding for
it and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was
released.
Hadoop is an open-source software framework for storing and processing big
data. It was created by Apache Software Foundation in 2006, based on a
white paper written by Google in 2003 that described the Google File System
(GFS) and the MapReduce programming model. The Hadoop framework
allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation
and storage. It is used by many organizations, including Yahoo, Facebook,
and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry
and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big
data processing:
What is MapReduce?
Terminology
PayLoad − Applications implement the Map and the Reduce
functions, and form the core of the job.
Mapper − Mapper maps the input key/value pairs to a set of
intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File
System (HDFS).
DataNode − Node where data is presented in advance before
any processing takes place.
MasterNode − Node where JobTracker runs and which accepts
job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to
Task tracker.
Task Tracker − Tracks the task and reports status to
JobTracker.
Job − A program is an execution of a Mapper and Reducer
across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of
data.
Task Attempt − A particular instance of an attempt to execute
a task on a SlaveNode.
Sr.No
GENERIC_OPTION & Description
.
1 -submit <job-file>
Submits the job.
2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.
4 -kill <job-id>
Kills the job.
7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
What is VirtualBox?
VirtualBox is a free and open-source software program for virtualizing
the x86 computing architecture. Oracle Corporation developed it. It works as a
hypervisor and develops a Virtual Machine where the user can run another operating
system. The "host" OS is the operating system where VirtualBox runs.
The "guest" OS is the operating system running on the Virtual Machine. As the host
OS, VirtualBox supports Windows, Linux, Solaris, Open Solaris, and MacOS.
When setting up a virtual machine, the user can determine how many processor
cores and how much RAM and disc space would be devoted to the VM. When the
VM is running, it may be "paused".
History of VirtualBox
Innotek GmbH originally developed it. VirtualBox was released as an open-source
software package on January 17, 2007. Sun Microsystems later purchased the
company. Oracle Corporation bought Sun on January 27, 2010, and took over
VirtualBox production.
Features of VirtualBox
There are various features of VirtualBox. Some of the essential features are as follows:
Portability
Guest Addition
These are the collections of tools installed on the Guest OS to optimize its
performance and offer extra host system integration and communication.
VM Groups
VirtualBox offers group functionality. This functionality allows the user to individually
and collectively organize virtual machines. Generally, it is possible to apply
operations such as start, pause, close, reset, shutdown, save state, power off, and so
on to VM classes such as individual VMs.
Hardware Support
Snapshot
VirtualBox provides the guest with VM state details with the save snapshot function.
In time, we can go back and get the virtual machine back.
The hypervisor is applied as a Ring zero kernel service. The kernel includes a device
driver tool known as vboxsrv. This device driver manages the tasks or activities,
including loading hypervisor modules for functionality, allocating physical memory to
the digital visitor machine, and saving and restoring the visitor technique's context.
Whenever any interruption occurs, we can use another OS to start execution and
identifying while the VT-x or AMD-V activities want to be handled.
The user itself manages its OS scheduling throughout its execution. The user
operates on the host system as a single process and scheduled via a host. Apart from
this, there are extra device drivers present while the user permits the OS to access
resources like disks, community controllers, and different devices. Besides kernel
modules, other processes are running on the host that help the running guest. The
VBoxSVC process begins automatically in the background when a guest VM is
booted from the VirtualBox GUI.
To create an application for an app engine, you can use Go, Java, PHP, or
Python. You can develop and test an app locally using the SDK’s
deployment toolkit. Each language’s SDK and nun time are unique. Your
program is run in a:
Java Run Time Environment version 7
Python Run Time environment version 2.7
PHP runtime’s PHP 5.4 environment
Go runtime 1.2 environment
Features in Preview
Experimental Features
These might or might not be made broadly accessible in the next app engine
updates. They might be changed in ways that are irreconcilable with the
past. The “trusted tester” features, however, are only accessible to a limited
user base and require registration in order to utilize them. The experimental
features include Prospective Search, Page Speed, OpenID,
Restore/Backup/Datastore Admin, Task Queue Tagging, MapReduce, and
Task Queue REST API. App metrics analytics, datastore
admin/backup/restore, task queue tagging, MapReduce, task queue REST
API, OAuth, prospective search, OpenID, and Page Speed are some of the
experimental features.
Third-Party Services
Python
Java
Node.js
PHP
Ruby
Go
Instance classes
Each instance's memory and CPU allocations, as well as the amount of free
quota and the cost per hour after your program uses up the free quota, are
determined by the instance class.
Runtime generation affects the RAM restrictions. The memory cap applies to
all runtime generations and considers both the memory your program requires
and the memory the runtime needs to function. The Java runtimes consume
more memory when running your app than other runtimes. Utilize the instance
class property in your app.yaml file to override the default instance class.
Quotas and limits
You can enable premium applications to augment the 1 GB of data storage and
traffic provided to you for free in the default setting. To ensure the system's
stability, some features do, however, impose restrictions that are unrelated to
quotas.
support for same gcloud commands and a similar GCP terminal interface
Free tier
Long-term support for legacy runtimes
Python 2.7
Java 8
Go 1.11
PHP 5.5
Commitment
In keeping with our more than the ten-year tradition of supporting your apps
as you advance into the future at your own pace, Google is dedicated to
offering long-term support for these runtimes.
Google might have to deprecate some of the updates that you are currently
using APIs or development tools.
Security updates
Your software may be vulnerable to flaws for which no publicly available patch
exists as communities discontinue supporting versions of their languages. As a
result, switching to a runtime with a supported language is safer than
continuing to run your program on some App Engine runtimes.
Introduction to OpenStack
It is a free open standard cloud computing platform that first came into
existence on July 21′ 2010. It was a joint project of Rackspace Hosting and
NASA to make cloud computing more ubiquitous in nature. It is deployed as
Infrastructure-as-a-service(IaaS) in both public and private clouds where
virtual resources are made available to the users. The software platform
contains interrelated components that control multi-vendor hardware pools of
processing, storage, networking resources through a data center. In
OpenStack, the tools which are used to build this platform are referred to as
“projects”. These projects handle a large number of services including
computing, networking, and storage services. Unlike virtualization, in which
resources such as RAM, CPU, etc are abstracted from the hardware using
hypervisors, OpenStack uses a number of APIs to abstract those resources
so that users and the administrators are able to directly interact with the
cloud services.
OpenStack components
Apart from various projects which constitute the OpenStack platform, there
are nine major services namely Nova, Neutron, Swift, Cinder, Keystone,
Horizon, Ceilometer, and Heat. Here is the basic definition of all the
components which will give us a basic idea about these components.
1. Nova (compute service): It manages the compute resources like
creating, deleting, and handling the scheduling. It can be seen as a
program dedicated to the automation of resources that are responsible for
the virtualization of services and high-performance computing.
2. Neutron (networking service): It is responsible for connecting all the
networks across OpenStack. It is an API driven service that manages all
networks and IP addresses.
3. Swift (object storage): It is an object storage service with high fault
tolerance capabilities and it used to retrieve unstructured data objects
with the help of Restful API. Being a distributed platform, it is also used to
provide redundant storage within servers that are clustered together. It is
able to successfully manage petabytes of data.
4. Cinder (block storage): It is responsible for providing persistent block
storage that is made accessible using an API (self- service).
Consequently, it allows users to define and manage the amount of cloud
storage required.
5. Keystone (identity service provider): It is responsible for all types of
authentications and authorizations in the OpenStack services. It is a
directory-based service that uses a central repository to map the correct
services with the correct user.
6. Glance (image service provider): It is responsible for registering,
storing, and retrieving virtual disk images from the complete network.
These images are stored in a wide range of back-end systems.
7. Horizon (dashboard): It is responsible for providing a web-based
interface for OpenStack services. It is used to manage, provision, and
monitor cloud resources.
8. Ceilometer (telemetry): It is responsible for metering and billing of
services used. Also, it is used to generate alarms when a certain
threshold is exceeded.
9. Heat (orchestration): It is used for on-demand service provisioning with
auto-scaling of cloud resources. It works in coordination with the
ceilometer.
These are the services around which this platform revolves around. These
services individually handle storage, compute, networking, identity, etc.
These services are the base on which the rest of the projects rely on and are
able to orchestrate services, allow bare-metal provisioning, handle
dashboards, etc.
Features of OpenStack
1. In the federated cloud, the users can interact with the architecture either
centrally or in a decentralized manner. In centralized interaction, the user
interacts with a broker to mediate between them and the organization.
Decentralized interaction permits the user to interact directly with the
clouds in the federation.
2. Federated cloud can be practiced with various niches like commercial and
non-commercial.
3. The visibility of a federated cloud assists the user to interpret the
organization of several clouds in the federated environment.
4. Federated cloud can be monitored in two ways. MaaS (Monitoring as a
Service) provides information that aids in tracking contracted services to
the user. Global monitoring aids in maintaining the federated cloud.
5. The providers who participate in the federation publish their offers to a
central entity. The user interacts with this central entity to verify the prices
and propose an offer.
6. The marketing objects like infrastructure, software, and platform have to
pass through federation when consumed in the federated cloud.
The technologies that aid the cloud federation and cloud services are:
1. OpenNebula
It is a cloud computing platform for managing heterogeneous distributed data
center infrastructures. It can use the resources of its interoperability,
leveraging existing information technology assets, protecting the deals, and
adding the application programming interface (API).
2. Aneka coordinator
The Aneka coordinator is a proposition of the Aneka services and Aneka
peer components (network architectures) which give the cloud ability and
performance to interact with other cloud services.
3. Eucalyptus
Eucalyptus defines the pooling computational, storage, and network
resources that can be measured scaled up or down as application workloads
change in the utilization of the software. It is an open-source framework that
performs the storage, network, and many other computational resources to
access the cloud environment.
Levels of Cloud Federation
Each level of the cloud federation poses unique problems and functions at a
different level of the IT stack. Then, several strategies and technologies are
needed. The answers to the problems encountered at each of these levels
when combined form a reference model for a cloud federation.
Conceptual Level
Infrastructure Level
The newest addition to the Radiant One package is the Cloud Federation
Service (CFS), which is powered by identity virtualization. Together with
Radiant One FID, CFS isolates your external and cloud applications from the
complexity of your identity systems by delegating the work of authenticating
against all of your identity stores to a single common virtual layer.
The future of the cloud is federated, and when you look at the broad categories of
apps moving to the cloud, the truth of this statement begins to become clear.
Gaming, social media, Web, eCommerce, publishing, CRM – these applications
demand truly global coverage, so that the user experience is always on, local and
instant, with ultra-low latency. That’s what the cloud has always promised to be.
The problem is that end users can’t get that from a single provider, no matter how
large. Even market giants like Amazon have limited geographic presence, with
infrastructure only where it’s profitable for them to invest. As a result, outside the
major countries and cities, coverage from today’s ‘global’ cloud providers is actually
pretty thin. Iceland, Jordan, Latvia, Turkey, Malaysia? Good luck. Even in the U.S.,
you might find that the closest access point to your business isn’t even in the same
state, let alone the same city.
As part of a cloud federation, even a small service provider can offer a truly global
service without spending a dime building new infrastructure. For companies with
spare capacity in the data center, the federation also provides a simple way to
monetize that capacity by submitting it to the marketplace for other providers to buy,
creating an additional source of revenue.
There are immediate benefits for end users, too. The federated cloud means that end
users can host apps with their federated cloud provider of choice, instead of choosing
from a handful of “global” cloud providers on the market today and making do with
whatever pricing, app support and SLAs they happen to impose. Cloud users can
choose a local host with the exact pricing, expertise and support package that fits
their need, while still receiving instant access to as much local or global IT resources
as they’d like. They get global scalability without restricted choice, and without
having to manage multiple providers and invoices.
The federated cloud model is a force for real democratization in the cloud market. It’s
how businesses will be able to use local cloud providers to connect with customers,
partners and employees anywhere in the world. It’s how end users will finally get to
realize the promise of the cloud. And, it’s how data center operators and other service
providers will finally be able to compete with, and beat, today’s so-called global cloud
providers.