CC-KML051-Unit V
CC-KML051-Unit V
Learn about how to use Dataproc to run Apache Hadoop clusters, on Google Cloud, in a simpler, integrated,
more cost-effective way.
Hadoop history
Hadoop has its origins in the early era of the World Wide Web. As the Web grew to millions and then billions
of pages, the task of searching and returning search results became one of the most prominent challenges.
Startups like Google, Yahoo, and AltaVista began building frameworks to automate search results. One
project called Nutch was built by computer scientists Doug Cutting and Mike Cafarella based on Google’s
early work on MapReduce (more on that later) and Google File System. Nutch was eventually moved to the
Apache open source software foundation and was split between Nutch and Hadoop. Yahoo, where Cutting
began working in 2006, open sourced Hadoop in 2008.
While Hadoop is sometimes referred to as an acronym for High Availability Distributed Object Oriented
Platform, it was originally named after Cutting’s son’s toy elephant.
Hadoop defined
Hadoop is an open source framework based on Java that manages the storage and processing of large amounts
of data for applications. Hadoop uses distributed storage and parallel processing to handle big data and
analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS is a
distributed file system in which individual Hadoop nodes operate on data that resides in their local storage.
This removes network latency, providing high-throughput access to application data. In addition,
administrators don’t need to define schemas up front.
Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for
managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling
and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce model,
subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes,
where each subset is processed by a node in parallel with other processing jobs. After processing the results,
individual subsets are combined into a smaller, more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop
modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to grow and
includes many tools and applications to help collect, store, process, analyze, and manage big data. These
include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware. Processing is
performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce
then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of individual
machines or racks of machines are common and should be automatically handled in software by the
framework.
As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as interactive
analytical tasks. MapReduce functions also need to be written in Java and can require a steep learning curve.
The MapReduce ecosystem is quite large, with many components for different functions that can make it
difficult to determine what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large datasets. An ecosystem of tools
for authentication, encryption, auditing, and provisioning has emerged to help developers secure data in
Hadoop.
Hadoop does not have many robust tools for data management and governance, nor for data quality and
standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding developers with the
combined requisite skills in Java to program MapReduce, operating systems, and hardware can be difficult.
In addition, MapReduce has a steep learning curve, making it hard to get new programmers up to speed on its
best practices and ecosystem.
Research firm IDC estimated that 62.4 zettabytes of data were created or replicated in 2020, driven by the
Internet of Things, social media, edge computing, and data created in the cloud. The firm forecasted that data
growth from 2020 to 2025 was expected at 23% per year. While not all that data is saved (it is either deleted
after consumption or overwritten), the data needs of the world continue to grow.
Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend the capabilities of the core
module. Some of the main software tools used with Hadoop include:
Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a query language
called HiveQL, which is similar to SQL
Apache HBase: An open source non-relational distributed database often paired with Hadoop
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data and enables
functions like filter, sort, load, and join
Apache Impala: Open source, massively parallel processing SQL query engine often used with Hadoop
Apache Sqoop: A command-line interface application for efficiently transferring bulk data between relational
databases and Hadoop
Apache ZooKeeper: An open source server that enables reliable distributed coordination in Hadoop; a service
for, "maintaining configuration information, naming, providing distributed synchronization, and providing
group services"
On Google Cloud, Dataproc is a fast, easy-to-use, and fully-managed cloud service for running Apache
Spark and Apache Hadoop clusters in a simpler, integrated, most cost-effective way. It fully integrates with
other Google Cloud services that meet critical security, governance, and support needs, allowing you to gain
a complete and powerful platform for data processing, analytics, and machine learning.
Big data analytics tools from Google Cloud—such as Dataproc, BigQuery, Vertex AI Workbench,
and Dataflow—can enable you to build context-rich applications, build new analytics solutions, and turn
data into actionable insights.
Google App Engine
A scalable runtime environment, Google App Engine is mostly used to run Web applications. These
dynamic scales as demand change over time because of Google’s vast computing infrastructure. Because it
offers a secure execution environment in addition to a number of services, App Engine makes it easier to
develop scalable and high-performance Web apps. Google’s applications will scale up and down in response
to shifting demand. Croon tasks, communications, scalable data stores, work queues, and in-memory
caching are some of these services.
The App Engine SDK facilitates the testing and professionalization of applications by emulating the
production runtime environment and allowing developers to design and test applications on their own PCs.
When an application is finished being produced, developers can quickly migrate it to App Engine, put in
place quotas to control the cost that is generated, and make the programmer available to everyone. Python,
Java, and Go are among the languages that are currently supported.
The development and hosting platform Google App Engine, which powers anything from web programming
for huge enterprises to mobile apps, uses the same infrastructure as Google’s large-scale internet services.
It is a fully managed PaaS (platform as a service) cloud computing platform that uses in-built services to
run your apps. You can start creating almost immediately after receiving the software development kit
(SDK). You may immediately access the Google app developer’s manual once you’ve chosen the language
you wish to use to build your app.
After creating a Cloud account, you may Start Building your App
Using the Go template/HTML package
Python-based webapp2 with Jinja2
PHP and Cloud SQL
using Java’s Maven
The app engine runs the programmers on various servers while “sandboxing” them. The app engine allows
the program to use more resources in order to handle increased demands. The app engine powers programs
like Snapchat, Rovio, and Khan Academy.
Features of App Engine
To create an application for an app engine, you can use Go, Java, PHP, or Python. You can develop and test
an app locally using the SDK’s deployment toolkit. Each language’s SDK and nun time are unique. Your
program is run in a:
Java Run Time Environment version 7
Python Run Time environment version 2.7
PHP runtime’s PHP 5.4 environment
Go runtime 1.2 environment
These are protected by the service-level agreement and depreciation policy of the app engine. The
implementation of such a feature is often stable, and any changes made to it are backward-compatible. These
include communications, process management, computing, data storage, retrieval, and search, as well as
app configuration and management. Features like the HRD migration tool, Google Cloud SQL, lo gs,
datastore, dedicated Memcached, blob store, Memcached, and search are included in the categories of data
storage, retrieval, and search.
Features in Preview
In a later iteration of the app engine, these functions will undoubtedly be made broadly accessible. However,
because they are in the preview, their implementation may change in ways that are backward-incompatible.
Sockets, MapReduce, and the Google Cloud Storage Client Library are a few of them.
Experimental Features
These might or might not be made broadly accessible in the next app engine updates. They might be changed
in ways that are irreconcilable with the past. The “trusted tester” features, however, are only accessible to a
limited user base and require registration in order to utilize them. The experimental features include
Prospective Search, Page Speed, OpenID, Restore/Backup/Datastore Admin, Task Queue Tagging,
MapReduce, and Task Queue REST API. App metrics analytics, datastore admin/backup/restore, task queue
tagging, MapReduce, task queue REST API, OAuth, prospective search, OpenID, and Page Speed are some
of the experimental features.
Third-Party Services
As Google provides documentation and helper libraries to expand the capabilities of the app engine
platform, your app can perform tasks that are not built into the core product you are familiar with as app
engine. To do this, Google collaborates with other organizations. Along with the helper libraries, the
partners frequently provide exclusive deals to app engine users.
Advantages of Google App Engine
The Google App Engine has a lot of benefits that can help you advance your app ideas. This comprises:
1. Infrastructure for Security: The Internet infrastructure that Google uses is arguably the safest in
the entire world. Since the application data and code are hosted on extremely secure servers,
there has rarely been any kind of illegal access to date.
2. Faster Time to Market: For every organization, getting a product or service to market quickly is
crucial. When it comes to quickly releasing the product, encouraging the development and
maintenance of an app is essential. A firm can grow swiftly with Google Cloud App Engine’s
assistance.
3. Quick to Start: You don’t need to spend a lot of time prototyping or deploying the app to users
because there is no hardware or product to buy and maintain.
4. Easy to Use: The tools that you need to create, test, launch, and update the applications are
included in Google App Engine (GAE).
5. Rich set of APIs & Services: A number of built-in APIs and services in Google App Engine
enable developers to create strong, feature-rich apps.
6. Scalability: This is one of the deciding variables for the success of any software. When using the
Google app engine to construct apps, you may access technologies like GFS, Big Table, and
others that Google uses to build its own apps.
7. Performance and Reliability: Among international brands, Google ranks among the top ones.
Therefore, you must bear that in mind while talking about performance and reliability.
8. Cost Savings: To administer your servers, you don’t need to employ engineers or even do it
yourself. The money you save might be put toward developing other areas of your company.
9. Platform Independence: Since the app engine platform only has a few dependencies, you can
easily relocate all of your data to another environment.
What is OpenStack?
OpenStack is a collection of open source software modules and tools that provides a framework to create and
manage both public cloud and private cloud infrastructure.
Businesses and service providers can deploy OpenStack on premises (in the data center to build a private
cloud), in the cloud to enable or drive public cloud platforms, and at the network edge for distributed
computing systems.
To create a cloud computing environment, an organization typically builds off of its existing virtualized
infrastructure, using a well-established hypervisor such as VMware vSphere, Microsoft Hyper-V or KVM.
However, cloud computing offers more than just virtualization -- a public or private cloud provides extensive
provisioning, lifecycle automation, user self-service, cost reporting and billing, orchestration and other
features.
How does OpenStack work?
OpenStack is not an application in the traditional sense, but rather a platform composed of several dozen
separate components, called projects, which interoperate with each other through APIs. Each component is
complementary, but not all components are required to create a basic cloud. Organizations can install only
select components that build the features and functionality in a desired cloud environment.
OpenStack also relies on two additional foundation technologies: a base operating system, such as Linux, and
a virtualization platform, such as VMware or Citrix. The OS handles the commands and data exchanged from
OpenStack, while the virtualization engine manages the virtualized hardware resources used by OpenStack
projects.
Once the OS, virtualization platform and OpenStack components are deployed and configured properly,
administrators can provision and manage the instanced resources that applications require. Actions and
requests made through a dashboard produce a series of API calls, which are authenticated through a security
service and delivered to the destination component, which executes the associated tasks.
The OpenStack cloud platform is an amalgam of software components. These components are shaped by open
source contributions from the developer community, and OpenStack adopters can choose to implement some
or all of these components as business needs dictate.
OpenStack setups vary, but typically start with a handful of central components: compute (Nova), VM images
(Glance), networking (Neutron), storage (Cinder or Swift), identity management (Keystone) and resource
management (Placement).
Many enterprises that deploy and maintain an OpenStack infrastructure enjoy several advantages, including
that it is:
Affordable. OpenStack is available freely as open source software released under the Apache 2.0
license. This means there is no upfront cost to acquire and use OpenStack.
Reliable. With almost a decade of development and use, OpenStack provides a comprehensive and
proven production-ready modular platform upon which an enterprise can build and operate a
private or public cloud. Its rich set of capabilities includes scalable storage, good performance and
high data security, and it enjoys broad acceptance across industries.
Vendor-neutral. Because of OpenStack's open source nature, some organizations also see it as a
way to avoid vendor lock-in, as an overall platform as well as its individual component functions.
But potential adopters must also consider some drawbacks, such as the following:
Complexity. Because of its size and scope, OpenStack requires an IT staff with significant
knowledge to deploy the platform and make it work. In some cases, an organization might require
additional staff or a consulting firm to deploy OpenStack, which adds time and cost.
Support. As open source software, OpenStack is not owned or directed by any one vendor or team.
This can make it difficult to obtain support for the technology, beyond the open source community.
Consistency. The OpenStack component suite is always in flux as new components are added and
others are deprecated.
To reduce the complexity of an OpenStack deployment, and to gain direct access to technical support, an
organization can select an OpenStack distribution from a vendor. This is a version of the open source platform
packaged with other components, such as an installation program and management tools. It often comes with
technical support options.
An organization has many OpenStack distributions to choose from, including the Red Hat OpenStack
platform, the Mirantis Cloud Platform and the Rackspace OpenStack private cloud.
OpenStack vs. other cloud platforms
Even simple clouds are complex and require extensive automation, orchestration and management to operate.
This means there are few direct alternatives to OpenStack that are practical and proven. However, there are
some options that can help organizations combine the benefits of cloud and on-premises capabilities to
simplify or speed an enterprise's adoption of next-generation technology.
Kubernetes (containers)
Organizations with small, dynamic container-based environments may balk at OpenStack's embrace of
traditional VMs. They may instead opt for a pure container-based approach using a platform such as
Kubernetes.
The three major public cloud providers all provide managed offerings for on-premises clouds, with a strong
emphasis on hybrid cloud adoption. AWS Outposts, Azure Stack and Google Anthos all offer appliances that
sit within a local data center to facilitate a range of services that mimic the providers' public services and
capabilities.
VMware vCloud
Given the vast enterprise investments in virtualization technology, it's natural to consider building a private
cloud based on VMware's vCloud Suite. VMware has partnerships with cloud providers, notably AWS, to
support such hybrid cloud projects. However, VMware software is proprietary and requires licensing, and it
may offer fewer capabilities and less flexibility than an open source platform such as OpenStack.
Plenty of organizations decide that the breadth and reliability of public cloud services fulfill their
requirements, thereby avoiding the need to invest financially and intellectually in a private cloud
infrastructure.
OpenStack adoption is a process, not an event. There are potentially dozens of components to understand,
install and employ. Organizations that seek to build a private cloud based on OpenStack need time, financial
investment and support from upper management.
Federated Cloud
1. In the federated cloud, the users can interact with the architecture either centrally or in a
decentralized manner. In centralized interaction, the user interacts with a broker to mediate
between them and the organization. Decentralized interaction permits the user to interact directly
with the clouds in the federation.
2. Federated cloud can be practiced with various niches like commercial and non-commercial.
3. The visibility of a federated cloud assists the user to interpret the organization of several clouds
in the federated environment.
4. Federated cloud can be monitored in two ways. MaaS (Monitoring as a Service) provides
information that aids in tracking contracted services to the user. Global monitoring aids in
maintaining the federated cloud.
5. The providers who participate in the federation publish their offers to a central entity. The user
interacts with this central entity to verify the prices and propose an offer.
6. The marketing objects like infrastructure, software, and platform have to pass through federation
when consumed in the federated cloud.
1. In cloud federation, it is common to have more than one provider for processing the incoming
demands. In such cases, there must be a scheme needed to distribute the incoming demands
equally among the cloud service providers.
2. The increasing requests in cloud federation have resulted in more heterogeneous infrastructure,
making interoperability an area of concern. It becomes a challenge for cloud users to select
relevant cloud service providers and therefore, it ties them to a particular cloud service provider.
3. A federated cloud means constructing a seamless cloud environment that can interact with
people, different devices, several application interfaces, and other entities.
The technologies that aid the cloud federation and cloud services are:
1. OpenNebula
It is a cloud computing platform for managing heterogeneous distributed data center infrastructures. It can
use the resources of its interoperability, leveraging existing information technology assets, protecting the
deals, and adding the application programming interface (API).
2. Aneka coordinator
The Aneka coordinator is a proposition of the Aneka services and Aneka peer components (network
architectures) which give the cloud ability and performance to interact with other cloud services.
3. Eucalyptus
Eucalyptus defines the pooling computational, storage, and network resources that can be measured scaled
up or down as application workloads change in the utilization of the software. It is an open-source
framework that performs the storage, network, and many other computational resources to access the cloud
environment.
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech
landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable
prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already
empowered, and we're here to do the same for you.
Each level of the cloud federation poses unique problems and functions at a different level of the IT stack.
Then, several strategies and technologies are needed. The answers to the problems encountered at each of
these levels when combined form a reference model for a cloud federation.
Conceptual Level
The difficulties in presenting a cloud federation as an advantageous option for using services rented from a
single cloud provider are addressed at the conceptual level. At this level, it’s crucial to define the new
opportunities that a federated environment brings in comparison to a single-provider solution and to
explicitly describe the benefits of joining a federation for service providers or service users.
At this level, the following factors need attention:
The reasons that cloud providers would want to join a federation.
Motivations for service users to use a federation.
Benefits for service providers who rent their services to other service providers. Once a provider
joins the federation, they have obligations.
Agreements on trust between suppliers.
Consumers versus transparency.
The incentives of service providers and customers joining a federation stand out among these factors as
being the most important.
The obstacles in creating a framework that allows the aggregation of providers from various administrative
domains within the context of a single overlay infrastructure, or cloud federation, are identified and
addressed at the logical and operational level of a federated cloud.
Policies and guidelines for cooperation are established at this level. Additionally, this is the layer where
choices are made regarding how and when to use a service from another provider that is being leased or
leveraged. The operational component characterizes and molds the dynamic behavior of the federation as a
result of the decisions made by the individual providers, while the logical component specifies the context
in which agreements among providers are made and services are negotiated.
At this level, MOCC is put into precise and becomes a reality. At this stage, it’s crucial to deal with the
following difficulties:
How ought a federation should be portrayed?
How should a cloud service, a cloud provider, or an agreement be modeled and represented?
How should the regulations and standards that permit providers to join a federation be defined?
What procedures are in place to resolve disputes between providers?
What obligations does each supplier have to the other?
When should consumers and providers utilize the federation?
What categories of services are more likely to be rented than purchased?
Which percentage of the resources should be leased, and how should we value the resources that
are leased?
Both academia and industry have potential at the logical and operational levels.
Infrastructure Level
The technological difficulties in making it possible for various cloud computing systems to work together
seamlessly are dealt with at the infrastructure level. It addresses the technical obstacles keeping distinct
cloud computing systems from existing inside various administrative domains. These obstacles can be
removed by using standardized protocols and interfaces.
The following concerns should be addressed at this level:
What types of standards ought to be applied?
How should interfaces and protocols be created to work together?
Which technologies should be used for collaboration?
How can we design platform components, software systems, and services that support
interoperability?
Only open standards and interfaces allow for interoperability and composition amongst various cloud
computing companies. Additionally, the Cloud Computing Reference Model has layers that each has
significantly different interfaces and protocols.
Services of Cloud Federation
Microsoft developed the Single Sign-On (SSO) system known as (ADFS). It serves as a component of
Windows Server operating systems, giving users authenticated access to programs through Active Directory
that cannot use Integrated Windows Authentication (IWA) (AD).
Through a proxy service located between Active Directory and the intended application, ADFS manages
authentication. Users’ access is granted through the usage of a Federated Trust, which connects ADFS and
the intended application. As a result, users no longer need to directly validate their identity on the federated
application in order to log on.
These Four Phases are typically followed by the Authentication Process:
The user accesses a URL that the ADFS service has provided.
The user is then verified by the AD service of the company through the ADFS service.
The ADFS service then gives the user an authentication claim after successful authentication.
The target application then receives this claim from the user’s browser and decides whether to
grant or deny access based on the Federated Trust service established.
Applications can assign user authentication duties to a different system through a process known as identit y
federation. You can accomplish single sign-on, where users only need to log in once to be able to access
any number of their applications, by delegating access for all of your applications through a single federation
system. But because federation enables organizations to centralize the access management function, it is far
more significant than single sign-on (see our piece on this). User experience, security, application
onboarding, service logging and monitoring, operational efficiency in IT, and many other areas may all
benefit from this.
The newest addition to the Radiant One package is the Cloud Federation Service (CFS), which is powered
by identity virtualization. Together with Radiant One FID, CFS isolates your external and cloud applications
from the complexity of your identity systems by delegating the work of authenticating against all of your
identity stores to a single common virtual layer.
Lost in the complex landscape of DevOps? It's time to find your way! Enroll in our DevOps Engineering
Planning to Production Live Course and set out on an exhilarating expedition to conquer DevOps
methodologies with precision and timeliness.
What We Offer:
Comprehensive DevOps Curriculum
Expert Guidance for Streamlined Learning
Hands-on Experience with Real-world Scenarios
Proven Track Record with 100,000+ Successful DevOps Enthusiasts
A new approach to the cloud – one based on a federated model – will be increasingly important for cloud
providers and users alike. The future of the cloud is federated, and when you look at the broad categories of
apps moving to the cloud, the truth of this statement begins to become clear, writes CEO and co-founder of
OnApp, Ditlev Bredahl.
It can be tempting to think of ‘the cloud’ as a ubiquitous global phenomenon: always on and always available,
everywhere to anyone. And, it’s easy to assume that cloud providers like Amazon are the only way you can
get access to that kind of global capability. The reality, however, is really quite different. That’s why a new
approach to the cloud – one based on a federated model – will be increasingly important for cloud providers
and users alike.
Why You Can’t Get ‘The Cloud’ From a Single Provider
The future of the cloud is federated, and when you look at the broad categories of apps moving to the cloud,
the truth of this statement begins to become clear. Gaming, social media, Web, eCommerce, publishing, CRM
– these applications demand truly global coverage, so that the user experience is always on, local and instant,
with ultra-low latency. That’s what the cloud has always promised to be.
The problem is that end users can’t get that from a single provider, no matter how large. Even market giants
like Amazon have limited geographic presence, with infrastructure only where it’s profitable for them to
invest. As a result, outside the major countries and cities, coverage from today’s ‘global’ cloud providers is
actually pretty thin. Iceland, Jordan, Latvia, Turkey, Malaysia? Good luck. Even in the U.S., you might find
that the closest access point to your business isn’t even in the same state, let alone the same city.
Of course, these locations aren’t devoid of infrastructure. There are hosting providers, telcos, ISPs and data
center operators pretty much everywhere. If you own infrastructure in one of these locations, you already have
a working business model for your local market. And, like most providers, you are likely to have spare capacity
almost all of the time.
The federated cloud connects these local infrastructure providers to a global marketplace that enables each
participant to buy and sell capacity on demand. As a provider, this gives you instant access to global
infrastructure on an unprecedented scale. If your customer suddenly needs a few hundred new servers, you
just buy the capacity they need from the marketplace. If a customer needs to accelerate a website or an
application in Hong Kong, Tokyo or Latvia, you simply subscribe to those locations and make use of the
infrastructure that’s already there.
As part of a cloud federation, even a small service provider can offer a truly global service without spending
a dime building new infrastructure. For companies with spare capacity in the data center, the federation also
provides a simple way to monetize that capacity by submitting it to the marketplace for other providers to buy,
creating an additional source of revenue.
There are immediate benefits for end users, too. The federated cloud means that end users can host apps with
their federated cloud provider of choice, instead of choosing from a handful of “global” cloud providers on
the market today and making do with whatever pricing, app support and SLAs they happen to impose. Cloud
users can choose a local host with the exact pricing, expertise and support package that fits their need, while
still receiving instant access to as much local or global IT resources as they’d like. They get global scalability
without restricted choice, and without having to manage multiple providers and invoices.
The federated cloud model is a force for real democratization in the cloud market. It’s how businesses will be
able to use local cloud providers to connect with customers, partners and employees anywhere in the world.
It’s how end users will finally get to realize the promise of the cloud. And, it’s how data center operators and
other service providers will finally be able to compete with, and beat, today’s so-called global cloud providers.