Hadoop
Hadoop
net/publication/354378675
CITATIONS READS
0 1,442
1 author:
Pooja Pandit
28 PUBLICATIONS 14 CITATIONS
SEE PROFILE
All content following this page was uploaded by Pooja Pandit on 06 September 2021.
Java API as well as Thrift and REST gateway APIs. compute infrastructure, the local R environment, and
Oracle database tables.
F. Data Intelligence • Predictive analytic techniques, written in R or Java as
Hadoop MapReduce jobs, that can be applied to data in
1) Mahout: Mahout is a library of scalable machine- HDFS files
learning algorithms,which is a discipline of artificial intelli-
gence focused on enabling machines to learn without being III. P ROPERTIES OF H ADOOP D ISTRIBUTED F ILE S YSTEM
explicitly programmed, and it is commonly used to improve The properties of the HDFS System are as follows:
future performance based on previous outcomes. So Mahout is • Fault tolerance by detecting faults and applying quick,
implemented on top of Apache Hadoop.Mahout provides the automatic recovery
data science tools to automatically find meaningful patterns in • Portability across heterogeneous commodity hardware
those big data sets. The Apache Mahout project aims to make and operating systems
it faster and easier to turn big data into big information. • Scalability to reliably store and process large amounts of
data
G. Data Integration • Data storage reliability by automatically maintaining
1) Apache Sqoop: Apache Sqoop is a tool designed for multiple copies of data and automatically redeploying
bulk data transfers between relational databases and Hadoop. processing logic in the event of failure.
• Staging to commit: When a client creates a file in HDFS,
The features are as follows.
it first caches the data into a temporary local file. It then
• Import and export to and from HDFS. redirects subsequent writes to the temporary file. When
• Import and export to and from Hive. the temporary file accumulates enough data to fill an
• Import and export to HBase. HDFS block, the client reports this to the name node,
2) Apache Flume: Flume is a distributed, reliable, and which converts the file to a permanent data node.
available service for efficiently collecting, aggregating, and • Data block rebalancing: HDFS data blocks might not
moving large amounts of log data. It has a simple and flexible always be placed uniformly across data nodes, meaning
architecture based on streaming data flows. It is robust and that the used space for one or more data nodes can be
fault tolerant with tunable reliability mechanisms and many underutilized. Therefore, HDFS supports rebalancing data
failover and recovery mechanisms. It uses a simple extensible blocks using various models. One model might move data
data model that allows for online analytic application. The blocks from one data node to another automatically if the
architecture is shown in Fig. 4. free space on a data node falls too low. Another model
2021. Big Data Analytics.
might dynamically create additional replicas and rebal- B. JAQL(Query language for JSON)
ance other data blocks in a cluster if a sudden increase in JAQL is a functional, declarative programming language de-
demand for a given file occurs. HDFS also provides the signed especially for working with large volumes of structured,
hadoop balance command for manual rebalancing tasks. semi-structured and unstructured data. It was developed by
• Data integrity: HDFS goes to great lengths to ensure IBM. Data structure that it operates on is JSON (JavaScript
the integrity of data across clusters. It uses checksum Object Notation) which is a lightweight data-interchange for-
validation on the contents of HDFS files by storing mat. It is easy for humans to read and write. It is easy for
computed checksums in separate, hidden files in the same machines to parse and generate. Jaql allows you to select,
namespace as the actual data. When a client retrieves join, group, and filter data that is stored in HDFS, much like
file data, it can verify that the data received matches the a blend of Pig and Hive.
checksum stored in the associated file.
• HDFS permissions for users, files, and directories: HDFS VI. C ONCLUSION
implements a permissions model for files and directories In this paper, we study the Hadoop Ecosystem. Specifically,
that has a lot in common with the Portable Operating we first present the overall Hadoop architecture. We study the
System Interface (POSIX) model; for example, every file various components of the Hadoop Ecosystem such as Hadoop
and directory is associated with an owner and a group. Distributed File System, Map Reduce, YARN, etc. Next, we
The HDFS permissions model supports read (r), write study the properties of the HDFS and YARN Mapreduce
(w), and execute (x). system.
A. Pig
PIG deals with structured data using PIG LATIN which is
a scripting language. It was developed by Yahoo. The data
structure that it operates on is complex and nested. Pig Latin
includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop
their own functions for reading, processing, and writing data.