0% found this document useful (0 votes)
36 views

Big Data Ia Answers

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Big Data Ia Answers

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1. What is Big Data?

Give an example
Definition: Big Data refers to large, complex datasets that traditional data
processing software can't handle.
 Example 1 : Social media data, sensor data, E-mails, Zipped files,
Web pages, etc.
 Example 2 : FaceBook According to Facebook, its data system
processes 500+ terabytes of data daily. Facebook generates 2.7 billion
Like actions per day and 300 million new photos are uploaded daily.
It has 2.38 billion users. Allows searching, recommendation.

2. Explain the design features of Hadoop Distributed File


System (HDFS).
The Hadoop Distributed file system(HDFS) was designed for Big Data
processing and is capable of supporting many users simultaneously. The design
assumes a large file write once/read-many model. HDFS restricts data writing
to one user at a time. All additional writes are “append-only,” and there is no
random writing to HDFS files.
• HDFS is designed for data streaming where large amounts of data are read
from disk in bulk. The HDFS block size is typically 64MB or 128MB.
• There is no local caching mechanism. The large block and file sizes makes it
more efficient to reread data from HDFS than to try to cache the data.
• Hadoop MapReduce moves the computation to the data rather than moving
the data to the computation. That is, Converged data storage and processing
happen on the same server or data nodes.
• A reliable file system maintains multiple copies of data across the cluster.
Consequently, failure of a single will not bring down the file system.
• A specialized file system is used, which is not designed for general use.

3. Describe the main components of HDFS.


The design of HDFS is based on two types of nodes: NameNode and
multiple DataNodes. In a basic design, NameNode manages all the
metadata needed to store and retrieve the actual data from the DataNodes.
The NameNode stores all metadata in memory. No data is actually stored on
the NameNode. The design is a Master/Slave architecture in which
master(NameNode) regulates access to files by clients. File system
operations such as opening, closing and renaming files and directories are
all managed by the NameNode. The NameNode also determines the
mapping of blocks to DataNodes and handles DataNode failures. The
NameNode manages block creation, deletion and replication

HDFS uses a master/slave architecture to design large file reading/streaming.


• The NameNode is a metadata server or “data traffic cop.”
• HDFS provides a single namespace that is managed by the NameNode.
• Data is redundantly stored on DataNodes; there is no data on the
NameNode.
• The SecondaryNameNode performs checkpoints of the NameNode file
system’s state but is not a failover node.
4. MapReduce Parallel Data Flow

• HDFS distributes and replicates data over multiple data nodes.


• Apache Hadoop MapReduce will try to move the mapping tasks to the data
nodes that contains the data slice.Results from each data slice are then
combined in the reducer step.
• Parallel execution of MapReduce requires other steps in addition to the
mapper and reducer processes.
• The basic steps are as follows:

1.Input Splits.
• HDFS distributes and replicates data over multiple servers.
• The default data block size is 64MB. Thus, a 500MB file would be broken
into 8 blocks and written to different machines in the cluster.
• The data are also replicated on multiple machines (typically three machines).

2. Map Step.
• The user provides the specific mapping process.
• MapReduce will try to execute the mapper on the machines where the block
resides.
• Because the file is replicated in HDFS, the least busy node with the data will
be chosen.
• If all nodes holding the data are too busy, MapReduce will try to pick a node
that is closest to the node that hosts the data block. Introduction to Big Data
3. Combiner Step.
• It is possible to provide an optimization or pre-reduction as part of the map
stage where key–value pairs are combined prior to the next stage.
• The combiner stage is optional.

4. Shuffle Step.
• Before the parallel reduction stage can complete, all similar keys must be
combined and counted by the same reducer process.
• Therefore, results of the map stage must be collected by key–value pairs and
shuffled to the same reducer process.
• If only a single reducer process is used, the shuffle stage is not needed.

5. Reduce Step.
• The final step is the actual reduction. In this stage, the data reduction is
performed as per the programmer’s design.
• The results are written to HDFS. Each reducer will write an output file. For
example, a MapReduce job running four reducers will create files called part-
0000, part-0001, part 0002, and part-0003.

Figure is an example of a simple Hadoop MapReduce data flow for a


wordcount program. The map process counts the words in the split, and the
reduceprocess calculates the total for each word

5. What are some common HDFS user commands and


Explain their purposes?
• The preferred way to interact with HDFS is through the hdfs command and
will facilitate navigation within HDFS.
• The following listing presents the full range of options that are available for
the hdfs command. In the next section, only portions of the dfs and hdfsadmin
options are given.

General HDFS Commands


The version of HDFS can be found from the version option.
$ hdfs version
Hadoop 2.6.0.2.2.4.2-2
List Files in HDFS
To list the files in the root HDFS directory, enter the following command:
$ hdfsdfs -ls /
Output: Found 2 items
drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfshdfs 0 2015-04-21 14:28 /apps

To list files in our home directory, enter the following command:


Syntax: $ hdfsdfs -ls
Output:
Found 2 items
drwxr-xr-x - hdfshdfs 0 2015-05-24 20:06 bin
drwxr-xr-x - hdfshdfs 0 2015-04-29 16:52

Make a Directory in HDFS


To make a directory in HDFS, use the following command.
$ hdfs dfs -mkdir stuff

Copy Files to HDFS


• To copy a file from your current local directory into HDFS, use the following
command.
• If a full path is not supplied, your home directory is assumed.
• In this case, the file test is placed in the directory stuff that was created
previously. $ hdfs dfs -put test stuff
• The file transfer can be confirmed by using the -ls command:
$ hdfs dfs -ls stuff Found 1 items -rw-r--r-- 2 hdfs hdfs 12857 2015-05-29
13:12 stuff/test

Copy Files from HDFS


• Files can be copied back to your local file system using the following
command.
• In this case, the file test from HDFS, will be copied back to the current local
directory with the name test-local.
$ hdfs dfs -get stuff/test test-local
Copy Files within HDFS
The following command will copy a file in HDFS
$ hdfs dfs -cp stuff/test test.hdfs

Delete a File within HDFS


The following command will delete the HDFS file test.hdfs
$ hdfs dfs -rm test.hdfs

Delete a Directory in HDFS


The following command will delete the HDFS directory stuff and all its
contents:
$ hdfs dfs -rm -r -skipTrash stuff
Deleted stuff

Get an HDFS Status Report


HDFS status report using the following command. Those with HDFS
administrator privileges will get a full report. Also, this command uses
dfsadmin instead of dfs to invoke administrative commands.
$ hdfs dfsadmin -report
Configured Capacity: 1503409881088 (1.37 TB)
Present Capacity: 1407945981952 (1.28 TB)
DFS Remaining: 1255510564864 (1.14 TB)
DFS Used: 152435417088 (141.97 GB)
DFS Used%: 10.83%
Under replicated blocks: 54
Blocks with corrupt replicas: 0
Missing blocks: 0

6. Illustrate the process of HDFS block replication.

HDFS Block Replication


• When HDFS writes a file, it is replicated across the cluster. For Hadoop
clusters containing more than eight DataNodes, the replication value is usually
set to 3.
• The HDFS default block size is often 64MB. If a file of size 80MB is written
to HDFS, a 64MB block and a 16MB block will be created.
• Figure provides an example of how a file is broken into blocks and replicated
across the cluster. In this case, a replication factor of 3 ensures that any one
DataNode can fail and the replicated blocks will be available on other nodes—
and then subsequently re replicated on other DataNodes.

7. Write a note on HDFS Safe Mode and Rack Awareness

HDFS Safe Mode


• When the NameNode starts, it enters a read-only safe mode where blocks
cannot be replicated or deleted. Safe Mode enables the NameNode to perform
two important processes:
• The previous file system state is reconstructed by loading the fsimage file
into memory and replaying the edit log.
• The mapping between blocks and data nodes is created by waiting for enough
Of the DataNodes to register so that at least one copy of the data is available.

Rack Awareness
• Rack awareness about knowing where data is stored in a Hadoop system. It
deals with data locality which is moving computation to the node where data
resides.
• Hadoop cluster will exhibit three levels of data locality:
• Data resides on the local machine .
• Data resides in the same rack.
• Data resides in a different rack.
• To protect against failures, the system makes copies of data and stores them
across different racks. So, if one rack fails, the data is still safe and available
from another rack, keeping the system running without losing data.

8.Explain
i) NameNode High Availability
ii) HDFS NameNode Federation

iii) HDFS Checkpoints and Backup

HDFS Checkpoints
• The NameNode stores the metadata of the HDFS file system in a file called
fsimage.
• File systems modifications are written to an edits log file, and at startup the
NameNode merges the edits into a new fsimage.
• The SecondaryNameNode or CheckpointNode periodioally fetches edits from
the NameNode, merges them, and returns an updated fsimage to the
NameNode.

HDFS Backups
• An HDFS BackupNode maintains an up-to-date copy of the metadata both in
memory and on disk.
• The BackupNode does not need to download the fsimage and edits files from
the active NameNode because it already has an up-to-date metadata state in
memory.
• A NameNode supports one BackupNode at a time. No CheckpointNodes may
be registered if a Backup node is in use.
9. Explain Apache Sqoop import & Export Methods with
suitable diagram.
10. Explain Apache Pig with suitable examples.

The following PIG examples used in HDFS.


1.

2.

3.
4.

5.

11. Explain Apache Hive with suitable examples.


12. Explain
i) Apache Sqoop version comparison

ii)Steps to be performed in Apache Sqoop

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy