Big Data Ia Answers
Big Data Ia Answers
Give an example
Definition: Big Data refers to large, complex datasets that traditional data
processing software can't handle.
Example 1 : Social media data, sensor data, E-mails, Zipped files,
Web pages, etc.
Example 2 : FaceBook According to Facebook, its data system
processes 500+ terabytes of data daily. Facebook generates 2.7 billion
Like actions per day and 300 million new photos are uploaded daily.
It has 2.38 billion users. Allows searching, recommendation.
1.Input Splits.
• HDFS distributes and replicates data over multiple servers.
• The default data block size is 64MB. Thus, a 500MB file would be broken
into 8 blocks and written to different machines in the cluster.
• The data are also replicated on multiple machines (typically three machines).
2. Map Step.
• The user provides the specific mapping process.
• MapReduce will try to execute the mapper on the machines where the block
resides.
• Because the file is replicated in HDFS, the least busy node with the data will
be chosen.
• If all nodes holding the data are too busy, MapReduce will try to pick a node
that is closest to the node that hosts the data block. Introduction to Big Data
3. Combiner Step.
• It is possible to provide an optimization or pre-reduction as part of the map
stage where key–value pairs are combined prior to the next stage.
• The combiner stage is optional.
4. Shuffle Step.
• Before the parallel reduction stage can complete, all similar keys must be
combined and counted by the same reducer process.
• Therefore, results of the map stage must be collected by key–value pairs and
shuffled to the same reducer process.
• If only a single reducer process is used, the shuffle stage is not needed.
5. Reduce Step.
• The final step is the actual reduction. In this stage, the data reduction is
performed as per the programmer’s design.
• The results are written to HDFS. Each reducer will write an output file. For
example, a MapReduce job running four reducers will create files called part-
0000, part-0001, part 0002, and part-0003.
Rack Awareness
• Rack awareness about knowing where data is stored in a Hadoop system. It
deals with data locality which is moving computation to the node where data
resides.
• Hadoop cluster will exhibit three levels of data locality:
• Data resides on the local machine .
• Data resides in the same rack.
• Data resides in a different rack.
• To protect against failures, the system makes copies of data and stores them
across different racks. So, if one rack fails, the data is still safe and available
from another rack, keeping the system running without losing data.
8.Explain
i) NameNode High Availability
ii) HDFS NameNode Federation
HDFS Checkpoints
• The NameNode stores the metadata of the HDFS file system in a file called
fsimage.
• File systems modifications are written to an edits log file, and at startup the
NameNode merges the edits into a new fsimage.
• The SecondaryNameNode or CheckpointNode periodioally fetches edits from
the NameNode, merges them, and returns an updated fsimage to the
NameNode.
HDFS Backups
• An HDFS BackupNode maintains an up-to-date copy of the metadata both in
memory and on disk.
• The BackupNode does not need to download the fsimage and edits files from
the active NameNode because it already has an up-to-date metadata state in
memory.
• A NameNode supports one BackupNode at a time. No CheckpointNodes may
be registered if a Backup node is in use.
9. Explain Apache Sqoop import & Export Methods with
suitable diagram.
10. Explain Apache Pig with suitable examples.
2.
3.
4.
5.