Hadoop Questions and Answers Part 100
Hadoop Questions and Answers Part 100
1. IBM and ________ have announced a major initiative to use Hadoop to support
university courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google
Answer: d
Explanation: Google and IBM Announce University Initiative to Address Internet-
Scale.
Answer: b
Explanation: Data compression can be achieved using compression algorithms like
bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based
on their capabilities.
Answer: a
Explanation: Hadoop is Open Source, released under Apache 2 license.
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully
functional Hadoop cluster using a live CD.
a) OpenOffice.org
b) OpenSolaris
c) GNU
d) Linux
Answer: b
Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM
image.
Answer: a
Explanation: The Hadoop Distributed File System (HDFS) is designed to store very
large data sets reliably, and to stream those data sets at high bandwidth to the user.
Answer: c
Explanation: The Hadoop framework itself is mostly written in the Java programming
language, with some native code in C and command-line utilities written as shell
scripts.
7. Which of the following platforms does Hadoop run on?
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
Answer: c
Explanation: Hadoop has support for cross-platform operating system.
8. Hadoop achieves reliability by replicating the data across multiple hosts and
hence does not require ________ storage on hosts.
a) RAID
b) Standard RAID levels
c) ZFS
d) Operating system
Answer: a
Explanation: With the default replication value, 3, data is stored on three nodes: two
on the same rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job
Tracker, to which client applications submit MapReduce jobs.
a) MapReduce
b) Google
c) Functional programming
d) Facebook
Answer: a
Explanation: MapReduce engine uses to distribute work around a cluster.
10. The Hadoop list includes the HBase database, the Apache Mahout ________
system, and matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification
d) Artificial intelligence
Answer: a
Explanation: The Apache Mahout project’s goal is to build a scalable machine
learning tool.
Answer: d
Explanation: Adding security to Hadoop is challenging because all the interactions
do not follow the classic client-server pattern.
c) In the Hadoop programming framework output files are divided into lines or
records
d) None of the mentioned
Answer: a
Explanation: Data warehousing integrated with Hadoop would give a better
understanding of data.
4. Hadoop is a framework that works with a variety of related tools. Common cohorts
include ____________
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
Answer: a
Explanation: To use Hive with HBase you’ll typically want to launch two clusters, one
to run HBase and the other to run Hive.
Answer: c
Explanation: The programming model, MapReduce, used by Hadoop is simple to
write and test.
6. What was Hadoop named after?
a) Creator Doug Cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop development
Answer: c
Explanation: Doug Cutting, Hadoop creator, named the framework after his child’s
stuffed toy elephant.
Answer: b
Explanation: Apache Hadoop is an open-source software framework for distributed
storage and distributed processing of Big Data on clusters of commodity hardware.
Answer: a
Explanation: MapReduce is a programming model and an associated
implementation for processing and generating large data sets with a parallel,
distributed algorithm.
9. _________ has the world’s largest Hadoop cluster.
a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned
Answer: c
Explanation: Facebook has many Hadoop clusters, the largest among them is the
one that is used for Data warehousing.
Answer: a
Explanation: Prism automatically replicates and moves data wherever it’s needed
across a vast network of computing facilities.
Answer: c
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs.
2. Point out the correct statement.
a) Hive is not a relational database, but a query engine that supports the parts of
SQL specific to querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
d) All of the mentioned
Answer: a
Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates
data summarization, ad hoc queries, and the analysis of large datasets stored in
Hadoop-compatible file systems.
3. _________ hides the limitations of Java behind a powerful and concise Clojure
API for Cascading.
a) Scalding
b) HCatalog
c) Cascalog
d) All of the mentioned
Answer: c
Explanation: Cascalog also adds Logic Programming concepts inspired by Datalog.
Hence the name “Cascalog” is a contraction of Cascading and Datalog.
Answer: b
Explanation: Hive also supports custom extensions written in Java, including user-
defined functions (UDFs) and serializer-deserializers for reading and optionally
writing custom formats.
5. Point out the wrong statement.
a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering
b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop
offering
c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
d) All of the mentioned
Answer: a
Explanation: Rather than building Hadoop deployments manually on EC2 (Elastic
Compute Cloud) clusters, users can spin up fully configured Hadoop installations
using simple invocation commands, either through the AWS Web Console or through
command-line tools.
Answer: d
Explanation: Cascading hides many of the complexities of MapReduce programming
behind more intuitive pipes and data flow abstractions.
Answer: a
Explanation: Mapreduce provides a flexible and scalable foundation for analytics,
from traditional reporting to leading-edge machine learning algorithms.
8. The Pig Latin scripting language is not only a higher-level data flow language but
also has operators similar to ____________
a) SQL
b) JSON
c) XML
d) All of the mentioned
Answer: a
Explanation: Pig Latin, in essence, is designed to fill the gap between the declarative
style of SQL and the low-level procedural style of MapReduce.
Answer: d
Explanation: Hive Queries are translated to MapReduce jobs to exploit the scalability
of MapReduce.
10. ______ is a framework for performing remote procedure calls and data
serialization.
a) Drill
b) BigTop
c) Avro
d) Chukwa
Answer: c
Explanation: In the context of Hadoop, Avro can be used to pass data from one
program or language to another.
Hadoop Questions and Answers Part-4
1. A ________ node acts as the Slave and is responsible for executing a Task
assigned to it by the JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Answer: c
Explanation: TaskTracker receives the information necessary for the execution of a
Task from JobTracker, Executes the Task, and Sends the Results back to
JobTracker.
Answer: a
Explanation: Map Task in MapReduce is performed using the Map() function.
4. _________ function is responsible for consolidating the results produced by each
of the Map() functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
Answer: a
Explanation: Reduce function collates the work and resolves the results.
Answer: d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
Answer: a
Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement
MapReduce applications (non JNITM based).
7. ________ is a utility which allows users to create and run jobs with any
executables as the mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
d) None of the mentioned
Answer: b
Explanation: Hadoop streaming is one of the most important utilities in the Apache
Hadoop distribution.
Answer: a
Explanation: Maps are the individual tasks that transform input records into
intermediate records.
Answer: a
Explanation: Total size of inputs means the total number of blocks of the input files.
Answer: c
Explanation: The default partitioner in Hadoop is the HashPartitioner which has a
method called getPartition to partition.
Answer: a
Explanation: Place the generic options before the streaming options, otherwise the
command will fail.
c) The class you supply for the output format should return key/value pairs of Text
class
d) All of the mentioned
Answer: d
Explanation: Required parameters are used for Input and Output location for the
mapper.
Answer: c
Explanation: Environment Variable is set using cmdenv command.
Answer: c
Explanation: To use Aggregate, simply specify “-reducer aggregate”:
6. The ________ option allows you to copy jars locally to the current working
directory of tasks and automatically unjar the files.
a) archives
b) files
c) task
d) none of the mentioned
View Answer Discussion
Answer: a
Explanation: Archives options is also a generic option.
Answer: b
Explanation: The primary key is used for partitioning, and the combination of the
primary and secondary keys is used for sorting.
Answer: c
Explanation: Hadoop has a library class, KeyFieldBasedComparator, that is useful
for many applications.
Answer: b
Explanation: The map function defined in the class treats each input key/value pair
as a list of fields.
Answer: b
Explanation: JobConfigurable.configure method is overridden to initialize
themselves.
Answer: a
Explanation: In the Shuffle phase the framework fetches the relevant partition of the
output of all the mappers, via HTTP.
Answer: d
Explanation: The right number of reduces seems to be 0.95 or 1.75.
Answer: a
Explanation: Reducer has 3 primary phases: shuffle, sort and reduce.
6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
d) None of the mentioned
Answer: d
Explanation: The output of the reduce task is typically written to the FileSystem. The
output of the Reducer is not sorted.
Answer: a
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs
are being fetched they are merged.
8. Mapper and Reducer implementations can use the ________ to report progress or
just indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
Answer: b
Explanation: Hadoop MapReduce comes bundled with a library of generally useful
mappers, reducers, and partitioners.
10. _________ is the primary interface for a user to describe a MapReduce job to
the Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the above
Answer: b
Explanation: JobConf represents a MapReduce job configuration.
Answer: a
Explanation: NoSQL systems make the most sense whenever the application is
based on data with varying data types and the data can be stored in key-value
notation.
Answer: a
Explanation: Hadoop together with a relational data warehouse, they can form very
effective data warehouse infrastructure.
3. Hadoop data is not sequenced and is in 64MB to 256MB block sizes of delimited
record values with schema applied on read based on ____________
a) HCatalog
b) Hive
c) Hbase
d) All of the mentioned
Answer: a
Explanation: Other means of tagging the values also can be used.
4. __________ are highly resilient and eliminate the single-point-of-failure risk with
traditional Hadoop deployments.
a) EMR
b) Isilon solutions
c) AWS
d) None of the mentioned
View Answer Discussion
Answer: b
Explanation: Enterprise data protection and security options including file system
auditing and data-at-rest encryption to address compliance requirements are also
provided by Isilon solution.
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to
____________
a) Scale out
b) Scale up
c) Both Scale out and up
d) None of the mentioned
Answer: a
Explanation: HDFS and NoSQL file systems focus almost exclusively on adding
nodes to increase performance (scale-out) but even they require node configuration
with elements of scale up.
7. Which is the most popular NoSQL database for scalable big data store with
Hadoop?
a) Hbase
b) MongoDB
c) Cassandra
d) None of the mentioned
Answer: a
Explanation: HBase is the Hadoop database: a distributed, scalable Big Data store
that lets you host very large tables — billions of rows multiplied by millions of
columns — on clusters built with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for
use in the map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
d) All of the mentioned
Answer: c
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
Answer: c
Explanation: Google Bigtable leverages the distributed data storage provided by the
Google File System.
10. __________ refers to incremental costs with no major impact on solution design,
performance and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
d) None of the mentioned
Answer: c
Explanation: Adding more CPU/RAM/Disk capacity to Hadoop DataNode that is
already part of a cluster does not require additional network switches.
Answer: b
Explanation: All the metadata related to HDFS including the information about data
nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the
NameNode.
c) Data blocks are replicated across different nodes in the cluster to ensure a low
degree of fault tolerance
d) None of the mentioned
Answer: a
Explanation: NameNode servers as the master and each DataNode servers as a
worker/slave
Answer: c
Explanation: Secondary namenode is used for all time availability and reliability.
Answer: d
Explanation: NameNode is aware of the files to which the blocks stored on it belong
to.
6. Which of the following scenario may not be a good fit for HDFS?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the
same file
b) HDFS is suitable for storing data related to applications requiring low latency data
access
c) HDFS is suitable for storing data related to applications requiring high latency data
access
d) None of the mentioned
Answer: a
Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS
allows storing the data on low cost commodity hardware while ensuring a high
degree of fault-tolerance.
7. The need for data replication can arise in various scenarios like ____________
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
Answer: d
Explanation: Data is replicated across different DataNodes to ensure a high degree
of fault-tolerance.
8. _______ is the slave/worker node and holds the user data in the form of Data
Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
Answer: a
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional
filesystem has more than one DataNode, with data replicated across them.
Answer: b
Explanation: HDFS is implemented in Java and any computer which can run Java
can host a NameNode/DataNode on it.
10. For YARN, the ___________ Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource
d) Replication
Answer: c
Explanation: All the metadata related to HDFS including the information about data
nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the
NameNode.
Answer: a
Explanation: HBase Master UI provides information about the number of live, dead
and transitional servers, logs, ZooKeeper information, debug dumps, and thread
stacks.
2. During start up, the ___________ loads the file system state from the fsimage and
the edits log file.
a) DataNode
b) NameNode
c) ActionNode
d) None of the mentioned
Answer: b
Explanation: HDFS is implemented on any computer which can run Java can host a
NameNode/DataNode on it.
Answer: a
Explanation: InputDataStream is used to read data from file.
Answer: d
Explanation: If equivalence rules for keys while grouping the intermediates are
different from those for grouping keys before reduction, then one may specify a
Comparator.
5. ______________ is method to copy byte from input stream to any other stream in
Hadoop.
a) IOUtils
b) Utils
c) IUtils
d) All of the mentioned
Answer: a
Explanation: IOUtils class is static method in Java interface.
Answer: a
Explanation: readfully method can also be used instead of read method.
Answer: b
Explanation: The output of the Reducer is not re-sorted.
Answer: b
Explanation: Reducer implementations can access the JobConf for the job.
Answer: a
Explanation: In the phase the framework, for each Reducer, fetches the relevant
partition of the output of all the Mappers, via HTTP.
10. The output of the reduce task is typically written to the FileSystem via
____________
a) OutputCollector
b) InputCollector
c) OutputCollect
d) All of the mentioned
Answer: a
Explanation: In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter)
method is called for each pair in the grouped inputs.
Answer: b
Explanation: In scenarios where the application takes a significant amount of time to
process individual key/value pairs, this is crucial since the framework might assume
that the task has timed-out and kill that task.
Answer: d
Explanation: The reporter parameter is for a facility to report progress.
Answer: b
Explanation: The name should have a *.har extension.
Answer: d
Explanation: A Hadoop archive directory contains metadata (in the form of _index
and _masterindex) and data (part-*) files.
Answer: c
Explanation: Hadoop Archives is exposed as a file system MapReduce will be able
to use all the logical input files in Hadoop Archives as input.
6. The __________ guarantees that excess resources taken from a queue will be
restored to it within N minutes of its need for them.
a) capacitor
b) scheduler
c) datanode
d) none of the mentioned
Answer: b
Explanation: Free resources can be allocated to any queue beyond its guaranteed
capacity.
Answer: d
Explanation: All the fs shell commands in the archives work but with a different URI.
Answer: c
Explanation: The Capacity Scheduler supports multiple queues, where a job is
submitted to a queue.
Answer: c
Explanation: -archiveName <name> is the name of the archive to be created.
10. _________ identifies filesystem path names which work as usual with regular
expressions.
a) -archiveName <name>
b) <source>
c) <destination>
d) none of the mentioned
View Answer Discussion
Answer: d
Explanation: identifies destination directory which would contain the archive.