0% found this document useful (0 votes)
8 views

Week 0 To 8 Assignment

Uploaded by

vishalmuskan325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 0 To 8 Assignment

Uploaded by

vishalmuskan325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Week 0: Assignment 0

Your last recorded submission was on 2024-08-14, 10:43 IST

1 point

What defines Big Data?

Volume, Variety, Velocity

Veracity, Velocity, Value

Volume, Veracity, Value

Value, Versatility, Volume

1 point

Which technology is commonly used for processing and analyzing Big


Data?

Hadoop

SQL

Python

Excel

1 point

Which of the following is a challenge associated with Big Data?

Low storage requirements

Limited data sources

Slow data processing

Predictable data patterns

1 point

Which programming language is commonly used in Hadoop development?

Java

Python

C++

Ruby

1 point

What is the primary purpose of Hadoop's HDFS?


Data visualization

Data storage

Data querying

Data modeling

1 point

Which component of Hadoop is responsible for job scheduling and


resource management?

YARN

HDFS

MapReduce

Pig

1 point

What is Apache Zookeeper primarily used for in Big Data ecosystems?

Data storage

Data processing

Configuration management

Data visualization

1 point

What is the default block size in HDFS?

128 MB

256 MB

64 MB

512 MB

1 point

CAP theorem states that a distributed system cannot simultaneously


guarantee?

Consistency, Accessibility, Partition tolerance

Consistency, Atomicity, Partition tolerance

Consistency, Atomicity, Availability

Consistency, Availability, Reliability


1 point

Which of the following is NOT a role of Apache Zookeeper?

Data storage

Data processing

Configuration management

Data visualization

Week 1: Assignment 1

The due date for submitting this assignment has passed.

Due on 2024-08-28, 23:59 IST.

Assignment submitted on 2024-08-20, 13:24 IST

1 point

Which of the following best describes the concept of 'Big Data'?

Data that is physically large in size

Data that is collected from multiple sources and is of high variety,


volume, and velocity

Data that requires specialized hardware for storage

Data that is highly structured and easily analyzable

Yes, the answer is correct.


Score: 1

Accepted Answers:

Data that is collected from multiple sources and is of high variety, volume,
and velocity

1 point

Which technology is commonly used for processing and analyzing Big


Data in distributed computing environments?

MySQL
Hadoop

Excel

SQLite

Yes, the answer is correct.


Score: 1

Accepted Answers:

Hadoop

1 point

What is a primary limitation of traditional RDBMS when dealing with Big


Data?

They cannot handle structured data

They are too expensive to implement

They struggle with scaling to manage very large datasets

They are not capable of performing complex queries

Yes, the answer is correct.


Score: 1

Accepted Answers:

They struggle with scaling to manage very large datasets

1 point

Which component of Hadoop is responsible for distributed storage?

YARN

HDFS

MapReduce

Pig

Yes, the answer is correct.


Score: 1
Accepted Answers:

HDFS

1 point

Which Hadoop ecosystem tool is primarily used for querying and analyzing
large datasets stored in Hadoop's distributed storage?

HBase

Hive

Kafka

Sqoop

Yes, the answer is correct.


Score: 1

Accepted Answers:

Hive

1 point

Which YARN component is responsible for coordinating the execution of


tasks within containers on individual nodes in a Hadoop cluster?

NodeManager

ResourceManager

ApplicationMaster

DataNode

No, the answer is incorrect.


Score: 0

Accepted Answers:

NodeManager

1 point

What is the primary advantage of using Apache Spark over traditional


MapReduce for data processing?

Better fault tolerance

Lower hardware requirements

Real-time data processing


Faster data processing

Yes, the answer is correct.


Score: 1

Accepted Answers:

Faster data processing

1 point

What is Apache Spark Streaming primarily used for?

Real-time data visualization

Batch processing of large datasets

Real-time stream processing

Data storage and retrieval

Yes, the answer is correct.


Score: 1

Accepted Answers:

Real-time stream processing

1 point

Which operation in Apache Spark GraphX is used to perform triangle


counting on a graph?

connectedComponents

triangleCount

shortestPaths

pageRank

Yes, the answer is correct.


Score: 1

Accepted Answers:

triangleCount

1 point

Which component in Hadoop is responsible for executing tasks on


individual nodes and reporting back to the JobTracker?

HDFS Namenode

TaskTracker
YARN ResourceManager

DataNode

Yes, the answer is correct.


Score: 1

Accepted Answers:

TaskTracker

Week 2: Assignment 2

The due date for submitting this assignment has passed.

Due on 2024-09-04, 23:59 IST.

Assignment submitted on 2024-08-24, 22:45 IST

1 point

Which statement best describes the data storage model used by HBase?

Key-value pairs

Document-oriented

Encryption

Relational tables

Yes, the answer is correct.


Score: 1

Accepted Answers:

Key-value pairs

1 point

What is Apache Avro primarily used for in the context of Big Data?

Real-time data streaming

Data serialization

Machine learning

Database management
Yes, the answer is correct.
Score: 1

Accepted Answers:

Data serialization

1 point

Which component in HDFS is responsible for storing actual data blocks on


the DataNodes?

NameNode

DataNode

Secondary NameNode

ResourceManager

Yes, the answer is correct.


Score: 1

Accepted Answers:

DataNode

1 point

Which feature of HDFS ensures fault tolerance by replicating data blocks


across multiple DataNodes?

Partitioning

Compression

Replication

Encryption

Yes, the answer is correct.


Score: 1

Accepted Answers:

Replication

1 point

Which component in MapReduce is responsible for sorting and grouping


the intermediate key-value pairs before passing them to the Reducer?

Mapper

Reducer
Partitioner

Combiner

Yes, the answer is correct.


Score: 1

Accepted Answers:

Partitioner

1 point

What is the default replication factor in Hadoop Distributed File System


(HDFS)?

Yes, the answer is correct.


Score: 1

Accepted Answers:

1 point

In a MapReduce job, what is the role of the Reducer?

Sorting input data

Transforming intermediate data

Aggregating results

Splitting input data

Yes, the answer is correct.


Score: 1

Accepted Answers:

Aggregating results

1 point

Which task can be efficiently parallelized using MapReduce?

Real-time sensor data processing


Single-row database queries

Image rendering

Log file analysis

Yes, the answer is correct.


Score: 1

Accepted Answers:

Log file analysis

1 point

Which MapReduce application involves counting the occurrence of words


in a large corpus of text?

PageRank algorithm

K-means clustering

Word count

Recommender system

Yes, the answer is correct.


Score: 1

Accepted Answers:

Word count

1 point

What does reversing a web link graph typically involve?

Removing dead links from the graph

Inverting the direction of edges

Adding new links to the graph

Sorting links based on page rank

Yes, the answer is correct.


Score: 1

Accepted Answers:

Inverting the direction of edges


Week 3: Assignment 3

The due date for submitting this assignment has passed.

Due on 2024-09-11, 23:59 IST.

Assignment submitted on 2024-09-09, 15:36 IST

1 point

Which abstraction in Apache Spark allows for parallel execution and


distributed data processing?

DataFrame

RDD (Resilient Distributed Dataset)

Dataset

Spark SQL

Yes, the answer is correct.


Score: 1

Accepted Answers:

RDD (Resilient Distributed Dataset)

1 point

What component resides on top of Spark Core?

Spark Streaming

Spark SQL

RDDs

None of the above

Yes, the answer is correct.


Score: 1

Accepted Answers:

Spark SQL

1 point
Which statements about Cassandra and its Snitches are correct?
Statement 1: In Cassandra, during a write operation, when a hinted
handoff is enabled and if any replica is down, the coordinator writes to all
other replicas and keeps the write locally until the down replica comes
back up. Statement 2: In Cassandra, Ec2Snitch is an important snitch for
deployments, and it is a simple snitch for Amazon EC2 deployments where
all nodes are in a single region. In Ec2Snitch, the region name refers to the
data center, and the availability zone refers to the rack in a cluster.

Only Statement 1 is correct.

Only Statement 2 is correct.

Both Statement 1 and Statement 2 are correct.

Neither Statement 1 nor Statement 2 is correct.

Yes, the answer is correct.


Score: 1

Accepted Answers:

Both Statement 1 and Statement 2 are correct.

1 point

Which of the following is a module for Structured data processing?

GraphX

MLlib

Spark SQL

Spark R

Yes, the answer is correct.


Score: 1

Accepted Answers:

Spark SQL
1 point

A healthcare provider wants to store and query patient records in a NoSQL


database with high write throughput and low-latency access. Which
Hadoop ecosystem technology is most suitable for this requirement?

Apache Hadoop

Apache Spark

Apache HBase

Apache Pig

Yes, the answer is correct.


Score: 1

Accepted Answers:

Apache HBase

1 point

The primary Machine Learning API for Spark is now the _____ based API.

DataFrame

Dataset

RDD

All of the above

Yes, the answer is correct.


Score: 1

Accepted Answers:

DataFrame

1 point

How does Apache Spark's performance compare to Hadoop MapReduce?


Apache Spark is up to 10 times faster in memory and up to 100 times
faster on disk.

Apache Spark is up to 100 times faster in memory and up to 10 times


faster on disk.

Apache Spark is up to 10 times faster both in memory and on disk


compared to Hadoop MapReduce.

Apache Spark is up to 100 times faster both in memory and on disk


compared to Hadoop MapReduce.

Yes, the answer is correct.


Score: 1

Accepted Answers:

Apache Spark is up to 100 times faster in memory and up to 10 times


faster on disk.

1 point

Which DAG action in Apache Spark triggers the execution of all previously
defined transformations in the DAG and returns the count of elements in
the resulting RDD or DataFrame?

collect()

count()

take()

first()

Yes, the answer is correct.


Score: 1

Accepted Answers:

count()

1 point

What is Apache Spark Streaming primarily used for?

Real-time processing of streaming data

Batch processing of static datasets


Machine learning model training

Graph processing

Yes, the answer is correct.


Score: 1

Accepted Answers:

Real-time processing of streaming data

1 point

Which of the following represents the smallest unit of data processed by


Apache Spark Streaming?

Batch

Window

Micro-batch

Record

Yes, the answer is correct.


Score: 1

Accepted Answers:

Micro-batch

Week 4: Assignment 4

The due date for submitting this assignment has passed.

Due on 2024-09-18, 23:59 IST.

Assignment submitted on 2024-09-06, 17:09 IST

1 point

Which of the following statements about Bloom filters is true?

Bloom filters guarantee no false negatives

Bloom filters use cryptographic hashing functions

Bloom filters may produce false positives but no false negatives

Bloom filters are primarily used for sorting large datasets


Yes, the answer is correct.
Score: 1

Accepted Answers:

Bloom filters may produce false positives but no false negatives

1 point

How does CAP theorem impact the design of distributed systems?

It emphasizes data accuracy over system availability

It requires trade-offs between consistency, availability, and partition


tolerance

It prioritizes system performance over data security

It eliminates the need for fault tolerance measures

Yes, the answer is correct.


Score: 1

Accepted Answers:

It requires trade-offs between consistency, availability, and partition


tolerance

1 point

Which guarantee does the CAP theorem consider as mandatory for a


distributed system?

Consistency

Availability

Partition tolerance

Latency tolerance

Yes, the answer is correct.


Score: 1

Accepted Answers:

Partition tolerance

1 point
What consistency level in Apache Cassandra ensures that a write
operation is acknowledged only after the write has been successfully
written to all replicas?

ONE

LOCAL_ONE

LOCAL_QUORUM

ALL

Yes, the answer is correct.


Score: 1

Accepted Answers:

ALL

1 point

How does Zookeeper contribute to maintaining consistency in distributed


systems?

By managing data replication

By providing a centralized configuration service

By ensuring data encryption

By optimizing data storage

Yes, the answer is correct.


Score: 1

Accepted Answers:

By providing a centralized configuration service

1 point

A ___________ server is a machine that keeps a copy of the state of the


entire system and persists this information in local log files.

Master
Region

Zookeeper

All of the mentioned

Yes, the answer is correct.


Score: 1

Accepted Answers:

Zookeeper

1 point

What is Apache Zookeeper primarily used for in Big Data ecosystems?

Data storage

Data processing

Configuration management

Data visualization

Yes, the answer is correct.


Score: 1

Accepted Answers:

Configuration management

1 point

Which statement correctly describes CQL (Cassandra Query Language)?

CQL is a SQL-like language used for querying relational databases

CQL is a procedural programming language used for writing stored


procedures in Cassandra

CQL is a language used for creating and managing tables and querying
data in Apache Cassandra
CQL is a scripting language used for data transformation tasks in
Cassandra

Yes, the answer is correct.


Score: 1

Accepted Answers:

CQL is a language used for creating and managing tables and querying
data in Apache Cassandra

1 point

Which aspect of CAP theorem refers to a system's ability to continue


operating despite network failures?

Consistency

Accessibility

Partition tolerance

Atomicity

Yes, the answer is correct.


Score: 1

Accepted Answers:

Partition tolerance

1 point

Why are tombstones used in distributed databases like Apache


Cassandra?

To mark nodes that are temporarily unavailable

To mark data that is stored in multiple replicas

To mark data that has been logically deleted

To mark data that is actively being updated

Yes, the answer is correct.


Score: 1

Accepted Answers:
To mark data that has been logically deleted

Week 5: Assignment 5

The due date for submitting this assignment has passed.

Due on 2024-09-25, 23:59 IST.

Assignment submitted on 2024-09-13, 21:43 IST

1 point

What distributed graph processing framework operates on top of Spark?

MLlib

GraphX

Spark streaming

ALL

Yes, the answer is correct.


Score: 1

Accepted Answers:

GraphX

1 point

Which of the following frameworks is best suited for fast, in-memory data
processing and supports advanced analytics such as machine learning and
graph processing?

Apache Hadoop MapReduce

Apache Flink

Apache Storm

Apache Spark

Yes, the answer is correct.


Score: 1

Accepted Answers:
Apache Spark

1 point

A financial institution needs to analyze historical stock market data to


predict market trends and make investment decisions. Which Big Data
processing framework is best suited for this scenario?

Apache Spark

Apache Storm

Hadoop MapReduce

Apache Flume

Yes, the answer is correct.


Score: 1

Accepted Answers:

Apache Spark

1 point

A telecommunications company needs to process real-time call logs from


millions of subscribers to detect network anomalies. Which combination of
Big Data tools would be appropriate for this use case?

Apache Hadoop and Apache Pig

Apache Kafka and Apache HBase

Apache Spark and Apache Hive

Apache Storm and Apache Pig

Yes, the answer is correct.


Score: 1

Accepted Answers:

Apache Kafka and Apache HBase

1 point

Do many people use Kafka as a substitute for which type of solution?

log aggregation

compaction

collection
all of the mentioned

Yes, the answer is correct.


Score: 1

Accepted Answers:

log aggregation

1 point

Which of the following features of Resilient Distributed Datasets (RDDs) in


Apache Spark contributes to their fault tolerance?

DAG (Directed Acyclic Graph)

In-memory computation

Lazy-evaluation

Lineage information

Yes, the answer is correct.


Score: 1

Accepted Answers:

Lineage information

1 point

Point out the correct statement.

Hadoop do need specialized hardware to process the data

Hadoop allows live stream processing of real-time data

In the Hadoop mapreduce programming framework output files are


divided into lines or records

None of the mentioned

Yes, the answer is correct.


Score: 1

Accepted Answers:

In the Hadoop mapreduce programming framework output files are


divided into lines or records
1 point

Which of the following statements about Apache Pig is true?

Pig Latin scripts are compiled into HiveQL for execution.

Pig is primarily used for real-time stream processing.

Pig Latin provides a procedural data flow language for ETL tasks.

Pig uses a schema-on-write approach for data storage.

Yes, the answer is correct.


Score: 1

Accepted Answers:

Pig Latin provides a procedural data flow language for ETL tasks.

1 point

An educational institution wants to analyze student performance data


stored in HDFS and generate personalized learning recommendations.
Which Hadoop ecosystem components should be used?

Apache HBase for storing student data and Apache Pig for processing.

Apache Kafka for data streaming and Apache Storm for real-time
analytics.

Hadoop MapReduce for batch processing and Apache Hive for querying.

Apache Spark for data processing and Apache Hadoop for storage.

No, the answer is incorrect.


Score: 0

Accepted Answers:

Apache Spark for data processing and Apache Hadoop for storage.

1 point

A company is analyzing customer behavior across multiple channels (web,


mobile app, social media) to personalize marketing campaigns. Which
technology is best suited to handle this type of data processing?

Hadoop MapReduce

Apache Kafka

Apache Spark

Apache Hive
Yes, the answer is correct.
Score: 1

Accepted Answers:

Apache Spark

Week 6: Assignment 6

The due date for submitting this assignment has passed.

Due on 2024-10-02, 23:59 IST.

Assignment submitted on 2024-09-22, 20:48 IST

1 point

Point out the wrong statement.

Replication Factor can be configured at a cluster level (Default is set to 3)


and also at a file level

Block Report from each DataNode contains a list of all the blocks that are
stored on that DataNode

User data is distributed across multiple DataNodes in the cluster and is


managed by the NameNode.

DataNode is aware of the files to which the blocks stored on it belong to

Yes, the answer is correct.


Score: 1

Accepted Answers:

DataNode is aware of the files to which the blocks stored on it belong to

1 point

What is the primary technique used by Random Forest to reduce


overfitting?

Boosting

Bagging

Pruning
Neural networks

Yes, the answer is correct.


Score: 1

Accepted Answers:

Bagging

1 point

What statements accurately describe Random Forest and Gradient


Boosting ensemble methods?

S1: Both methods can be used for classification task

S2: Random Forest is use for regression whereas Gradient Boosting is use
for Classification task

S3: Random Forest is use for classification whereas Gradient Boosting is


use for regression task S4: Both methods can be used for regression

S1 and S2

S2 and S4

S3 and S4

S1 and S4

Yes, the answer is correct.


Score: 1

Accepted Answers:

S1 and S4

1 point

In the context of K-means clustering with MapReduce, what role does the
Map phase play in handling very large datasets?

It reduces the size of the dataset by removing duplicates

It distributes the computation of distances between data points and


centroids across multiple nodes

It initializes multiple sets of centroids to improve clustering accuracy

It performs principal component analysis (PCA) on the data

Yes, the answer is correct.


Score: 1
Accepted Answers:

It distributes the computation of distances between data points and


centroids across multiple nodes

1 point

What is a common method to improve the performance of the K-means


algorithm when dealing with large-scale datasets in a MapReduce
environment?

Using hierarchical clustering before K-means

Reducing the number of clusters

Employing mini-batch K-means

Increasing the number of centroids

Yes, the answer is correct.


Score: 1

Accepted Answers:

Employing mini-batch K-means

1 point

Which similarity measure is often used to determine the similarity


between two text documents by considering the angle between their
vector representations in a high-dimensional space?

Manhattan Distance

Cosine Similarity

Jaccard Similarity

Hamming Distance

Yes, the answer is correct.


Score: 1

Accepted Answers:

Cosine Similarity

1 point

Which distance measure calculates the distance along strictly horizontal


and vertical paths, consisting of segments along the axes?

Minkowski distance

Cosine similarity
Manhattan distance

Euclidean distance

Yes, the answer is correct.


Score: 1

Accepted Answers:

Manhattan distance

1 point

What is the purpose of a validation set in machine learning?

To train the model on unseen data

To evaluate the model’s performance on the training data

To tune hyperparameters and prevent overfitting

To test the final model’s performance

Yes, the answer is correct.


Score: 1

Accepted Answers:

To tune hyperparameters and prevent overfitting

1 point

In K-fold cross-validation, what is the purpose of splitting the dataset into


K folds?

To ensure that every data point is used for training only once

To train the model on all the data points

To test the model on the same data multiple times

To evaluate the model’s performance on different subsets of data

Yes, the answer is correct.


Score: 1

Accepted Answers:

To evaluate the model’s performance on different subsets of data

1 point

Which of the following steps is NOT typically part of the machine learning
process?

Data Collection
Model Training

Model Deployment

Data Encryption

Yes, the answer is correct.


Score: 1

Accepted Answers:

Data Encryption

INTRODUCTION TO BIG DATA

Download

BIG DATA ENABLING TECHNOLOGIES

Download

HADOOP STACK FOR BIG DATA

Download

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Download

HADOOP MAPREDUCE 1.0

Download

HADOOP MAPREDUCE 2.0 (PART-I)

Download
HADOOP MAPREDUCE 2.0 (PART-II)

Download

MAPREDUCE EXAMPLES

Download

PARALLEL PROGRAMMING WITH SPARK

Download

INTRODUCTION TO SPARK

Download

SPARK BUILT-IN LIBRARIES

Download

DESIGN OF KEY-VALUE STORES

Download

DATA PLACEMENT STRATEGIES

Download

CAP THEOREM

Download

CONSISTENCY SOLUTIONS

Download

DESIGN OF ZOOKEEPER

Download

CQL (CASSANDRA QUERY LANGUAGE)

Download

DESIGN OF HBASE

Download

SPARK STREAMING AND SLIDING WINDOW ANALYTICS (PART-I)

Download

SPARK STREAMING AND SLIDING WINDOW ANALYTICS (PART-II)

Download

SLIDING WINDOW ANALYTICS


Download

INTRODUCTION TO KAFKA

Download

BIG DATA MACHINE LEARNING (PART-I)

Download

BIG DATA MACHINE LEARNING (PART-II)

Download

MACHINE LEARNING ALGORITHM K-MEANS USING MAP REDUCE FOR BIG


DATA ANALYTICS

Download

PARALLEL K-MEANS USING MAP REDUCE ON BIG DATA CLUSTER ANALYSIS

Download

DECISION TREES FOR BIG DATA ANALYTICS

Download

BIG DATA PREDICTIVE ANALYTICS (PART-I)

Download

BIG DATA PREDICTIVE ANALYTICS (PART-II)

Download

PARAMETER SERVERS

Download

PAGERANK ALGORITHM IN BIG DATA

Download

SPARK GRAPHX & GRAPH ANALYTICS (PART-I)

Download

SPARK GRAPHX & GRAPH ANALYTICS (PART-II)

Download

CASE STUDY: FLIGHT DATA ANALYSIS USING SPARK GRAPHX

Download

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy