0% found this document useful (0 votes)
61 views

Important Da

This document discusses various topics related to big data analytics including: 1. It defines big data and discusses the challenges of working with big data. 2. It discusses Hadoop and its ecosystem including HDFS, MapReduce, Hive, and YARN. 3. It covers data mining techniques and how they are used in various sectors like marketing, finance, manufacturing, and government. 4. It discusses analyzing data streams and algorithms for handling streaming data. 5. It provides examples of how to write MapReduce programs and queries in MongoDB.

Uploaded by

Priyadarshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Important Da

This document discusses various topics related to big data analytics including: 1. It defines big data and discusses the challenges of working with big data. 2. It discusses Hadoop and its ecosystem including HDFS, MapReduce, Hive, and YARN. 3. It covers data mining techniques and how they are used in various sectors like marketing, finance, manufacturing, and government. 4. It discusses analyzing data streams and algorithms for handling streaming data. 5. It provides examples of how to write MapReduce programs and queries in MongoDB.

Uploaded by

Priyadarshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

PART-A

UNIT-1

Define the term “Big Data”.


What are the two open source analytics tools?
What do you understand about Reactive-Business Intelligence?
What are the challenges with Big Data?
What do you mean by R analytics?
What is Statistica?
What is IBM SPSS Modeler?
Define Weka.

UNIT-2

What is meant by unstructured data? Give examples.


What are the major components of HDFS 2?
What are the features of HDFS 2?

What is HDFS Limitation? What do you understand about the HDFS Federation?
What are the three important classes of MapReduce?
What is the use of hive in the Hadoop ecosystem?
What are the two MapReduce Daemons?
List the Daemons that are part of YARN Architecture.
How many blocks will be created for a file that is 300 MB? The default block size is 64 MB and
the replication factor is 3.
What is an active and passive NameNode?

UNIT-3

1. Mention the steps involved in EDA.


2. Consider the following headline:“Although Nintendo sales were already good last year,
this year they are even better. When the game Animal Crossing was released, the game
sold almost 12 million times in only two weeks.” Which types of data are mentioned in
this headline?
3. Two students are studying for the course Data Analytics for Engineers and have just
read the passage about interval data and ratio data. Student one says: “When students
take a test, they are graded from a scale of 0 to 10. For this score ratios make sense,
hence scores are ratio data.” Student two says: “Ratios do not make sense for IQ-
scores. Hence, scores are interval data.” Note that IQ-tests are scored relative to the
reference score of 100. Which of the two students is correct? And why?
4. What do you mean by Categorical data?
5. What is the function used to identify NaN values?
6. How do you perform random sampling without replacement?
7. Let’s say a Hive table is created as an external table. If we drop the table, will the data
be accessible?
8. Mention the types of NoSQL databases.
9. A Hive partition table is created which is partition by a column say yearofexperience. If
we create a directory say yearofexperience=3 at the HDFS path of the table and dump
the data set which is as per the table structure. Will the data be available if we execute
select query on the table?
10. What is Hive metastore?
11. What is Data Lakes?

UNIT-4

What are the examples of Stream Sources?


How hash function is used in sampling data stream?
What is Bloom Filtering?
What is the Alon-Matias-Szegedy Algorithm for Second Moments?
What is Flajolet-Martin Algorithm?
What is the Datar-Gionis-Indyk-Motwani (DGIM) Algorithm?
What do you mean by Count-Distinct Problem?
What is Decaying Windows?
What are the six rules that must be followed when representing a stream by buckets?
How do you deal with infinite streams?

UNIT-5

How data mining is used in sales and marketing?


How data mining technology is used to acquire new customers?
How Data mining is used to personalize online retailing?
How data mining techniques are used in the Government sector?
How data mining is used in manufacturing?
What do you mean by churn or attrition?
What is the use of data mining in the finance sector?
What is credit card fraud detection?
How are fraudulent activities detected in telecommunications?
What is mining of retail transaction data?

PART-B

UNIT-1
Discuss the following in detail

i) Conventional challenges in Big Data


ii) Nature of Big Data

Define Big Data. Explain the Evolution of Big Data and their characteristics.

Define data, web data, Big Data. Also explain the structured, semi structured and unstructured
data.

Analyse how the unstructured data is getting processed? Explain the sources of unstructured
data? What are the challenges in handling with Big data?

Write short notes on.

i) In-memory Analytics
ii) In -Database Processing
iii) Shared nothing architecture.

Discuss the 6 V’s of Big data.

Explain the terminologies used in big data environments.

What are the key questions to be answered by all organizations stepping into analytics? Justify
with example.

Explain about the process involved in data mining. Also explain the algorithms used in data
mining process.

UNIT-2

What are the goals of Hadoop framework? Discuss and Illustrate the ecosystem of Hadoop?

Explain the following with neat diagram.

i) Hadoop Version 1.0


ii) Hadoop Version 2.0

Explain how do you process the data in Hadoop?


i) Explain the issues faced with MapReduce in Hadoop 1.0?

ii) What are the alternate solutions to MapReduce in Hadoop 2.0? Analyse

Write in detail about the steps involved in MapReduce to achieve the high throughput?

Explain HDFS operations in detail.

Explain the significances of Hadoop distributed file systems and its application.

Explain how BigQuery is working? With neat sketch.

Explain how Matrix Multiplication is carried out in MapReduce Algorithm?

UNIT-3

Explain the steps involved in Exploratory Data analysis.

Analyse the software tools available for EDA.

Analyse the usage of numerical and Categorical data. Also emphasize your answer with
relevant examples.

Analyse the way to handle missing data. Give example to support your views.

Explain the difference between SQL and NoSQL.

A Hive table is created as an external table at location say hdfs://usr/data/table_name. If we


dump a data set which are having the data as per the table structure, will you able to fetch the
records from the table using a select query?

Explain the datatypes used in MongoDB. Explain with example queries.

What is NoSQL? Explain its types and advantages.

What is Hive? Explain the Hive architecture in detail.


Explain Hive Query languages with examples.

UNIT-4

What is Sampling Data in a Stream? Explain the General Sampling Problem.

What is Filtering Streams? Explain the Analysis of Bloom Filtering.

Explain Counting Distinct Elements in a Stream.

Explain Estimating Moments. How do you deal with Infinite Streams?

Explain the concept of Counting Ones in a Window.

Explain the Flajolet-Martin Algorithm.

Explain the Alon-Matias-Szegedy Algorithm for Second Moments and Higher order moments.

Elaborate on how google and yahoo are handling the streams.

Brief the concept of Combining Estimates.

UNIT-5

Explain how data mining techniques are used in Sales and Marketing.

Describe the data mining techniques used in finance and manufacturing sectors.

Explain the role of data mining in Government and healthcare.

Write short notes on data mining approaches used in Telecommunications.

Create a case study to evaluate the data mining and data analytics for a healthcare industry.

Explain microRNA data analysis case study in detail.


Explain credit scoring case study in detail.

Explain data mining non-tabular data.

Analyse how innovative insurance organizations extract value from uncertain data.
PART-C

UNIT-1

You are the university library. You see a few students browsing through the library catalogue
on a Kiosk. You observe the librarians busy at work issuing and returning books. You see a
few students fill up the feedback form on the services offered by the library. Quite a few
students are learning using the e-learning content. Think on the different types of data that are
being generated in this scenario. Support your answer with logic.

Analyse the difference between the various types of analytics.

UNIT-2

Create a MapReduce program to count the occurrences of similar words across 50 files.

Create a MapReduce program to analyse the temperature dataset.

Consider a collection of literature survey made by a researcher in the form of a text document
with respect to cloud and big data analytics. Using Hadoop and MapReduce, develop an
application to count the occurrence of pre-dominant words.

UNIT-3

Here are the counts (in thousands) of earned degrees in the U.S. for a recent year, classified by
degree type and sex of degree recipient.

                      Bachelor's            Master's           Professional            Doctorate


Female              616                        194                        30                               16
Male                  529                        171                        44                               26

Problems:

i)  If you choose a degree recipient at random, what is the probability you pick a woman?

ii)  If you choose a male degree recipient at random, what is the probability that you pick
someone who earned a professional degree?
iii)  If you pick a degree recipient at random, what is the probability you pick a woman with a
doctorate?

IV) If you pick a Bachelor's degree recipient at random, what is the probability you pick a
man?

2. How a bank turned challenges into opportunities to serve its customers using NoSQL
Database. Demonstrate with architectural and database design.

3. Structure of 'restaurants' collection:

"address": {

"building": "1007",

"coord": [ -73.856077, 40.848447 ],

"street": "Morris Park Ave",

"zipcode": "10462"

},

"borough": "Bronx",

"cuisine": "Bakery",

"grades": [

{ "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },

{ "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },

{ "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },

{ "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },

{ "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }

],

"name": "Morris Park Bake Shop",


"restaurant_id": "30075445"

i) Write a MongoDB query to display all the documents in the collection restaurants.
ii) Write a MongoDB query to display the fields restaurant_id, name, borough and
cuisine for all the documents in the collection restaurant.
iii) Write a MongoDB query to display the fields restaurant_id, name, borough and
cuisine, but exclude the field _id for all the documents in the collection restaurant.
iv) Write a MongoDB query to find the restaurants that achieved a score, more than 80
but less than 100.
v) Write a MongoDB query to find the restaurant Id, name, borough and cuisine for
those restaurants which contain 'Wil' as first three letters for its name.
vi) Write a MongoDB query to find the restaurant Id, name, borough and cuisine for
those restaurants which contain 'Reg' as three letters somewhere in its name.

4) Create a MongoDB instance and create the table with following fields. ModelNo, Brand,
Color, Price, Size(Height, Width)
ii) Update Brand name from Adidas to Puma in mongodb for the very first matching record.
iii) insert one more brand info of your choice.
iv) Print all shoes which are available in blue color and width 2cm.
v) Print all shoes which are available either in blue or Neon Color using $in expression.
v) Delete all records for Adidas brand from this collection.
vi) Update height for Nike Shoes to 12cm.
vii)Drop shoes collection.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy