BDA List of Experiments For Practical Exam

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

BDA List of Experiments for practical Exam

Experiment No. 1 : Implementation of Wordcount program using


Mapreduce

Aim: To write a program to implement a word count program using MapReduce.

Theory:
WordCount is a simple program which counts the number of occurrences of each
word in a given text input data set. WordCount fits very well with the MapReduce
programming model making it a great example to understand the Hadoop
Map/Reduce programming style. The implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver

Algorithm:

Step-1. Write a Mapper

A Mapper overrides the ―map function from the Class


“org.apache.hadoop.mapreduce.Mapper" which provides <key, value>
pairs as the input. A Mapper implementation may output <key,value>
pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input
data file and the key would be the line number <line_number, line_of_text>
. Map task outputs <word, one> for each word in the line of text.

Pseudo-code
void Map (key, value)

{ for each word x in value:

output.collect(x,1);
}

Step-2. Write a Reducer


A Reducer collects the intermediate <key,value> output from multiple map
tasks and assemble a single result. Here, the WordCount program will sum
up the occurrence of each word to pairs as <word, occurrence>.

Pseudo-code
void Reduce (keyword, <list of value>)
{ for each x in <list of value>:
sum+=x; final_output.collect(keyword, sum);
}

Example:

Programm:

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import

org.apache.hadoop.mapreduce.Reducer; import

org.apache.hadoop.conf.Configuration; import

org.apache.hadoop.mapreduce.Job;import

org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import

org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import

org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import

org.apache.hadoop.fs.Path;

public class WordCount

public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {

public void map(LongWritable key, Text value,Context context) throws

IOException,InterruptedException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

value.set(tokenizer.nextToken());

context.write(value, new IntWritable(1));

public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {


public void reduce(Text key, Iterable<IntWritable> values,Context context)

throws IOException,InterruptedException {

int sum=0;

for(IntWritable x: values)

sum+=x.get();

context.write(key, new IntWritable(sum));

public static void main(String[] args) throws Exception {

Configuration conf= new Configuration();

Job job = new Job(conf,"My Word Count Program");

job.setJarByClass(WordCount.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

Path outputPath = new Path(args[1]);

//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job,

new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));


//deleting the output path automatically from hdfs so that we don't have to

delete it explicitly

outputPath.getFileSystem(conf).delete(outputPath);

//exiting the job only if the flag value becomes false

System.exit(job.waitForCompletion(true) ? 0 : 1);

Experiment No. 2 : Implementation of matrix multiplication using


Mapreduce

Aim:Implementing simple algorithms in Map-Reduce: Matrix multiplication.

Theory:
Definitions:
M is a matrix with element mij in row i and column j. N is a matrix with element njk in
row j and column k.
P is a matrix = MN with element pik in row i and column k, where pik =∑j mijnjk

Mapper function does not have access to the i, j, and k values directly. An extra MapReduce
Job has to be run initially in order to retrieve the values.

The Map Function:

For each element mij of M, emit a key-value pair (i, k), (M, j, mij ) for k = 1, 2, . . .number
of columns of N.

For each element njk of N, emit a key-value pair (i, k), (N, j, njk) for i = 1, 2, . . . number
of rows of M.

The Reduce Function:

For each key (i, k), emit the key-value pair (i, k), pik where, Pik = ∑jmij * njk
Algorithm

Experiment No. 3: Implementation of k-means clustering algorithm

Aim: Implementation of k-means clustering algorithm

Theory: K-means clustering is a classical clustering algorithm that uses an expectation


maximization like technique to partition a number of data points into k clusters. K-means
clustering is commonly used for a number of classification applications. Because k-means
is run on such large data sets, and because of certain characteristics of the algorithm, it is a
good candidate for parallelization.

K-Means Algorithm:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Experiment No. 4 : Execution of Nosql commands in MongoDB

Aim:Execution of Nosql commands in MongoDB

Theory:
A NoSQL (originally referring to "non SQL" or "non relational") database provides a
mechanism for storage and retrieval of data which is modeled in means other than the
tabular relations used in relational databases.NoSQL databases are increasingly used in big
data and real-time web applications.
NoSQL Database Types

1. Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents.
2. Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
3. Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or "key"), together with its value. Examples
of key-value stores are Riak and Berkeley DB. Some key-value stores, such as
Redis, allow each value to have a type, such as "integer", which adds functionality.
4. Wide-column stores such as Cassandra and HBase are optimized for queries over
large datasets, and store columns of data together, instead of rows.

Example:
● Show database: Show dbs
● To insert data : db.table_name.insert( )
db.collection.insertOne(): It is used to insert a single document in the collection.

db.collection.insertMany():It is used to insert multiple documents in the collection.

Eg: db.RecordsDB.insertOne({
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",})

● To drop database: db.dropDatabase( )


● To create collection: db.createCollection(name, options)
Eg: db.createCollection("mycollection")
● To update data: db.collection.updateOne()
db.collection.updateMany()
db.collection.replaceOne()
● To delete data: db.collection.deleteOne()
db.collection.deleteMany()
Experiment No. 5: Implementation of Bloom Filter

Aim: Implementing Bloom Filter using Map-Reduce.

Theory:
Bloom filtering is similar to generic filtering in that it is looking at each record and deciding
whether to keep or remove it. However, there are two major differences that set it apart
from generic filtering. First, we want to filter the record based on some sort of set
membership operation against the hot values. For example: keep or throw away this record
if the value in the user field is a member of a predetermined list of users.Second, the set
membership is going to be evaluated with a Bloom filter.

Formula for optimal size of bloom filter


1. OptimalBloomFilterSize = (-The number of members in the set * log(The desired false
positive rate))/log(2)^2

Formula to get the optimalK


1. OptimalK = (OptimalBloomFilterSize * log(2))/The number of members in the set
Bloom filters can assist expensive operations by eliminating unnecessary ones.

For example
Bloom filter can be previously trained with IDs of all users that have a salary of more than
x and use the Bloom filter to do an initial test before querying the database to retrieve more
information about each employee. By eliminating unnecessary queries, we can speed up
processing time.

Programme:

"Bloom Filter"
from bloomfilter
import BloomFilter
from random import shuffle

n = 20
p = 0.05

bloomf = BloomFilter(n,p)
print("Size of bit array:{}".format(bloomf.size)) print("False
positive Probability:{}".format(bloomf.fp_prob)) print("Number
of hash functions:{}".format(bloomf.hash_count))

word_present = ['abound','abounds','abundance','abundant','accessible',
'bloom','blossom','bolster','bonny','bonus','bonuses',
'coherent','cohesive','colorful','comely','comfort',
'gems','generosity','generous','generously','genial']

word_absent = ['bluff','cheater','hate','war','humanity',
'racism','hurt','nuke','gloomy','facebook',
'geeksforgeeks','twitter']

for item in
word_prese
nt:
bloomf.add
(item)

shuffle(wor
d_present)
shuffle(wor
d_absent)

test_words = word_present[:10] +
word_absent shuffle(test_words)
for word in test_words:
if bloomf.check(word):
if word in word_absent:
print("'{}' is a false positive!".format(word))
else:
print("'{}' is probably present!".format(word))
else:
print("'{}' is definitely not present!".format(word))

OUTPUT:
Size of bit array:124 False
positive Probability:0.05
Number of hash
functions:4
'war' is definitely not
present!
'gloomy' is definitely not
present!
'humanity' is definitely not
present!
'abundant' is probably present!
'bloom' is probably present!
'coherent' is probably present!
'cohesive' is probably present!
'bluff' is definitely not present!
'bolster' is probably present!
'hate' is definitely not present!
'racism' is definitely not present!
'bonus' is probably present!
Experiment No. 6: Implementation of DGIM algorithm

Aim: Implementing DGIM algorithm using any Programming Language

Theory:
The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm To begin, each bit of the stream has
a timestamp, the position in which it arrives. The first bit has timestamp 1, the second has
timestamp 2, and so on. Since we only need to distinguish positions within the window of
length N, we shall represent timestamps modulo N, so they can be represented by log2 N
bits. If we also store the total number of bits ever seen in the stream (i.e.,
the most recent timestamp) modulo N, then we can determine from a timestamp modulo N
where in the current window the bit with that timestamp is.
We divide the window into buckets consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1‘s in the bucket. This number must be a power of 2, and we refer to the
number of 1‘s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its
right end. To represent the number of 1‘s we only need log2 log2 N bits. The reason is that
we know this number i is a power of 2, say 2j , so we can represent i by coding j in binary.
Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent
a bucket. There are six rules that must be followed when representing a stream by buckets.

1. The right end of a bucket is always a position with a 1.


2. Every position with a 1 is in some bucket.
3. No position is in more than one bucket.
4. There are one or two buckets of any given size, up to some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left (back in time).

..1011011000101110110010110
..101101 10001011 101 1001 01 10

Two of size 4 Two of size 2 Two of size 1

Figure 7: A bit-stream divided into buckets following the DGIM rules


Programme:

class DGIM:
def __init__(self, window_size):
self.window_size = window_size
self.buckets = [[] for _ in range(window_size)]

def update(self, bit):

self.buckets[0].append(bit)

if len(self.buckets[0]) > self.window_size:


self.buckets[0].pop(0)

i=0
while i < len(self.buckets) - 1 and len(self.buckets[i]) == len(self.buckets[i + 1]):
self.buckets[i] += self.buckets[i + 1]
self.buckets.pop(i + 1)
i += 1

if len(self.buckets[-1]) > self.window_size:


self.buckets[-1].pop(0)

def count_ones(self, k):


count = 0
for i in range(len(self.buckets)):
count += self.buckets[i].count(1) * (2 ** i)
return count

if __name__ == "__main__":
window_size = 16
dgim = DGIM(window_size)

stream = [1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,1, 1, 0]

for bit in stream:


dgim.update(bit)

k=4
count = dgim.count_ones(k)
print(f"Estimated count of 1s in the last {k} buckets: {count}")

Output:

Estimated count of 1s in the last 4 buckets: 9

Experiment No. 7 : Implementation of Flajolet Martin algorithm


Aim: Implementation of Flajolet Martin algorithm

Theory:

The Flajolet-Martin algorithm is also known as a probabilistic algorithm which is mainly


used to count the number of unique elements in a stream or database . This algorithm was
invented by Philippe Flajolet and G. Nigel Martin in 1983 and since then it has been used
in various applications such as , data mining and database management.

The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to
map the elements in the given dataset to a binary string, and to make use of the length of
the longest null sequence in the binary string as an estimator for the number of unique
elements to use as a value element.

The steps for the Flajolet-Martin algorithm are:

First step is to choose a hash function that can be used to map the elements in the database
to fixed-length binary strings. The length of the binary string can be chosen based on the
accuracy desired.

Next step is to apply the hash function to each data item in the dataset to get its binary
string representation.

Next step includes determining the position of the rightmost zero in each binary string.

Next we compute the maximum position of the rightmost zero for all binary strings.
Now we estimate the number of distinct elements in the dataset as 2 to the power of the
maximum position of the rightmost zero which we calculated in the previous step.

The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings
and the number of hash functions it uses. Generally, increasing the length of the binary
strings or using more hash functions in an algorithm can often increase the algorithm’s
accuracy.

Programme:

import random

import math

def trailing_zeros(x):

""" Counting number of trailing zeros

in the binary representation of x."""

if x == 0:

return 1

count = 0

while x & 1 == 0:

count += 1

x >>= 1

return count

def flajolet_martin(dataset, k):

"""Number of distinct elements using

the Flajolet-Martin Algorithm."""

max_zeros= 0

for i in range(k):
hash_vals = [trailing_zeros(random.choice(dataset))

for _ in range(len(dataset))]

max_zeros = max(max_zeros, max(hash_vals))

return 2 ** max_zeros

# Example

dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

dist_num = flajolet_martin(dataset, 10)

print("Estimated number of distinct elements:", dist_num)

Output:

'Estimated number of distinct elements: 8

Experiment No. 8: Data Analysis using R (any 1 graph, Assume suitable


data)

1. PieChart

# Create data for the graph.


x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.


png(file = "city.png")

# Plot the chart.


pie(x,labels)
# Save the file.
dev.off()

2. Barchart

# Create the data for the chart


H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")

# Give the chart file a name


png(file = "barchart_months_revenue.png")

# Plot the bar chart


barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")

# Save the file


dev.off()
3. Histogram

# Create data for the graph.


v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.


png(file = "histogram.png")

# Create the histogram.


hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.


dev.off()
4. Linegraph

# Create the data for the chart.


v <- c(7,12,28,3,41)

# Give the chart file a name.


png(file = "line_chart.jpg")

# Plot the bar chart.


plot(v,type = "o")

# Save the file.


dev.off()
Experiment No. 9 : Statistical Analysis using R

Aim: Statistical Analysis using R , to calculate mean & median of a vector

Theory:

Mean: It is calculated by taking the sum of the values and dividing with the number of
values in a data series.

The function mean() is used to calculate this in R.

Median: The middle most value in a data series is called the median. The median() function
is used in R to calculate this value.

1. To calculate Mean
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

Output:
8.22

2. To calculate Median

# Create the vector.


x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)

Output:
5.6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy