BDA List of Experiments For Practical Exam
BDA List of Experiments For Practical Exam
BDA List of Experiments For Practical Exam
Theory:
WordCount is a simple program which counts the number of occurrences of each
word in a given text input data set. WordCount fits very well with the MapReduce
programming model making it a great example to understand the Hadoop
Map/Reduce programming style. The implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
Algorithm:
Pseudo-code
void Map (key, value)
output.collect(x,1);
}
Pseudo-code
void Reduce (keyword, <list of value>)
{ for each x in <list of value>:
sum+=x; final_output.collect(keyword, sum);
}
Example:
Programm:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.mapreduce.Job;import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.fs.Path;
IOException,InterruptedException{
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
sum+=x.get();
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job,
delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Theory:
Definitions:
M is a matrix with element mij in row i and column j. N is a matrix with element njk in
row j and column k.
P is a matrix = MN with element pik in row i and column k, where pik =∑j mijnjk
Mapper function does not have access to the i, j, and k values directly. An extra MapReduce
Job has to be run initially in order to retrieve the values.
For each element mij of M, emit a key-value pair (i, k), (M, j, mij ) for k = 1, 2, . . .number
of columns of N.
For each element njk of N, emit a key-value pair (i, k), (N, j, njk) for i = 1, 2, . . . number
of rows of M.
For each key (i, k), emit the key-value pair (i, k), pik where, Pik = ∑jmij * njk
Algorithm
K-Means Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Theory:
A NoSQL (originally referring to "non SQL" or "non relational") database provides a
mechanism for storage and retrieval of data which is modeled in means other than the
tabular relations used in relational databases.NoSQL databases are increasingly used in big
data and real-time web applications.
NoSQL Database Types
1. Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents.
2. Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
3. Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or "key"), together with its value. Examples
of key-value stores are Riak and Berkeley DB. Some key-value stores, such as
Redis, allow each value to have a type, such as "integer", which adds functionality.
4. Wide-column stores such as Cassandra and HBase are optimized for queries over
large datasets, and store columns of data together, instead of rows.
Example:
● Show database: Show dbs
● To insert data : db.table_name.insert( )
db.collection.insertOne(): It is used to insert a single document in the collection.
Eg: db.RecordsDB.insertOne({
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",})
Theory:
Bloom filtering is similar to generic filtering in that it is looking at each record and deciding
whether to keep or remove it. However, there are two major differences that set it apart
from generic filtering. First, we want to filter the record based on some sort of set
membership operation against the hot values. For example: keep or throw away this record
if the value in the user field is a member of a predetermined list of users.Second, the set
membership is going to be evaluated with a Bloom filter.
For example
Bloom filter can be previously trained with IDs of all users that have a salary of more than
x and use the Bloom filter to do an initial test before querying the database to retrieve more
information about each employee. By eliminating unnecessary queries, we can speed up
processing time.
Programme:
"Bloom Filter"
from bloomfilter
import BloomFilter
from random import shuffle
n = 20
p = 0.05
bloomf = BloomFilter(n,p)
print("Size of bit array:{}".format(bloomf.size)) print("False
positive Probability:{}".format(bloomf.fp_prob)) print("Number
of hash functions:{}".format(bloomf.hash_count))
word_present = ['abound','abounds','abundance','abundant','accessible',
'bloom','blossom','bolster','bonny','bonus','bonuses',
'coherent','cohesive','colorful','comely','comfort',
'gems','generosity','generous','generously','genial']
word_absent = ['bluff','cheater','hate','war','humanity',
'racism','hurt','nuke','gloomy','facebook',
'geeksforgeeks','twitter']
for item in
word_prese
nt:
bloomf.add
(item)
shuffle(wor
d_present)
shuffle(wor
d_absent)
test_words = word_present[:10] +
word_absent shuffle(test_words)
for word in test_words:
if bloomf.check(word):
if word in word_absent:
print("'{}' is a false positive!".format(word))
else:
print("'{}' is probably present!".format(word))
else:
print("'{}' is definitely not present!".format(word))
OUTPUT:
Size of bit array:124 False
positive Probability:0.05
Number of hash
functions:4
'war' is definitely not
present!
'gloomy' is definitely not
present!
'humanity' is definitely not
present!
'abundant' is probably present!
'bloom' is probably present!
'coherent' is probably present!
'cohesive' is probably present!
'bluff' is definitely not present!
'bolster' is probably present!
'hate' is definitely not present!
'racism' is definitely not present!
'bonus' is probably present!
Experiment No. 6: Implementation of DGIM algorithm
Theory:
The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm To begin, each bit of the stream has
a timestamp, the position in which it arrives. The first bit has timestamp 1, the second has
timestamp 2, and so on. Since we only need to distinguish positions within the window of
length N, we shall represent timestamps modulo N, so they can be represented by log2 N
bits. If we also store the total number of bits ever seen in the stream (i.e.,
the most recent timestamp) modulo N, then we can determine from a timestamp modulo N
where in the current window the bit with that timestamp is.
We divide the window into buckets consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1‘s in the bucket. This number must be a power of 2, and we refer to the
number of 1‘s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its
right end. To represent the number of 1‘s we only need log2 log2 N bits. The reason is that
we know this number i is a power of 2, say 2j , so we can represent i by coding j in binary.
Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent
a bucket. There are six rules that must be followed when representing a stream by buckets.
..1011011000101110110010110
..101101 10001011 101 1001 01 10
class DGIM:
def __init__(self, window_size):
self.window_size = window_size
self.buckets = [[] for _ in range(window_size)]
self.buckets[0].append(bit)
i=0
while i < len(self.buckets) - 1 and len(self.buckets[i]) == len(self.buckets[i + 1]):
self.buckets[i] += self.buckets[i + 1]
self.buckets.pop(i + 1)
i += 1
if __name__ == "__main__":
window_size = 16
dgim = DGIM(window_size)
k=4
count = dgim.count_ones(k)
print(f"Estimated count of 1s in the last {k} buckets: {count}")
Output:
Theory:
The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to
map the elements in the given dataset to a binary string, and to make use of the length of
the longest null sequence in the binary string as an estimator for the number of unique
elements to use as a value element.
First step is to choose a hash function that can be used to map the elements in the database
to fixed-length binary strings. The length of the binary string can be chosen based on the
accuracy desired.
Next step is to apply the hash function to each data item in the dataset to get its binary
string representation.
Next step includes determining the position of the rightmost zero in each binary string.
Next we compute the maximum position of the rightmost zero for all binary strings.
Now we estimate the number of distinct elements in the dataset as 2 to the power of the
maximum position of the rightmost zero which we calculated in the previous step.
The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings
and the number of hash functions it uses. Generally, increasing the length of the binary
strings or using more hash functions in an algorithm can often increase the algorithm’s
accuracy.
Programme:
import random
import math
def trailing_zeros(x):
if x == 0:
return 1
count = 0
while x & 1 == 0:
count += 1
x >>= 1
return count
max_zeros= 0
for i in range(k):
hash_vals = [trailing_zeros(random.choice(dataset))
for _ in range(len(dataset))]
return 2 ** max_zeros
# Example
Output:
1. PieChart
2. Barchart
Theory:
Mean: It is calculated by taking the sum of the values and dividing with the number of
values in a data series.
Median: The middle most value in a data series is called the median. The median() function
is used in R to calculate this value.
1. To calculate Mean
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
Output:
8.22
2. To calculate Median
Output:
5.6