Case Study
Case Study
ABC.com is a website where you can watch original movie DVDs. It currently maitains
the list of visitors and details of their visit. The website gets almost 1 billion visitors
everyday and at midnight it processes all the information. It takes almost 5 hours to
pocess all the information and the system remains down for that long. It causes the
company a huge loss. The company decided to buy a super computer for faster
analysis. The supercomputer has 10 processors. Now the need is to design a parallel
algorithm for the following problems:
We now have the list of visitors for the day and the number of movies they watched.
Question 1: Design a parallel algorithm that would sort the names alphabetically.
Write a parallel search algorithm that would find a visitor "John" in this sorted list and
show how many movies he watched.
The degree of the increase in the computational speed between a parallel algorithm and
a corresponding sequential algorithm is called speedup and expressed by ratio of
T(sequential) to T(parallel).
If the given ratio exceeds p, where p is the number of processors (cores) used,
super linear speedup takes place. The most common reason for it is the cache effect. It
is called that due to increased total size of cache in multiprocessor system, hence
increased data transfer rate between RAM and CPU, which is cardinal to the work with
the large data sets.
Traditional parallel computer performance evaluation has fixed problem size and varied
the number of processors, the so-called fixed-size model. In mid‘80s the scaled-size
model was developed and subsequently substantiated by experiments on a 1024-
processor hypercube. The scaled size model specifies that the storage complexity
grows in proportion to the number of processors. A third model is the fixed-time model,
in which the problem is scaled to take a constant time as processors are added and
rarely used in real-world applications. Algorithm described here is optimized for the
fixed-size model. It is a modification of the Quicksort algorithm by C. A. R. Hoare(1962)
to be utilized on a system with several processors (or cores).
On the first step, original data set is viewed as blocks of twice the size of the L1
cache (which is typically 32 or 64 kB). Processor with the smallest PID chooses the
pivot element. Then all processors in parallel invoke “neutralization” function on the
leftmost and the rightmost remaining blocks, effectively swapping elements respective
to the value of the pivot, which leaves only <=P+1 blocks to be sorted. After that
remaining not “neutralized” blocks are getting swapped with the “neutralized” ones and
getting sorted sequentially.
The next step is to split given data set at the pivotal point and assign processors
to each half according to its size. Stack is used to keep track of the state of the sorting
algorithm and the sequential steps of the recursion are turned into PUSH and POP
operations on this stack. Whenever a processor encounters a small subarray, which it
can fit in the cache, it will use inserting sort to sort it without PUSHing it into the stack.
When a processor finished its job, it begins helping other processors by POPing out
unused (yet unsorted) arrays from their stacks.
Such optimization of algorithm brings average time of partition phase to O(N/P),
for N>>B, where N – number of elements, B – number of elements in one block and P –
number of processors, and the sorting phase yields us speedup O(P), provided that all
processors are largely independent from one another at this stage and no
synchronization required. This bring total speedup up to T(s)/T(p) = P, i.e. linear
speedup.
Also, reduced time of memory access due to cache effect further decreases
overhead and yields super linear speedup.
Ans 1
Merge sort first divides the unsorted list into smallest possible sub-lists, compares it with
the adjacent list, and merges it in a sorted order. It implements parallelism very nicely
by following the divide and conquer algorithm.
begin
data = sequentialmergesort(data)
for dim = 1 to n
endfor
newdata = data
end
Ans 2
In the conventional sequential BFS algorithm, two data structures are created to store
the frontier and the next frontier. The frontier contains the vertexes that have same
distance(it is also called "level") from the source vertex, these vertexes need to be
explored in BFS. Every neighbor of these vertexes will be checked, some of these
neighbors which are not explored yet will be discovered and put into the next frontier. At
the beginning of BFS algorithm, a given source vertex s is the only vertex in the frontier.
All direct neighbors of s are visited in the first step, which form the next frontier. After
each layer-traversal, the "next frontier" is switched to the frontier and new vertexes will
be stored in the new next frontier. The following pseudo-code outlines the idea of it, in
which the data structures for the frontier and next frontier are called FS and NS
respectively.
2 for all v in V do
3 d[v] = -1;
6 while FS !empty do
7 for u in FS do
9 if d[v] = -1 then
10 push(v, NS);
11 d[v] = level;