CS462 Project Report: Name: Samuel Day 1. The Nature of The Project
CS462 Project Report: Name: Samuel Day 1. The Nature of The Project
Name: Samuel Day 1. The Nature of the Project My project is a slight modification of a pre-approved project idea meant to measure statistical performance on the Newton supercluster: with a large number of different suites designed for aiding programmers in parallelizing their code, there hasn't been a lot of statistical analysis for the speed advantages that certain types of parallelization could provide and whether or not there is a noticeable difference. elow is my original project proposal, which has been deprecated !too complicated to e"ecute in the time-frame of this project# and should be ignored:
$My project proposal is to analyze the effects of intercommunication and speed between cores of a multiprocessor application through the use of eaglebone lac% microcomputers. & will do this by creating a program for them that simulates a musicians' live jamming setup, where each core is set to play a single instrument alongside the other cores. (he general algorithm & will be using as follows: ). * host computer will serve to connect the eaglebone lac%s to a switch to communicate messages. +. (he host will establish each eaglebone's instrument. ,ne of the eaglebone's will be guaranteed to be a drum %it. &n addition, one of the non-drum instruments will be a $leadinstrument, from which the other instruments will learn the %ey of the song. .. (he drum %it will begin to play a drum pattern/ the lead instrument will listen to this pattern, detect the tempo, then begin to play songs in the chosen %ey/ the other instruments will also detect the tempo from the drum pattern, but their %ey will be chosen by whatever the lead instrument plays. 0. (he drummer can change tempo and the lead instrument can change pitch, and it will be up to the other instruments to analyze this and $react-. (his will primarily be done on the eaglebone lac%s using 12133 threading. & will be using the st% and aubio open source libraries to generate and detect sound respectively. 4astly, & will probably have to borrow a few eaglebones to run all of the tests for the project write-up.-
2. Reasons Why the project s mportant or !orth!h le (here are plenty of reasons as to why this project is important to the overall study of parallel computing. 5or one, benchmar%ing these varying methods of parallel computing provides valuable statistical information for studying the overall benefits of parallel computing from an academic standpoint/ it is one thing to theorize the benefits of parallelization in code but it is another to collect precise timing data on controlled tests to see the overall benefits. 6econdly, its important because there are an increasingly wide-range of software suites, plugins and methods to parallelize code, and its a worthy endeavor to not only analyze which is the best in terms of speed, but also which is the best in terms of usability. *s scientists, mathematicians, and other users from non-computing fields are becoming increasingly responsible for writing and maintaining their own code, they might not be as familiar with the variety of techni7ues and paradigms that e"ists when parallelizing code such as dealing with deadloc%, race conditions, etc., so finding a middle ground
between functionality, usability, and speed is a practice that is worth investigating. 4ast, if nothing else hopefully my implementation of the varying techni7ues provide useful for others curious about parallelization aides but unsure about how to use them. ". Relate# Wor$ (his project is intrinsically tied to the various implementations of the particle code that & have written for earlier assignments throughout the semester. (hough this seems somewhat lazy and also means that the timing data collected could be more open to variance do to there not being a set running time for the problem, but & felt it was a relevant choice for two specific reasons. 5irst, is the fact that & %now the initial code wor%s. 8aving done a number of assignments with it, & am aware of its intricacies in timing and can recognize faulty data more readily. (he problem of alternate implementations of parallel code could produce a variety of results depending on the methods in each program. y limiting all of my tests to a program with a %nown output, it ma%es interpreting that output that much easier !and accurate# 6econd, the nature of the program means there is a fair amount of variability in the results. (his is good because e"ecution time is almost never a given in any application, and it is useful in ensuring that when the code is e"ecuting each processor is performing actual wor%, rather than simply sleeping for a number of seconds. 4. %ey Steps & too$ to mplement my parallel co#e (here were a number of different modifications & had to ma%e for running these tests. 5irst there was obtaining the particles code. (his was simple as & was able to re-use a lot of my code from this semester. 6econd & had to modify some of the code so that it would output results in a manageable fashion for collecting data. Mostly this too% place as modifying the print statements of each program. (hird & had to create an efficient way to run tests on the machine. (his led to a series of scripts & wrote in python to obtain results for different values of n !number of processes#. (hen, & created job files for each implementation of the particles code. & also created another python script called 'run*449, 6.py' !and another called 'run*449, 6):.py' for running on ):-core machines# to run all of the job files & had created. 4ast, for graphing & results & opted to use ,pen,ffice rather than gnuplot for my visualizations. ;hile it was somewhat annoying to transfer results over from the shell to ,pen,ffice without them losing any meaning, the graphs produced provide a much cleaner view of the results. '. The Nature of (y Test Cases My general schema for running the tests was as follows: &mplement the different types of test cases for my code. & chose four different configurations to run tests on: No parallelization: (his will serve as a control group so that the parallelization results can be compared to something.
<arallelization with M<&: (his means manually spawning each thread, passing data between them and a 'parent' thread that collects the data obtained, then compiling all of the data received and calculating the results. <arallelization with ,penM<: ,penM< allows the user to simply use pragmas to designate which loops and pieces of e"ecution will be parallelized <arallelization with both ,penM< and M<&: 5or this implementation & simply too% by parallelized with M<& code, and added ,penM< pragmas for some of the loops' e"ecutions =un the code to obtain the tests results. &n doing this & chose to use particles values of 0>>>>, +>>>>, )>>>>, ?>>>, +?>>, )>>>, and ?>> to graph data of a variety of different e"ecution lengths, and & analyzed results using )-)+ threads. @raph and analyze the test results.
6. The Way a user can run my co#e @ood news, user, &'ve done all of the hard wor% for youA &f you loo% in the main directory you'll notice subdirectories for the e"ecutables used and the data collected. &n the e"ecutable directory there are a number of different files for the different configurations of the particles code. (o compile them, however, all one has to do is simply type 'ma%e all' B the ma%efile will ta%e care of compiling all of the different files for you. * bunch of error messages may or may or not appear, but its safe to ignore them B the code will compile and run just fine. *fter that you'll need to submit the jobs used to gather the data, but that is simple as well. 6imply run 'python run*449, 6.py' and the script will submit all of the jobs re7uired to gather data. &f you want to run the code on the ):-core newton clusters, then run 'python run*449, 6):.py'. Casy. &f you want to clean up the e"ecutables directory then type 'ma%e pristine' and everything e"cept for the source code for each test will be deleted. ). Test Results
Number of Par$i%les
Number of Par$i%les
*s you can see, as the number of particles increases the e"ecution time increases. (his is to be e"pected. ;hat is somewhat interesting is two things: ). (he processors on the ):-core machines appear to be faster than on the D-core machines, given that
the overall e"ecution times are 7uic%er. +. (he effects of process switching can be see at the n E D values on the D-core machines. * detailed view shows how once the number of cores have all been used up, there is some slowdown to account for switching processes in and out between the available cores for e"ecution.
10
Number of Par$i%les
Number of Par$i%les
*nd for the most part these trends hold in each implementation of the code:
No +aralleli,a$ion
Number of Available Cores: "
100 #0 "0 !0 60 50 40 30 20 10 0 250 10250 20250 30250 40250 n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12 45 40 35 30 25 20 15 10 5 0 250
No Paralleli,a$ion
Number of Available Cores: 16
n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12
10250
20250
30250
40250
Number of Par$i%les
Number of Par$i%les
;ith no parallelization there is no speedup with the number of processes, but at the very least the benefits of staying within the number of available cores for e"ecution should be clear as in the tests with only D cores available the e"ecution time rose as the number of processes went above D.
MPI Paralleli,a$ion
Number of Available Cores: "
35 n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12 25
MPI Paralleli,a$ion
Number of Available Cores: 16
n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12
Number of Par$i%les
Number of Par$i%les
O+enMP
Number of Available Cores: "
160 140 120 100 "0 60 40 20 0 250 10250 20250 30250 40250 n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12 45 40 35 30 25 20 15 10 5 0 250
O+enMP
Number of Available Cores: 16
n n n n n n n n n n n n 1 2 3 4 5 6 ! " # 10 11 12
10250
20250
30250
40250
Number of Par$i%les
Number of Par$i%les
,h my, this loo%s e"actly with the graphs without any parallelization only slowerA 6omething can't be right. 4et's compare with *mdahl's 4aw... *. Compar sons ! th +m#ahl,s -a!
No Paralleli,a$ion
Amda-l.s /a0 (n 121 + 0*##) Com+arison
50
O+enMP Paralleli,a$ion
Amda-l.s /a0 (n 121 + 0*##) Com+arison
E&e%u$ion 'ime (in s)
1000 100 10 1 0*1 0*01 0 0 1000020000 30000 40000 50000 'es$ 2a$a Amda-l.s /a0
Number of Par$i%les
Number of Par$i%les
MPI Paralleli,a$ion
Amda-l.s /a0 (n 121 + 0*##) Com+arison
2*5 2*5
MPI3O+enMP Paralleli,a$ion
Amda-l.s /a0 (n 121 + 0*##) Com+arison
E&e%u$ion 'ime (in s)
2 1*5 1 0*5 0 0 10000 20000 30000 40000 50000 'es$ 2a$a Amda-l.s /a0
2 1*5 1 0*5 0 0 10000 20000 30000 40000 50000 'es$ 2a$a Amda-l.s /a0
Number of Par$i%les
Number of Par$i%les
6o there are some predictable results, and some unpredictable results. (he predictable ones are fairly clear: No parallelization results in a far longer e"ecution time than *mdahl's 4aw predicts with parallelizing code to a significant degree. 6imilarly, when we parallelize with M<& we are able to see results that mimic *mdahl's 4aw, or even beat it for a couple of data pointsA &n all honesty, this is probably because the value of p was too low B despite it implying that pretty much all of the code is being parallelized. <erhaps using a larger value B >.FFFFF or as close to ) as & can get B will provide a more accurate result, but regardless we can still see the law in effect as the actuals results fall right on top of the estimated results. ;ith the ,penM< results, we can see that something is very wrong B probably an incorrect implementation of the code. &t is here & will ma%e my point about ,penM<: despite appearing to be easier to use and the code itself it much simpler, ,penM< as a library is significantly harder to implement and ma%e use of than M<& threading. (he lac% of access to under-the-hood pieces of parallelization in addition to various compilation issues and version chec%ing ma%es ,penM< difficult to use, especially in code that needs to be portable between different machines. M<& shares some portability problems as well, but its logic is similar to a library-less implementation of threads. ;ith this information we can conclude that M<& is probably the safest way to implement multithreading. &t is significantly faster than no parallelization, appears to shadow *mdahl's 4aw in its results, and isn't as difficult to use correctly as ,penM<.
.. - n$ to tar/all (he tarball for this project can be found at: https:22www.dropbo".com2s2d+hp?g.wm%z?Df"2sdayGcs0:+project.tar