X. Mapping Techniques: 27 April, 2009
X. Mapping Techniques: 27 April, 2009
X. Mapping Techniques: 27 April, 2009
Mapping Techniques
Content
mapping classification
schemes for static mapping
schemes for dynamic mapping
maximizing data locality
overlapping computations with interactions,
replication,
optimized collective interactions
Processors
are the hardware units that physically perform computations.
in most cases, there is a one-to-one correspondence between
processes and processors
The tasks, into which a problem is decomposed, run on physical
processors.
A process
refers to a processing or computing agent that performs tasks.
is an abstract entity that uses the code and data corresponding to a
task to produce the output of that task within a finite amount of time
after the task is activated by the parallel program.
In addition to performing computations, a process may synchronize or
communicate with other processes, if needed.
In order to obtain any speedup over a sequential implementation, a
parallel program must have several processes active simultaneously,
working on different tasks
3.
Once a computation has been decomposed into tasks, these tasks are mapped onto
processes with the obj that all tasks complete in the shortest amount of elapsed time.
Uneven load distribution may cause some processes to finish earlier than others.
All the unfinished tasks mapped onto a process may be waiting for tasks mapped onto other
processes to finish in order to satisfy the constraints imposed by the task-dependency graph.
Processes can be idle even before the overall comput. is finished for a variety of reasons:
To achieve a small exec.time, overheads of executing the tasks in parallel must be minim.
Example:
Min. the interactions can be easily achieved by assigning sets of tasks that need to interact with each other
onto the same process.
Such a mapping will result in a highly unbalanced workload among the processes.
Following this strategy to the limit will often map all tasks onto a single process.
Processes with a lighter load will be idle when those with a heavier load are finishing their tasks.
To balance the load among processes => assign tasks that interact heavily to different processes.
As the figure shows, two mappings, each with an overall balanced workload, can
result in different completion times.
Mapping classification
The choice of a good mapping in this case depends on several factors, including
1.
2.
Distribute the tasks among processes prior to the execution of the algorithm.
For statically generated tasks, either static or dynamic mapping can be used.
For many practical cases, relatively inexpensive heuristics provide fairly acceptable approximate
solutions to the optimal static mapping problem.
Algorithms that make use of static mapping are in general easier to design and program.
If task sizes are unknown, a static mapping can potentially lead to serious load-imbalances
Dynamic Mapping:
Distribute the work among processes during the execution of the algorithm.
If tasks are generated dynamically, then they must be mapped dynamically too.
If the amount of data associated with tasks is large relative to the computation, then a dynamic
mapping may entail moving this data among processes.
In a shared-address-space paradigm, dynamic mapping may work well even with large data
associated with tasks if the interaction is read-only.
Algs. that require dynamic mapping are complicated, particularly in the mes-pas progr. paradigm.
Static mapping
Block Distributions
Partition the array into p parts such that the kth part
contains rows kn/p...(k + 1)n/p - 1, where 0 <=k < p.
Each partition contains a block of n/p consecutive rows
Partition A along the second dimension, then each
partition contains a block of n/p consecutive columns.
Two-dimensional distributions
1.
2.
1.
2.
1.
2.
Example 2: LU factorization
Serial:
Block version of LU f.
Illustrated in Fig. 3.
Comp. value of A1,1 requires only one task Task 1.
Comp. value of A3,3 requires 3 tasks 9, 13, and 14.
partition an array into many more blocks than the number of available processes.
assign the partitions (and the associated tasks) to processes in a round-robin manner
so that each process gets several non-adjacent blocks.
the rows (columns) of an n x n matrix are divided into p groups of n/(p) consecutive
rows (columns), where 1 <=<=n/p.
These blocks are distributed among the p processes in a wraparound fashion such that
block bi is assigned to process Pi %p ('%' is the modulo operator).
Assigns blocks of the matrix to each process, but each subsequent block that gets
assigned to the same process is p blocks away.
Graph partitioning
there are many algorithms that operate on sparse datastructures and for which the
pattern of interaction among data elements is data dependent and highly irregular.
Numerical simulations of physical phenomena provide a large source of such type of comps.
In these computations, the physical domain is discretized and represented by a mesh of elements.
Ex: Simulation of a physical phenomenon such the dispersion of a water contaminant in the lake
involve computing the level of contamination at each vertex of this mesh at various intervals of time.
Random: each process will need to access a large set of points belonging to other
processes to complete computations for its assigned portion of the mesh.
? Partition the mesh into p parts such that each part contains roughly the same no. of
mesh-points or vertices, & no. of edges that cross partition boundaries is minimized.
NP-complete problem.
Algorithms that employ powerful heuristics are available to compute reasonable partitions.
Each process is assigned a contiguous region of the mesh such that the total number of
mesh points that needs to be accessed across partition boundaries is minimized.
An approximate solution
Ex: A mapping of three for sparse
matrix-vector multiplication onto three
processes
Dynamic Mapping
necessary in situations
centralized or
distributed.
all executable tasks are maintained in a common central data structure or they are
maintained by a special process or a subset of processes.
a special process is designated to manage the pool of available tasks, then it is often
referred to as the master
the other processes that depend on the master to obtain work are referred to as slaves
whenever a process has no work, it takes a portion of available work from the central
data structure or the master process.
whenever a new task is generated, it is added to this centralized data structure or
reported to the master process.
centralized load-balancing schemes are usually easier to implement than distributed
schemes, but may have limited scalability.
the large no. accesses to the common data structure or the master process tends to
become a bottleneck.
Example:
Whenever a process is idle, it picks up an available index, deletes it, and sorts the row with that index
Scheduling the independent iterations of a loop among parallel processes is known as self scheduling.
by restructuring the algorithm such that shared data are accessed and used in
large pieces
Data replication
Example:
multiple processes may require frequent read-only access to shared data structure, such as
a hash-table, in an irregular pattern.
The aggregate amount of memory required to store the replicated data increases linearly
with the no. concurrent processes.
This may limit the size of the problem that can be solved on a given parallel computer.
Computation replication
Example:
Fast Fourier Transform, on an N-point series, N distinct powers of or "twiddle factors" are
computed and used at various points in the computation.
In a parallel implementation of FFT, different processes require overlapping subsets of
these N twiddle factors.
Message-passing paradigm: each process locally compute all the twiddle factors it needs.
Although the parallel algorithm may perform many more twiddle factor computations than the
serial algorithm, it may still be faster than sharing the twiddle factors.
Often the interaction patterns among concurrent activities are static and regular.
A class of such static and regular interaction patterns are those that are
performed by groups of tasks, and they are used to achieve regular data
accesses or to perform certain type of computations on distributed data.
A number of key such collective interaction operations have been identified that
appear frequently in many parallel algorithms.
Examples:
the algorithm designer does not need to think about how these operations are
implemented and needs to focus only on the functionality achieved by these operations.