Microarchitecture Configurations and Floorplanning Co-Optimization
Microarchitecture Configurations and Floorplanning Co-Optimization
Microarchitecture Configurations and Floorplanning Co-Optimization
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
I. INTRODUCTION HE throughput of a computer system is the product of average instruction-per-cycle (IPC) and clock rate. To boost throughput, both microarchitecture conguration and oorplanning, which are strongly related, need to be optimized. The microarchitecture conguration, including issue width, branch prediction method, cache/TLB sizes, and the number of functional units, directly determines IPC. For example, as shown in [1], resizing critical components improves throughput over 20% for Power5 processors. Microarchitecture conguration also decides the area of components in a physical layout and thus affects the oorplan of the microarchitecture. The oorplan of a microarchitecture, on the other hand, not only determines the clock rate of the microarchitecture but also has a signicant impact on IPC due to pipelining of global interconnects [2][4]. However, the traditional design ow separates microarchitecture tuning and oorplan optimization. IPC was improved by the
Manuscript received September 8, 2005; revised September 10, 2006. This paper was supported in part by the National Science Foundation under CAREER Award CCR-0401682 and SRC Grant 1116. C. Long is with Synopsys Inc., Mountain View, CA 94043 USA (e-mail: longchb@synopsys.com). L. J. Simonson is with Intel Inc., Gaston, OR 97119 USA. W. Liao is with Nvidia Inc., Santa Clara, CA 95050 USA. L. He is with the Electrical Engineering Department, University of California, Los Angeles, CA 90095 USA (e-mail: lhe@ee.ucla.edu). Digital Object Identier 10.1109/TVLSI.2007.899240
means of microarchitecture conguration alone without considering oorplanning, and oorplanning was employed only to determine the clock rate. Consequently, this separation may lead to inferior designs. Because multiple stages for global interconnects will be adopted in the future [5][8], it is necessary to optimize microarchitecture conguration and oorplanning simultaneously in order to avoid the throughput degradation caused by the separation of the design ow. In this paper, we develop a method to optimize microarchitecture conguration and oorplanning simultaneously to maximize throughput for SuperScalar-like [9], [10] microarchitecture. First, we concentrate on oorplanning under given microarchitecture congurations. In addition to the objectives of conventional oorplanning methods, we minimize the throughput degradation caused by pipelined global interconnects as well. The key is to develop efcient yet accurate models for microarchitecture throughput over pipeline stages of global interconnects during oorplanning. Note that the accurate evaluation of throughput for a microarchitecture requires cycle-accurate simulations over a set of benchmarks and one simulation lasts for hours. A fast throughput model is essential during oorplanning. For this purpose, we propose the trajectory piecewise-linear (TPWL) model for cycle-per-instruction (CPI)1 over pipeline stages of global interconnects. Our results show that the TPWL model needs more ofine setup time but obtains 13% higher throughput than a rough access ratio-based model. The oorplanning approach based on the two models can improve the throughput of microarchitecture by up to 64.7% (for the TPWL model) than conventional oorplanning methods without considering the inuence of pipelining global interconnects. Second, we build a unied throughput model parameterized for pipelined global interconnects and microarchitecture congurations based on the TPWL method and then apply this model to efciently explore over one million microarchitecture congurations and corresponding oorplan variations. This is in sharp contrast with existing works [11], [2], which enumerated a limited number of microarchitecture congurations. Our experiments show that the average error of the TPWL model is about 3.0% compared with cycle-accurate estimations. We obtain microarchitecture congurations and corresponding oorplans less that are than 20.0% from the ideal IPC. Also, our solutions are 26.9% better than the manually chosen congurations in [2]. The idea of the TPWL model, as originally proposed in [12] to model nonlinear dynamic systems, is to build a piecewise-linear
1In this paper, we assume that microprocessors with different clock rates have different conguration. For the same clock rate, we assume CPI = 1=IPC.
831
model along the trajectory of a typical system response excited by a training input. This model is particularly accurate to model system responses that have a trajectory close the one of the training input. In Section III, we will show that the oorplanning optimization based on simulated annealing (SA) can be mathematically described by a form similar to nonlinear systems described as state-space approaches in [12]. We will also show that multiple SA runs starting with different initial oorplans have close trajectories which enables a highly accurate TPWL model. Moreover, we will improve the original TPWL model from [12] for higher accuracy. Related to our work, the oorplanning for interconnect optimization has been studied for ASIC and system-on-chip (SOC) designs. Early work on interconnect-driven oorplanning [13][16] for ASIC chips focus on buffer block planning without considering interconnect pipelining. More recently, [17] minimizes the degradation of throughput caused by pipelined interconnects by adding throughput as one of the objectives during oorplanning in SOC designs. In [17], the throughput of SOCs are normalized values rather than specied throughput values in this paper. The oorplanning optimization for microarchitecture has also been studied. The study in [11] developed a primitive co-optimization method for microarchitecture conguration and oorplanning without considering interconnect pipelining. Specifically, the IPC values of 32 microarchitecture congurations are obtained by cycle-accurate simulations and then stored in a lookup table (LUT). The best conguration is obtained by comparing throughput of these congurations and corresponding oorplans. Microarchitecture oorplanning with interconnect pipelining was studied by [2] and the earlier version [3] of this paper. The work in [2] proled module-to-module communication and solved an interconnect-pipelining-aware oorplanning using mixed integer nonlinear programming (MILP). Iterations between proling and MILP are needed to guarantee the convergence of the overall design ow. Again, the microarchitecture congurations were limited as only four candidates were considered in the paper. The work in [18] evaluated a bus-driven oorplanning method to optimize the routability and timing of buses in a given microarchitecture conguration. To solve the microarchitecture oorplanning problem, the authors of [3] proposed a TPWL model for interconnect pipelining, which will be extended in this paper to consider microarchitecture and oorplanning co-optimization. In the remainder of this paper, we introduce background knowledge in Section II and present the TPWL model in Section III. We develop methods for microarchitecture oorplanning considering pipelined interconnects and present experimental results in Section IV. We develop co-optimization of microarchitecture conguration and oorplanning with experiment results in Section V and conclude the paper in Section VI. II. BACKGROUND A. Bus Latency Vectors We assume an out-of-order SuperScalar implementation of the MIPS instruction set. For microarchitecture oorplanning
under a given conguration, we summarize the conguration in Table I, which is similar to the Alpha 21264 with an issue width of four. We group the modules in this implementation into blocks that are each treated as an independent unit during oorplanning. We assume that interconnects between modules within the same block will not be pipelined. Blocks that are composed of multiple modules are the RUU block including Register Update Unit and Load Store Queue, Decode block including Fetch Queue and the Decoder, Branch block including Fetch Unit and Branch Predictor, DL1 block including the Level 1 Data Cache and the DTLB, and the IL1 block including the Level 1 Instruction Cache and the ITLB. The L2 unied cache and all functional units are treated as independent blocks. We summarize the block area in Table II, which is obtained by scaling the area of corresponding components in Alpha 21264 to the ITRS [8] 100-nm generation. The lengths of interconnects between two blocks in Table II are computed according to the Manhattan distance between the centers of two blocks in the oorplan.2 We treat the latency of each such interconnect as an independent variable. Changing the latency of one of these interconnects is effectively a change in the microarchitecture and will impact the performance. In Table III, we specify these interconnects with respect to their terminal blocks.3 We summarize all of the interconnects that affect CPI in Table III and form them into a vector , called the bus latency vector, which is used to characterize a oorplan. For example, in a oorplan, if Bus 1 has a latency of 3, Bus 2 has a latency of 4, Bus 3 has a latency of 7, and so on, the for the oorplan . The latency of each interconnect would be is obtained by dividing the total wire length of the interconnect, measured from the oorplan, by a constant value called
2Note that our method can be applied to the exact bus length and forked bus if such design information is provided. 3The L2 cache is composed of three banks in our experiment and we consider the worst case latency.
832
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
accuracy of CPI measurement. Simulating the next 100 million instructions only improves the efciency of the measurement because it is long enough to obtain a steady-state estimation. C. Floorplanning The method for microarchitecture oorplanning used in this paper is based on traditional oorplanning approaches. The objective of traditional oorplanning is to determine the positions and shapes of blocks in a chip subject to the minimization of a cost function, which is usually a combination of area and total wire length. This is usually presented in the form (1)
ip-op (FF) insertion length, which is computed based on the simultaneous buffer and FF insertion algorithm proposed in [7]. B. Cycle-Accurate Simulations and CPI Metrics To measure the impact of pipelining stages of global interconnects to CPI, we use out-of-order issue, cycle-accurate simulations in the SimpleScalar 3.0 [19] framework. The latencies (stages) of various global interconnects, which are obtained from the oorplan and recorded in the bus latency vector , are rst specied in SimpleScalar. In some cases, these latencies can be specied by simply modifying the conguration le of SimpleScalar. The interconnect between L1 and L2 data cache is a good example. Note that, because the interconnect lengths are different for the L2 instruction cache and the L2 data cache, the miss penalties for them are different. In more complicated cases, we insert queues between modules to buffer data to realize these latencies. These cases include the buses for instruction L1 cache, fetch to dispatch, and dispatch to issue. Specically, the L1 instruction cache interconnect latency is modeled by a FIFO queue placed between the fetch unit and the cache. A branch cannot be identied by the fetch unit until it moves through the queue, therefore prefetching proceeds speculatively to consecutive memory locations. When a branch is taken, the prefetched contents of the queue are ushed and fetch proceeds from the target location. The latencies between the fetch and dispatch units and between the dispatch and issue units are modeled by a FIFO queue between the fetch and dispatch units with the length equal to the sum of the latencies of the two buses. This is because the dispatch stage is completely self-contained and the effect of latency that comes immediately before dispatch is identical to that of latency that comes immediately after dispatch. After all interconnect latencies in a bus latency vector are specied in SimpleScalar, the CPI value of this vector is computed by the arithmetic mean of the CPI of ten benchmarks, including equake, mesa, gzip, art, bzip2, parser, vpr, gcc, go, and mcf, representing both integer and oating-point workloads. The CPI value of each benchmark is obtained by cycle-accurate simulations, where the rst 200 million instructions are fast-forwarded,4 and the next 100 million instructions are actually simulated. Fast-forwarding the rst 200 million instructions is to skip the initial false setup stage and warm up the architecture structure such as caches and branch predictors, which improves the
4Fast-forwarding is an option provided by SimpleScalar which skips a specied number of instructions by using functional simulation before starting cycleaccurate simulations to reduce runtime.
where and are the area and total wire length and are user-dened of the oorplan, respectively, and weights. Because the metrics of area and wire are in different magnitudes, they are normalized by typical values ( and ) in the objective function. A widely used oorplanning approach is based on simulated annealing (SA)[20][22]. SA starts with an initial oorplan and moves to a new one by changing the positions or shapes of blocks. In each iteration, the cost of the new oorplan is evaluated and the move is unconditionally accepted if the cost of the new oorplan is smaller than the old one. The move may also be accepted if the cost increases but with a probability dictated by the simulated temperature of the annealing. A move that increases the cost is more likely to be accepted at a higher temperature. The temperature is decreased throughout the annealing based upon a schedule, so that by the end only moves that reduce the cost are likely to be accepted. III. TRAJECTORY PIECEWISE-LINEAR MODEL A. Overview of the TPWL Model The TPWL model was originally proposed to model nonlinear dynamic systems [12] such as the following one described by a state-space approach: (2) where is a vector of states at time and are nonlinear vector-valued functions, is a state-dependent input matrix, is an output matrix, and input signal, is an is the output signal. In essence, the TPWL model is a weighted combination of linear models at different linearization points in , where the key is how to nd the state space, say these linearization points. By Taylors expansion, it is clear that the weighted combination of linear models at linearization points of would be accurate for a given state , which needs to be eval. However, it is difuated, if it is close to any of cult to obtain linearization points to guarantee that there is a close point for any given state . Therefore, [12] proposed to perform a single simulation of the nonlinear and initial state and system for a xed training input nd the linearization points along the trajectory
833
of this simulation. The reasoning is that system responses may have similar trajectories and the trajectory of a training input that are can guide us to nd linearization points close to the points on the trajectories of other inputs. Experiment results show that this TPWL model is highly accurate, especially for trajectories close to the trajectory of the training input. In this paper, we adopt the TPWL approach to model CPI over pipelining stages of global interconnects during oorplanning. Similar to the state-space approach, the SA optimization process could be described as (3) where the moves in SA are labeled by numbers of is the bus latency vector of the oorplan after moves, is the change to in the move which is caused is the initial oorplan where by the change to the oorplan, is the bus latency vector of the initial oorplan before SA starts, and . The CPI value of the oorplan after moves is a function of , which is represented by . Similar to [12], we perform a single-start SA run for a xed initial oorplan, which is the training input, to obalong the trajectory by tain linearization points treating each move in the SA as a state in the nonlinear system. The weighted combination of linear models at these linearizais the TPWL model for oorplanning. tion points We can see that this TPWL model would be accurate if the trajectories of SA starting from other initial oorplans are close to the trajectory of the training input. Fortunately, the trajectories starting from different initial oorplans are close to each other in oorplanning. We illustrate the trajectory of SA for microarchitecture oorplanning in Fig. 1(a). We are particularly interested in the trajectory of latencies for buses because these latencies impact the throughput. We represent these latencies by the distance between the corresponding bus latency vector (as dened in Section II-A) and the original point5 in the -axis. Although this distance metric cannot fully represent bus latency vectors,6 it serves as a good example to illustrate the trend of bus latency vector changes in the SA procedure. The -axis shows the process of the SA procedure, where 0% and 100% represents the beginning and end of the SA procedure, respectively. Fig. 1(a) shows that the trajectory becomes greatly concentrated at the end of the SA. This is because the temperature of the SA is low at the end and there are a smaller number of moves accepted to change the oorplan compared with a high temperature at the beginning stage of the SA procedure. This point has been further demonstrated in Fig. 1(b), which shows the distribution of latencies in the SA trajectory. The latencies are heavily concentrated in a relatively small range. As shown in the gure, 86% of them fall between the values of 8 and 16. Because the SA procedure at the lower temperature explores a relatively small and concentrated solution space, as shown above, employing the TPWL model to
original point means that all bus latencies are zero. 6Two different bus latency vectors may have the same distance to the original point.
5The
Fig. 1. (a) The trajectory of SA. (b) Distribution of latencies of buses; 86% of them have total bus latencies between 8 and 16.
build a model just in the vicinity of the solution space explored by SA is likely to be more effective than models targeting the whole solution space such as design of experiments in [4].7 In this study, we improve the original TPWL model by introducing a trajectory points collecting (TPC) problem and iterations. The TPC problem reduces the cost of building the piecewise-linear model by sampling the trajectory and then collecting key points among those sampled points as discussed in Section III-B. The strategy to build the TPWL model based on several trajectories from multiple SA starts of oorplanning optimization, called iteration, is used because the trajectory of oorplanning is subject to small changes when the CPI metric is added into the objective function of oorplanning. This improves the accuracy of the CPI model is improved (please refer to Section IV-B for the details of iterations). B. Construction of the TPWL Model The TPWL model is built in three phases.
7Note that the TPWL and design of experiments are orthogonal to each other and can be combined.
834
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
1) Sampling: Each move in the SA explores a new oorplan by a small change to the current one, and these explored oorplans dene a trajectory in the solution space. We capture the consectrajectory by sampling one oorplan in every utive moves in SA. By sampling, it is meant that the bus latency vector is extracted from the oorplan and will be used in the next phase. Note that a large reduces the costs but may lose the details of the trajectory. In order to obtain a good tradeoff beor 3 depending tween cost and performance, we assume on the size of the oorplan. 2) Collecting: To capture the trend of the trajectory with as few as possible sampling points, we perform the collecting phase. Intuitively, we describe this phase as using as few as possible balls, which are dened as spherical areas in the solution space where any point inside these areas has a distance to the center smaller than a given radius, to cover all of the bus latency vectors obtained in the sampling phase. The center points of these balls are used to represent all others in the same ball and will be used in the simulation phase while all others are discarded. We formulate the problem as follows. and radius Formulation 1: TPC: Given a set of points , nd with minimum while satisfying (4) Note that is the center of a ball and the minimum leads to the smallest number of balls. The TPC problem can be rephrased as follows. and radius , Formulation 2: Given a set of points for each contains all points satisfying (5) then nd the smallest number of sets to cover all points in . One can see that the rephrased TPC problem falls into the category of a set-cover problem. We adopt the greedy algorithm proposed in [23] and [24] to solve the TPC problem. The idea is to iteratively nd a ball which covers as many points in as possible. The implementation details are similar to those in [23] and [24]. The TPC problem was not explicitly presented in [12]. As one of the contributions of this paper to improve the TPWL approach, the introduction of the TPC problem can signicantly improve the efciency of the model. 3) Simulation: From the collecting phase, we obtain a set , which are the center points of of bus latency vectors the balls representing the trend of the trajectory. Each point is a bus latency vector and the corresponding CPI can be estimated by cycle-accurate simulations (see Section II-B). We call the phase to simulate all of these bus latency vectors in to obtain corresponding CPIs for the simulating phase. These bus latency vectors and corresponding CPI values form a table, called the CPI table, which is used to evaluate the CPI value for any given bus latency vector described in Section III-C. C. CPI Estimation Under the TPWL Model By Taylors expansion, the CPI value of any given bus latency vector could be estimated from each entry in the CPI table (6)
where
(7)
and
should be obtained from cycle-accurate Note that simulation. However, it is time-consuming to compute for each entry in the CPI table. In this paper, we compute the and use it for all entries in the CPI table. The average is obtained as follows: average
(9) where and is the bus latency vector with maximum and minimum latency on all buses, respectively. In this paper, we assume that the maximum and minimum latency of a bus is 10 and 0, respectively, and denote the difference between them . Also, is the same as as , i.e., except that the latency of the th bus is the minimum. Similarly, is the same as except that the latency of the th bus is the maximum. It can be seen from above that actually represents the sensitivity of the bus. The nal estimation is computed as the weighted sum of as (10) To determine the weight for each entry in the CPI table, we follow the method adopted in [12]. We rst compute the distance between each entry and and then employ an exponential function of the distance as a weight to compute the average is computed estimation. The distance between to each as (11) is the size of the CPI table. Then, we compute the where weight of each entry by as (12) where (13) Note that is a positive constant and is set to 25 [12]. Afterward, we compute the normalized weights as (14)
835
For convenience, we summarize the computations to estimate CPI for a bus vector in Fig. 2. As stated in [12], computing the weights based on an exponential function of distance is a simple heuristic. However, it is suitable for CPI which is a strong nonlinear function of bus latencies. The estimation based on the Taylors expansion from a CPI table entry that is not close to the target bus latency vector can be error-prone because of this nonlinearity. Therefore, the exponential weight function is more accurate than others such as linear functions, because table entries that are close to the target bus latency vector contribute to the estimation. IV. MICROARCHITECTURE FLOORPLANNING CONSIDERING PIPELINED INTERCONNECTS Here, we study microarchitecture oorplanning to minimize the impact of interconnect pipelining to CPI for given microarchitecture congurations. In order to model the impact of interconnect pipelining on CPI, we employ the TPWL and access ratio-based model to be presented in this section and then compare them. A. Microarchitecture Floorplanning As shown in Section II-C, the objective of traditional oorplanning is the weighted sum of area and total wire length. To consider the microarchitecture performance during oorplanning, we add CPI to the objective function and obtain (15) where and are the area and total wire length of the oorplan, respectively, and are user-dened weights, , and are the CPI value of the oorplan, the and normalization value of CPI, and user-dened weights for CPI, respectively. For simplicity, we denote the objective combining area and total wire length as AL and combining all area, total wire length, and CPI as ALC. The CPI value in (15) is computed by the TPWL model as shown in Section III and via an access ratio-based approach. Access ratio of an interconnect (bus) is dened as the number of bus access over the total clock cycles for a benchmark. Intuitively, the latency of a bus has more impact on performance if it is accessed more frequently. Therefore, the access ratio metric is an indicator of the impact of a bus. The data of bus access numbers are collected from cycle-accurate simulations, and we present the arithmetic mean of the data over all benchmarks in Table III. In the access ratio-based approach to compute the CPI term in (15), we use the access ratio weighted sum of
the pipelining stages for all interconnects in Table III. Note that a similar idea has been used in an independent study [2]. B. Microarchitecture Floorplanning With the TPWL Model We develop the microarchitecture oorplanning with the TPWL model based on the Parquet package [22]. Fig. 3 shows the overview of the ow. It starts with an initial oorplanning optimization with an objective function of weighted sum of area and total wire length (as in traditional oorplannings) to obtain an SA trajectory. This trajectory is sampled, collected, and simulated to build an initial CPI table (refer to Section III). This initial CPI table constructs a TPWL model which makes CPI estimation possible. Thereafter, a few iterations of the above processes may be needed to expand the CPI table and improve the accuracy of the model. Note that the initial SA trajectory is obtained without CPI included in the objective function. However, CPI is added into the objective function after the rst round. The iterations stop when the change on the estimated CPI is smaller than a given threshold, which corresponds to the condition for accurate CPI estimation at the bottom of Fig. 3. Typically, the iterations converge in two to three iterations, and our experimental results show that these extra iterations can improve accuracy signicantly. Note that the idea of using iterations to improve the accuracy of the TPWL model is proposed in this paper for the rst time. Regarding implementation details, we describe how to add CPI into the objective function in [22] as follows. As shown in [22], the objective of AL is a linear combination of area and total wire length. In [22], after each move in SA, of the objective is computed as the weighted sum of the changes of area and total wire length. In this paper, we optimize ALC in the oorplanning by changing the objective function. Similar to [22], in our
836
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
oorplanning, we compute of the objective as the weighted sum of the changes in area, total wire length, and CPI after each move, i.e., (16) (17) (18) (19) where is the area of all blocks in the oorplan, and , and are the weights for area, total wire length, and CPI, respectively. Note that, in SA, a move is accepted if and only if (20) where is a random value between 0 and 1 and and initial temperature and current temperature, respectively. C. Experiment Results We have implemented the proposed methods for microarchitecture oorplanning based on the Parquet package [22]. According to [7], below we assume that the interconnect distance between two adjacent FFs is 2000 m under a 3-GHz clock and 1200 m under 5 GHz in the ITRS 100-nm generation. 1) Validation of the TPWL Model: The procedure to build a TPWL model follows the procedure outlined in Section III-B. Specically, we conduct SA to minimize an objective function of combining area and total wire length (AL) and build up an initial CPI table for the TPWL model by sampling this SA procedure and running cycle-accurate simulations. Then, SA is repeated two more times with CPI added to the objective function (ALC), where the CPI value is evaluated by the TPWL model that has already been established. Each time, we add more entries to the CPI table to improve the accuracy. The total number of cycle-accurate simulations for building a TPWL model is controlled by two factors. The rst one is the radius of the balls in the sampling phase. A small value of leads to a large number of cycle-accurate simulations in the SA procedure. The second factor is the total number of SA procedures performed. In our experiment, we choose to be ve and perform three times of SA. This setting ends up with 90 simulations. Fig. 4 shows the accuracy of the TPWL model against the total number of simulations. As we can see from Fig. 4, the maximum error decreases from 15% to 4% and the average error decreases from 4% to 1% as the total number of cycle-accurate simulations increases from 15 to 90. The total number of simulations to build a TPWL model is scalable, as shown in Fig. 4. However, the accuracy of the TPWL model indeed affects the quality of the oorplans obtained by SA. For example, using a CPI model with 15% error, SA could are
Fig. 4. Accuracy of the TPWL model versus total number of cycle-accurate simulations.
obtain a oorplan anywhere between 0% and 30%8 from the optimum even if the SA procedure is optimal. Considering that the SA procedure is suboptimal and cycle-accurate simulations are time-consuming, we believe that using 90 cycle-accurate simulations to build a TPWL model with 4% maximum error is a good balance between cost and quality. Note that each cycleaccurate simulation includes ten benchmarks and takes 34 h in a 2.8-GHz Xeon machine. Comparing with cycle-accurate simulations, the cost of oorplanning optimization is negligible (23 min per run). Applying the TPWL to evaluate CPIs is highly efcient and has almost no impact on the speed of oorplanning optimization once the TPWL tables are built because the evaluation is based on formulas. 2) Comparison of Floorplanning With Different Objectives: We compare the oorplans obtained by SA subject to the objective functions of AL (area and total wire length) and ALC (area, total wire length, and CPI), respectively. Note that the AL objective is used by the traditional oorplanning with throughput degraded by pipelined global interconnects. The area of each block in the oorplan can be found in Table II. We summarize the results of oorplanning using these objectives in Table IV. For each objective in Table IV, we run oorplanning optimizations ten times, as the oorplanning algorithm in [22] is not deterministic. We show both best and average results from these ten runs for comparison in the table. We rst present the white space rate of these oorplans in the rst row. The low rate indicates the good quality of these oorplans. We then present the CPI of oorplans in the second row. The ALC and AC solutions can reduce CPI over AL by up to 64.7%, which corresponds to a 183.0% increase in throughput. The results on total wire length show that there is no strong correlation between CPI and total wire length. This indicates
8For the model that has 15% error, a solution 30% worse than the optimal solution could be overestimated by 15% according to the model. If the optimal solution is underestimated by 15% by the model, the 30% worse solution could be chosen by SA to become the optimal solution.
837
Fig. 5. Comparison between oorplanning without (a) and with (b) CPI minimization.
that the objective of minimizing total wire length in traditional oorplanning does not maximize performance. Instead, latency of pipelined interconnects in CPI-critical paths should be considered to maximize performance. Similarly, the results in the fourth row show that minimizing total wire length does not necessarily reduce the total number of ip-ops and the maximum number of FFs in a single bus. We present the area in the fth row. The ALC results in the lowest CPI with only a small area overhead of less than 5.0% over AL. To demonstrate how CPI minimization affects the nal oorplan, we compare the oorplan obtained without CPI minimization and with CPI minimization in Fig. 5. In our experiment, the aspect ratio of the die is xed but those of blocks are exible. As shown in Fig. 5(b), with consideration of CPI minimization during oorplanning, the length of critical buses that have high
access ratios and intuitively signicant effect on CPI is generally shorter than that in Fig. 5(a). For example, in Fig. 5(b), the RUU module is placed close to the modules of IALU, FPAdd, and FPMul. Note that the buses between these modules are critical buses based on the access ratio in Table III. 3) Comparison of Floorplans Under Different CPI Models: In terms of computational time to set up, the access ratio model is more efcient than the TPWL model. As discussed above, it takes 90 simulations to build an accurate TPWL model while access ratio data can be collected in one simulation. However, because both models use formulas to evaluate CPI, they are equally efcient when used in oorplanning once the TPWL tables are built. To compare the two models, we choose the best and average results among ten optimization results and show them in Table V. We compare the oorplans with an ALC objective function under both 3 and 5 GHz. Metrics of white space ratio, CPI, total wire length (TWL), total/max number of FFs, and area are shown in the table for comparison. As shown in the table, in terms of objective function, the TPWL model is 7.8% and 5.5% better than the access ratio model under 3 and 5 GHz, respectively, on average. In terms of CPI, the TPWL model is 10.0% and 13.0% better. We have also shown the real throughput in Table V in BIPS as well. From Table V, we can see that, although IPC is degraded when the clock rate is increased, throughput is still signicantly improved. This result can be explained as follows. As we discussed in footnote 8, in an ideal SA procedure, a CPI model with X% error leads to a oorplan anywhere between 0% and 2X% away from the optimum. Because the access ratio model can be treated as a rough CPI model and the TPWL model is more accurate, the oorplan obtained by the TPWL model should be better on average. V. CO-OPTIMIZATION OF MICROARCHITECTURE CONFIGURATION AND FLOORPLANNING To develop the co-optimization method for microarchitecture conguration and oorplanning we build a unied throughput model to pipelined interconnects and microarchitecture congurations based on the TPWL approach. This model is then integrated in the SA engine to efciently explore over one million microarchitecture congurations and corresponding oorplan variations for each of them. Below, we rst introduce the TPWL model for pipelined interconnects and microarchitecture congurations and then discuss the co-optimization method based on the SA engine. Finally, we present our experiment results.
838
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
A. TPWL Model for Pipelined Interconnects and Microarchitecture Congurations To build a TPWL model for pipelined interconnects, the interconnects that have a large impact on throughput have been rst identied and then recorded into the bus latency vector, as shown in Section II-A. To build a unied TPWL model on both pipelined interconnects and microarchitecture congurations, not only these interconnects but also the components in congurations that affect throughput need to be identied. These components are described as follows. 1) Issue Width: Issue width, also called machine width, is one of the most important parameters determining IPC. It is dened as the number of instructions that can be issued to the execution stage of the pipeline in a single cycle. As shown in [25], CPI can be approximated as (21) is the maximum performance sustainable in the where , and are the adideal case, and ditional CPI caused by branch misprediction, instruction miss, and data cache miss, respectively. Note that (21) shows that the . upper bound of IPC is It is observed in [26] that the total number of instructions that can issue per cycle is roughly the square root of the number of instructions in the issue window. Therefore, issue width determines the size of RUU [27] and LSQ. In this paper, we explore three issue width values of 2, 4, and 8. This range covers most current architectures. 2) Branch Prediction: Equation (21) shows that branch miss-events increase CPI. A branch misprediction event causes fetching of useless instructions into pipeline with ve to ten cycles misfetch delay penalty in a pipeline and with ve to nine front-end depth, as shown in [25]. The missing rate of branch prediction depends on the prediction method. We assume a combination of bimodal and twolevel method [28] in this study. [28] shows that this combinational method has a good performance in practice. The BTB size of the bimodal and two-level methods are parameters of our model. In general, a larger size of BTB leads to a more accurate prediction. 3) Cache and TLB Size: The cache-missing event is another major factor that increases CPI, as shown in (21). Cache-missing events cause stalling of the pipeline to bring data from next-level caches and introduce delay penalties. For example, the L1 instruction missing penalty is about eight cycles in a pipeline with ve to nine front-end depth [25].
The occasion of cache-missing events depends on the size and organization of caches. In this study, we assume a smaller associativity for small cache/TLB sizes and a larger one for large cache sizes and treat the cache/TLB size as model parameters. The range of cache/TLB size is from 4 to 2048 K in this study. The physical size of cache/TLB in the oorplan is calculated by Cacti [29] under the ITRS 100-nm generation. Cacti takes the size and organization of a cache/TLB as input and estimates the physical size of the cache/TLB with a high accuracy. 4) Number of Functional Units: A insufcient number of functional units can signicantly affect performance. Our study shows that one less or more integer ALU can cause over 20% difference in IPC. However, redundant functional units waste area and power. To a certain degree, the purpose of exploring the number of functional units is to nd the appropriate number of functional units that just satises the requirement for computation resources based on other parameters such as issue width, cache/TLB size, and so on. Because the number of functional units cannot exceed issue width, in this study it is limited between one and eight. Also, we consider four types of functional units: integer ALU (IALU), integer multiplier (IMult), oating point ALU (FPALU), and oating point multiplier (FPMult). Together with pipeline interconnects in Table III, these conguration components are recorded in a vector, which is similar to a bus latency vector. To build a TPWL model for throughput with respect to this vector, we use the same procedure described in Section III-B. Specically, we sample an SA trajectory, collect the sample points in as few as possible balls, and then simulate the center points of these balls to build a CPI table. Note that here SA optimizes both oorplans and microarchitecture congurations. Moves in SA may modify the oorplanning as well as the microarchitecture congurations. The details of the SA engine will be discussed in Section V-B. B. SA-Based Co-Optimization Method In oorplanning, the solution space is explored by moves in SA that change the shape, position, or orientation of a module in the oorplan. To additionally explore the solution space of microarchitecture congurations, we use new moves including resizing branch predictor and cache/TLB and changing the number of functional units. We assign approximately 15% of the total number of moves in SA to moves for microarchitecture congurations, and these moves are evenly distributed into each type mentioned above. At rst glance, changing issue width can also be a new move. However, to achieve high performance, most parameters should
839
stay in a suitable range with respect to issue width. As observed in our experiment, in most cases, a change of issue width in SA brings in inconsistency among parameters and degrades the quality of the solution. To explore the solution space more efciently, we propose to employ multiple starts for issue width. In each start, the issue width is xed and the best case is selected among all starts. The explored microarchitecture congurations are summarized in Table VI. Note that each row in the table represents an independent variable, and the total number of microarchitecture congurations in this table is over one million, which is in sharp contrast with previous work of [2] and [11] to enumerate no more than 32 candidates. Table VI is built by allowing each parameter to change within a range around the baseline conguration, which is in bold font and close to those appearing in the literature [2]. We nd that the microarchitecture throughput is sensitive to changes in these ranges. Specically, we start with the baseline conguration and then scan one parameter each time to determine the range for the parameter. Note that, as suggested by [25], the RUU and LSQ size are deterministic for the given issue width. C. Experiment Results 1) IPC Versus Area: We have integrated the TPWL model and the methodology to explore microarchitecture congurations into Parquet. Because the metric of total wire length has no strong relation with throughput (as shown in Section IV-C), we use SA minimizing the objective function AC, i.e., area and CPI. We show the curve of IPC versus area in Fig. 6 for the purpose of validating the TPWL model. The data points are obtained by executing multiple SA runs with different weights assigned to area and IPC in the objective function. We also choose the best IPC value for solutions with close area values. We rst validate the TPWL model by comparing the IPC value obtained by the model and by cycle-accurate SimpleScalar simulation in Fig. 6. The data points in the gure indicate that the average error of TPWL model is around 3.3%. We then show the trend of IPC with respect to increasing area. The dotted lines in the gure represent the ideal IPC value with no performance loss caused by insufciency of instruction parallelism, missing events, lack of functional units, or interconnect pipelining. We
Fig. 6. IPC versus area (under TPWL model and cycle-accurate simulation).
approximate the ideal IPC value for each issue width by providing sufciently large branch predictors, caches, TLBs, and enough functional units in the microarchitecture congurations and conduct SimpleScalar simulations. As shown in the gure, the optimization results are less than 20% away from ideal IPC. We also compare our IPC value with the IPC value of the rst three congurations enumerated in [2], which is also shown in Table VII as the last value. These three congurations are for issue widths of 2, 4, and 8, respectively. We show the IPC value of these three congurations but without considering interconnect latency. Therefore, shown in the gure is the ideal IPC for [2]. Also, we obtain the area of each conguration by mapping conguration parameters to corresponding physical size used in this paper. As we can see from the gure, the co-optimization methodology proposed in this paper leads to much better designs (26.9% higher IPC) than congurations manually chosen in [2]. Note that the case of issue width 8 in [2] experiences a major performance degradation because its cache size is too small. As shown in Fig. 6, increasing chip area helps to increase IPC mainly because of the increase in cache size and the number of functional units. However, there is a diminishing for the same issue width, as it is difcult to improve performance by increasing the chip area when IPC is close to the ideal. This trend holds for different issue widths as well. In addition, compared
840
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007
to issue width 4, issue width 8 is less effective to improve IPC by the same amount of area increase. The primary reason is that there is not enough instruction level parallelism (ILP) when issue width increases. Note that the increase in global interconnect latency with the increase in area also contributes to this diminishing return. On the other hand, an area that is too small may lead to signicant degradation in IPC. The leftmost data point from the issue width 4 in the TPWL model is a good example. It has the smallest area among all data points in the issue width 4, but the IPC value is much smaller compared with that from the issue width 2 with a similar area (the rightmost data point from the issue width 2). 2) Conguration Details: To reveal the microarchitecture conguration details, we show all microarchitecture conguration parameters for average and best cases of multiple SA runs in Table VII. All of the IPC values in the table are obtained from cycle-accurate simulations with global interconnect latencies extracted from corresponding oorplans. As shown in the table, average values are fairly close to the best case value, indicating a good convergence of SA optimization. Also, the obtained conguration by the proposed method has a signicant performance advantage over the upper bound of designated congurations used in [2] without considering the impact of global interconnects. VI. CONCLUSION AND DISCUSSION Considering the impact of interconnect pipelining on throughput, we have developed methods for microarchitecture oorplanning to minimize CPI for a given microarchitecture conguration. Compared with the conventional oorplanning minimizing area and wire length techniques, the new oorplanning formulation obtains a oorplan with CPI reduced by 64.7% with a small area overhead that is less than 5.0%. CPI optimization during oorplanning is achieved by shortening the lengths of CPI-critical buses. At rst glance, the set of CPI-critical buses is a subset of all global interconnects and the traditional oorplanning objective of minimizing total wire length should lead to a oorplan with optimized system performance. However, we have shown that minimizing total wire length does not necessarily lead to minimization of CPI. However, using the bus access ratio as weight and minimizing the weighted interconnect length is shown to be a good heuristic for microarchitecture oorplanning.
There are a few recent studies on microarchitecture oorplanning [4], [30], [31]. To reduce the number of microarchitecture simulations [4] employed, the design of the experiment is to identify key factors affecting throughput. This idea is orthogonal to our strategy to build a TPWL model around the solution space explored by SA. We have developed a TPWL model for CPI to consider microarchitecture congurations and interconnect pipelining and further developed co-optimization of microarchitecture conguration and oorplanning based on it. We explore over one million congurations candidates and nd microarchitecture congurations and corresponding oorplans with an IPC of less than 20.0% away from the ideal IPC. Our solutions are 26.9% better than [2], which was able to consider only a limited number of microarchitecture congurations. There is a tradeoff between quality and simulation runtime to build the TPWL model. Tracing SA reduces the simulation runtime without sacricing the accuracy around the SA trajectory. In addition, cycle-accurate simulation may be replaced by traced-based simulation [25] or incremental simulation to reduce model building time. Note that, once models are built, the runtime is virtually the same for the TPWL model and access ratio-based approach. The TPWL model is a general model and can be expanded to consider more design freedoms and objectives, such as power, and still maintain high quality.
REFERENCES
[1] J. Clabes, J. Friedrich, M. Sweet, J. Dilullo, S. Chu, D. Plass, J. Dawson, P. Muench, L. Powell, M. Floyd, B. Sinharoy, M. Lee, M. Goulet, J. Wagoner, N. Schwartz, S. Runyon, G. Gorman, P. Restle, R. Kalla, J. McGill, and S. Dodson, Design and implementation of the power5 microprocessor, in Proc. IEEE Int. Solid-State Circuits Conf., 2004, pp. 5657. [2] M. Ekpanyapong, J. R. Minz, T. Watewai, H.-H. S. Lee, and S. K. Lim, Prole-guided microarchitectural oorplanning for deep submicron processor design, in Proc. Design Autom. Conf., 2004, pp. 634639. [3] C. Long, L. Simonson, W. Liao, and L. He, Floorplanning optimization with trajectory piecewise-linear model for pipelined interconnects, in Proc. Design Autom. Conf, 2004, pp. 640645. [4] V. Nookala, Y. Chen, D. J. Lilja, and S. S. Sapatnekar, Microarchitecture-aware oorplanning using a statistical design of experiments approach, in Proc. Design Autom. Conf, 2005, pp. 579584. [5] D. Matzke, Will physical scalability sabotage performance gains?, Computer, vol. 30, pp. 3739, 1997. [6] P. Cocchini, Concurrent ip-op and repeater insertion for high performance integrated circuits, in Proc. Int. Conf. Comput.-Aided Design, Nov. 2002, pp. 268273. [7] W. Liao and L. He, Full-chip interconnect power estimation and simulation considering concurrent repearter and ip-op insertion, Proc. ICCAD, pp. 574580, 2003. [8] International Technology Roadmap for Semiconductors (ITRS), , 2003. [9] J. Smith and G. Sohi, The microarchitecture of superscalar processors, Proc. IEEE, vol. 83, no. 12, pp. 16091624, Dec. 1995. [10] B. A. Gieseke, A 600 mhz superscalar risc microprocessor with out-of-order execution, in Proc. IEEE Int. Solid-State Circuits Conf., 1997, pp. 176177. [11] J. Cong, A. Jagannathan, G. Reinman, and M. Romesis, Microarchitecture evaluation with physical planning, in Proc. Design Autom. Conf, 2003, pp. 3235. [12] M. Rewienski and J. White, A trajectory piecewise-linear approach to model order reduction and fast simulation of nonlinear circuits and micromachined devices, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 22, no. 1, pp. 155170, Jan. 2003. [13] J. Cong, T. Kong, and D. Pan, Buffer block planning for interconnectdriven oorplanning, in Proc. Int. Conf. Comput.-Aided Design, Nov. 1999, pp. 358363.
841
[14] I.-R. Jiang, Y.-W. Chang, J.-Y. Jou, and K.-Y. Chao, Simultaneous oor plan and buffer-block optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 23, no. 5, pp. 694703, May 2004. [15] K. Wong and E. Young, Fast buffer planning and congestion optimization in interconnect-driven oorplanning, in Proc. Design Autom. Conf, 2003, pp. 411416. [16] C. w. Sham, E. Young, and H. Zhou, Interconnect-driven oorplanning by searching alternative packings, in Proc. Asia South Pacic Design Autom. Conf., 2003, pp. 417422. [17] M. R. Casu and L. Macchiarulo, Floorplanning for throughput, in Proc. Int. Symp. Phys. Design, 2004, pp. 6269. [18] F. Raq, M. Chrzanowska-Jeske, H. Yang, M. Jeske, and N. Sherwani, Integrated oorplanning with buffer/channel insertion for bus-based designs, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 22, no. 6, pp. 730741, Jun. 2003. [19] D. Burger and T. Austin, The Simplescalar Tool Set Version 2.0. Madison: Univ. of Wisconsin-Madison Press, 1997. [20] N. Sherwani, Algorithms For VLSI Design Automation, 3rd ed. Boston, MA: Kluwer, 1999. [21] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, VLSI module placement based on rectangle-packing by the sequence pair, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 15, no. 12, pp. 15181524, Dec. 1996. [22] S. N. Adya and I. L. Markov, Fixed-outline oorplanning through better local search, in Proc. IEEE Int. Conf. Comput. Design, 2001, pp. 328334. [23] D. S. Johnson, Approximation algorithms for combinatorial problems, J. Comput. Syst. Sci., vol. 9, pp. 256278. [24] L. Lovasz, On the ratio of optimal integral and fractional covers, Discrete Math., vol. 13, pp. 383390. [25] T. Karkhanis and J. Smith, A rst-order superscalar processor model, in Proc. 31st Annu. Int. Symp. Comput. Architecture, Jun. 2004, pp. 338349. [26] E. Riseman and C. Foster, The inhibition of potential parallism by conditional jumps, IEEE Trans. Comput., vol. C-21, no. 12, pp. 14051411, Dec. 1972. [27] G. Sohi, Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers, IEEE Trans. Comput., vol. C-39, no. 3, pp. 349359, Mar. 1990. [28] McFarling, Combining Branch Predictors Tech. Rep. TN-36, Jun. 1993, DEC WRL. [29] S. Wilton and N. Jouppi, Cacti: An enhanced cache access and cycle time model, IEEE J. Solid-State Circuits, vol. 31, no. 5, pp. 677688, May 1996. [30] M. R. Casu and L. Macchiarulo, Throughput-driven oorplanning with wire pipelining, Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 5, pp. 663675, May 2005. [31] M. R. Casu and L. Macchiarulo, Floorplan assisted data rate enhancement through wire pipelining: A real assessment, in Proc. Int. Symp. Phys. Design, 2005, pp. 121128. Changbo Long received the B.S. and M.S. degrees in electrical engineering from Tsinghua University, Beijing, China, in 1999 and 2001, respectively, the M.S. degree in computer engineering from the University of Wisconsin-Madison in 2003, and the Ph.D. degree in electrical engineering from the University of California, Los Angeles, in 2006. He joined Synopsys Inc., Sunnyvale, CA, in June 2006. His research interests include computer-aided design of VLSI circuits and systems and power-efcient circuits and designs.
Lucanus J. Simonson received the B.S. degree in computer engineering from the Illinois Institute of Technology, Chicago, in 2001, and the M.S. degree in electrical engineering from the University of California, Los Angeles, in 2004. He is currently a Computer-Aided-Design Software Developer for Intel Corporation Gaston, OR, focusing on physical design and computational geometry.
Weiping Liao (S05M06) received the B.S. and M.S. degrees in physics from the University of Science and Technology of China, Hefei, in 1996 and 1999, respectively, the M.S. degree in computer engineering from the University of Wisconsin-Madison in 2002, and the Ph.D. degree in electrical engineering from the University of California, Los Angeles, in 2005. He is currently a Senior Architecture Engineer with Nvidia, Santa Clara, CA, where he is involved with next-generation graphics processor design.
Lei He (S94M99) received the Ph.D. degree in computer science from the University of California, Los Angeles (UCLA), in 1999. He is an Associate Professor with the Electrical Engineering Department, UCLA, and was a faculty member with the University of Wisconsin-Madison between 1999 and 2001. He also held visiting or consulting positions with Intel, Hewlett-Packard, Cadence, Synopsys, Rio Design Automation, and Apache Design Solutions. His research interests include VLSI circuits and systems and electronic design automation. He has published over 140 technical papers and has been a technical program committee member for a number of conferences, including the Design Automation Conference, the International Conference on Computer-Aided Design, the International Symposium on Low Power Electronics and Design, and the International Symposium on Field Programmable Gate Array. Dr. He was the recipient of the National Science Foundation CAREER Award in 2000, the UCLA Chancellors Faculty Career Development Award (highest class) in 2003, the IBM Faculty Award in 2003, the Northrop Grumman Excellence in Teaching Award in 2005, and the Best Paper Award at the 2006 International Symposium on Physical Design.