Exam - Chinese PDF
Exam - Chinese PDF
Exam - Chinese PDF
(2007.10) (2) Why have multiprocessor systems become popular nowadays? What are their advantages? What are the challenges in the design and use of a multiprocessor system? Ans: (1) CPU multiprocessor CPU (2) processor independent processes performance (3) i. ii. iii. processor iv. (2007.10) (5) What is availability? Given the mean time to failure (MTTF) and the mean time to repair (MTTR), calculate the mean time between failure (MTBF) and the availability. How can MTTF and availability be improved? Use a RAID disk array as an example to explain why it can have better availability than a single disk. Ans: (1) MTBF = MTTF + MTTR
Avalibility = MTTF MTTF = MTTF + MTTR MTBF
Availability failure service (2) RAID disk array hard disks data disks disk failure loss RAID disk array failure MTTR availability MTTR availability availability MTTF RAID disk array MTTF availability
(2007.03) (1a) (1b) The architecture team is considering enhancing a machine by adding vector hardware to it. When a computation is run in vector mode on the vector hardware, it is 10 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode the percentage of vectorization. (a) What percentage of vectorization is needed to achieve a speedup of 2? (b) Support you have measured the percentage of vectorization for programs to be 70%. The hardware design group says they can double the speedup of the vector hardware with a significant additional engineering investment. You wonder whether the compiler crew could increase the use of vector mode as another approach to increase performance. How much of an increase in the percentage of vectorization (relative to current usage) would you need to obtain the same performance gain as doubling vector hardware speed? Ans: (a) x percentage of vectorization
2= 1 (1 x) + x 10 x 0.5556 = 55.56%
(b) x percentage of vectorization double the speedup of the vector hardware 10*2=20
1 0 .3 + 0 .7 x (1 x) + 20 10 x 0.7389 = 73.89% 73.89% 70% = 3.89% = 1
3.89% percentage of vectorization (2007.03) (8) (2003.10) (5) To design an instruction set, what factors you need to consider in determining the number of registers? Ans: instruction set register (1) addressing mode addressing mode registers i. Immediate addressing mode ADD R1, #3 register ii. Indexed addressing mode ADD R3, (R1+R2)
2
3 registers (2) Code size decode complexity tradeoff operands registers decode decode Designer register CISC RISC register (2006.10) (5) Describe one technique for exploiting instruction level parallelism. Ans: Exploiting ILP hardware software (1) Hardware: Pipelining Pipelining implement pipelining machine superscalar VLIW (2) Software: Loop unrolling Loop unrolling iteration independent instructions (2006. 03) (5a) (5b) (5c) (5d) The following figure shows the trend of microprocessors in terms of transistor count. The transistor count on a processor chip increases dramatically over time. This is also known as Moores Law. (Ignore the figure) (a) Based on the straight line plotted on the figure, estimate the number of transistors available on a chip in 2020. (b) What does bit-level parallelism mean? How was bit-level parallelism utilized to increase performance of microprocessors in 70s and 80s? Is bit-level parallelism still an effective way to increase performance of mainstream applications today? Why? (c) Explain how instruction-level parallelism (ILP) can be exploited to improve processor performance. Is ILP still an effective way to increase processor performance for mainstream applications today? Why? (d) What does thread-level parallelism (TLP) mean? Why is it important for processors to take advantage of TLP today? Discuss the advantages and disadvantages of TLP.
x = 10.75 2020 1010.75 (b) i Bit-level parallelism processor word size bus bandwidth word size word size performance processor word size 8 bits 16 bits lower 8 bits higher 8 bits processor word size 16 bits ii bandwidth (c) i Instruction-level parallelism processor performance instruction pipeliningsuperscalarVLIW Superscalar ii implement ILP structurecontrol data hazard fully parallelism (d) i Thread-level parallelism processor threads thread states(instructionsdataPCregisters) ii Advantage: multiple instruction streams throughput
4
multi-thread program ILP exploit TLP cost-effective Disadvantage: ILP programmer (2005.10) (1a) (1b) (a) Generally speaking, CISC CPUs have more complex instructions than RISC CPUs, and therefore need fewer instructions to perform the same tasks. However, typically one CISC instruction, since it is more complex, takes more time to complete than a RISC instruction. Assume that a certain task needs P CISC instructions and 2P RISC instructions, and that one CISC instruction takes 8T ns to complete, and one RISC instruction takes 2T ns. Under this assumption, which one has the better performance? (b) Compare the pros & cons of two types of encoding schemes: fixed length and variable length. Ans: (a) CPU time CISC: P8T=8PT (ns) RISC: 2P2T=4PT (ns) RISC (b) Fixed length: Pros: decode CPI pipeline Cons: addressing mode code size Variable length: Pros: addressing mode code size Cons: decode CPI pipeline (2004.10) (1) The benchmarks for your new RISC computer show the following instruction mix: Instruction Type ALU Loads Strores Branches Frequency 30% 30% 25% 15% Clock cycle count 1 2 2 2
Assume there is one hardware enhancement that reduces the execution time of load/store instruction to 1 cycle. But it increases the execution time of a branch instruction to 3 cycles and the system clock cycle time by 15%. If you are the system designer, will you implement this hardware enhancement? Why?
Ans: Instruction Type Frequency Clock cycle count ALU 30% 1 Loads 30% 1 Strores 25% 1 Branches 15% 3 CPUtime = IC CPI CCT speedup
Speedup = = CPUtimeoriginal CPUtimeenhanced = CPI original CCToriginal CPI enhanced CCTenhanced
(0.3 1 + 0.3 2 + 0.25 2 + 0.15 2) 1 (0.3 1 + 0.3 1 + 0.25 1 + 0.15 3) 1.15 1.7 = = 1.13712 1.495
speedup 1 hardware enhancement (2008.03) (5) Describe the Amdahl's Law. (2004.10) (4) What is the important design principle addressed by Amdahls law? (2003.03) (1) What is the architectural design principle based on the Amdahls Law? Ans: Amdahls law
SpeedUpoverall = 1 (1 Fractionenhanced ) + Fractionenhanced SpeedUpenhanced
(1)(2) Designer performance bottleneck performance (2004.10) (8) (2003.03) (4) Why can a SMT processor utilize CPU resources more efficiently than a superscalar processor?
Ans: Superscalar processor exploit ILP hardware pipeline thread dependent instructions CPU stall Simultaneous Multithreading ILP TLP fine-grained multithreading clock cycle switch thread cycle thread independent instructions CPU stall superscalar CPU (2004.03) (2) Assume the instruction mix and the latency of various instruction types listed in Table 1. Determine the speedup obtained by applying a compiler optimization that converts 50% of multiplies into a sequence of shifts and adds with an average length of three instructions. Instruction Type ALU arithmetic (add, sub, shift) Loads Strores Branches Multiply Frequency 30% 25% 15% 15% 15% Clock cycle count 1 2 2 4 10
Ans: add shift 3 instruction Instruction Type Frequency Clock cycle count ALU arithmetic (30 + 7.5 3) % 1 (add, sub, shift) Loads 25 % 2 Strores 15 % 2 Branches 15 % 4 Multiply 7.5 % 10 (30 + 7.5 3) %+25 %+15 %+15 %+7.5 %=1.15 Speedup 1/1.15 Frequency normalize 100
Speedup = = = 1.37570
7
ExTimeoriginal ExTimeimproved
0.3 1 + 0.25 2 + 0.15 2 + 0.15 4 + 0.15 10 ((0.3 + 0.075 3) 1 + 0.25 2 + 0.15 2 + 0.15 4 + 0.075 10)
(2007.03) (4) (2004.03) (6) What is the problem existing in the modern superscalar processor that motivates the design of the simultaneous multithreading architecture? Ans: Superscalar processor exploit ILP hardware pipeline thread dependent instructions CPU stall CPU resources ILP TLP simultaneous multithreading (SMT) cycle thread independent instructions CPU stall CPU (2003.10) (4) What is an important design principle addressed by Amdahls Law? Explain why it expresses the law of diminishing returns. Ans: (1) Amdahls law
SpeedUpoverall =
(i) (ii) Designer performance bottleneck performance (2) Amdahls law SpeedUpenhanced SpeedUpoverall SpeedUpenhanced returns) (2002.10) (1) Why is MIPS a wrong measure for comparing performance among computers? Ans: MIPS
MIPS = IC IC CPU frequency = = 6 6 ExTime 10 IC CPI CCT 10 CPI 10 6
8
performance instruction countclock per instruction clock cycle time MIPS CPI CPU frequency computer instruction set architecture ( CISCRISC) CISC CPI IC RISC CPI IC MIPS ISA (2002.03) (1a) (1b) (a) You just make a fortune from your summer job. You want to use this money to buy the fastest PC available in the market. You are choosing between machine A and machine B. Machine A runs at 800 MHZ and the reported peak performance is 500 MIPS. The clock rate of machine B is 1 GHZ and the reported peak performance is 700 MIPS. Are you able to decide which machine to buy? Why? (b) The benchmark for machine A shows the following instruction mix: Instruction Type A B C D Frequency 40% 30% 15% 15% CPI 1 4 2 6
The architectural development team is studying a hardware enhancement. According to the simulation result, the enhancement will reduce the instruction counts of instruction A, B, C and D by 10%, 20%, 50% and 5%, respectively. But it will lengthen the clock cycle time by 15%. Should they implement this new architectural feature in their next generation product? Ans: (a) MIPS =
IC IC = MIPS clock 6 ExTime 10 IC CPI CCT 10 6
Instruction Type A B C D
Frequency 40 90 % 30 80 % 15 50 % 15 95 %
CPI 1 4 2 6
Speedup = =
ExTimeoriginal ExTimeimproved
= 0.654511
implement architectural feature (2001.03) (1) Can you give one example of high-level optimization performed by an optimizing compiler? (1999.03) (1) Can you give an example of high-level optimizations in modern compiler design? Ans: Compiler optimization (1) Loop Interchange: memory row major spatial locality Before After for (k = 0; k < 100; k = k+1) for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; x[i][j] = 2 * x[i][j]; (2) Loop Fusion: temporal locality Before After for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; { for (i = 0; i < N; i = i+1) a[i][j] = 1/b[i][j] * c[i][j]; for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; d[i][j] = a[i][j] + c[i][j]; } (3) Blocking: size submatrix cache capacity basic block Before After
for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { r = 0; for (k = 0; k < N; k = k+1) { r = r + y[i][k]*z[k][j]; }; x[i][j] = r; }; for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) { r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j]; }; x[i][j] = x[i][j] + r; };
10
(2000.10) (1) Can you explain the following terms? (1) The diminishing return in Amdahls law. (2) Convoy and chime with respect to vector processors. (3) Vector stride with respect to the execution of vector processors. Ans: (1) Amdahls law
SpeedUpoverall = 1 (1 Fractionenhanced ) + Fractionenhanced SpeedUpenhanced
returns) (2) Convoy: clock vector instructions structure hazards data hazards Chime: vector operation (3) Vector stride elements separating distance vector operation vector stride general purpose register LVWS (load vector with stride) vector fetch vector register SVWS (store vector with stride) nonunit stride vector (1999.10) (1) A CPU designer wants to change the pipeline design of a CPU. How will his/her action affect the 3 metrics on the right hand side of the following equation? CPU time = instruction count clocks per instruction clock cycle time. If the optimizing compiler developed for the CPU is rewritten, how will the metrics be affected? Ans: (1) CPU CPU clock cycle time metric Instruction set RISCcode size IC CPI CISC code size IC CPI
11
(2) Compiler optimizing compiler cycle process code sequence cycle CPI Compiler Instruction Count (1998.10) (1) To evaluate a CPU performance, we use the following equation: Execution time = instruction count CPI clock cycle time. Can you explain how instruction set design, architecture design, hardware technology, and compiler technology affect the three components in the equation? Ans: (1) Instruction set Instruction Count CPI Instruction set RISCcode size IC CPI CISC code size IC CPI (2) Architecture CPI CPI single multi processor CPI Single>Multi (3) Hardware technology Clock Cycle Time Processor Rate clock cycle time (4) Compiler technology CPI, Instruction Count Compiler optimizing compiler cycle process code sequence cycle CPI Compiler Instruction Count (!!) Instruction CPI Clock Cycle Time Count Program Compiler Instruction Set Organization Technology
12
13
(3) 2-bit predictor transition diagram satae branch Taken or Non-Taken initial stateT/NT branch State Predict NTTaken State Predict NT Taken Predict T (2007.03) (5) Explain the differences between dynamic scheduling with and without speculation. Ans: Dynamic scheduling with and without speculation hardware speculation branch prediction branch branch predict speculation commit stage update register memory Tomasulos algorithm (1) Tomasulos algorithm with speculation Issue Execution Write result: write results to the CDB & store results in a HW buffer (Reorder Buffer) Commit: update register file or memory ( in-order commit) (2) Tomasulos algorithm without speculation Issue Execution Write result: write results to the CDB & update register file or memory
14
(2007.03) (6) (2002.10) (6) Support we have deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the mis-prediction penalty is always 4 cycles, and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. Ans:
Speedup = =
ExTimeNOBTB CPI NOBTB 1 + 0.15 * 2 = = ExTimeBTB CPI BTB 1 + 0.15 * 0.1* 3 + 0.15 * 0.9 * 0.1* 4 1.3 = 1.18289 1.099
(2006.10) (3) Increasing the size of a branch-prediction buffer means that it is less likely that two branches in a program will share the same predictor. A single predictor predicting a single branch instruction is generally more accurate than is that same predictor serving more than one branch instruction. (a) List a sequence of branch taken and not taken actions to show a simple example if 1-bit predictor sharing that reduces mis-prediction rate. (b) List a sequence of branch taken and not taken actions that show a simple example of how sharing a 1-bit predictor increases mis-prediction rate. Ans: b1b2 branchP shared 1-bit predictorb1pb2p b1 b2 ( predictor NT)
15
(a) 1-bit predictor b1p NT T T T T T NT NT NT NT b1 T T T T T NT NT NT NT NT Corr. b2p NT T T T T T NT NT NT NT b2 T T T T T NT NT NT NT NT Corr. b1 mis-prediction rate: 20%b2 mis-prediction rate: 20% 1-bit predictor b1p NT T T T T T NT NT NT NT b1 T T T T T NT NT NT NT NT Corr. b2p T T T T T NT NT NT NT NT b2 T T T T T NT NT NT NT NT Corr. b1 mis-prediction rate: 20%b2 mis-prediction rate: 0% b1b2 1-bit predictor mis-prediction rate (b) 1-bit predictor b1p NT T T T T T T T T T b1 T T T T T T T T T T Corr. b2p NT NT NT NT NT NT NT NT NT NT b2 NT NT NT NT NT NT NT NT NT NT Corr. b1 mis-prediction rate: 10%b2 mis-prediction rate: 0% 1-bit predictor b1p NT NT NT NT NT NT NT NT NT NT b1 T T T T T T T T T T Corr. b2p T T T T T T T T T T b2 NT NT NT NT NT NT NT NT NT NT Corr. b1 mis-prediction rate: 100%b2 mis-prediction rate: 100% b1b2 1-bit predictor mis-prediction rate
16
(2006.10) (4) For the following code sequence, assuming the pipeline latency listed in the following table, show how it can be scheduled in a single-issue pipeline without delays. You could transform the code and unroll the loop as many times as needed. Foo: L.D L.D MUL.D ADD.D DADDUI DADDUI BNEZ Instruction producing result FP ALU op FP ALU op Load double Load double Ans: F0, 0(R1) F4, 0(R2) F0, F0, F4 F2, F0, F2 R1, R1, #-8 R2, R2, #-8 R1, Foo Instruction using result Another FP ALU op Store double FP ALU op Store double Latency in clock cycles 3 2 1 0
17
(2006.03) (1) For the code segment shown below, use standard MIPS 5-stage pipeline (IF, ID, EX, MEM, WB) with register forwarding and no branch prediction to answer the following questions. Assume the branch is resolved in the ID stage, and ld instructions hit in the cache. The processor has only one floating point ALU, the execution latency of the floating point ALU is three cycles, and the floating point ALU is pipelined. Inst 1: Inst 2: Inst 3: Inst 4: Inst 5: Inst 6: target LD ADD.D SUB.D BNEZ ADD.D SUB.D F1, 45(R2) F7, F1, F5 F8, F1, F6 F7, target F8, F8, F5 F2, F3, F4
(a) Identify each dependency by type (data, and output dependency); list the two instructions involved. (b) Assume a non-taken branch. How many cycles does the code segment execute? Please show how the code segment scheduled in the pipeline.
18
Note: Please write down all the additional architectural assumptions that your answers are based on. Ans: (a) Data dependency: (1,2), (1,3), (3,5) Output dependency: (3,5) ,(2,4) (b) LD Mem Read SD WB Read Instrs. 1 2 3 4 5 F D E M W F D S E F S D F 6 7 8 Cycles 9 10 11 12 13 14 15 16 17 18 W E E M W E M W D S E E F S D S
E S
M W E E
M W
Ans: Tournament branch predictor predictor selection algorithm local global predictor predictor branch predictor selection algorithm predictor state diagram predictor #1 / predictor #2 (0: incorrect, 1: correct)
19
(2006.03) (4) Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself. Show how the following code segment can be improved using a predicated form of LW (LWC). First instruction slot LW R1, 40(R2) BEQZ R10, L LW R8, 0(R10) L: LW R9, 0(R8) Ans: First instruction slot LW R1, 40(R2) Waste BEQZ R10, L LW R8, 0(R10) L: LW R9, 0(R8) Second instruction slot ADD R3, R4, R5 ADD R6, R3, R7
code first instruction slot waste pipeline true data dependencyLWC LW condition LWC code First instruction slot LW R1, 40(R2) LWC R8, 0(R10), R10 BEQZ R10, L L: LW R9, 0(R8) Second instruction slot ADD R3, R4, R5 ADD R6, R3, R7
case R100 LWC R8, 0(R10), R10 first instruction slot R8 true data dependency pipeline (2005.10) (2) For the code segment shown below, use standard MIPS 5-stage pipeline (IF, ID, EX, MEM, WB) with register forwarding and delayed branch semantics to answer the following questions (note the pipeline latency is listed in table 1): Loop: L.D MULT.D L.D ADD.D S.D F0, 0(R1) F0, F0, F2 F4, 0(R2) F0, F0, F4 0(R2), F0
20
R1, R1, 8 R1, R2, 8 R1, Loop Instruction using result Another FP ALU op Store double FP ALU op Store double Latency in clock cycles 3 2 1 0
(a) Perform loop unrolling and schedule the codes without any delay. (b) Perform loop unrolling and schedule the codes on a simple two-issue, statically scheduled superscalar MIPS pipeline without any delay. This processor can issue two instructions per clock cycle: one of the instructions can be a load, store, branch or integer ALU operation, and the other can be any floating-point operation. Ans:
21
22
23
(2005.03) (2) Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Be sure to include the start-up and clean-up code in your answer. (2002.10) (5) Show a software-pipelined version of the following code segment: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop You may omit the start-up and clean-up code. Ans: [] http://developer.apple.com/hardwaredrivers/ve/software_pipelining.html P.248-250 iteration code L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) software-pipeline
Iter. 3
24
code Start-up code L.D ADD.D L.D Loop S.D ADD.D L.D DADDUI BNE Clean-up code S.D ADD.D S.D
F0, 0(R1) F4, F0, F2 F0, -8(R1) F4, 0(R1) F4, F0, F2 F0, -16(R1) R1, R1, #-8 R1, R2, Loop F4, 0(R1) F4, F0, F2 F4, -8(R1)
(2005.03) (3) Support we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the mis-prediction penalty is always 5 cycles, and the buffer miss penalty is always 3 cycles. Assume 95% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. Ans:
Speedup = =
ExTimeNOBTB CPI NOBTB 1 + 0.15 * 2 = = ExTimeBTB CPI BTB 1 + 0.15 * 0.05 * 3 + 0.15 * 0.95 * 0.1* 5 1.3 = 1.18857 1.09375
(2005.03) (7) (2004.03) (4) Use code examples to illustrate three types of pipeline hazards. Ans: MIPS 5-stage pipeline (1) Structure hazard memory port code instr. 1 cycle 4 memory access instr. 4 cycle 4 instruction fetch memory port structure hazard
25
LD F0, 0(R2) SUB.D F1, F2, F3 ADD.D F4, F5, F6 SUB.D F7, F8, F9 (2) Control hazard Code branch branch taken not taken control hazard LOOP: LD F0, 0(R1) ADD.D F2, F0, F4 DADDUI R1, R1, #-8 BNEZ R1, LOOP R1 instruction BNEZ (3) Data hazard i. Read after write (RAW) hazard: data dependence ADD.D F0, F4, F2 SUB.D F8, F0, F6 ii. Write after read (WAR) hazard: anti dependence ADD.D F4, F0, F2 SUB.D F0, F8, F6 iii. Write after write (WAW) hazard: output dependence ADD.D F0, F4, F2 SUB.D F0, F8, F6 (2004.10) (5) What are correlating branch predictors? (2000.10) (4) Can you explain how correlated predictors, also called two-level predictors, work in branch prediction? Ans: Correlating branch predictor branch branch taken not taken (1) global last branch history branch (2) (m, n) predictor global m branch 2^m n-bit local global local predictor predictor predictor branch Lecture_3 P.42
26
(2004.10) (6) For the following code sequence: L.D L.D MULT.D SUB.D DIV.D ADD.D F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2
Identify instruction pairs with data, anti and output dependence. Ans: (1) Data dependence (RAW hazard): i. L.D F6, 34(R2) SUB.D F8, F6, F2 ii. L.D F6, 34(R2) DIV.D F10, F0, F6 iii. L.D F2, 45(R3) MULT.D F0, F2, F4 iv. L.D F2, 45(R3) SUB.D F8, F6, F2 v. L.D F2, 45(R3) ADD.D F6, F8, F2 vi. MULT.D F0, F2, F4 DIV.D F10, F0, F6 vii. L.D F2, 45(R3) ADD.D F6, F8, F2 (2) Anti dependence (WAR hazard): i. SUB.D F8, F6, F2 ADD.D F6, F8, F2 ii. DIV.D F10, F0, F6 ADD.D F6, F8, F2 (3) Output dependence (WAW hazard): i. L.D F6, 34(R2) ADD.D F6, F8, F2
27
(2004.03) (3) Consider the following code segment within a loop body: If (x is even) then /* branch b1 */ a++; If(x is a multiple of 10) then /* branch b2 */ b++; For the following list of 10 values of x to be processed by 10 iterations of this loop: 8, 9, 10, 11, 12, 20, 29, 30, 31, 32, determine the prediction accuracy of b1 and b2 using a 2-bit predictor as illustrated in Figure 1. Assume the initial state is non-taken in the branch history table. (Ignore the figure) Ans:
28
8 NT T2 NT T1
9 T N2 T N1
10 NT T2 T T1
(2003.10) (1) For the following code sequence: L.D F6, 34(R2) (1)* L.D F2, 45(R3) (1)* MULT.D F0, F2, F4 (2)* SUB.D F8, F6, F2 (2)* DIV.D F10, F0, F6 (40)* ADD.D F6, F8, F2 (2)* * the number represents the exectuion latency of the corresponding instruction. (For example, L.D. instruction starts execution at time t and completes execution at t+1). (a) Identify instruction pairs with data dependence, anti-dependence and output dependence. (b) On the scoreboarding implementations, are there instructions that are stalled due to the WAR or WAW hazards? If yes, identify instruction pairs that cause the pipeline to stall? (c) Describe how the Tamasulo algorithms resolve WAR and WAW hazards.
29
Ans: (a) Data dependence: (1,4), (1,5), (2,3), (2,4), (2,6), (3,5), (4,6) Anti-dependence: (4,6), (5,6) resource load Output dependence: (1,6) (b) Scoreboarding Read Exe. Write Instrs. Issue OP. Comp. Result L.D F6, 34(R2) (1) 1 2 3 4 L.D F2, 45(R3) (1) 5 6 7 8 MULT.D F0, F2, F4 (2) 6 9 11 12 SUB.D F8, F6, F2 (2) 7 9 11 12 DIV.D F10, F0, F6 (40) 8 13 53 54 ADD.D F6, F8, F2 (2) 13 14 16 17 WAR WAW hazard CPU stall Stall RAW Resouce WAR WAW (c) Tomasulo algorithm register renaming WAR, WAW hazards stall in-order issue, out-of-order executing, out-of-order complete hazards performance (2003.10) (3) You are designing an embedded system where power, cost and complexity of implementations are the major design considerations. Which technique will you choose to exploit ILP, superscalar with dynamic instruction scheduling or VLIW? You need to justify your answer. Ans: [] http://www.embedded.com/story/OEG20010222S0039 Superscalar with dynamic instruction scheduling Tomasulos algorithm pipeline in-order issueout-of-order execution out-of-order completion cost power VLIW run time compiler independent instructions large instruction fixed instruction packet hardware complexity compiling technique branch scheduling complexity power, cost complexity superscalar with dynamic instruction scheduling power VLIW embedded system
30
(2003.10) (6) Explain how loop unrolling can be used to improve pipeline scheduling. Ans: Loop unrolling iteration independent instructions code stall pipeline stall performance (2003.03) (3) What is the problem that a Branch Target Buffer tries to solve? Ans: Branch target buffer (BTB) branch instruction address (branch PC) predicted address (predicted PC) BTB instruction decode stage instruction fetch stage branch fetch pipeline stall (2003.03) (7) Consider the execution of the following loop on a dual-issue, dynamically scheduling processor with speculation: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assume : (a) Both a floating-point and integer operations can be issued on every clock cycle even if they are dependent. (b) One integer functional unit is used for both the ALU operation and effective address calculations and a separate pipelined FP functional unit for each operation type. (c) A dynamic branch-prediction hardware (with perfect branch prediction), a separate functional unit to evaluate branch conditions and branch single issue (no delay branch). (d) The number of cycles of latency between a source instruction and an instruction consuming the result is one cycle for integer ALU, two cycles for loads, and three cycles for FP add. (e) There are four-entry ROB (reorder buffer) and unlimited reservation stations, two common data bus (CDB). (f) Up to two instructions of any type can commit per cycle. Create a table showing when each instruction issue, begin execution, write its result to the
31
CDB and commit for the first two iterations of the loop. Table Sample:
Iteration # Instruction Issue Executes Mem access Write-CDB Commit
Note: If your solution is based on the assumptions that are not listed above, please list them clearly. Ans: speculation
Iter. Instruction Issue Executes Mem access Write-CDB Commit
1 1 1 1 1 2 2 2 2 2
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop
1 1 2 2 6 10 10 11 11 15
2 5 3 4 7 11 14 12 13 16
4 8 5
12
13 17 14
5 9 9 10 10 14 18 18 19 19
(2002.10) (3) List two architectural features that enhance the CPU performance by exploiting instruction level parallelism. Ans: [] http://en.wikipedia.org/wiki/Superscalar http://en.wikipedia.org/wiki/Very_long_instruction_word (1) Superscalar processor Superscalar clock rate CPU throughput CPI < 1 performance redundant functional units ( ALUmultiplier) cycle issue instructions
32
Tomasulos algorithm in-order issueout-of-order executionout-of-order completion compiler issue (2) Very long instruction word (VLIW) VLIW run time compiler independent instructions large instruction fixed instruction packet hardware complexity compiling technique branch scheduling complexity (2002.03) (2) Assume a five-stage integer pipeline with dedicated branch logic in the ID stage. Assume that writes to the register file occur in the 1st half of the clock cycle and reads to the register file occurs in the 2nd half. For the following code segment: Loop: LW R1, 0(R2) ADDI R3, R1, 100 SW 0(R2), R3 ADD R5, R5, R3 SUBI R4, R4, #4 SUBI R2, R2, #4 BNEZ R4, Loop (a) Show how the code sequence is scheduled on the described pipeline without register forwarding and branch prediction. Identify all the pipeline hazards and use s to indicate a stall cycle. Assume all memory accesses are cache hits. (b) Assume that the pipeline implements register forwarding and branch predict-taken scheme. If the initial value of R4 is 12 and the first instruction in the loop start execution at clock #1, how many cycle does it take to finish executing the entire loop? Note: Please write down all the additional architectural assumptions that your answers are based on. Ans:
33
(a)
Instrs. 1 2 3 4 5 6 7 (b)
1 2 3 4 5 6 7 F D E M W F D S S E M W F S S D S S E M W F S S D E M W F D E M W F D E M W F S D E M W
Cycles 8 9 10 11 12 13 14 15 16
Instrs. 1 2 3 4 5 6 7
1 2 3 4 5 F D E M W F D S E M W F S D E M W F D E M F D E F D F
Cycles 6 7 8
10 11 12
W M W E M W D E M W
34
R4 12 loop 3 Stall 12+2*(7+1)=28 cycles (2001.03) (2) In what condition can we move an instruction from one possible path of a branch to the basic block ahead of the branch without affecting the other part of the code or causing any false exception? Ans: branch prection prediction accuracy 100% prefetch code branch false exception pipeline stall (2001.03) (3) Can you describe a hardware mechanism for solving the false exception problem in problem 2? Ans: 100% predictor Tomasulos algorithm with speculation false exception branch prediction branch branch speculation reorder buffer commit stage update register memory predict code (2000.10) (3) It seems that the performance of a pipelined processor increases as the number of pipeline stages increases. Can you discuss the major factors that limit the number of pipeline stages in processor design? Ans: pipeline stages pipeline stages stage pipeline hazards (structure hazard, control hazard, and data hazard) stages pipeline stage performance pipeline stage performance
35
(1999.10) (2) Can you describe how the reorder buffer works? How does the reorder buffer materialize the precise interrupt? Ans: (1) Reorder Buffer circular array i. program array tail ii. entry location iii. Buffer head non-speculative instruction commit Buffer (2) reorder buffer in-order commit commit register I/O interrupt reorder buffer head flush reservation stations interrupt (1999.03) (3) (a) What data dependences are checked by the scoreboard mechanism during instruction scheduling? (b) What data dependences are checked by the Tomasulos mechanism? Ans: (a) [] http://en.wikipedia.org/wiki/Scoreboarding Scoreboarding IssueRead operandsExecutionWrite results 4 stages i. Issue current instruction registers output dependencies (WAW hazards) registers stall issue fuctional units busy current instruction stall (structural hazard) ii. Read operands issue allocate hardware module operands available true data dependencies (RAW hazards) registers unavailable available iii. Execution operands fetch functional units scoreboard iv. Write results destination register delay destination register earlier instructions read operands anti dependencies (WAR
36
hazards) (b) [] http://en.wikipedia.org/wiki/Tomasulo_algorithm Tomasulos algorithm without speculation Issue ExecutionWrite results 3 stages i. Issue operands reservation stations ready issue stall register renaming WAR WAW hazards ii. Execution delay operands available RAW hazards iii. Write results ALU operations registers store operations memory (1999.03) (4) (a) If a superscalar microprocessor designer decides to use the scoreboard mechanism instead of the Tomasulos mechanism due to hardware costs consideration, what aer the designers main reasons? (b) Implementing the scoreboard mechanism instead of the Tomasulos mechanism means that performance of the superscalar microprocessor is sacrificed, can you find some ways to make up loss of performance, including possibly adding more hardware? Ans: (a) Tomasulos mechanism reservation stations register renaming WAR WAW hazards scoreboard mechanism designer scoreboard mechanism (b) scoreboard mechanism function units 1unit for store/load, 2units for multiplication, 1 unit for adding, 1 unit for division process (ex. load) free function unit structure hazard performance Function units structure hazards performance scoreboard mechanism register rename WAW hazard WAR hazard stall register rename stall performance
37
(1998.10) (2) Can you explain why branch prediction is so important in modern microprocessor design? Ans: Branch prediction instruction fetch branch cycle fetch stall CPU pipeline CPU
38
Memory Hierarchy
(2007.10) (3) Explain virtual memory and how a modern processor access memory space. What are the functions of MMU and TLB here? Ans: (1) [] http://en.wikipedia.org/wiki/Virtual_memory i Virtual memory computer system technique fragments physical memory inactive program data swap physical memory ii Modern processor memory space (i) addressing translation virtual address physical address(ii) physical address cache cache cache main memory (2) Memory management unit (MMU): virtual address map physical address hardware device Translation look-aside buffer (TLB): virtual address physical address TLB hit physical address page table physical address address translation (2007.10) (6) The following code multiplies two matrices B and C. (a) Compute the cache miss ratio for the code on a 16K, two-way set-associative cache with LRU replacement and write-allocate/write-back policy. The cache block size is 16 bytes. Assume that none of the array elements exists in the cache initially. (b) Which type of cache misses dominates in this case? (compulsory, conflict, or capacity). Can the cache miss ratio be reduced by rewriting the code? Briefly explain why the new code has fewer cache misses. int A[32][32]; int B[32][32]; int C[32][32]; /* 4KB, starting at address 0x20000 */ /* 4KB, starting at address 0x30000 */ /* 4KB, starting at address 0x40000 */
for (i = 0; i < 32; i = i+1) for (j = 0; j < 32; j = j+1) A [i] [j] = B [i] [j] * C [i] [j]; /* the memory access order is B, C, A */
39
Ans:
Two way
(a) i. ii. iii. cache index 214/(242) = 29 index 9 bits array 4K3=12K < 16K (cache size) capacity miss array Address Tag Index Block offset A: 0x20000 010000 000000000 0000 B: 0x30000 011000 000000000 0000 C: 0x40000 100000 000000000 0000 A, B, C array iteration cache index array cache compulsory miss conflict miss cache index (4 iterations cache index 1 element 4B1 Block16B 4 iterationindex 1 ) 4 iterations 1 compulsory miss ( array B ) 11 conflict miss ( arrays A,C B,A,C ) miss ratio 328(11+1) / 32323( i j ABC ) = 100%
(b) block size=16B array 4 elements BC arrays: temp1 temp2 iteration iteration compulsory miss conflict miss miss ratio temp1 temp2 A, B, C conflict misses(temp1 temp2 cache index ABC ) code for (i = 0; i < 32; i = i+1) { for (j = 0; j < 32/4; j = j+4) { for (k=0; k < 4; k = k+1) temp1[i][j4+k] = B[i][j4+k]; for (l=0; l < 4; l = l+1) temp2[i][j4+l] = C[i][j4+l]; for (m=0; m < 4; m = m+1) A[i][j4+m] = temp1[i][j4+m] temp2[i][j4+m]; } } miss ratio 3285 / 32323 = 41.67%
temp1+temp2+A+B+C 5
40
(2007.03) (2) For the following memory system parameters: Cache: physically addressed, 32KB, direct-mapped, 32B block size Virtual Memory: 4K page, 1 GB virtual address space TLB: fully associative, 32 entries 64 MB of addressable physical memory Sketch a block diagram of how to determine whether a memory request is a cache hit. Be sure to label exactly how many bits are in each field of the TLB/cache architecture and which/how many address bits go where. You only need to show the tag and data fields of the TLB/cache architectures. Ans: Page offset = 4K page = 212 =12 bit 1GB virtueal Space = 230 = 30 bit Physical page number= addressable physical memory / page size = 226 / 212 = 214 =14 bit
Virtual Address
Page offset
12
Page Table
TLB Hit
Physical Address
11
14
block offset
5
Physical Tag
=
Cache Hit
Cache
41
(2007.03) (7) (2004.10) (7) What are the advantages of replicating data across caches for shared-memory multiprocessor processors? What kind of problems does it introduce? (2003.10) (7) In the shared-memory multiprocessor processor, what are the advantages of replicating data across caches? What problem does it introduce? Ans: (1) shared-memory multiprocessor processor cache read miss (2) cache coherency Designer cache X processor A B cache copy processor code cache X modified value memory snooping protocol cache coherency (2006.10) (1) (a) There are three organizations for memory (i) one word memory organization, (ii) wide memory organization and (iii) interleaved memory organization. Briefly describe the three organizations and compare the hardware they need. (b) Assume that the cache block size is 32 words, the width of organization (ii) is eight words, and the number of banks in organization (iii) is four. If sending an address is 1 clock, the main memory latency for a new address is 10 cycles, and the transfer time for one word is one cycle, what are the miss penalties for each organization? Ans: (a) [] http://en.wikipedia.org/wiki/Memory_organisation
42
One word memory organization 1 word wide memory 1 word wide bus cache (ii) Wide memory organization 1 word ( 4 words) memory bus low level cache low level cache multiplexer 1 word wide bus high level cache (iii) Interleaved memory organization 1 word wide memory banks 1 word wide bus cache access memory address lower k bits bank higher order m bits bank location (b) One word memory organization address memory 1(send an adderss time) block size 32 words total latency 3210(latency time) memory cache 32x1(transferr time) one-word-wide memory miss penalty 1+3210+321=353 (cycles) (ii) Wide memory organization wide memory 8 words 32 words 4 penalty 1+(32/8)10+(32/8)1=45 (cycles) (iii) Interleaved memory organization miss penalty 1+(32/4)10+321=113 (cycles) (2006.10) (6) Suppose we wished to solve a problem of the following form: for (i = 0; i <= M; i++) compute the sum of N2 numbers The total time to solve this problem in a single processor is MN2. Now suppose we had P processors (P <= (N2)/2) and that the work of each iteration (i.e. computing the sum of a portion of the N2 numbers) could be divided evenly among the processors. However, each processor must now add the result of its computation (each iteration) to a sum in shared memory. Moreover, suppose the processors must wait until all processors have completed an iteration before beginning the next iteration. Thus, there is some delay a*P on each iteration that is due to synchronization and access to the sum in shared memory. (a) What is the time needed to solve this problem using P processors? Explain why. (b) What is the speedup using P processors? (c) Plot the speedup over the range 1<= P <= (N2)/2, Label the axes and the endpoints of the speedup curve. (d) At what value of P does the speedup reach a maximum?
43
(i)
(i)
Ans: (a) P processors MN P M * a * P ( for loop m iteration delay a * P iterations) M * a * P + MN P (b) Speedup =
ExTimeold MN 2 = ExTimenew M * a * P + MN 2 = P N2 2 a*P + N
2 2
(c) S =
N2 2 a*P + N
d q ( x) p( x) q( x) q( x) p( x) ( )= dx p( x) p ( x) 2
S P S =
( N 2 + aP 2 ) N 2 N 2 P 2 aP N 4 + 2 aP 2 N 2 + a 2 P 4
N 4 aN 2 P 2 = 4 N + 2aP 2 N 2 + a 2 P 4 S P
S = =
speedup ( p
N2 N S = 2 a a*P + N
a 2N a
N2
) [from(d)]
S
P
N2 2N a
(1,
N2 N2 N2 ) ( , ) N2 +a 2 2 + aN 2 2
(d) S =
N 4 aN 2 P 2 = 0 P = N 4 + 2aP 2 N 2 + a 2 P 4
44
N2 N = a a
(2006.03) (2) To overlap the cache access with TLB access as shown in the figure, how do we design the cache?
CPU
Cache
TLB
PA MEM
Ans: cache access TLB access address translation cache hit virtually indexed and physically tagged cache virtual address index cache aliasing virtual addresses physical address map cache block aliasing hardware anti-aliasing OS page coloring (2005.10) (3) (a) Do we need a non-blocking cache for an in-order issue processor? Why? (b) What is a virtually-indexed, physically-tagged cache? What problem does it try to solve? (c) Describe one method to eliminate compulsory misses. (d) Describe one method to reduce cache miss penalty. Ans: (a) In-order issue processor in-order execution out-of-order execution, out-of-order completion non-blocking cache cache miss CPU request stall CPU independent instructions performance processor in-order execution cache miss CPU stall data memory cache non-blocking cache (b) Virtually indexed, physically tagged cache virtual address index cache cache physical address tag cache read tag address translation physical tag cache tag cache hit time (c) spatial locality ( access array)
45
cache miss block size compulsory miss (d) L1 L2 cache victim cache L1 cache conflict miss capacity miss cache line L1 miss check cache line victim cache L2 cache L2 miss penalty (2005.10) (5) In small bus-based multiprocessors, what is the advantage of using a write-through cache over a write-back cache? Ans: small bus-based multiprocessors bus bandwidth write invalidate processor data cache copy processor invalidate copy request processor cache write-back cache processor processor dirty block memory write-through cache update update memory read miss memory read miss penalty bus traffic (2005.03) (1) Both machine A and B contain one-level on chip caches. The CPU clock rates and cache configurations for these two machines are shown in Table 1. The respective instruction/data cache miss rates in executing program P are also shown in Table 1. The frequency of load/store instructions in program P is 20%. On a cache miss, the CPU stalls until the whole cache block is fetched from the main memory. The memory and bus system have the following characteristics: 1. 2. the bus and memory support 32-byte block transfer; a 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking 1 bus clock cycle, and 1 bus clock cycle required to send an address to memory (assuming shared address and data lines); assuming there is no cycle needed between each bus operation; a memory access time for the first 4 words (16 bytes) is 250 ns, each additional set of four words can be read in 25 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped. Table 1 Machine A Machine B CPU clock rate 800 MHz 400 MHz
46
3. 4.
I-cache Direct-mapped 2-way, 32-byte block, configuration 32-byte block, 8K 128K D-cache 2-way, 32-byte block, 16K 4-way, 32-byte block, configuration 256K I-cache miss rate 5% 1% D-cache miss rate 15% 3% I-cache:Instruction cache, D-cache: Data cache (a) What is the data cache miss penalty (in CPU cycles) for machine A? (b) Which machine is faster in executing program P and by how much? The CPI (Cycle per Instruction) is 1 without cache misses for both machine A and B. Ans: (a) clock cycle = 1/200MHz =0.00510-6= 510-9s = 5ns cache miss cache line 32 bytes access memory 16bytes 16bytes send address take one clock cycle = 5ns 16bytes 250ns 32bits (4bytes) 5ns 5ns (send adderss)+250ns (memory acess)+45ns (transferr time)=275ns 16bytes 25ns address 25ns+45ns=45ns cache miss cache 275ns+45ns=320ns mechine A data cache miss penalty = 320ns 800MHz = 256 cycles mechine B data cache miss penalty = 320ns 400MHz = 128 cycles (b) mechine A B CPU Time CPU TimeA = ICA (1+5% 256 cycles + 20% 15% 256 cycles) 1/800MHz CPU TimeB = ICB (1+1% 128 cycles + 20% 3% 128 cycles) 1/400MHz (20% load/store D-cache) A B Instruction set RISC CISC ICA ICB program P (2005.03) (5) A virtual addressed cache removes the TLB from the critical path. However, it could cause aliases. Describe one technique to solve the aliasing problem. Ans: (1) Hardware: anti-aliasing cache block physical address cache miss virtual address cache location physical address physical address cache physical address (2) Software:
47
OS page coloring aliases direct mapped cache size 2 n virtual address physical address n bits virtual addresses map physical address cache set (2005.03) (6) Why does the on-chip cache size continue to increase in modern processors? Ans: on-chip cache size capacity miss cache hit rate CPU on-chip cache access memory performance CPU memory hit time (2004.10) (2) Describe how the victim cache scheme improves the cache performance. Ans: Victim cache L1 cache L2 cache buffer conflict miss capacity miss L1 cache cache blocks L1 cache miss block victim cache data L2 cache L1 cache temporal locality block L1 cache miss victim cache L2 cache L1 cache victim cache miss penalty performance (2004.10) (3) What is a non-blocking cache? Why is it important for a superscalar processor with dynamic instruction scheduling? Ans: (a) Non-blocking cache hit under miss hit under multiple misses cache miss CPU request cache hit miss penalty (b) Superscalar processor with dynamic instruction scheduling in-order issue, out-of-order execution, out-of-order completion cache miss CPU independent instructions performance non-blocking cache cache miss CPU stall
48
(2004.03) (1) A virtual address cache is used to remove the TLB from the critical path on cache accesses. (a) To avoid flushing the cache during a context switch, one solution is to increase the width of the cache address tag with a process-identifier tag. Assume 64-bit virtual addresses, 8-bit process identifiers, and 32-bit physical addresses. Compare the tag size of a physical cache vs. a virtual cache for a 2-way, 64K cache with 32B blocks. (b) A virtual cache incurs the address aliasing problem. Explain what the address aliasing problem is? Explain how page coloring can be used to solve this problem for a direct-mapped cache. Ans: (a) block offset = log 2 32 = 5 bit (32B blocks)
64 210 index = log 2 32 = 10 bit 2
216 = 210 2 25
(1) The tag size of physical cache = 32 (10+5) = 17 bits (2) The tag size of virtual cache = 64 (10+5) +8= 57 bits (b) (1) virtual address physical address virtual address x y physical address x y virtual cache cache line aliasing (2) direct mapped cache size 2 n page coloring virtual address physical address n bits virtual addresses map physical address cache set (2003.10) (2) For the following memory system parameters: Cache: physically addressed, 32KB, 2-way, 32B block size Virtual Memory: 4K page, 1 GB virtual address space TLB: fully associative, 40 entries 64 MB of addressable physical memory Sketch a block diagram of how to determine whether a memory request is a cache hit. Be sure to label exactly how many bits are in each field of the TLB/cache architecture and which/how many address bits go where. You only need to show the tag and data fields of the TLB/cache architectures. Ans:
49
Virtual Address
Page offset
12
Page Table
TLB Hit
Physical Address
12
14
block offset
Physical Tag
=
Cache Hit
Cache
(2003.03) (2) What is the problem that a TLB tries to solve? Ans: TLB page number cache record virtual page number physical page number TLB virtual address page table physical page number TLB miss page table physical page number address translation instruction data cache data cache cache hit virtually indexed, physically tagged cache TLB index cache address translation cache hit time
50
(2003.03) (5) What is the aliasing problem using a virtual cache? Describe a mechanism to solve the aliasing problem. Ans: (1) virtual address physical address virtual address x y physical address x y virtual cache cache line aliasing (2) OS page coloring aliases direct mapped cache n size 2 virtual address physical address n bits virtual addresses map physical address cache set (2003.03) (6) What is the memory wall problem? How does pre-fetching solve this problem? Pre-fetching mechanism should be used with a blocking cache or non-block cache? Why? Ans: (1) memory access CPU CPU memory access memory cache instruction bottleneck memory wall problem (2) pre-fetching stream buffer cache miss memory stream buffer cache miss data stream buffer buffer memory data sequential code array cache miss stream buffer data memory access CPU stall performance (3) Pre-fetching data cache miss CPU stall performance non-blocking cache cache miss CPU independent instructions blocking cache cache miss CPU stall data pre-fetching (2002.10) (4) What is a non-blocking cache? Ans: Non-blocking cache hit under miss hit under multiple misses cache miss CPU request cache hit miss penalty
51
non-blocking cache cache miss CPU stall independent instructions in-order issue, out-of-order execution, out-of-order completion (2002.10) (7) Find the cache miss rate for the following code segments. Assume an 8-KB direct-mapped data cache with 16-byte blocks. It is a write-back cache that does write allocate. Both a and b are double-precision floating-point arrays with 3 rows and 100 columns for a and 101 rows and 3 columns for b (each array element is 8B long). Lets also assume they are not in the cache at the start of the program. For (i = 0; i < 3; i++) For (j = 0; i < 100; j++) a [i] [j] = b [j] [0] * b [j+1] [0]; Ans: i cache 3 * 100 * 3 = 900 ii cache miss (1) a compulsory miss: write-allocate a element cache block 16B element 8B (double) miss 2 elements cache miss 3 * 50 = 150 (2) b compulsory miss: miss i = 0 100 b[j][0] 1 b[j+1][0] miss 101 (3) Capacity miss: cache 8K16B = 512 cache lines a b cache 150+101251 lines (4) Conflict miss: memory miss rate (150 + 101)900 = 27.89%
52
(2002.03) (3) (a) A virtually indexed and physically tagged cache is often used to avoid address translation before accessing the cache. One design is to use the page offset to index the cache while sending the virtual part to be translated. The limitation is that a direct-mapped cache can be no bigger than the page size. Describe how we can use page coloring to technique to remove this limitation. (b) Your task is to choose the block size of a 8K direct-mapped cache to optimize the memory performance for the following code sequence: For (j = 0; j < 5000; j++) For (i = 0; i < 5000; i++) X [i] [j] = 2 X [i] [j]; Assume that each array element is 4 bytes. The available cache block size is 16, 32, 64 and 256 bytes. Which one will you choose? Why? (c) Compute the cache miss ratio for the following code sequence on a 8K, two-way set-associative cache with LRU replacement and write-allocate/write-back policy. The cache block size is 16 bytes. Assume that none of the array elements exists in the cache initially. int a [32] [32]; int b [32] [32]; int c [32] [32]; /* 4KB, at address 0x2000 */ /* 4KB, at address 0x3000 */ /* 4KB, at address 0x4000 */
for (i = 0; i < 32; i = i+1) for (j = 0; j < 32; j = j+1) a [i] [j] = b [i] [j] * c [i] [j]; /* the memory access order is b, c, a */
53
Ans: (a) direct mapped cache size 2 n page coloring virtual address physical address n bits physical address index cache page offset index cache bits index cache physical cache virtual cache physical cache (b) 16 bytes code column major array row major spatial locality cache block size performance (c) i. cache index 213/(242) = 28 index 8 bits ii. array Address Tag Index Block offset A: 0x20000 010000 000000000 0000 B: 0x30000 011000 000000000 0000 C: 0x40000 100000 000000000 0000 A, B, C array iteration cache index array cache compulsory miss conflict miss cache index (4 iterations cache index ) 4 iterations 1 compulsory miss ( array B ) 11 conflict miss ( arrays A, B, C )miss ratio 328(11+1) / 32323 = 100% (1999.10) (3) Can you parallelize the following two loops? If yes, show how you do it. If no, give your reason. (a) for (i = 1; i < 100; i++) { a[i] = b[i] + c[i]; c[i+1] = a[i] * d[i+1]; } (b) for (i = 1; i <= 100; i++) { a[i] = b[i] + c[i]; c[i+1] = a[i+1] * d[i+1]; } Ans: loop processor processor A B
54
loop (a)(b) A B c[i+1] (1998.10) (4) Can you discuss some software and compiler techniques that can improve the performance of a cache system? Ans: (1) Loop interchange memory row major compulsory misses (2) Loop fusion loop cache compulsory misses (3) Blocking elements cache capacity misses submatrix elements size cache submatirx (4) Software prefetching compiler pre-fetch compulsory misses
55
A typical implementation of a barrier can be done with two spin locks: one to protect a counter that tallies the processes arriving at the barrier and one to hold the processes until the 1ast process arrives at the barrier. To implement a barrier we usually use the ability to spin on a variable until it satisfies a test; we use the notation spin (condition) to indicate this. Below code is a typical implementation, assuming that lock and unlock provide basic spin locks and total is the number of processes that must reach the barrier. lock (counter lock); if (count= =0) release=0; count = count+ 1; unlock (counter lock) ; if (count= =total){ count=0 ; release=1; } /* ensure update atomic */ /* first=>reset release */ /* count arrivals */ /* release lock */ /* a11 arrived */ /* reset counter */ /*release processes */
56
(2008.10)(3) multi-core CPU? Ans: A multi-core processor (or chip-level multiprocessor, CMP) combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (IC), called a die, or more dies packaged together. A dual-core processor contains two cores, and a quad-core processor contains four cores. A multi-core microprocessor implements multiprocessing in a single physical package parallel and independent instructions ! land balance multi-core CPU Parallel Overhead Due to thread creation, scheduling Synchronization Excessive use of global data, contention for the same synchronization object Load balance Improper distribution of parallel work
57
cache line 6
2 cache hit
6 cache hit
(2008.10)(6)
shared-memory multiprocessor interconnection network Ans: Centralized shared-memory multiprocessor Multiple-step networks Centralized shared-memory multiprocessor: share a single centralized memory, interconnect processors and memory by a bus also known as uniform memory access (UMA) orsymmetric (shared-memory) multiprocessor (SMP) that is A symmetric relationship to all processors and A uniform memory access time from any process but scalability problem: less attractive for large-scale processors
58
Configurations: 1. Common bus 2. Multiple buses 3. Crossbar A crossbar switch may be used to allow any processor to be connected to any memory unit simultaneously however a memory unit can only be connected to one processor at a time. Each processor has a communications path associated with it. Similarly each memory unit has a communications path associated with it. A set of switches connects processor communications paths to memory unit communications paths. The switches are configured such that a processor communications path is only ever connected to a single memory unit communications path and vice versa. The structure of a crossbar switch is shown in Figure
Multiple-step networks: Distributed-memory multiprocessor memory modules associated with CPUs. Advantages: cost-effective way to scale memory bandwidth lower memory latency for local memory access Drawbacks Longer communication latency for communicating data between processors software model more complex
59
60
(2008.03)(1) What is Multithreading? Describe at least 2 approaches to Multithreading and discuss their advantages and disadvantages. Ans: Multithreading: multiple threads to share the functional units of 1 processor via overlapping Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiple threads to be interleaved. That usually done in a round-robin fashion, skipping any stalled threads. CPU must be able to switch threads every clock. Advantage: is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls. Disadvantage: is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages: Relieves need to have very fast thread-switching Does not slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage: is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs. Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline
61
must be emptied or frozen. New thread must fill pipeline before instructions can complete. Simultaneous multithreading (SMT): Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures An inspection of the basic characteristics of earlier proposed architectures compared to SMT, reveals that the throughput advantage of the later comes from sharing (i.e. avoiding partitioning) the underutilized resources among threads. Disadvantages of resource partitioning and how these can limit available parallelism in superscalar processors.
(2008.03)(2) In Shared Memory Multiprocessors, a lock operation is used to provide atomicity for a critical section of code. The following are assembly-level instructions for this attempt at a lock and unlock. /* copy location to register */ 1. Lock: Ld register, location 2. Cmp register, #0 /* compare with 0 */ 3. Bnz Lock /* if not zero, try again */ 4. St location, #1 /* store 1 into location to mark it locked */ 5. Ret I* return control to caller of lock */ 6. Unlock: St location, #0 7. Ret
What is the problem with this lock? How to modify it? Ans: processoer lock test and set !
cycle P1 P2 P1 lacation value P2 lacation value
1 2 3 4 5 6 7 8
0 0 0 0 0 0 1 1
1 2 3 location 0 NT 1 2 3 location 0 NT
P1 4 1 P2 BNZ P2P2 4 1
5 5
Line 5 Line 5
1 1
Test-and-set: tests a value and sets it if the value passes the test
63
(2008.03)(3) Describe a four-state (MESI) write-back invalidation protocol for cache coherence. The MESI consists of 4 states: Modified (M) or dirty, exclusive-clean (E), Shared (S), and invalid (I). Ans: SharedOne or more processors have the block cached, and the value in memory is up to date (as well as in all the caches). Invalid (Uncached)No processor has a copy of the cache block. ModifiedExactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block. exclusive-clean- cache ( cache memory) (2007.03) (3) Please explain why it is usually easier to pre-fetch array data structures than pointers. (2006.10) (2) Why is it easier to design a pre-fetching scheme for array-based than pointer-based data structures? Ans: array data structure spatial locality pre-fetching CPU stall memory performance array pointer-based data structures spatial locality temporal locality pre-fetching pointer cache miss CPU stall pre-fetch pre-fetch array data structures pointer-based data structures (2006.10) (7) System Organization Consider a computer system configuration with a CPU connected to a main memory bank, input devices and output devices through a bus, as shown in the figure below. The CPU is rated at 10 MIPS, i.e., has a peak instruction execution rate of 10 MIPS. On the average, to fetch and execute one instruction, CPU needs 40 bits from memory, 2 bits from an input device and sends 2 bits to an output device. The peak bandwidth of the input and output devices are 3 Megabytes per second each. The bandwidth of the memory bank is 10 Megabytes per second, and the bandwidth of the bus is 40 Megabytes per second. (a) What is the peak instruction execution rate of the system as configured below? (b) What unit is the bottleneck in the system? (c) Suggest several ways in which you might modify the bottleneck unit so as to improve the instruction execution rate.
64
(d) Using only units with specifications as given in the problem statement, show a system configuration (redraw Figure 1) where the CPU is likely to be the bottleneck. Briefly justify your answer.
CPU
Main Memory
Input Device
Output Device
Ans: (a) CPU 40bits from memory 10 Megabytes per second 1010241024/(40/8)=2106 2MIPS Input Output device 310241024/(2/8)=12106 12MIPS CPU peak execution rate = 10MIPS peak instruction execution rate 2MIPS (b) CPU=10MIPSMemory=2MIPSInput, Output Device=12MIPS bottleneck memory (c) i. cache miss penalty performance ii. TLB virtual address physical address (d) memory bus device cpu CPU bound memory input output device CPU bottleneck CPU
65
Main Memory
CPU
Input Device
Output Device
(2006.03) (6) Consider a bus-based symmetric shared-memory multiprocessor system which looks like the figure below.
Here we plan to use Suns UltraSPARC III microprocessors build a commercial server system which mainly runs an OLTP (On-Line Transaction Processing) application. The UltraSPARC III processor has a two-level cache hierarchy. Let us analyze the following simulated results to characterize the processor performance for this OLTP workload.
66
Note that the chart shows the sources for memory cycles per instruction (MCPI): true sharing, false sharing, cold (compulsory) misses, capacity misses, conflict misses, and instruction misses. The cycles per instruction (CPI) on a processor = Ideal_CPI + MCPI, where Ideal_CPI is the cycles per instruction under a perfect memory system (i.e. infinite fully-associate L1 cache, perfect pre-fetching, perfect caching, etc.) (a) Explain what false sharing and true sharing mean. Why do they increase as the processor count increase? (b) Assuming the Ideal_CPI for the UltraSPARC III processor is 1. Suppose the OLTP workload is fully parallelized and has lots of threads (e.g. 256) to run in parallel on the system. Use the figure above to estimate the CPI and the speedup for the system running the OLTP workload with 1, 2, 4, and 8 processors. (c) Based on your analysis, do you think the OLTP workload can scale beyond 8 processors? List all the reasons you can think of. Ans: (a) cache line words block processor block true sharing false sharing cache coherency miss True sharing: processor A block word X processor B word X cache coherency protocol False sharing: processor A block word X processor B word Y processor B block invalidate
67
block false sharing true sharing (b) processor count MCPI processor 1 248 MCPI 1.11.51.752.8 processors OLTP workload CPI speedup Processor 1 2 4 8 CPI 1/1+1.1=2.1 1/2+1.5=2 1/4+1.75=2 1/8+2.8=2.925 Speedup 2.1/2.1=1 2.1/2=1.05 2.1/2=1.05 2.1/2.925=0.72
Speedup = ExTimeNOBTB CPI NOBTB = ExTimeBTB CPI BTB
(c) processors true sharing miss false sharing miss MCPI (c) 8 processors CPI speedup 0.72 1 processor 8 processors CPI 8 processors run OLTP workload performance (2005.10) (4) Compare the pros & cons of a synchronous vs. asynchronous bus. Ans: (1) Synchronous bus: Pros: bus Cons:(i) bus device run clock rate (ii) bus clock skew() (2) Asynchronous bus: Pros: (i) device (ii) bus clock skew Cons: handshaking protocol buffer (2005.03) (4) Describe the Flynns classification of computers. Ans: Flynns classification (1) Single Instruction Single Data (SISD) processor
68
(2) Single Instruction Multiple Data (SIMD) vector processor data level parallelism
(2004.03) (5) Describe one pre-fetching mechanism that works well for programs with sequential access pattern.
69
Ans: stream buffer cache miss memory block cache block stream buffer cache miss stream buffer block buffer cache memory sequential access program miss
(2002.10) (2) Please describe how snooping protocol maintains cache coherency in a multiprocessor environment? Ans: Snooping protocol bus processor write-invalidate write-update protocol maintain cache coherency (1) Write-invalidate protocol a. Write-through cache i. Read miss request processor memory cache ii. Write miss processor cache bus broadcast invalidate copy processor bus invalidate invalidate request processor cache update memory memory up-to-date b. Write-back cache () i. Read miss request processor bus processor cache data cache ii. Write miss processor cache bus broadcast invalidate copy processor bus invalidate invalidate request processor cache cache line processor read miss dirty update memory write-back cache read miss memory (2) Write-update protocol a. Read miss Memory request processor memory b. Write miss request processor bus broadcast cache
70
memory copy processor broadcast update cache (2002.03) (4) (a) Why is snooping protocol more suitable for centralized shared-memory multiprocessor architectures than distributed architectures? (b) P1, P2, and P3 are three nodes on a distributed shared-memory multiprocessor architecture, which implements a directory-based cache coherency protocol. Assume that caches implement write-back policy. P1 and P2 have cache block A initially. For the following event sequence, show the sharer and state transition (shared, uncached and exclusive) of block A in the directory and the message types sent among nodes: State Sharer Message-Type Initial P3 writes 10 to A P1 read A Ans: (a) Centralized shared-memory multiprocessor uniform memory access (UMA) processor bus memorydistributed architecture non-uniform memory access (NUMA) processors interconnection network processor memory snooping protocol write-invalidate write-update protocol processor bus request request processor centralized shared-memory multiprocessor small scale machine (b) State Sharer Message-Type
Initial P3 writes 10 to A Shared Exclusive {P1, P2} {P3} Write Miss: P3Home Invalidate: HomeP1, P2 Data Value Reply: HomeP3 P1 read A Shared {P1, P3} Read Miss: P1Home Fetch: HomeP3 Data Write Back: P3Home Data Value Reply: HomeP1
(2001.03) (4) What is the minimum number of states that a cache coherence protocol must have? Please describe the meanings of these states.
71
Ans: cache coherence protocol state (1) Exclusive state: processor cache cache block state exclusive state copies blocks ( processor cache) invalidate (2) Shared state: processor data block copies cache block shared state (3) Invalid state: processor cache processor cache copies block invalid state block contains no data (2001.03) (5) Why do architecture designers employ cache coherence protocols that have more states than the answer you give in problem 4? Ans: states Modified Exclusive block main memory Owned Shared share block main memory [] Wiki Modified: A cache line in the modified state holds the most recent, correct copy of the data. The copy in main memory is stale (incorrect), and no other processor holds a copy. Owned: A cache line in the owned state holds the most recent, correct copy of the data. The owned state is similar to the shared state in that other processors can hold a copy of the most recent, correct data. Unlike the shared state, however, the copy in main memory can be stale (incorrect). Only one processor can hold the data in the owned stateall other processors must hold the data in the shared state. Exclusive: A cache line in the exclusive state holds the most recent, correct copy of the data. The copy in main memory is also the most recent, correct copy of the data. No other processor holds a copy of the data. Shared: A cache line in the shared state holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The copy in main memory is also the most recent, correct copy of the data, if no other processor holds it in owned state. Invalid: A cache line in the invalid state does not hold a valid copy of the data. Valid copies of the data can be either in main memory or another processor cache.
72
(2000.10) (2) Advanced vector processors may incorporate a mechanism called chaining to improve performance. Can you use an example to elaborate how chaining works and how it improves vector processor performance? Ans: ChainingforwardingVector Instruction(VI) performance vector instructions: (1) MULV.D V1,V2,V3, (2) ADDV.D V4,V1,V5VI V1VIV5VIV1 VIforwardVI VI VI performanceunchainedchainedcycle7 cycles 6 cyclesthe latency of adder and multiplier
(1999.10) (4) Can you describe when the following bus transactions are issued in the Mbus cache coherence protocol? (a) coherent read (b) coherent read and invalidate (c) invalidate Ans: (a) cache read miss issue coherent read cache line (b) write-allocate cache write miss issue coherent read and invalidate copy cache blocks cache line (c) cache write hit issue invalidate copy cache blocks [] The five possible states of a data block are: Invalid (I): Block is not present in the cache. Clean exclusive (CE): The cached data is consistent with memory, and no other cache has it. Owned exclusive (OE): The cached data is different from memory, and no other cache has it.
73
This cache is responsible for supplying this data instead of memory when other caches request copies of this data. Clean shared (CS): The data has not been modified by the corresponding CPU since cached. Multiple CS copies and at most one OS copy of the same data could exist. Owned shared (OS): The data is different from memory. Other CS copies of the same data could exist. This cache is responsible for supplying this data instead of memory when other caches request copies of this data. (Note, this state can only be entered from the OE state.) The MBus transactions with which we are concerned are: Coherent Read (CR): issued by a cache on a read miss to load a cache line. Coherent Read and Invalidate (CRI): issued by a cache on a write-allocate after a write miss. Coherent Invalidate (CI): issued by a cache on a write hit to a block that is in one of the shared states. Block Write (WR): issued by a cache on the write-back of a cache block. Coherent Write and Invalidate (CWI): issued by an I/O processor (DMA) on a block write (a full block at a time). (1999.03) (6) A multiprocessor designer incorporates a write-invalidate cache coherence protocol, what does the designer need to do in order to guarantee sequential consistence? Ans: [] http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf write-invalidate cache coherence protocol sequential consistencedesigner (1) memory operation (2) CPU memory operation (3) processor memory operation (interleaved) memory access invalidate (1998.10) (3) If an architecture designer implements a multiprocessor system that shares cache memory, what degree of parallel processing granularity is the machine designed for? Ans: [] http://en.wikipedia.org/wiki/Granularity multiprocessor system cache memory chip multiprocessor (CMP)
74
parallel computing granularity computation communication computation communication granularity parallelism synchronization communication overheads CMP communication cost small granularity fine-grained multithreading communication
75
External Issues
(2008.10)(4) Power management power power management? (2008.03)(4) Describe 4 approaches to reduce the power consumption of a microprocessor. ANS: 2005 261 PDA power - Thermal issue () - Environmental issue
Clock gatingDisabling the clock to a circuit whenever the circuit is not used Pipeline BalancingGeneral purpose processors contain more resources than a program needs. Exploiting IPC Instructions Per Clock Cycle variations to disable unused resources
76
Power gate Nehalem Power Gate() Nehalem (Sleep transistors) resizable caches: to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the caches unused sections with minimal impact on performance. Phase Detection & Configuration Selection Phase Detection & Configuration Selection phase phase cache or core Temporal approach- Partitioning a programs execution into fixed interval run time mointer phase Positional approach-Associating phases with codes, e.g. loops or subroutines loop loop Block buffer: cache line line buffer cache mis memory line buffer memory search Drowsy CacheA cache line can stay in a lower power mode such that the content of a cache line is preserved. Pros& Cons: do not reduce leakage as much as Gated-Vdd but with much smaller state-transition overheads (2008.10)(5) Digital signal processor(DSP) DSP General-purpose processor ANS A digital signal processor (DSP or DSP micro) is a specialized microprocessor designed specifically for digital signal processing, generally in real-time computing. Digital signal processing algorithms typically require a large number of mathematical operations to be performed quickly on a set of data. Signals are converted from analog to digital, manipulated digitally, and then converted again to analog form, as diagrammed below. Many DSP applications have constraints on latency; that is, for the system to work, the DSP operation must be completed within some time constraint.
77
DSP GPP 1 GPP GPP DSP DSP bits bits - DSP MAC 2 GPP . 4 DSP DSP MAC GPP GPP DSP DSP DSP DSP DSP 3 DSP DSP 1GPP GPP
78
4 DSP DSP DSP DSP 5 DSP FFT GPP 6 DSP GPP GPP DSP GPP DSP GPP DSP DSP 7 DSP DSP DSP DSP DSP MAC FIR GPP GPP C C++ DSP DSP C DSP DSP C DSP DSP DSP DSP
79
8 DSP DSP GPP GPP GPP DSP GPP (2007.10) (1) System-on-Chip (SoC) is a recent trend in the design of computer systems. Please explain the concept of SoC and discuss on the importance of SoC from all aspects (e.g. cost, time-to-development, etc). Ans: SoC (MPU, DSP)(RAM, ROM, FLASH) SoC (Time to Market) IC SoC DVDMP3 SoC 90 IC SoC -- (2004.03) (7) How do disk arrays affect the performance, availability and reliability of the storage system compared to a single-disk system? (2003.10) (8) How do disk arrays affect performance, reliability and available? Ans: [] http://en.wikipedia.org/wiki/RAID RAID reliability I/O performance mirroring striping error correction performanceavailability reliability
80
(1) Performance vs. reliability Mirroring disk reliability disc disk performance Striping performance reliability (2) Availability failure single-disk system shut down disk array error correction (fault tolerance) machine high availability (1999.03) (5) (a) Can you show how a level-5 RAID places the parity data? (b) When a level-5 RAID needs to write to a file block, how does it compute the new parity? Ans: [] http://web.mit.edu/rhel-doc/4/RH-DOCS/rhel-isa-zh_tw-4/s1-storage-adv.html (a) RAID 5 RAID 0 n n parity
(b) [] http://docs.sun.com/app/docs/doc/806-3204/6jccb3gad?l=zh_tw&a=view 1. RAID disk blocks multiple I/Os (a read-modify-write sequence) parity buffers parity parity parity log data stripe units 2. RAID disk blocks XOR parity parity log data parity data stripe units
81