AMP Lab Manual
AMP Lab Manual
AMP Lab Manual
Advanced Microprocessor
T.E. Computer Semester: VI F.H. 2010
Experiment List 1. Study of internal component of CPU cabinet. 2. Write a program to Simulate Pipelining Processing. 3. Write a program to Simulate Superscalar /Super Pipeline Architecture. 4. Write a program to detect data dependency hazards. 5. Write a program to Simulate Brach Prediction logic. 6. Write a program to implement Delayed Execution. 7. Write a program to implement Page replacement algorithm. 8. Write a program to implement CPUID instruction. 9. Study of SPARC Architecture(V8)
:
Experiment No: 1
Experiment name
Resources Required: P IV 2 GHz, 512 MB RAM 40 GB HDD, 15 IBM Color Monitor optical Mouse, Dot Matrix Printer. Consumable FlowChart : Printer paper. :Not applicable
Slot Packages:
(Intel
386
DX
Cyrix
Cx486DLC
III
(Slot
1)
MotherBoard: It is the main unit inside the cabinet on which all the components are mounted or to which are connected. Mainly it is described according the processor slot/socket available on it. Motherboard is of many types like AT, ATX, etc. Processor slots/sockets: Socket / Slot Pincount Type Socket 1 169 LIF/ZIF PGA Socket 2 238 LIF/ZIF PGA Socket 3 237 LIF/ZIF PGA 273 LIF/ZIF PGA 296/320 LIF/ZIF SPGA / Supported Processors Intel i486 AMD Am5x86 133 (w/ voltage adaptor) Cyrix Cx5x86 100/120 (w/ voltage adaptor) Intel i486 Intel Pentium AMD Am5x86 133 (w/ voltage adaptor) Cyrix 5x86 100/120 (w/ voltage adaptor) Intel i486 Intel Pentium AMD Am5x86 133 Cyrix 5x86 100/120 Intel Pentium P5 60/66 Intel Pentium OverDrive 120/133 Intel Pentium P45C 75-133 Intel Pentium MMX P55 166-233 AMD K5 PR75-133 AMD K6 166-300 Cyrix 6x86L PR120-166 (w/ voltage adaptor) Cyrix 6x86MX PR166-233 (w/ voltage adaptor) IDT Winchip Intel i486 DX4 75-120
Socket 4 Socket 5
Socket
(uncommon)
Socket Super 7
Socket 8
Slot 1
387 LIF/ZIF PGA/SPGA dual pattern 242 SECC SECC2 SEPP 330 SECC 370 ZIF SPGA 242 SECC 462 ZIF SPGA 423 ZIF SPGA 478 ZIF PGA 775 LGA 603/604 ZIF PGA 418 VLIF 611 VLIF
Intel Pentium P45C Intel Pentium MMX P55 AMD K5 PR75-200 AMD K6 Cyrix 6x86 IDT Winchip Intel Pentium Pro 150-200 Intel Pentium II Intel Celeron Intel Pentium Pro Intel Pentium II Intel Pentium III Intel Pentium II Xeon 400/450 (Drake) Intel Pentium III Xeon 500/550 (Tanner) Intel Pentium III Xeon 600-1GHz (Cascades) Intel Celeron Intel Pentium III Cyrix III 533-667 (Samuel) AMD Athlon 500-700 (K7) AMD Athlon 550-1GHz (K75) AMD Athlon 700-1GHz (Thunderbird) AMD Duron 600-950 AMD Athlon AMD Sempron Intel Pentium 4 1.3GHz Intel Celeron 1.7GHz-1.8GHz Intel Celeron Intel Pentium 4 Intel Pentium Intel Celeron D 325J (Prescott) Intel Pentium 4 Intel Pentium D Intel Pentium Extreme Intel Xeon Intel Itanium 733-800MHz (Merced) Intel Itanium 2
AMD Athlon 64 AMD Sempron 2600+-3300 AMD Athlon 64 FX-51 - FX-53 (Sledgehammer) AMD Opteron 140-150 (Sledgehammer) AMD Athlon 64
Socket 939
Bus Slots: The various bus slots on motherboard are ISA (Industry standard Architecture) PCI (Peripheral Component Interconnect) AGP (Accelerated Graphics Port) AMR (Audio Modem Riser) It also contains external connections for your onboard sound card, USB ports, Serial and Parallel ports, PS/2 ports for your keyboard and mouse as well as network and Firewire connections. RAM Slots: There are varieties of RAM modules that can be mounted on the motherboard a) SIMM (Single Inline Memory Modules) Supports EDO RAM b) DIMM (Dual Inline Memory Module) Supports 3D and DDR RAM c) RIMM (Rambus Inline Memory Module) Supports RD RAM Cache Memory: Cache is an intermediate or buffer memory. The idea behind cache is that it should function as a near store of fast RAM. A store which the CPU can always be supplied from. In practice there are always at least two close stores. They are called Level 1, Level 2, and (if applicable) Level 3 cache. Level 1 cache is built into the actual processor core. It is a piece of RAM, typically 8, 16, 20, 32, 64 or 128 Kbytes, which operates at the same clock frequency as the rest of the CPU. Thus you could say the L1 cache is part of the processor. L1 cache is normally divided into two sections, one for data and one for instructions. For example, an Athlon processor may have a 32 KB data cache and a 32 KB instruction cache. If the cache is common for both data and instructions, it is called a unified cache.
The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB. The purpose of the L2 cache is to constantly read in slightly larger quantities of data from RAM, so that these are available to the L1 cache. Now the L2 cache has been integrated within processor and that makes it function much better in relation to the L1 cache and the processor core. The level 2 cache takes up a lot of the chips die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM). Buses: Bus Description PC-XT Synchronous 8-bit bus which followed the CPU clock frequency of 4.77 or 6 from 1981 MHz. Band 170: 4-6 MB/sec. ISA (PC- Simple, cheap I/O bus. AT) Synchronous with the CPU. from 1984 Band width: 8 MB/sec. MCA Advanced I/O bus from IBM (patented). Asynchronous, 32-bit, at 10 MHz. from 1987 Band width: 40 MB/sec. EISA Simple, high-speed I/O bus. From 1988 32-bit, synchronised with the CPUs clock frequency: 33, 40, 50 MHz. Band width: up to 160 MB/sec. PCI from 1993 USB and Firewire, from 1998 PCI Express from 2004 Advanced, general, high-speed I/O bus. 32-bit, asynchronous, at 33 MHz. Band width: 133 MB/sec. Serial buses for external equipment.
A serial bus for I/O cards with very high speed. Replaces PCI and AGP. 500 MB/sec. per. Channel.
Experiment No:2
Microprocessor Experiment name Aim
Subject : Advanced : Write a Program in Java to simulate a pipeline processing. : Simulation of pipeline
Resources Required: Internet, Books Consumable Flowchart : Printer paper. : Not applicable
Theory: Pipeline is a process of prefetching the nexttask while executing the current task. Pipeline in which task is divided in subtasks and in each stage of pipeline subtask is executed. Instruction pipeline in which instruction is prefetched while executing current instruction. In this simulation , High level language can be used to simulate the same.
Algorithm: i) Start ii) Display of vertical lines iii) Display of Instruction stages in pipelines iv) Movement of instructions one by one v) End Conclusion: Thus we have successfully done the simulation of pipeline.
Experiment No:3
Experiment name . Aim
Subject : Advanced Microprocessor : Write a program to Simulate Superscalar and Super Pipeline . : Superscalar Execution
Resources Required: Internet, Books Consumable Flowchart Theory: In Superscalar Architecture , Pipeline implementation implies parallelism and more than one instructions are executed at a time. Two issue superscalar pipeline means at a time two instructions are pipelined and if it is three issue superscalar pipeline means at a time three instructions are pipelined. This type of pipelining increase the throughput of the processor . now days 8 issue superscalar structure is been developed. In superscalar processor fetches multiple instructions at a time and attempts to find nearby instructions that are independent of one another and can be executed in parallel .The essence of the superscalar approach is the ability to execute instructions independently in different pipelines. In Super Pipeline,Many pipeline stages need less than half a clock cycle.Double internal clock speed gets two tasks per external clock cycle : Printer paper. : Not applicable
Algorithm: i) Start ii) Display of vertical lines iii) Display of Instruction stages in pipeline iv) Movement of two instructions at a time.(2-issue superscalar) v) In super pipeline, each instruction is taking less than one cycle(completion of each stage in half cycle) vi) End Output
Experiment No:4
Experiment name : Write a program to detect data dependency hazards. . Resources Required: Internet, Books Consumable Flowchart Theory: Dependency among the instructions are required to remove in order to implement instruction level parallelism(ILP). There are three types of data dependency exist which are to be identified and eliminated from sequential flow of instructions True data dependency Hazard( Flaw dependency/RAW Hazard) Eg : R1:=R2+ R3 R4:= R1-R5 Antidependency ( WAR Hazard) Eg: R1:= R2+R3 R2:= 6 Output dependency Hazard( WAW Hazard) Eg: R1:= R2+R3 R1:=R5 Algorithm: i) ii) iii) iv) v) vi) vii) Start Accept No of Instructions Accept Source and destination for each instruction For checking Flow dependency , compare destination of each instruction with Src of other instructions sequentially.. For checking anti dependency , compare src of each instruction with destination of other instructions sequentially. For checking output dependency , compare destination of each instruction with destination of others Display the flow dependant, Anti dependent ,output dependent instructions , : Printer paper. : Not applicable
Display of Instruction stages in pipeline Movement of two instructions at a time.(2-issue superscalar) Three data dependency hazards are to be simulated End
Output :
Experiment No:5
Experiment name
Resources Required: Internet, Books Consumable Flowchart Theory: Prediction Logic is used to minimize penalty incurred due to branch instructions. To reduce time taken by queue to flush and fetch again and again branch prediction is used. Following diagram depicts the need of Branch Prediction Logic : Printer paper. : Not applicable
BTB(Branch Translation Buffer) is lookup table which has 256 entries (2^8=256, 2 way associative cache ) Valid bit Source Address History bits Target Address
History bits can be in the one of four states and based on which prediction is 00~ Strongly Taken 01~ Weakly taken 10~ Weakly not taken 11~ Strongly Not taken
Algorithm: 1. Find source address of instruction into look up table. a. if (Source Addr not Found) // Instruction encountered first time Prediction is NO JUMP { if ( branch ) insert record into BTB with history bits 00 else do nothing. } b. If (Source Addr Found ) Prediction is JUMP / NO JUMP // Based on history bits { if ( branch ) History bits are upgraded else History bits are degraded } }
Output : Instructions in program are : cmp x1,x2 1000 Jump if x1 < x2 Enter x1 , x2 value : 35 45 Prediction is No JUMP Branch taken Incorrect Prediction . History bits are strongly taken Enter x1 , x2 value : 31 11 Prediction is JUMP
Branch not taken Incorrect Prediction History bits are weakly taken Enter x1 , x2 value : 63 10 Prediction is NOJUMP Branch not taken Correct Prediction. History bits are weakly not taken Enter x1 , x2 value : 74 95 Prediction is NO JUMP Branch taken InCorrect Prediction. History bits are weakly taken
Experiment No:6
Experiment name .
Resources Required: Internet, Books Consumable Flowchart Theory: In normal execution of instructions , instruction in sequence give rise flushing /clearing of queue Register many times due to the presence of JUMP instruction at unexpected places. For example , In given program , sequence of instruction as follows. 100 ADD r1,r2 102 MUL r3 103 STR r4 105 CMP r3 , r1 106 JMP 108 107 ADD r2,r3 108 SUM r3,r4 We can see instructions which come before JMP instruction are fetched in queue register and when actual JMP instruction is taken place ,fetched instructions should to be flushed and fetched new instructions from target address. It means presence of JMP Instruction can lead to reduce the throughput of processor. One of the solution to avoid this is to arrangement of instructions in such way that all other instructions other than JMP are delayed or come after the JMP instruction. Delayed Execution 100 ADD r1,r2 102 JMP 108 103 MUL r3 104 STR r4 105 CMP r3 , r1 107 ADD r2,r3 108 SUB r3,r4 : Printer paper. : Not applicable
To simulate the delayed execution, Following example can be refered. Algorithm Addition a= 2, 23 Subtraction b = 3 ,5 Multiplication c= 4 ,5 Divide d= 77 ,35 If(a>0) { /*Display of records which use value of a */ } Delayed execution Addition a= 2, 23 Switch (a) { /*Display of records which use value of a */ } Subtraction b = 3 ,5 Multiplication c= 4 ,5 Divide d=77, 35
Experiment No:7
Experiment name .
Resources Required: Internet, Books Consumable Flowchart : Printer paper. : Not applicable
Theory: Whenever there is a page required for data it will be searched in the cache. If it is not present it will be brought in to the cache. If there is space in the cache the any page is replaced by the new page for this various techniques are used such as FIFO, LRU, optimal, clock etc FIFO: in this technique the page entered first is replaced Eg:
Lowest page-fault rate of all algorithms Never suffer from Beladys anomaly
Replace page that will not be used for longest period of time. 4 frames example 1, 2, 3, 4, 1, 2, 5, 1,
2,
3,
4,
How do you know this? Used for measuring how well your algorithm performs. Difficult to implement as it requires prior knowledge of reference string (like SJF in CPU Scheduling) Mainly used for comparison studies Algorithms Conclusion: Thus we have successfully implemented Page Replacement Algorithm
Experiment No:8
Experiment name
Subject : Advanced Microprocessor : Write a assembly program for finding out the processor id.
Resources Required: Pentium machine, Turbo Assembler, Intel Manual Consumable : Printer paper.
Theory: Pentium processor gives the facility to check for the processor id. with the CPUID instruction. To use CPUID , ID Flag should be set in EFLAG Register. CPUID instruction whenever executed information about the current Processor such as Model , Family, Id , Version get transfered into General Purpose Registers of the Processor. Algorithm: i) initialize the segments ii) initialize the registers iii) use CPUID instruction (valid only on Pentium class processors!) iv) Print three hex digits which correspond to the family, model, and stepping ID. v) terminate the program. Conclusion: we have successfully executed CPUID instruction..
Experiment No: 9
Experiment name
DESIGN GOALS SPARC was designed as a target for optimizing compilers and easily
pipelined hardware implementations. SPARC implementations provide exceptionally high execution rates and short time-to-market development schedules. REGISTER WINDOWS SPARC, Formulated At Sun Microsystems In 1985, Is Based On The Risc I & II designs engineered at the University of California at Berkeley from 1980 through 1982. the SPARC register window architecture, pioneered in UC Berkeley designs, allows for straightforward, high-performance compilers and a significant reduction in memory load/store instructions over other RISCs, particularly for large application programs. For languages such as C++, where object-oriented programming is dominant, register windows result in an even greater reduction in instructions executed. Note that supervisor software, not user programs, manages the register windows. A supervisor can save a minimum number of registers (approximately 24) at the time of a context switch, thereby optimizing context switch latency. One difference between SPARC and the Berkeley RISC I & II is that SPARC provides greater flexibility to a compiler in its assignment of registers to program variables. SPARC is more flexible because register window management is not tied to procedure call and return (CALL and JMPL) instructions, as it is on the Berkeley machines. Instead, separate instructions (SAVE and RESTORE) provide register window management.
systems. Note that they are invisible to nearly all user application programs and the interfaces to them can be limited to localized modules in an associated operating system.
SPARC includes the following principal features: . A linear, 32-bit address space. . Few and simple instruction formats All instructions are 32 bits wide, and
are aligned on 32-bit boundaries in memory. There are only three basic instruction formats, and they feature uniform placement of opcode and register address fields. Only load and store instructions access memory and I/O. . Few addressing modes A memory address is given by either register + register or register+immediate. . Triadic register addresses Most instructions operate on two register operands (or one register and a constant), and place the result in a third register. . A large windowed register file At any one instant, a program sees 8 global integer registers plus a 24-register window into a larger register file. The windowed registers can be described as a cache of procedure arguments, local values, and return addresses. . A separate floating-point register file configurable by software into 32 single-precision (32-bit), 16 double-precision (64-bit), 8 quad-precision registers (128-bit), or a mixture thereof. . Delayed control transfer The processor always fetches the next instruction after a delayed control-transfer instruction. It either executes it or not, depending on the control-transfer instructions annul bit. . Fast trap handlers Traps are vectored through a table, and cause allocation of a fresh register window in the register file. . Tagged instructions The tagged add/subtract instructions assume that the two least-significant bits of the operands are tag bits Multiprocess or synchronization instructions One instruction performs an atomic read-then-set-memory operation; another performs an atomic exchange-register-with-memory operation. . Coprocessor The architecture defines a straightforward coprocessor instruction set, in addition to the floating-point instruction set.
In SPARC Architecture, Following concepts are also described The Instruction Set , Addressing Modes, Pipeline Processing,, FPU , Interrupts , Bus cycles, Programming Model. Etc.