A Novel Asynchronous Pipeline Architecture For CISC Type Embedded Controller, A8051
A Novel Asynchronous Pipeline Architecture For CISC Type Embedded Controller, A8051
1. Introduction The proposed A8051, In this paper, shows various solutions for
these problems. To begin with, it proposes the simple way to
While recent VLSI technology development caused the gate predict branches through data-dependent control without clock in
delay to decrease, it also caused the wire delay to increase order to solve control hazard that happens in branch instructions.
relatively. The more complicated system gets the more difficult Secondly, it constructs 5 stages with equal data path delay to
synchronization of the clock and so clock skew tend to happen improve pipeline performance. Thirdly, it groups simple
more frequently. In addition, it yields power waste with clock instruction execution schemes together by inshuction to remove
supplying to modules don’t need activation. To overcome these, bubble states that is occupied in the execution of multi-cycle
the research about asynchronous circuit design i s being instruction. Finally, it verifies the architecture to handle the
conducted very a lot nowadays. variable length of instruction through simulation.
The pipeline architecture i s one of the most commonly used for The paper is organized as follows: Section 2 presents the
high speed computing machinery. In asynchronous processor, the instruction set and the implementation architecNre of Intel
pipeline can be clocked when their control signals are activated 80C51 and A8051. Section 3 evaluates the control mechanism
by widely distributed clock signal or an event driven when they and the asynchronous pipeline architecture. Section 4 describes
are activated independently by the handshake protocol between the overall results and performance. The conclusions and some
stages instead of the system clock. The advantage of the final remarks are offered in Section 5.
asynchronous pipeline architecture over its synchronous
counterpart is now well recognized in [2]. The difference 2. Architecture of Intel 80C51 and A8051
between the behavioral mechanisms of synchronous pipeline and
those of asynchronous one is depicted in Fig. 1. In conventional Intel 80C5I is a popular 8-bit processor with complex inshuction
synchronous pipeline, all stages are finished within a time slot set that i s classified to five classes: arithmetic, logical, data
given by the global clock whose length is set up to the slowest transfer, boolean and jump instructions, and it i s the memory-
time to complete its workload. By contrast, the asynchronous memory architectllre which has six addressing modes. Intel
pipeline architecture does not use global clock so that all stages 80C51 instruction set has 255 instructions are encoded to
takes variable execution time to finish. For this reason, the variable lengths with one, two or three bytes. Each instruction is
asynchronous pipeline indicates the average-case performance. executed in one, two or four machine cycles respectively [3]. In
Therefore whenever the execution of each state completes, the addition, each machine cycle is divided into 6 states, which lead
output from the previous state put into the next state as an input to some communication or computations. As depicted in Fig. 2,
without delay or waits. It has a space time between each state the bus plays a central role in the synchronous architecture.
A8051 outline
A8051 consists of 5-stage pipelines as IF, ID, OF, EX and WB as
Figure 2. The architecture of Intel 8OC51 showed in Fig. 4. In this study, we introduce DI (delay
insensitive) delay model for an asynchronous operation. To do so
I Machina Cycle it is needed to know when the combinational circuits are
completed. The proposed A8051 uses 4-phase handshake
I1 protocol for the data transmission and dual-rail encoding to get
I IF 8 I ID I OF
I ROM.=. IAcc->TTI I WN m.IROM.*.
:I O P >E,T r n P I * LWB" m J
I I
the completion signal of the combinational circuits as shown in
(a) Eumpl. axe~Y1Ionscheme OF Intel 8OCS1 Fig. 5 [ 5 ] .
H Uedundancy slaw
11 IF ID EX 1 ~Hmdshrk*overhead Redefine suitable execution scheme
(b) Example executln scheme of A8051
As described in section 2, Intel 8OC51 uses the redundancy
stages and machine cycles for ClSC operation. Although A8051
Figure 3. Comparison of instruction execution scheme
does not use the redundancy states for the synchronization
far "INC A" instructions
between the stages and it has to wait until the next stage is
In Intel 80C51, a machine cycle for the instructions goes through available is called a space time. For this we newly defined the
6 state transitions from SI to S6. This scheme results in many instmction set of A8051 that has new execution sequences. Table
redundant states because not all states are required for the 1 shows newly defmed instruction groups for A8051. Each
instruction execution. The execution scheme of the proposed instruction group does not require all stages and connects only
A8051 uses the optimum architecture skipping the redundancy the necessary stages. Group 5, 6 and I have the iteration of the
state and connecting to the next stage execution directly. In this OF and EX stages once or twice. So A8051 requires an
case, it needs only the space time for waiting the next state. arbitration block and additional control signals for the iteration of
processing this stage. Though the A8051 has multi-cycle
A805 1 consists of 5-stage asynchronous pipeline architecture as execution scheme, it was based on simple linear pipeline
shown in Fig. 4 (a) 1st-stage is IF stage that is responsible for execution that some special cases of the skipping stages and the
the instruction fetch, pre-decode and checking the existence of iterating stages are just added. Taking the asynchronous data path
the branch instructions. @) 2nd-stage is ID Unit which is only, it reduced more idle time than the synchronous version has
responsible for mapping of the opcode to an entry point of the dummy state.
microinstruction and checking data dependency for the previous
II-676
Handling the variable instruction length
The instructions of A8051 ale changeable from 1 to 3 bytes
according to the number of operands. IF stage checks fetched
instructions, determines the length of instruction, and saves it to
IR register. To do so, as represented in Fig. I the mapping table
is required to contain the information regarding the number of
operand in advance given by an opcode. IR register can,
therefore, keep maximum 3 bytes of data.
The instructions fetched from the program memory are sent to
both IR register and mapping table. Mapping table selects the
position where the input data is saved correspondingly to the
input opcode and signals the information for that to Mux. Mux
transmits Rin to Latch-CTL block of IR register. In the end it
handshakes with Aout for the next inshuction fetch.
Group2
Group3
Omup4
GmupS
IC"
Gmup6
Gmup7
Figure 6. A logical layout for A805 1 execution scheme Figure 8. The BTA calculation unit in IF stage
II-677
SEL signal chooses one address for the input to Mux. The The most influential factor in performance improvement of
address chosen like this is stored on Addr Latch and sent to A805 1 is branch instructions. Therefore, the difference between
program memory by handshake protocol. Thus this operation is the worst and the best operation speed is about 12 MIPS as
overlapped with IF stage for the next instruction. And a branch shown in Fig. 9. Consequently, A8051 shows an average
penalty doesn’t occur although it takes longer execution time. In operation speed of 76 MlPS on 0.35 um CMOS technology.
addition, when a conditional branch is executed, the resolving
concept takes a static prediction. When the branch condition was
detected, the processor checks the prediction. If the prediction is
correct, IF continue the execution of the sequential path and if
the predication is incorrect, IF reset all speculation by executed
instructions in the sequential path.
I Stwe I IF 1 ID ) O F ( E X I W B
Avrngs rrrcutlonnms 5.911s 4.8“s Sons 5.111s 6.6”s
Figure 9. Comparison results using test-bench program
5. Conclusions
In this paper, A8051 with a new pipeline architecture is proposed
and implemented using the asynchronous data path and control
scheme based on the 4-phase handshake. The proposed
architecture supports multi-cycle pipeline to solve the several
problems occurred in ClSC machine. A8051 has couples of
special features including many asynchronous processing
techniques to handle the complex features of ClSC type micro-
controller, such as variable length handling, modifying of
execution scheme of instruction set in Intel 8051 for A8051 and
applying the multiple-looping pipeline architecture. Based on
those circuit techniques, the proposed A8051 shows high
operation speed more about 18 times than that of the synchronous
version and 5 times higher than Async 80C51 by [I].
REFERENCES
[ I ] H. van gageldonk et al.. “An asynchronous low-power
80C51 microcontroller,” In Proc. International Symposium
on Advanced Research in Asynchronous Circuits and
Systems, pp. 96-107, 1998.
[2] I. E. Sutherland, ”Micropipelines”, Communication of the
ACM, Volume 32, No. 6, pp. 720-739, June 1989.
[3] Intel, “Microprocessor and Peripheral Handbook,” 1987.
[4] lamin M. C. Tse and Daniel P. K. Lun, “ASYNMPU : A
Fullv Asvnchronous CISC Micro-Drocessor”. ISCAS’97. UP.
..
Async80CSI Pmprsd A8051 Proposed A8051 1816-1si9,1997.
Intel 805 I 151 K. R. Cho. K. Okura. K. Asada. “Desien a 32-bit Fullv
Ver. by [I] Non-pipeline Pipeline L 1
II-678