Final Report
Final Report
Project Report
On
Submitted in partial fulfilment of the requirements for the reward of the degree of
Bachelor of Engineering
in
Electronics & Communication
Submitted by
CERTIFICATE
Certified that the project work entitled “Design and Implementation of DDR For 32-bit RISC V
Processor” by SAMARTH R BHARADWAJ USN: 1BI20EC126, SACHIN BHASKAR S
USN: 1BI20EC121, SAGAR G P USN: 1BI20EC123, SIDDESH M G USN: 1BI20EC142
bonafide students Bangalore Institute Of Technology in partial fulfillment for the award of
Bachelor of Engineering in Electronics and Communication Engineering of the
Visvesvaraya Technological University, Belgaum during the year 2023-2024. It is certified that
all corrections/suggestions indicated for Internal Assessment have been incorporated in the Report
deposited in the departmental library. The project report has been approved as it satisfies the
academic requirements in respect of Project work prescribed for the said Degree.
External Viva
Name of the examiners Signature with date
1.
2.
ACKNOWLEDGEMENT
We take this opportunity to express our sincere gratitude and respect to the
Bangalore Institute of Technology, Bangalore for providing us an opportunity to carry
out final project.
We express our sincere regards and thanks to Dr. HEMANTH KUMAR A.R,
Professor and HOD, Electronics & communication Engineering, BIT for his valuable
suggestions.
We also extend our thanks to the entire faculty of the Department of ECE, BIT,
Bangalore, who have encouraged us throughout the course of bachelor degree.
The microprocessor industry has historically been dominated by complex proprietary technologies with
restrictive licensing, but there is a shift towards freely available, open-source alternatives like the RISC-V
Instruction Set Architecture (ISA). RISC-V is a license-free, modular, and extensible option that has gained
significant popularity across a wide range of applications, from microcontroller chips to supercomputing
initiatives. The RISC-V ISA provides designers the freedom to develop their own custom processors using
open-source or commercial resources as a starting point, and its modularity allows for a high degree of
customization, as users can choose to implement only the extensions they need, reducing the complexity
and power consumption of their hardware. The RISC-V architecture has been extensively tested and
verified, and its simplicity reduces the risk of bugs and vulnerabilities. The impact of RISC-V on the
processor market could be significant, as it has the potential to lower costs, increase innovation, and foster
competition in an industry traditionally dominated by a few key players. For hardware developers, RISC-
V offers the freedom to customize their processors to meet their specific needs, without the licensing fees
and restrictions associated with proprietary ISAs. For software developers, RISC-V offers a stable target
for software development, as the base ISA is frozen, meaning it will not change in future versions. Looking
ahead, the future of RISC-V appears promising, with a growing community of developers and a wide range
of applications, from microcontrollers to supercomputers, RISC-V is poised to continue its growth and
impact on the processor market.
A 5-stage pipelined RISC-V processor has been developed and implemented with DDR flip-flops,
enhancing efficiency by reducing the number of clock cycles per instruction with DDR, with minimum
operating frequency of clock decreased from 221.047Mhz to 104.093Mhz i.e., 53% improvement. This
reduced operating frequency led to a significant decrease in dynamic power from 0.198W to 0.001W
leading to changes of 99.9% reduction in supply power. Due to this reduction the Total supply power is
reduced from. .281W to .083W i.e. 99.7% reduction. However, as we have traded the Power and clock
cycles reduction with Area to improve existing design performance there is proportional amount of
increase in Area with increase that is from 244 Number of LUTs to 402 Number of LUTs in proposed
design. This processor design, realized using Verilog HDL and simulated in Xilinx Vivado, presents a
promising alternative to conventional proprietary microprocessor technologies, offering the advantages of
open-source customization and lower entry barriers.
LIST OF CONTENTS
CHAPTER 1
INTRODUCTION
A central processor unit is referred to as a microprocessor when it consists of just one integrated circuit.
It can handle several instructions at once thanks to its millions of transistors and electronic components. All
of this is contained on a single silicon chip that supports the computer system with memory and other unique
capabilities. It can be programmed to read binary instructions from memory, carry out the operation, and
provide the desired result. It is helpful for concurrent data transmission and receiving, device interaction,
and data saving.
Transistors, registers, and diodes are just a few of the many parts that make up a microprocessor
and work together to complete tasks. As technology has advanced, chip capabilities have grown
increasingly sophisticated. Better functionality and faster speed have been achieved. These days, most
devices require a microprocessor in order to operate. It is the component that gives a gadget
intelligence. Every device, whether it be a computer or a smartphone, need an interface to manage data,
which a microprocessor alone can supply. Furthermore, there is still a long way to go in the
advancement of artificial intelligence.
The first microprocessor generation was released by the Intel corporation in 1971. Specifically,
these were Intel 4004 4-bit processors. With 60k instructions per second, the processor ran at a speed
of 740 kHz. It was constructed with 16 pins and 2300 transistors. Simple arithmetic and logical
processes could be performed with it, as it was constructed on a single chip. To interpret the commands
from memory and carry out the tasks, there was a control unit.
Intel introduced the first 8-bit microprocessor in 1973, marking the start of the second generation.
For arithmetic and logic operations on 8-bit words, it was helpful. With a clock speed of 500 kHz and
50k instructions per second, the 8008 was the first processor. In 1974, an 8080 microprocessor with a
2 MHz speed and 60k instructions per second came next. The 8085 microprocessors, which could
process up to 769230 instructions per second at a speed of 3 MHz, was the last to arrive in 1976.
In 1978, the third generation of microprocessors debuted with the 8086-88, which had a speed
of 4.77, 8, and 10 MHz and a capacity of 2.5 million instructions per second. Other notable inventions
were the Zilog Z800 and the Zilog 80286 (released in 1982, capable of reading 68 pin instructions at a
rate of 4 million per second).
Around 1986, a number of companies released 32-bit microprocessors, although Intel remained
the market leader. With 275k transistors inside, their clock speed ranged from 16 MHz to 33 MHz. The
Intel 80486 microprocessor, which had 1.2 million transistors, 16–100 MHz clock speed, and 8 KB of
cache memory, was one of the earliest. In 1993, the PENTIUM microprocessor was released, including
a clock speed of 66 MHz and an 8-bit cache memory.
One of the earliest 64-bit processors with a clock speed range of 1.2 GHz to 3 GHz was the
Pentium, which debuted in 1995. There were 64kb of instructions per second and 291 million
transistors.i3, i5, and i7 microprocessors in 2007, 2009, and 2010 in that order of precedence. These
were a few of this generation's salient features.
In order to support the system, CISM may handle orders in addition to other low-level tasks like
downloading and uploading. With only a command, it is also capable of carrying out intricate
mathematical calculations. Their high-quality personal computers are compatible with simpler
compilers. Their instructions consist of multiple clock cycles. Intel 386 and 486, Pentium, and so on
are a few instances.
Small, targeted commands should be carried out by RISC with excellent optimization and speed.
Simple commands and the same length result in a shorter instruction set. By adding registers, they
decrease memory references. The pipelining that RISC uses causes the fetching and execution of
instructions to overlap. Most of them require one CPU cycle to complete. AMD K6, K7, and other
models are a few examples.
Combining the greatest qualities of both RISC and CISC processors, EPIC is a hybrid. Without
a set width, they obey commands in parallel. They allow sequential semantics to be used by compilers
to interface with hardware. Intel IA-64 and Itanium are a couple of example
4. Superscalar Microprocessors
Multiple tasks can be executed simultaneously by the superscalar processor. They are frequently
found in multipliers or ALUs because of their ability to carry multiple commands. To convey
instructions within the CPU, they make use of various operational units.
ASICs are widely used in personal digital assistants and automobile pollution control systems.
Although they use off-the-shelf gears, their architecture is extremely well described.
The current processors in the market face challenges in efficiently executing complex instructions, resulting
in increased delay, and reduced overall performance. There is a need to explore alternative solutions to
enhance processor efficiency and address these issues. Introducing RISC-V processors, known for their
simplicity and flexibility, offers a potential solution. However, integrating DDR flip-flops into existing
RISC-V architectures presents technical challenges and requires careful consideration. The problem
statement aims to investigate the feasibility and benefits of replacing existing processors with RISC-V
processors while implementing DDR flip-flops to improve performance and reduce delay in executing
complex instructions.
1.6 OBJECTIVES
Analysis of the RISC V processor to execute complex instruction.
Evaluation of performance parameter of the proposed processor to reduce the number of clock
cycle to improve the efficiency.
CHAPTER 2
LITERATURE SURVEY
In order to get a foothold and basic understanding of the idea of our proposed project, we need to
review and analyze previously published technical papers in the domain of RISC V Processor. The list
below presents the details about the major such papers.
[1] Mehrdad Poorhosseini, Wolfgang Nebel, Kim Gruttner "A Compiler Comparison in the RISC-V
Ecosystem" In the context of the RISC-V environment, which is becoming more and more important
for embedded software development, the study compares the GCC and LLVM compilers. GCC has
long been the preferred compiler for embedded systems because of its broad support for RISC-V and
other instruction set architectures. Still, LLVM begs for comparison given its rising popularity and
recent support for RISC-V. The study reveals that LLVM compiles quicker in 88% of experiments,
whereas GCC and LLVM create similar binary sizes in 51%, with GCC winning in 37% of the
experiments and LLVM in 12%. The benchmarking framework evaluates compile time, binary size,
instruction count, and execution time.Remarkably, in 94% of situations, the binary size difference is
within +/- 5%. Similar clock cycles are found in 42% of the studies, according to execution time
analysis, with LLVM winning in 18% of the cases and GCC in 40%. Developers can use these data to
get insight into which compiler to choose based on project demands and optimization objectives. While
both compilers perform similarly in terms of binary size and execution time, LLVM has an advantage
in compilation speed. One of this paper's main contributions is the establishment of a compiler
benchmark approach designed exclusively for RISC-V ecosystem compiler evaluation. Comparing the
performance of the GCC and Clang/LLVM compilers is the main objective of the article. These
compilers' compilation efficiency, code optimization capabilities, and overall performance may be
systematically evaluated and compared with the help of this benchmark approach. In order to help
developers and organizations in the RISC-V community make well-informed decisions, the article uses
standardized benchmarking techniques to shed light on the advantages and disadvantages of the GCC
and Clang/LLVM compilers.
[2] Nguyen My Qui, Chang Hong Lin, and Poki Chen's paper "Design and Implementation of a 256-
Bit RISC-V-Based Dynamically Scheduled Very Long Instruction Word on FPGA" describes a novel
method of creating a very long instruction word (VLIW) microprocessor by utilizing the RISC-V
[3] Aaron Elson Phangestu, Dr. Ir. Totok Mujiono, M.I.Kom and Ahmad Zaini ST, M.T “Five-Stage
Pipelined 32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in
VHDL”, marks a significant achievement in the realm of open-source Instruction Set Architectures
(ISAs) and microprocessor development. By successfully implementing the core in VHDL and
simulating it using ModelSim, and further synthesizing it using FPGA techniques with Synopsys
Design Compiler, the study showcases the feasibility and effectiveness of utilizing open-source ISAs
for microprocessor creation. The achievement of reaching a maximum frequency of 62.95 MHz
demonstrates the core's efficiency and performance potential. Despite excluding certain instructions
like FENCE, ECALL, and CSR, the core effectively executes the majority of RV32I instructions,
indicating its versatility and compatibility with common instruction sets. Moreover, the study
highlights the potential applications of such a processor core, particularly in embedded digital signal
processing (DSP) applications. Its competitive performance coupled with low resource usage makes it
an attractive choice for scenarios where efficiency and processing power are crucial, such as in
embedded systems and IoT devices.
[4] Srikanth V. Devarapalli, Payman Zarkesh-Ha and Steven C. Suddarth, “A Robust and Low Power
Dual Data Rate (DDR) Flip-Flop Using C-Elements”, introduces a novel dual-edge triggered flip-flop
(DETFF) design called DDR-FF, which aims to address power consumption and delay issues compared
to existing designs like ep-DSFF. DDR-FF leverages direct clock pulses to achieve a notable reduction
CHAPTER 3
METHODOLOGY
The objective of incorporating DDR (Double Data Rate) flip-flops into the RISC-V datapath architecture
is to enhance the performance by reducing operation time and increasing efficiency. Traditional flip-flops
operate on a single clock edge, either rising or falling, which can limit the speed at which data can be
processed. By using DDR flip-flops, which are sensitive to both rising and falling edges of the clock signal,
we aim to exploit more opportunities for data processing within each clock cycle.
The methodology involves replacing conventional registers in the datapath with DDR flip-flops to enable
dual-edge sensitivity. This allows for data to be sampled and processed on both the rising and falling edges
of the clock signal, effectively doubling the data transfer rate compared to single-edge flip-flops.
1. Increased Throughput: By utilizing both edges of the clock signal, more data can be transferred and
processed within a single clock cycle, leading to higher throughput.
2. Reduced Latency: With the ability to sample data on both edges of the clock, the latency of critical
datapath operations can be reduced, enhancing overall system responsiveness.
3. Improved Efficiency: By optimizing the datapath architecture to leverage DDR flip-flops, we aim to
achieve better utilization of hardware resources, resulting in improved efficiency and performance.
In summary, the objective of incorporating DDR flip-flops into the RISC-V datapath architecture is to
enhance performance, reduce operation time, and increase efficiency by leveraging dual-edge sensitivity
for data processing.
The RISC-V datapath architecture stands as the cornerstone of RISC-V processors, embodying
fundamental principles of simplicity, efficiency, and scalability. At its core, the architecture is built upon a
clean and streamlined instruction set, designed to execute instructions swiftly and with minimal complexity.
Within the datapath, various components collaborate seamlessly to facilitate instruction execution. The
instruction fetch unit retrieves instructions from memory, while the instruction decode unit interprets their
meaning. The register file serves as a repository for data operands, accessible by the arithmetic logic unit
(ALU) for computation. Additionally, the control unit orchestrates the flow of data and control signals,
ensuring smooth operation throughout the processor pipeline.
The RISC-V datapath architecture represents a paradigm of elegance and efficiency in processor design.
Its modular and streamlined nature enables RISC-V processors to execute instructions with remarkable
speed and efficiency, while also providing flexibility for future innovations. As the RISC-V ecosystem
continues to grow and evolve, the datapath remains a foundational element, driving progress and pushing
the boundaries of what is achievable in the realm of processor architecture .
It is aimed to enhance the sensitivity of sequential flip-flops, latches, and registers within the datapath
architecture by transitioning to DDR (Double Data Rate) flip-flops. This transition involves a systematic
approach, starting with an analysis of performance requirements and identifying target elements for
modification. Design adjustments are made to incorporate DDR functionality, enabling sampling on both
rising and falling clock edges. Thorough verification, integration into the datapath architecture, and
performance evaluation ensure the effectiveness of the transition. Optimization and fine-tuning iterations
are conducted to maximize performance benefits, resulting in an improved datapath architecture poised for
greater efficiency and throughput.
In summary, the methodology involves careful analysis, design modification, verification, integration, and
optimization to transition from sequential elements to DDR flip-flops within the datapath architecture. By
enhancing sensitivity to dual clock edges, this approach aims to unlock performance improvements, leading
to a more efficient and responsive computing system.
Data transport and memory were completely transformed by Double Data Rate (DDR) technology, which
allowed data to be transmitted on both the rising and falling edges of the clock signal. When compared to
conventional single data rate (SDR) systems, this invention essentially doubled the data transfer rate.
Because DDR memory offers better performance and efficiency, it is commonly employed in current
computing systems. It's especially important in sophisticated processors and high-speed memory interfaces,
because satisfying the demands of contemporary computer activities requires faster data transfer. DDR
technology has, in general, greatly increased computer systems' performance and capacities by speeding
up and streamlining data transfer.
DDR with control signals:
The control signals were introduced in DDR flipflops for combinational data synchronization.
Common control signals like enable and reset are introduced.
3. Execute (EX):
- The decoded instruction is executed in this stage, which involves performing arithmetic, logic, or data
transfer operations.
- For arithmetic or logical operations, the operands are typically sourced from registers or immediate
values, and the result is computed by the ALU (Arithmetic Logic Unit).
- Branch instructions may also be evaluated in this stage to determine whether a branch is to be taken.
This structured approach not only simplifies the design and implementation of the processor but also
enables high performance and scalability. By dividing the instruction processing into distinct stages, the
processor can execute multiple instructions concurrently, leveraging parallelism to enhance throughput and
efficiency.
Instruction memory stands as a fundamental component within the RISC-V datapath architecture, serving
as the repository for program instructions. Its direct integration into the datapath ensures swift access to
instructions, thereby enhancing execution speed and overall efficiency. By storing instructions within the
datapath framework, the instruction memory facilitates sequential fetching, decoding, and execution of
instructions, thus ensuring smooth program flow and optimal performance. As a core element of the RISC-
V datapath, instruction memory plays a vital role in the seamless execution of instructions, contributing
significantly to the efficiency and effectiveness of RISC-V processors.
DRAWBACKS: In the RISC-V datapath, the program counter (PC) efficiently advances to the next
instruction address by incrementing by 4 during each clock cycle. This synchronous operation ensures
consistent and reliable updates to the PC value. With instructions typically aligned at byte boundaries in
memory, this incrementation scheme optimizes memory access and instruction fetching. By swiftly
progressing through the program sequence in this manner, the RISC-V processor achieves efficient
execution of instructions and maintains smooth program flow.
By replacing conventional clocking blocks with DDR flip-flops in instruction memory, RISC-V
architectures optimize data throughput and processing speed. DDR flip-flops, capable of sampling data on
both rising and falling clock edges, significantly reduce latency and accelerate instruction fetching and
execution. This enhancement in memory bandwidth utilization not only improves overall system
performance but also enhances responsiveness. Moreover, the strategic integration of DDR flip-flops
ensures seamless compatibility and benefits across the entire RISC-V processor architecture.
The control unit within the RISC-V datapath serves as the conductor of instruction execution, generating
essential control signals to coordinate various components efficiently. It interprets instruction opcodes,
determining the sequence of operations required for optimal execution. Through seamless coordination
with the instruction decoder, the control unit facilitates proper data flow between registers, the ALU,
memory, and other functional units. By ensuring synchronization and coordination of datapath operations,
the control unit plays a pivotal role in optimizing performance and executing instructions accurately.
DRAWBACKS: Control signals are synchronized to consecutive clock edges in pipelined registers,
ensuring precise timing coordination. This synchronization optimizes the overall timing of operations
within the datapath, enhancing efficiency. It enables seamless orchestration of instruction execution and
contributes to the smooth functioning of the RISC-V processor.
The hazard detection and resolution unit in the RISC-V pipeline identifies and resolves hazards, including
data, control, and structural issues. By implementing mechanisms like data forwarding or pipeline stalls, it
minimizes their impact, ensuring smooth operation. This unit also manages data flow to maintain coherence
and prevent incorrect data from affecting instruction execution, enhancing pipeline reliability.
The proposed methodology for enhancing the hazard unit using DDR (Double Data Rate) involves
integrating DDR flip-flops to enable dual-edge sensitivity for hazard detection. This integration allows for
more precise identification of hazards, including data, control, and structural issues. Enhanced hazard
detection mechanisms are implemented, leveraging the increased sensitivity to clock edges, while dynamic
resolution strategies optimize pipeline performance. Thorough evaluation and optimization ensure seamless
integration into the RISC-V pipeline architecture, ultimately improving hazard detection and resolution
capabilities.
The data memory component in the RISC-V datapath efficiently stores data accessed by instructions,
facilitating seamless data management. Through load and store operations, it enables smooth manipulation
of data, interfacing with the CPU to ensure efficient data exchange during execution. Additionally, the
inclusion of cache hierarchy optimizes performance by reducing latency, further enhancing the overall
efficiency of data processing within the RISC-V architecture.
DRAWBACKS: Introduces Propagation Delay: Pipelined registers with single-edge sensitivity may
introduce propagation delays as data moves through pipeline stages. This delay can affect the overall system
throughput and performance, potentially leading to slower execution of instructions.
Introducing DDR instead of pipelined registers for dual-edge sensitivity can enhance data transfer rates by
allowing data to be sampled on both rising and falling edges of the clock signal. This approach potentially
boosts system throughput and performance compared to single-edge sensitive pipelined registers.
Integrating DDR instead of pipelined registers enables dual-edge sensitivity, allowing data sampling on
both rising and falling clock edges. This enhances data transfer rates, potentially boosting system
throughput and performance compared to single-edge sensitive pipelined registers.
The register file within the RISC-V datapath serves as a repository for a set of general-purpose registers,
essential for data manipulation and temporary storage during instruction execution. Comprising multiple
registers, each capable of holding fixed-size data values, commonly 32 or 64 bits long, it enables fast access
to operands required for arithmetic, logic, and data movement operations. This efficient architecture allows
instructions to swiftly read from or write to specific registers within the file, facilitating the rapid execution
of program instructions.
DRAWBACKS : Data transactions with the register file in the RISC-V datapath are completed within a
single clock cycle, effectively boosting processing speed. This swift operation enables instructions to
promptly access operands stored in the register file, enhancing system efficiency and responsiveness.
DDR flip-flops optimize data throughput by sampling on both clock edges, enhancing instruction fetching
and execution efficiency. This dual-edge sensitivity reduces access latency, resulting in faster program
execution and enhanced memory bandwidth utilization. Strategically integrated into the instruction memory
module, DDR flip-flops replace traditional clocking blocks, ensuring seamless compatibility and efficiency
enhancement within the datapath architecture. This systematic upgrade process leads to improved efficiency
and reduced latency, ultimately enhancing overall system performance.
The pipelined registers in the RISC-V datapath involves utilizing sequential storage elements, typically
flip-flops, to stage data throughout the pipeline stages. These registers enable the smooth flow of data and
control signals, facilitating concurrent instruction execution and improving system performance by
reducing overall instruction latency. However, their single-edge sensitivity may present limitations in
achieving optimal throughput and efficiency in modern processor designs.
DRAWBACKS: Pipelined registers with single-edge sensitivity introduce propagation delays as data
progresses through pipeline stages. These delays can have a notable impact on system throughput and
performance, potentially resulting in slower execution of instructions. As data must wait for the next clock edge
to propagate through each stage, the cumulative effect of these delays can hinder overall efficiency and
responsiveness. This highlights the importance of considering alternative approaches, such as DDR flip-flops
with dual-edge sensitivity, to mitigate propagation delays and optimize system performance within the RISC-
V Datapath architecture.
Integrating DDR (Double Data Rate) flip-flops instead of pipelined registers in the RISC-V datapath introduces
dual-edge sensitivity, enabling data sampling on both rising and falling edges of the clock signal. This approach
holds the potential to significantly enhance data transfer rates, as it allows for more frequent sampling and
processing of data within each clock cycle. By leveraging dual-edge sensitivity, DDR flip-flops can mitigate
the limitations of single-edge sensitive pipelined registers, potentially leading to improved system throughput
and performance. This enhancement in data processing efficiency aligns with the growing demands of modern
computing tasks and contributes to the overall optimization of RISC-V.
CHAPTER 4
4.1 RISC V
The RISC-V (RV32I) instruction set, with a fixed length of 32 bits aligned to 32-bit boundaries,
is tailored to serve as a comprehensive compile target supporting modern operating systems. Crafted
to minimize hardware requirements, it adopts a little-endian format where the lowest address holds the
least significant byte of a word.
RV32I, a refined iteration of RISC-V, is optimized for constructing RISC machines, offering
broad support for contemporary operations and functionalities. Featuring 32 general-purpose registers
(reg0 to reg31), with reg0 inherently set to 0, it also includes a user-accessible program counter. This
counter, 32 bits in length, increments at the positive edge of the clock, typically by one in word-
addressable instruction memory configurations.
RISC-V was chosen primarily for its pipeline-friendly nature and efficient resource
consumption, making it appealing for software-centric applications. To enhance processor
performance, techniques such as loop unrolling, and compiler scheduling are utilized for runtime
optimization.
Within the RV32I instruction set, there are six distinct formats: R-type, U-type, I-type, B-type,
J-type, and S-type, each serving specific purposes in instruction encoding and execution.
The Register-type RV32I ISA V 2.0, illustrated in Figure 4.1, comprises six fields. The Opcode
field spans 7 bits, determining the instruction type. Source registers (rs1, rs2) and the destination
register (rd) are denoted by five-bit fields. A 10-bit Function field identifies the operation type.
Supported instructions include add, sub, sltu, sll, xor, and, sra, srl, or, and slt.
Figure 4.3 depicts the Immediate-type RV32I ISA V 2.0. Similar to the R-type format, the Opcode
width within this format is 7 bits. Source register (rs1) and destination register (rd) are denoted by five-bit
fields. A three-bit function field is utilized to specify the operation type. Additionally, there's a dedicated
12-bit field for holding immediate operands, crucial for immediate data operations.
Instructions supported by this format include jalr, lhu, lw, lb, lbu, lh, srai, srli, slli, slti, addi, andi, ori, xori,
and sltiu. Figure 4.4 illustrates the decoding logic of the I-type Instruction.
Figure 4.5 illustrates the Store-type RV32I ISA v2.0. Similar to the R-type format, the Opcode width is 7
bits. Source registers (rs1 and rs2) are identified by five-bit fields. A three-bit function field specifies the
size of the data to be stored. Additionally, there's a separate 12-bit field for holding the immediate operand,
which, when added to rs1, determines the address where the value from rs2 will be stored.
Instructions supported by this format include sw, sb, and sh. Figure 4.6 displays the decoding logic of the
S-type Instruction.
In Figure 4.7, the Branch-type RV32I ISA V 2.0 is depicted. Similar to other instructions, the Opcode width
is 7 bits. Source registers (rs1 and rs2) are represented by five-bit fields, serving as the basis for comparison
during branching operations. The function field, spanning 3 bits, determines the type of condition to be
evaluated for branching.
A separate space of 12 bits accommodates the immediate operand, which, when a branch is taken, is added
to the program counter. Instructions supported by this format include bne, bltu, blt, bgeu, bge, and beq.
Figure 4.8 illustrates the decoding logic of the B-type Instruction.
Figures 3.9 and 3.11 present the U-type and J-type RV32I ISA V 2.0, respectively, both sharing similarities
in their structure. Each comprises two main fields. The Opcode width, spanning 7 bits, serves to distinguish
the type of instruction format. The destination register (rd) is identified by a five-bit field within these
instructions.
Additionally, there's a 20-bit field dedicated to holding immediate operands, crucial for immediate data
operations. However, in the case of J-type instructions, the immediate data undergoes rearrangement before
branching, distinguishing it from other types. Instructions supported by this format include jal, lui, and
auipc, each serving distinct purposes in program control and data manipulation.
Figures 3.10 and 3.12 provide insights into the decoding logic of J and U-type instructions, respectively,
elucidating how these instructions are interpreted and executed within the RV32I ISA framework.
1) Arithmetic Operations:
These operations involve mathematical computations such as addition, subtraction, multiplication, and
division. They are crucial for manipulating numerical data and performing calculations within the
processor.
2) Logical Operation:
Logical operations entail bitwise manipulations on binary data. These operations include bitwise AND,
OR, XOR, and logical shifts. They are utilized for tasks such as masking, setting or clearing specific
bits, and logical comparisons.
4) Control Operations:
Control operations govern the flow of execution within the processor. These operations include
branching instructions (e.g., conditional branches like beq for branching if equal) and jump instructions
(e.g., jal for jumping to a specific address). They enable decision-making and looping structures within
programs.
In executing each of these operations, the processor relies on a series of interconnected stages, such as
instruction fetch, decode, execute, memory access, and write back. Each stage contributes to the overall
execution of an instruction, with dependencies between them ensuring the correct sequencing and
completion of operations. This interdependence underscores the intricate nature of processor operation and
the coordination required for efficient instruction execution. The specific details and behaviors of each
instruction type are documented in tables, providing comprehensive guidance for programmers and
hardware designers alike.
Table 4.1 provides a comprehensive list of arithmetic operations that the processor supports. These
operations are executed by the Arithmetic Logic Unit (ALU) during the execution stage of the processor's
pipeline. During execution, arithmetic operations involve two source operands, typically retrieved from
registers, and the resulting value is written back to the register file during the memory write-back stage.
It's important to note that immediate data, which are values directly encoded within the instruction, are
extended to 32 bits before being used in arithmetic operations. This ensures consistency in operand size, as
all operations are performed with respect to 32 bits. In these operations, register 1 always serves as the left-
hand side operand, while register 2 or the immediate data acts as the right-hand side operand.
In essence, arithmetic operations manipulate numerical data using basic mathematical functions such as
addition, subtraction, multiplication, and division. These operations play a fundamental role in data
processing within the processor, facilitating computations required for various tasks and algorithms. The
consistent handling of operand size and the sequential execution of these operations within the processor's
pipeline ensure efficient and reliable arithmetic computation.
Table 4.2 presents the assortment of Logical operations that the processor supports. These operations are
executed by the Arithmetic Logic Unit (ALU) during the execution stage of the processor's pipeline. In this
stage, two source operands are utilized for the operation, and the resulting output is subsequently written
back to the register file during the memory write-back stage.
During execution, immediate data are extended to 32 bits, ensuring consistency across all operations, which
are conducted within the framework of 32-bit data. The operations are structured such that Register 1
always serves as the left-hand operand, while Register 2 or the immediate data acts as the right-hand
operand.
In simpler terms, the processor can perform a variety of logical operations, such as bitwise AND, OR,
XOR, and logical shifts, using two input sources. These operations occur within a specific stage of the
processor's operation, and the results are stored back into the processor's registers. Immediate data, when
used, are expanded to 32 bits to maintain uniformity in processing, and the operands are arranged such that
Register 1 is on the left side and Register 2 or immediate data is on the right side.
Table 4.3 outlines the data transfer operations supported by the processor. During execution, the ALU
(Arithmetic Logic Unit) handles the address calculation aspect. These operations involve two source
operands, and the resulting data is written back to subsequent stages for memory access. Immediate data,
sign-extended to 32 bits, ensures uniformity in operation, as all computations are performed with respect
to this length.
In these operations, a consistent convention is followed: register 1 is always positioned on the left-hand
side, while register 2 or immediate data occupies the right-hand side. Load operations are executed during
the write-back stage, whereas store operations take place during the memory access stage.
In essence, the table provides a comprehensive overview of how data is transferred within the processor,
detailing the stages involved and the specific procedures for load and store operations. This streamlined
process ensures efficient handling of data movement, contributing to the overall functionality and
performance of the processor.
Table 4.4 provides an overview of the Control Transfer operations supported by the processor. These
operations involve evaluating conditions for branching, a task executed by the Arithmetic Logic Unit
(ALU) during the execution stage of the processor. In this process, two source operands are used, and the
resulting outcome determines whether the branch is taken or not.
Any immediate data involved in these operations are extended to 32 bits, ensuring uniformity in data
handling across the processor. It's important to note that all operations are performed with respect to 32
bits. Additionally, a consistent convention is followed where register 1 is always on the left-hand side,
while register 2 or immediate data is placed on the right-hand side.
The outcome of the branching operation, captured by the taken branch flag, influences the subsequent
instructions' ability to modify the memory and register file of the processor. This mechanism helps maintain
the integrity and coherence of the processor's state during control transfer operations.
Pipelining is the methodical approach used by processors to retrieve instructions and execute them through
a sequence of stages. It facilitates the organized storage and execution of instructions, optimizing the
efficiency of processing tasks. Sometimes referred to as pipeline processing, this technique streamlines the
flow of instructions through the processor's stages.
In essence, pipelining enables the processor to concurrently handle multiple instructions by breaking down
the execution process into smaller, sequential stages. Each stage focuses on a specific task, such as
instruction fetch, decode, execute, memory access, and write back. As instructions progress through these
stages, new instructions can be fetched, allowing for continuous processing without waiting for the
completion of earlier instructions.
Pipelining is the method of organizing instructions within a processor to facilitate their systematic retrieval
and execution through a pipeline. It enables efficient storage and processing of instructions, often referred
to as processor pipelining. Fig 4.13 illustrates pipelining scheduling, demonstrating how multiple
instructions can simultaneously utilize different stages of the processor, enhancing overall throughput and
efficiency.
There are possibly three types of hazards that arise in a pipelined processor:-
1) Structural Hazards
2) Data Hazards
3) Control Hazards
Hardware Duplication: We can have more than one resource for the same function. For example, having
separate instruction and data memories can eliminate the structural hazard that occurs when an instruction
fetch, and a data operation need to access memory at the same time.
Pipeline Scheduling: Compiler techniques can be used to schedule the pipeline so that simultaneous
resource requests do not occur.
Operand Forwarding: Also known as data bypassing, it involves forwarding the result of an operation
directly from the functional unit to the pipeline stage where it is needed, without having to write it to a
register and then read it back.
Pipeline Stalls: Also known as pipeline bubbles, these are deliberate delays inserted into the pipeline to
allow time for the data to be written and then read.
Register Renaming: This involves dynamically reassigning the registers used in the program to avoid
hazards.
Branch Prediction: This involves predicting the outcome of a branch operation and fetching instructions
accordingly. The prediction could be static (always taken or not taken) or dynamic (based on the history of
branch outcomes).
Branch Delay Slots: The compiler or hardware fills the slots following a branch instruction with other
instructions that can be executed whether the branch is taken.
Loop Unrolling: This involves replicating the body of the loop to reduce the number of branches
Computers and other electronic devices use a sort of memory technology called Double Data Rate (DDR)
Synchronous Dynamic Random Access Memory (SDRAM) to improve performance. Fast data retrieval
and storage are made possible by DDR SDRAM, which operates by synchronizing its operations with the
system clock and enabling random access to any memory location. In order to manage massive volumes of
data effectively, there is an increasing need for more complex and quick memory solutions due to
technological breakthroughs. To satisfy these objectives, DDR memory has developed throughout time,
resulting in cost savings as well as notable gains in speed and storage density. One of the main advantages
of DDR is that, in comparison to conventional memory technologies, it can transmit data twice as fast by
taking use of the clock signal's rising and falling edges. Modern computing systems require DDR memory
because of its crucial role in improving overall system performance and responsiveness through increased
data transfer speed.
Faster data transfer rates: DDR memory allows for faster data transfer rates than SDR memory
because it can transfer data on the rising as well as the falling edges of the clock signal.
Increased bandwidth: The quantity of data that can be moved in each amount of time is known as
bandwidth, and it is boosted by the quicker data transfer rates of DDR memory. Because of this,
DDR memory has the capacity to manage larger amounts of data simultaneously, which is very
helpful for high-performance equipment like servers and game consoles.
Improved power efficiency: Because DDR memory is more power-efficient than SDR memory,
DDR memory-using devices will use less energy and produce less heat. Mobile devices, which must
preserve power to operate for extended periods of time between charges, will find this especially
helpful.
Higher capacity: There are various DDR memory variants, including DDR1, DDR2, DDR3, DDR4,
and DDR5, with varying capacity specifications. For instance, the maximum capacity of DDR4
memory is 16 GB, but the maximum capacity of DDR1 memory is 1 GB. As a result, devices using
DDR memory have a larger capacity for data storage than those using SDR memory.
Synchronized with the system clock: Synchronous Dynamic Random Access Memory, or DDR
SDRAM, is another name for DDR memory. It is named such because it is synced with the system
clock.
Widely used in various applications: Numerous gadgets, such as mobile phones, game consoles,
servers, and personal PCs, use DDR memory. It is a flexible technology with a wide range of
A flip-flop can function in two modes: standard mode and double data-rate mode. The flip-flop outputs
data on both edges of the applied clock when set to operate in the double data-rate mode. The flip-flop
sends data on the rising or falling edges of the applied clock when it is set to function in the standard mode.
When a flip-flop is in the double data-rate mode, the output data is supplied by the second latch in the
holding mode when it is operating in a sampling mode, and the output data is supplied by the first latch in
the holding mode when the second latch is operating in a sampling mode. Consequently, one of the latches
provides an output data with each rising or falling edge of the clock.
Implicit pulsed Dual Edge Flip-Flop: In digital circuits, implicit pulse dual-edge flip-flops are a
kind of clocking device that can record information on both the rising and falling edges of a clock
signal. In order to use this strategy, two series devices that receive both the clock and a delayed
clock contained in the logic branch must be used. Implicit pulse dual-edge flip-flops, as opposed to
explicit pulsed systems, can be employed for dynamic logic but may perform worse because of a
deeper nMOS stack. Because pulse generators cannot be shared by flip-flops, their power overhead
is thus larger. Nonetheless, compared to explicit pulsed flip-flops, implicit pulse dual-edge flip-
flops can have benefits like a two-fold reduction in clock dynamic power consumption and a simpler
design with fewer transistors.
CHAPTER 5
IMPLEMENTATION
The following are the steps involved in implementing the above proposed methodology:
Step 1: Development of the control and Hazard units.
Step 2: Development of the Datapath unit.
Step 3 : Design and Verification of the Double data Rate Flipflop along with the control signal.
Step 4: Identifying and replacing sequential Elements with DDR.
Step 5: Developed Testcases of immediate and Register type Instructions and encoded testcases
based RISC 32i type format and incorporated in instruction ROM.
Step 6: Writing the testbench to simulate and verify, also extracted the post synthesis report.
Control Unit:
The control unit of RISC-V, a modern open-source instruction set architecture (ISA), orchestrates the
execution of instructions within the processor. It decodes instructions, manages data flow, and
coordinates operations, adhering to the RISC philosophy of simplicity and efficiency. This unit is pivotal
in directing the processor's actions for streamlined performance. the control unit consists of the Main
decoder and the ALU control decoder and the control unit is the combinational block. which accepts the
ALU control signal generator: It accepts the part of instruction that specifies the ALU function which
involves arithmetic and logical operation and sends the ALU CTRL and ALUSrcD as the output control
signal to the ALU at the execute stage and ALUSrcD to alumux that choose appropriate input for ALU.
Hazard unit:
The pipeline architecture is different than the single cycle that it has flip flops between every stage and
the next one so the data can propagate as one unit, and two 3:1muxes to select the operands for the ALU
in case of the forwarding, and hazard unit to control the conditions of forwarding.
The Datapath is one of the modules that consists of ALU, muxes program counter sign extender and
pipeline registers. This module accepts the inputs from control unit, hazard unit and instruction memory
and processes the data according to the signals of control and hazard units. The Datapath encloses all
5 stages namely fetch, decode, execute, memory and writeback.
The Datapath consist of the following:
Pipeline registers: The Datapath consists of pipeline registers which are normally flipflops in
between each of five stages namely fetch decode execute memory and write back to avoid
metastability and data support data synchronization.
Multiplexers: There are 2 to1 and 3 to 1 muxes in fetch and execute stages to choose among different
outputs.
Program counters: The Program Counter (PC) is a register in the processor of a computer that holds
the address of the next instruction to be fetched and executed.
Register file: The Register File is a high-speed storage area within the processor of a computer that
stores a small set of data called registers. In the RISC-V architecture, the Register File typically
contains a set of general-purpose registers (GPRs), each capable of storing a fixed-size binary data
word, commonly 32 or 64 bits.
ALU: Performs arithmetic and logical operations on data, such as addition, subtraction, AND, OR
XOR, etc.
Sign extender: A sign extender is a component in a computer processor that extends the sign bit of a
binary number to fill additional bits. In the context of RISC-V architecture, the sign extender is often
used when dealing with signed integer operations.
Simple adder: The combinational block that Just adds the four byte or immediate address as part of
the program counter.
We designed the DDR flipflop by using two flipflops that one triggers at positive edge and other at
negative edge the output of both flipflops selected by 2:1 mux and clock as the select line. The modeling
of the DDR was in structural modelling.
Fig 5.6 DDR flipflop with enable and reset as control signal.
The verification plan for a DDR (Double Data Rate) flip-flop includes testing its functionality along
with control signals along with time requirements, transition times, and data stability under various
clock and input signal conditions. Simulations will validate proper functioning during both rising and
falling clock edges. Timing analysis ensures compliance with DDR specifications, while corner-case
were covered by the testbench with appropriate control signals at specific simulation time.
Next step is to identify the sequential units as of in each stage consists of the pipeline registers in which
are conventional flipflops are replaced with the DDR flipflops the above diagram illustrates the
replacement of the pipeline registers with the DDR flipflops and other constructs remains the same.
Similarly, all other pipeline registers between each stage are replaced with the DDR.
We used the reference released by university of Berkeley to encode our instructions the instructions
format is of as follows:
Step 6: Dividing the extracted features into training and testing set:
Where rs1 is the source address rd is the destination address and funct3 is the function used for the
Alu operation and opcode is specify the type of instruction.
Step 6: Writing the testbench to simulate and verify, also extracted the post
synthesis report:
We have made testbench as the top module from there we instantiated the top module which consist of
Datapath, control and hazard units which is in turn connected with instruction rom and data memory
the test bench drives the global signals clock and reset and runs the processor through elaborated design
of instruction rom, top module and data memory.
We used Xilinx ISE power analyzer and synthesizer to synthesize our design and post synthesis. We
extracted our design through Xilinx ISE Design Suite.
The proposed architecture enhancement aims to significantly reduce number of clock cycle per operation
and operational time compared to the current RISC-V architecture, with minimum operating frequency of
clock decreased from 221.074Mhz to 104Mhz i.e., 53% improvement. This significant reduction in clock
period directly translates to a substantial increase in the operational speed of the processor, enabling faster
execution of tasks and improved system responsiveness. However, the addition of more multiplexers and
flip-flops in the proposed design leads to an increase in the overall area and power consumption,
highlighting a trade-off between minimum operational time and increased hardware complexity and size.
Property Existing RISC V Proposed RISC V % difference in
Datapath Datapath Architecture synthesis results
Architecture (existing to
proposed)
Minimum time period of 4.523 9.607 52
Clock required (ns)
Maximum clock frequency 221.074 104.039 53
allowed (MHz)
Dynamic power Consumed 0.198 0.001 99.9
(W)
Total Power Consumed (W) 0.281 0.083 97
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
6.1 CONCLUSION
The RISC-V architecture presents a promising solution to address the challenges faced by current
processors in efficiently executing complex instructions. As an open-source, modular, and extensible ISA,
RISC-V offers hardware developers the freedom to customize processors to meet their specific needs,
without the restrictions and licensing fees associated with proprietary architectures. The integration of DDR
flip-flops into the RISC-V processor design is a key enhancement that aims to improve performance and
reduce the delay in executing complex instructions. By leveraging the dual edge triggering of DDR, the
processor can effectively double the data transfer rate, leading to a significant reduction in the number of
clock cycles required for complex instruction execution.
The proposed architecture modifications demonstrate substantial improvement in operational time, with a
notable decrease in the minimum time period of clock from 9.607ns to 4.523ns. However, this comes at the
cost of increased hardware complexity and size due to the addition of more multiplexers and flip-flops
(referring to the Table 5.1). Overall, the RISC-V processor with DDR flip-flops represents a compelling
alternative to conventional proprietary microprocessor technologies, offering the advantages of open-
source customization, lower entry barriers, and enhanced performance and efficiency. As the RISC-V
community continues to grow and evolve, this architecture is poised to reshape the landscape of the
processor market, driving innovation, and fostering competition across a wide range of applications, from
microcontrollers to supercomputing.
In robotics, the project unlocks new frontiers in precision and efficiency. Medical robotics stand to benefit
Beyond its immediate applications, the project's emphasis on security and privacy underscores its
commitment to responsible innovation. By safeguarding sensitive data collected and processed within
various domains, the project fosters trust and reliability in the adoption of advanced technologies. As
research and development efforts continue to push the boundaries of what is possible, the project's impact
is poised to reshape industries, driving progress and prosperity in the global economy.
Fig 6.1 Rise in Trend of RISC V Fig 6.2 Practical application of RISC V processor
[1] Aaron Elson Phangestu, Dr. Ir. Totok Mujiono, M.I.Kom, Ahmad Zaini ST, M.T, "Five-Stage Pipelined
32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in VHDL", 2022.
[2] K. Asanovic and D. A. Patterson, “Instruction sets should be free: The case for RISC-V,” EECS
Department, University of California, Berkeley, Aug 2014.
[3] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovi, “The RISC-V instruction set manual. Volume
1: User-level ISA, version 2.0,” 2014.
[4] M. Poorhosseini, W. Nebel, and K. Grüttner, “A compiler comparisonin the RISC-V ecosystem,” 09
2020.
[5] N. M. Qui, C. H. Lin, and P. Chen, “Design and implementation of a256-bit RISC-V based dynamically
scheduled very long instruction word on FPGA,” 2020
[6] Srikanth V. Devarapalli, Payman Zarkesh-Ha and Steven C. Suddarth, “A Robust and Low Power Dual
Data Rate (DDR) Flip-Flop Using C-Elements”, 2010
APPENDIX
SOURCE CODE
datapath u_dp (
.clk(clk),
.rst(rst),
//instr memory inputs
.instrF(instrF),
//data memory inputs
.read_dataM(read_dataM),
//CU inputs
.immsrcD(immsrcD),
.ALUsrcD(ALUsrcD),
.ALUctrlD(ALUctrlD),
.resultsrcD(resultsrcD),
.regwrD(regwrD),
.jumpD(jumpD),
.jalrD(jalrD),
.branchD(branchD),
.memwrD(memwrD),
//hazard unit inputs
.forwardAE(forwardAE),
.forwardBE(forwardBE),
.stallF(stallF),
.stallD(stallD),
.flushE(flushE),
.flushD(flushD),
//CU outputs
.instrD(instrD),
control_unit u_cu (
.opD(instrD[6:0]),
.funct3D(instrD[14:12]),
.funct7_5D(instrD[30]),
//datapath outputs
.immsrcD(immsrcD),
.ALUsrcD(ALUsrcD),
.ALUctrlD(ALUctrlD),
.resultsrcD(resultsrcD),
.regwrD(regwrD),
.jumpD(jumpD),
.jalrD(jalrD),
.branchD(branchD),
//data memory output
.memwrD(memwrD)
);
hazard_unit u_hu (
.rst(rst),
.rs1E(rs1E),
.rs2E(rs2E),
.rdM(rdM),
.rdW(rdW),
.regwrM(regwrM),
.regwrW(regwrW),
//stalling inputs
.rs1D(rs1D),
.rs2D(rs2D),
.rdE(rdE),
.resultsrcE0(resultsrcE0),
//flushing inputs
.PCsrcE0(PCsrcE0),
//forwarding outputs
.forwardAE(forwardAE),
.forwardBE(forwardBE),
//stalling outputs
.stallF(stallF),
.stallD(stallD),
.flushE(flushE),
//flushing outputs
.flushD(flushD)
);
endmodule
//////////////////////////////////////////////////////////////////////
////////////////////////////////
module mux2x1 ( input wire sel , input wire [31:0] in0 , in1 , output
reg [31:0] out );
always@(in1,in0,sel)
begin
if(sel)
begin
out = in1 ;
end
else
begin
out = in0 ;
end
end
endmodule
//////////////////////////////////////////////////////////////////////
////////////////////////////////
module mux3x1 (
input wire [1:0] sel,
input wire [31:0] in0,
input wire [31:0] in1,
input wire [31:0] in2,
output reg [31:0] out );
always@ (sel,in0,in1,in2)
begin
if(sel == 2'b10)
begin
out = in2;
end
else if (sel == 2'b01)
begin
out = in1;
end
else if (sel == 2'b00)
begin
out = in0;
end
else
begin
out = in0;
end
end
endmodule
//////////////////////////////////////////////////////////////////////
//////////////////////////////////////////
module adder ( input wire [31:0] in1, input wire [31:0] in2, output
wire [31:0] out );
assign out = in1 + in2 ;
endmodule
//////////////////////////////////////////////////////////////////////
////////////////////////////////////
module Sign_ext ( input wire [31:7] in, input wire [1:0] opcode,
output reg [31:0] out );
always@(opcode,in)
begin
case(opcode)
2'b00 : out = { {20{in[31]}} , in[31:20] } ; //S-type instruction
2'b01 :out = { {20{in[31]}} , in[31:25] , in[11:7] } ; //B-type
instruction
2'b10 :out = { {20{in[31]}} , in[7] , in[31:25] , in[11:8] , 1'b0} ;
//J-type instruction
2'b11 :out = { {12{in[31]}} , in[19:12] , in[20] , in[30:21] , 1'b0}
;
default : out = 32'hxxxxxxxx ;
endcase
end
endmodule
//////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////
module Reg_file (
input wire clk, input wire [4:0] Addr1, input wire [4:0] Addr2, input
wire [4:0] Addr3, input wire [31:0] wd3, input wire we3, output reg
[31:0] rd1, output reg [31:0] rd2
);
//////////////////////////////////////////////////////////////////////
//////////////////////////////////////////
//////////////////////////////////////////////////////////////////////
/////////////////////////////
module main_decoder ( input wire [6:0] op, output reg jump,
output reg jalr,
output reg branch, output reg [1:0] immsrc, output reg ALUsrc,
output reg [1:0] ALUop, output reg [1:0] resultsrc, output reg regwr,
output reg memwr
);
always@(*) begin
case(op)
default :
begin
regwr = 1'bx ;
immsrc = 2'bxx ;
ALUsrc = 1'bx ;
memwr = 1'bx ;
resultsrc = 2'bxx ;
branch = 1'bx ;
ALUop = 2'bxx ;
jump = 1'bx ;
jalr = 1'bx ; end
endcase
end
endmodule
//////////////////////////////////////////////////////////////////////
///////////////////////
module Alu_decoder ( input wire [1:0] ALUop, input wire [2:0] funct3,
input wire funct7_5,
input wire op_5, output reg [1:0] ALUctrl
);
always@(*) begin
case(ALUop)
2'b00 : ALUctrl = 2'b00 ; //adding for lw,sw,jalr
2'b01 : ALUctrl = 2'b01 ; //subtructing for beq,bne
end
default : ALUctrl = 2'bxx ; endcase
end
endmodule
//ALU_CTRL
//////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////
module datapath (
//global inputs
input wire clk, input wire rst,
//instr memory inputs
input wire [31:0] instrF,
//data memory inputs
input wire [31:0] read_dataM,
//CU inputs
input wire [1:0] immsrcD,
input wire ALUsrcD,
input wire [1:0] ALUctrlD,
input wire [1:0] resultsrcD,
input wire regwrD,
input wire jumpD,
input wire jalrD,
input wire branchD,
input wire memwrD,
//hazard unit inputs
input wire [1:0] forwardAE,
input wire [1:0] forwardBE,
input wire stallF,
input wire stallD,
input wire flushE,
input wire flushD,
//CU outputs
output wire [31:0] instrD,
//hazard unit outputs
output wire [4:0] rs1E,
output wire [4:0] rs2E,
output wire [4:0] rdM,
output wire [4:0] rdW,
output wire regwrM,
output wire regwrW,
output wire [4:0] rs1D,
output wire [4:0] rs2D,
output wire [4:0] rdE,
output wire resultsrcE0,
output wire PCsrcE0,
//instr memory outputs
output wire [31:0] PCF,
//data memory outputs
output wire [31:0] ALUoutM,
output wire [31:0] write_dataM,
output wire memwrM
);
////////////////////////////
//flip flops between fetch and decode
generate
ddre u_ff1 (
.clk(clk),
.rst_p(flushD),
.en(~stallD),
.din(instrF[i]),
.q(instrD[i])
);
end
endgenerate
generate
ddre u_ff2 (
.clk(clk),
.rst_p(flushD),
.en(~stallD),
.din(PCF[i1]),
.q(PCD[i1])
);
end
endgenerate
generate
ddre u_ff3 (
.clk(clk),
.rst_p(flushD),
.en(~stallD),
.din(PCplus4F[i2]),
.q(PCplus4D[i2])
);
end
endgenerate
//////////////////////////////
//flip flops between decode and excute
ddr u_ff4 (
.clk(clk),
.rst_p(flushE),
.din(regwrD),
.q(regwrE)
);
ddr u_ff5(
.clk(clk),
.rst_p(flushE),
.din(resultsrcD[1]),
.q(resultsrcE[1])
);
ddr u_ff51(
.clk(clk),
.rst_p(flushE),
.din(resultsrcD[0]),
.q(resultsrcE[0])
);
ddr u_ff6 (
.clk(clk),
.rst_p(flushE),
.din(memwrD),
.q(memwrE)
);
ddr u_ff7(
.clk(clk),
.rst_p(flushE),
.din(jumpD),
.q(jumpE)
);
ddr u_ff8(
.clk(clk),
.rst_p(flushE),
.din(jalrD),
.q(jalrE)
);
ddr u_ff9(
.clk(clk),
.rst_p(flushE),
.din(branchD),
.q(branchE)
);
generate
for (genvar i2 = 0; i2 < 2; i2 = i2 + 1) begin : INST_LOOP333
ddr u_ff6 (
.clk(clk),
.rst_p(flushE),
.din(ALUctrlD[i2]),
.q(ALUctrlE[i2])
);
end
endgenerate
ddr u_ff11(
.clk(clk),
.rst_p(flushE),
.din(ALUsrcD),
.q(ALUsrcE)
);
generate
ddr u_ff12 (
.clk(clk),
.rst_p(flushE),
.din(rd1D[j]),
.q(rd1E[j])
);
end
endgenerate
generate
ddr u_ff13 (
.clk(clk),
.rst_p(flushE),
.din(rd2D[j1]),
.q(rd2E[j1])
);
end
endgenerate
generate
ddr u_ff14 (
.clk(clk),
.rst_p(flushE),
.din(PCD[j2]),
.q(PCE[j2])
);
end
endgenerate
generate
ddr u_ff15 (
.clk(clk),
.rst_p(flushE),
.din(rs1D[j31]),
.q(rs1E[j31])
);
end
endgenerate
generate
ddr u_ff16 (
.clk(clk),
.rst_p(flushE),
.din(rs2D[j21]),
.q(rs2E[j21])
);
end
endgenerate
generate
ddr u_ff17 (
.clk(clk),
.rst_p(flushE),
.din(rdD[j11]),
.q(rdE[j11])
);
end
endgenerate
generate
for (genvar j3 = 0; j3 < 32; j3 = j3 + 1) begin : INST_LOOP7
ddr u_ff18 (
.clk(clk),
.rst_p(flushE),
.din(immextD[j3]),
.q(immextE[j3])
);
end
endgenerate
generate
ddr u_ff19 (
.clk(clk),
.rst_p(flushE),
.din(instrD[j4]),
.q(instrE[j4])
);
end
endgenerate
generate
ddr u_ff20 (
.clk(clk),
.rst_p(flushE),
.din(PCplus4D[j5]),
.q(PCplus4E[j5])
);
end
endgenerate
////////////////////////////////////////////
//flip flops between excute and memory
ddfr u_ff21(
.clk(clk),
.din(regwrE),
.q(regwrM)
);
ddfr u_ff221(
.clk(clk),
.din(resultsrcE[1]),
.q(resultsrcM[1])
);
ddfr u_ff222(
.clk(clk),
.din(resultsrcE[0]),
.q(resultsrcM[0])
);
ddfr u_ff23(
.clk(clk),
.din(memwrE),
.q(memwrM)
);
generate
for (genvar j32 = 0; j32 < 32; j32 = j32 + 1) begin : INST_LOOP20
ddfr u_ff24 (
.clk(clk),
.din(ALUoutE[j32]),
.q(ALUoutM[j32])
);
end
endgenerate
generate
for (genvar j33 = 0; j33 < 32; j33 = j33 + 1) begin : INST_LOOP21
ddfr u_ff25 (
.clk(clk),
.din(write_dataE[j33]),
.q(write_dataM[j33])
);
end
endgenerate
generate
ddfr u_ff26 (
.clk(clk),
.din(rdE[j17]),
.q(rdM[j17])
);
end
endgenerate
generate
for (genvar j34 = 0; j34 < 32; j34 = j34 + 1) begin : INST_LOOP22
ddfr u_ff27 (
.clk(clk),
.din(PCplus4E[j34]),
.q(PCplus4M[j34])
);
end
endgenerate
//////////////////////////////////////////
//flip flops between memory and writeback
ddfr u_ff28(
.clk(clk),
.din(regwrM),
.q(regwrW)
);
ddfr u_ff291(
.clk(clk),
.din(resultsrcM[1]),
.q(resultsrcW[1])
);
ddfr u_ff292(
.clk(clk),
.din(resultsrcM[0]),
.q(resultsrcW[0])
);
generate
for (genvar k = 0; k < 32; k = k + 1) begin : INS_LOOP22
ddfr u_ff30 (
.clk(clk),
.din(ALUoutM[k]),
.q(ALUoutW[k])
);
end
endgenerate
generate
for (genvar k4 = 0; k4 < 32; k4 = k4+ 1) begin : INS_LOOP1
ddfr u_ff31 (
.clk(clk),
.din(read_dataM[k4]),
.q(read_dataW[k4])
);
end
endgenerate
generate
for (genvar k3 = 0; k3 < 5; k3 = k3 + 1) begin : INST_LOOP40
ddfr u_ff32 (
.clk(clk),
.din(rdM[k3]),
.q(rdW[k3])
);
end
endgenerate
generate
for (genvar k1 = 0; k1 < 32; k1 = k1 + 1) begin : INS_LOOP2
ddfr u_ff33 (
.clk(clk),
.din(PCplus4M[k1]),
.q(PCplus4W[k1])
);
end
endgenerate
generate
for (genvar k2 = 0; k2 < 32; k2 = k2 + 1) begin : INS_LOOP3
ddre u_ff (
.clk(clk),
.rst_p(rst),
.en(~stallF),
.din(PCnext[k2]),
.q(PCF[k2])
);
end
endgenerate
/////////////////////////////////////
mux3x1 u_pcmux (
.sel(PCsrcE) ,
.in0(PCplus4F) ,
.in1(PCtargetE) ,
.in2( {ALUoutE[31:1],1'b0} ),
.out(PCnext)
);
Reg_file u_regf (
.clk(clk),
.Addr1(rs1D),
.Addr2(rs2D),
.Addr3(rdW),
.wd3(resultW),
.we3(regwrW),
.rd1(rd1D),
.rd2(rd2D)
);
Sign_ext u_signext(
.in(instrD[31:7]),
.opcode(immsrcD),
.out(immextD)
);
mux3x1 u_forwardAEmux (
.sel(forwardAE) ,
.in0(rd1E) ,
.in1(resultW) ,
.in2(ALUoutM),
.out(SrcA)
);
mux3x1 u_forwardBEmux (
.sel(forwardBE) ,
.in0(rd2E) ,
.in1(resultW) ,
.in2(ALUoutM),
.out(write_dataE)
);
mux2x1 u_alumux (
.sel(ALUsrcE) ,
.in0(write_dataE) ,
.in1(immextE) ,
.out(SrcB)
);
adder u_adderplus4 (
.in1(PCF),
.in2(32'd4),
.out(PCplus4F)
);
adder u_addertarget (
.in1(PCE),
.in2(immextE),
.out(PCtargetE)
);
Alu u_ALU (
.ALUctrl(ALUctrlE) ,
.A(SrcA) ,
.B(SrcB) ,
.ALUout(ALUoutE) ,
.zero(zero)
);
mux3x1 u_resultmux (
.sel(resultsrcW) ,
.in0(ALUoutW) ,
.in1(read_dataW) ,
.in2(PCplus4W),
.out(resultW)
);
//////////////////////////////////////////////////////////////////////
//////////////
main_decoder u_md (
.op(opD),
.jump(jumpD),
.jalr(jalrD),
.branch(branchD),
.immsrc(immsrcD),
.ALUsrc(ALUsrcD),
.ALUop(ALUopD), //
.resultsrc(resultsrcD),
.regwr(regwrD),
.memwr(memwrD)
);
Alu_decoder u_ad (
.ALUop(ALUopD),
.funct3(funct3D),
.funct7_5(funct7_5D),
.op_5(opD[5]),
.ALUctrl(ALUctrlD)
);
endmodule //control unit
//////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////
module hazard_unit (
//Hazard
//fowarding inputs
input wire rst,
input wire [4:0] rs1E,
input wire [4:0] rs2E,
input wire [4:0] rdM,
input wire [4:0] rdW,
input wire regwrM,
input wire regwrW,
//stalling inputs
input wire [4:0] rs1D,
input wire [4:0] rs2D,
input wire [4:0] rdE,
input wire resultsrcE0,
//flushing inputs
input wire PCsrcE0,
//forwarding outputs
output reg [1:0] forwardAE,
output reg [1:0] forwardBE,
//stalling outputs
output reg stallF ,
output reg stallD,
output reg flushE,
//flushing outputs
output reg flushD
);
always@(*) begin
if( (rs1E == rdM) && regwrM && (rs1E != 0) ) begin
forwardAE = 2'b10 ; end
else if( (rs1E == rdW) && regwrW && (rs1E != 0) ) begin
forwardAE = 2'b01 ; end
else
begin
forwardAE <= 2'b00 ; end
always@(*) begin
if(rst) begin
stallF = 1'b0 ;
stallD = 1'b0 ; end
else if(( (rdE == rs1D) || (rdE == rs2D) ) && resultsrcE0 ) begin
stallF = 1'b1 ;
stallD = 1'b1 ; end
else
begin
stallF = 1'b0 ;
stallD = 1'b0 ; end
end
always@(*) begin
if(rst) begin
flushD = 1'b0 ; end
else if(PCsrcE0) begin
flushD = 1'b1 ; end
else
begin
flushD = 1'b0 ; end
if(rst) begin
flushE = 1'b0 ; end
else if((( (rdE == rs1D) || (rdE == rs2D) ) && resultsrcE0 ) ||
PCsrcE0) begin
flushE = 1'b1 ; end
else
begin
flushE = 1'b0 ; end
end
endmodule //Hazard Unit
//////////////////////////////////////////////////////////////////////
///////////////////////////
initial begin
$readmemh("testcasess.txt", mem); end
endmodule
//////////////////////////////////////////////////////////////////////
///////////////////////////
q<=0;
// qb<=1;
else
q<=din;
// qb<=~din;
end
endmodule
//////////////////////////////////////////////////////////////////////
//
module ffn(input din,clk,rst_p, output reg q);
always @(negedge clk or posedge rst_p)
begin
if(rst_p==1)
q<=0;
// qb<=1;
else
q<=din;
// qb<=~din;
end
endmodule
//////////////////////////////////////////////////////////////////////
/
module mux2_1(input [1:0]in,input sel,output out);
assign out=(sel)?in[1]:in[0];
endmodule
module ddr(din,rst_p,clk,q);
wire qn,qp;
wire qbn,qbp;
input din,rst_p,clk;
output q;
// assign qb=~q;
ffn f1(din,clk,rst_p,qn);
ffp f2(din,clk,rst_p,qp);
mux2_1 m1({qp,qn},clk,q);
endmodule
/////////////////////////////////////////////////////////////////////
q<=0;
// qb<=1;
else
begin
if(en)
q<=din;
// qb<=~din;end
end
end
endmodule
//////////////////////////////////////////////////////////////////////
//
module ffne(input din,en,clk,rst_p, output reg q);
always @(negedge clk or posedge rst_p)
begin
if(rst_p==1)
q<=0;
// qb<=1;
else
begin
if(en)
q<=din;
// qb<=~din;end
end
end
endmodule
//////////////////////////////////////////////////////////////////////
/
module mux2_1e(input [1:0]in,input sel,output out);
assign out=(sel)?in[1]:in[0];
endmodule
/////////////////////////////////////////////////////
module ddre(din,en,rst_p,clk,q);
wire qn,qp;
wire qbn,qbp;
input din,rst_p,clk,en;
output q;
//assign qb=~q;
ffne f1(din,en,clk,rst_p,qn);
ffpe f2(din,en,clk,rst_p,qp);
mux2_1e m1({qp,qn},clk,q);
endmodule
///////////////////////////////////////////////////////
end
endmodule
//////////////////////////////////////////////////////////////////////
//
module fifn(input din,clk, output reg q);
always @(negedge clk )
begin
q<=din;
// qb<=~din;
end
endmodule
//////////////////////////////////////////////////////////////////////
/
module muxf2_1(input [1:0]in,input sel,output out);
assign out=(sel)?in[1]:in[0];
endmodule
module ddfr(din,clk,q);
wire qn,qp;
wire qbn,qbp;
input din,clk;
output q;
// assign qb=~q;
fifn f1(din,clk,qn);
fifp f2(din,clk,qp);
muxf2_1 m1({qp,qn},clk,q);
endmodule
/////////////////////////////////////////////////////////
module data_ram ( input wire clk,
input wire rst,
input wire we,
input wire [9:0] addr,
input wire [31:0]write_data , output wire [31:0] read_data
);
//instantiation
riscc u_top (
.clk(clk),
.rst(rst),
.instrF(instrF),
.addr(addr),
.write_dataM(write_dataM),
.memwrM(memwrM),
.read_dataM(read_dataM),
.PCF(PCF),
.instrD(instrD)
);
instr_rom u_ins_rom (
.addr(PCF[9:0]),
.read_data(instrF)
);
data_ram u_data_ram (
.clk(clk),
.rst(rst),
.we(memwrM),
.addr(addr),
.write_data(write_dataM),
.read_data(read_dataM)
);
initial begin
clk = 0 ;
forever #250 clk = ~clk ; //clk with period 500ps
end
initial begin
rst = 1'b1 ;
#500;
rst = 1'b0 ; end
$stop ;
end
end
end
endmodule
TESTCASES
00500113 02728463 FF718393
005203B3 00100113
402383B3 00910133
0471AA23 0221A023