ACA Notes UNIT-1
ACA Notes UNIT-1
Unit-1
1. Computer architecture can be defined as a set of rules and methods that describe the
functionality, management and implementation of computers. To be precise, it is nothing but rules by
which a system performs and operates.
Sub-divisions
Computer Architecture can be divided into mainly three categories, which are as follows −
Instruction set Architecture or ISA − Whenever an instruction is given to processor, its role is to
read and act accordingly. It allocates memory to instructions and also acts upon memory address
mode (Direct Addressing mode or Indirect Addressing mode).
Micro Architecture − It describes how a particular processor will handle and implement
instructions from ISA.
System design − It includes the other entire hardware component within the system such as
virtualization, multiprocessing.
The main role of Computer Architecture is to balance the performance, efficiency, cost and
reliability of a computer system.
For Example − Instruction set architecture (ISA) acts as a bridge between computer's software and
hardware. It works as a programmer's view of a machine.
Computers can only understand binary language (i.e., 0, 1) and users understand high level
language (i.e., if else, while, conditions, etc). So to communicate between user and computer,
Instruction set Architecture plays a major role here, translating high level language to binary
language.
Structure
2. Technology Trends
Computer technology has made incredible progress in the roughly 70 years since the first general-
purpose electronic computer was created. Today, less than $500 will purchase a cell phone that has as
much performance as the world’s fastest computer bought in 1993 for $50 million. This rapid
improvement has come both from advances in the technology used to build computers and from
innovations in computer design.
During the first 25 years of electronic computers, both forces made a major contribution,
delivering performance improvement of about 25% per year.
The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride
the improvements in integrated circuit technology led to a higher rate of performance
improvement—roughly 35% growth per year.
This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to
an increasing fraction of the computer business being based on microprocessors.
Two significant changes in the computer market place made it easier than ever before to succeed
commercially with a new architecture.
First, the virtual elimination of assembly language programming reduced the need for object-code
compatibility.
Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its
clone, Linux, lowered the cost and risk of bringing out a new architecture.
These changes made it possible to develop successfully a new set of architectures with simpler
instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s.
The RISC-based machines focused the attention of designers on two critical performance
techniques, the exploitation of instruction-level parallelism (initially through pipelining and later
through multiple instruction issue) and the use of caches (initially in simple forms and later using
more sophisticated organizations and optimizations).
The RISC-based computers raised the performance bar, forcing prior architectures to keep up or
disappear.
As transistor counts soared in the late 1990s, the hardware overhead of translating the more
complex x 86 architecture became negligible.
In low-end applications, such as cell phones, the cost in power and silicon area of the x 86-
translation overhead helped lead to a RISC architecture, ARM, becoming dominant.
The combination of architectural and organizational enhancements led to 17 years of sustained
growth in performance at an annual rate of over 50%—a rate that is unprecedented in the
computer industry.
Second,Dramatic improvement in cost-performance led to new classes of computers. Personal
computers and workstations emerged in the 1980s with the availability of the microprocessor
Third, improvement of semiconductor manufacturing as predicted by Moore’s law has led to the
dominance of microprocessor-based computers across the entire range of computer design.
The preceding hardware innovations led to a renaissance in computer design, which emphasized
both architectural innovation and efficient use of technology improvements.
the microprocessor industry to use multiple efficient processors or cores instead of a single
inefficient processor. Indeed, in 2004 Intel canceled its high-performance uniprocessor projects
and joined others in declaring that the road to higher performance would be via multiple
processors per chip rather than via faster uniprocessors.
3. Moore’s Law
Moore's law is a term used to refer to the observation made by Gordon Moore in 1965 that the
number of transistors in a dense integrated circuit (IC) doubles about every two years.
it was an observation by Gordon Moore in 1965 while he was working at Fairchild Semiconductor:
the number of transistors on a microchip (as they were called in 1965) doubled about every year.
Moore went on to co-found Intel Corporation and his observation became the driving force behind
the semiconductor technology revolution at Intel and elsewhere.
How Does Moore’s Law Work?
Moore’s law is based on empirical observations made by Moore. The doubling every year of the
number of transistors on a microchip was extrapolated from observed data.
Over time, the details of Moore’s law were amended to better reflect actual growth of transistor
density. The doubling interval was first increased to two years and then decreased to about 18
months. The exponential nature of Moore’s law continued, however, creating decades of
significant opportunity for the semiconductor industry. The true exponential nature of Moore’s law
is illustrated by the figure below.
A straight-line plot of the logarithm of a function indicates an exponential growth of that function.
4. Classes of Parallelism and Parallel Architectures
Parallelism at multiple levels is now the driving force of computer design across all four classes of
computers, with energy and cost being the primary constraints. There are basically two kinds of
parallelism in applications:
1. Data-level parallelism (DLP) arises because there are many data items that can be operated on at
the same time.
2. Task-level parallelism (TLP) arises because tasks of work are created that can operate independently
and largely in parallel.
Computer hardware in turn can exploit these two kinds of application parallelism in four major
ways:
1. Instruction-level parallelism exploits data-level parallelism at modest levels with compiler help
using ideas like pipelining and at medium levels using ideas like speculative execution.
2. Vector architectures, graphic processor units (GPUs), and multimedia instruction sets exploit data-
level parallelism by applying a single instruction to a collection of data in parallel.
3. Thread-level parallelism exploits either data-level parallelism or task-level parallelism in a tightly
coupled hardware model that allows for interaction between parallel threads.
4. Request-level parallelism exploits parallelism among largely decoupled tasks specified by the
programmer or the operating system.
When Flynn (1966) studied the parallel computing efforts in the 1960s, he found a simple
classification whose abbreviations we still use today. They target data-level parallelism and task-level
parallelism. He looked at the parallelism in the instruction and data streams called for by the
instructions at the most constrained component of the multiprocessor and placed all computers in
one of four categories:
1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor. The
programmer thinks of it as the standard sequential computer, but it can exploit ILP.
2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by
multiple processors using different data streams. SIMD computers exploit data-level parallelism by
applying the same operations to multiple items of data in parallel. Each processor has its own data
memory (hence, the MD of SIMD), but there is a single instruction memory and control processor,
which fetches and dispatches instructions.
3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of this type
has been built to date, but it rounds out this simple classification.
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own
instructions and operates on its own data, and it targets task-level parallelism. In general, MIMD is
more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than
SIMD. For example,
MIMD computers can also exploit data-level parallelism, although the overhead is likely to be higher
than would be seen in an SIMD computer. This overhead means that grain size must be sufficiently
large to exploit the parallelism efficiently.
Instruction Set Architecture: The Myopic View of Computer Architecture
The ISA serves as the boundary between the software and hardware.
This quick review of ISA will use examples from 80x86, ARMv8, and RISC-V to illustrate the seven
dimensions of an ISA.
The most popular RISC processors come from ARM (Advanced RISC Machine).
In addition to a full software stack (compilers, operating systems, and simulators), there are
several RISC-V implementations freely available for use in custom chips or in field-programmable
gate arrays.
The integer core ISA of RISC-V as the example ISA
1. Class of ISA—Nearly all ISAs today are classified as general-purpose register architectures,
where the operands are either registers or memory locations. The 80x86 has 16 general-purpose
registers and 16 that can hold floating-point data, while RISC-V has 32 general-purpose and 32
floating-point registers.
2. Memory addressing—Virtually all desktop and server computers, including the 80x86, ARMv8,
and RISC-V, use byte addressing to access memory operands. Some architectures, like ARMv8,
require that objects must be aligned. An access to an object of size s bytes at byte address A is
aligned if A mod s=0. The 80x86 and RISC-V do not require alignment, but accesses are generally
faster if operands are aligned.
3. Addressing modes—In addition to specifying registers and constant operands, addressing
modes specify the address of a memory object. RISC-V addressing modes are Register, Immediate
(for constants), and Displacement, where a constant offset is added to a register to form the
memory address.The 80x86 supports those three modes, plus three variations of displacement:
no register (absolute), two registers (based indexed with displacement), and two registers where
one register is multiplied by the size of the operand in bytes (based with scaled index and
displacement).
4. Types and sizes of operands—Like most ISAs, 80x86, ARMv8, and RISC-V support operand sizes
of 8-bit (ASCII character), 16-bit (Unicode character or half word), 32-bit (integer or word), 64-bit
(double word or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit
(double precision). The 80x86 also supports 80-bit floating point (extended double precision).
5. Operations—The general categories of operations are data transfer, arithmetic logical, control
(discussed next), and floating point. RISC-V is a simple and easy-to-pipeline instruction set
architecture, and it is representative of the RISC architectures being used in 2017.
6. Control flow instructions—Virtually all ISAs, including these three, support conditional
branches, unconditional jumps, procedure calls, and returns. All three use PC-relative addressing,
where the branch address is specified by an address field that is added to the PC. There are some
small differences. RISC-V conditional branches (BE, BNE, etc.) test the contents of registers, and
the 80x86 and ARMv8 branches test condition code bits set as side effects of arithmetic/logic
operations. The ARMv8 and RISC-V procedure call places the return address in a register, whereas
the 80x86 call (CALLF) places the return address on a stack in memory.
7. Encoding an ISA—There are two basic choices on encoding: fixed length and variable length. All
ARMv8 and RISC-V instructions are 32 bits long, which simplifies instruction decoding. Figure 1.7
shows the RISC-V instruction formats. The 80x86 encoding is variable length, ranging from 1 to 18
bytes. Variable-length instructions can take less space than fixed-length instructions, so a program
compiled for the 80x86 is usually smaller than the same program compiled for RISC-V.
Trends in Technology
Network technology—
Network performance depends both on the performance of switches and on the performance of
the transmission system.
These rapidly changing technologies shape the design of a computer that, with speed and
technology enhancements, may have a lifetime of 3–5 years. Key technologies such as Flash
change sufficiently that the designer must plan for these changes.
Performance is the primary differentiator for microprocessors and networks, so they have seen
the greatest gains: 32,000–40,000-X in bandwidth and 50-90-X in latency. Capacity is generally
more important than performance for memory and disks, so capacity has improved more, yet
bandwidth advances of 400–2400-X are still much greater than gains in latency of 8–9-X.
A simple rule of thumb is that bandwidth grows by at least the square of the improvement in
latency. Computer designers should plan accordingly.
Log-log plot of bandwidth and latency milestones
Today, energy is the biggest challenge facing the computer designer for nearly every class of
computer. First, power must be brought in and distributed around the chip, and modern
microprocessors use hundreds of pins and multiple interconnect layers just for power and ground.
Second, power is dissipated as heat and must be removed.
First, what is the maximum power a processor ever requires? Meeting this demand can be
important to ensuring correct operation. For example, if a processor attempts to draw more
power than a power-supply system can provide (by drawing more current than the system can
supply), the result is typically a voltage drop, which can cause devices to malfunction. Modern
processors can vary widely in power consumption with high peak currents; hence they provide
voltage indexing methods that allow the processor to slow down and regulate voltage within a
wider margin.
Second, what is the sustained power consumption? This metric is widely called the thermal
design power (TDP) because it determines the cooling requirement. TDP is neither peak power,
which is often 1.5 times higher, nor is it the actual average power that will be consumed during a
given computation, which is likely to be lower still. A typical power supply for a system is typically
sized to exceed the TDP, and a cooling system is usually designed to match or exceed TDP.
Modern processors provide two features to assist in managing heat, since the highest power (and
hence heat and temperature rise) can exceed the long-term average specified by the TDP. First, as
the thermal temperature approaches the junction temperature limit, circuitry lowers the clock
rate, thereby reducing power. Should this technique not be successful, a second thermal overload
trap is activated to power down the chip.
The third factor that designers and users need to consider is energy and energy efficiency. Recall
that power is simply energy per unit time: 1 watt=1 joule per second. Which metric is the right
one for comparing processors: energy or power? In general, energy is always a better metric
because it is tied to a specific task and he time required for that task. In particular, the energy to
complete a workload is equal to the average power times the execution time for the workload.
Conclusion-
Thus, if explain about which of two processors is more efficient for a given task, need to compare
energy consumption (not power) for executing the task. For example, processor A may have a 20%
higher average power consumption than processor B, but if A executes the task in only 70% of the
time needed by B, its energy consumption will be 1.2* 0.7=0.84, which is clearly better.
Example
Some microprocessors today are designed to have adjustable voltage, so a 15% reduction in voltage
may result in a 15% reduction in frequency. What would be the impact on dynamic energy and on
dynamic power?
Answer
Because the capacitance is unchanged, the answer for energy is the ratio of the voltages
Trends in Cost
Although costs tend to be less important in some computer designs—specifically supercomputers—
cost-sensitive designs are of growing significance.
This section discusses the major factors that influence the cost of a computer and how these factors
are changing over time.
The number of dies per wafer is approximately the area of the wafer divided by the area of the
die. It can be more accurately estimated by
The first term is the ratio of wafer area (πr2 ) to die area. The second compensates for the
“square peg in a round hole” problem—rectangular dies near the periphery of round wafers.
Dividing the circumference (πd) by the diagonal of a square die is approximately the number of
dies along the edge.
This formula gives only the maximum number of dies per wafer. The critical question is: What is the
fraction of good dies on a wafer, or the die yield? A simple model of integrated circuit yield, which
assumes that defects are randomly distributed over the wafer and that yield is inversely proportional
to the complexity of the fabrication process, leads to the following:
Performance Metrics and Evaluation
Defining Performance
Performance means different things to different people
Analogy from the airline industry:
Cruising speed (How fast)
Flight range (How far)
Passengers (How many)
Performance Metrics
Response (execution) time:
Time between the start and completion of a task
Measures user perception of the system speed
Common in reactive and time critical systems, single-user computer, etc.
Throughput:
Total number of tasks done in a given time
Most relevant to batch processing (billing, credit card processing, etc.)
Mainly used for input/output systems (disk access, printer, etc.)
Relate the performance of two different computers, say, X and Y. The phrase “X is faster than Y” is
used here to mean that the response time or execution time is lower on X than on Y for the given task.
In particular, “X is n times as fast as Y” will mean -
Since execution time is the reciprocal of performance, the following relationship holds:
“the throughput of X is 1.3 times as fast as Y” signifies here that the number of tasks completed per
unit time on computer X is 1.3 times the number completed on Y.
Time is not always the metric quoted in comparing the performance of computers.
the only consistent and reliable measure of performance is the execution time of real programs,
and that all proposed alternatives to time as the metric or to real programs as the items
measured have eventually led to misleading claims or even mistakes in computer design.
Even execution time can be defined in different ways depending on what count.
The most straightforward definition of time is called wall-clock time, response time, or elapsed
time, which is the latency to complete a task, including storage accesses, memory accesses,
input/output activities, operating system overhead—everything.
Benchmarks
The best choice of benchmarks to measure performance is real applications, such as Google
Translate.
Attempts at running programs that are much simpler than a real application have led to
performance pitfalls.
Kernels, which are small, key pieces of real applications.
Toy programs, which are 100-line programs from beginning programming assignments, such as
Quicksort.
Synthetic benchmarks, which are fake programs invented to try to match the profile and behavior
of real applications, such as Dhrystone.
One way to improve the performance of a benchmark has been with benchmark-specific compiler
flags; these flags often caused transformations that would be illegal on many programs or would
slow down performance on others.
To restrict this process and increase the significance of the results, benchmark developers
typically require the vendor to use one compiler and one set of flags for all the programs in the
same language (such as C++ or C).
In addition to the question of compiler flags, another question is whether source code
modifications are allowed. There are three different approaches to addressing this question:
1. No source code modifications are allowed.
2. Source code modifications are allowed but are essentially impossible. For example, database
benchmarks rely on standard database programs that are tens of millions of lines of code. The
database companies are highly unlikely to make changes to enhance the performance for one
particular computer.
3. Source modifications are allowed, as long as the altered version produces the same output.
The key issue that benchmark designers face in deciding to allow modification of the source is
whether such modifications will reflect real practice and provide useful insight to users, or
whether these changes simply reduce the accuracy of the benchmarks as predictors of real
performance.
The goal of a benchmark suite is that it will characterize the real relative performance of two
computers, particularly for programs not in the suite that customers are likely to run.
One of the most successful attempts to create standardized benchmark application suites has
been the SPEC (Standard Performance Evaluation Corporation), which had its roots in efforts in
the late 1980s to deliver better benchmarks for workstations.
Just as the computer industry has evolved over time, so has the need for different benchmark
suites, and there are now SPEC benchmarks to cover many application classes.
Desktop Benchmarks
Desktop benchmarks divide into two broad classes: processor-intensive benchmarks and graphics-
intensive benchmarks, although many graphics benchmarks include intensive processor activity.
SPEC originally created a benchmark set focusing on processor performance (initially called
SPEC89), which has evolved into its sixth generation.
SPEC2017 programs and the evolution of the SPEC benchmarks over time, with integer programs
above the line and floating point programs below the line.
SPEC benchmarks are real programs modified to be portable and to minimize the effect of I/O on
performance.
The integer benchmarks vary from part of a C compiler to a go program to a video compression.
The floating-point benchmarks include molecular dynamics, ray tracing, and weather forecasting.
The SPEC CPU suite is useful for processor bench marking for both desktop systems and single-
processor servers.
Server Benchmark
Servers have multiple functions, so are there multiple types of benchmarks.
The simplest benchmark is perhaps a processor throughput-oriented benchmark.
SPEC CPU2017 uses the SPEC CPU benchmarks to construct a simple throughput benchmark
where the processing rate of a multiprocessor can be measured by running multiple copies
(usually as many as there are processors) of each SPEC CPU benchmark and converting the CPU
time into a rate.
This leads to a measurement called the SPECrate, and it is a measure of request-level parallelism
To measure thread-level parallelism, SPEC offers what they call high performance computing
benchmarks around OpenMP and MPI as well as for accelerators such as GPUs
Other than SPECrate, most server applications and benchmarks have significant I/O activity arising
from either storage or network traffic, including benchmarks for file server systems, for web
servers, and for database and transaction processing systems.
Transaction-processing (TP) benchmarks measure the ability of a system to handle transactions
that consist of database accesses and updates.
Airline reservation systems and bank ATM systems are typical simple examples of TP; more
sophisticated TP systems involve complex databases and decision-making.
In the mid-1980s, a group of concerned engineers formed the vendor-independent Transaction
Processing Council (TPC) to try to create realistic and fair benchmarks for TP.
The first TPC benchmark, TPC-A, was published in 1985 and has since been replaced and
enhanced by several different benchmarks.
TPC-C, initially created in 1992, simulates a complex query environment.
TPC-H models ad hoc decision support—the queries are unrelated and knowledge of past queries
cannot be used to optimize future queries.
The TPC-DI benchmark, a new data integration (DI) task also known as ETL, is an important part of
data warehousing.
TPC-E is an online transaction processing (OLTP) workload that simulates a brokerage firm’s
customer accounts.
All the TPC benchmarks measure performance in transactions per second. In addition, they
include a response time requirement so that throughput performance is measured only when
the response time limit is met.
The system cost for a benchmark system must be included as well to allow accurate comparisons
of cost-performance.
Let’s see an example. Where AVG represents average execution time and GEO means for geometric
mean,
COMP_X COMP_Y SPEEDUP
APP A 9 18 2
APP B 10 7 0.7
APP C 5 11 2.2
AVG 8 12 1.5
GEO 7.66 11.15 1.46
The three components of the CPU time allow us to think about different aspects of computer
architecture. This is called the Iron Law of performance. We can also write this expression to,
where,
IC means instructions count
CPI means cycles per instruction
CCT means clock cycle time (which is also the reciprocal of the cycle clock)
Amdahl’s Law
The performance gain that can be obtained by improving some portion of a computer can be
calculated using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained
from using some faster mode of execution is limited by the fraction of the time the faster mode can
be used.
Amdahl’s Law gives us a quick way to find the speedup from some enhancement, which depends on
two factors:
1. The fraction of the computation time in the original computer that can be converted to take
advantage of the enhancement—For example, if 40 seconds of the execution time of a program that
takes 100 seconds in total can use an enhancement, the fraction is 40/100. This value, which we call
Fractionenhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode, that is, how much faster the task
would run if the enhanced mode were used for the entire program—This value is the time of the
original mode over the time of the enhanced mode. If the enhanced mode takes, say, 4 seconds for a
portion of the program, while it is 40 seconds in the original mode, the improvement is 40/4 or 10. We
call this value, which is always greater than 1, Speedupenhanced.
The execution time using the original computer with the enhanced mode will be the time spent using
the unenhanced portion of the computer plus the time spent using the enhancement:
So what does Amdahl’s law implies? Let’s think about two different enhancements. For
enhancement #1, we enhance a speedup of 20 on 10% of the time. For enhancement #2, we
enhance a speedup of 1.6 on 80% of the time. By Amdahl’s law, the overall speedup for
enhancement #1 is 1.105 and the overall speedup for enhancement #2 is 1.43 .
This implies that if we put significant effort to speed up something that is a small part of the
execution time, you will still not get a very large improvement. In contrast, if you are impacting a
large part of the execution time and even if there’s only a small reasonably small speedup, you
will still result in a large overall speedup.
Lhadma’s Law
This is jokingly called Lhadma’s law because of Amdahl’s law. We have known that Amdahl’s law
implies that we have to speed up the most common cases in order to have a significant impact on the
overall speedup, while Lhadma’s law tells us that we don’t want to mess up with the uncommon cases
too badly.
Let’s see an example. Suppose we would like to have a speedup of 2 on 90% of the execution time and
another speedup of 0.1 on the rest of the execution time. The overall speedup can be calculated by,