Dhrystone White Paper
Dhrystone White Paper
Dhrystone White Paper
ECL, LLC 6507 Jester Blvd 2222 Francisco Drive Suite 511 Suite 510-203 Austin, Texas 78750 El Dorado Hills, California 95762 Inquiry@ebenchmarks.com http://www.ebenchmarks.com 512-219-0302
Dhrystone Benchmark
. . . . . . . . . .
History, Analysis, "Scores" and Recommendations
White Paper
Alan R. Weiss
November 1, 2002
. . . . . . . . .
Dhrystone Benchmark
Introduction and Disclosure
The EEMBC Certification Laboratories, LLC (ECL) is recognized as the premier 1 benchmarking and certification laboratory in the semiconductor and software industries, and is the authorized certification body for EEMBC. EEMBC (pronounced "embassy") is the industry-standard processor benchmark consortium, and was setup to create reliable application-based benchmarks to measure processor (and compiler) performance. Despite the growing adoption of EEMBC benchmarks, the Dhrystone benchmark is still misused in the industry. To help people and companies evaluate its usefulness, we decided to analyze Dhrystone for strengths and weaknesses and explain our findings based on real examples. This White Paper will first explain what "benchmarking" is, how it is used, and offer a set of intended uses. Then, we will explain Dhrystone, exploring its creation and evolution and intended purpose. From there, we dive into the technical details of Dhrystone, explaining how it works and what it measures. We then try and distill a reasonable set of run-rules consistent with its creator's intent, report some interesting scores, and then explore how Dhrystone is being used - and misused - by many in the industry. Finally, we compare and contrast Dhrystone with EEMBC's industry-standard benchmarks
. .
. .
A useful way to characterize benchmarks is whether they are synthetic, or application ("real world") based. A synthetic benchmark is created with the intent to measure one or more features of a system, processor, or compiler. Synthetic benchmarks may try to mimic instruction mixes in real world applications, or they may be artificial. Synthetic benchmarks are useful in debugging specific features, but they cannot be easily related to how that feature will perform in an application. Because they are useful in debugging or isolating specific functionality, synthetic benchmarks tend to be small, though this is not a requirement.
ECL defines Certification as the process of re-creating the benchmarking environment, verifying the processor and memory bus clock speed, verifying the compiler switches, re-creating the scores, rebuilding the code to ensure scores are re-creatable, and so on. ECL has over 50 separate steps in its benchmark score certification process.
. . . . . . . . Application benchmarks, also called "real world" benchmarks, use system- or user-level . software code drawn from real algorithms or full applications Application benchmarks
are more common in system-level benchmarking and usually have large code and data storage requirements. A third type of benchmarks, called derived benchmarks (or "algorithm-based benchmarks") is a compromise between synthetic and application. As their name implies, derived benchmarks are created by extracting the key algorithms (software code) and generating realistic data sets from real world applications. This avoids the need to execute an entire application, and the benchmark can be used both for debugging, internal engineering, and for competitive analysis. Derived benchmarks, based on real application code, represent the best of both worlds and are perfectly suited for embedded environments.
Throughout this white paper, it is very important to note Dr. Weickers long-standing sentiment about his creation: "Although the Dhrystone benchmark that I published in 1984 was useful at the time," said Weicker, "it cannot claim to be useful for modern workloads and CPUs because it is so short, it fits in on-chip caches, and fails to stress the memory system. Also, because it is so short and does not read from an input file, special compiler optimizations can benefit Dhrystone performance more than normal program performance. In embedded computing, EEMBC (www.eembc.org) is collecting larger real-life embedded-computing programs as the basis for benchmarks." Dr. Reinhold P. Weicker, Siemens AG, Vice Chairman of the Spec Open Systems Steering Committee. http://www.einsite.net/ednmag/index.asp?layout=article&articleId=CA46261& st EDN Magazine 10 / 28 / 1999
Dr. Weicker has long ago gone on to bigger and better things. An important computer scientist, renowned in benchmarking and performance analysis, Weicker has been involved with the SPEC organization (http://www.specbench.org), recognizing the inherent weaknesses endemic with Dhrystone.
. . . . . . . . .
traction. Dhrystone is formally reported as "Dhrystone 2.1 MIPS". Weakness: Dhrystone users employ confusing and ambiguous terminology such as DMIPS, DMIPS/MHz, Rounded Dhrystones/second, and Dhrystones/ CPU cycle. Furthermore, a "MIP" is actually 1.75 DEC VAX MIPS. Synthetic Weakness: Dhrystone only measures a few mathematical and basic operations. Strength: This makes it potentially useful for simple 8- and 16-bit microcontrollers, assuming people don't care about relating anything to real world applications. Weakness: Does not measure multiplyaccumulate, floating-point, SIMD, or any other type of operations. Librarydependent performance Weakness: Dhrystones execution is largely spent in standard C library functions, such as strcmp(),strcpy(), and memcpy(). Compiler vendors generally provide these libraries that are typically optimized and hand-written in assembly language. While you may think you are benchmarking a processor, you are really benchmarking are the compiler writers optimizations of the C library functions for a particular platform. Weakness: Compiler writers have long ago determined Dhrystones functionality. The secret to good benchmarks, as SPEC and EEMBC have shown, is to stay ahead of the compiler writers to ensure that the processor and system is benchmarked, not just the compiler. Weakness: Dhrystones lack of an official certification process (as defined in Footnote #1) has eliminated this benchmarks credibility. Certification can only come from inherent value, and there is very little value in Dhrystone to modern processors or compilers. Weakness: Dhrystone is available from multiple sources, and while most companies attempt to use Weicker's original source, some servers have "gone dark" as the age of the Web increases. There is great potential that a company, or an individual, has modified the code to its advantage.
4
No Evolution
No Third-Party Certification
No Source Control
has modified the code to its advantage. Some companies report Dhrystone 1.1 scores - an even older version of the code. No Standard Run Rules Weakness: . Due to the lack of a standards organization, Dhrystones original runtime rules have eroded into a state of confusion, thereby turning it into a performance measurement that is easily circumvented. Weakness: The benchmarking environment, including processor and memory clock speed, compiler switches, and libraries, are not disclosed nor required. Weakness. Instructing the compiler to inline the code, greatly increasing the benchmark's susceptibility to code elimination, typically breaks Dhrystone's apocryphal "rules". The benchmark essentially vanishes and scores get unrealistically good.
2.
. . . . . . . . .
3.
making it necessary to dump code to assembly language to verify what has occurred. Separate compilation. Dhrystone tried to mimic how real programs are written linking separate modules together. This reflects 1970s style "structured programming" techniques that are still used today. Dhrystones two C source files and header file must not be combined and compiled as one step. That there are only two C source files means this is not a very difficult barrier for most compilers (and compiler writers, having extensively studied Dhrystone, don't find this terribly "real-world"). Because Dhrystone scores are so heavily dependent upon C language functions that copy and compare strings (called strcmp() and strcpy()), Dhrystone rules allow compilers or assembly programmers to optimize these functions. In fact, most smart compilers have these library functions written in assembly language. Another trick is to optimize strcpy() by making alignment and fixed-length assumptions for the input strings, but in no case can these functions be optimized in a Dhrystone-specific manner (such as assuming the content, positioning, or length of the strings). In point of fact, this "processor benchmark" can spend between 10% and 20% of its time in these functions! You cannot use post-processing tools (after linkage) to optimize. These illegal optimizations typically fall under the heading of feedback-directed optimizations, and are particularly handy when used with an architecture that has branch prediction and speculative execution
4.
5.
To understand how different vendors use the Dhrystone benchmark ECL measured Dhrystone on cores from ARM Ltd and MIPS Technologies, two leaders in the embedded-processor industry. Both companies cooperated fully with ECL and provided all necessary tools and support.
boundaries. Another MIPS compiler vendor was less aggressive, and this points out that compiler selection matters - sometimes as much as 10-40%! However, using Dhrystone as a benchmark for compilers is flawed - there simply aren't enough different kinds of instructions in Dhrystone, and as we have seen libraries matter a great deal more than they should. The 5kc's inclusion of the multiply-divide unit also proves a point about comparing seemingly similar architectures. The ARM 1026-EJS and the MIPS 5kc at first blush appear very similar: both are single-issue machines with similar L1 cache sizes, and so on. The MIPS part is a 64-bit part, and the ARM part is a 32-bit part, but architecturally they are more similar than dissimilar. The multiply-divide, when utilized with suitable instructions from the instruction-set architecture, gives the 5kc a slight performance boost, and so does the effect of the 64-bit fetches. Note, however, a nearly fatal flaw with using Dhrystone as an embedded benchmark: nowhere is code and data size documented and reported, and nowhere is there a disclosure about the number of gates (transistors, die-size area) required by the processor. In the embedded world, often memory is the most expensive part of a design. Memory requirements have a significant effect on system cost and power consumption. 5kc processor, 40 MHz, TSMC process Column 2 MIPS 32 Bit 5kc using 5kc binaries 92889.03451 1.321666368 / 1.37 Column 3 MIPS 32 Bit 5kc using 4kc binaries 87697.20141 1.247794665 1.25
Table 2 shows a number of performance metrics based on Dhrystone. Column 1 consists of the typical Dhrystone metrics and other derived calculations. Most scores are reported as Dhrystone MIPS/megahertz (abbreviated as DMIPS/MHz) and/or as VAX Dhrystone MIPS (sometimes just called DMIPS). The number of loops we ran Dhrystone through (in this case, 20,000) had little effect, indicating another Dhrystone weakness: its small size allows it to easily fit inside small L1 caches, therefore, after a few thousand loops the score is constant and scales linearly for clock speed. When we increased the loops to 50,000, it had no effect - nor when we decreased it to 5000 loops. Column 2 indicates that the Dhrystone code was compiled for 64 bits, which is the native word size of the MIPS 5kc. As can be seen by the resulting DMIPS score of 1374 and a rounded DMIPS/MHz score of 1.37, this option gave the best results. We believe that taking advantage of 64 bits had a significant effect on performance. Column 3 indicates the effects of emitting 32-bit code for the 5kc, resulting in a noticeable negative effect on performance. We generated this by compiling for the MIPS Technologies 32-bit 4kc core, a different processor, and running the ensuing binary on the 5kc. This procedure validated MIPS Technologies' claim of code compatibility between processors. Performance suffered noticeably with the strcmp() function for
. . . . . . . . the 4KC, though, because comparing 32 bit words takes longer than comparing 64 . bit words when the strings are the same size.
Analysis of ARM Code and Score
ARM did not violate the Distilled Dhrystone Run Rules. ECL investigated the ARM 1026EJ-S scores using a simulator, as a board with this processor on it was not available at the time of this study. Previously, we had verified on the ARM and MIPS platforms that using an RTL simulator with Dhrystone was practical from a time perspective, since the benchmark easily fits inside small L1 caches. We were suspicious of everything (being a certification laboratory, its an occupational hazard), so we investigated everything that went into the set-up of the environment. ECL was able to re-create ARMs benchmark environment and obtain exactly the same scores. We hand inspected the source code ARM provided us, and found that ARM did not change the code inside the timing loop of Dhrystone, and in porting ARM did nothing to alter the code. ARM ran "out of the box", except for allowed modifications to the printf() functionality to accommodate its particular environment (MIPS Technologies did the same). The ARM10 architecture is 32 bits, not 64 bits, and lacks a multiply-divide unit. A singleissue machine with cache sizes similar to the MIPS 5kc, the Dhrystone benchmark doesnt measure the performance, power, and code size attributes of any processor. Dhrystone score reporting does not require the inclusion of code and data sizes, and nowhere is there a disclosure about the number of gates (transistors, die-size area) used. Further, this processor includes Jazelle, ARM's Java execution functionality. Dhrystone, of course, doesn't exercise any of that functionality. How could it? When it was written, the word "Java" only meant an island of Indonesia, or a variety of coffee. The world has changed since Dhrystone was written - but it has not. ECL had to manually calculate ARMs scores since the RTL simulator only produces a cycle count. After running the benchmark for 25 iterations, the system accumulated 11071 cycles. Similarly, after 24 iterations, the system accumulated 10649 cycles. Subtracting the two cycle counts, we derived that 422 cycles are used to run one Dhrystone loop. To calculate Dhrystone MIPS / MHz: 11071 - 10649 = 422 cycles 1 / (422 cycles * 0.001757) = 1.3487 DMIPS/MHz The ARM processor has a 16-bit instruction mode called Thumb, which ECL did not try to use. Remember, 1 million instructions per second is not the same as a Dhrystone MIP a VAX VMS 11/1750 actually ran at 1757 DMIPS, and hence the adjustment in the equation.
In Table 3, we round off and compare the results from the ARM 1026EJ-S and the MIPS Technologies 5Kc on that basis. MIPS 5KC 1.37 ARM 1026EJS 1.35
Dhrystones / MHz
This data indicates that these processors, per MHz, are practically identical - if all you consider is Dhrystone. As we have seen, this comparison condemns you with a misguided view of the capabilities of each processor.
2.
3.
4.
5.
The Mission Statement for EEMBC is: "EEMBC will work collaboratively to develop a suite of performance benchmarks that will target key applications of embedded systems. These benchmarks will help provide customers an objective means of evaluating processors and controllers ." A problem can occur so long as there is no independent third-party certification, and a canonical benchmark score repository does not exist. EEMBC solves those problems. All scores are available on the EEMBC website, for free. Dhrystone has no such certification, and no canonical main repository of scores exists. Currently EEMBC has over 180 scores available for free on its website, and has gained a huge amount of support over the years. In 2002 alone, ARC, ARM, Improv Systems, Infineon, Motorola, NEC, SuperH, Tensilica, and Toshiba have published certified scores, with more on the way. Each score has full-disclosure of the environment including the compiler and switch settings. ECL has re-created each score and passed a set of 50 checks to assure that the scores are trustworthy. Furthermore, each score has the associated code and data sizes. Most importantly, EEMBC is an open process and consortium for members to resolve disputes, express opinions, vote, and include new benchmarks into new versions. It is not static - it moves as the industry moves. EEMBC 2.0 adds: 8/16 Bit Microcontroller Benchmarks (an entire suite)
10
Java Benchmarks (an entire suite) MP3 MPEG-2 Decode and MPEG-2 Encode MPEG-4 Decode Voice Over Internet Protocol (VoIP) Additional Networking benchmarks Cryptography benchmarks Ghostscript
A comparison of EEMBC vs. Dhrystone shows the following: Attribute Written in C language code EEMBC Yes - All code is in ANSI C, except the Java Benchmark Suite No - both small and larger kernels and benchmarks - a mixture Yes - there are aggregates such as AutoMark or TeleMark. Yes - 35 in Version 1.0, another two-dozen coming in Version 2 No based on application algorithms based No - no need, since over 100 scores available No - mostly integer, but some floating point, and much that can be rewritten for "full-fury" in assembler, SIMD, etc. Good mixture No - profiling suggests that good libraries help, but bad libraries do not hurt as much as with Dhrystone
11
Yes - DMIPS
No
Synthetic
Yes - completely
Yes
. . . . . . . . .
Evolution
Dhrystone Yes - EEMBC 1.0 in 1998, 1.1 in 2002, and soon EEMBC 2.0 in 2003 Yes - EEMBC Certification Laboratories, independent, nonbiased, fair Yes - strong. All code is available to EEMBC members, backed up by ECL using CVS and source management. A "correct version" always exists Yes - extremely strict, but there is a Fully Fury (optimized) and an Out of the Box set (don't touch the code). Not since 1988
Third-Party Certification
No
Source Control
No
Yes - but open to interpretation, and the lack of certification means some companies can cheat No
No
Yes: http://www.eembc.org
Inlining or Excessive Compiler Optimization destroys the benchmark Full Fury Mode
No
Yes - allows any optimization as long as you get the right answer. Helps highlight peripheral performance, and what the theoretical maximum performance of a part would be like
12
. . . . . . . . for their high-end processors, has published EEMBC scores. Motorola found that not . publishing scores was a recipe for disaster, and in the words of one Motorola manager:
"We completely underestimated how important EEMBC scores are. We should have done this a long time ago." Chuck Corley, Applications Manager, PowerPC Motorola, Inc.
Bibliography
[1] Dhrystone Benchmark: Rationale for Version 2 and Measurement Rules published in SIGPLAN Notices 23,8 (Aug. 1988), 49-62] [2] Understanding Variations in Dhrystone Performance, Reinhold P. Weicker, Siemens AG, AUT E 51, Erlangen, April 1989
14