Active Benchmarking: Casual Benchmarking: You Benchmark A, But Actually Measure B, and Conclude You've Measured C

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

This Site:

Start Here Active Benchmarking


Homepage
Blog Debugging benchmarks is something I've done for many
Sys Perf book
Linux Perf years, and I've seen an amazing and comical variety of
Perf Methods failure modes. The problem is that benchmark tools are
USE Method often run without understanding what they are testing or
TSA Method
Off-CPU checking that the results are valid. This can lead to poor
Analysis development or architectural choices that haunt you later
Active Bench. on. I previously summarized this situation as:
WSS Estimation
Flame Graphs
Heat Maps
Frequency Trails
casual benchmarking: you benchmark A, but
Colony Graphs actually measure B, and conclude you've
perf Examples measured C.
eBPF Tools
DTrace Tools
DTraceToolkit
DtkshDemos Accurate benchmarking is important. Your company may Passive Benchmarking
Guessing Game choose a different server platform, programming stack,
Specials database, or application vendor, based on benchmark results. Accurate results enable good choices to
Books
Other Sites be made: picking options that deliver the best price performance, and are most likely to scale under
load.
This Page:
Active On this page I'll introduce what I call active and passive benchmarking, where active benchmarking
Benchmarking helps you accurately test the true target of the benchmark, and properly understand its results. It
  Summary
  Passive requires more effort at the start, but can save much more time and money later on.
  Active
  Checklist To see this in action, you can jump to the examples.
  Examples
  Statistics
  Updates Summary
Examples:
Bonnie++ To perform active benchmarking:

1. If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
2. While the benchmark is running, analyze the performance of all components involved using other
tools, to identify the true limiter of the benchmark.
The process of active benchmarking is similar to the performance analysis of any application. One
difference, which can make this process easier, is that you have a known workload to begin analyzing:
the benchmark applied.

Passive Benchmarking
Benchmarks are commonly executed and then ignored until they have completed. That is passive
benchmarking, where the main objective is the collection of benchmark data. Data is not Information.

A telltale sign is when the only technical results presented are the benchmark results. I've seen
countless slide decks, blog posts, and articles that present an impressive bar chart of comparitive
results, but then no supporting technical evidence. It's been my job to get to the bottom of many of
these, and I typically find that they are wrong or misleading almost every time. The primary reason is
that they have been run passively, "fire and forget" style, with no additional analysis, and all problems
were overlooked.

Active Benchmarking
With active benchmarking, you analyze performance while the benchmark is still running (not just after
it's done), using other tools. You can confirm that the benchmark tests what you intend it to, and that
you understand what that is. Data becomes Information. This can also identify the true limiters of the
system under test, or of the benchmark itself.

To perform active benchmarking, you may use any performance analysis tool that your OS provides:
vmstat, iostat, mpstat, sar, top, tcpdump/snoop, perf, bcc+eBPF/DTrace/SystemTap, strace/truss, etc.
You can also follow a performance analysis methodology to guide your usage of these tools. The USE
Method is especially suited for this, since it identifies typical limiters: hardware and software
resources.

How to tell if someone else did active benchmarking:

Did they run other tools while the benchmark was running? Can they provide screenshots?
Can they explain why the benchmark result was X, and not 2X (twice as fast)? ie, what is the
limiting factor?
Ideally, include the limiting factor (or suspected limiting factor) along with the benchmark results. For
example: "the file system result was limited by the CPU speed of the server, and the benchmark being
single-threaded". For evidence, this statement could include a screenshot showing that the benchmark
was single-threaded and CPU-bound: for example, on Linux, using "pidstat -t 1"; on Solaris, using
"prstat -mLc 1".

Apart from analysis while the benchmark runs, you should also analyze its configuration beforehand.
Ideally, the benchmark is open source, allowing you to study the source code, as well as any Makefiles
and compiler options.

Examples
The following are worked examples of active benchmarking, showing the tools used for analysis:

Bonnie++: a detailed analysis on both SmartOS/illumos and Fedora/Linux.

Problem Checklist
Common pitfalls that can be identified using active benchmarking, are when the benchmark is:

Perturbed by other system events, including neighbors.


Throttled by software imposed resource controls.
Throttled by the network between the benchmark client and the server.
Limited by the benchmark software being single threaded.
Testing different client or server software versions, when doing comparative benchmarking.
Testing disk I/O instead of file system I/O.
Applying an unrealistic workload.

The most common case is where a benchmark is not really testing what it claims to test, which can be
identified using active benchmarking. Sometimes the results are still useful, now that they can be
interpreted correctly.

Statistical Analysis
This is the statistical analysis of numerical benchmark results after the benchmark has completed. This
is often considered a useful exercise to develop new information from raw benchmark data, for better
understanding results, and for developing confidence. However, if the benchmark results were wrong
or misleading to begin with, statistical analysis can make matters worse. A sound statistical method can
make benchmark results seem trustworthy, when in fact, they are false. New information developed
may also be false, compounding the problem.

The only good outcome, given bad results, is that statistical analysis deems them untrustworthy (eg, too
high CoV), and analysis moves to understanding what went wrong with the actual benchmark. In
practice, this doesn't happen as much as I'd like. Often, the wrong target has been benchmarked, but the
results are statistically sound.

Statistical analysis is useful after active benchmarking – when you have valid numbers to work with.
iostat first, R later.

Updates
I gave a lightning talk at Surge 2013 titled Benchmarking Gone Wrong, which provides a
memorable anecdote for active benchmarking.

Last updated: 20-Apr-2014

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy