Previous Next Contents

2. Benchmarking procedures and interpretation of results

A few semi-obvious recommendations:

  1. First and foremost, identify your benchmarking goals. What is it you are exactly trying to benchmark? In what way will the benchmarking process help later in your decision making or in advancing Linux? How much time and resources are you willing to put into your benchmarking effort?
  2. Use standard tools. Use a current, stable kernel version, standard, current gcc and libc and a standard benchmark (e.g. the Linux Benchmarking Toolkit).
  3. Give a complete description of your setup (e.g. the LBT Report Form).
  4. Try to isolate a single variable. In all cases, comparative benchmarking is more informative than "absolute" benchmarking. I cannot stress this enough.
  5. Verify your results. Run your benchmarks a few times and verify the variations in your results, if any. Unexplained variations will invalidate your results.
  6. If you think your benchmarking effort produced meaningful information, share it with the Linux community in a precise and concise way.
  7. Please forget about BogoMips. I promise myself I shall someday implement a very fast ASIC with the BogoMips loop wired in. Then we shall see what we shall see !

2.1 Understanding benchmarking choices

Synthetic vs. applications benchmarks

Before spending any amount of time on benchmarking chores, a basic choice must be made between "synthetic" benchmarks and "applications" benchmarks.

Synthetic benchmarks are specifically designed to measure the performance of individual components of a computer system, usually by exercising the chosen component to its maximum capacity. An example of a well-known synthetic benchmark is the "Whetstone" suite, originally programmed in 1972 by Harold Curnow in FORTRAN and still in widespread use nowadays. The Whestone suite will measure the floating-point performance of a CPU.

The main criticism that can be made to synthetic benchmarks is that they do not represent a computer system's performance in real-life situations. Take for example Whetstone: the main loop is very short and will easily fit in the primary cache of a CPU, keeping the FPU pipeline constantly filled and so exercising the FPU to its maximum speed. Keeping in mind that it was programmed 25 years ago (its design dates even earlier than that !), when the concept of instruction pipeline did not even exist, we must make sure we interpret its results with care, when it comes to benchmarking modern RISC microprocessors.

Another very important point to note about synthetic benchmarks is that, ideally, they should tell us something about a specific aspect of the system being tested, independently of all other aspects: a synthetic benchmark for Ethernet card I/O throughput should result in the same or similar figures whether it is run on a 386SX-16 with 4 MBytes of RAM or a Pentium 200 MMX with 64 MBytes of RAM. Otherwise, the test will be measuring the overall performance of the CPU/Motherboard/Bus/Ethernet card/Memory subsystem/DMA combination: not very useful since changing the CPU will cause a greater impact than changing the Ethernet network card (this of course assumes we are using the same kernel/driver combination, which could cause an even greater variation)!

Finally, a very common mistake is to average various synthetic benchmarks and claim that such an average is a good representation of real-life performance for any given system. The resulting figure is absolutely useless for two very different reasons:

  1. If we are comparing the relative strengths/weaknesses of various configurations, the relevant information is totally lost in the averaging operation.
  2. The various synthetic tests tell us nothing about the performance of the various subsystems when put to work together in real-world tasks.

Here is a comment on FPU benchmarks quoted with permission from the Cyrix Corp. Web site:

"A Floating Point Unit (FPU) accelerates software designed to use floating point mathematics: typically CAD programs, spreadsheets, 3D games and design applications. However, today's most popular PC applications make use of both floating point and integer instructions. As a result, Cyrix chose to emphasize "parallelism" in the design of the 6x86 processor to speed up software that intermixes these two instruction types.

The x86 floating point exception model allows integer instructions to issue and complete while a floating point instruction is executing. In contrast, a second floating point instruction cannot begin execution while a previous floating point instruction is executing. To remove the performance limitation created by the floating point exception model, the 6x86 can speculatively issue up to four floating point instructions to the on-chip FPU while continuing to issue and execute integer instructions. As an example, in a code sequence of two floating point instructions (FLTs) followed by six integer instructions (INTs) followed by two FLTs, the 6x86 processor can issue all ten instructions to the appropriate execution units prior to completion of the first FLT. If none of the instructions fault (the typical case), execution continues with both the integer and floating point units completing instructions in parallel. If one of the FLTs faults (the atypical case), the speculative execution capability of the 6x86 allows the processor state to be restored in such a way that it is compatible with the x86 floating point exception model.

Examination of benchmark tests reveals that synthetic floating point benchmarks use a pure floating point-only code stream not found in real-world applications. This type of benchmark does not take advantage of the speculative execution capability of the 6x86 processor. Cyrix believes that non-synthetic benchmarks based on real-world applications better reflect the actual performance users will achieve. Real-world applications contain intermixed integer and floating point instructions and therefore benefit from the 6x86 speculative execution capability."

So, the recent trend in benchmarking is to choose common applications and use them to test the performance of complete computer systems. For example, SPEC, the non-profit corporation that designed the well-known SPECint and SPECfp synthetic benchmarks, has launched a project for a new applications benchmark suite. But then again, it is very unlikely that SPEC benchmarks will ever include any GPLed code: we have to look elsewhere for tests to include in the LBT.

Summarizing, synthetic benchmarks are valid as long as you understand their purposes and limitations. Applications benchmarks will better reflect a computer system's performance, but no standard applications benchmark suite exists for Linux systems.

High-level vs. low-level benchmarks

Low-level benchmarks will directly measure the performance of the hardware: CPU clock, DRAM and cache SRAM cycle times, hard disk average access time, latency and track-to-track stepping time, etc... This can be useful in case you bought a system and are wondering what components it was built with, but a better way to get this information would be to open the microcomputer case, list whatever part numbers you can find and somehow obtain the data sheet for each part (usually on the Net).

Another, better use for low-level benchmarks is to check that a kernel driver was correctly configured for a specific piece of hardware: if you have the data sheet for the component, you can compare the results of the low-level benchmarks to the theoretical, manufacturer specs.

High-level benchmarks are more concerned with the performance of the hardware/driver/OS/compiler combination for a specific aspect of a microcomputer system, for example file I/O performance, or even for a specific hardware/driver/OS/compiler/application performance, e.g. benchmarking a specific Web server package on different microcomputer systems, or different Web server packages on the same platform.

2.2 Standard benchmarks available for Linux

Kernel Compilation

IMHO a simple test that anyone can do while upgrading any component in his/her Linux box is to launch a kernel compile before and after the hard/software upgrade and compare compilation times. If all other conditions are kept equal (i.e. if you don't change the kernel configuration, for example) then the test is valid as a measure of compilation performance and one can be confident to say that:

"Changing A to B led to an improvement of x % in the compile time of the Linux kernel under such and such conditions".

No more, no less !

Since kernel compilation is a very usual task under Linux, and since it exercises most functions that get exercised by the usual benchmarks (except floating-point performance), it constitutes a rather good individual test. In most cases, however, results from such a test cannot be reproduced by other Linux users because of variations in hard/software configurations and so this kind of test cannot be used as a "yardstick" to compare dissimilar systems (unless we all agree on a standard kernel to compile and a standard benchmarking procedure - see below).

Linux-specific benchmarking tools

There are no Linux-specific benchmarking tools yet. There are, however, many Unix benchmarking tools, for example, an improved, updated version of the Byte Unix Benchmarks put together by David C. Niemi. It is called UnixBench 4.10 to avoid confusion with earlier versions. Here is what David wrote about his mods:

"The original and slightly modified BYTE Unix benchmarks are broken in quite a number of ways which make them an unusually unreliable indicator of system performance. I intentionally made my "index" values look a lot different to avoid confusion with the old benchmarks."

The Byte Linux Benchmarks David refers to are a slightly modified version of the Byte Unix Benchmarks dating back from May 1991 (Linux mods by Jon Tombs, original authors Ben Smith, Rick Grehan and Tom Yager).

There is a central Web site for the Byte Linux Benchmarks, but I recommend you start using the new UnixBench benchmarks. If you have any questions on Unixbench I suggest you contact David through a mailing list he has setup for discussion of benchmarking on Linux and other OS's. Join with "subscribe bench" sent in the body of a message to majordomo@wauug.erols.com.

Also recently, Uwe F. Mayerported the BYTE Bytemark suite to Linux. This is a modern suite carefully put together by Rick Grehan at BYTE Magazine to test the CPU, FPU and memory system performance of modern microcomputer systems (these are strictly processor-performance oriented benchmarks, no I/O or system-wide performance measurement is taken into account).

Uwe has also put together a Web site with a database of test results for his version of the Linux BYTEmark benchmarks.

To test the relative speed of X servers and graphics cards, the xbench-0.2 suite by Claus Gittinger is available from sunsite.unc.edu, ftp.x.org and other sites. It is relatively old and IMHO does not correctly reflect the performance of modern accelerated X servers.

Quoting Jeremy Chatfield from Xi Graphics:

"Current benchmarks have many weaknesses. For example, they all fail to show "user-responsiveness", the ratio of how fast the screen responds to user changes at the mouse and keyboard. Single figure benchmarks don't help people who primarily use pre-computed graphics separate their needs from those who primarily use text, or who rely on the X Server to generate the image from graphics primitives. Most of what the current benchmarks show is the motherboard RAM->host-CPU->PCI chipset->graphics board bandwidth. That *is* a single figure number, but doesn't reflect what an accelerated X Server does."

Xfree86.org refuses (wisely) to carry, support or recommend any X benchmarks.

The XFree86-benchmarks Survey is a Web site with a database of xbench results.

For pure disk I/O throughput, the hdparm program (included with most distributions, otherwise available from sunsite.unc.edu) will measure transfer rates if called with the -t and -T switches. This is a typical low-level benchmark.

There are many other tools freely available on the Internet to test various performance aspects of your Linux box. The Linux Benchmarking Project Web site as links to most of them. This site was setup by the Washington Area Unix Users Group with the specific purpose of being the central repository for Linux benchmarking data on the Net. Note however that it is still very much work-in-progress.

2.3 Other links and references

The comp.benchmarks FAQby Dave Sill is the standard reference for benchmarking. It is not Linux specific, but recommended reading for anybody serious about benchmarking. It is available from a number of FTP and web sites and lists 46 different benchmarks, with links to FTP or Web sites that carry them. Not all benchmarks listed are freely available, some are even quite expensive (SPEC for example), few are GPLed.

I can't go through each one of the benchmarks mentionned in the comp.benchmarks FAQ, but there is at least one low-level suite which I would like to comment on: the lmbench suite, by Larry McVoy. Quoting David C. Niemi:

"Linus and David Miller use this a lot because it does some useful low-level measurements and can also measure network throughput and latency if you have 2 boxes to test with. But it does not attempt to come up with anything like an overall "figure of merit"..."

A rather complete FTP site for freely available benchmarks was put together by Alfred Aburto. The Whetstone suite used in the LBT can be found at this site.

There is a long, multipart FAQ by Eugene Miya that gets posted regularly to comp.benchmarks; it is very humorously written, nice reading for a rainy day. I can't resist the following quote:

BenchMARKETing: The Art of Selling Inferior Goods

John L. Larson CSRD, University of Illinois at Urbana-Champaign

...

Technique 8 - Keep nothing constant

* use A to compute matrix multiplication using an assembly language library routine

* use B to compute recurrences in FORTRAN

* measure the performance

* Conclusion: A is faster than B

* Corollary: Apples and oranges are both fruits

Technique 9 - Compare what A will be with what B is now

* announce the availability of A in 3 years

* run benchmarks on B

* compare execution speeds

* Conclusion: A is faster than B

* Corollary: All of tomorrow's problems were solved yesterday

Technique 10 - Compare A with B's predecessor

* run benchmarks on A

* recall performance tables from benchmark articles on the Illiac I

* compare the performance

* Conclusion: A is faster than the HAL-9000

* Corollary: All machines at the University of Illinois are slow


Previous Next Contents