Parallel Linpack Benchmark
WarpBench is a benchmark useful to evaulate the FPU performances based on the original Linpack test included in the NETLIB. it's available for Windows and Linux systems and it's able to detect the number of installed CPUs in order to enable the parallel code to evaluate the total computational power. If the CPU power management is enabled (e.g. AMD's Cool'n'Quiet), the software disables the CPU throttle avoiding wrong results due to the variable clock.
1.1 About the Linpack benchmark
The Linpack benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack benchmark report. The computational power is expressed in Mflops/s that is a rate of execution, millions of floating point operations per second. Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Gflop/s refers to billions of floating point operations per second and Tflop/s refers to trillions of floating point operations per second.
2. Installation & usage
No installation required: just run the warpbench command in the command shell. WarpBench could be installed with VEGA ZZ as option and it can be executed selecting VEGA ZZ WarpProject WarpBench.
If you are running the Linux version, it's possible that the file permissions aren't correctly set. To change them, type in the command prompt:
chmod 755 warpbench
2.1 Command line options
WarpBench can be executed also by command shell. Here is the help that is shown when you invoke the command with -? option:
WarpBench 1.1.0 - Parallel Linpack Benchmark Copyright 2006-2022, Alessandro Pedretti Unrolled single precision Win32 version Usage: WarpBench -c CPU_NUM -q c -> Number of threads (default all). q -> Quiet mode (show only the global performances).
You can choose manually the number of CPUs and enable the quiet mode.
3.0 Benchmark results
This is the report generated by WarpBench Linpack benchmark:
WarpBench 1.1.0 - Parallel Linpack Benchmark Copyright 2006-2022, Alessandro Pedretti Unrolled single precision Win32 version
Performing the benchmark for 2 CPU(s) ...
Average values for one CPU:
norm resid resid machep x-1 x[n-1]-1 1.9 4.52336171e-005 1.19209290e-007 -1.31130219e-005 -1.30534172e-00
Times are reported for matrices of order 100 1 pass times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio 0.00068 0.00003 0.00071 972.52 0.0021 0.0126
Overhead for 1 matgen 0.00014 seconds
Matgen/dgefa passes used 1239 for 1 seconds Times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio 0.00067 0.00011 0.00077 888.80 0.0023 0.0138 0.00067 0.00011 0.00078 882.40 0.0023 0.0139 0.00067 0.00011 0.00078 880.13 0.0023 0.0139 0.00068 0.00011 0.00078 876.85 0.0023 0.0140 0.00072 0.00011 0.00082 834.61 0.0024 0.0147 Average 872.56
Calculating matgen2 overhead Overhead for 1 matgen 0.00014 seconds
Times for array with leading dimension of 200
dgefa dgesl total Mflops unit ratio 0.00071 0.00011 0.00082 839.49 0.0024 0.0146 0.00071 0.00011 0.00082 841.87 0.0024 0.0146 0.00072 0.00011 0.00082 834.09 0.0024 0.0147 0.00071 0.00011 0.00082 841.08 0.0024 0.0146 0.00071 0.00011 0.00082 838.77 0.0024 0.0146 Average 839.06
Total computational power 1678.12 Mflops
The norm resid is a measure of the accuracy of the computation. The value should be O(1). If the value is much greater than O(100) it suggest that the results are not correct.
The resid is the unnormalized quantity.
The term machep measure the precision used to carry out the computation. On an IEEE floating point computer the value should be 2.22044605e-16.
The values of x-1 and x[n-1]-1 are the first and last component of the solution. The problem is constructed so that the values of solution should be all ones.
There are two timings performed both on matrices of size 100. The first one is where the 2-dimensional array that contained the matrix has a leading dimension of 201, and a second set where the leading dimension 200. This is done to see what effect, if any, the placement of the arrays in memory has on the performance.
Times for dgefa and dgesl are reported. dgefa factors the matrix using Gaussian elimination with partial pivoting and dgesl solves a system based on the factorization. dgefa requires 2/3 n3 operations and dgesl requires n2 operations. The value of total is the sum of the times and mflops is the execution rate, or millions of floating point operations per second. Here a floating point operations is taken to be floating point additions and multiplications. Unit and ratio are obsolete and should be ignored. If the time reported is negative or zero then the clock resolution is not accurate enough for the granularity of the work. In this case a different timing routine should be used that has better resolution
If the system has more than one CPU, all results are the average values for one CPU and the Total computational power is the sum of the results for all CPUs.
In the following table, are reported some benchmark results:
|CPU type||CPUs||Core x CPU||Threads||OS||Mflops/s|
|Intel Xeon Gold 6238R||2||56||112||Windows 10 x64 Professional||110946|
|AMD Ryzen 9 3900XT||1||12||24||Windows 10 x64 Professional||78507|
|AMD Ryzen 9 3900X||1||12||24||Windows 10 x64 Professional||78328|
|Intel Xeon Gold 5120||2||14||56||Windows 10 x64 Enterprise||55397|
|AMD EPYC 7281||1||16||32||Windows Server 2019 Standard||49686|
|Intel Xeon E5-2650 v3||2||10||40||Windows 10 x64 Professional||39484|
|Intel Xeon E5-2630 v3||2||8||32||Windows 10 x64 Professional||31249|
|Intel Xeon E5-2640 v2||2||8||32||Windows 7 x64||27211|
|AMD Opteron 875||8||2||16||CentOS 4.3 64 bit||19254|
|AMD Ryzen 2400G||1||4||8||Windows 10 x64 Professional||16298|
|AMD Opteron 875||8||2||16||Windows Server 2008 R2||13632|
|Intel Xeon E5-2620 v2||1||6||12||CentOS 6.4 64 bit||10586|
|Intel Xeon E5-1620 v3||1||4||8||Windows 10 x64 Professional||9990|
|Intel i5-8250U||1||4||8||Windows 10 x64 Professional||9013|
|AMD Phenom II X6 1090T||1||6||6||Windows 7 x64||7052|
|Intel i5-6400||1||4||4||Windows 10 x64||5227|
|AMD Phenom II X4 955||1||4||4||Windows 7 x64||4711|
|AMD Athlon II X4 640||1||4||4||Windows Server 2003 Enterprise||4606|
|AMD A8 3870K||1||4||4||Windows 7 x64||4528|
|AMD Athlon MP 2200+||2||1||2||Windows 2000||1678|
|Intel Core 2 Duo T5600||1||2||2||Windows 10 x64||1141|
|ARM Cortex A7 1.2 GHz (AllWinner H3, Orange Pi One)||1||4||4||ARMBIAN 3.4.113 RetroOrangePi||874|
|AMD Athlon 64 3200+ Venice||1||1||1||Windows XP||861|
|AMD Athlon 64 3200+||1||1||1||Windows XP||849|
|AMD Sempron 64 2800+||1||1||1||Windows XP||840|
|AMD Athlon 64 3000+||1||1||1||Windows XP||821|
|ARM Cortex A9 r3p0 1.6 GHz (RockChip RK3188)||1||4||4||Ubuntu 12.04.5 LTS||693|
|Intel Pentium III 550 MHz||1||1||1||Windows 2000||133|
|ARM Cortex A9 r3p0 1.6 GHz||1||4||4||Android 4.1.1 Jelly Bean||97|
|ARM Cortex A9 r2p10 1.0 GHz||1||1||1||Android 4.1.1 Jelly Bean||23|
The compiler performance may change the results in significant manner. Using a dual Athlon MP test PC with Windows 2000 as operating system and several C compiler, these results were found:
|RedHat||Y||3.3.3||-O3 -march=pentium -malign-double -fomit-frame-pointer -ffast-math -funroll-loops -D WIN32||1678|
|GNU||Y||3.2.3||-O3 -march=pentium -malign-double -fomit-frame-pointer -ffast-math -funroll-loops||1678|
|gcc mingw32||GNU||Y||3.4.5||-O3 -march=pentium -malign-double -fomit-frame-pointer -ffast-math -funroll-loops||1643|
|pgcc||Portland Group||N||6.0||-O3 -tp p5 -Munroll=c:5 -Mnoframe -Mlre -Mnozerotrip -D __TINYC__||1632|
|bcc32||Borland||N||5.6.4||-O2 -Hc -Vx -Ve -ff -X- -a8 -5 -b- -k- -vi -tWC -tWM||1464|
|lcc-win32||Jacob Navia||Y||3.3||-O -D__TINYC__||1024|
5. Copyright and disclaimers
All trademarks and software directly or indirectly referred in this document, are copyrighted from legal owners. WarpBench is a freeware program and can be spread through Internet, BBS, CD-ROM and other electronic formats. The Authors of this program accept no responsibility for hardware/software damages resulting from the use of this package. No warranty is made about the software or its performance.
Use and copying of this software and the preparation of derivative works based on this software are permitted, so long as the following conditions are met:
The copyright notice and this entire notice are included intact and prominently carried on all copies and supporting documentation.
No fees or compensation are charged for use, copies, or access to this software. You may charge a nominal distribution fee for the physical act of transferring a copy, but you may not charge for the program itself.
If you want include the WarpBench package into a commercial file collection, you must send a written request. The Authors can accept or deny the request on their own decision.
If you change the source code to improve the WarpBench performances, please contact the authors to add your modifications in the official package.
Any work distributed or published that in whole or in part contains or is a derivative of this software or any part thereof is subject to the terms of this agreement. The aggregation of another unrelated program with this software or its derivative on a volume of storage or distribution medium does not bring the other program under the scope of these terms.
is an enhanced version of the original Linpack benchmark
Copyright 2006-2022, Alessandro Pedretti & Giulio Vistoli
All rights reserved.
Dipartimento di Scienze Farmaceutiche
Università degli Studi di Milano
Via Mangiagalli, 25
I-20133 Milano - Italy
Tel. +39 02 503 19332
Fax. +39 02 503 19359