The benchmark case: 3D lid-driven cavity
The benchmarks on this page can be used to get a rough estimate of the execution speed of OpenLB programs. They are based on the lattice Boltzmann program for a 3D lid-driven cavity which is contained as an example in the OpenLB distributions. A D3Q19 lattice and double precision floating point arithmetics are used. As this simulation runs on a regular lattice, the benchmark only delivers information on the raw processing power of OpenLB in such a regular case, and not on the treatment of complex geometries. On the other hand, a non-trivial boundary condition is used on all walls of the cavity, based on extrapolations of velocity gradients.
In the following, results are presented from running the simulation on two parallel machines, one of which is distinctly outdated, and the other one located at the high-end of current technology. The outdated machine shows that good performance can be reached even with modest hardware, and that it is practically always worth to parallelize OpenLB applications. The fast machine proves that given an appropriate hardware platform, OpenLB scales well over thousands of cores, even with small lattices. It is easy to reach a regime in which billions of lattice sites are processed in one second.
The performance of the programs is measured by the number of lattice sites which complete a collision-propagation cycle in one second. Units are lattice site updates per second (sus), Mega-sus (Msus) and Giga-sus (Gsus).
Cluster with slow interconnect
The first test machine is a cluster with Intel Pentium 4 processors, running at 1.5 GHz, and having each 500 MB of main memory and 256 KB of L1 cache. They are connected among each other by a fast ethernet network with a bandwidth of 11 MB/s. Although both the processors and the interconnect are slow by current standards, the benchmark results for this problem are respectable. On a single processor, a performance of 0.57 Msus is obtained, independently of whether the serial or the MPI version of the program is used. With 24 processors, the execution speed on a 200x200x200 lattice increases to 10 Msus, which expresses an efficiency of 73%. A small 100x100x100 lattice scales less well, but still reaches an efficiency of 52%.
Parallel machine with fast interconnect
The second benchmark was performed on the
"ranger" computing system at the Texas Advanced Computing Center. This machine consists of AMD Opteron 64-bit Quad-Cores, which are inter-connected by a dedicated network, based on a full-CLOS fat tree. The strength of this system lies in the number of available cores and the fast interconnect. The one-core performance is given by 1.08 Msus, and this time, the simulations scale on thousands of cores. On 4096 cores and with a 1000x1000x1000 lattice, a processing speed of 3.5 Gsus is reached, and thus an efficiency of 80%. But even a small 200x200x200 lattice scales well up to 2048 cores, with a speed of 1.4 Gsus and an efficiency of 49%. In this regime, every core holds a lattice as small as approximately 15x15x15. Note that, in order to display all orders of magnitude in this large array of cores, a logarithmic range is used for both axes.