12.3 TUFLOW HPC (Including Quadtree)

TUFLOW HPC uses an explicit finite volume scheme which is parallelised to utilise multiple CPU cores, or equally the thousands of cores available on GPUs. The explicit finite volume scheme is not as computationally efficient as the ADI finite difference scheme used in TUFLOW Classic. However, because it can be parallelised, benchmarking has found that using the HPC solution scheme on a modern GPU can run a typical model anywhere from 10 to 100 times faster than using the Classic scheme on a single core of a modern CPU. For further details on the TUFLOW HPC solution scheme refer to Section 1.1.2.

12.3.1 TUFLOW HPC Solver on CPU

TUFLOW HPC can be run using multiple CPU cores of a single or dual CPU mainboard, in which case it may offer solution times that are comparable with TUFLOW Classic on a single CPU core. However, most users prefer to utilise the performance of GPU compute, refer to Section 12.3.2. Note that compute using GPU does require a GPU module licence, refer to Section 1.1.5.

12.3.2 TUFLOW HPC Solver on GPU

TUFLOW HPC uses the CUDA API that NVIDIA makes available for their GPUs, and only NVIDIA GPUs are supported. The GPU is a physically separate device that is connected to the mainboard at a PCI slot. It has its own separate memory which has to be sufficient to hold the working data for the model solution process, and draws power from the main power supply unit through its own supply cables. The computer’s power supply unit must be capable of supplying the mainboard (with the CPUs) and the GPUs attached to the mainboard. The power draw of the GPUs and the cooling requirements of the whole system can become a constraining issue with regard to the overall system design.

Note that due to the use of NVIDIA CUDA API, AMD GPU are not supported.

For further guidance, see TUFLOW Wiki Hardware Selection Advice, which provides general hardware recommendations for running TUFLOW. The TUFLOW Wiki Console Window GPU Usage is also helpful for confirming whether the GPU is being utilised during a simulation, including how to use tools like nvidia-smi to monitor performance.

12.3.3 Types of GPU

In general, the price and performance of a GPU are related to the size of its memory, the number of compute cores, and importantly the internal architecture of the GPU. In this regard, the NVIDIA GPUs fall into two broad categories:

Gaming GPUs. These usually offer excellent computational performance, but are often physically large and have high power draw. This means that it is often difficult to have more than two attached to a mainboard. Despite these drawbacks, installing gaming GPUs often provides the best overall performance relative to price for TUFLOW modelling.
Scientific GPUs. The top-line scientific GPUs can offer superior computational performance to the top-line gaming GPUs, also offer more GPU memory, and are usually physically smaller in size with lower power draw. So it is more feasible to have four or more attached to a mainboard. However, they are significantly more expensive - sometimes an order of magnitude more expensive than even best gaming GPUs.

12.3.4 Utilising Multiple GPUs for One Model

TUFLOW HPC (including Quadtree) can utilise multiple GPUs (that are all connected to the same mainboard) to compute a single simulation. For larger models (typically more than 5,000,000 2D cells), useful reductions in solve times can be achieved with multiple GPUs. For smaller models (typically less than 100,000 cells), the solve time may actually increase when run across multiple GPUs due to the device to device communication latency.

Multiple GPU may be used to run very large models that require more GPU memory than available on a single GPU device.

12.3.5 Running Multiple Models on a Single GPU

A single simulation per GPU card will produce the fastest simulation speed from a single model perspective. Running more than one simulation per GPU card will make the simulation run slower, though from an overall project perspective may be a more efficient mode of operation. For example, hypothetically, if a model takes 10 hours to run on a single card, running the same model in parallel with another simulation may reduce the simulation efficiency to 70%. Those models run in parallel would finish in 14.3 hours. This is quicker than the 20 hours that would be required if the models were run in series.

The optimum number of simulations per GPU card varies depending on the card specs (Cuda cores, RAM etc.) and model compute needs.

12.3.6 Differences in results between CPU and GPU

TUFLOW Classic uses a very different solution scheme to TUFLOW HPC, and there will be some differences between these solutions. However what we are discussing here is running the HPC solver on CPU, and then also on GPU, and comparing the difference. TUFLOW HPC has been written to utilise one source code, that with the use of macros can be compiled for CPU execution or for GPU execution without duplication of code. So the lines of code are the same, but the compilers are different, and the physical hardware that executes the binary instructions is different. In particular, the number of bits used for floating point representation within the CPU cores and GPU cores can differ, though both meet the minimum IEEE standards for single (or double) precision. The math libraries for computing square-roots and logarithms are not absolutely identical. These differences produce very subtle differences in depth calculations. The differences are very, very, small - maybe at the 7th decimal place for single precision calculations, but when an embankment is being over-topped by just a little bit, then a 0.0001% difference in depth can become a 0.1% difference in over-topping flux, leading to what appear to be larger relative differences in depth at shallow locations. Such differences can become as large as a few mm or even cm. These differences are certainly much smaller than the uncertainties involved in flood modelling, and much smaller than the differences that will arise from running the same model at a different resolution, or (to the extent possible) in a different software. It is important for all modellers to understand what constitutes real differences in results vs what is “numerical noise”.

12.3.7 RAM

RAM is the computer memory required to store all of the model data used during the computation. A computer has CPU RAM, which is located on the motherboard and accessed from the CPU, and it has GPU RAM, which is located on the GPU device and accessed from the GPU. The two memory storage systems are physically separate. Both are required for TUFLOW HPC simulations (when using GPU for the compute) The amount of GPU RAM is one of two factors that will determine the size of the model that can be run (the other being CPU RAM). As a general rule, approximately 5 million 2D cells can be simulated per gigabyte (GB) of GPU RAM depending on the model features (e.g. a model with infiltration requires more memory due to the extra variables needed for the infiltration calculation).

TUFLOW HPC on GPU hardware still uses the CPU to compute and store data (in CPU RAM) during model initialisation and for all 1D calculations. During initialisation and simulation a model will typically require 4-6 times the amount of CPU RAM relative to GPU RAM. As an example, a model that utilises 11GB of GPU RAM (typical memory for high-end gaming card, and corresponds to about a 50 million cell model) the CPU RAM required during initialisation will typically be in range 44GB to 66GB.