Section 12 Hardware and Operating System
12.1 Introduction
This chapter provides information and guidance that aims to help users choose computer hardware that will suit their modelling needs. TUFLOW Classic and HPC utilise distinctly different solution schemes, and likewise have different requirements with regard to the choice of computer hardware. Descriptions for a number of hardware related commands and options that are specific to HPC are provided in Section 12.8.
Additional material on performance/benchmarking and hardware selection advice can be found on the TUFLOW Wiki:
12.2 Operating Systems
TUFLOW Classic and HPC are only available as 64-bit binaries for 64-bit Microsoft Windows operating systems. Unix/Linux and Mac operating systems are not currently supported. While TUFLOW Classic and HPC may run on older versions of Windows, support is only available for Windows 10 and later, and Windows Server 2016 and later.
12.3 TUFLOW Classic
TUFLOW Classic uses the ADI (alternating direction implicit (Stelling, 1984)) finite difference scheme that has been further refined by Syme (1991). For further details on the TUFLOW Classic solution scheme refer to Section 1.1.1. It is a highly efficient scheme, but is difficult to parallelise to use multiple threads. As such it runs on a single core of the main CPU (central processing unit). General hardware advice for TUFLOW Classic is summarised below:
- As all modern CPUs have multiple cores, it is possible to run more than one model at a time on the computer, provided the user has access to sufficient licences.
- If running only one model at a time, the primary factor for model speed is the single core computational capability of the CPU. Generally this is mostly driven by the CPU clock speed - a CPU with fewer cores but a higher clock speed will likely provide better performance than one with many cores and a lower clock speed. CPU type and architecture do also play a role in single core performance. For more information refer to the subsection on performance proxies. When running models with Classic, having a dual socket mainboard will be of little benefit if running just one model at a time. Dual socket mainboards are capable of holding two CPUs, doubling the total number of CPU cores available (e.g. 2 x 16 core CPUs), these are primarily found in server setups.
- If running multiple models simultaneously, the memory bandwidth between CPUs and the mainboard RAM is quite important, in which case a dual CPU mainboard will usually offer better performance than a single CPU mainboard. Further, running multiple models simultaneously will also require increased CPU RAM, which again typically favours a dual CPU mainboard.
- The mainboard memory required for a particular model will vary significantly due to a number of factors related to model setup. In general the memory required will scale proportionally with number of model cells.
- There is no requirement for a particular GPU (Graphics Processing Unit) to be installed. However, having a medium quality GPU installed may assist with graphics rendering in any GIS software being used for visualising model inputs and outputs.
- The amount of CPU RAM will determine the size of the model that can be run or a number of models that can be run at one time. Faster RAM will result in quicker runtimes, however this is usually a secondary consideration to chip speed, cache size and architecture.
12.4 TUFLOW HPC (Including Quadtree)
TUFLOW HPC uses an explicit finite volume scheme which is parallelised to utilise multiple CPU cores, or equally the thousands of cores available on GPUs. The explicit finite volume scheme is not as computationally efficient as the ADI finite difference scheme used in TUFLOW Classic. However, because it can be parallelised, benchmarking has found that using the HPC solution scheme on a modern GPU can run a typical model anywhere from 10 to 100 times faster than using the Classic scheme on a single core of a modern CPU. For further details on the TUFLOW HPC solution scheme refer to Section 1.1.2.
12.4.1 TUFLOW HPC Solver on CPU
TUFLOW HPC can be run using multiple CPU cores of a single or dual CPU mainboard, in which case it may offer solution times that are comparable with TUFLOW Classic on a single CPU core. However, most users prefer to utilise the performance of GPU compute, refer to Section 12.4.2. Note that compute using GPU does require a GPU module licence, refer to Section 1.1.5.
12.4.2 TUFLOW HPC Solver on GPU
TUFLOW HPC uses the CUDA API that NVIDIA makes available for their GPUs, and therefore only NVIDIA GPUs are supported. The GPU is a physically separate device that is connected to the mainboard at a PCI slot. It has its own separate memory which has to be sufficient to hold the working data for the model solution process, and draws power from the main power supply unit through its own supply cables. The computer’s power supply unit must be capable of supplying the mainboard (with the CPUs) and the GPUs attached to the mainboard. The power draw of the GPUs and the cooling requirements of the whole system can become a constraining issue with regard to the overall system design.
Note that due to the use of NVIDIA CUDA API, AMD GPU are not supported.
12.4.3 Types of GPU
In general, the price and performance of a GPU are related to the size of its memory, the number of compute cores, and importantly the internal architecture of the GPU. In this regard, the NVIDIA GPUs fall into two broad categories:
- Gaming GPUs. These usually offer excellent computational performance, but are often physically large and have high power draw. This means that it is often difficult to have more than two attached to a mainboard. Despite these drawbacks, installing gaming GPUs often provides the best overall performance relative to price for TUFLOW modelling.
- Scientific GPUs. The top-line scientific GPUs can offer superior computational performance to the top-line gaming GPUs, also offer more GPU memory, and are usually physically smaller in size with lower power draw. So it is more feasible to have four or more attached to a mainboard. However, they are significantly more expensive - sometimes an order of magnitude more expensive than even best gaming GPUs.
12.4.4 Utilising Multiple GPUs for One Model
TUFLOW HPC (including Quadtree) can utilise multiple GPUs (that are all connected to the same mainboard) to compute a single simulation. For larger models (typically more than 5,000,000 2D cells), useful reductions in solve times can be achieved with multiple GPUs. For smaller models (typically less than 100,000 cells), the solve time may actually increase when run across multiple GPUs due to the device to device communication latency.
Multiple GPU may be used to run very large models that require more GPU memory than available on a single GPU device.
12.4.5 Running Multiple Models on a Single GPU
It is also possible to run multiple TUFLOW HPC simulations on a single GPU card provided it has sufficient memory to accommodate both simulations. Both of the models will, however, solve more slowly as the GPU compute resources are shared between the models. Occasionally it has been noted that the amount of time taken for two models to complete when running simultaneously can be less than the sum of times when the models are run sequentially.
12.4.6 Differences in results between CPU and GPU
TUFLOW Classic uses a very different solution scheme to TUFLOW HPC, and there will be some differences between these solutions. However what we are discussing here is running the HPC solver on CPU, and then also on GPU, and comparing the difference. TUFLOW HPC has been written to utilise one source code, that with the use of macros can be compiled for CPU execution or for GPU execution without duplication of code. So the lines of code are the same, but the compilers are different, and the physical hardware that executes the binary instructions is different. In particular, the number of bits used for floating point representation within the CPU cores and GPU cores can differ, though both meet the minimum IEEE standards for single (or double) precision. The math libraries for computing square-roots and logarithms are not absolutely identical. These differences produce very subtle differences in depth calculations. The differences are very, very, small - maybe at the 7th decimal place for single precision calculations, but when an embankment is being over-topped by just a little bit, then a 0.0001% difference in depth can become a 0.1% difference in over-topping flux, leading to what appear to be larger relative differences in depth at shallow locations. Such differences can become as large as a few mm or even cm. These differences are certainly much smaller than the uncertainties involved in flood modelling, and much smaller than the differences that will arise from running the same model at a different resolution, or (to the extent possible) in a different software. It is important for all modellers to understand what constitutes real differences in results vs what is “numerical noise”.
12.4.7 RAM
RAM is the computer memory required to store all of the model data used during the computation. A computer has CPU RAM, which is located on the motherboard and accessed from the CPU, and it has GPU RAM, which is located on the GPU device and accessed from the GPU. The two memory storage systems are physically separate. Both are required for TUFLOW HPC simulations (when using GPU for the compute) The amount of GPU RAM is one of two factors that will determine the size of the model that can be run (the other being CPU RAM). As a general rule, approximately 5 million 2D cells can be simulated per gigabyte (GB) of GPU RAM depending on the model features (e.g. a model with infiltration requires more memory due to the extra variables needed for the infiltration calculation).
TUFLOW HPC on GPU hardware still uses the CPU to compute and store data (in CPU RAM) during model initialisation and for all 1D calculations. During initialisation and simulation a model will typically require 4-6 times the amount of CPU RAM relative to GPU RAM. As an example, a model that utilises 11GB of GPU RAM (typical memory for high-end gaming card, and corresponds to about a 50 million cell model) the CPU RAM required during initialisation will typically be in range 44GB to 66GB.
12.5 Proxies for CPU and GPU performance
When choosing hardware for running engineering software, it is important to consider performance for price, including an allowance for the cost of the software licencing. With licencing costs included, this often favours choosing recent high performance hardware. Prices for hardware can be well known in advance, but knowing what performance will be achieved on particular hardware, for a specific software, can be difficult.
TUFLOW publishes computational performance results for a benchmark 2D hydraulic model - refer to the links in Section 12.1. For other hardware that is not listed on these pages, the TUFLOW development team has found that some published 3rd party hardware benchmarks offer a relative performance comparison that appears to be consistent with our own performance measurements. These may be found at:
https://www.videocardbenchmark.net/
At this site there is a “High End Video Card Chart”. The scores for each GPU are calculated from a number of different tests, mostly to do with graphics performance. There is also a “GPU Compute Video Card Chart”, but interestingly we have found the former chart to be a better proxy for how well a GPU will run a TUFLOW HPC model. This site also has excellent CPU charts for multi-core and single-core performance.
12.6 Virtualisation
It is possible to share high-performance compute resources amongst multiple users simultaneously by running virtual desktops on a machine that users connect to remotely. There are many reasons why IT professionals favour this solution for sharing expensive resources. However, the important question that must be asked in advance is: how will the GPU(s) be shared between users? Unless the Virtual Desktop Infrastructure (VDI) environment is implemented correctly, users may experience erratic model start up and solve times. There are different sharing mechanisms possible depending on the type of GPU and the choice of operating system. Some key points are:
- Solving the 2D Shallow Water Equations is a computationally intensive task. TUFLOW HPC is a well-optimised engine that will efficiently utilise all of a GPU’s compute resources for hours, sometimes days depending on the size of the model. For optimum performance, it is generally safest to only allow one HPC compute job per GPU device.
- In a VDI environment for modellers, where resources are typically pooled and assigned dynamically, GPU resources should instead be assigned as isolated and independent resources, dedicated to a user session.
- When opting for large NVIDIA GPUs, choosing a GPU model and a hypervisor that support Multi Instance GPU is recommended. Otherwise, select multiple smaller GPUs that can be exclusively assigned to individual user sessions.
12.7 Cloud Compute
Organisations may host their TUFLOW network licences on the cloud and run simulations on cloud virtual machines. There are numerous ways both, licencing and simulation can be configured in a cloud environment, depending on the cloud provider (Microsoft, Google, Amazon, other etc.) and internal company protocols. Configuration of your cloud environment is your own responsibility. The Cloud Execution page of the TUFLOW Wiki provides guidance on this subject.
12.8 Commands
The following commands, all optional, are available for the HPC solution scheme (
If an NVIDIA GPU is available, then this may be selected using the Hardware command. If GPU hardware is specified, but the system cannot find an NVIDIA GPU, then ERROR 3005 will result. The memory required for the TUFLOW model will be compared against the free memory available on the GPU. If the available memory is insufficient to run the model ERROR 3017 will result. For models that only just fit within the available device memory by a small margin, it is possible that a memory allocation will fail due to being unable to find the required memory as a contiguous block, in which case ERROR 3018 will be reported during model initialisation.
The default setting of “ERROR”, causes TUFLOW to stop with ERROR 2420 (advising that it is recommended to use the single precision binary) if the HPC solution scheme is selected when running the double precision binary. Due to its explicit finite volume formulation and being depth based, the HPC 2D scheme does not generally require to be run in double precision (DP) mode. There can also be substantial speed gains using single precision (SP) on some GPU cards, and there is significantly less memory footprint. However, 1D-2D linked models that also use the ESTRY 1D engine may still require double precision in situations where the model is at higher elevations. Accordingly it is still possible to run the HPC solution scheme in double precision by explicitly setting HPC DP Check to “OFF”.
For computers with more than one GPU, the NVIDIA GPU driver will search all connected GPUs, and compile an enumerated list of NVIDIA GPUs. This list will range from 0 to n-1, where n is the number of attached NVIDIA GPUs. This command may be used to:
- Run a model on a particular GPU, e.g. “
GPU Device IDs == 1 ” will run the model on the second NVidia GPU in the list of available NVIDIA GPUs. - Distribute a model over two or more GPUs, e.g. “
GPU Device IDs == 0 1 ” will run the model spread over the first two NVIDIA GPUs in the list. Note that the GPUs do not have to be consecutive and the device IDs can appear in any order.
- If you only have one GPU device, or you wish to use the primary device, this command is not needed.
If desired, the selection of GPU Device IDs can be specified on the command line when running the executable:
- To select a specific GPU, specify “-puN” where N is the device ID.
- To select mutliple GPUs for a distributed run, supply a “-puN” argument for each device ID required, for example -pu0 -pu1.
- The “-pu” arguments will automatically override any GPU Device IDs specified in the tcf.
Also note:
- If the list of device IDs is longer than the number of available devices, then the list will be truncated to the number of available devices and WARNING 2784 issued.
- If a device ID is specified that is outside of the range 0…n-1 then ERROR 3005 will result.
- If a device ID is specified that already has a model running on it, then the requested GPU will be loaded up with the additional model, which will cause both the existing model and the new model to solve more slowly.
- A GPU module licence is required for each GPU.
- A CPU thread is created for managing the compute stream for each GPU device.
- Available memory checks are performed on all GPUs in the list.
- If Hardware target is CPU then this command is ignored.
These two commands are identical and control the number of CPU threads used by HPC (including Quadtree) when solving on CPU instead of GPU. For example, “CPU Threads == 6” runs the HPC 2D solver across 6 CPU cores. The number of threads may also be specified as a command line argument -nt[number_of_threads]. Using the command line argument will override any definition in the tcf file.
Notes:
- TUFLOW licences have 4 times the number of TUFLOW engine licences available as HPC “thread” licences. For example a local or network 4 licence has 16 thread licences available.
- The default number of threads that HPC will use for the CPU solver run is 4.
- The maximum number of threads possible is the lesser of the maximum number of cores available on the machine and the number of available TUFLOW thread licences.
- Pre-processing of SGS elevations can be computationally intensive, as can the compression of TIF output files. If the number of threads has not been specified then the number of threads used for these tasks will default to the maximum number of cores available without requiring any additional thread licences.
When a model is spread over two or more GPUs (or CPU threads), the model is partitioned into vertical ribbons and each device solves its own ribbon, with boundary data synchronised between the devices at each timestep. By default the ribbons are of equal width. However, the computational burden for each GPU is rarely uniform due to each ribbon having a different active cell or wet cell count. This command will vary the ribbon sizes in accordance with the load factors in the list, and therefore can improve overall solution time for unbalanced models. Also, for models that require nearly all of the available GPU memory on systems with GPUs of different memory, this command can be used to apportion ribbon sizes accordingly. Additional notes:
- Load factors are mapped respectively to the devices in the same order in which they appear in the list of device IDs.
- Load factors are normalised after reading, if required, so their average is one.
- Upon completion of the model, TUFLOW will report the approximate computational load split across the devices, and offer a suggested list of load factors that may improve the workload balance.
- When running on CPU, this command adjusts the ribbon size for each thread.
- For a Quadtree model, the decomposition is not performed in ribbons, instead the list of cells is partitioned.
Some models of NVIDIA GPUs allow for a direct communications link between them, either via a specific hardware connector or via the PCI bus controller. This is known as ‘peer to peer’ access. TUFLOW HPC will by default enable peer to peer access if the driver reports that it is available. This command may be used to specifically disable peer to peer communications if desired.