Mixed-precision AMG as linear equation solver for definite systems - ScienceDirectįull article: Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (tandfonline.It does not support native double precision fp64 but does 2 fp32 this is very slow Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers (utk.edu) I did some simple investigations into mixed-precision solvers for CFD on GPUs and got a 2x speedup using FP32 by alleviating the effective memory bottleneck. In fact any way to exploit lower precision on GPUs is beneficial due to the financial cost now required for full double precision hardware support. I have no idea of the hardware you have available…Īs Ivan points out, any way to exploit half precision can give potential benefits on modern GPUs which now seem to prioritise FP16 performance for machine learning applications. Parts of the mantissa and exponent are therefore completely unused.Īs far I can understand, the application was written in OpenCL, for hardware that had support for 16-bit floats ( cl_khr_fp16), and not in Fortran. It turns out with some tricks, and assuming a stable simulation, the DDF’s will fall between -2 and 2, mostly aggregating around 0.the algorithm is bandwith-limited a large share of time is spent just shifting the DDF’s in memory (think particles moving along a grid).given how memory intensive it is, the available GPU memory quickly becomes a restriction on the domain size (or grid size) this also puts a cap on the Reynolds number you can simulate (without resorting to turbulence models).the application is extremely memory intensive for each cell of a 3D spatial grid, you need to store 19 or even 27 DDF’s (discrete distribution functions), plus any additional scalar fields (temperature, pressure, etc.).The reasons why they are looking at 16-bit reals are several: I am not involved directly, just following what’s going on in the field. My latest attempt with Ryzen 5900X: more cache and faster memory was only moderately successful, but I am hoping that newer hardware and DDR5 memory might show an improvement.įor me, multi thread computation has extra cores, but they share the same limited memory addressing capacity/bandwidth. I have been trying to understand the “black art” AVX inefficiency for a few years now. The alternative is to utilise a processor with larger cache and increased memory bandwidth. Surely a software real(2) would not achieve the vector real(4) performance. I would have thought that utilising avx-256 or avx-512 could be more effective, combined with targeting the L1 cache, which is so important for efficient AVX. I have no idea of the hardware you have available, but improved efficiency of vector instructions must be preferable to software emulation of smaller memory real calculations, given the ratios are only 50% change of memory. I am surprised that you are investigating a 16-bit real to look for an improved performance. (*) Is it my too poor mastering of the English language that I have to use the word “number” in at least two different meanings? Perhaps one of them has something akin to the number format you mention. A large number of platforms are mentioned there each with its own ideas about what a number should be (*). You had to select the format via a compiler option.Īds it happens there is a thread on the gfortran mailing list that concerns the handling of signalling NaNs. But that was before Fortran 90 was a serious option. The main difference being a different interpretation of the exponent encoding and a slightly faster processing with the native number format (that was not IEEE-compliant). I have seen a Convex machine in years long gone by where there were two number formats. That would, however, still fit in a more or less ordered sequence of number formats - the precision is reduced along with the range. Something that is of considerable interest in the data science community is half-precision reals, because they have little to gain with six decimals but everything with less memory per number. I have never seen anything of the sort ). Similar results are obtained if the program uses Integer, parameter :: n = 1000, nhuge = 100 Use norm_mod, only: wp, my_norm2, my_norm2_extra ! use extra precision to calculate L2 norm Pure function my_norm2_extra(x) result(y) Using the method above of setting kinds, I have written a version of norm2 that appears to handle overflow better than Intel Fortran does.
0 Comments
Leave a Reply. |