A new ESCAPE-2 report surveys approaches for fault-tolerance in numerical algorithms and system resilience in parallel simulations from the perspective of numerical weather and climate prediction systems.
A selection of existing strategies is analysed, featuring interpolation-restart and compressed check-pointing for the numerics, in-memory check-pointing, user-level failure mitigation-based and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers.
The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exa-scale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
Photo by NOAA on Unsplash