Issue |
2014
SNA + MC 2013 - Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo
|
|
---|---|---|
Article Number | 04209 | |
Number of page(s) | 6 | |
Section | 4. Advanced Parallelism and HPC Strategies: b. Monte Carlo Methods, Parallelism and HPC | |
DOI | https://doi.org/10.1051/snamc/201404209 | |
Published online | 06 June 2014 |
Preliminary Studies on the Resiliency of Stochastic Linear Solvers
Massachusetts Institute of Technology, Department of Nuclear Science and Engineering, 77 Massachusetts Avenue, Cambridge, MA 02139
With the advent of exascale computing and the realization that memory errors will be an ever important part of the high performance computing landscape, this paper proposes the reconsideration of stochastic linear solvers for their inherent scalability and resiliency capabilities. This paper addresses the latter by analyzing the resiliency of stochastic solvers to randomly occurring memory errors that go undetected. The premise is that, in stochastic solvers, undetected errors can be considered as part of the random process while detectable errors can be filtered using basic statistics. Thus, the goal is not to detect all memory errors, but only those that matter and quantifying their frequency which will impact efficiency. A simple iterative stochastic solver was implemented and all double-precision input variables were imposed a bit-flip to determine their impact on the final solution. Accepted batches contribute to the sample mean and variance that is used to determine whether or not to accept or reject the following batch. The test case indicated that 3-6 % of all soft memory errors in the double-precision variables were detected as exceeding the normal noise of the method. Since only a small fraction of memory errors truly matter, stochastic solvers offer a path that could potentially avoid error correction code all together. Multiple consecutive bit errors were also analyzed for the specific case of five neighboring bit flips. The test on one variable produced a soft error detection rate 1 % higher than the single bit-flip.
Key words: Stochastic linear solvers / Resiliency / Soft Memory Errors / Exascale computing
© Owned by the authors, published by EDP Sciences, 2014