Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/120126
Citations
Scopus Web of Science® Altmetric
?
?
Type: Conference paper
Title: Fault-tolerant grid-based solvers: Combining concepts from sparse grids and MapReduce
Author: Larson, J.W.
Hegland, M.
Harding, B.
Roberts, S.
Stals, L.
Rendell, A.P.
Strazdins, P.
Ali, M.M.
Kowitz, C.
Nobes, R.
Southern, J.
Wilson, N.
Li, M.
Oishi, Y.
Citation: Procedia Computer Science, 2013 / Alexandrov, V., Lees, M., Krzhizhanovskaya, V., Dongarra, J., Sloot, P.M.A. (ed./s), vol.18, iss.Special issue, pp.130-139
Publisher: Elsevier
Issue Date: 2013
Series/Report no.: Procedia Computer Science
ISSN: 1877-0509
Conference Name: International Conference on Computational Science (ICCS) (5 Jun 2013 - 7 Jun 2013 : Barcelona, Spain)
Editor: Alexandrov, V.
Lees, M.
Krzhizhanovskaya, V.
Dongarra, J.
Sloot, P.M.A.
Statement of
Responsibility: 
J. W. Larson, M. Hegland, B. Harding, S. Roberts, L. Stals, A. P. Rendell, P. Strazdins, M. M. Ali, C. Kowitz, R. Nobes, J. Southern, N. Wilson, M. Li, Y. Oishi
Abstract: A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than re-scheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users’ legacy codes, and the system's runtime behavior.
Keywords: Parallel computing; partial differential equations; fault-tolerance; sparse grids; MapReduce
Rights: © 2013 The Authors. Published by Elsevier B.V. Open access under CC BY-NC-ND License.
DOI: 10.1016/j.procs.2013.05.176
Grant ID: http://purl.org/au-research/grants/arc/LP110200410
Published version: http://dx.doi.org/10.1016/j.procs.2013.05.176
Appears in Collections:Aurora harvest 4
Mathematical Sciences publications

Files in This Item:
File Description SizeFormat 
hdl_120126.pdfPublished Version145.19 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.