Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

Hursey, Joshua.   Indiana University ProQuest Dissertations Publishing,  2010. 3423687.