Checkpointing is a commonly used technique for reducing the execution time of longrunning programs in the presence of failures.With checkpointing, the status of the program under execution is saved intermittently. Upon the occurrence of a failure, the program execution is restarted from the most recent checkpoint rather than from the beginning. Due to its overhead, checkpointing may not always be beneficial, and, if it is, an optimal checkpointing strategy may be determined so as to minimize the expected execution time or to maximize the probability of the execution time not exceeding a critical limit. In this chapter we consider several models of checkpointing and recovery in a program in order to derive the distribution of program execution time or its expectation. We make important extensions to some existing results as well as introduce and analyze new models. It is shown that the expected execution time increases linearly (exponentially) with the processing requirement in the presence (absence) of checkpointing. Furthermore, these models can be used to compare different checkpointing strategies and to determine an optimal interval between checkpoints. Throughout the chapter, and whenever appropriate, we attempt to relate to other work in the existing literature.
|Title of host publication||Software Fault Tolerance|
|Editors||Michael R. Lyu|
|Place of Publication||New York, NY|
|Number of pages||24|
|Publication status||Published - 1995|
|Name||Trends in software|