Checkpointing and the modeling of program execution time

Victort F. Nicola

    Research output: Chapter in Book/Report/Conference proceedingChapterAcademic


    Checkpointing is a commonly used technique for reducing the execution time of longrunning programs in the presence of failures.With checkpointing, the status of the program under execution is saved intermittently. Upon the occurrence of a failure, the program execution is restarted from the most recent checkpoint rather than from the beginning. Due to its overhead, checkpointing may not always be beneficial, and, if it is, an optimal checkpointing strategy may be determined so as to minimize the expected execution time or to maximize the probability of the execution time not exceeding a critical limit. In this chapter we consider several models of checkpointing and recovery in a program in order to derive the distribution of program execution time or its expectation. We make important extensions to some existing results as well as introduce and analyze new models. It is shown that the expected execution time increases linearly (exponentially) with the processing requirement in the presence (absence) of checkpointing. Furthermore, these models can be used to compare different checkpointing strategies and to determine an optimal interval between checkpoints. Throughout the chapter, and whenever appropriate, we attempt to relate to other work in the existing literature.
    Original languageEnglish
    Title of host publicationSoftware Fault Tolerance
    EditorsMichael R. Lyu
    Place of PublicationNew York, NY
    Number of pages24
    ISBN (Print)9780471950684
    Publication statusPublished - 1995

    Publication series

    NameTrends in software

    Cite this