Work flows in life science

I. Wassink

    Research output: ThesisPhD Thesis - Research UT, graduation UT

    761 Downloads (Pure)


    The introduction of computer science technology in the life science domain has resulted in a new life science discipline called bioinformatics. Bioinformaticians are biologists who know how to apply computer science technology to perform computer based experiments, also known as in-silico or dry lab experiments. Various tools, such as databases, web applications and scripting languages, are used to design and run in-silico experiments. As the size and complexity of these experiments grow, new types of tools are required to design and execute the experiments and to analyse the results. Workflow systems promise to fulfill this role. The bioinformatician composes an experiment by using tools and web services as building blocks, and connecting them, often through a graphical user interface. Workflow systems, such as Taverna, provide access to up to a few thousand resources in a uniform way. Although workflow systems are intended to make the bioinformaticians' work easier, bioinformaticians experience difficulties in using them. This thesis is devoted to find out which problems bioinformaticians experience using workflow systems and to provide solutions for these problems. This thesis consists of three parts. The first part discusses the daily working practices of bioinformaticians and the infrastructure they use to perform their computer based experiments. Within a single in-silico experiment, often scientists from different disciplines and organisations are involved. The collaboration takes place in different forms, ranging from working at the same working place to sharing knowledge by means of publications as well as tools and data. The collaborating scientists have different experience levels in using computer tools. The bioinformatician is the expert. She knows how to use life science tools, how to program and how to connect different tools. The multidisciplinary collaborative work is reflected by the life science infrastructure. Bioinformaticians construct and share databases and tools and reuse those created by others. Many tools are currently available as web services. The bioinformatician generates scripts to access and connect such web services. In the second part of this thesis, the difficulties bioinformaticians have using workflow systems are analysed and discussed. By using a workflow system, the bioinformatician should be able to easily construct the experiment without programming. This is, however, an idealistic view on workflow systems. Workflow systems do not support the explorative research approach the bioinformatician normally uses. Data are sources of inspiration for the bioinformatician and are used to determine the next steps in the experiment. In traditional workflow systems, the bioinformatician needs to design the entire workflow in advance before it can be run. She therefore has to make many of the design decisions without appropriate data. Another issue many workflow designers face is solving data incompatibility problems. Different organisations often use different data structures in their services, even to represent the same information. This results in a situation where about 30\% of the tasks in the Taverna workflows stored at myExperiment represent data transformations. Standardising on data formats will be the best solution, but is infeasible, because if people are free to use their own data format they will. It would be better if a workflow system provides the means to handle data transformations. It should support scripting tasks for various programming languages, to enable the bioinformatician to program in the language he is familiar with. Additionally, a workflow can provide tasks to automatically compose and decompose complex data structures. Furthermore, the workflow system can suggest tasks that produce compatible output or that can consume the data available. Once finished, the workflow model is a knowledge representation of an experiment, that can easily be shared with peers, for validation or to construct similar experiments. Due to portals such as myExperiment, workflow sharing has become popular. These portals, however, have introduced a new type of problem bioinformaticians have to deal with. The services used in a workflow can become unavailable. Services may be down or moved to another location or may have a changed interface. The workflow (re)user has no influence on the existence of services, because the services are often hosted by other organisations. As a result, at the time of writing approximately one out of ten of the Taverna workflows at myExperiment are broken. In order to reuse these workflows, the tasks representing these dead services need to be replaced. Workflow systems that support late binding could solve these problems, in case the service is moved to another location. The bioinformatician will not even notice the service has been moved. In all other situations, she has to replace the broken service with an alternative. In the third part, we discuss our design solutions realised in our workflow system e-BioFlow. The system we propose supports the explorative working approach of the bioinformatician. It combines aspects from both a data flow and control flow system and therefore is what is called a hybrid workflow system. It supports more control flow patterns than many existing workflow systems for bioinformatics. Examples of these patterns are conditional branching and iteration through loops. Additionally, it supports late binding: the workflow designer can abstract from real resources. The engine performs the actual task resource binding at enactment time. This way, workflows designed in e-BioFlow are independent of the location of the resources at design time. e-BioFlow's provenance system interacts with the engine. It automatically captures the data produced during a workflow run and saves it in an Open Provenance Model compatible file format. Using this openstandard, scientists can easily share their experiment results with peers for inspection and validation. The provenance archive can also be used as cache to speed up future executions of tasks. The provenance system provides an interactive provenance browser and a query interface to explore and query provenance data. e-BioFlow provides an ad-hoc workflow editor to enable the bioinformatician to design and execute unfinished workflows. The main advantages are: data are explicitly present in the workflow and the workflows can be constructed step by step. The workflow editor helps the workflow designer by suggesting compatible tasks, not only to prevent data incompatibility problems, but also as a source of inspiration. In this interface, the workflow designer can use the explorativeresearch approach. By means of a mockup implementation, we have presented our design ideas to 50 life scientists in an early design stage. Most participants were enthusiastic about the new interface and expected it to be much easier to use than traditional workflow editor interfaces. Workflow systems can fit in the explorative research of bioinformaticians. These systems can help bioinformaticians to design and run their experiments and to automatically capture and store the data generated at runtime. A next challenge will be an interface that brings workflow design closer to the conceptual model bioinformaticians have of an experiment. Bioinformaticians do not think in terms of web services, but in terms of actions they want to perform on the data. The workflow system is responsible for mapping these higher level actions to the services available. Such a workflow system will be much easier in use and will better suit the bioinformaticians' needs.
    Original languageUndefined
    Awarding Institution
    • University of Twente
    • van der Vet, P.E., Advisor
    • Nijholt, Anton, Supervisor
    • van der Veer, Gerrit Cornelis, Supervisor
    Thesis sponsors
    Award date14 Jan 2010
    Place of PublicationEnschede
    Print ISBNs978-90-365-2932-7
    Publication statusPublished - 14 Jan 2010


    • IR-69304
    • web service
    • late binding
    • life science
    • METIS-270703
    • ad hoc
    • EWI-17081
    • HMI-HF: Human Factors
    • Workflow flow

    Cite this