G6G Directory of Omics and Intelligent Software - Applied Bioinformatics Cyrille2

Cyrille2

Category Cross-Omics>Workflow Knowledge Bases/Systems/Tools

Abstract Cyrille2 is an extensible, modular, generic pipeline (workflow) system. Cyrille2 enables easy creation and execution of high throughput, flexible bioinformatics pipelines.

The system is modular in design and consists of three (3) functionally distinct parts:

1) A web based, graphical user interface (GUI) that enables a pipeline (workflow) operator to manage the system;

2) The Scheduler, which forms the functional core of the system and it tracks what data enters the system and determines what jobs must be scheduled for execution; and

3) The Executor, which searches for scheduled jobs and executes these on a compute cluster.

Cyrille2 System overview --

The Cyrille2 system architecture is composed of four (4) distinct layers. Layer 1 comprises the main functional and core software components. These core components make extensive use of a modular application programming interface (API) (layer 2). The API allows unified access to three system databases (layer 3).

The biological database and the end-user interface that connect to it are third-party systems that can be integrated with the Cyrille2 system (layer 4).

To allow tracking and debugging of a pipeline in operation, a centralized status and logging system is implemented. This provides a pipeline operator access to detailed information on the status of a pipeline run and errors that might have occurred.

A pipeline system needs to manage and store large amounts of diverse information. To keep different types of data separated, the system employs four (4) databases: 1) the pipeline database which stores pipeline definitions, node settings and associated parameters;

2) The status database which stores the execution state of a pipeline at any given time, tracking all jobs and their respective in- and output;

3) the biological database which stores and provides access to the results of all analyses, and;

4) A failover database which employs a generic method to store all data that does Not need to be stored in the biological database.

Similar to the functional division of the databases, the core software (as stated above) is divided into three (3) distinct functional parts: the Graphical User Interface (GUI), the Scheduler and the Executor. The GUI allows a pipeline operator to create, adapt, start and stop pipeline runs and fine-tune pipeline and tool settings.

The Scheduler is the core of the Cyrille2 system. It retrieves pipeline definitions from the pipeline database and schedules all jobs for execution, accounting for dependencies between nodes. A scheduled job is stored in the ‘status database’.

The Executor loops through all scheduled jobs and executes each of these. The results of each job are stored in the ‘biological database’ and are tracked with unique object identifiers in the status database. If the number of jobs to be executed is large, a ‘compute cluster’ is required to keep the total execution time within bounds.

To this end the Executor acts as a broker between the Cyrille2 system and third-party compute cluster software such as Sun Grid Engine (SGE). It is possible to employ multiple and different types of clusters by running multiple instances of the Executor.

Cyrille2 Data flow and storage --

A major challenge for any pipeline system is to devise a fast and robust way to conduct data through a pipeline. This is Not trivial, given that even a relatively simple pipeline may imply that many thousands of separate jobs need to be scheduled and executed, which in turn may result in millions of objects.

Automated execution of a pipeline implies that each node needs to hold information on the nature and format of the objects that enter and leave it and that it has to process such streams in a manner unique for each type of data.

Several ‘data exchange formats’ exist and the manufacturers chose to implement BioMOBY as the data exchange format for the Cyrille2 system.

BioMOBY is emerging as an important data standard in bioinformatics and is already used by MOWserv and Taverna (see G6G Abstract Number 20514) when this system is dealing with BioMOBY (see G6G Abstract Number 20520) operations.

The BioMOBY standard contains a specification on how to describe data types, formats, and analysis types. It is a meta-data format, meaning that it does Not describe data but defines how to describe data.

BioMOBY employs a system of ‘object identification’ and classification, in which each BioMOBY object is identified with 1) an identification string (id), 2) an object type (articlename) and 3) a namespace.

BioMOBY encompasses the description of web services and facilitates interoperability with third-party servers.

Storage of all intermediate data generated during pipeline operation is guaranteed by the ‘failover database’ that automatically stores any object Not stored in the biological database.

To upload data into the Cyrille2 system, specific start nodes are provided allowing the upload of data through the user interface or automatically harvesting data from a file system.

Cyrille2 Scheduler --

The Scheduler is the core of the Cyrille2 system (as stated above…). Based on a pipeline definition (from the ‘pipeline database’) it schedules all jobs for execution, taking mutual dependencies between nodes into account.

Various tools used in an analysis pipeline require different arrangements of incoming data. Scheduler functionality is embedded in the node classes.

This modular, object-oriented implementation of a node allows for complex scheduling strategies. A more complex node implemented in the Cyrille2 system schedules groups of objects which share a common grandparent, for example, all repeats that are predicted by several different repeat detection tools, grouped per BAC sequence (the grandparent).

Cyrille2 Pipeline execution --

Execution of a pipeline can be considered at two (2) levels: execution of a separate node, and execution of an entire pipeline. A single node in the Cyrille2 system executes a variable number of distinct jobs.

Node operation in Cyrille2 is performed by executing three (3) different scripts:

1) data is retrieved from the database;

2) the tool is executed, and;

3) the results are stored back in the database.

Communication with the database is handled by two ‘database connection scripts’. These two scripts access the database wrapper and provide generic communication with any database of choice.

A ‘tool wrapper’ is responsible for the execution of the tool and provides generic interaction with the Cyrille2 system. Tool wrappers are implemented in such a way that they can run standalone, be part of a BioMOBY web service, or function as a component of the Cyrille2 system.

A further task of the tool wrapper is to register itself in the Cyrille2 system. Registration implies that the tool becomes available through the GUI, allowing a pipeline operator to integrate it into a pipeline and allowing the Scheduler to correctly schedule jobs for that tool.

The process communicates what ‘type of objects’ are required as input (e.g. protein sequences for BLASTP), what parameters are accepted (e. g. specification of a protein database) and with what ‘node type’ it must be associated. This is implemented in a ‘generic registration method’ where the wrapper registers all required information into the pipeline database.

In a rapidly evolving field like bioinformatics, it is of great importance that new tools can be implemented quickly. In the Cyrille2 system this requirement is implemented through modular, object oriented, design of the tool wrapper code.

In brief, implementation of a novel tool in the Cyrille2 system involves the following procedure:

1) Installation and configuration of the new tool on the execution server or cluster;

2) writing of the BioMOBY-compatible tool wrapper;

3) definition of new BioMOBY objects (if required);

4) confirmation of compatibility between object types and the biological database in use, and;

5) registration of the tool in the pipeline database.

System Requirements

Contact manufacturer.

Manufacturer

Applied Bioinformatics
Plant Research International
Wageningen University and Research Centre
PO Box 16
6700 AA, Wageningen, The Netherlands

Manufacturer Web Site Cyrille2

Price Contact manufacturer.

G6G Abstract Number 20532

G6G Manufacturer Number 104148

The G6G Directory of Omics and Intelligent Software

Cyrille2