Magallanes

Category Cross-Omics>Workflow Knowledge Bases/Systems/Tools

Abstract Magallanes (Magellan) (Multi-Architecture Resources Discovering) is a versatile, platform-independent Java library of algorithms aimed at discovering bioinformatics ‘web services’ and associated data types.

The Magallanes software library supplies an integrated framework to develop advanced ‘discovery engines’ that help researchers find web- services and associated data-types. The rationale for Magallanes' design has been efficiency and usability.

There is consensus in the genomics research community that one of the biggest barriers to the integrated use of ‘remote resources’ is difficulty of locating the appropriate resource. Several techniques have proposed to solve this problem, with varying degrees of success.

Magallanes represents advancement in practical web-resource discovering tasks, regardless of application domain. Approximate ‘keyword matching’ and ‘user profiling’ have demonstrated the power of simple approaches similar to the most commonly used way to locate web pages -- search engines.

A second important feature available in Magallanes is its capacity to build up ‘workflows’ by automatic and efficient analysis of alternative pathways. These pathways go from an initial type of data to a desired output by using a set of available and compatible services.

Rigorous evaluations of different algorithm implementations, lead to an efficient breath-first pruning algorithm from target to source followed by a backtracking procedure.

The Magallanes client integrates different sources of resource metadata outperforming current client search capabilities. Moreover, the inclusion of indirect information from the available ‘web page links’ usually embedded in description metadata extends the scope of discovery.

Various implementations of Magallanes client have been deployed to demonstrate the potential utility of the Magallanes' Application Programming Interface (API).

Different variations of the same client (web-based engines, desktop applications, etc.) demonstrate the versatility of the software library.

Several of these clients are being used in real installations such as the National Institute of Bioinformatics (Spain) and ACGT-EU project, to exploit BioMoby-based repositories.

Web services from the European Bioinformatics Institute (EBI) are also among the available service catalogues.

Magallanes' architecture --

Magallanes consists of a Java library with algorithms and data handling routines built using the Modular API. The Modular API uses specific wrappers called accesses to map different data types and web-services repositories into a unified model (e.g., parsing the WSDL to get the web service's description, name, etc.).

Note: WSDL (Web Services Description Language) is an XML-based language for describing Web services and how to access them.

Magallanes can access and manage various ‘remote repositories’ using a standardized interface, and benefits from a cache system to reduce processing time.

Currently, the ‘Modular API’ can access BioMoby (The BioMOBY project was established to address the problem of discovering and retrieving related pieces of biological data from multiple hosts and services by attempting to generate a standardized query and retrieval interface using consensus object models);

Spanish National Bioinformatics Institute (INB) - (INB is a technological platform of Genome Spain. A National Network for the coordination, integration and development of Spanish Bioinformatics Resources in genomics and proteomics projects);

Advancing Clinico-Genomic Clinical Trials on Cancer (ACGT) - (ACGT is an open environment for supporting clinical trials and related research through the use of ‘grid-enabled’ tools and infrastructure); and standard WDSL repositories.

In order to support another repository, a new access must be implemented (e.g. the manufacturers are currently working to incorporate BioCatalogue in the list of available repositories).

Magallanes' API is organized in two (2) main modules: 1) Search engine and 2) Workflow composition --

Search engine -

The search engine module provides Google™-like methods for finding web resources using a scoring system based on the number of occurrences and relative word positions of matching hits. Currently it is endowed with AND/OR operators and regular expressions. The searching space defined by the resource metadata is easily expandable.

The algorithm initially searches for words similar to the keywords on the metadata descriptions. The similarity threshold can be setup as a configuration parameter. If No hits occur, it becomes necessary to fall back on approximate ‘expression matching’.

There are two widely used approaches for approximate expression matching: the ‘Hamming distance’, which compares strings of the same length and the ‘Levenshtein distance’, which compares two strings Not necessarily having the same length, measuring by the minimum number of insertions, deletions, and substitutions of characters required to transform one string into another.

Levenshtein distance is also known as the matching with k differences or errors. If the search does Not generate hits, a “Did You Mean?” module in Magallanes pops up to aid the user.

This module offers plausible alternatives to the user's query by computing the Levenshtein distance automatically (and letting the user influence the suggestions) to identify words similar to each keyword, and to estimate the distance using multiple keywords.

Magallanes uses a ‘feedback module’ to continually learn and refine its discovery capabilities. Any client software using Magallanes is able to access this feedback module, which records user selections of resources associated with specific keywords.

The module stores this information and records the ‘feedback’ value associated to the keyword-resource tuple (KR). This value is adjusted when the user selects another resource using the same keyword.

Finally, Magallanes also allows the use of third-party discovery functionality. For instance, several repositories implement discovery strategies based on web service compatibility with a given data type (i. e., which services are able to process my data?).

Intuitively, the consecutive application of this strategy can be exploited to create a sequence of compatible services that connect a given input with another target data type, in “pipeline” fashion.

This motivates the next major area of functionality offered by Magallanes: the ‘automatic arrangement of services’ to connect differing data types, including the management of user interactions to refine results.

Automatic workflow composition -

The Workflow Management consortium (WfMC) defines a workflow or workflow model as the complete or partial automation of a process in which information or tasks are passed from a participant to another according to a defined set of procedural rules.

Bioinformatics research can often benefit from connecting several applications in sequence to form a workflow (WF). Manual construction of WFs is complex and prone to error, particularly in bioinformatics where data comes in a multitude of formats. Combined with the difficulty of using distributed web services, composing a meaningful WF can present a challenge to life scientists.

Automatic ‘workflow generation’ (also called automatic service composition) aims to automate the task of connecting independent services. Two services can be connected if the output of one is compatible with the input of the other.

Therefore, the task of ‘automatic workflow generation’ is to find the shortest non-redundant sequence of services, meaningful to the research, that match outputs with inputs to link the source to the target data type.

Workflow generation support can be either semi-automatic, interactively giving advice on suitable services for each step in workflow construction, or fully automatic, where the scientist only provides input and output data sets and the algorithm generates the complete workflow.

In simplest terms, the automatic WF-builder in Magallanes proceeds to identify all the services that produce a target data type as output.

All the data types used as input for such services are used as a target in the next step.

A well defined data type hierarchy will provide the required semantics to generate meaningful workflows.

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site Magallanes

Price Contact manufacturer.

G6G Abstract Number 20511

G6G Manufacturer Number 104129