Abstract The BioCreative MetaServer (BCMS) is one of the first meta-services for information extraction (IE) in molecular biology.

This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts.

Annotation types cover gene names, gene IDs, species, and protein-protein interactions (PPI). The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML).

This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing.

The platform allows direct comparison, unified access, and result aggregation of the annotations.

BioCreative MetaServer Overview --

The fundamental aim of the BCMS platform is to provide users with annotations on biomedical texts from different systems.

At the current prototype level, the dataset is restricted to a fixed number of approximately 22,800 PubMed/Medline abstracts.

The available annotations consist of marking passages that are detected as gene or protein name mentions, annotating the articles with the gene/protein and taxonomic IDs (providing hyperlinks to the corresponding database entries), and a confidence score for whether the text contains protein-protein interaction information.

Expanding on stand-alone Information Extraction systems, this platform gathers the results of several systems developed by various research groups, unifies them, and allows the user to access abstracts and annotations in a combined view.

It is conceivable that collating classification results will often enhance performance, simply because multiple equal classifications for a given annotation are more likely to be correct.

The gathered data are accessible to the user both as human-readable hypertext and as machine processable XML in the form of XML- Remote Procedure Call (RPC) requests.

System design --

The platform is to be regarded as a distributed system requesting, retrieving and unifying textual annotations, and delivering these data to the user at different levels of granularity.

The BCMS can be divided into three (3) main units:

1) A static collection of text (a set of approximately 22,800 PubMed abstracts used in the BioCreative II challenge).

2) A set of active servers (ASs) providing annotations for text such as, [EAGL Tools (Engine for question-Answering in Genomics Literature), GIANT (Gene Identification And Normalization Tool),

iHOP (Information Hyperlinked over Proteins - see G6G Abstract Number 20225), PIE (Protein Interaction information Extraction), etc.], upon request; these annotation servers (AS) only interact with the meta-server and Not directly with each user.

3) A meta-server providing the combined data, namely both the annotations and the corresponding text. Therefore, users indirectly communicate with the annotation servers, using the meta-server as a proxy.

The data can be provided by three (3) different means, which also correlate with the three (3) main components (units):

1) Via a web browser - The main intention of this access method is to allow end-users (biomedical researchers) to search for a specific piece of information, e.g., to identify or confirm interaction partners for a given gene or protein.

This view correlates with the meta-server unit (the third BCMS unit above) and offers the user a graphical interface to explore the text and annotations.

2) The second option is to use the XML-RPC protocol. This method is intended to provide developers with a means to integrate the platform data with their own applications, for example to use in combination with other annotation pipelines.

Therefore, this is the direct interface to the ASs (the second BCMS unit above), because the meta-server only acts as a proxy in this scenario.

The Application Programming Interface (API) of the XML-RPC service can be found online at the manufacturer's web-site.

3) The third option is to contact the authors for a database snapshot of the current state of the meta-server data.

This option is of interest for web browsers and text mining applications that make heavy use of the data, where online RPC would Not be an option.

This roughly correlates with the static content of the platform (the first BCMS unit above).

Annotation systems --

Annotating biomedical abstracts can be done at various levels of granularity. Currently, the service provides four (4) types of annotations:

1) Gene/protein mention (GM) - locate positions in the text that are detected as gene or protein names.

2) Gene/protein normalization (GN) - detect which genes or proteins are mentioned, assigning sequence database identifiers to the text.

3) Taxon classification - identification of the organisms to which the text pertains, together with a confidence score, providing an ID for the National Center for Biotechnology Information (NCBI) taxonomic database.

4) Protein-protein interaction (PPI) - classifies whether the text contains PPI information and assigns a confidence score to the classification.

Future initiatives --

Future initiatives to expand the system, such as adding annotation types or opening the system for user-provided texts, are likely to be possible with little effort.

This implies that other research groups can join the platform, providing their own annotations, including the expansion of the system for new annotation types, for example, for protein-interaction detection methods.

The three (3) main units of the system (the various annotation systems, the annotated data, and the access methods) as well as their components (data, communications, and application layer) are independent of each other, so that one of the parts can be manipulated or completely exchanged without affecting the platform as a whole.

