Abstract Merlin is a user-friendly system that performs functional genomic annotations of lists of genes. Merlin can also be used for the reconstruction of genome-scale metabolic models.

Merlin retrieves information of each homologue and automatically scores the results, allowing the user to change the score selection, and dynamically (re-)annotate the genome.

Merlin expedites the transition from genome-scale data to SBML metabolic models, allowing the user to have a preliminary view of the biochemical network.

Merlin allows the user to --

Perform similarity searches for any organism that has its genome sequenced, perform semi-automated dynamic (re-)annotation of the genome, and generate new GenBank genome annotated files (.gbk) from the existing ones, for submission to the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), and/or DNA Data Bank of Japan (DDBJ).

The user may also combine the similarity data with the information previously loaded into a local database and export the results to a metabolic model in the Systems Biology Markup Language (SBML) format.

Merlin is composed of two (2) modules -- The Dynamic Annotation Tool and the Model Reconstruction Tool.

1) The Dynamic Annotation Tool automatically annotates lists of genes, properly provided in the FASTA format (files containing either nucleotide or amino acid sequences).

This module allows the user to define the Basic Local Alignment Search Tool (BLAST) similarity searches initial parameters such as the e-value, maximum number of hits, remote database, etc.

The results of the BLAST search are then scored, allowing the user to dynamically (re-)annotate each gene, either by accepting the scorer selection or selecting another entry, supported by a quantifiable confidence level.

If none of the presented results satisfies the user, a manual record can also be added.

2) The Models Reconstruction Tool allows the user to load information from the Kyoto Encyclopedia of Genes and Genomes (KEGG), integrate it with information from the previous module and later build the metabolic model storing it in SBML format.

Merlin's Architecture --

The (re-)annotation process in Merlin is based on similarity searches to the online GenBank databases. From this process, a list of files is generated, one for each gene, containing similarity information.

Next the information for all the homologues present in each file is retrieved from the Entrez Protein database and loaded into a local relational database.

The acquired information is displayed for user appraisal and interaction. The user can then select the information based on the confidence level scores, provided by Merlin.

After the manual curation, the user can export a new annotated file and/or integrate the information with the previously loaded KEGG information.

The last stage is the SBML model generation.

Since the only metabolic information retrieved from a BLAST search is the EC number, the similarity information is integrated with the KEGG data, providing new reactions to be added to the metabolic model.

Hence, the reactions stored in the local database, which are catalyzed by the enzymes identified in the similarity search, along with the reactions already assigned for the case study by KEGG, are accepted for the generation of the metabolic model.

Merlin's Operations implementation features --

1) Merlin’s (re-)annotation - The purpose of this operation is the inference of candidate functions that could be assigned, by homology, to the proteins encoded by each gene in the genome.

2) Merlin’s Load Database - This operation loads several KEGG data files (compound, glycan, compound.inchi, reaction, ec.list, enzyme, organism enzyme.list, and organism.ent) and builds a local database that allows the user to later assemble a genome-scale model, selecting and editing reactions, to be included in the model.

3) Merlin’s Views and Edition - The views of the local database enable the editing of any loaded information, except the compounds information.

Therefore, the user can edit genes, proteins and reactions. Moreover, new genes, proteins and/or reactions that are Not available can be added to the local database.

4) Merlin’s Integrate - This operation compares the enzyme information retrieved by similarity with the data already available in the local database. The common unique identifier used for cross-referencing information is the locus tag.

In case of conflict between the local database information and the BLAST data, the user can select which data should be automatically preferred or if the data should be merged.

In the later case, the user will have to resolve, manually, each conflict that arises from the data integration.

5) Merlin’s SBML Builder - This operation allows the user to export the model, currently stored in a relational database to the SBML format. This feature also allows the user to deploy the model to other software applications, such as OptFlux.

