eggNOG

Category Cross-Omics>Knowledge Bases/Databases/Tools

Abstract eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes.

The orthologous groups are annotated with functional descriptions, which are derived by identifying a common denominator for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains.

eggNOG's database currently (as of Feb 2011) counts 224,847 orthologous groups in 630 species, covering 2,242,035 proteins (built from 2,590,259 proteins) of which 1,966,709 are annotated.

eggNOG, can:

(1) be updated without the requirement for manual curation,

(2) covers more genes and genomes than existing databases,

(3) contains a hierarchy of orthologous groups to balance phylogenetic coverage and resolution and

(4) provides automatic function annotation of similar quality to that obtained through manual inspection.

Construction of Hierarchical Orthologous Groups -- The manufacturer assembles proteins into orthologous groups using an automated procedure similar to the original COG/KOG approach.

When constructing coarse-grained orthologous groups across all three domains of life or for all eukaryotes, the manufacturer first assigns the proteins encoded by the genomes in eggNOG to the respective Clusters of Orthologous Groups of proteins (COGs) or euKaryotic clusters of Orthologous Genes (KOGs) based on best hits to the manually assigned sequences in the COG/KOG database.

In case of multiple hits to the same part of the sequence, only the best hit was considered.

The many proteins that canNot be assigned to existing COGs or KOGs are subsequently assembled into non-supervised orthologous groups using the procedure described below.

When constructing more fine-grained orthologous groups, this initial step is skipped.

Briefly, the manufacturer first computes all-against-all Smith–Waterman similarities among all proteins in eggNOG.

The manufacturer then groups recently duplicated sequences into in- paralogous groups, which are subsequently treated as single units to ensure that they will be assigned to the same orthologous groups.

To form the in-paralogous groups, the manufacturer first assembles highly related genomes into clades, usually encompassing all sequenced strains of a particular species in a single clade, but also other close pairs such as human and chimpanzee.

In these clades, the manufacturer joins into in-paralogous groups all proteins that are more similar to each other (within the clade), than to any other protein outside the clade.

For this, there is No fixed cutoff in similarity, but instead the manufacturer starts with a stringent similarity cutoff and relaxes it in a step-wise fashion until all in-paralogous proteins are joined, requiring that all members of a group must align to each other with at least 20 residues.

After grouping in-paralogous proteins, the manufacturer starts assigning orthology between proteins, by joining triangles of reciprocal best hits involving three different species (here, in-paralogous groups are represented by their best-matching member).

Again, the manufacturer starts with a stringent similarity cutoff and relaxes it to identify groups of proteins that all align to each other by at least 20 residues.

Next, the manufacturer relaxes the triangle criterion and allows the remaining unassigned proteins to join a group by simple bidirectional best hits.

Finally, the manufacturer automatically identifies gene fusion events by searching for proteins that bridge otherwise unrelated orthologous groups.

In these cases, the different parts of the fusion protein are assigned to their respective orthologous groups.

This step is a distinguishing feature of the manufacturer's approach and is crucial for the analysis of eukaryotic multi-domain proteins, as these would otherwise cause unrelated orthologous groups to be fused.

To construct a hierarchy of orthologous groups, the procedure described above was applied to several subsets of organisms.

To make a set of course-grained orthologous groups across all three domains of life, the manufacturer constructed non-supervised orthologous groups (NOGs) from the genes that could Not be mapped to a COG or KOG.

Focusing on eukaryotic genes, the manufacturer constructed more fine- grained eukaryotic NOGs (euNOGs) from the genes that could Not be mapped to a KOG.

Finally, the manufacturer builds sets of NOGs of increasing resolution for five eukaryotic clades, namely fungi (fuNOGs), metazoans (meNOGs), insects (inNOGs), vertebrates (veNOGs) and mammals (maNOGs).

Automatic Annotation of Protein Function -- An important feature of eggNOG is that it provides functional annotations for the orthologous groups.

These annotations are produced by a pipeline, which summarizes the available functional information on the proteins in each cluster:

(1) the textual annotation for these proteins,

(2) their annotated Gene Ontology (GO) terms,

(3) their membership to KEGG pathways and

(4) the presence of protein domains from SMART and Pfam.

For each orthologous group, the manufacturer's pipeline also searches for overrepresented GO terms, KEGG pathways or protein domains.

As a single domain may Not properly reflect the function of a complete protein, description lines are constructed based on overrepresented domains only if all other options have been exhausted.

Quality Assessment -- To assess the quality of the function annotations provided by the manufacturer's automated pipeline, the manufacturer manually checks a random sample of 100 NOGs and 100 euNOGs and classifies their annotations into three categories:

87.5% were correct (i.e. they describe a function that the proteins have in common), 12.5% were uninformative (i.e. they do Not describe a function) and, due to the manufacturer's stringent rule set, No wrong functions were assigned.

Uninformative annotations of orthologous groups are in many cases due to a lack of functional knowledge on the corresponding proteins.

Note: Statistics on the content of the eggNOG database are provided.

Note: Up-to-date genomes and proteomes were obtained from Ensembl, GenomeReviews and RefSeq.

System Requirements

Contact manufacturer.

Manufacturer

Bork Group
EMBL - Biocomputing
Meyerhofstraße
1 69117 Heidelberg
Germany
And
Institute of Molecular Biology
Y55-L76
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich
Switzerland

Manufacturer Web Site eggNOG

Price Contact manufacturer.

G6G Abstract Number 20306

G6G Manufacturer Number 100849

The G6G Directory of Omics and Intelligent Software

eggNOG