STRING

Category Cross-Omics>Knowledge Bases/Databases/Tools

Abstract STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database of known and predicted protein interactions.

The interactions include direct (physical) and indirect (functional) associations; they are derived from four (4) sources: 1) Genomic Context; 2) High-throughput Experiments; 3) (Conserved) Co- expression; and 4) Previous Knowledge.

STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 2,483,276 proteins from 630 organisms.

The database and web-tool STRING is a 'meta-resource' that aggregates most of the available information on protein-protein associations, scores and weights it, and augments it with predicted interactions, as well as with the results of automatic literature-mining searches.

Since its first release in 2000, it has grown into one of the most comprehensive resources of its type. It builds upon and extends the excellent, manual annotation efforts undertaken at primary 'protein interaction databases' and at databases of 'curated pathway knowledge'.

The basic interaction unit in STRING is the ‘functional association’, which is defined in this database as the specific and meaningful interaction between two proteins that jointly contribute to the same 'functional process'.

With respect to the interacting proteins, STRING does Not consider any specific splicing isoforms or post-translational modifications, but instead represents each protein-coding locus in a genome by a single protein (the longest isoform).

Thus, and because STRING aggregates data and predictions stemming from a wide spectrum of cell types and environmental conditions it aims to represent the union of all possible protein-protein links.

From this union, the actual network for any given spatio-temporal snapshot of the cell can in principle be deduced by projection, for example by removing proteins known to be not expressed or not active under the conditions studied.

In keeping with the above definitions, STRING imports protein association knowledge Not only from databases of physical interactions, but also from databases of curated 'biological pathway' knowledge.

Apart form the resources already included in the previous release(s) of STRING [MINT, HPRD, BIND, DIP, BioGRID, KEGG and Reactome (see G6G Abstract Number 20267)]; a number of resources have been newly included [IntAct, EcoCyc (see G6G Abstract Number 20231), NCI-Nature Pathway Interaction Database (see G6G Abstract Number 20245) and Gene Ontology (GO) protein complexes].

For the full STRING release, this set of previously known and well- described interactions is then complemented by interactions that are predicted computationally, specifically for STRING, using a number of prediction algorithms.

First, the manufacturer conducts systematic searches for genes that are found in close proximity within prokaryotic chromosomes, which is a good indicator for functional linkage.

Second, the manufacturer searches for instances where genes have joined to encode a single fusion protein, which is indicative of functional linkage even in organisms where the two proteins have Not fused.

Third, the manufacturer searches for gene families that share above- random similarities in their evolutionary histories (i.e. they have similar ‘phylogenetic profiles’). This, again, predicts that they contribute to similar functional processes in the cell.

Fourth, the manufacturer conducts searches for genes that display a similar transcriptional response across a variety of conditions (co- expression).

Individually, the above predictors may Not always have the specificity of direct experimental interaction assays; however, when used in concert and integrated probabilistically, the performance even of relatively weak predictors can rival that of experimental data.

Lastly, two further sources of interactions in STRING are actually providing the majority of associations; these are text-mining and interaction transfer between organisms.

For the former, the manufacturer parses a large body of scientific texts [SGD, OMIM, The Interactive Fly, and all abstracts from PubMed].

The manufacturer searches for statistically relevant co-occurrences of gene names and also extracts a subset of semantically specified interactions using Natural Language Processing (NLP).

For the transfer of interactions between organisms, the manufacturer estimates whether a pair of interacting proteins found conserved in another organism justifies the transfer of the interaction to that other organism.

The transferred interactions, as well as all predicted or imported interactions, are benchmarked and scored against a common reference of functional partnership [the manufacturer currently uses the joint membership of proteins in biological pathways, as annotated at KEGG, as their gold-standard].

Together, the above sources of interactions, including predictions and transfers, result in a uniquely high coverage of the interaction networks stored in STRING, particularly for well-studied model organisms.

Since the previous release, STRING has almost doubled the number of supported organisms, which now stands at 630. The number of stored interactions has increased as well, to a total of more than 50 million.

Since the various subtypes of the interaction evidence are stored separately in the database, they can be disabled at will -- giving users the ability to adjust the scope and specificity of STRING towards their particular application.

Integration of Protein Structures -- For each update, STRING now parses all entries of the PDB database of protein structures. The use of protein structures is two-fold: first, to inform the user that a given protein - or a close homolog thereof - indeed has 3D structure information

New Programming Interface -- To facilitate the integration of STRING into network tools like Cytoscape (see G6G Abstract Number 20092) and workflow engines like Taverna, the manufacturer has created an application programming interface (API) that allows access to the interaction network in computer-readable formats.

Additionally, specific API functions allow retrieval of individual records from the database, for example to map a protein via its name onto a STRING entry. The manufacturer further envisions that the STRING API will be useful to developers of web services, who plan to make use of the STRING interaction network.

Use Scenarios -- Apart from the ad hoc and barrier-free access through the website, STRING can be downloaded and used locally, either in the form of concise flat-files or as a mirror installation of the complete relational database back-end (some of the downloads do require a free, non-redistribution license applicable to academic nonprofit users).

The interacting entities in STRING can be set to be either proteins, or groups of orthologs spanning multiple organisms (‘COG-mode’). For the latter, STRING relies on an updated and extended version of the COGs (‘Clusters of Orthologous Groups’), which is being maintained at the eggNOG database.

A variety of other databases use STRING networks as a basis for further computations/annotations, for example by augmenting the networks with small molecules (STITCH), or by using the network to increase the power of kinase-substrate predictions (NetworKIN).

STRING has also been integrated into third-party tools such as NeAT (Network Analysis Tools), which provides various ways to analyze the interaction network, or Gaggle (see G6G Abstract Number 20222), which enables automated data transfer into other tools via a browser add-on.

Medusa -- Medusa (see G6G Abstract Number 20299) is a front end (interface) to the STRING protein interaction database. It is also a general graph visualization tool.

System Requirements

Web-based.

Manufacturer

1) European Molecular Biology Laboratory (EMBL)
2) Swiss Institute of Bioinformatics
3) University of Zurich

Manufacturer Web Site STRING

Price Contact manufacturer.

G6G Abstract Number 20298

G6G Manufacturer Number 100869

The G6G Directory of Omics and Intelligent Software

STRING