Supervised Inference of Regulatory Networks (SIRENE)

Category Cross-Omics>Pathway Analysis/Gene Regulatory Networks/Tools and Genomics>Gene Expression Analysis/Profiling/Tools

Abstract SIRENE (Supervised Inference of REgulatory NEtworks) is a new method to infer gene regulatory networks on a genome scale from a compendium of gene expression data.

The method decomposes the problem of gene regulatory network (GRN) inference into a large number of local binary classification problems that focus on separating target genes from non-targets for each transcription factor. SIRENE is thus conceptually simple and computationally efficient.

The manufacturer’s tested it on a benchmark experiment aimed at predicting regulations in Escherichia coli, and it showed that it retrieves on the order of six (6) times more known regulations than other state-of-the-art inference methods (see below...).

SIRENE differs fundamentally from other approaches --

SIRENE differs fundamentally from other approaches in that it requires as inputs Not only gene expression data, but also a list of known regulation relationships between transcription factor (TF) and target genes.

In machine-learning terminology, the method is supervised in the sense that it uses a partial knowledge of the information one wants to predict in order to guide the ‘inference engine’ for the prediction of new information.

The necessity to input some known regulations is Not a serious restriction in many applications, as many regulations have already been characterized in model organisms, and can be inferred by homology in newly sequenced genomes.

Known regulations allow the manufacturer to use a natural induction principle to predict new regulations: if a gene A has an expression profile similar to a gene B known to be regulated by a given TF, then gene A is likely to be also regulated by the TF.

The fact that genes with similar expression profiles are likely to be co-regulated has been used for a long time in the construction of groups of genes by unsupervised clustering of expression profiles.

The novelty in the manufacturer’s approach is to use this principle in a supervised classification paradigm.

This inference paradigm has the advantage that No particular hypothesis is made regarding the relationship between the expression data of a TF, and those of regulated genes.

In fact, expression data for the TF are Not even needed in the manufacturer’s approach.

Support vector machine (SVM) algorithm is used --

Many algorithms for supervised classification can be used to transform this inference principle into a working algorithm. The manufacturer used a support vector machine (SVM) algorithm, a state-of-the-art method for supervised classification in their experiments.

The idea to cast the problem of gene or protein networks inference as a supervised classification problem, using known interactions as inputs, has been proposed and investigated for the reconstruction of protein-protein interaction (PPI) and metabolic networks.

A simple method has been proposed, where a local model is estimated to predict the interacting partners of each protein in the network, and all local models are then combined together to predict edges throughout the network.

It has been shown that this method gave important improvement in accuracy compared with more elaborated methods on both the PPI and metabolic networks.

The manufacturer’s adapted this strategy for the reconstruction of gene regulatory networks. For each TF, they estimate a local model to discriminate, based on their expression profiles, the genes regulated by the TF from others genes.

All local models are then combined to rank candidate regulatory relationships between TFs and all genes in the genome.

SIRENE is conceptually simple, easy to implement and computationally scalable to whole genomes because each local model only involves the training of a supervised classification algorithm on a few hundred or thousands of examples.

SIRENE tested on benchmark experiment --

The manufacturer tested SIRENE on the benchmark experiment proposed by (Faith JJ, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles; PLoS Biol 2007;5:8), which aims at reconstructing known regulations within Escherichia coli genes from a compendium of gene expression data.

On this benchmark experiment, SIRENE strongly outperforms the best results reported by Faith et al. (2007), with the CLR algorithm.

For example, at a 60% true positive rate (precision), CLR identifies 7.5% of all known regulatory relationships (recall), while SIRENE has a recall of 44.5% at the same precision level using expression profiles.

SIRENE is easy to implement --

SIRENE is easy to implement and scales well to large-scale inference. Indeed, the main idea behind SIRENE is to decompose the network inference into a set of local binary classification problems, aimed at discriminating targets from non-targets of each TF.

Although the manufacturer used an SVM as a basic algorithm to solve these local problems, any algorithm for ‘pattern recognition’ may be used instead.

Each local problem involves at the most a training set of a few thousand genes, easily manageable by most machine-learning algorithms.

This strategy also paves the way to the use of other genomic data to predict regulation.

Indeed, local models for gene classification often improve in performance when several data, such as phylogenetic or cell subcellular localization information is available, and SVMs provide a convenient framework to practically perform this data integration.

Another interesting feature of SIRENE is its ability to predict self-regulation that other methods have generally had difficulties to deal with.

SIRENE limitation(s) --

An important limitation of SIRENE is its inability to predict targets of TFs with No a priori known target. More generally, the performance of SIRENE tends to decrease when few targets are known.

Thus, for example, it cannot be used to discover new transcription factors. An interesting direction of future research is therefore to extend the predictions to TFs with No known target.

A possible direction may be to combine the supervised approach with other non-supervised approaches in some meaningful way.

Finally, the manufacturer’s note that the evaluation criteria used in the benchmark experiment (see above...) through the use of the global precision/recall curve, although more relevant than the Receiver Operating Characteristic (ROC) curve, certainly remains to be improved.

The fact that it is biased towards TFs for which one knows many regulations (the hubs) implies that the method is less likely to propose reliable new interactions for TFs with few known neighbors.

This is also a great disadvantage in some applications where one is interested in ‘orphan’ TFs.

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site Supervised Inference of Regulatory Networks (SIRENE)

Price Contact manufacturer.

G6G Abstract Number 20728

G6G Manufacturer Number 104298