Multi-Experiment Matrix (MEM)

Category Genomics>Gene Expression Analysis/Profiling/Tools

Abstract MEM is a web-based ‘multi-experiment’ gene expression query and visualization tool. MEM can perform ‘gene expression’ similarity searches across many datasets.

MEM features large collections of microarray datasets and utilizes ‘rank aggregation’ to merge information from different datasets into a single global ordering with simultaneous statistical significance estimation.

Unique features of MEM include automatic detection, characterization and visualization of datasets that includes the strongest ‘co-expression patterns’.

MEM detects co-expressed genes in large platform-specific microarray collections. Its Affymetrix microarray data originates from ArrayExpress (see G6G Abstract Number 20012) and also includes datasets submitted to GEO (see G6G Abstract Number 20013) and automatically uploaded to ArrayExpress.

MEM encompasses a variety of conditions, tissues and disease states and incorporates nearly a thousand datasets for both human and mouse, as well as hundreds of datasets for other model organisms.

MEM co-expression search requires two (2) types of input: first, the user types in a ‘gene ID’ of interest, and second, chooses a collection of relevant datasets. The user may pick the datasets manually by browsing multiple annotations, or allow MEM to make an automatic selection based on statistical criteria such as ‘gene variability’.

MEM performs the co-expression analysis individually for each dataset and assembles the final list of similar genes using a novel statistical rank ‘aggregation algorithm’.

Efficient programming guarantees rapid performance of the computationally intensive ‘real-time analysis’ that does Not rely on precomputed or indexed data. The results are presented in highly interactive graphical format with strong emphasis on further data mining.

Query results and datasets can be ordered by significance or clustered. The MEM visualization method helps highlight datasets with the highest co- expression to the input gene and helps the user distinguish evidence with poor or negative correlation.

Datasets are additionally characterized with the ‘automatic text analysis’ of experiment descriptions, and they are represented as ‘word clouds’ that highlight predominant terms.

With MEM, the manufacturer's aim is to make ‘multi-experiment co- expression’ analysis accessible to a wider community of researchers.

MEM web interface --

1) MEM Input - Primary input - The primary input of MEM is a single query gene that acts as the template pattern for the co-expression search.

The tool recognizes common gene identifiers and automatically retrieves corresponding probe-sets, the conversion is based on g:Profiler (a web- based toolset for functional profiling of gene lists from large-scale experiments) and Ensembl ID mappings. When several probe-sets link to a gene, the user needs to choose one of the probe-sets for further analysis.

Second, the user needs to select the collection of datasets where similarities between ‘expression profiles’ are detected (the search space).

ArrayExpress datasets are organized into platform-specific collections and the user may choose to perform the search over all datasets of a specific platform. The search space may be further narrowed by browsing dataset annotations and composing a collection that covers a specific disease or tissue type.

2) Dataset selection - In multi-experiment co-expression analysis, some individual datasets may produce noisy or even entirely random results that are either caused by poor data quality or low expression levels of the query gene.

The manufacturers have included a standard deviation filter in the MEM interface that allows the users to detect and disregard datasets where the variability of the query gene is low. Based on extensive simulations, the manufacturer's conclude that the standard deviation (sigma = 0.29) is a reasonable threshold for distinguishing informative datasets.

The above filter holds for the entire analysis since all related datasets are normalized and preprocessed using the same algorithm.

3) Search algorithm parameters - The first step of MEM multi-experiment co-expression analysis detects the most similar candidate genes for each individual dataset. The most important parameter for this stage is the ‘distance measure’ that defines the similarity between ‘expression profiles’ and has a significant impact on the contents and interpretation of results.

Pearson correlation is the default distance measure in MEM. It evaluates the dynamic similarity of expression profiles and has become a standard method of measuring co-expression.

Another useful measure is the ‘anti-correlation distance’ that detects inverse ‘expression patterns’, such as genes responding to repressor activity.

After detecting the most similar genes in individual datasets, the manufacturers apply a novel rank ‘aggregation algorithm’ that merges candidates of different datasets and creates the final list of co-expressed genes.

The rank aggregation algorithm assigns a p-value to each gene, in order to evaluate its similarity to the query gene across the given collection of datasets.

Statistically, the p-value reflects the likelihood of the gene appearing with certain observed ranks in the datasets if the similarity lists were shuffled randomly. Selecting the expression profiles with most significant p-values accurately retrieves genes with ‘high expression similarity’ and functional relevance to the query gene.

4) MEM Output - The principal output of MEM is a ranked list of genes that are co-expressed with the query gene in the provided datasets. For each resulting gene, MEM provides a p-value that reflects the significance of its expression similarity to the query gene across the collection on analyzed datasets.

A wealth of interesting information is presented in the graphical rank matrix. Each column of the matrix stands for a dataset, each row represents a gene, and each ‘matrix element’ reflects the individual similarity rank for the given gene in the given dataset.

Visual inspection of the ‘rank matrix’ allows the researcher to detect patterns of correlation across datasets and spot significantly stronger co- expression profiles.

The ‘rank aggregation’ algorithm provides a natural cutoff between informative and non-informative ranks for each gene. Colors and cell size is used to highlight datasets where the given gene was particularly similar to the query gene and hence contributed significantly to the final p-value.

Genes with the greatest similarity rankings are frequently in ‘strong correlation’ only within a relatively small fraction of datasets that are biologically relevant to gene function.

If the contributing datasets can be related in the context of ‘experimental design’, one may learn additional information about the query gene and its association to the resulting genes.

Columns of the ‘rank matrix’ are clustered hierarchically, so that datasets with similar ‘correlation patterns’ are grouped together using a tree visualization, and datasets with most impact are aligned to the left.

While the default policy is to ‘filter datasets’ based on the standard deviation criterion, one may take advantage of the high contribution of few datasets and manually remove experiments that have little impact on the final list of correlated genes. Single clicks on datasets or tree nodes toggle whether selected experiments or entire ‘experiment groups’ are regarded in downstream analysis.

A text mining technique called ‘word cloud’ gives a compact semantic overview of a selected group of datasets through the descriptions of experimental designs.

The word cloud detects keywords that are enriched in the ‘experimental descriptions’ of the group and it uses different font sizes to highlight terms with strong ‘statistical significance’. One may study the experiment descriptions of single datasets and ‘dataset clusters’ by moving the mouse over the dataset clustering tree.

Additional features of the tool reveal finer details of underlying data and create multiple pointers for further analysis. Besides co-expression associations in the rank matrix, MEM also displays standard Heat-maps with ‘expression profiles’ and experimental details of individual datasets.

The Heat-maps provide an easy visual validation of detected ‘co- expression patterns’. MEM includes filters that constrain the output to certain genes and allow the researcher to seek answers to interesting problems.

For instance, one may study the association of the query gene in relation to a certain pathway or biological process, by comparing the ‘expression patterns’ of its members.

The URL Map feature provides easy access to external resources, as it automatically links resulting genes to multiple genomic databases. Co- expressed genes can be directed to the g:Profiler toolset for functional enrichment analysis of Gene Ontology (GO) terms, pathways and cis- regulatory motifs.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site Multi-Experiment Matrix (MEM)

Price Contact manufacturer.

G6G Abstract Number 20554

G6G Manufacturer Number 104027