The PathOlogist

Category Cross-Omics>Pathway Analysis/Gene Regulatory Networks/Tools and Genomics>Gene Expression Analysis/Profiling/Tools

Abstract The PathOlogist is an automated tool for pathway-centric analysis. This tool is designed to transform large sets of gene expression data into quantitative descriptors of pathway-level behavior.

The tool aims to provide a robust alternative to the search for single-gene-to-phenotype associations by accounting for the complexity of molecular interactions. The PathOlogist provides a straightforward means to identify the functional processes, rather than individual molecules, that are altered in disease.

The PathOlogist is designed to automatically analyze genetic data within the context of molecular pathways. The tool aims to facilitate both a quantitative and qualitative analysis of pathway behavior in a format accessible to both laboratory researchers and informatics analysts.

The PathOlogist uses RNA expression data to calculate two (2) descriptive metrics - ‘activity’ and ‘consistency’ for each pathway in a set of more than 500 canonical pathways on a sample-by-sample basis.

These two (2) metrics have been shown to be more efficient than individual gene expression at distinguishing samples of different tumor grades and predicting disease outcome in cancer samples. The metrics make use of the structure of gene relationships, within the pathway, rather than treating the genes as simply a uniform set of entities.

A pathway is defined as a network of molecular interactions; each interaction consists of one or more input genes, promoters and inhibitors, and one or more output genes.

An activity score and a consistency score is calculated for each interaction based on the expression of all input and output genes. Activity scores provide a measure of how likely the interactions are to occur while consistency scores determine whether these interactions follow the logic of the defined network structure.

Depending on the nature of the samples, these scores can reveal various types of information. For example, one may compare activity scores calculated from expression data collected at different time points to identify functional processes that have been activated or de-activated over time.

Comparing consistency scores calculated from sets of tumor and matched normal samples can reveal pathways whose ordinary behavior has been altered by disease.

The PathOlogist facilitates such analyses through a number of features:

1) A clustered heatmap of pathway scores can be generated to provide an overview of the metrics and quickly identify any inherent groupings of samples or sets of pathways that act in concert.

2) The network structure of a pathway and metrics for individual interactions can be viewed as a color-coded graphic, which proves useful for direct comparison of samples and identification of specific areas within the pathway that deviate from normal behavior.

3) The tool also provides an interface for conducting a number of statistical tests to detect associations between pathway scores and additional sample information (for example, disease grade or response to treatment).

Implementation -- The PathOlogist is a MATLAB-based application, which can be run as a GUI in the MATLAB environment or as a standalone executable (with slightly more limited functionality). The main objective of the PathOlogist is to transform standard gene or molecule-based data into meaningful, quantifiable information at the pathway level.

Input - The PathOlogist is designed to analyze normalized abundance data from any gene-based microarray platform; however special features are included to accept Affymetrix data in its raw state as well.

The user may upload a set of .cel files reporting probe-level hybridization readings in an arrangement specific to the microarray chip used in the experiment. These .cel files as well as a chip-specific mapping file are the sole input to the PathOlogist necessary to carry out the process of pathway analysis.

Once loaded, raw data can be summarized into probe-sets and normalized using the robust-multi-chip averaging (RMA) method.

Up-Down Normalization - Once the data is defined at the probe-set level, a unique algorithm is applied which calculates the probability that each sample is in an ‘up’ (highly expressive) or ‘down’ (minimally expressive) state, by fitting the set of intensity readings for a probe-set to a mixture of two (2) gamma distributions.

Pathway Metrics Calculations - The PathOlogist uses normalized expression data to calculate two (2) descriptive metrics for each pathway selected (as stated above...). For this purpose, a pathway is defined as a connected set of interactions, each consisting of one or more input molecules and one or more output molecules.

Source of pathway data - Currently, the PathOlogist uses the PID (Pathway Interaction Database) as the source of pathway structure data.

This database is a collection of over 500 canonical pathways (as stated above...), including pathways curated by Nature Publishing Group editors and pathways imported from BioCarta and Kegg.

The network structure for each pathway is contained within the tool, and can be updated as new pathways are added to the database.

Mapping probes to genes - Mapping probe-level intensity values to molecules within a pathway is accomplished, using a platform-specific text file listing the Entrez gene ID associated with each probe.

This data is contained within the tool for a number of commonly used platforms. An option also exists allowing the addition of new user-created mapping files, extending the tool’s capabilities to virtually any platform.

Calculations - Users can select any subset of samples and pathways to include in metrics calculations. ‘Activity’ and ‘consistency’ metrics are calculated for each pathway selected, based on the normalized expression of input and output elements for each sample.

Calculations are first performed at the interaction level and then averaged over all the interactions in a pathway to generate a final pathway score. At the end of the process, each sample will have two (2) scores describing the behavior of each pathway. These scores can then replace individual gene expression values in any desired informatics or statistical analysis.

Metrics Visualizations -- After pathway metrics have been calculated, the PathOlogist facilitates a detailed investigation of the results.

Heatmap - A heatmap feature displays activity and/or consistency scores as a bi-dimensionally clustered heatmap. This can be used as a summary view, to quickly identify subgroups within the data. Specific subsets of pathways and samples can be selected for a more directed view as well.

Network Graphic - For a specific pathway of interest, a pathway-drawing feature generates a directed network graphic for each of the samples selected, detailing the structure and behavior of the pathway. In the graphic, metrics for individual interactions are displayed visually using node color and size. If desired, the drawing may overlay gene-specific data such as copy number alterations or methylation status.

An option also exists to extend the network to include all interactions from other pathways that involve genes within the pathway of interest. Graphics for multiple samples can be compared to identify specific points of differentiation within the pathway.

Clicking on any gene within the pathway will link to more detailed information, courtesy of the CGAP Gene database. The network structure and individual interaction metrics can also be generated in text format.

Identifying Important Pathways - The PathOlogist performs statistical analyses to determine the relationship between pathway behavior and sample features such as class, survival, etc. Sample data is entered through a simple copy-paste procedure or by uploading a two-column text file.

Four (4) types of analysis are possible:

1) Binary classification - finds pathways whose scores can be used to differentiate two classes of samples (e.g. cancer v. normal). For each pathway, a two-sample rank-sum test is performed to evaluate the null hypothesis that pathway scores of class A and class B are samples from normal distributions with equal means and variances.

Significant pathways are those for which the null hypothesis is highly unlikely, indicating that the pathway behaves differently in these two groups of samples.

2) Linear correlation - finds pathways whose scores correlate well with a continuous variable (e.g. response to treatment, measured as concentration of drug required to initiate cell death). For each pathway, the Pearson’s correlation coefficient is calculated for the linear relationship between the set of pathway scores and the set of sample data.

A p-value is then calculated for each pathway, using a Student’s t-distribution to evaluate the null hypothesis that the correlation coefficient is zero. Significant pathways are those which show either highly positive or highly negative correlation with the associated variable.

3) Survival - finds pathways that influence sample survival. The set of scores for each pathway are partitioned into two (2) groups using k-means clustering to minimize the squared Euclidean distance between group centroids. (A minimum group size can be set by the user).

Cumulative survival distributions are calculated separately for these two groups of samples using the Kaplan-Meier algorithm, and a log rank test is performed to evaluate the null hypothesis that the two (2) sample groups are drawn from the same population.

Significant pathways are those for which pathway behavior can be used as a marker dividing samples into groups with highly differentiated survival curves.

Gene hits targeting - finds pathways whose molecules are the target of some alteration (e.g. copy number, mutations). Gene-specific alterations are uploaded as a matrix of logical values describing which of the assayed genes were altered in each sample.

For each pathway, a hypergeometric cumulative distribution function is computed for each sample to estimate the probability that genes within the pathway are altered more often than would be expected, given the overall distribution of gene alterations for that sample.

An overall p-value for the pathway is calculated by applying a Fisher’s Omnibus test to the set of probabilities across all samples. Pathways with a significant p-value are those which comprise a set of genes that are disproportionately altered in multiple samples (although the specific genes altered are Not necessarily the same in different samples).

Each test can be performed on all samples or specific classes of samples, and returns a list of all pathways ordered by significance, along with corresponding p-values. These results can be plotted for visual confirmation of association, and then written to text files.

System Requirements

The full version of the tool requires a copy of MATLAB as well as MATLAB’s Bioinformatics and Statistics Toolboxes. The tool is most stable when using MATLAB version 2009b and later.

The standalone version does Not require MATLAB but the RMA normalization and single-pathway graphics features are Not currently available.

Manufacturer

Manufacturer Web Site The PathOlogist

Price Contact manufacturer.

G6G Abstract Number 20757

G6G Manufacturer Number 104338