Platform for the Analysis, Translation, and Organization of Large Scale data (PLATO)

Category Genomics>Genetic Data Analysis/Tools

Abstract PLATO is a computational framework that analyzes single nucleotide polymorphisms (SNPs) and other independent variables using a variety of filters in an effort to identify a subset of interesting SNPs from a much larger set.

A filter in this case is defined as an analytical method or knowledge-based approach which mediates a reduction in the number of SNPs to a smaller subset.

PLATO allows the flexibility of applying filters in series, parallel, or individually and also allows the specification of filters for different disease models (additive, dominant, etc).

Furthermore, PLATO is extensible, allowing users to easily implement their own analytical methods as filters using a modular C++ library. By narrowing down the number of SNPs using various filters, looking for interactions between the remaining variables may be feasible.

An important consideration when applying multiple analytical filters to a dataset is the potential for redundancy among the filters. Within PLATO, many of the filters are highly correlated; however, the different filters are options for analysis to accommodate user preferences.

Since some of the filters are correlated, it is Not necessary to analyze datasets with all of them.

By grouping filters into classes according to their tendency to identify overlapping subsets of putatively important SNPs, and subsequently running filters from these distinct classes, it may be possible to remove the most SNPs with the fewest number of filters, and subsequently reduce computation time.

It is also possible that by running multiple distinct filters, “noise” SNPs can be removed and the truly significant effects can be found by singling out SNPs that repeatedly appear highly ranked across multiple filters.

Data Simulation --

To determine which of the PLATO filters yield unique results, a simulation study was performed -

[The genomeSIMLA software was used to conduct the data simulations - genomeSIMLA generates datasets using a forward-time population simulator which relies on random mating, genetic drift, recombination, and population growth to allow a population to naturally obtain Linkage Disequilibrium (LD) features].

Simulations, where the true location and size of the genetic effect are known, prove indispensable for evaluating new analytical techniques.

Genomic data with a known effect was simulated, specifying disease prevalence and a disease variant. The resulting data was then analyzed using all twenty-four (24) PLATO filters individually.

The Kappa & MAX Statistic --

A kappa statistic was used as a measure of comparison to provide a mechanism for grouping filters into subsets that yield similar results.

One filter from each resulting group was chosen as a representative filter for the group based on ease of use and interpretation. These filter sets were then further subset into filter classes by their tendency to rank embedded genetic effects similarly.

Once a set of filter classes had been determined, the manufacturer's implemented a MAX statistic in an additional simulation. Here, one filter from each of the four (4) filter classes was performed on the simulated data for each SNP in the dataset, taking the lowest p-value among the four (4) tests for each SNP.

Permutation was then performed on the entire analysis procedure to create an empirical null distribution and the results were compared with those found from running the four (4) filters individually.

The PLATO approach utilizing the MAX statistic (PLATO_MAX) out-performed all of the individual filters alone and demonstrates promise for future applications to multiple types of analyses, in particular the search for epistasis.

Motivation for PLATO --

The motivation for PLATO is twofold. First, the fact that any single underlying analytical scheme will reveal only some important results and that multiple filters will reveal different subsets of important results.

However, once results are obtained these results can be viewed in light of the results from other filters to best understand the full meaning of the genetic data.

The potential to use multiple filters forces No a priori assumptions about the mode of action of the genetic components of a phenotype allowing the most general possible analysis and interpretation. This is critical as it is rare that one knows what type of effect one is attempting to detect in disease gene association studies.

Thereby the ability to evaluate the association in the context of many different models and select the optimum solution for the dataset at hand, while controlling the Type I error rate is a great success.

Second, it is hypothesized that the genetic architecture of complex disease will include interactions between many genes as well as the environment.

In genome-wide association studies (GWAS) scale datasets, searching for interactions is a computational challenge; thus filtering the full set of GWAS SNPs to a smaller subset will be critical in the quest for detecting interactions. PLATO accomplishes both of these goals.

There are a large number of possible filters that one can envision for the PLATO framework.

Currently, PLATO has the following tests implemented: Cochran-Armitage trend test, chi-square, likelihood ratio, logistic regression, multifactor dimensionality reduction (MDR)

Normalized mutual information, odds ratio, and uncertainty coefficient as well as a thorough quality control filter: including sample and SNP efficiency, Hardy-Weinberg Equilibrium (HWE), allele frequency, rates of homozygosity, concordance checks, gender errors, and Mendelian errors.

In addition, PLATO currently has the following filters under development: the Biofilter - (see G6G Abstract Number 20681);

Data transformations, conditional logistic regression, MDR- pedigree disequilibrium test (PDT), generalized MDR, Cochran-Mantel-Haenszel analysis, linkage disequilibrium (r^2), linear regression, and Transmission Disequilibrium Test (TDT).

Whole-genome Association Study Pipeline (WASP) --

The Whole-genome Association Study Pipeline (WASP) has recently been absorbed into PLATO.

WASP was designed to aid in retrieving, evaluating, formatting, and analyzing genotypic and clinical data from the latest large-scale genotyping studies. WASP implements a battery of quality control procedures to assess the data.

Among the currently available procedures are the examination of marker and sample genotyping efficiency, allele frequency calculations, checks of Mendelian error (if applicable) and gender discrepancies (based on available chromosome X and Y genotypes), and tests of Hardy-Weinberg Equilibrium.

Additionally, WASP can retrieve and format data for other software programs such as the Graphical Representation of Relationships (GRR) program - (GRR is a Windows-based application for detecting pedigree errors via graphically inspecting the distribution for marker allele sharing among pairs of family members or all pairs of individuals in a study);

Or STRUCTURE - (The program Structure is a free software package for using multi-locus genotype data to investigate population structure), and depending on the nature of the samples and the depth of examination the user desires to pursue.

Beyond the quality control aspect of this application, WASP can perform standard tests of association using the Transmission Disequilibrium Test (TDT) for family-based datasets and the chi-square test of association for case-control datasets.

The manufacturer's are also working on a graphical user interface (GUI) data manager for PLATO called PLATO Viewer.

This will allow you to do the exact same things as the command line PLATO; however it will be more “user friendly” by providing point and click batch setups as well as plotting capabilities and data interaction.

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site PLATO

Price Contact manufacturer.

G6G Abstract Number 20684

G6G Manufacturer Number 104259