g:Profiler

Category Genomics>Gene Expression Analysis/Profiling/Tools

Abstract g:Profiler is a web-based toolset for the ‘functional profiling’ of gene lists from large-scale experiments. This public web server can be used for characterizing and manipulating gene lists resulting from mining high-throughput genomic data.

g:Profiler has a simple, user-friendly web interface with advanced visualization for capturing Gene Ontology (GO), pathway, or transcription factor binding site (TFBS) enrichments down to individual gene levels.

Besides standard multiple testing corrections, a new improved method for estimating the ‘true effect’ of multiple testing over complex structures like GO has been introduced. Interpreting ‘ranked gene lists’ is supported from the same interface with very efficient algorithms.

Such ordered lists may arise when studying the most significantly affected genes from high-throughput data or ‘genes co-expressed’ with the query gene.

Other important aspects of practical data analysis are supported by modules tightly integrated with g:Profiler.

These are: g:Cocoa - for ‘comparative functional profiling’; g:Convert for converting between different database identifiers; g:Sorter for searching a large body of public ‘gene expression’ data for co-expression; and g:Orth for finding ‘orthologous genes’ from other species.

g:Profiler supports 31 different species, and underlying data is updated regularly from sources such as, the Ensembl database.

g:Profiler - g:Convert module --

g:Convert is a ‘gene identifier’ tool that allows conversion of genes, proteins, microarray probes, standard names, various database identifiers, etc. A mix of IDs of different types may be inserted into g:Convert.

The user needs to select a target database; all input IDs will be converted into a target database format. Input IDs that have No corresponding entry in target database will be displayed as N/A.

g:Convert is based on the Ensembl database. Every alias is mapped through a three-level index of gene, transcript, and protein Ensembl IDs. For every level of index, all corresponding aliases are added to the output.

g:Convert is well integrated with other modules in g:Profiler, input aliases in all related tools are automatically converted into the necessary internal format.

g:Profiler - g:Sorter module --

g:Sorter is a tool for ‘gene expression’ similarity search. For a selected single gene, protein or probe ID, g:Sorter finds a number of the most similar co-expressed (correlated) expression profiles in a specified dataset. Most ‘dissimilar reversely expressed’ (anti-correlated) profiles may also be selected.

A large number of public gene expression datasets from Gene Expression Omnibus (GEO) - (see G6G Abstract Number 20013) are available for analysis for each selected organism.

The result of g:Sorter analysis is one or more sorted lists of microarray probes IDs. Several lists are retrieved when several probes correspond to a given input gene or protein. g:Sorter also finds an intersection list that contains the ‘common elements’ of all retrieved lists.

g:Profiler - g:Orth module --

g:Orth is a tool for retrieving orthologs for a selected organism. g:Convert accepts input of genes, proteins, microarray probes, standard names, various database identifiers, etc. A mix of IDs of different types may be inserted.

The user also needs to select a target organism; all entries in the input query are then mapped to the target organism and corresponding orthologs are fetched. Input IDs that have No corresponding entry in either the given organism or the target organism will be displayed as N/A.

g:Orth ortholog mappings are based on Ensembl data. Orthologs are automatically mapped via Ensembl gene IDs using internal g:Convert methods.

g:Profiler resources --

g:Profiler supports 31 different organisms (as stated above…) and offers an interface for its list of different databases for functional classification, as well as tasks like namespace conversion, expression analysis, and orthology search.

g:Profiler resources are kept up-to-date on a regular basis as new Ensembl versions are released. A short description of all resources is given below.

1) Genomes for 31 organisms, respective namespace mappings, orthology matches, and GO annotations are all retrieved from the Ensembl database via the BioMart interface - (see G6G Abstract Number 20517). The manufacturer keeps their resources up to date and adds new organisms as data becomes available.

Since Ensembl lacks mappings for a noticeable amount of microarray probes, the manufacturer fetches additional microarray probe-set data from the Gene Expression Omnibus (GEO).

2) Gene Ontology (GO) is the primary resource for ‘annotating gene groups’ for three (3) types of knowledge -- cell components (cc), molecular functions (mf) and biological processes (bp).

GO is a structured vocabulary in a form of a directed acyclic graph (DAG). Hierarchical relations hold true within GO; vocabulary terms are related to one or several general ‘parent’ terms. Any term automatically involves all other terms below it, via ‘all relational’ paths.

Therefore, genes annotated to a specific term in g:Profiler are also added to all associated ‘parents’, and the profiling is performed at all ‘hierarchical levels’ simultaneously. g:Profiler strips out GO annotations that apply the ‘NOT’ qualifier.

3) Kyoto Encyclopedia of Genes and Genomes (KEGG) database provides g:Profiler with ‘functional annotations’ for metabolic and information processing pathways, cellular processes, human diseases and drug development data. KEGG classifications are available for 15 organisms.

4) Reactome - (see G6G Abstract Number 20267) is a mammalian- specific pathway database with thorough annotations of numerous well- studied biological processes, ranging from intermediary metabolism to signal transduction to cell cycle and apoptosis. Reactome annotations are available for eight organisms.

5) Putative transcription factor binding sites (TFBS) from the TRANSFAC database - (see G6G Abstract Number 20121) are available for nine organisms and retrieved and placed into g:Profiler through a special prediction pipeline.

First, TFBS are found by matching TRANSFAC position-specific matrices using the program MATCHTM (a tool for searching transcription factor binding sites in DNA sequences) on 1,000-bp upstream regions, as provided by the UCSC genome database - (see G6G Abstract Number 20197).

A cut-off value provided by TRANSFAC is then applied to remove spurious motifs. Remaining matches are split into five (5) hierarchical and inclusive groups based on ‘match score’.

In most cases, motif matches from the deepest hierarchy are perfect representations of the initial matrix. This hierarchical approach allows TFBS profiling in greater detail and allows the user to distinguish between high- and low-credibility matches.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site g:Profiler

Price Contact manufacturer.

G6G Abstract Number 20555

G6G Manufacturer Number 104027