Abstract CANDID (A CANDidate gene IDentification program) is a unique and advanced data mining system that prioritizes candidate genes for genetic analysis.

It can combine literature analysis, protein domain information, sequence conservation, gene expression studies, linkage studies, association studies, and other custom data to score and rank 'all human genes'.

CANDID's main strength is in its versatility. Users may input linkage data to weight or limit analysis to certain areas of the genome.

Users may also decide whether they want to prioritize genes of 'known function' - by heavily weighting genes' literature scores - or whether they want to be sure Not to miss genes of 'unknown function' - by 'weighting scores' based on sequence alone, such as 'protein domain scores' and 'conservation scores'.

CANDID does Not require a "training set" or "testing set" of genes for its analysis - it will analyze and rank all human genes.

By ranking the genes, CANDID allows investigators to choose how many of the top genes they wish to study further, 'limiting problems of multiple testing'.

Reasons for CANDID --

The sequencing of the human genome has provided investigators with a map of our genes, and projects such as the HapMap have pinpointed locations of variants in this map. To date, many variants have been associated with disease, but there are many phenotypes that have yet- to-be-identified genetic causes.

Candidate genes for a phenotype are usually positional or biological candidates. Positional candidates include those implicated in deletions, amplifications, and linkage studies.

Biological candidates are often chosen because of prior knowledge linking them to the phenotype. Selecting candidates is often an arbitrary process, and in the case of biological candidates, it is limited by the knowledge of the selector.

Well-characterized genes tend to be examined before uncharacterized genes, leading to a bias in candidate selection.

CANDID was designed to be a unique and advanced algorithm with several advantages over existing systems:

1) Sequence-based analysis: A variety of criteria are used to analyze genes, including some that are only sequence-based. This allows for higher scoring of uncharacterized genes.

2) Weighting: Users can weight each criterion in order to emphasize well- or poorly-characterized genes.

3) Gene rankings: Delivering the output in the form of a 'ranked list' allows the user the option of examining each gene in turn, limiting statistical multiple testing issues.

4) Whole genome: All human genes are analyzed by CANDID. The user has the option to limit the analysis to only protein-coding genes as well.

5) No training set: Whereas other algorithms require training sets of genes to be entered, CANDID does Not, eliminating a source of bias from the user.

CANDID Data Sources --

To evaluate candidate genes, CANDID accesses a set of custom-built databases that are derived from the data sources described below. Each data source is used to compile a separate, CANDID-specific database.

Though the data sources may change on a day-to-day basis, CANDID's databases will Not.

Instead, builds of CANDID databases will be released periodically. If users submit queries days or weeks apart, as long as the queries are identical and the same CANDID database builds are chosen each time, users will receive identical results.

Data source descriptions are as follows:

Gene - NCBI's Gene database is the cornerstone of CANDID's database organization. A list of 'EntrezGene ID' numbers corresponding to human genes is used to populate all CANDID databases.

CANDID also uses the following information from the NCBI Gene database in its analyses and in the formatting of results:

Gene symbol; Gene synonyms; Gene location (start and stop, in base pairs; chromosome; cytogenetic locus); Gene type (protein-coding, tRNA, etc.); Gene summary; Gene description; and Gene product interactions (from BIND, HPRD, BioGRID, etc.).

PubMed - PubMed is an index of medically-relevant scientific literature. All publications with a PubMed ID (PMID) number are analyzed by CANDID.

Conserved Domain Database - The Conserved Domain Database describes and indexes protein domains. CANDID analyzes the descriptions of the domains listed in the Conserved Domains Database (CDD) for all domains that have an ID number.

HomoloGene - NCBI's HomoloGene database groups 'homologous genes' from organisms with sequenced genomes. These groups are described by using phylogenetic descriptors of the earliest common ancestor that contains the gene. CANDID ‘conservation analysis’ uses these descriptors.

GeneAtlas - The Genomics Institute of the Novartis Research Foundation (GNF) has created a dataset with expression profiles of approximately 19,000 human genes in 79 tissues. The normalized data are used in CANDID's expression analysis.

dbSNP - NCBI's dbSNP contains unique IDs for known simple nucleotide polymorphisms (SNPs). SNPs associated with human genes are identified and used in CANDID's association analysis.

CANDID MapConverter utility --

This utility assists users in translating 'genomic positions' to centiMorgan locations on the Marshfield genetic map. Linkage files for CANDID must use Marshfield centiMorgans, so this utility is designed with CANDID linkage file formatting in mind.

CANDID Linkage Info--

Linkage files from a number of sources may be submitted. Output files from GENEHUNTER (variance components, Haseman-Elston, and maximum likelihood output files), MERLIN and SOLAR can be uploaded directly and will be parsed by CANDID.

Linkage information from other sources must be submitted as custom linkage files. Combinations of linkage files can be combined into a Meta linkage file using the MetaMaker (utility) tool on the CANDID website.

System Requirements

CANDID is web-based and was written entirely in Perl (version 5.8.6). It uses the modules DBM::Deep and DB_File (version 1.815) and BerkeleyDB (version 4.005020), and it runs on an Apache web server. Batch mode versions of CANDID are also available.


