Abstract NGSmethDB is a database of single-cytosine resolution methylation data. It stores and retrieves methylation data derived from Next Generation Sequencing (NGS) and two (2) cytosine methylation contexts (CpG and CAG/CTG) are considered.

Next-generation sequencing (NGS) together with bisulphite conversion allows the generation of whole genome methylation maps at single-cytosine resolution.

This allows studying the absence of methylation in a particular genome region over a range of tissues, the differential tissue methylation or the changes occurring along pathological conditions. NGSmethDB addresses these requirements.

NGSmethDB uses a web interface based on GBrowse and is coupled to a MySQL backend, which allows you to visualize the methylation data in a genomic context together with many other annotations, as well as full data, downloads.

In addition, a set of advanced data mining tools are also implemented, so that the user can filter, analyze and retrieve data in many different ways.

For example, the user can search for unmethylated or differentially methylated cytosines in a selected set of tissues, or display and analyze the promoter methylation of RefSeq genes.

Finally, the database extends the commonly used focus on CpG dinucleotides to the recently discovered non-CpG targets for DNA methylation in undifferentiated tissues.

NGSmethDB Features and Capabilities --

The NGSmethDB database can be divided into two (2) parts.

First, the content can be visualized, together with many other common annotations, by means of a web interface based on GBrowse coupled to a MySQL backend (as stated above...).

GBrowse: Genome annotation viewer - The Generic Genome Browser is a combination of database and interactive Web page for manipulating and displaying annotations on genomes.

Second, several user-friendly data mining tools are provided so the average user can easily generate their own data sets.

Currently (August 2010), NGSmethDB holds information on three (3) species (human, mouse and Arabidopsis) and 52 different tissues (21 unique tissues).

Furthermore, two (2) different methylation contexts are considered, C - phosphate - G (CpG) and CWG, but other non-CpG contexts, such as, CAH or CHH, will soon be available.

Currently (August 2010), the database holds methylation data of 696,599,217 cytosines for human (hg18), 69,459,481 cytosines for mouse (mm8), and 16,321,229 cytosines for Arabidopsis (TAIR8).

A detailed and updated database statistical table is maintained on-line and a summary of the publications where the data was generated from is also maintained and updated on-line.

Note: The manufacturers encourage data submissions of new methylation data in order to populate and maintain an updated NGSmethDB.

For most data, the methylation information for the cytosines is directly available for the three (3) mentioned genome assemblies. In these cases, the manufacturers populate the database with these processed data.

For other cases, the manufacturers used the LiftOver tool - (This tool converts genome coordinates and genome annotation files between assemblies.), to convert the coordinates from other assemblies, or developed scripts to process the raw data (like fastaq files) in order to obtain the methylation information for all covered cytosines.

All methylation values for both CpG and CWG contexts are calculated taking into account both strands. The assigned methylation value is therefore a weighted mean between the context in the direct and reverse strands.

Which means that it is the sum of reads that indicate methylation (cytosine Not converted to uracil/thymine) mapped to the specific position in the ‘+’ strand and those mapped to the ‘-’ strand, divided by the total number of reads mapped to the position, regardless of the strand.

Genomic browser interface --

The GBrowse genome viewer connected to a MySQL backend (as stated above...) is used to set up a web browser interface for NGSmethDB.

Features of the browser include the ability to scroll and zoom through arbitrary regions of a genome, to enter a region of the genome by searching for a landmark or performing a full text search of features, as well as the ability to enable and disable feature tracks and change their relative order and appearance.

The user can also upload private annotations to view them in the context of existing ones at the NGSmethDB web site.

Apart from the methylation data, the following related annotations are currently available on the NGSmethDB browser:

1) CpGcluster CpG islands;

2) Takai-Jones CpG islands;

3) RefSeq genes;

4) HMR conserved Transcription Factor Binding Sites (TFBSs);

5) CisRED regulatory elements; and

6) The chromosome sequence (hg18, mm8 and TAIR8 genome assemblies) and G + C content.

The methylation information of a given context is represented by the coordinate of the cytosine on the direct strand.

To display the methylation values of the cytosines the manufacturers use a color gradient from white (methylation value = 0, unmethylated in all reads) to red (methylation value = 1, methylated in all reads).

NGSmethDB Data mining tools --

Currently, five (5) different ways are implemented to retrieve raw data from the database.

For all five possibilities, two (2) different sequence contexts and three (3) coverage levels exist. The manufacturers detected Not just the methylation values of CpG dinucleotides but also for the cytosines in a CWG (CAG or CTG) context.

The methylation value at a given position (cytosine) is calculated taking both strands into consideration (as stated above…). The manufacturers also stored three (3) different coverage levels in the database: cytosines covered by at least 1, 5, and 10 reads.

1) Dump download -

This option shows an overview of current database content, including a short description of the tissue, the genome coverage in, as a percentage, a link to PubMed, and raw data files for #reads = 1, #reads = 5 and #reads = 10 coverage.

The files show the chromosome, chromosome-start and chromosome-end coordinates the sequence methylation context (either CpG or CWG), the number of reads and the cytosine methylation ratio.

2) Retrieve unmethylated contexts -

This tool can be used to retrieve all unmethylated cytosines in a given set of tissues. The user has to select the sequence context (CG or CWG), the read coverage, the threshold for unmethylation (often a threshold of 0.2 is used, i.e. all cytosines with values =0.2 are considered to be unmethylated) and the tissues.

The tool will detect all cytosine contexts showing lower methylation ratios than the chosen threshold in all selected tissues. The provided output file holds the chromosome, chromosome start- and end-coordinates and the methylation values in all selected tissues.

Note that this tool can be also used to retrieve all CpGs which are present in every single analyzed tissue by setting the threshold to one.

In doing so, cytosines with methylation data in all tissues will be reported regardless of its methylation state, i.e. cytosines that are Not covered by at least the number of chosen coverage threshold (1, 5, or 10) in any of the analyzed tissues will Not be reported in the output.

3) Retrieve differentially methylated contexts -

By means of this tool all differentially methylated cytosine contexts can be determined in a given set of tissues.

All parameters of the ‘Retrieve unmethylated contexts’ (see above...) are available here, plus one additional parameter: the threshold for the methylation value which defines whether a cytosine is considered to be methylated (often a threshold of 0.8 is used, i.e. all cytosines with higher values than =0.8 are considered to be methylated).

The manufacturers define a cytosine as differentially methylated if it is unmethylated in at least one tissue and methylated in at least one other tissue.

The tool reports those differentially methylated cytosine contexts that are either methylated or unmethylated in all analyzed tissues, i.e. those contexts that show intermediate methylation in only one tissue will Not be reported.

4) Get methylation states of promoter regions -

This tool allows depicting the methylation states of all cytosine contexts within the promoter region of RefSeq genes. The manufacturers define the promoter region as beginning 1.5 kb upstream of the Transcription Start Site (TSS) and ending 500 bp downstream of the TSS.

The user needs to provide a valid RefSeq name (NM_*) or a unique TAIR gene id (ATxGxxxxx) and the desired coverage.

The output is displayed by default as an overview table that summarizes the fluctuation along the promoter as well as over the different tissues.

5) Retrieve methylation data for chromosome region -

All methylation values for a selected set of tissues can be retrieved for a given chromosomal region, once the user provides the start and end chromosome coordinates.

