BioWarehouse

Category Cross-Omics>Knowledge Bases/Databases/Tools

Abstract BioWarehouse is an open-source software environment for integrating a set of biological databases (DBs) into a single physical database management system for data management, mining, and exploration.

BioWarehouse is a component of the Bio-SPICE project (Bio-SPICE, an open source framework and software toolset for Systems Biology, is intended to assist biological researchers in the modeling and simulation of spatio-temporal processes in living cells).

Key features/capabilities of BioWarehouse are:

1) A relational database schema that models important bioinformatics data-types.

2) BioWarehouse instances can be implemented using either the Oracle or MySQL database management systems.

3) A collection of loader programs that populate the warehouse with data from public biological databases.

4) The loader programs transform the syntax of the source databases into relational form and transform the diverse semantics of the source database into the common semantics of the BioWarehouse schema.

BioWarehouse Loaders -- The BioWarehouse is populated using loader programs that translate the flat file representation of a source database into a warehouse schema.

A loader is provided for each source database supported by BioWarehouse.

Once loaded within a BioWarehouse instance running on e.g. MySQL, a set of source DBs can now be queried together.

Some loaders are specific to a data format rather than to a single source database. For example, the BioPAX and MAGE-ML loaders can load any database that is in BioPAX or MAGE-ML format, respectively.

BioWarehouse Loaders include:

1) Biocyc DBs (see G6G Abstract Number 20230) -- Genomes, genes, proteins, metabolic pathways, reactions, compounds

2) BioPAX format -- BioPAX format describes biological pathway and protein interaction data. Currently this loader can process BioPAX Level 2 only -- protein interaction data.

3) Comprehensive Microbial Resource (CMR) -- Genomes, genes, proteins, reactions.

4) ENZYME DB -- Reactions, proteins.

5) Eco2dbase -- E. coli 2D protein gel database.

6) GenBank bacteria only -- Bacterial genes and proteins.

7) Gene Ontology (GO) -- A controlled vocabulary to describe gene and gene product attributes.

8) Kyoto Encyclopedia of Genes and Genomes (KEGG) -- Genomes, genes, proteins, metabolic pathways, reactions, compounds.

9) MetaCyc Ontology (see G6G Abstract Number 20232) -- The MetaCyc ontology of metabolic pathways and the MetaCyc ontology of chemical compounds.

10) MAGE-ML format -- The MAGE-ML file format describes gene expression datasets.

11) Taxonomy DB -- Taxonomical organism classification.

12) UniProt (Swiss-Prot and TrEMBL) -- Protein Knowledge Base.

Typically, many of the source database attributes are copied into the warehouse either verbatim or with minor transformations (e.g., converting from the Dalton unit stored in a source database to the kilo- Dalton unit used within the BioWarehouse).

The few source attributes that are Not represented in the warehouse are generally ignored, although some attributes are inferred from the raw data, for example, in cases where a gene is clearly present but Not stated explicitly in the source data.

Current BioWarehouse loaders are implemented in both the C and Java languages. C-based MySQL loaders interface with MySQL using its C Application Programming Interface (API).

Similarly, the C-based Oracle loaders interface with Oracle using the Oracle Pro-C precompiler. Java-based loaders use the Java Database Connectivity (JDBC) API to interface with the DataBase Management System (DBMS).

Each of these APIs allows SQL to be embedded and/or generated within its source language.

BioWarehouse Schema -- The BioWarehouse schema is designed to capture as much of the data of each component DB as possible within a uniform representation.

For example, in encoding data from a set of source DBs pertaining to proteins, BioWarehouse uses a single set of schema definitions that spans all attributes of proteins found across this set of DBs.

This approach eliminates the semantic heterogeneity present in these DBs, allowing users to query all protein sequence DBs using the same schema.

Such sharing of tables is applied wherever practical. The translation from the component DB to the warehouse is achieved by the DB loaders, which convert the conceptualization used in each component DB into the conceptualization used by the warehouse schema.

Documentation -- For each loader, there are two pieces of documentation: how to build and run the loader, and a manual for developers describing the details of the loader implementation and schema mappings.

Usage and Obtaining the BioWarehouse --

BioWarehouse can be used in two (2) ways:

1) Users can query the public BioWarehouse server, PublicHouse maintained by SRI International via Internet SQL query.

Note: PublicHouse is a publicly queryable set of biological databases constructed using the BioWarehouse. It provides an environment for large-scale data mining using SQL statements issued across the Internet.

2) Users can also download the BioWarehouse software distribution to create their own BioWarehouse instance containing the subset of supported BioWarehouse DBs that are of interest. This approach allows access to DBs that SRI can Not redistribute and lets each user control when new DB versions are loaded.

Users can also apply large hardware configurations to their BioWarehouse instance and add proprietary data to their BioWarehouse instance.

The Open Source release of the BioWarehouse is distributed as a zip file.

System Requirements

Contact manufacturer for complete information.

Oracle load times are for a 2.66 GHz Pentium with 2GB memory, with C loaders running locally on the server and Java loaders running remotely from a 1.5 GHz Pentium 4 client with 1GB memory.

MySQL load times are for a 1.5 GHz Pentium 4 client with 1GB memory installed with Debian Linux version 3.1, networked with a similar server.

Manufacturer

Manufacturer Web Site BioWarehouse

Price Contact manufacturer.

G6G Abstract Number 20238

G6G Manufacturer Number 102506