Textpresso
Category Cross-Omics>Data/Text Mining Systems/Tools
Abstract Textpresso is a text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine.
Textpresso's two (2) major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched.
The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e. g., association, regulation, etc.) or describe one (e.g., biological process, etc.).
Together they form a catalog of types of objects and concepts called an ontology.
After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms.
A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries.
Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency.
Textpresso is useful as a search engine for researchers as well as a curation tool. It was developed as a part of WormBase and is used extensively by C. elegans curators.
Textpresso has currently been implemented for 17 different literatures, and can readily be extended to other corpora of text.
1) Textpresso for C. elegans -- (WormBase - the database of the model organism Caenorhabditis elegans).
2) Textpresso for D. melanogaster (Flybase) -- Currently, it contains approximately 20,000 full text papers and 39,000 abstracts.
3) Textpresso for Neuroscience -- is a collaboration/contract work and part of the Neuroscience Information Framework (NIF) supported by NIH Neuroscience Blueprint via NIDA.
4) Textpresso for Arabidopsis -- Developed in collaboration with The Arabidopsis Information Resource (TAIR) - which maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana.
5) Textpresso for Dicty -- Developed in collaboration with dictyBase, a central resource for the biology and genomics of the social amoeba Dictyostelium discoideum.
6) Textpresso for Rat --
7) Textpresso for Zebrafish --
8) Textpresso for Nematode --
9) Textpresso for Alzheimers --
10) Textpresso for S. cerevisiae -- SGDTM is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.
11) Textpresso for T. thermophila -- TGD is a web-accessible database of information about the Tetrahymena thermophila genome sequence determined at The Institute for Genomic Research (TIGR), now part of the J. Craig Venter Institute.
12) Textpresso for O.sativa --
13) Textpresso for Brucella --
14) Textpresso for Regulon DB --
15) Textpresso for OHSU Fungal --
16) Textpresso for Vaxpresso (for vaccine research) -- Vaxpresso is a Textpresso-powered 'vaccine literature' mining program. It contains all vaccine-related papers on selected pathogens extracted from PubMed.
The natural language processing (NLP) and ontology-based Textpresso is then used to process the literature data.
17) Textpresso for Pharmspresso -- Pharmspresso - An information retrieval and extraction system for pharmacogenomic-related literature.
Textpresso was initially developed by Hans-Michael Muller, Eimear Kenny and Paul W. Sternberg, with contributions from Juancarlos Chan and David Chen.
The current version (officially known as Textpresso 2.0) was developed by Hans-Michael Muller with contributions from Arun Rangarajan and Tracy K. Teal.
System Requirements
Contact manufacturer.
Manufacturer
- California Institute of Technology
- WormBase
- 1200 East California Boulevard
- Pasadena, California 91125
- E-mail: textpresso@caltech.edu
Manufacturer Web Site Textpresso
Price Contact manufacturer.
G6G Abstract Number 20254
G6G Manufacturer Number 100458