Whatizit
Category Cross-Omics>Workflow Knowledge Bases/Systems/Tools and Cross-Omics>Data/Text Mining Systems/Tools
Abstract Whatizit is a suite of modules that analyze text for contained information, e.g. any scientific publication or Medline abstracts.
It allows you to do ‘text mining’ tasks on text. The tasks come defined by the pipelines (workflows) found in the drop down list provided and the text can be pasted in the ‘text area’ provided.
The description of each ‘individual task/pipeline’ can be found following the link next to the submit button (see ‘Whatizit Pipeline – Description’ below...).
Whatizit is also a Medline abstracts retrieval/search engine. Instead of providing the text by ‘Copy & Paste’, you can launch a Medline search. The abstracts that match your ‘search criteria’ are retrieved and processed by a pipeline (workflow) of your choice.
Whatizit is great at identifying ‘molecular biology’ terms and linking them to publicly available databases. Identified terms are wrapped with Extensible Markup Language (XML) tags that carry additional information, such as the ‘primary keys’ to the databases where all the relevant information is kept.
The wrapping XML is translated into HTML hypertext links. This service is highly appreciated by researchers who are reading literature and need to quickly find more information about a particular term, e.g. its UniProt id.
Whatizit is also available as 1) a Web Service and as 2) a Streamed Servlet. The Web Service allows you to enrich content within your website in a similar way as in Wikipedia. The Streamed Servlet allows you to process large amounts of text.
1) Web Services -- Web Services is an integration technology whose underlying idea is to ensure that software from various sources work well together.
This technology is built on open standards, such as Simple Object Access Protocol (SOAP), a messaging protocol for transporting information; Web Services Description Language (WSDL), a standard method of describing Web Services and their capabilities.
For the transport layer itself, Web Services uses most of the commonly available network protocols, especially Hypertext Transfer Protocol (HTTP).
2) Whatizit Streamed Servlet -- Whatizit can be accessed from your favorite programming language; although the manufacturer only provides support for Java (They are Java gurus).
In general, any vocabulary in the range of up to 500k terms can be easily integrated into a whatizit pipeline. Whatizit is very good at identifying formalized language patterns, specialized, syntactically formalized, technical notation.
The annotation speed of a given pipeline is almost independent of the size of the vocabulary behind it and is currently based on ‘pattern matching’ (as a result, quite a few spurious matches are highlighted because many terms, e.g. protein names, resemble normal English words or acronyms which also have other meanings. The manufacturers are actively working on the disambiguation of these terms).
In addition, several vocabularies can be integrated in a ‘single pipeline’ as is the case of Swissprot and GO terms in the whatizitSwissprotGo pipeline.
Examples of already integrated vocabularies are Swissprot, the Gene Ontology (GO), the NCBI's taxonomy, Medline Plus.
Whatizit Pipeline - Description --
1) whatizit_Abner - Annotation done with the Abner package. [ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for ‘molecular biology text mining’. It automatically tags genes, proteins and other entity names in text. At its core is a ‘machine learning’ system using conditional random fields with a variety of orthographic and contextual features].
2) whatizitSwissprotGo - Combination of whatizitSwissprot and whatizitGo.
3) whatizitSwissprot - Pipeline for the annotation of Swissprot protein/gene names, linking the findings to (www.uniprot.org) for further exploration. The pipeline contemplates some disambiguation based in acronym resolution and term frequency.
The context for Protein/Gene names which resemble acronyms, for example NPY (neuron-peptide Y) is analyzed, in the search for features which will help disambiguate whether the target name is really a Protein/Gene name.
If it is unclear from the context, then the pipeline will assume that names which have a high frequency in the British National Corpus (www.natcorp.ox.ac.uk) are common enough Not to be considered relevant in the biomedical field. The right choice for a Protein/Gene name will reflect in its popularity among text mining tools.
4) whatizitProteinInteraction - Protein Interaction. It is the pipeline used in Protein Corral (www.ebi.ac.uk/Rebholz-srv/pcorral). The results are pure XML and are Not transformed into HTML by the application, so there is No user level presentation.
5) whatizitUkPmcGenesProteins - whatizitSwissprot with a maximum- likelihood-based (ML-based) filter;
6) whatizitOscar3 - Chemical entities annotated by Oscar3;
7) whatizitSwissprotFilter - whatizitSwissprot with a ML-based filter;
8) whatizitDisease - Pipeline for the annotation of ‘disease names’ linking the findings to (www.healthcentral.com);
9) whatizitProteinInteractionPMID - Protein Interaction. It is the pipeline used in Protein Corral (www.ebi.ac.uk/Rebholz-srv/pcorral). The results are pure XML and are Not transformed into HTML by the application, so there is No user level presentation.
10) whatizitQbmarsdf – ‘Medline Abstract Retrieval Engine’ based on the Text Mining Index. Input: Lucene query, Output: a list of Medline Abstracts.
11) whatizitProteinBiolexHuman - Protein tagger based on the Biolexicon and the dictionary filter for human species (to be used with a ‘Lucene Query’).
12) whatizitUkPmcGoterms - Pipeline for the annotation of terms belonging to the Gene Ontology (GO) and its three (3) branches, Molecular Function, Biological Process and Cellular Component.
13) whatizitCheponer - Cheponer (Chemical European Patent Office Named Entity Recognition). A NER (Named Entity Recognition) system trained to recognize chemical entities similar to those found in the European Patent Office documents. Text to be tagged should be nested in cners_to_tag tags. If the system finds a likely chemical entity, it will nest it in a z:cheponer tag.
14) whatizitCALBCFilterTerm - CALBC module to retrieve sentences by an annotated term.
15) whatizitDiseaseUMLSDict - Disease tagger based on the Unified Medical Language System (UMLS) lexicon and a dictionary approach.
16) whatizitMetamap – A tagger based on the MetaMap. (MetaMap is an online application that allows mapping text to UMLS Metathesaurus concepts, which is a very useful interoperability among different languages and systems within the biomedical domain).
17) whatizitUkPmcAll - This pipeline is a ‘merged pipeline’ of whatizitUkPmcGenesProteins, whatizitUkPmcSpecies, and whatizitUkPmcGoterms.
18) whatizitOrganisms - Pipeline for the annotation of terms belonging to the NCBI Taxonomy. The findings are linked to (www.ncbi.nlm.nih. gov/entrez/query.fcgi?db=taxonomy).
19) whatizitEBIMed - Combination of whatizitSwissprot, whatizitGo, whatizitOrganisms and whatizitDrugs. It is the pipeline used in EBIMed (www.ebi.ac.uk/Rebholz-srv/ebimed).
20) whatizitMeshUp - MeshUp annotation. (The annotation of biomedical texts using controlled vocabularies such as MeSH).
21) whatizitPathwaywiki - Tags pathways.
22) whatizitOrganismsFilter - whatizitOrganisms with a filter.
23) whatizitGORanked - Gene Ontology (GO) terms ranking at the sentence level based on an information theoretic approach.
24) whatizitUkPmcSpecies - whatizitOrganisms with a filter.
25) whatizitDrugs - Pipeline for the annotation of drug names linking the findings to (http://www.nlm.nih.gov/medlineplus/druginformation.html).
26) whatizitCALBCFilterId - Collaborative Annotation of a Large Biomedical Corpus (CALBC) module to retrieve sentences by id.
27) whatizitSwissprotDisease - Combination of whatizitSwissprot and whatizitDisease.
28) whatizitEBIMedDiseaseChemicals - Combination of whatizitSwissprot, whatizitGo, whatizitOrganisms and whatizitDrugs. It is the pipeline used in EBIMed (www.ebi.ac.uk/Rebholz-srv/ebimed) that, in addition, annotates ‘diseases and chemical entities’.
29) whatizitISCN - Identification of karyotypes, annotations available from the web services. z:karyo marks the whole string identified as a terminology for a karyotype, while z:iscn marks the parts of the ‘whole terminology’.
30) whatizitChebiDict - ChEBI entities annotated based on a dictionary approach.
31) whatizitChemicals - Chemical entities, Drugs and Protein names for EBIMed (chemicals).
32) whatizitSwissprotGo2 - Swissprot protein names and Gene Ontology (GO) terms Version 2: Combination of whatizitSwissprot and whatizitGO2.
33) whatizitGODict - Pipeline for the annotation of terms belonging to the Gene Ontology (GO)and its three (3) branches, Molecular Function, Biological Process and Cellular Component. The findings are linked to www.geneontology.org.
34) whatizitProteinDiseaseUMLS - Protein tagger based on Swissprot and disease tagger based on the UMLS lexicon.
System Requirements
Web-based.
Manufacturer
- European Bioinformatics Institute (EBI)
- Wellcome Trust Genome Campus
- Hinxton
- Cambridge
- CB10 1SD
- UK
- Tel: +44 (0)1223 494 444
- Fax: +44 (0)1223 494 468
- E-mail: textmining-support@ebi.ac.uk
Manufacturer Web Site Whatizit
Price Contact manufacturer.
G6G Abstract Number 20539
G6G Manufacturer Number 104154