PLant ANnotation to Literature (PLAN2L)

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract PLAN2L is an automatic bio-text mining system developed for the plant model organism Arabidopsis thaliana, with the aim to enable more efficient retrieval of biologically relevant info related to protein interaction, regulatory events and some of the prominent biological processes.

PLAN2L incorporates info extraction of individual entities, together with retrieval of protein interaction relations and gene regulatory associations, ranking each of these biological objects according to their relevance for central developmental processes studied in higher plants, namely flowering, leaf, root, and seed development.

At the cellular level, the manufacturer prioritizes each of the Arabidopsis genes for their implication in the cell cycle process through ranked links to their corresponding evidence texts together with co-mentioned cell cycle terms.

Spatial information in terms of sub-cellular location of proteins can be useful to understand the functional properties and interaction network of a particular protein; therefore PLAN2L integrates a localization retrieval module for finding location evidence descriptions.

PLAN2L Preprocessing and article retrieval -- In order to extract more fine-grained info for the level of bio-entities, the actual literature collection relevant to the studied model organism needs to be gathered first.

This was accomplished using a document retrieval pipeline that takes into account several sources of evidence for the determining whether a given article is associated to A. thaliana:

1) External references derived from multiple databases providing annotations and literature references for A. thaliana genes;

2) Organism and taxonomic name tagging using dictionary look-up based on a species lexicon derived from the NCBI Taxonomy that was automatically extended using a rule-based approach to account for typographical variants and abbreviations of species names;

3) Keyword based retrieval from PubMed and PubMed Central. The fraction of Arabidopsis mentions from the total list of tagged organism sources co-occurring in the article is used to score how specific the article is for this plant model organism.

Additionally a full text collection of Arabidopsis-related articles was constructed from a local repository of open access full text articles as well as using an in-house retrieval system to collect articles.

Plain text conversion was carried out through a combination of systems including pdftotext (pdftotext is an open source command-line utility for converting PDF files to plain text files).

Both abstracts as well as full text articles where then further processed using a rule-based sentence boundary detection module implemented in Python, specifically adapted to handle biomedical articles.

Gene/protein mention normalization -- An important step for the extraction of protein and gene annotations is the detection of links between the literature and concrete biological entities, for instance, as provided in annotation databases, often referred to as protein or gene mention normalization.

The manufacturer’s protein normalization approach is based on the construction and look-up of a gene and protein lexicon, followed by a protein normalization scoring/disambiguation approach.

The gene dictionary integrated A. thaliana gene names and symbols derived from multiple databases, including TAIR, SwissProt and from a collection of gene and protein names identified by a 'machine learning named entity recognition program' (ABNER) as well as a rule-based approach considering morphological cues and name length to identify potential Arabidopsis gene symbols.

PLAN2L Gene regulation -- Regulation of ‘gene expression’ is a fundamental cellular control process that involves complex interactions between genes, transcriptions factors (proteins) and other biological entities.

To extract such complex relations, where the correct identification of directionality of the event (i.e. regulator and regulated gene) plays an important role, the manufacturer adapted an Information Extraction (IE) architecture relying on a pipeline of semantic/syntactic rules.

The manufacturer applied part-of-speech tagging of each word using a GENIA-trained version of Treetager.

Then a module was used that substituted some of the POS tags with more semantically oriented labels, such as org (organism), nnpg (protein/gene name), actv (activation verb), etc.

For this Named-Entity Recognition task the manufacturer used dictionaries that describe gene lexicon. The text with mixed syntactic and semantic tags was fed into a SCOL parser that generated a tree- like structure by applying a modified CASS grammar originally developed for the STRING-IE system.

Additionally, the manufacturer constructed a 'high recall system' for ranking sentences related to transcription, gene regulation and expression.

This system is based on a Support Vector Machine (SVM) (radial basis kernel) approach that uses a collection of gene regulation relevant and Not relevant sentences as training set and is based on the bag of words approach.

PLAN2L Protein Interaction -- There is an increasing interest in the characterization of the Arabidopsis thaliana protein interactome under the systems biology perspective.

The extraction of protein interaction evidence associations was addressed using a machine learning sentence classifier approach relying on manually selected interaction evidence sentences.

The used sentence classifier relies on a SVM algorithm trained on set of manually classified interaction evidence passages derived from a collection used at the second BioCreative challenge.

PLAN2L Sub-cellular location evidence -- To retrieve protein localization description sentences, the manufacturer explored both the use of semantic-syntactic frames for extracting a fine-grained association between proteins and subcellular location mentions together with a 'machine learning sentence classifier' for retrieving protein localization description sentences in general.

PLAN2L Cellular and developmental processes -- A central component of PLAN2L is the scoring of each evidence sentence according to its relevance for complex temporal biological events (topics), at the cellular level (cell cycle) as well as at the level of developmental processes.

The manufacturer therefore implemented a classifier for scoring cell cycle relevant abstracts and document passages.

The full text passage classifier models were applied to classify and score each of the Arabidopsis full text sentence passages using a sliding window approach, resulting in a collection of cell cycle-scored windows of 2,987,342 (5 sentences) and 2,971,840 (7 sentences) passages.

The SVM text classifier was trained on a collection of cell cycle relevant abstracts and non-relevant abstracts and then applied to a literature collection of abstracts and full text articles mentioning A. thaliana genes.

Additionally four (4) specific sentence classifiers for the most relevant developmental processes in higher plants, namely

PLAN2L provides a comprehensive approach to assist in the selection and ranking of genes, proteins, documents and terms relevant to a specific biological process for this model organism.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site PLAN2L

Price Contact manufacturer.

G6G Abstract Number 20498

G6G Manufacturer Number 104119