GeneTUKit

Category Cross-Omics>Data/Text Mining Systems/Tools

Abstract GeneTUKit is a document-level gene normalization software system for full-text articles.

This software employs both local context surrounding gene mentions and global context from the whole full-text document.

It can normalize genes of different species simultaneously. Given a target article, the software outputs a list of normalized genes, and each predicted gene is associated with a confidence score.

When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles.

GeneTUKit departs from previous systems --

GeneTUKit departs from previous systems in two (2) aspects:

1) First, it combines local and global contexts to normalize genes at the document-level.

The goal of this software is Not to normalize every “mention” correctly, but to suggest a list of normalized genes given a target document, to assist human annotators.

Most previous systems are normalizing genes at the mention-level and only local context surrounding a mention (e.g. the sentence where the mention was recognized) were employed.

However, due to the high ambiguity of gene names, it may be insufficient to use only local context: inter-sentential or document-level context can be helpful with this task.

2) Second, GeneTUKit is designed for simultaneously normalizing genes of many different species, for full-text articles.

It is Not limited to any specific organism, but rather deals with all species present in a gene database (such as, Entrez Gene, etc.).

GeneTUKit has four (4) main modules --

The first module is for gene mention recognition, the second one for gene ID candidate generation and the third one for gene ID disambiguation.

In the fourth module, the software generates confidence scores for each gene ID, where the confidence score indicates the strength of the association between a gene ID and the document.

1) First module - The manufacturers used three (3) methods for recognizing gene mentions in the first module.

a) The first method is a conditional random field-based approach, which was trained on the training dataset of BioCreAtIvE II Gene Mention Recognition Task.
b) The second method is a dictionary-based recognition approach where the dictionary was compiled from Entrez Gene.
c) The third method is ABNER, an open source named entity recognition system for biomedical literature.

The input text is processed by these methods separately, and the resulting mentions are maintained if a mention is recognized by at least two methods.

If two mentions are similar but have different boundaries, the overlapping part is taken, as the final mention.

2) Second module - The second module generates gene ID candidates for a recognized mention. In this module, an open-source indexing package, Lucene, was used to index all the genes in Entrez Gene.

Each mention was then queried and the top 50 gene IDs were returned as candidates.

The text of mentions and Entrez Gene entries were, respectively, processed by the following rules, sequentially:

a) Removing special characters such as dashes and underscores;
b) Removing stop words;
c) Changing words such as ‘hBCL’ into ‘h BCL’;
d) Separating digits, Greek and Roman letters from alphabetic letters; and
e) Converting the text to lowercase letters.

3) Third module - The third module is for disambiguating gene IDs, which is accomplished by a ranking algorithm. The algorithm was trained on the 32 full-text articles provided by BioCreAtIvE III.

Each article has a list of tuples (gene mention, gene id and species); however, the annotations did Not give the positions where a gene mention was recognized.

The training samples were generated as follows: for each gene ID candidate, if the gene ID appears in the manual annotation list, the candidate is taken as positive, otherwise negative.

For each gene ID candidate and its corresponding mention, the manufacturers extract features from local and global contexts. Some local context features are as follows:

a) The ranking score of the gene ID given by the Lucene index.
b) Whether the species of the gene ID is implied by the gene mention, such as hBCL.
c) The edit distance between the mention and the official symbol of the gene ID.
d) The minimal edit distance between the mention and all synonyms of the gene ID.

Whether at least one word indicating gene functions of a gene ID appears in the sentences from which the mention was recognized.

The words indicating gene functions are obtained from the corresponding gene symbols after removing common words (such as protein, gene etc.) and words containing capital letters or digits (e.g. VDR, p65).

The document-level, global context features are partly listed as follows:

a) Whether the species of the gene ID appears in the document.
b) Whether the species of the gene ID appears in the title.
c) Whether the species of the gene ID is the nearest species in the same paragraph where the mention is recognized.
d) If the mention has a full (or abbreviated) name throughout the document, compute the minimal edit distance between synonyms of the gene ID and the full (or abbreviated) name of the mention.

In constructing these features, the manufacturers used dictionary-based matching to recognize species; as such a simple method can produce a fairly good performance.

For finding full/abbreviated name mappings, the manufacturers adopted a method from: (Schwartz A.S., Hearst M.A. Proceedings of the 8th Pacific Symposium on Biocomputing. Kauai, Hawaii: World Scientific Publishing Co. Pte. Ltd; 2003). A simple algorithm for identifying abbreviation definitions in biomedical text;

Once features were obtained, the manufacturers used a ranking algorithm ListNet to rank gene IDs for each mention and the top gene ID was maintained for further processing.

4) Fourth module - The fourth module generates a confidence score for each predicted gene ID to measure the association of the given gene ID and the document using a support vector machine (SVM) classifier.

The training examples were similarly constructed, as in the third module.

The features were constructed as follows:

a) The best value of features used in the third module as each gene ID may correspond to many mentions. For the edit distance features, ‘best’ means ‘minimal’; for the ranking score feature, ‘best’ means ‘maximal’;
b) The total number of gene mentions associated with the gene ID; and
c) The highest rank of the gene ID among all the mentions associating with the gene ID.

System Requirements

Contact manufacturer.

Manufacturer

Department of Computer Science and Technology
Tsinghua University
Beijing, China

Manufacturer Web Site GeneTUKit

Price Contact manufacturer.

G6G Abstract Number 20791

G6G Manufacturer Number 104364

The G6G Directory of Omics and Intelligent Software

GeneTUKit