Query Expansion Tool (QuExT)

Abstract QuExT (Query Expansion Tool) is a document indexing and retrieval application that obtains, from the MEDLINE database, a ranked list of publications that are most significant to a particular set of genes.

Document retrieval and ranking are based on a concept-based methodology that broadens the resulting set of documents to include documents focusing on these gene-related concepts.

Each gene in the input list is expanded to its various synonyms and to a network of biologically associated terms. Currently, the expansion is based on proteins, metabolic pathways and diseases (this last one is used only when the selected organism is Homo sapiens).

The retrieved documents are ranked according to user-definable weights for each of these concept classes.

By simply changing these weights, users can alter the order of the documents, allowing them to obtain for example, documents that are more focused on the metabolic pathways in which the initial genes are involved, rather than on the genes themselves.

How QuExT works --

QuExT receives as input a list of genes and a corresponding organism. The gene list can be typed into the input box or uploaded in a text file. Genes can be separated by commas or spaces. The organism to consider is selected from a drop-down menu.

When the user submits the form, gene names or identifiers in the input are checked against a database and mapped to an internal identifier corresponding to the selected organism. Genes which are Not found in the database are rejected from further analysis.

QuExT then creates an expanded query and searches a local index of the PubMed database for documents matching this query.

Query expansion is performed as follows: for each gene in the query, the algorithm obtains, from a term expansion table corresponding to the selected organism, all the alternative gene, protein, pathway and disease names corresponding to that gene’s internal ID.

The full list of terms from all input genes is then accumulated in four (4) separate query strings (one for each concept type; which is Genes, Proteins, Pathways, and Diseases). Each term obtained from expanding all genes is used to search the index.

QuExT runs four (4) index searches using the four (4) query strings obtained in the query expansion stage (one for each concept type).

For each search, the documents that match the query and the corresponding scores are obtained. Resulting documents and corresponding scores are kept on separate lists, one for each concept class.

Notice that while the term expansion takes into account the selected organism, to avoid going from a gene in one organism to a related term in a different organism, this is Not true for document retrieval.

Since the indexing does Not distinguish between different species referred in the articles, a search for a gene name in H. sapiens may return results referring to the same gene but in mice, for example.

Finally, the results from the document retrieval stage are assembled and documents are re-ranked in terms of the defined weights for each concept. The final score for the document is obtained as a weighted sum of the four (4) concept-based scores.

QuExT query expansion procedure --

The query expansion and document ranking procedure is as follows:

QuExT User interface --

The query expansion and document retrieval method described above was implemented as a web-based application.

The user interface is divided in two (2) simple forms. The first one allows the user to insert a list of gene identifiers and select the organism of study (as stated above...).

There is also an option to upload a text file with the list of genes. After submitting the query, the retrieved documents are presented in the results explorer interface.

The titles of the retrieved documents are displayed in the left panel, while the right-side panel shows the input genes and the gene-related concepts used in the expanded query.

The user can expand each individual abstract or navigate to the corresponding entry in PubMed, GoPubMed or iHOP by using the corresponding button.

The abstracts are Not saved locally, but instead they are obtained from PubMed using the Entrez e-utilities once the query results are returned.

In the right-side panel are slide bars that can be used to change the concept weights used for ranking the documents. Setting the value of each slider changes the relative weight for each type of concept used for expanding the query.

For example, setting the weight of the concept type ‘Disease’ to 100% (and the remaining to 0%, accordingly), will show the documents ranked in terms of their scores for this concept type only.

QuExT Main innovation --

The main innovation of this application is the possibility to modify the weights (as explained above…) for the four (4) concept classes/types used in the query expansion: gene names and symbols, protein names, metabolic pathways and diseases.

This gives the user control on how the expanded search terms affect the final ranking of the documents.

The application of ‘concept weights’ to modify the order of the returned documents represents a significant advantage when compared with the available methods.

QuExT Future enhancements --

Four (4) classes of concepts are currently used for query expansion, but the flexibility of the concept-based expansion and weighting scheme allows the inclusion of more concepts in a straightforward manner.

For example, new concepts such as Biological Process terms (from the Gene Ontology), may be included to enrich the expansion.

Another possibility is the inclusion of MeSH terms and resource identifiers for the concepts that appear in the document, such as UMLS concept IDs or UniProt accession numbers, in order to categorize the documents and offer links to the primary data sources describing each concept.

Likewise, seven (7) reference organisms are supported at present: Homo sapiens, Mus musculus, Rattus norvegicus, Candida albicans, Saccharomyces cerevisiae, Drosophila melanogaster and Apis mellifera.

Inclusion of new organisms only requires a straightforward update of the information in the database and can be easily accomplished if there is a demand by users.

