Abstract The Text Analytics Collection is a set of components for mining PubMed, US Patent Office, NIH Grants and personal documents. The Text Analytics Collection brings together the utility of literature search and text mining with the power of process automation and data integration in Pipeline Pilot (see Note 1). Product can help you achieve your text mining objectives by linking together search and characterization steps into automated routines. It also helps you to integrate literature mining with your existing scientific protocols, which can then be run interactively or automatically every night.

Product features/capabilities include:

Search Online Documents - The Text Analytics Collection gives you the ability to extract knowledge from important online document resources such as PubMed, US Patents, national institutes of health (NIH) computer retrieval of information on scientific projects (Crisp) grant database, and Google (user extendable to other remote text data sources). Search these databases with interactive queries, or mine them with large-scale document retrieval and characterization routines.

Search Local Documents - You can also search and mine your locally stored documents. The Text Analytics Collection indexes and searches folders that contain portable document format (PDF), Word, hypertext markup language (HTML), or text files on your local disk (extendable to other file formats). You can also store the results of online searches in local repositories for speedy retrieval and post-processing. Local databases of documents stay current automatically by monitoring the folder contents for the introduction of any new or edited documents.

Annotate Scientific Results - When reporting the results of pipelined data analyses, it is often useful to include additional information about the output data points. With the Text Analytics Collection; you can add a few steps at the end of any Pipeline Pilot protocol and have each data point serve as a query to search a database of literature. For example, after clustering a set of genes with the Sequence Analysis Collection (an additional product) (see G6G Abstract Number 20069), you can annotate each gene with summary information from its top reference in PubMed (and a link to further search results).

Identify Emerging Trends - The Text Analytics Collection can be used to monitor the scientific literature for topics of interest, and it can alert you when new concepts are emerging for those topics. The latter is achieved by searching for new articles about your topic of interest and detecting the concept words they contain. The association of each concept with the topic of interest is calculated over time to detect emerging new relationships. This can help you to stay on top of a broader class of topics, and learn about breakthroughs before they become widely known.

Mine Patent Databases - The Text Analytics Collection provides you with the tools necessary to characterize research and intellectual property trends in a field of interest. You can search and process the U. S. patent databases (extendable to other patent databases) for trends reflecting the quantity of patents, application areas, and companies engaged, and more. For example, by building a protocol to process patents in the field of fuel cells, you can discover how rapidly this emerging field is growing. You can also see that applications for automobiles have come to dominate the area and that Honda and General Motors are leading innovators.

Note 1: Pipeline Pilot Overview - Pipeline Pilot streamlines the integration and analysis of vast quantities of data flooding the research informatics world. It makes the most of your information resources through industrial-scale data flow control and advanced mining capabilities. You can graphically compose data processing networks, known as 'protocols', using hundreds of different configurable components for operations such as data retrieval, manipulation, computational filtering, and display. These protocols are automatically captured as you create them and you can publish them for project/enterprise use. From a Web interface, your colleagues can invoke your protocols and run them using their own data, etc.

