Text Mining of the Scientific Literature to Identify Pharmacogenomic Interactions
Author: Yael Garten
Publisher: Stanford University
Published: 2010
Total Pages: 221
ISBN-13:
DOWNLOAD EBOOKPharmacogenomics is the study of how variation in the human genome impacts drug response in patients. It is a major driving force of "personalized medicine" in which drug choice and dosing decisions are informed by individual information such as DNA genotype. The field of pharmacogenomics is in an era of explosive growth; massive amounts of data are being collected and knowledge discovered, which promises to push forward the reality of individualized clinical care. However, this large amount of data is dispersed in many journals in the scientific literature and pharmacogenomic findings are discussed in a variety of non-standardized ways. It is thus challenging to identify important associations between drugs and molecular entities, particularly genes and gene variants. Thus, these critical connections are not easily available to investigators or clinicians who wish to survey the state of knowledge for any particular gene, drug, disease or variant. Manual efforts have attempted to catalog this information, however the rapid expansion of pharmacogenomic literature has made this approach infeasible. Natural Language Processing and text mining techniques allow us to convert free-style text to a computable, searchable format in which pharmacogenomic concepts such as genes, drugs, polymorphisms, and diseases are identified, and important links between these concepts are recorded. My dissertation describes novel computational methods to extract and predict pharmacogenomic relationships from text. In one project, we extract pharmacogenomic relationships from the primary literature using text-mining. We process information at the fine-grained sentence level using full text when available. In a second project, we investigate the use of these extracted relationships in place of manually curated relationships as input into an algorithm that predicts pharmacogenes for a drug of interest. We show that for this application we can perform as well with text-mined relationships as with manually curated information. This approach holds great promise as it is cheaper, faster, and more scalable than manual curation. Our method provides us with interesting drug-gene relationship predictions that warrant further experimental investigation. In the third project, we describe knowledge inference in the context of pharmacogenomic relationships. Using cutting-edge natural language processing tools and automated reasoning, we create a rich semantic network of 40,000 pharmacogenomic relationships distilled from 17 million Medline abstracts. This network connects over 200 entity types with clear semantics using more than 70 unique types of relationships. We use this network to create collections of precise and specific types of knowledge, and infer relationships not stated explicitly in the text but rather inferred from the large number of related sentences found in the literature. This is exciting because it demonstrates that we are able to overcome the heterogeneity of written language and infer the correct semantics of the relationship described by authors. Finally, we can use this network to identify conflicting facts described in the literature, to study change in language use over time, and to predict drug-drug interactions. These achievements provide us with new ways of interacting with the literature and the knowledge embedded within it, and help ensure that we do not bury the knowledge embodied in the publications, but rather connect the often fragmented and disconnected pieces of knowledge spread across millions of articles in hundreds of journals. We are thereby brought one step closer to the realization of personalized medicine and ensure that as scientists, we continue to build on the knowledge discovered by past generations and truly to stand on the shoulders of giants.