Você está na página 1de 1

BIOGRAPHT A Large scale biomedical relation extraction from unstructured data

Vincent Van Asch, Roser Morante, Walter Daelemans CLiPS - University of Antwerp Prinsstraat 13, B-2000 Antwerpen, Belgium {Vincent.VanAsch,Roser.Morante,Walter.Daelemans}@ua.ac.be
Web link: http://www.clips.ua.ac.be/biographTA Keywords: machine learning, biomedical text mining, relation extraction, semantic processing.

Purpose of the research

We present the machine learning application BIOGRAPHT A that CLiPS has developed to extract biomedical relations from the PubMed database of abstracts in the framework of the BIOGRAPH project1 . The general goal of the project is to put forward a new methodology for mining data from heterogeneous information sources in order to discover new relations between genes and phenotypes. A more specic goal is to develop text mining techniques that allow to perform large scale relation extraction starting from the smallest possible amount of manually annotated data and obtaining the highest precision possible.

- Syntactic parsing with GDep. - Machine learning of biological relations with SVMLight (Joachims, 1999). An instance is created for every pair of named entities that is detected in the sentence. The feature vectors encode information relative to the named entities and their context in the string of tokens and in the dependency tree. - Labelling scopes of negation and speculation cues. This module (Morante et al., 2010) takes as input a parsed sentence and outputs the parsed sentence with the negation and speculation cues identied and their scope marked. - Filtering out biological relations based on the scope labelling step: if at least one of the entities is under the scope of a negation or speculation cue, the relation is removed.

Approach

Results and future perspectives

BIOGRAPHT A processes abstracts in which biological relations from multiple databases have been annotated automatically based on an insentence co-occurrence criterium. It performs a relation identication task learning from noisy data, since a proportion of the automatically annotated relations is incorrect. A distinctive characteristic of the system is that it incorporates a module that processes negation and speculation cues. The output of this module is used to lter out false positives. BIOGRAPHT A processes abstracts following several steps: - Tokenization of the raw abstracts. - Lemmatizing the tokenized sentences with the GENIA dependency parser GDep (Sagae and Tsujii, 2007). - Named entity tagging with UMLS (Bodenreider, 2004). The tagger matches token sequences with entries in the UMLS database, which integrates over 2 million names for some 900 000 concepts from more than 60 families of biomedical vocabularies.
1 This research is funded by the GOA BOF project BIOGRAPH of the University of Antwerp.

The system cannot be evaluated on noisy data. This is why we evaluate it with the binarized version of the BioInfer corpus (Pyysalo et al., 2007). We perform 10-fold cross validation experiments obtaining a precision of 72.54% and a recall of 46.61%. The system will be integrated in the Biograph knowledge discovery server.2

References
O. Bodenreider. 2004. The Unied Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl. 1):D267D270. Th. Joachims. 1999. Making large-scale support vector machine learning practical. In Advances in kernel methods: support vector learning, pages 169184. MIT Press. R. Morante, V. Van Asch, and W. Daelemans. 2010. Memory-based resolution of in-sentence scopes of hedge cues. In Proc. of the Shared Task of CoNLL 2010, pages 4047, Uppsala, Sweden. ACL. S. Pyysalo, F. Ginter, J. Heimonen, J. Bj orne, J. Boberg, J. J arvinen, and T. Salakoski. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(50). K. Sagae and J. Tsujii. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. In Proc. of the CoNLL Shared Task Session of EMNLPCoNLL 2007, pages 10441050, Prague, Czech Republic, June. ACL.
2

Web site of the biograph server: http://www.biograph.be/

Você também pode gostar