Vol-3194/paper16
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper16 |
wikidataid | → |
title | Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations |
pdfUrl | https://ceur-ws.org/Vol-3194/paper16.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/0001S22 |
volume | Vol-3194→Vol-3194 |
session | → |
Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations
Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations* (Discussion Paper) Stefano Marchesina , Gianmaria Silvelloa a Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova, Via Gradenigo 6/b, 35131, Padova, Italy Abstract Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to shift these expensive and time-consuming processes to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this, few resources have been devoted to training – and evaluating – models for GDA extraction. Besides, such resources are limited in size, preventing models from scaling effectively to large amounts of data. To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi- automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state- of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The dataset and models are publicly available to foster the development of state-of-the-art BioRE models for GDA extraction. Keywords Weak Supervision, Relation Extraction, Gene-Disease Association 1. Introduction Curated databases, such as UniProt [2], DrugBank [3], or CTD [4], are pivotal to the develop- ment of biomedical science. Such databases are usually populated and updated with a great deal of effort by human experts [5], thus slowing down the biological knowledge discovery process. To overcome this limitation, the Biomedical Information Extraction (BioIE) field aims to shift population and curation processes to machines by developing effective computational tools that automatically extract meaningful facts from the vast unstructured scientific literature [6, 7, 8]. Once extracted, machine-readable facts can be fed to downstream tasks to ease biological knowl- edge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges to advance precision medicine and drug discovery [9], as it helps to understand the genetic causes of diseases [10]. Thus, the automatic extraction * The full paper has been originally published in BMC Bioinformatics [1] SEBD 2022: The 30th Italian Symposium on Advanced Database Systems (SEBD 2022), June 19–22, 2022, Pisa, Italy Envelope-Open stefano.marchesin@unipd.it (S. Marchesin); gianmaria.silvello@unipd.it (G. Silvello) Orcid 0000-0003-0362-5893 (S. Marchesin); 0000-0003-4970-4554 (G. Silvello) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �and curation of GDAs is key to advance precision medicine research and provide knowledge to assist disease diagnostics, drug discovery, and therapeutic decision-making. Most datasets used to train and evaluate Relation Extraction (RE) models for GDA extraction are hand-labeled corpora [11, 12, 13]. However, hand-labeling data is an expensive process requiring large amounts of time to expert biologists and, therefore, all of these datasets are limited in size. To address this limitation, distant supervision has been proposed [14]. Under the distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled by the corresponding relation stored within a source database. The assumption is that if two entities participate in a relation, at least one sentence mentioning them conveys that relation. As a consequence, distant supervision generates a large number of false positives, since not all sentences express the relation between the considered entities. To counter false positives, the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL) problem [15, 16, 17, 18]. With MIL, the sentences containing two entities connected by a given relation are collected into bags labeled with such relation. Grouping sentences into bags reduces noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus, distant supervision alleviates manual annotation efforts, and MIL increases the robustness of RE models to noise. Since the advent of distant supervision, several datasets for RE have been developed under this paradigm for news and biomedical science domains [14, 19, 6]. Among biomedical ones, the most relevant datasets are BioRel [19], a large-scale dataset for domain-general Biomedical Relation Extraction (BioRE), and DTI [6], a large-scale dataset developed to extract Drug-Target Interactions (DTIs). In the wake of such efforts, we created TBGA: a novel large-scale, semi- automatically annotated dataset for GDA extraction based on DisGeNET. We chose DisGeNET as source database since it is one of the most comprehensive databases for GDAs [20], integrating several expert-curated resources. Then, we trained and tested several state-of-the-art RE models on TBGA to create a large and realistic benchmark for GDA extraction. We built models using OpenNRE [21], an open and extensible toolkit for Neural Relation Extraction (NRE). The choice of OpenNRE eases the re-use of the dataset and the models developed for this work to future researchers. Finally, we publicly released TBGA on Zenodo,1 whereas we stored source code and scripts to train and test RE models in a publicly available GitHub repository.2 2. Dataset TBGA is the first large-scale, semi-automatically annotated dataset for GDA extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence, and it is represented as a JSON object with the following attributes: • text : sentence from which the GDA was extracted. 1 https://doi.org/10.5281/zenodo.5911097 2 https://github.com/GDAMining/gda-extraction/ � • relation : relation name associated with the given GDA. • h : JSON object representing the gene entity, composed of: ∘ id : NCBI Entrez ID associated with the gene entity. ∘ name : NCBI official gene symbol associated with the gene entity. ∘ pos : list consisting of starting position and length of the gene mention within text. • t : JSON object representing the disease entity, composed of: ∘ id : UMLS Concept Unique Identifier (CUI) associated with the disease entity. ∘ name : UMLS preferred term associated with the disease entity. ∘ pos : list consisting of starting position and length of the disease mention within text. If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate data records. Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation statistics for the dataset. Notice the large number of Not Associated (NA) instances. Regarding gene and disease statistics, the most frequent genes are tumor suppressor genes, such as TP53 and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases, we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma. As a consequence, the most frequent GDAs are gene-cancer associations. Table 1 Per-relation statistics for TBGA. Statistics are reported separately for each data split. Granularity Split Therapeutic Biomarker Genomic Alterations NA Train 3,139 20,145 32,831 122,149 Sentence-level Validation 402 2,279 2,306 15,206 Test 384 2,315 2,209 15,608 Train 2,218 13,372 12,759 56,698 Bag-level Validation 331 2,019 1,147 6,994 Test 308 2,068 1,122 6,996 3. Experimental Setup 3.1. Benchmarks We performed experiments on three different datasets: TBGA, DTI, and BioRel. We used TBGA as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other hand, we used DTI and BioRel only to validate the soundness of our implementation of the baseline models. �3.2. Evaluation Measures We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC is a popular measure to evaluate distantly-supervised RE models, which has been adopted by OpenNRE [21] and used in several works, such as [6, 19]. For experiments on TBGA, we also computed Precision at k items (P@k). 3.3. Aggregation Strategies We adopted two different sentence aggregation strategies to use RE models under the MIL setting: average-based (AVE) and attention-based (ATT) [22]. The average-based aggregation assumes that all sentences within the same bag contribute equally to the bag-level representation. In other words, the bag representation is the average of all its sentence representations. On the other hand, the attention-based aggregation represents each bag as a weighted sum of its sentence representations, where the attention weights are dynamically adjusted for each sentence. 3.4. Relation Extraction Models We considered the main state-of-the-art RE models to perform experiments: CNN [23], PCNN [24], BiGRU [25, 19, 6], BiGRU-ATT [26, 6], and BERE [6]. All models use pre-trained word embed- dings to initialize word representations. On the other hand, Position Features (PFs), Position Indicators (PIs), and unknown words are initialized using the normal distribution, whereas blank words are initialized with zeros. We adopted pre-trained BioWordVec [27] embeddings to perform experiments on TBGA. Two versions of pre-trained BioWordVec embeddings are available: “Bio_embedding_intrinsic” and “Bio_embedding_extrinsic”. We chose the “Bio_embedding_extrinsic” version as it is the most suitable for BioRE. As for the experiments on DTI and BioRel, we adopted the pre-trained word embeddings used in the original works [6, 19] – that is, the word embeddings from Pyysalo et al. [28] for DTI, and the “Bio_embedding_extrinsic” version of BioWordVec for BioRel. For TBGA experiments, we used grid search to determine the best combination between optimizer and learning rate. As combinations, we tested Stochastic Gradient Descent (SGD) with learning rate among {0.1, 0.2, 0.3, 0.4, 0.5} and Adam [29] with learning rate set to 0.0001. For all RE models, we set the rest of the hyper-parameters empirically. For DTI and BioRel experiments, we relied on the hyper-parameter settings reported in the original works [6, 19]. 4. Experimental Results We report the results for two different experiments. The first experiment aims to validate the soundness of the implementation of the considered RE models. To this end, we trained and tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we obtained against those reported in the original works [6, 19]. For this experiment, we only compared the RE models and aggregation strategies that were used in the original works. The �Table 2 Results of the baselines validation on DTI [6] and BioRel [19] datasets. The “–” symbol means that the RE model, for the given aggregation strategy, has not been originally evaluated on the specific dataset. Model Strategy Implementation DTI BioRel Reproduced – 0.800 AVE Original – 0.790 CNN Reproduced – 0.790 ATT Original – 0.780 Reproduced 0.234 0.860 AVE Original 0.160 0.820 PCNN Reproduced 0.408 0.820 ATT Original 0.359 0.790 Reproduced – 0.870 AVE Original – 0.800 BiGRU Reproduced 0.379 0.850 ATT Original 0.390 0.780 Reproduced 0.383 – BiGRU-ATT ATT Original 0.457 – Reproduced 0.407 – AVE Original 0.384 – BERE Reproduced 0.525 – ATT Original 0.524 – second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In this case, we trained and tested all the considered RE models using both aggregation strategies. For each RE model, we reported the AUPRC and P@k scores. 4.1. Baselines Validation The results of the baselines validation are reported in Table 2. We can observe that the RE models we use from – or implement within – OpenNRE achieve performance higher than or comparable to those reported in DTI and BioRel original works. The only exceptions are BiGRU and BiGRU-ATT on DTI, where the AUPRC scores of our implementations are lower than those reported in the original work. However, Hong et al. [6] report the optimal hyper-parameter settings for BERE, but not for the baselines. Thus, we attribute the negative difference between our implementations and theirs to the lack of information about optimal hyper-parameters. Overall, the results confirm the soundness of our implementations. Therefore, we can consider them as competitive baseline models to use for benchmarking GDA extraction. 4.2. GDA Benchmarking Table 3 reports the AUPRC and P@k scores of RE models on TBGA. Given the RE models performance, we make the following observations. First, the AUPRC performances achieved by �Table 3 RE models performance on TBGA dataset. For each measure, bold values represent the best scores. Model Strategy AUPRC P@50 P@100 P@250 P@500 P@1000 AVE 0.422 0.780 0.760 0.744 0.696 0.625 CNN ATT 0.403 0.780 0.760 0.788 0.710 0.624 AVE 0.426 0.780 0.780 0.744 0.720 0.664 PCNN ATT 0.404 0.760 0.750 0.744 0.700 0.628 AVE 0.437 0.620 0.720 0.724 0.730 0.678 BiGRU ATT 0.423 0.760 0.750 0.748 0.726 0.666 AVE 0.419 0.740 0.740 0.748 0.694 0.615 BiGRU-ATT ATT 0.390 0.680 0.760 0.756 0.702 0.631 AVE 0.419 0.700 0.710 0.720 0.704 0.620 BERE ATT 0.445 0.780 0.780 0.800 0.764 0.709 RE models on TBGA indicate a high complexity of the GDA extraction task. The task complexity is further supported by the lower performances obtained by top-performing RE models on TBGA compared to DTI and BioRel (cf. Table 2). Secondly, CNN, PCNN, BiGRU, and BiGRU- ATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This suggests that replacing BiGRU max pooling layer with an attention layer proves less effective. Overall, the best AUPRC and P@k scores are achieved by BERE when using the attention- based aggregation strategy. This highlights BERE effectiveness of fully exploiting sentence information from both semantic and syntactic aspects [6]. Thirdly, in terms of AUPRC, the attention-based aggregation proves less effective than the average-based one. On the other hand, attention-based aggregation provides mixed results on P@k measures. Although in contrast with the results obtained in general-domain RE [22], this trend is in line with the results found by Xing et al. [19] on BioRel, where RE models using an average-based aggregation strategy achieve performance comparable to or higher than those using an attention-based one. The only exception is BERE, whose performance using the attention-based aggregation outperforms the one using the average-based strategy. Thus, the obtained results suggest that TBGA is a challenging dataset for GDA extraction. 5. Conclusions We have created TBGA, a large-scale, semi-automatically annotated dataset for GDA extraction. Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest that TBGA is a challenging dataset for this task and, in general, for BioRE. Acknowledgments The work was supported by the EU H2020 ExaMode project, under Grant Agreement no. 825292. �References [1] S. Marchesin, G. Silvello, TBGA: a large-scale gene-disease association dataset for biomed- ical relation extraction, BMC Bioinform. 23 (2022) 111. [2] A. Bairoch, R. Apweiler, The SWISS-PROT protein sequence data bank and its supplement TrEMBL, Nucleic Acids Res. 25 (1997) 31–36. [3] D. S. Wishart, C. Knox, A. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang, J. Woolsey, DrugBank: a comprehensive resource for in silico drug discovery and explo- ration, Nucleic Acids Res. 34 (2006) 668–672. [4] C. J. Mattingly, G. T. Colby, J. N. Forrest, J. L. Boyer, The Comparative Toxicogenomics Database (CTD), Environ. Health Perspect. 111 (2003) 793–795. [5] P. Buneman, J. Cheney, W. C. Tan, S. Vansummeren, Curated Databases, in: Proc. of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, ACM, 2008, pp. 1–12. [6] L. Hong, J. Lin, S. Li, F. Wan, H. Yang, T. Jiang, D. Zhao, J. Zeng, A novel machine learn- ing framework for automated biomedical relation extraction from large-scale literature repositories, Nat. Mach. Intell. 2 (2020) 347–355. [7] E. P. Barracchia, G. Pio, D. D’Elia, M. Ceci, Prediction of new associations between ncrnas and diseases exploiting multi-type hierarchical clustering, BMC Bioinform. 21 (2020) 70. [8] S. Alaimo, R. Giugno, A. Pulvirenti, ncpred: ncrna-disease association prediction through tripartite network-based inference, Front. Bioeng. Biotechnol. 2 (2014). [9] S. Dugger, A. Platt, D. Goldstein, Drug development in the era of precision medicine, Nat. Rev. Drug. Discov. 17 (2018) 183–196. [10] J. P. González, J. M. Ramírez-Anguita, J. Saüch-Pitarch, F. Ronzano, E. Centeno, F. Sanz, L. I. Furlong, The DisGeNET knowledge platform for disease genomics: 2019 update, Nucleic Acids Res. 48 (2020) D845–D855. [11] E. M. van Mulligen, A. Fourrier-Réglat, D. Gurwitz, M. Molokhia, A. Nieto, G. Trifirò, J. A. Kors, L. I. Furlong, The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships, J. Biomed. Informatics 45 (2012) 879–884. [12] D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, D. S. Wishart, PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites, Nucleic Acids Res. 36 (2008) 399–405. [13] H. J. Lee, S. H. Shim, M. R. Song, H. Lee, J. C. Park, CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations, BMC Bioinform. 14 (2013) 323. [14] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in: Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, ACL, 2009, pp. 1003–1011. [15] T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez, Solving the Multiple Instance Problem with Axis-Parallel Rectangles, Artif. Intell. 89 (1997) 31–71. [16] S. Riedel, L. Yao, A. McCallum, Modeling Relations and Their Mentions without Labeled Text, in: Proc. of Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, volume 6323 of LNCS, Springer, 2010, pp. 148–163. �[17] R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, D. S. Weld, Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, ACL, 2011, pp. 541–550. [18] M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance Multi-label Learning for Relation Extraction, in: Proc. of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP- CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, ACL, 2012, pp. 455–465. [19] R. Xing, J. Luo, T. Song, BioRel: towards large-scale biomedical relation extraction, BMC Bioinform. 21-S (2020) 543. [20] Z. Tanoli, U. Seemab, A. Scherer, K. Wennerberg, J. Tang, M. Vähä-Koskela, Exploration of databases and methods supporting drug repurposing: a comprehensive survey, Briefings Bioinform. 22 (2021) 1656–1678. [21] X. Han, T. Gao, Y. Yao, D. Ye, Z. Liu, M. Sun, OpenNRE: An Open and Extensible Toolkit for Neural Relation Extraction, in: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, ACL, 2019, pp. 169–174. [22] Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural Relation Extraction with Selective Attention over Instances, in: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, ACL, 2016, pp. 2124–2133. [23] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation Classification via Convolutional Deep Neural Network, in: Proc. of COLING 2014, 25th International Conference on Computa- tional Linguistics, Technical Papers, August 23-29, 2014, Dublin, Ireland, ACL, 2014, pp. 2335–2344. [24] D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks, in: Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, ACL, 2015, pp. 1753–1762. [25] D. Zhang, D. Wang, Relation Classification via Recurrent Neural Network, CoRR abs/1508.01006 (2015). [26] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, in: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers, ACL, 2016, pp. 207–212. [27] Y. Zhang, Q. Chen, Z. Yang, H. Lin, Z. Lu, BioWordVec, improving biomedical word embeddings with subword information and MeSH, Sci Data 6 (2019) 1–9. [28] S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, S. Ananiadou, Distributional Semantics Resources for Biomedical Text Processing, Proc. of LBM (2013) 39–44. [29] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Proc. of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2015, pp. 1–15. �