Vol-3194/paper16


Wolfgang Fahl

Paper

Paper
edit
description  
id  Vol-3194/paper16
wikidataid  →
title  Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations
pdfUrl  https://ceur-ws.org/Vol-3194/paper16.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/0001S22
volume  Vol-3194→Vol-3194
session  →

Paper[edit]

Paper
edit
description  
id  Vol-3194/paper16
wikidataid  →
title  Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations
pdfUrl  https://ceur-ws.org/Vol-3194/paper16.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/0001S22
volume  Vol-3194→Vol-3194
session  →

Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations[edit]

load PDF

Exploiting Curated Databases to Train Relation
Extraction Models for Gene-Disease Associations*
(Discussion Paper)

Stefano Marchesina , Gianmaria Silvelloa
a
    Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova, Via Gradenigo 6/b, 35131, Padova, Italy


                                         Abstract
                                         Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and
                                         updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to
                                         shift these expensive and time-consuming processes to machines. Among its different applications, the
                                         discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this,
                                         few resources have been devoted to training – and evaluating – models for GDA extraction. Besides,
                                         such resources are limited in size, preventing models from scaling effectively to large amounts of data.
                                         To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-
                                         automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K
                                         publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state-
                                         of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The
                                         dataset and models are publicly available to foster the development of state-of-the-art BioRE models for
                                         GDA extraction.

                                         Keywords
                                         Weak Supervision, Relation Extraction, Gene-Disease Association




1. Introduction
Curated databases, such as UniProt [2], DrugBank [3], or CTD [4], are pivotal to the develop-
ment of biomedical science. Such databases are usually populated and updated with a great deal
of effort by human experts [5], thus slowing down the biological knowledge discovery process.
To overcome this limitation, the Biomedical Information Extraction (BioIE) field aims to shift
population and curation processes to machines by developing effective computational tools that
automatically extract meaningful facts from the vast unstructured scientific literature [6, 7, 8].
Once extracted, machine-readable facts can be fed to downstream tasks to ease biological knowl-
edge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs)
is one of the most pressing challenges to advance precision medicine and drug discovery [9],
as it helps to understand the genetic causes of diseases [10]. Thus, the automatic extraction

                 * The full paper has been originally published in BMC Bioinformatics [1]
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems (SEBD 2022), June 19–22, 2022, Pisa, Italy
Envelope-Open stefano.marchesin@unipd.it (S. Marchesin); gianmaria.silvello@unipd.it (G. Silvello)
Orcid 0000-0003-0362-5893 (S. Marchesin); 0000-0003-4970-4554 (G. Silvello)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�and curation of GDAs is key to advance precision medicine research and provide knowledge to
assist disease diagnostics, drug discovery, and therapeutic decision-making.
   Most datasets used to train and evaluate Relation Extraction (RE) models for GDA extraction
are hand-labeled corpora [11, 12, 13]. However, hand-labeling data is an expensive process
requiring large amounts of time to expert biologists and, therefore, all of these datasets are
limited in size. To address this limitation, distant supervision has been proposed [14]. Under the
distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled
by the corresponding relation stored within a source database. The assumption is that if two
entities participate in a relation, at least one sentence mentioning them conveys that relation.
As a consequence, distant supervision generates a large number of false positives, since not
all sentences express the relation between the considered entities. To counter false positives,
the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL)
problem [15, 16, 17, 18]. With MIL, the sentences containing two entities connected by a given
relation are collected into bags labeled with such relation. Grouping sentences into bags reduces
noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus,
distant supervision alleviates manual annotation efforts, and MIL increases the robustness of
RE models to noise.
   Since the advent of distant supervision, several datasets for RE have been developed under
this paradigm for news and biomedical science domains [14, 19, 6]. Among biomedical ones,
the most relevant datasets are BioRel [19], a large-scale dataset for domain-general Biomedical
Relation Extraction (BioRE), and DTI [6], a large-scale dataset developed to extract Drug-Target
Interactions (DTIs). In the wake of such efforts, we created TBGA: a novel large-scale, semi-
automatically annotated dataset for GDA extraction based on DisGeNET. We chose DisGeNET as
source database since it is one of the most comprehensive databases for GDAs [20], integrating
several expert-curated resources.
   Then, we trained and tested several state-of-the-art RE models on TBGA to create a large
and realistic benchmark for GDA extraction. We built models using OpenNRE [21], an open
and extensible toolkit for Neural Relation Extraction (NRE). The choice of OpenNRE eases the
re-use of the dataset and the models developed for this work to future researchers. Finally, we
publicly released TBGA on Zenodo,1 whereas we stored source code and scripts to train and
test RE models in a publicly available GitHub repository.2


2. Dataset
TBGA is the first large-scale, semi-automatically annotated dataset for GDA extraction. The
dataset consists of three text files, corresponding to train, validation, and test sets, plus an
additional JSON file containing the mapping between relation names and IDs. Each record in
train, validation, or test files corresponds to a single GDA extracted from a sentence, and it is
represented as a JSON object with the following attributes:

    • text : sentence from which the GDA was extracted.

   1
       https://doi.org/10.5281/zenodo.5911097
   2
       https://github.com/GDAMining/gda-extraction/
�    • relation : relation name associated with the given GDA.
    • h : JSON object representing the gene entity, composed of:
          ∘ id : NCBI Entrez ID associated with the gene entity.
          ∘ name : NCBI official gene symbol associated with the gene entity.
          ∘ pos : list consisting of starting position and length of the gene mention within text.
    • t : JSON object representing the disease entity, composed of:
          ∘ id : UMLS Concept Unique Identifier (CUI) associated with the disease entity.
          ∘ name : UMLS preferred term associated with the disease entity.
          ∘ pos : list consisting of starting position and length of the disease mention within
            text.

If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate
data records.
    Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation
statistics for the dataset. Notice the large number of Not Associated (NA) instances. Regarding
gene and disease statistics, the most frequent genes are tumor suppressor genes, such as TP53
and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases,
we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma.
As a consequence, the most frequent GDAs are gene-cancer associations.

Table 1
Per-relation statistics for TBGA. Statistics are reported separately for each data split.
      Granularity       Split         Therapeutic    Biomarker     Genomic Alterations         NA
                        Train                3,139        20,145                  32,831    122,149
      Sentence-level    Validation             402         2,279                   2,306     15,206
                        Test                   384         2,315                   2,209     15,608
                        Train                2,218        13,372                  12,759     56,698
      Bag-level         Validation             331         2,019                   1,147      6,994
                        Test                   308         2,068                   1,122      6,996




3. Experimental Setup
3.1. Benchmarks
We performed experiments on three different datasets: TBGA, DTI, and BioRel. We used TBGA
as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other
hand, we used DTI and BioRel only to validate the soundness of our implementation of the
baseline models.
�3.2. Evaluation Measures
We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC
is a popular measure to evaluate distantly-supervised RE models, which has been adopted by
OpenNRE [21] and used in several works, such as [6, 19]. For experiments on TBGA, we also
computed Precision at k items (P@k).

3.3. Aggregation Strategies
We adopted two different sentence aggregation strategies to use RE models under the MIL
setting: average-based (AVE) and attention-based (ATT) [22]. The average-based aggregation
assumes that all sentences within the same bag contribute equally to the bag-level representation.
In other words, the bag representation is the average of all its sentence representations. On
the other hand, the attention-based aggregation represents each bag as a weighted sum of
its sentence representations, where the attention weights are dynamically adjusted for each
sentence.

3.4. Relation Extraction Models
We considered the main state-of-the-art RE models to perform experiments: CNN [23], PCNN [24],
BiGRU [25, 19, 6], BiGRU-ATT [26, 6], and BERE [6]. All models use pre-trained word embed-
dings to initialize word representations. On the other hand, Position Features (PFs), Position
Indicators (PIs), and unknown words are initialized using the normal distribution, whereas
blank words are initialized with zeros.
   We adopted pre-trained BioWordVec [27] embeddings to perform experiments on TBGA. Two
versions of pre-trained BioWordVec embeddings are available: “Bio_embedding_intrinsic” and
“Bio_embedding_extrinsic”. We chose the “Bio_embedding_extrinsic” version as it is the most
suitable for BioRE. As for the experiments on DTI and BioRel, we adopted the pre-trained word
embeddings used in the original works [6, 19] – that is, the word embeddings from Pyysalo et
al. [28] for DTI, and the “Bio_embedding_extrinsic” version of BioWordVec for BioRel.
   For TBGA experiments, we used grid search to determine the best combination between
optimizer and learning rate. As combinations, we tested Stochastic Gradient Descent (SGD)
with learning rate among {0.1, 0.2, 0.3, 0.4, 0.5} and Adam [29] with learning rate set to 0.0001.
For all RE models, we set the rest of the hyper-parameters empirically.
   For DTI and BioRel experiments, we relied on the hyper-parameter settings reported in the
original works [6, 19].


4. Experimental Results
We report the results for two different experiments. The first experiment aims to validate the
soundness of the implementation of the considered RE models. To this end, we trained and
tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we
obtained against those reported in the original works [6, 19]. For this experiment, we only
compared the RE models and aggregation strategies that were used in the original works. The
�Table 2
Results of the baselines validation on DTI [6] and BioRel [19] datasets. The “–” symbol means that the
RE model, for the given aggregation strategy, has not been originally evaluated on the specific dataset.
                     Model          Strategy   Implementation       DTI    BioRel
                                               Reproduced             –     0.800
                                    AVE
                                               Original               –     0.790
                     CNN
                                               Reproduced             –     0.790
                                    ATT
                                               Original               –     0.780
                                               Reproduced          0.234    0.860
                                    AVE
                                               Original            0.160    0.820
                     PCNN
                                               Reproduced          0.408    0.820
                                    ATT
                                               Original            0.359    0.790
                                               Reproduced              –    0.870
                                    AVE
                                               Original                –    0.800
                     BiGRU
                                               Reproduced          0.379    0.850
                                    ATT
                                               Original            0.390    0.780
                                               Reproduced          0.383        –
                     BiGRU-ATT      ATT
                                               Original            0.457        –
                                               Reproduced          0.407        –
                                    AVE
                                               Original            0.384        –
                     BERE
                                               Reproduced          0.525        –
                                    ATT
                                               Original            0.524        –


second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In
this case, we trained and tested all the considered RE models using both aggregation strategies.
For each RE model, we reported the AUPRC and P@k scores.

4.1. Baselines Validation
The results of the baselines validation are reported in Table 2. We can observe that the RE
models we use from – or implement within – OpenNRE achieve performance higher than or
comparable to those reported in DTI and BioRel original works. The only exceptions are BiGRU
and BiGRU-ATT on DTI, where the AUPRC scores of our implementations are lower than those
reported in the original work. However, Hong et al. [6] report the optimal hyper-parameter
settings for BERE, but not for the baselines. Thus, we attribute the negative difference between
our implementations and theirs to the lack of information about optimal hyper-parameters.
Overall, the results confirm the soundness of our implementations. Therefore, we can consider
them as competitive baseline models to use for benchmarking GDA extraction.

4.2. GDA Benchmarking
Table 3 reports the AUPRC and P@k scores of RE models on TBGA. Given the RE models
performance, we make the following observations. First, the AUPRC performances achieved by
�Table 3
RE models performance on TBGA dataset. For each measure, bold values represent the best scores.
        Model         Strategy   AUPRC      P@50    P@100    P@250     P@500    P@1000
                      AVE           0.422   0.780    0.760     0.744    0.696      0.625
        CNN
                      ATT           0.403   0.780    0.760     0.788    0.710      0.624
                      AVE           0.426   0.780    0.780     0.744    0.720      0.664
        PCNN
                      ATT           0.404   0.760    0.750     0.744    0.700      0.628
                      AVE           0.437   0.620    0.720     0.724    0.730      0.678
        BiGRU
                      ATT           0.423   0.760    0.750     0.748    0.726      0.666
                      AVE           0.419   0.740    0.740     0.748    0.694      0.615
        BiGRU-ATT
                      ATT           0.390   0.680    0.760     0.756    0.702      0.631
                      AVE          0.419    0.700    0.710    0.720     0.704     0.620
        BERE
                      ATT          0.445    0.780    0.780    0.800     0.764     0.709


RE models on TBGA indicate a high complexity of the GDA extraction task. The task complexity
is further supported by the lower performances obtained by top-performing RE models on
TBGA compared to DTI and BioRel (cf. Table 2). Secondly, CNN, PCNN, BiGRU, and BiGRU-
ATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This
suggests that replacing BiGRU max pooling layer with an attention layer proves less effective.
Overall, the best AUPRC and P@k scores are achieved by BERE when using the attention-
based aggregation strategy. This highlights BERE effectiveness of fully exploiting sentence
information from both semantic and syntactic aspects [6]. Thirdly, in terms of AUPRC, the
attention-based aggregation proves less effective than the average-based one. On the other hand,
attention-based aggregation provides mixed results on P@k measures. Although in contrast
with the results obtained in general-domain RE [22], this trend is in line with the results found
by Xing et al. [19] on BioRel, where RE models using an average-based aggregation strategy
achieve performance comparable to or higher than those using an attention-based one. The
only exception is BERE, whose performance using the attention-based aggregation outperforms
the one using the average-based strategy. Thus, the obtained results suggest that TBGA is a
challenging dataset for GDA extraction.


5. Conclusions
We have created TBGA, a large-scale, semi-automatically annotated dataset for GDA extraction.
Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a
benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest
that TBGA is a challenging dataset for this task and, in general, for BioRE.


Acknowledgments
The work was supported by the EU H2020 ExaMode project, under Grant Agreement no. 825292.
�References
 [1] S. Marchesin, G. Silvello, TBGA: a large-scale gene-disease association dataset for biomed-
     ical relation extraction, BMC Bioinform. 23 (2022) 111.
 [2] A. Bairoch, R. Apweiler, The SWISS-PROT protein sequence data bank and its supplement
     TrEMBL, Nucleic Acids Res. 25 (1997) 31–36.
 [3] D. S. Wishart, C. Knox, A. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang,
     J. Woolsey, DrugBank: a comprehensive resource for in silico drug discovery and explo-
     ration, Nucleic Acids Res. 34 (2006) 668–672.
 [4] C. J. Mattingly, G. T. Colby, J. N. Forrest, J. L. Boyer, The Comparative Toxicogenomics
     Database (CTD), Environ. Health Perspect. 111 (2003) 793–795.
 [5] P. Buneman, J. Cheney, W. C. Tan, S. Vansummeren, Curated Databases, in: Proc. of the
     Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
     Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, ACM, 2008, pp. 1–12.
 [6] L. Hong, J. Lin, S. Li, F. Wan, H. Yang, T. Jiang, D. Zhao, J. Zeng, A novel machine learn-
     ing framework for automated biomedical relation extraction from large-scale literature
     repositories, Nat. Mach. Intell. 2 (2020) 347–355.
 [7] E. P. Barracchia, G. Pio, D. D’Elia, M. Ceci, Prediction of new associations between ncrnas
     and diseases exploiting multi-type hierarchical clustering, BMC Bioinform. 21 (2020) 70.
 [8] S. Alaimo, R. Giugno, A. Pulvirenti, ncpred: ncrna-disease association prediction through
     tripartite network-based inference, Front. Bioeng. Biotechnol. 2 (2014).
 [9] S. Dugger, A. Platt, D. Goldstein, Drug development in the era of precision medicine, Nat.
     Rev. Drug. Discov. 17 (2018) 183–196.
[10] J. P. González, J. M. Ramírez-Anguita, J. Saüch-Pitarch, F. Ronzano, E. Centeno, F. Sanz, L. I.
     Furlong, The DisGeNET knowledge platform for disease genomics: 2019 update, Nucleic
     Acids Res. 48 (2020) D845–D855.
[11] E. M. van Mulligen, A. Fourrier-Réglat, D. Gurwitz, M. Molokhia, A. Nieto, G. Trifirò, J. A.
     Kors, L. I. Furlong, The EU-ADR corpus: Annotated drugs, diseases, targets, and their
     relationships, J. Biomed. Informatics 45 (2012) 879–884.
[12] D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, D. S. Wishart, PolySearch: a
     web-based text mining system for extracting relationships between human diseases, genes,
     mutations, drugs and metabolites, Nucleic Acids Res. 36 (2008) 399–405.
[13] H. J. Lee, S. H. Shim, M. R. Song, H. Lee, J. C. Park, CoMAGC: a corpus with multi-faceted
     annotations of gene-cancer relations, BMC Bioinform. 14 (2013) 323.
[14] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without
     labeled data, in: Proc. of the 47th Annual Meeting of the Association for Computational
     Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language
     Processing of the AFNLP, 2-7 August 2009, Singapore, ACL, 2009, pp. 1003–1011.
[15] T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez, Solving the Multiple Instance Problem
     with Axis-Parallel Rectangles, Artif. Intell. 89 (1997) 31–71.
[16] S. Riedel, L. Yao, A. McCallum, Modeling Relations and Their Mentions without Labeled
     Text, in: Proc. of Machine Learning and Knowledge Discovery in Databases, European
     Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, volume 6323 of
     LNCS, Springer, 2010, pp. 148–163.
�[17] R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, D. S. Weld, Knowledge-Based Weak
     Supervision for Information Extraction of Overlapping Relations, in: The 49th Annual
     Meeting of the Association for Computational Linguistics: Human Language Technologies,
     Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, ACL, 2011, pp.
     541–550.
[18] M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance Multi-label Learning
     for Relation Extraction, in: Proc. of the 2012 Joint Conference on Empirical Methods in
     Natural Language Processing and Computational Natural Language Learning, EMNLP-
     CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, ACL, 2012, pp. 455–465.
[19] R. Xing, J. Luo, T. Song, BioRel: towards large-scale biomedical relation extraction, BMC
     Bioinform. 21-S (2020) 543.
[20] Z. Tanoli, U. Seemab, A. Scherer, K. Wennerberg, J. Tang, M. Vähä-Koskela, Exploration of
     databases and methods supporting drug repurposing: a comprehensive survey, Briefings
     Bioinform. 22 (2021) 1656–1678.
[21] X. Han, T. Gao, Y. Yao, D. Ye, Z. Liu, M. Sun, OpenNRE: An Open and Extensible Toolkit
     for Neural Relation Extraction, in: Proc. of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, ACL,
     2019, pp. 169–174.
[22] Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural Relation Extraction with Selective Attention
     over Instances, in: Proc. of the 54th Annual Meeting of the Association for Computational
     Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, ACL,
     2016, pp. 2124–2133.
[23] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation Classification via Convolutional Deep
     Neural Network, in: Proc. of COLING 2014, 25th International Conference on Computa-
     tional Linguistics, Technical Papers, August 23-29, 2014, Dublin, Ireland, ACL, 2014, pp.
     2335–2344.
[24] D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant Supervision for Relation Extraction via Piecewise
     Convolutional Neural Networks, in: Proc. of the 2015 Conference on Empirical Methods
     in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,
     ACL, 2015, pp. 1753–1762.
[25] D. Zhang, D. Wang, Relation Classification via Recurrent Neural Network, CoRR
     abs/1508.01006 (2015).
[26] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long
     Short-Term Memory Networks for Relation Classification, in: Proc. of the 54th Annual
     Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,
     Berlin, Germany, Volume 2: Short Papers, ACL, 2016, pp. 207–212.
[27] Y. Zhang, Q. Chen, Z. Yang, H. Lin, Z. Lu, BioWordVec, improving biomedical word
     embeddings with subword information and MeSH, Sci Data 6 (2019) 1–9.
[28] S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, S. Ananiadou, Distributional Semantics
     Resources for Biomedical Text Processing, Proc. of LBM (2013) 39–44.
[29] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Proc. of the 3rd
     International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
     May 7-9, 2015, 2015, pp. 1–15.
�

Exploiting Curated Databases to Train Relation Extraction Models for Gene-Disease Associations[edit]

load PDF

Exploiting Curated Databases to Train Relation
Extraction Models for Gene-Disease Associations*
(Discussion Paper)

Stefano Marchesina , Gianmaria Silvelloa
a
    Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova, Via Gradenigo 6/b, 35131, Padova, Italy


                                         Abstract
                                         Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and
                                         updated by human experts with a great deal of effort. Biomedical Relation Extraction (BioRE) aims to
                                         shift these expensive and time-consuming processes to machines. Among its different applications, the
                                         discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this,
                                         few resources have been devoted to training – and evaluating – models for GDA extraction. Besides,
                                         such resources are limited in size, preventing models from scaling effectively to large amounts of data.
                                         To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-
                                         automatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K
                                         publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated state-
                                         of-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The
                                         dataset and models are publicly available to foster the development of state-of-the-art BioRE models for
                                         GDA extraction.

                                         Keywords
                                         Weak Supervision, Relation Extraction, Gene-Disease Association




1. Introduction
Curated databases, such as UniProt [2], DrugBank [3], or CTD [4], are pivotal to the develop-
ment of biomedical science. Such databases are usually populated and updated with a great deal
of effort by human experts [5], thus slowing down the biological knowledge discovery process.
To overcome this limitation, the Biomedical Information Extraction (BioIE) field aims to shift
population and curation processes to machines by developing effective computational tools that
automatically extract meaningful facts from the vast unstructured scientific literature [6, 7, 8].
Once extracted, machine-readable facts can be fed to downstream tasks to ease biological knowl-
edge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs)
is one of the most pressing challenges to advance precision medicine and drug discovery [9],
as it helps to understand the genetic causes of diseases [10]. Thus, the automatic extraction

                 * The full paper has been originally published in BMC Bioinformatics [1]
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems (SEBD 2022), June 19–22, 2022, Pisa, Italy
Envelope-Open stefano.marchesin@unipd.it (S. Marchesin); gianmaria.silvello@unipd.it (G. Silvello)
Orcid 0000-0003-0362-5893 (S. Marchesin); 0000-0003-4970-4554 (G. Silvello)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�and curation of GDAs is key to advance precision medicine research and provide knowledge to
assist disease diagnostics, drug discovery, and therapeutic decision-making.
   Most datasets used to train and evaluate Relation Extraction (RE) models for GDA extraction
are hand-labeled corpora [11, 12, 13]. However, hand-labeling data is an expensive process
requiring large amounts of time to expert biologists and, therefore, all of these datasets are
limited in size. To address this limitation, distant supervision has been proposed [14]. Under the
distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled
by the corresponding relation stored within a source database. The assumption is that if two
entities participate in a relation, at least one sentence mentioning them conveys that relation.
As a consequence, distant supervision generates a large number of false positives, since not
all sentences express the relation between the considered entities. To counter false positives,
the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL)
problem [15, 16, 17, 18]. With MIL, the sentences containing two entities connected by a given
relation are collected into bags labeled with such relation. Grouping sentences into bags reduces
noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus,
distant supervision alleviates manual annotation efforts, and MIL increases the robustness of
RE models to noise.
   Since the advent of distant supervision, several datasets for RE have been developed under
this paradigm for news and biomedical science domains [14, 19, 6]. Among biomedical ones,
the most relevant datasets are BioRel [19], a large-scale dataset for domain-general Biomedical
Relation Extraction (BioRE), and DTI [6], a large-scale dataset developed to extract Drug-Target
Interactions (DTIs). In the wake of such efforts, we created TBGA: a novel large-scale, semi-
automatically annotated dataset for GDA extraction based on DisGeNET. We chose DisGeNET as
source database since it is one of the most comprehensive databases for GDAs [20], integrating
several expert-curated resources.
   Then, we trained and tested several state-of-the-art RE models on TBGA to create a large
and realistic benchmark for GDA extraction. We built models using OpenNRE [21], an open
and extensible toolkit for Neural Relation Extraction (NRE). The choice of OpenNRE eases the
re-use of the dataset and the models developed for this work to future researchers. Finally, we
publicly released TBGA on Zenodo,1 whereas we stored source code and scripts to train and
test RE models in a publicly available GitHub repository.2


2. Dataset
TBGA is the first large-scale, semi-automatically annotated dataset for GDA extraction. The
dataset consists of three text files, corresponding to train, validation, and test sets, plus an
additional JSON file containing the mapping between relation names and IDs. Each record in
train, validation, or test files corresponds to a single GDA extracted from a sentence, and it is
represented as a JSON object with the following attributes:

    • text : sentence from which the GDA was extracted.

   1
       https://doi.org/10.5281/zenodo.5911097
   2
       https://github.com/GDAMining/gda-extraction/
�    • relation : relation name associated with the given GDA.
    • h : JSON object representing the gene entity, composed of:
          ∘ id : NCBI Entrez ID associated with the gene entity.
          ∘ name : NCBI official gene symbol associated with the gene entity.
          ∘ pos : list consisting of starting position and length of the gene mention within text.
    • t : JSON object representing the disease entity, composed of:
          ∘ id : UMLS Concept Unique Identifier (CUI) associated with the disease entity.
          ∘ name : UMLS preferred term associated with the disease entity.
          ∘ pos : list consisting of starting position and length of the disease mention within
            text.

If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate
data records.
    Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation
statistics for the dataset. Notice the large number of Not Associated (NA) instances. Regarding
gene and disease statistics, the most frequent genes are tumor suppressor genes, such as TP53
and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases,
we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma.
As a consequence, the most frequent GDAs are gene-cancer associations.

Table 1
Per-relation statistics for TBGA. Statistics are reported separately for each data split.
      Granularity       Split         Therapeutic    Biomarker     Genomic Alterations         NA
                        Train                3,139        20,145                  32,831    122,149
      Sentence-level    Validation             402         2,279                   2,306     15,206
                        Test                   384         2,315                   2,209     15,608
                        Train                2,218        13,372                  12,759     56,698
      Bag-level         Validation             331         2,019                   1,147      6,994
                        Test                   308         2,068                   1,122      6,996




3. Experimental Setup
3.1. Benchmarks
We performed experiments on three different datasets: TBGA, DTI, and BioRel. We used TBGA
as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other
hand, we used DTI and BioRel only to validate the soundness of our implementation of the
baseline models.
�3.2. Evaluation Measures
We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC
is a popular measure to evaluate distantly-supervised RE models, which has been adopted by
OpenNRE [21] and used in several works, such as [6, 19]. For experiments on TBGA, we also
computed Precision at k items (P@k).

3.3. Aggregation Strategies
We adopted two different sentence aggregation strategies to use RE models under the MIL
setting: average-based (AVE) and attention-based (ATT) [22]. The average-based aggregation
assumes that all sentences within the same bag contribute equally to the bag-level representation.
In other words, the bag representation is the average of all its sentence representations. On
the other hand, the attention-based aggregation represents each bag as a weighted sum of
its sentence representations, where the attention weights are dynamically adjusted for each
sentence.

3.4. Relation Extraction Models
We considered the main state-of-the-art RE models to perform experiments: CNN [23], PCNN [24],
BiGRU [25, 19, 6], BiGRU-ATT [26, 6], and BERE [6]. All models use pre-trained word embed-
dings to initialize word representations. On the other hand, Position Features (PFs), Position
Indicators (PIs), and unknown words are initialized using the normal distribution, whereas
blank words are initialized with zeros.
   We adopted pre-trained BioWordVec [27] embeddings to perform experiments on TBGA. Two
versions of pre-trained BioWordVec embeddings are available: “Bio_embedding_intrinsic” and
“Bio_embedding_extrinsic”. We chose the “Bio_embedding_extrinsic” version as it is the most
suitable for BioRE. As for the experiments on DTI and BioRel, we adopted the pre-trained word
embeddings used in the original works [6, 19] – that is, the word embeddings from Pyysalo et
al. [28] for DTI, and the “Bio_embedding_extrinsic” version of BioWordVec for BioRel.
   For TBGA experiments, we used grid search to determine the best combination between
optimizer and learning rate. As combinations, we tested Stochastic Gradient Descent (SGD)
with learning rate among {0.1, 0.2, 0.3, 0.4, 0.5} and Adam [29] with learning rate set to 0.0001.
For all RE models, we set the rest of the hyper-parameters empirically.
   For DTI and BioRel experiments, we relied on the hyper-parameter settings reported in the
original works [6, 19].


4. Experimental Results
We report the results for two different experiments. The first experiment aims to validate the
soundness of the implementation of the considered RE models. To this end, we trained and
tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we
obtained against those reported in the original works [6, 19]. For this experiment, we only
compared the RE models and aggregation strategies that were used in the original works. The
�Table 2
Results of the baselines validation on DTI [6] and BioRel [19] datasets. The “–” symbol means that the
RE model, for the given aggregation strategy, has not been originally evaluated on the specific dataset.
                     Model          Strategy   Implementation       DTI    BioRel
                                               Reproduced             –     0.800
                                    AVE
                                               Original               –     0.790
                     CNN
                                               Reproduced             –     0.790
                                    ATT
                                               Original               –     0.780
                                               Reproduced          0.234    0.860
                                    AVE
                                               Original            0.160    0.820
                     PCNN
                                               Reproduced          0.408    0.820
                                    ATT
                                               Original            0.359    0.790
                                               Reproduced              –    0.870
                                    AVE
                                               Original                –    0.800
                     BiGRU
                                               Reproduced          0.379    0.850
                                    ATT
                                               Original            0.390    0.780
                                               Reproduced          0.383        –
                     BiGRU-ATT      ATT
                                               Original            0.457        –
                                               Reproduced          0.407        –
                                    AVE
                                               Original            0.384        –
                     BERE
                                               Reproduced          0.525        –
                                    ATT
                                               Original            0.524        –


second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In
this case, we trained and tested all the considered RE models using both aggregation strategies.
For each RE model, we reported the AUPRC and P@k scores.

4.1. Baselines Validation
The results of the baselines validation are reported in Table 2. We can observe that the RE
models we use from – or implement within – OpenNRE achieve performance higher than or
comparable to those reported in DTI and BioRel original works. The only exceptions are BiGRU
and BiGRU-ATT on DTI, where the AUPRC scores of our implementations are lower than those
reported in the original work. However, Hong et al. [6] report the optimal hyper-parameter
settings for BERE, but not for the baselines. Thus, we attribute the negative difference between
our implementations and theirs to the lack of information about optimal hyper-parameters.
Overall, the results confirm the soundness of our implementations. Therefore, we can consider
them as competitive baseline models to use for benchmarking GDA extraction.

4.2. GDA Benchmarking
Table 3 reports the AUPRC and P@k scores of RE models on TBGA. Given the RE models
performance, we make the following observations. First, the AUPRC performances achieved by
�Table 3
RE models performance on TBGA dataset. For each measure, bold values represent the best scores.
        Model         Strategy   AUPRC      P@50    P@100    P@250     P@500    P@1000
                      AVE           0.422   0.780    0.760     0.744    0.696      0.625
        CNN
                      ATT           0.403   0.780    0.760     0.788    0.710      0.624
                      AVE           0.426   0.780    0.780     0.744    0.720      0.664
        PCNN
                      ATT           0.404   0.760    0.750     0.744    0.700      0.628
                      AVE           0.437   0.620    0.720     0.724    0.730      0.678
        BiGRU
                      ATT           0.423   0.760    0.750     0.748    0.726      0.666
                      AVE           0.419   0.740    0.740     0.748    0.694      0.615
        BiGRU-ATT
                      ATT           0.390   0.680    0.760     0.756    0.702      0.631
                      AVE          0.419    0.700    0.710    0.720     0.704     0.620
        BERE
                      ATT          0.445    0.780    0.780    0.800     0.764     0.709


RE models on TBGA indicate a high complexity of the GDA extraction task. The task complexity
is further supported by the lower performances obtained by top-performing RE models on
TBGA compared to DTI and BioRel (cf. Table 2). Secondly, CNN, PCNN, BiGRU, and BiGRU-
ATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This
suggests that replacing BiGRU max pooling layer with an attention layer proves less effective.
Overall, the best AUPRC and P@k scores are achieved by BERE when using the attention-
based aggregation strategy. This highlights BERE effectiveness of fully exploiting sentence
information from both semantic and syntactic aspects [6]. Thirdly, in terms of AUPRC, the
attention-based aggregation proves less effective than the average-based one. On the other hand,
attention-based aggregation provides mixed results on P@k measures. Although in contrast
with the results obtained in general-domain RE [22], this trend is in line with the results found
by Xing et al. [19] on BioRel, where RE models using an average-based aggregation strategy
achieve performance comparable to or higher than those using an attention-based one. The
only exception is BERE, whose performance using the attention-based aggregation outperforms
the one using the average-based strategy. Thus, the obtained results suggest that TBGA is a
challenging dataset for GDA extraction.


5. Conclusions
We have created TBGA, a large-scale, semi-automatically annotated dataset for GDA extraction.
Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a
benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest
that TBGA is a challenging dataset for this task and, in general, for BioRE.


Acknowledgments
The work was supported by the EU H2020 ExaMode project, under Grant Agreement no. 825292.
�References
 [1] S. Marchesin, G. Silvello, TBGA: a large-scale gene-disease association dataset for biomed-
     ical relation extraction, BMC Bioinform. 23 (2022) 111.
 [2] A. Bairoch, R. Apweiler, The SWISS-PROT protein sequence data bank and its supplement
     TrEMBL, Nucleic Acids Res. 25 (1997) 31–36.
 [3] D. S. Wishart, C. Knox, A. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang,
     J. Woolsey, DrugBank: a comprehensive resource for in silico drug discovery and explo-
     ration, Nucleic Acids Res. 34 (2006) 668–672.
 [4] C. J. Mattingly, G. T. Colby, J. N. Forrest, J. L. Boyer, The Comparative Toxicogenomics
     Database (CTD), Environ. Health Perspect. 111 (2003) 793–795.
 [5] P. Buneman, J. Cheney, W. C. Tan, S. Vansummeren, Curated Databases, in: Proc. of the
     Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
     Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, ACM, 2008, pp. 1–12.
 [6] L. Hong, J. Lin, S. Li, F. Wan, H. Yang, T. Jiang, D. Zhao, J. Zeng, A novel machine learn-
     ing framework for automated biomedical relation extraction from large-scale literature
     repositories, Nat. Mach. Intell. 2 (2020) 347–355.
 [7] E. P. Barracchia, G. Pio, D. D’Elia, M. Ceci, Prediction of new associations between ncrnas
     and diseases exploiting multi-type hierarchical clustering, BMC Bioinform. 21 (2020) 70.
 [8] S. Alaimo, R. Giugno, A. Pulvirenti, ncpred: ncrna-disease association prediction through
     tripartite network-based inference, Front. Bioeng. Biotechnol. 2 (2014).
 [9] S. Dugger, A. Platt, D. Goldstein, Drug development in the era of precision medicine, Nat.
     Rev. Drug. Discov. 17 (2018) 183–196.
[10] J. P. González, J. M. Ramírez-Anguita, J. Saüch-Pitarch, F. Ronzano, E. Centeno, F. Sanz, L. I.
     Furlong, The DisGeNET knowledge platform for disease genomics: 2019 update, Nucleic
     Acids Res. 48 (2020) D845–D855.
[11] E. M. van Mulligen, A. Fourrier-Réglat, D. Gurwitz, M. Molokhia, A. Nieto, G. Trifirò, J. A.
     Kors, L. I. Furlong, The EU-ADR corpus: Annotated drugs, diseases, targets, and their
     relationships, J. Biomed. Informatics 45 (2012) 879–884.
[12] D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, D. S. Wishart, PolySearch: a
     web-based text mining system for extracting relationships between human diseases, genes,
     mutations, drugs and metabolites, Nucleic Acids Res. 36 (2008) 399–405.
[13] H. J. Lee, S. H. Shim, M. R. Song, H. Lee, J. C. Park, CoMAGC: a corpus with multi-faceted
     annotations of gene-cancer relations, BMC Bioinform. 14 (2013) 323.
[14] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without
     labeled data, in: Proc. of the 47th Annual Meeting of the Association for Computational
     Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language
     Processing of the AFNLP, 2-7 August 2009, Singapore, ACL, 2009, pp. 1003–1011.
[15] T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez, Solving the Multiple Instance Problem
     with Axis-Parallel Rectangles, Artif. Intell. 89 (1997) 31–71.
[16] S. Riedel, L. Yao, A. McCallum, Modeling Relations and Their Mentions without Labeled
     Text, in: Proc. of Machine Learning and Knowledge Discovery in Databases, European
     Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, volume 6323 of
     LNCS, Springer, 2010, pp. 148–163.
�[17] R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, D. S. Weld, Knowledge-Based Weak
     Supervision for Information Extraction of Overlapping Relations, in: The 49th Annual
     Meeting of the Association for Computational Linguistics: Human Language Technologies,
     Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, ACL, 2011, pp.
     541–550.
[18] M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance Multi-label Learning
     for Relation Extraction, in: Proc. of the 2012 Joint Conference on Empirical Methods in
     Natural Language Processing and Computational Natural Language Learning, EMNLP-
     CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, ACL, 2012, pp. 455–465.
[19] R. Xing, J. Luo, T. Song, BioRel: towards large-scale biomedical relation extraction, BMC
     Bioinform. 21-S (2020) 543.
[20] Z. Tanoli, U. Seemab, A. Scherer, K. Wennerberg, J. Tang, M. Vähä-Koskela, Exploration of
     databases and methods supporting drug repurposing: a comprehensive survey, Briefings
     Bioinform. 22 (2021) 1656–1678.
[21] X. Han, T. Gao, Y. Yao, D. Ye, Z. Liu, M. Sun, OpenNRE: An Open and Extensible Toolkit
     for Neural Relation Extraction, in: Proc. of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, ACL,
     2019, pp. 169–174.
[22] Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural Relation Extraction with Selective Attention
     over Instances, in: Proc. of the 54th Annual Meeting of the Association for Computational
     Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, ACL,
     2016, pp. 2124–2133.
[23] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation Classification via Convolutional Deep
     Neural Network, in: Proc. of COLING 2014, 25th International Conference on Computa-
     tional Linguistics, Technical Papers, August 23-29, 2014, Dublin, Ireland, ACL, 2014, pp.
     2335–2344.
[24] D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant Supervision for Relation Extraction via Piecewise
     Convolutional Neural Networks, in: Proc. of the 2015 Conference on Empirical Methods
     in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,
     ACL, 2015, pp. 1753–1762.
[25] D. Zhang, D. Wang, Relation Classification via Recurrent Neural Network, CoRR
     abs/1508.01006 (2015).
[26] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long
     Short-Term Memory Networks for Relation Classification, in: Proc. of the 54th Annual
     Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,
     Berlin, Germany, Volume 2: Short Papers, ACL, 2016, pp. 207–212.
[27] Y. Zhang, Q. Chen, Z. Yang, H. Lin, Z. Lu, BioWordVec, improving biomedical word
     embeddings with subword information and MeSH, Sci Data 6 (2019) 1–9.
[28] S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, S. Ananiadou, Distributional Semantics
     Resources for Biomedical Text Processing, Proc. of LBM (2013) 39–44.
[29] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Proc. of the 3rd
     International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
     May 7-9, 2015, 2015, pp. 1–15.
�
🖨 🚪