Vol-3194/paper33
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper33 |
wikidataid | →Q117344916 |
title | Expanding the Citation Graph for Data Citations |
pdfUrl | https://ceur-ws.org/Vol-3194/paper33.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/BunemanDLS22 |
volume | Vol-3194→Vol-3194 |
session | → |
Expanding the Citation Graph for Data Citations
Expanding the Citation Graph for Data Citations* Peter Buneman1 , Dennis Dosso2 , Matteo Lissandrini3 and Gianmaria Silvello2 1 University of Edinburgh, UK 2 University of Padua, Italy 3 Aalborg University, Denmark Abstract The Citation Graph (CG) is a computational artifact widely used to represent the domain of published literature. There is an increasing demand to treat the publication of data in the same way that we treat conventional publications. It should be possible to cite data for the same reasons that is is necessary to cite other publications. In this paper we see some of the limitations of the citation graph, and we discuss how some implementation-agnostic extensions may solve them, thus also allowing the introduction of data and the management of data citations within the CG. Keywords Data Citation, Bibliometrics, Citation Graph 1. Introduction Citations are essential to all forms of research: they are necessary, among other things, to identify the cited material, to retrieve it, and to give credit to its creators. The Citation Graph (CG) is a model that describes how citations link research entities, typically papers, journals, and books [2, 3]. It supports activities such as authorship tracking, discovery of new publications, and the computation of bibliometrics. Several implementations of the CG are available such as Google Scholar, Microsoft Academic Graph (MAG)1 , Scopus2 , and Web of Science (WoS)3 . Much research now relies on curated databases, which have largely replaced traditional reference works and now play a crucial role in science [4]. There is now a strong demand to give databases the same scholarly status of traditional scientific works [5, 6], and it is a common opinion that citations to data should be given the same scholarly status as traditional citations and that they should contribute to bibliometric indicators [7, 8]. However, currently, no citation-based system properly takes into account scientific databases and data citations. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy *The full paper was published in the Quantitative Science Studies journal [1]. $ opb@inf.ed.ac.uk (P. Buneman); dennis.dosso@unipd.it (D. Dosso); matteo@cs.aau.dk (M. Lissandrini); gianmaria.silvello@unipd.it (G. Silvello) http://www.dei.unipd.it/~dosso/ (D. Dosso); https://disi.unitn.it/~lissandrini/ (M. Lissandrini); http://www.dei.unipd.it/~silvello/ (G. Silvello) � 0000-0001-7307-4607 (D. Dosso); 0000-0001-7922-5998 (M. Lissandrini); 0000-0003-4970-4554 (G. Silvello) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ 2 https://www.scopus.com/home.uri 3 https://clarivate.com/webofsciencegroup/solutions/web-of-science/ � We claim that the current model of the citation graph cannot properly accommodate data citations. Two significant features are missing or poorly represented in CGs: (i) the lack of the representation of context of a citation, and (ii) the versioning of the publications (e.g., papers, databases, data subsets, for which new updated versions or corrections are published through time). These shortcomings limit the management of data objects, their citations, and also the representation of traditional publications and their connections. To overcome these limitations and to introduce data citations in the CG, we propose and discuss two extensions of this model. The first one, reference annotation, models the information that may accompany a citation, such as its context for traditional citation or the query used to obtain the data, for a data citation. Through the information contained in the reference annotations, it is possible to add data and count data citations. The second extension is a new relationship among publications called subsumption. A publication subsumes another one when it takes its place in receiving citations, for instance when a new version of the publication is released. Moreover, using subsumption, it is possible to model citations to an evolving database. The paper is organized as follows: Section 2 presents in further detail some core concepts and the existing limits of the CG, Section 3 presents our proposed extensions to overcome these limits, and how they can be exploited to add data and data citations in the CG, finally, Section 3.4 presents our conclusions. 2. Core concepts and limits of the CG 2.1. Core concepts Citable Unit A Citable Unit (CU) is a published entity, such as a paper, a chapter, or portion of data or software, which presents all the necessary qualities to be considered as a “citable work”. The characterization we adopt requires the CU to be: uniquely and unambiguously identifiable and citable; available in perpetuity and in unchanged form; accessible; self-contained and complete [9]. Although some of these requirements are subjective, and not straightforward in databases, they still provide a workable starting point. One of the most problematic aspect for a CU belonging to a database is for it to be unchanged, since databases evolve, and creating a CU for each version may be counterproductive. We also note that a CU may contain other citable units, such as the chapters forming a book. There is thus a “part-of” relationship between CUs that we discuss later. Reference Usually scientific papers present a list of references at their end. Traditionally, a reference is a pointer to another publication in the literature, comprising also a brief description of it. It is a short text composed by fields representing metadata such as the title of the publication, its authors, the publication year, its venue, and other metadata. This information enables us to identify and find the entity. We note that contents of a reference are determined by the cited entity. That is, to within stylistic variation, the contents of a reference will be the same in any paper that cites the entity. In the citation graph, the presence of a reference determines a directed edge between nodes representing the citing CU and the cited CU. �Reference pointer, reference annotation, and citation Generally speaking, there is no universal agreement on the distinction between the concept of reference and citation, and these two terms are often used interchangeably [10, 11, 12]. In the body of a paper we may find, for example, a textual artifact such as “Austen, J. (2004). pp 101-104”. We call the first part of this artifact, “Austen, J. (2004)”, reference pointer. It is used to denote a single bibliographic reference in the reference section when mentioned in the body of the paper. Such pointers may be accompanied by some additional information. In our example, this additional information is provided by the text “pp 101-104”, that specifies the exact location in the paper the citation is referring to. The same reference pointer therefore may appear many times in the same paper, each time with different additional information. We call this additional information reference annotation; it is not part of the reference itself and will depend on how and where the reference is used. This information can be thought as a form of context. Finally, we can define a citation to be the combination of a reference pointer with the (optional) reference annotation. Part-of A paper may appear in a collection of papers or in the proceedings of a confer- ence; where both the paper and the collection are CUs. Databases and datasets often have a hierarchical structure; for example many datasets have a simple directory structure. In addi- tion to accommodating reference pointers the citation graph may need to represent a part-of relationship citable units in the graph. 2.2. Limits of the Citation Graph The basic model of CG consists of a directed graph 𝒢 = {𝑉, 𝐸}, where 𝑉 is the set of papers and 𝐸 ⊆ 𝑉 × 𝑉 is the set of citations, where an edge ⟨𝑝1 , 𝑝2 ⟩ signals the presence of a citation from paper 𝑝1 to paper 𝑝2 . The following limitations of this model can already be seen in the more traditional systems that model the citations among paper, and are a hindrance also for the introduction of data citation. Lack of context The nodes of the citation graph often contain the information of the entity they represent, but not of what we called context of the citations in this entity. The only information provided by the edge ⟨𝑝1 , 𝑝2 ⟩, in fact, is that 𝑝1 cites 𝑝2 . While we may need to know, for instance, which parts of 𝑝2 is 𝑝1 referencing to. Versions Ideally, the entities in the citation graph should be clearly distinguishable between each other. However, this is not always the case and it happens that some entities in 𝑉 may be quite similar to each other. This is due to many reasons, such as, but not limited to: one “abstract” version first published in some conference proceedings, and later published again in its “full version” in some journal; a paper first published in some online archive and later as a full-fledged version released as peer-reviewed publication in a conference. For these reasons, it is not always clear which of the versions of a paper should be considered as the version that needs to receive the citations. Generally speaking, it is in the authors’ interest to have these documents conflated into one, to accumulate the citations going to the different versions scattered among various locations into one representative entity. With Google Scholar, � reference_annotation: reference_annotation: {id_ra: ra_1 {id_ra: ra_2 id_reference: reference_1 id_reference: reference_1 timestamp: [2019-10-30T10:40:28] timestamp: [2019-10-30T10:40:28] page_numbers: 33-36} page_numbers: 101-111} paper: title: [XFEL structures of the human MT2 melatonin receptor…] authors:[Johansson, L. C. Et paper: title: [Structural basis of ra_1 ra_2 al. ] ligand recognition…] year:[2019] authors:[Stauch, B. Et al. ] DOI:[10.1038/s41586-019-1144-0] year:[2019] timestamp: [2019-10-30T10:40:28] DOI:[10.1038/s41586-019-1141-3] P1 reference_1 P2 reference: {id_r: reference_1 timestamp: [2019-10-30T10:40:28] Type: paper_to_paper} Figure 1: Exemplification of the use of references and reference annotations. for example, it appears that all the versions being found on the Internet are clustered together in one unique “main” version, that becomes the recipient of all references coming from other entities. Citations to data One of the primary roles of data citation is to give credit and attribution to the work of data creators and curators [5]. If integrated into the citation graph, data citations can be represented and analyzed as if they were conventional citations, with data CUs and corresponding authors receiving citations and thus credit for their work. The existing services present limitations that make data citation de facto unfeasible or limited [13]. Systems like Google Data Search4 allows us to search for indexed data set, but they do not keep track of the citations to data, or other types of statistics, such as clicks or downloads. WoS models data citations, but only at the whole database level, as does Zenodo5 , therefore, when multiple authors contribute to the same curated database, it is impossible to distinguish to which of the authors the credit should go. 3. How to extend the graph and deal with data citations The two extensions to the citation graph proposed in this paper already exist in limited forms in some implementations. For example, the MAG already presents the possibility to include a citation context, but it lacks the ability to accurately cite data and manage their versions. The extensions proposed in this section are independent of any specific implementation of the citation graph and, for the most part, they can directly be incorporated in the existing implementations of the CG rather than requiring a completely new implementation of the supporting database. 3.1. Reference Annotation To address the problem of lack of context, it is necessary to annotate the CG edges that represent the references. Most data models currently implemented do not support data on edges6 , so for 4 https://datasetsearch.research.google.com/ 5 https://zenodo.org/ 6 Property graphs are an exception to this observation. �Figure 2: The subsumption relation between two CUs. consistency with these models, for our first extension we propose two new node types, rather than new kinds of edges: the reference and reference annotation nodes. Consider Figure 1, where the paper 𝑃1 is citing 𝑃2 . We can imagine a reference in the “References” section of 𝑃1 as something similar to “Johansson, L. C. et al. (2019). XFEL structures of the human MT 2 melatonin receptor reveal the basis of subtype selectivity. Nature, 569 (7755), 289–292. doi: 10.1038/s41586-019-1144-0.”. The existence of this reference is reflected by the presence of an edge between the two nodes. We add the reference node reference_1 between 𝑃1 and 𝑃2 . This new type of node contains information such as the edge type (reference), the timestamp of when the citation was registered by the system, and the reference type (in this case, from a paper to another paper). The actual information contained in a reference node can be modeled according to whatever model we decide to follow. Let us now suppose that 𝑃1 cites 𝑃2 twice, that is, in 𝑃1 ’s body there are two reference pointers to 𝑃2 , each with its context. To model this, we add two neighbor reference annotation nodes to reference_1, namely ra_1 and ra_2. These two contain the information describing the context of the two citations found in 𝑃1 , which may consist in references to particular pages, tables or images, comments on the nature of the citations, or some other type of metadata. The reference annotation node therefore acts as a container of all the information of the context. As we shall see, this new node is extremely helpful in introducing data citations to the CG. 3.2. Subsumption To deal with the different versions of a CU we propose the introduction in the citation graph of a new relation, called subsumption. � Consider Figure 2, divided in two subsequent moments in time, 𝑡′ and 𝑡′′ . At time 𝑡′ , the paper 𝑃1 is citing paper 𝑃2 , as seen with the reference node r_1. Let us now consider a new version of the same paper, 𝑃2′ , which is published and inserted in the citation graph at time 𝑡′′ . Thus 𝑃2′ subsumes 𝑃2 indicates that the former is a new version of the latter and it is, from now on, the paper to consult and reference. For consistency with our approach with the reference and reference annotation nodes, we are modeling the subsumption property as a new type of node. From time 𝑡′′ onward, each new paper such as 𝑃3 now cites 𝑃2′ . Moreover, through the subsumption, the citations that were before assigned to 𝑃2 can now be “moved” to its latest version. However, we do not want to destroy the original link from 𝑃1 to 𝑃2 : to do so would be to “rewrite history” and remove information from the graph. Thus, we add a new edge from the reference node r_1 to 𝑃2′ . This edge describes the fact that the old citation is now “re-assigned” to the new node 𝑃2′ . A timestamp on the edges indicates the moment in time in which they were added to the graph. The most recent edge can thus be considered the “valid” one and used, for instance, by algorithms that compute bibliometric measures such as ℎ-index or impact factor. We note that while one paper may subsume more than one other paper (e.g., a book node with its chapters), it does not make sense for one paper to be subsumed by more than one other CU. Thus, the subsumption relation is many-to-one and acyclic, and it creates a forest within the CG. The roots of the trees of this forest are the ones receiving all the credit, and we call them Primary Citable Units (PCU). It may not always be desirable or correct for citations to be “inherited” through the sub- sumption relation, for example when a new version presents a different authors’ list. In this case different types of algorithms can be considered to decide how to transfer the citations depending on the nature of the CU. 3.3. Data in the Citation Graph To deal with data citations in the CG, we use the term “database” in its most general sense: it may be a relational database, an ontology, some form of graph database, or a collection of files [14]. The whole database may be considered a CU, however parts of a database may also be CUs. In fact, different parts of a database may present different type of information, that the user may want to cite explicitly. Also, different parts of a database can be curated by different sets of authors (this is particularly true for curated databases such as GtoPdb [15]), making an accurate citation crucial for the correct attribution of credit. Here, by “part” of a database we mean a view [14]. A view is a query which we again generalize to being anything from a relational query for a relational database, a directory path or an URI for a collection of files, or some query in one of the several available query languages. We assume that it is the task of the Data Base Administrator (DBA) to define these views and, consequently, the corresponding CUs. Let us start, for simplicity, by considering a static database, i.e., a database that does not evolve over time. By defining the CUs as views, we immediately obtain a part-of relationship. Given two views 𝑉1 and 𝑉2 , we say that 𝑉1 is part of 𝑉2 if it can be answered from the result of 𝑉2 , i.e., if there is a query 𝑄 such that, for each database instance 𝐷, 𝑉1 (𝐷) = 𝑄(𝑉2 (𝐷)). In this context we observe that the query that defines the view being cited is a fundamental part of the data citation itself. As such, it can be inserted in the reference annotation node of the �corresponding citation as metadata. Many approaches can be defined to decide how to introduce a new CU corresponding to a view in the CG. We discuss two of them. One first solution can be to keep the whole database as the only CU and recipient of citations. Every time a paper wants to cite data in the database, it cites that CU, while the reference annotation contains the query used to identify the cited data. With this solution the number of citations to a database may become very high, but, on the other hand, it may happen that the rightful authors and curators of the parts of the database being actually cited do not receive any credit for their work. The queries contained in the reference annotation can be used, though, to accurately compute the correct bibliometric measures, when required. A second strategy sees a new CU being created every time a paper cites a data subset extracted through a new query. With this solution, new views are created every time it is necessary. This, on one hand, may produce an explosion in the number of nodes in the CG, many of which may receive only one citation. On the other hand, it is possible to cite the exact set of data extracted by the query, and to give credit to the rightful authors. If these CUs are connected to the main database through the part-of relationship, the whole database may still inherit these citations, depending on the strategy defined by the DBAs. Most databases are not static and, unlike documents, they are expected to evolve over time. If a new version of a database is released periodically, one option would be to treat each version as a new CU. However, a database may present in the CG a hierarchy of CUs connected among them through the part-of property. While the database may change, some of the single CUs of its hierarchy may not. Moreover, even when one CU changes, it may not change in its entirety, thus we may want to treat it as a new version, rather than an entirely unrelated new CU. To solve these new problems one option is to let the DBA decide. Every time a new version of the database is released, the administrators go through the different CUs that compose the part-of hierarchy and decide which CUs need a new versions. These new versions will be connected to the old one via subsumption if the DBAs deem that the new CU can be considered a new version of the old one, rather than a completely different entity (e.g., when there is some structural change, or the number of curators changed). 3.4. Conclusions The current basic model of the citation graph is unsuited to the representation of published data and data citations. We propose an implementation-agnostic model that includes reference annotations and a new subsumption property. The annotations contain the context of a citation, such as page numbers or the query issued to obtain the data. This property indicates that one citable unit version “takes the place” of another and inherits its citations if needed. Using these extensions we discussed how to represent data and data citation in the graph, facing problems such the correct attribution of a data citation to authors and the evolution of citable units over time. Acknowledgments The work was partially supported by the ExaMode project, as part of the European Union H2020 program under Grant Agreement no. 825292. Matteo Lissandrini is supported by the �European Union H2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 838216. Peter Buneman was partly supported by the Huawei Edinburgh Research Laboratory. References [1] P. Buneman, D. Dosso, M. Lissandrini, G. Silvello, Data citation and the citation graph, Quantitative Science Studies 2 (2022) 1399–1422. doi:10.1162/qss_a_00166. [2] A.-W. K. Harzing, R. Van der Wal, Google scholar as a new source for citation analysis, Ethics in science and environmental politics 8 (2008) 61–73. [3] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, ArnetMiner: extraction and mining of academic social networks, in: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2008, pp. 990–998. [4] L. Candela, D. Castelli, P. Manghi, A. Tani, Data Journals: A Survey, Journal of the Association for Information Science and Technology 66 (2015) 1747–1762. URL: http://dx.doi.org/10.1002/asi.23358. doi:10.1002/asi.23358. [5] Out of Cite, Out of Mind: The Current State of Practice, Polocy, and Technology for the Citation of Data, volume 12, CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013. [6] FORCE-11, Data Citation Synthesis Group: Joint Declaration of Data Citation Principles, FORCE11, San Diego, CA, USA, 2014. [7] C. W. Belter, Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets, PLoS ONE 9 (2014) e92590. [8] I. Peters, P. Kraker, E. Lex, C. Gumpenberger, J. Gorraiz, Research data explored: An extended analysis of citations and altmetrics, Scientometrics 107 (2016) 723–744. [9] C. Wilke, What constitutes a citable scientific work?, 2015. https://serialmentor.com/blog/2015/1/2/what-constitutes-a-citable-scientific-work. [10] M. Altman, M. Crosas, The evolution of data citation: from principles to implementation, IAssist quarterly 37 (2014) 62–62. [11] M. Daquino, S. Peroni, D. Shotton, G. Colavizza, B. Ghavimi, A. Lauscher, P. Mayr, M. Romanello, P. Zumstein, The OpenCitations data model, in: International Semantic Web Conference, Springer, 2020, pp. 447–463. [12] F. Osareh, Bibliometrics, citation analysis and co-citation analysis: A review of literature i, Libri 46 (1996) 149–158. [13] P. Buneman, G. Christie, J. A. Davies, R. Dimitrellou, S. D. Harding, A. J. Pawson, J. L. Sharman, Y. Wu, Why data citation isn’t working, and what to do about it, Database (2020). doi:10.1093/databa/baaa022. [14] P. Buneman, S. Davidson, J. Frew, Why data citation is a computational problem, Communications of the ACM 59 (2016) 50–57. [15] C. Southan, J. L. Sharman, H. E. Benson, E. Faccenda, A. J. Pawson, S. P. Alexander, O. P. Buneman, A. P. Davenport, J. C. McGrath, J. A. Peters, et al., The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands, Nucleic acids research 44 (2015) D1054–D1068. �