Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper23 |
wikidataid | Q117344884→Q117344884 |
title | Explaining Link Prediction with Kelpie |
pdfUrl | https://ceur-ws.org/Vol-3194/paper23.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/0002FMT22 |
volume | Vol-3194→Vol-3194 |
session | → |
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper23 |
wikidataid | Q117344884→Q117344884 |
title | Explaining Link Prediction with Kelpie |
pdfUrl | https://ceur-ws.org/Vol-3194/paper23.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/0002FMT22 |
volume | Vol-3194→Vol-3194 |
session | → |
Explaining Link Prediction with Kelpie Andrea Rossi1 , Donatella Firmani2 , Paolo Merialdo1 and Tommaso Teofili1 1 Roma Tre University, Rome, Italy 2 Sapienza University, Rome, Italy Abstract Link Prediction (LP) is the problem of inferring new facts in a Knowledge Graph from the already known ones. Recent advances in Machine Learning have led researchers to develop LP models that represent Knowledge Graph elements as vectors in an embedding space. This approach has led to greatly encouraging results, often outperforming traditional approaches; its main shortcoming is its opaqueness, hindering both our understanding and our trust in these models. In this context, we discuss the Kelpie explainability framework. Kelpie can be applied to any embedding-based LP model and can identify the combinations of training facts that have enabled the prediction of a given link. Kelpie can extract two complementary types of explanations, that we dub necessary and sufficient. Keywords Knowledge Graph Embeddings, Link Prediction, Explainable AI 1. Introduction Knowledge Graphs (KGs) provide a natural way to model structured real-world information. The nodes of a KG represent real-world entities, and they are connected by directed edges denoted by labels conveying semantic relations. Each edge connects two entities, forming a triple that is called a fact. In time, several KGs have achieved web-scale size, e.g., DBPedia, Yago, Wikidata, and, in industry, the Google KG; nonetheless, all KGs suffer from incompleteness, as they miss large portions of the information they should contain. For instance, it has been observed that, in the FreeBase KG, over 70% of the person entities have unknown birthplace, and over 99% have unknown ethnicity [1]. Link Prediction (LP) aims at inferring new, missing facts by analyzing those already present in the KG. For instance, in the graph in Figure 1 (left), knowing ⟨Barack_Obama, born_in, Honolulu⟩ and ⟨Honolulu, located_in, USA⟩, we can probably infer ⟨Barack_Obama, nationality, USA⟩. In the last decade, LP researchers have achieved extremely promising results with the use of KG embeddings, i.e., numerical vectors mapped to each element in the KG, and learned with Machine Learning (ML) techniques leveraging the known facts. KG Embeddings have rapidly become the dominant approach to LP in literature, with dozens of new models proposed every year (see [2] for a survey). Like most ML systems, LP models based on KG Embeddings are inherently opaque, as they do not provide any insights on the reasons why they predict specific links. We argue that interpretability in LP systems is vital. Explanations reveal which patterns our models SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ andrea.rossi3@uniroma3.it (A. Rossi); donatella.firmani@uniroma1.it (D. Firmani); paolo.merialdo@uniroma3.it (P. Merialdo); tommaso.teofili@uniroma3.it (T. Teofili) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � supported supported supported Barack Obama Bill Gates Homologous Bill Gates Non-Homologous Bill Gates Mimic Mimic born_in born_in born_in born_in nationality nationality born_in nationality born_in Honolulu Seattle Honolulu Seattle Honolulu Seattle president_of president_of president_of located_in located_in located_in located_in located_in located_in Figure 1: A small KG example (left), together with an example of homologous (center) and non- homologous (right) mimic. The missing fact 〈Barack_Obama, nationality, USA〉 is denoted in red. are leveraging to perform their predictions: in the LP field, this would provide a pathway to gain a deeper understanding of our models, to identify biases or errors in our KGs, and most importantly to decide whether our models can be trusted or not. In this paper we discuss the Kelpie framework for explaining embedding-based link predic- tions, that we originally present in [3]. Accordingly to the taxonomy for explainable AI methods in [4], Kelpie is a local post-hoc interpretability method; it can be applied to any LP model based on embeddings, independently of its architecture. Given a prediction, Kelpie explains it by identifying the subset of training facts that have enabled it. Given any prediction to explain, Kelpie can interpret it in two scenarios, dubbed necessary and sufficient. A necessary explanation is the minimal set of facts in absence of which the model becomes unable to yield the prediction: in doing so, necessary explanations can suggest why a specific entity has been predicted in a certain way. A sufficient explanation is the minimal set of facts that, if added to other entities, lead the model to yield the same prediction for those entities too: sufficient explanations can thus suggest the rules that would extend the same prediction to any entity. As observed in [5], necessity and sufficiency are the building blocks of any successful explanation. The Kelpie code and resources are publicly available. 1 2. Problem definition A Knowledge Graph (KG) is a labeled directed graph 𝐾𝐺 = (ℰ, ℛ, 𝒢) where ℰ is the set of nodes, or entities, in the graph; ℛ is its set of labels, or relations; and 𝒢 ⊆ ℰ × ℛ × ℰ is its set of edges linking entities via relations, i.e., its set of facts. In any fact 〈h, r, t〉 the first entity h is the head, r is the relation, and the second entity t is the tail. Models for embedding-based Link Prediction (LP) typically define a scoring function 𝜑 that estimates the plausibility of any fact using the embeddings of its head, relation and tail. In training, models learn embeddings to optimize the plausibility of the known facts: in prediction, new links are identified by finding which entities, if added to incomplete triples as heads or tails, yield the best scores. A tail prediction ⟨ℎ, 𝑟, 𝑡⟩ is the process that finds 𝑡 to be the best scoring tail for the incomplete triple ⟨ℎ, 𝑟, ?⟩, i.e., 𝑡 = argmax𝑒∈ℰ 𝜑(ℎ, 𝑟, 𝑒). Head predictions are defined symmetrically. In the following we focus on tail predictions; all our formulations 1 https://github.com/AndRossi/Kelpie �can be adapted for head predictions analogously. LP literature usually relies on datasets sampled from real-world KGs, and in which the set of extracted facts 𝒢 is split into a training set 𝒢𝑡𝑟𝑎𝑖𝑛 , a validation set 𝒢𝑣𝑎𝑙𝑖𝑑 , and a test set 𝒢𝑡𝑒𝑠𝑡 . In evaluation, head and tail predictions are performed on each fact in 𝒢𝑡𝑒𝑠𝑡 ; in each prediction, the target entity, i.e, the expected answer, is ranked against all the others in ℰ. The obtained ranks are then aggregated into standard global metrics: Hits@K (H@K), that measures the fraction of ranks 𝑇 with value ≤ 𝑘, and Mean Reciprocal Rank (MRR), that averages the inverse of all the obtained ranks 𝑇 . Both H@K and MRR are always between 0 and 1; higher values convey better results. Discussions on the pros and cons of such methodologies can be found in [6, 7, 8]. For some predictions there can exist more than one correct answer (e.g., in case of 1-to-𝑛 relations). In this event we follow the common practice called filtered setting of excluding such alternative answers when computing the rank of the target entity. For a given prediction Kelpie defines two explanation types: necessary and sufficient. We indicate with 𝑋 the generic candidate explanation (either necessary or sufficient) and with 𝑋 * the explanation we want to identify and return. We use the subscripts 𝑛 (e.g., 𝑋𝑛 ) and 𝑠 to denote necessary and sufficient explanations respectively. We give definitions of 𝑋𝑛* and 𝑋𝑠* w.r.t. tail predictions 〈ℎ, 𝑟, 𝑡〉, as head predictions can be handled analogously. ∙ 𝑋𝑛* is the smallest set of training facts featuring ℎ such that, removing 𝑋𝑛* from 𝒢𝑡𝑟𝑎𝑖𝑛 and retraining the model from scratch, the top ranking tail for ⟨ℎ, 𝑟, ?⟩ changes to any 𝑒 ̸= 𝑡. ∙ 𝑋𝑠* is the smallest set of training facts featuring ℎ such that, given a set of random entities 𝐶 ⊆ ℰ such that 𝑡 is not the top ranking tail for any ⟨𝑐, 𝑟, ?⟩ prediction, if we add the facts in 𝑋𝑠* to any 𝑐 ∈ 𝐶 and retrain the model from scratch, the top ranking tail for ⟨𝑐, 𝑟, ?⟩ changes to 𝑡. When this happens, we say that the 𝑐 entities have been converted. For instance, in the previous example of the tail prediction ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦, 𝑈 𝑆𝐴⟩, the set {⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩, ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡, 𝑈 𝑆𝐴⟩} is a necessary explanation if removing these facts from Barack_Obama the pre- dicted 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦 changes from 𝑈 𝑆𝐴 to any other entity. Analogously, the single-sized set {⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩} constitutes a sufficient explanation if adding it to any non-American entities, e.g., Édith_Piaf, changes their predicted nationality to USA. 3. System Overview The high-level architecture of Kelpie is shown in Figure 2. We briefly describe the Kelpie modules w.r.t. tail predictions ⟨ℎ, 𝑟, 𝑡⟩; head predictions can be handled analogously. Pre-filter. This module analyzes the set of training facts mentioning ℎ, that we call 𝒢𝑡𝑟𝑎𝑖𝑛ℎ , in order identify and discard the least promising ones: its overall goal is to reduce the search space for the following steps, preventing combinatorial explosion when ℎ is featured in too many facts. The Pre-Filter achieves this goal by computing for each fact in 𝒢𝑡𝑟𝑎𝑖𝑛 ℎ a promisingness value based on the graph topology, and by returning the subset ℱ𝑡𝑟𝑎𝑖𝑛 of the top promising facts. Intuitively, ℎ topologically closer entities often bear a stronger semantic relationship: so, for any ⟨ℎ, 𝑠, 𝑞⟩ (or ⟨𝑞, 𝑠, ℎ⟩) connecting ℎ to an entity 𝑞 close to 𝑡 we compute promisingness as the length of the shortest non-oriented path connecting 𝑞 to 𝑡; lower values convey better promisingness � & 𝒢!"#$% & ℱ!"#$% 𝑋∗ Pre-Filter Explanation Builder 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 𝑋 Relevance Engine Figure 2: High-level Kelpie architecture, highlighting the interactions among its three modules. (and thus higher priority in the filtering). Consider again the example in Figure 1 (left) and suppose we want to explain the tail prediction ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦, 𝑈 𝑆𝐴⟩: ∙ ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩ has promisingness 0 (the best possible value). ∙ ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩ has promisingness 1. ∙ ⟨𝐵𝑖𝑙𝑙_𝐺𝑎𝑡𝑒𝑠, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑, 𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎⟩ has promisingness 2, as the shortest path from 𝐵𝑖𝑙𝑙_𝐺𝑎𝑡𝑒𝑠 to 𝑈 𝑆𝐴 has length 2: [〈Bill_Gates, born_in, Seattle〉, 〈Seattle, located_in, USA〉]. Explanation Builder. This module explores the space of the candidate explanations 𝑋 that can be obtained by combining the Pre-Filtered facts, with the goal of identifying the smallest 𝑋 that is relevant enough for the prediction to explain. The notion of relevance of an explanation embodies the likelihood of yielding changes in predictions, and it is quantified by the Relevance Engine described in the next paragraph. Suppose that in the example in Figure 1 (left) only the facts 𝑓1 = ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩ and 𝑓2 = ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩ survive the Pre-Filtering. The Explanation Builder starts by exploring short explanations, such as, 𝑋1 = {𝑓1 } and 𝑋2 = {𝑓2 }; then, it iteratively builds longer ones, such as 𝑋3 = {𝑓1 , 𝑓2 }, until either 𝑋 * is found with relevance higher than a user-specified threshold, or early stopping policies are enacted. In the latter case, the best explanation seen so far is returned. Among same-sized explanations (e.g., 𝑋1 and 𝑋2 ), a prioritization mechanism based on an off-line heuristic is used. For details about early stopping and exploration priority, we refer the reader to Algorithm 3 in [3]. Relevance Engine. This module aims at estimating how adding or removing training facts from ℎ would affect the prediction to explain. Ideally, the relevance of any candidate explanation 𝑋 could be computed by retraining the model after adding or removing its facts from 𝒢𝑡𝑟𝑎𝑖𝑛 ; in practice this is unfeasible, so Kelpie introduces the novel ML process of Post-Training (PT). PT consists in using an already trained model to obtain an alternate version, i.e., a mimic, of an entity 𝑒 and of the corresponding embedding. Given 𝑒, PT amounts to: (i) replicating the training facts mentioning 𝑒, 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 , and potentially injecting additions/removals; (ii) training a new embedding to optimize the plausibility of those facts, while keeping all the other embeddings and parameters in the model frozen. Compared to a full training, PT is a lightweight process as it optimizes only one embedding and uses a training set that is several orders of magnitude smaller than the entire 𝐺𝑡𝑟𝑎𝑖𝑛 . For any entity 𝑒 Kelpie can post-train two types of mimics: ∙ A homologous mimic 𝑒′ has its embedding initialized randomly, and then post-trained on an exact replica of 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 . As a result, we expect 𝑒′ to behave similarly to the original entity 𝑒. ∙ A non-homologous mimic 𝑒′−𝑀 (or 𝑒′+𝑀 ) initialized randomly as well, and then post-trained � Necessary Explanations Effectiveness Sufficient Explanations Effectiveness FB15k WN18 FB15k-237 WN18RR YAGO3-10 FB15k WN18 FB15k-237 WN18RR YAGO3-10 ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR TransE Kelpie -0.490 -0.321 -0.920 -0.857 -0.540 -0.356 -0.920 -0.904 -0.740 -0.580 +0.319 +0.516 +0.259 +0.434 +0.128 +0.218 +0.117 +0.169 +0.048 +0.109 DP -0.380 -0.258 -0.900 -0.859 -0.460 -0.303 -0.770 -0.701 -0.670 -0.533 +0.273 +0.461 +0.183 +0.350 +0.051 +0.080 +0.082 +0.116 +0.036 +0.117 Kelpie -0.850 -0.695 -0.910 -0.827 -0.590 -0.413 -0.980 -0.913 -0.960 -0.858 +0.945 +0.920 +0.953 +0.962 +0.377 +0.490 +0.834 +0.877 +0.858 +0.892 ComplEx DP -0.540 -0.458 -0.800 -0.742 -0.340 -0.185 -0.750 -0.650 -0.810 -0.714 +0.910 +0.888 +0.931 +0.940 +0.245 +0.334 +0.836 +0.878 +0.835 +0.878 Criage -0.030 -0.020 -0.050 -0.045 -0.090 -0.050 -0.180 -0.150 -0.05 -0.030 +0.068 +0.069 +0.105 +0.137 +0.035 +0.038 +0.110 +0.147 +0.000 +0.000 Kelpie -0.710 -0.516 -0.930 -0.860 -0.430 -0.284 -0.980 -0.914 -0.980 -0.884 +0.677 +0.649 +0.903 +0.900 +0.225 +0.203 +0.827 +0.856 +0.799 +0.848 ConvE DP -0.350 -0.232 -0.790 -0.752 -0.290 -0.195 -0.850 -0.750 -0.880 -0.799 +0.234 +0.199 +0.396 +0.412 +0.202 +0.169 +0.373 +0.419 +0.366 +0.391 Criage -0.080 -0.056 -0.160 -0.146 -0.290 -0.196 -0.170 -0.156 -0.070 -0.042 +0.106 +0.065 +0.164 +0.166 +0.132 +0.094 +0.150 +0.164 +0.019 +0.020 Table 1 Effectiveness of necessary and sufficient explanations on a set of facts that replicates 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 with the removal (or addition) of a set 𝑀 of facts. After the post-training is over, the mimic is expected to approximate the behaviour that the original entity 𝑒 would have displayed if the injected variation had been present since the beginning. Examples of homologous and non-homologous mimics are in Figure 1 (center and right respec- tively). When explaining any tail prediction ⟨ℎ, 𝑟, 𝑡⟩, we can thus estimate the effect of adding (or removing) from ℎ the facts of any candidate explanation 𝑋 by comparing the outcomes obtained using a homologous mimic ℎ′ and a non-homologous mimic mimic ℎ′+𝑋 (or ℎ′+𝑋 ). 4. Experiments In this section we briefly report the main experimental results of Kelpie. For a deeper report on the experimental evaluation of the system, see [3]. Setup. Our experiments have been run on a server with 88 CPUs Intel Core(TM) i7-3820, 516GB RAM and 4 16GB NVIDIA Tesla P100 GPUs. We consider the 5 best-established dataset in LP literature: FB15k and FB15k-237, sampled from Freebase; WN18 and WN18RR, sampled from WordNet; and YAGO3-10, sampled from YAGO3. We consider 3 models representative for the 3 different families of the taxonomy in [2]: the geometric model TransE [9], the matrix factorization model ComplEx [10], and the deep learning model ConvE [11]. End-to-end performance. We compute the performance of Kelpie as the H@1 and MRR variation caused by applying the extracted explanation to the predictions to disable (in the necessary scenario) or to enable (in the sufficient scenario). More specifically: ∙ In our necessary scenario, for each model and dataset we randomly sample a set of 100 correct test tail predictions ⟨ℎ, 𝑟, 𝑡⟩. We extract their necessary explanations and remove the explanation facts from 𝒢𝑡𝑟𝑎𝑖𝑛 ; we then re-train the model and check how the removals have worsened the ranks of the target tails 𝑡. We measure such worsenings in terms of ∆𝐻@1 and ∆𝑀 𝑅𝑅: more negative values correspond to higher effectiveness. ∙ In our sufficient scenario, we once again sample 100 correct test tail predictions ⟨ℎ, 𝑟, 𝑡⟩ to explain for each model and dataset. We extract their sufficient explanations, each with the � Necessary Explanations Sufficient Explanations FB15k WN18 FB15k-237 WN18RR YAGO3-10 FB15k WN18 FB15k-237 WN18RR YAGO3-10 AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD TransE 2.78 1.13 2.17 1.24 2.5 1.15 1.67 0.96 1.99 1.20 1.89 1.03 1.02 0.14 3.43 0.83 1.67 0.79 1.45 0.73 ComplEx 3.40 1.10 3.39 1.00 3.83 0.57 3.16 1.08 2.27 1.31 1.40 0.75 1.00 0.00 2.51 1.20 1.00 0.00 1.04 0.31 ConvE 3.36 0.89 2.91 1.20 2.26 1.24 2.66 1.21 1.73 0.97 1.83 1.23 1.01 0.10 3.09 1.10 1.04 0.24 1.18 0.48 Table 2 Lengths of the necessary and sufficient explanations goal of converting a different set 𝐶 of 10 random entities. Then, for each prediction ⟨ℎ, 𝑟, 𝑡⟩ we add the explanation facts to all 𝑐 entities in the corresponding 𝐶 set, we re-train the model, and we measure how the addition of the explanation facts has improved the tail rank of 𝑡 in each ⟨𝑐, 𝑟, 𝑡⟩ prediction. We measure such improvements in terms of ∆𝐻@1 and ∆𝑀 𝑅𝑅: more positive values correspond to higher effectiveness. We use as baselines two data poisoning systems, namely Criage [12] and the framework by Zhang et al. [13], that we denote as DP. The goal of these systems is to disable the selected ⟨ℎ, 𝑟, 𝑡⟩ predictions by removing/adding individual training facts. In the necessary scenario we compare directly to their removal setting. In the sufficient scenario, that they do not support natively, we adapt their formulations to identify which facts would not only disable ⟨ℎ, 𝑟, 𝑡⟩ but also enable the selected ⟨𝑐, 𝑟, 𝑡⟩ predictions. We do not run Criage on TransE, as the code provided by the authors only supports multiplicative models (e.g., ConvE and ComplEx). Results by Criage, DP and Kelpie are reported in Table 1. Kelpie almost always outperforms baselines, by tackling predictions that others fail to explain. Since Criage and DP are limited to explanations with only 1 fact, we have also experimented with a single-fact version of Kelpie; we have obtained results comparable to the best baselines but almost always worse than "full" Kelpie. This demonstrates the effectiveness of post-training at identifying the most relevant facts, as well as the utility of supporting fact combinations. Explanations Lengths. We report in Table 2, for each scenario, model and dataset, the average (AVG) and the standard deviation (STD) of the lengths of our explanations. We observe that necessary explanations tend to always be longer than sufficient ones, for the same datasets and models. Necessary explanations should encompass all the facts supporting a prediction, so that removing them disables the prediction. Sufficient explanations, on the other hand, just need to to extend the same prediction to other entities: so, it is usually enough for them to include just a few facts (or even just one) as pieces of evidence for the prediction. We also observe that, in many cases, the average explanation lengths are lesser than 2, suggesting that Kelpie can identify minimal explanations, as expected (extensive experiments on minimality are in [3]). Explanations and Bias. LP datasets have been found to suffer from various forms of data bias [14]; we now report a brief example showing how Kelpie explanations can unveil previously unknown instances of bias in the training data. We have observed that the correctly predicted YAGO3-10 test facts that convey that a person born_in a city are generally explained by that person being a soccer player playing for a football team from the same city. This is surprising: in the real world, being member of a football team does not imply having been born in its city. � Guided by our explanations, we have found that, in YAGO3-10, people playing for a football team are born in same city unnaturally often (thus providing data bias) and that personal data are significantly scarce, making birthplace prediction very challenging: the correct predictions, in this regard, seem to be affected even by the slight preference that football players may have towards teams from their birthplace. This type of observations can allow researchers to intervene on LP datasets to make them better adhere to the semantics of the real world. 5. Related Works So far few works have addressed directly the interpretability of embedding-based LP models. The works most related to ours are Criage [12] and the framework by Zhang et al. [13]. Both of them follow a data poisoning approach: given a prediction ⟨ℎ, 𝑟, 𝑡⟩, their goal is to find which is the individual fact that, if added to or removed from 𝒢𝑡𝑟𝑎𝑖𝑛 , worsens the score 𝜑(ℎ, 𝑟, 𝑡) the most. Criage [12] uses Influence Functions [15] to estimate how the addition or removal of any fact would affect 𝜑(ℎ, 𝑟, 𝑡); unfortunately, the adaptability of their formulation to the scoring functions of non-multiplicative models is unclear. The framework by Zhang et al. [13] uses gradient analysis to perturb the embedding of ℎ in the direction that worsens 𝜑(ℎ, 𝑟, 𝑡) the most. Other than having empirical differences with Kelpie (as illustrated in Section 4) these methods have an inherently different goal, as they do not aim directly at explaining predictions, but rather they investigate how models withstand adversarial modifications. One of the aspects that make the interpretability of LP models so challenging is the large variety of ML architectures. In time, researchers have tried to bypass this problem in a variety of ways. Some authors have chosen to support specific architectures [16], while other have proposed inherently interpretable LP models [17]. Finally, a few frameworks explain predictions by focusing on the dataset topology more than on the model and its embeddings [18]. We refer the reader to [3] for further discussion about the application of general purpose explanation methods like LIME [19], SHAP [20] and ANCHOR [21] to the LP setting, and their shortcomings. 6. Conclusions We have discussed Kelpie, a full-fledged explainability framework for embedding-based LP models. Kelpie explanations highlight the most relevant training samples that have enabled our predictions; they are based on the fundamental notions of necessity and sufficiency, and they can encompass combinations of multiple training samples. In the sparkling topic of LP on Knowledge Graphs, interpretability is a strikingly desirable property; yet, it is hardly ever guaranteed by current state-of-the-art systems and frameworks. The effectiveness of Kelpie explanations, measured through extensive experiments, shows that Kelpie largely surpasses pre-existing methods across almost all scenarios in literature; furthermore, as demonstrated, Kelpie explanations can be precious in identifying unbalances and biases in LP datasets. Acknowledgments The work by Donatella Firmani has been supported in part by SEED PNR FLOWER grant 2021. �References [1] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, D. Lin, Knowledge base completion via search-based question answering, in: WWW, 2014. [2] A. Rossi, D. Barbosa, D. Firmani, A. Matinata, P. Merialdo, Knowledge graph embedding for link prediction: A comparative analysis, TKDD (2021). [3] A. Rossi, D. Firmani, P. Merialdo, T. Teofili, Explaining link prediction systems based on knowledge graph embeddings, in: SIGMOD, 2022. [4] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv. (2018). [5] D. S. Watson, L. Gultchin, A. Taly, L. Floridi, Local explanations via necessity and suffi- ciency: Unifying theory and practice, in: UAI, 2021. [6] A. Rossi, A. Matinata, Knowledge Graph Embeddings: Are Relation-Learning Models Learning Relations?, in: PIE, 2020. [7] Y. Wang, D. Ruffinelli, R. Gemulla, S. Broscheit, C. Meilicke, On Evaluating Embedding Models for Knowledge Base Completion, in: RepL4NLP@ACL, 2019. [8] A. Rossi, Interpreting link prediction on knowledge graphs., in: SEBD, 2020. [9] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: NIPS, 2013. [10] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex Embeddings for Simple Link Prediction, in: ICML, 2016. [11] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2d knowledge graph embeddings, in: AAAI, 2018. [12] P. Pezeshkpour, Y. Tian, S. Singh, Investigating robustness and interpretability of link prediction via adversarial modifications, in: NAACL-HLT, 2019. [13] H. Zhang, T. Zheng, J. Gao, C. Miao, L. Su, Y. Li, K. Ren, Data poisoning attack against knowledge graph embedding, in: IJCAI, 2019. [14] A. Rossi, D. Firmani, P. Merialdo, Knowledge graph embeddings or bias graph embeddings? a study of bias in link prediction models, in: DL4KG, 2021. [15] P. W. Koh, P. Liang, Understanding black-box predictions via influence functions, in: D. Precup, Y. W. Teh (Eds.), ICML, 2017. [16] Z. Ying, D. Bourgeois, J. You, M. Zitnik, J. Leskovec, Gnnexplainer: Generating explanations for graph neural networks, in: NIPS, 2019. [17] W. Zhang, S. Deng, H. Wang, Q. Chen, W. Zhang, H. Chen, Xtranse: Explainable knowledge graph embedding for link prediction with lifestyles in e-commerce, in: JIST, 2019. [18] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, H. Chen, Interaction embeddings for prediction and explanation in knowledge graphs, in: WSDM, 2019. [19] M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I Trust You?": Explaining the Predictions of Any Classifier, in: SIGKDD, 2016. [20] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: NIPS, 2017. [21] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations, in: AAAI, 2018. �
Explaining Link Prediction with Kelpie Andrea Rossi1 , Donatella Firmani2 , Paolo Merialdo1 and Tommaso Teofili1 1 Roma Tre University, Rome, Italy 2 Sapienza University, Rome, Italy Abstract Link Prediction (LP) is the problem of inferring new facts in a Knowledge Graph from the already known ones. Recent advances in Machine Learning have led researchers to develop LP models that represent Knowledge Graph elements as vectors in an embedding space. This approach has led to greatly encouraging results, often outperforming traditional approaches; its main shortcoming is its opaqueness, hindering both our understanding and our trust in these models. In this context, we discuss the Kelpie explainability framework. Kelpie can be applied to any embedding-based LP model and can identify the combinations of training facts that have enabled the prediction of a given link. Kelpie can extract two complementary types of explanations, that we dub necessary and sufficient. Keywords Knowledge Graph Embeddings, Link Prediction, Explainable AI 1. Introduction Knowledge Graphs (KGs) provide a natural way to model structured real-world information. The nodes of a KG represent real-world entities, and they are connected by directed edges denoted by labels conveying semantic relations. Each edge connects two entities, forming a triple that is called a fact. In time, several KGs have achieved web-scale size, e.g., DBPedia, Yago, Wikidata, and, in industry, the Google KG; nonetheless, all KGs suffer from incompleteness, as they miss large portions of the information they should contain. For instance, it has been observed that, in the FreeBase KG, over 70% of the person entities have unknown birthplace, and over 99% have unknown ethnicity [1]. Link Prediction (LP) aims at inferring new, missing facts by analyzing those already present in the KG. For instance, in the graph in Figure 1 (left), knowing ⟨Barack_Obama, born_in, Honolulu⟩ and ⟨Honolulu, located_in, USA⟩, we can probably infer ⟨Barack_Obama, nationality, USA⟩. In the last decade, LP researchers have achieved extremely promising results with the use of KG embeddings, i.e., numerical vectors mapped to each element in the KG, and learned with Machine Learning (ML) techniques leveraging the known facts. KG Embeddings have rapidly become the dominant approach to LP in literature, with dozens of new models proposed every year (see [2] for a survey). Like most ML systems, LP models based on KG Embeddings are inherently opaque, as they do not provide any insights on the reasons why they predict specific links. We argue that interpretability in LP systems is vital. Explanations reveal which patterns our models SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ andrea.rossi3@uniroma3.it (A. Rossi); donatella.firmani@uniroma1.it (D. Firmani); paolo.merialdo@uniroma3.it (P. Merialdo); tommaso.teofili@uniroma3.it (T. Teofili) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � supported supported supported Barack Obama Bill Gates Homologous Bill Gates Non-Homologous Bill Gates Mimic Mimic born_in born_in born_in born_in nationality nationality born_in nationality born_in Honolulu Seattle Honolulu Seattle Honolulu Seattle president_of president_of president_of located_in located_in located_in located_in located_in located_in Figure 1: A small KG example (left), together with an example of homologous (center) and non- homologous (right) mimic. The missing fact 〈Barack_Obama, nationality, USA〉 is denoted in red. are leveraging to perform their predictions: in the LP field, this would provide a pathway to gain a deeper understanding of our models, to identify biases or errors in our KGs, and most importantly to decide whether our models can be trusted or not. In this paper we discuss the Kelpie framework for explaining embedding-based link predic- tions, that we originally present in [3]. Accordingly to the taxonomy for explainable AI methods in [4], Kelpie is a local post-hoc interpretability method; it can be applied to any LP model based on embeddings, independently of its architecture. Given a prediction, Kelpie explains it by identifying the subset of training facts that have enabled it. Given any prediction to explain, Kelpie can interpret it in two scenarios, dubbed necessary and sufficient. A necessary explanation is the minimal set of facts in absence of which the model becomes unable to yield the prediction: in doing so, necessary explanations can suggest why a specific entity has been predicted in a certain way. A sufficient explanation is the minimal set of facts that, if added to other entities, lead the model to yield the same prediction for those entities too: sufficient explanations can thus suggest the rules that would extend the same prediction to any entity. As observed in [5], necessity and sufficiency are the building blocks of any successful explanation. The Kelpie code and resources are publicly available. 1 2. Problem definition A Knowledge Graph (KG) is a labeled directed graph 𝐾𝐺 = (ℰ, ℛ, 𝒢) where ℰ is the set of nodes, or entities, in the graph; ℛ is its set of labels, or relations; and 𝒢 ⊆ ℰ × ℛ × ℰ is its set of edges linking entities via relations, i.e., its set of facts. In any fact 〈h, r, t〉 the first entity h is the head, r is the relation, and the second entity t is the tail. Models for embedding-based Link Prediction (LP) typically define a scoring function 𝜑 that estimates the plausibility of any fact using the embeddings of its head, relation and tail. In training, models learn embeddings to optimize the plausibility of the known facts: in prediction, new links are identified by finding which entities, if added to incomplete triples as heads or tails, yield the best scores. A tail prediction ⟨ℎ, 𝑟, 𝑡⟩ is the process that finds 𝑡 to be the best scoring tail for the incomplete triple ⟨ℎ, 𝑟, ?⟩, i.e., 𝑡 = argmax𝑒∈ℰ 𝜑(ℎ, 𝑟, 𝑒). Head predictions are defined symmetrically. In the following we focus on tail predictions; all our formulations 1 https://github.com/AndRossi/Kelpie �can be adapted for head predictions analogously. LP literature usually relies on datasets sampled from real-world KGs, and in which the set of extracted facts 𝒢 is split into a training set 𝒢𝑡𝑟𝑎𝑖𝑛 , a validation set 𝒢𝑣𝑎𝑙𝑖𝑑 , and a test set 𝒢𝑡𝑒𝑠𝑡 . In evaluation, head and tail predictions are performed on each fact in 𝒢𝑡𝑒𝑠𝑡 ; in each prediction, the target entity, i.e, the expected answer, is ranked against all the others in ℰ. The obtained ranks are then aggregated into standard global metrics: Hits@K (H@K), that measures the fraction of ranks 𝑇 with value ≤ 𝑘, and Mean Reciprocal Rank (MRR), that averages the inverse of all the obtained ranks 𝑇 . Both H@K and MRR are always between 0 and 1; higher values convey better results. Discussions on the pros and cons of such methodologies can be found in [6, 7, 8]. For some predictions there can exist more than one correct answer (e.g., in case of 1-to-𝑛 relations). In this event we follow the common practice called filtered setting of excluding such alternative answers when computing the rank of the target entity. For a given prediction Kelpie defines two explanation types: necessary and sufficient. We indicate with 𝑋 the generic candidate explanation (either necessary or sufficient) and with 𝑋 * the explanation we want to identify and return. We use the subscripts 𝑛 (e.g., 𝑋𝑛 ) and 𝑠 to denote necessary and sufficient explanations respectively. We give definitions of 𝑋𝑛* and 𝑋𝑠* w.r.t. tail predictions 〈ℎ, 𝑟, 𝑡〉, as head predictions can be handled analogously. ∙ 𝑋𝑛* is the smallest set of training facts featuring ℎ such that, removing 𝑋𝑛* from 𝒢𝑡𝑟𝑎𝑖𝑛 and retraining the model from scratch, the top ranking tail for ⟨ℎ, 𝑟, ?⟩ changes to any 𝑒 ̸= 𝑡. ∙ 𝑋𝑠* is the smallest set of training facts featuring ℎ such that, given a set of random entities 𝐶 ⊆ ℰ such that 𝑡 is not the top ranking tail for any ⟨𝑐, 𝑟, ?⟩ prediction, if we add the facts in 𝑋𝑠* to any 𝑐 ∈ 𝐶 and retrain the model from scratch, the top ranking tail for ⟨𝑐, 𝑟, ?⟩ changes to 𝑡. When this happens, we say that the 𝑐 entities have been converted. For instance, in the previous example of the tail prediction ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦, 𝑈 𝑆𝐴⟩, the set {⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩, ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡, 𝑈 𝑆𝐴⟩} is a necessary explanation if removing these facts from Barack_Obama the pre- dicted 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦 changes from 𝑈 𝑆𝐴 to any other entity. Analogously, the single-sized set {⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩} constitutes a sufficient explanation if adding it to any non-American entities, e.g., Édith_Piaf, changes their predicted nationality to USA. 3. System Overview The high-level architecture of Kelpie is shown in Figure 2. We briefly describe the Kelpie modules w.r.t. tail predictions ⟨ℎ, 𝑟, 𝑡⟩; head predictions can be handled analogously. Pre-filter. This module analyzes the set of training facts mentioning ℎ, that we call 𝒢𝑡𝑟𝑎𝑖𝑛ℎ , in order identify and discard the least promising ones: its overall goal is to reduce the search space for the following steps, preventing combinatorial explosion when ℎ is featured in too many facts. The Pre-Filter achieves this goal by computing for each fact in 𝒢𝑡𝑟𝑎𝑖𝑛 ℎ a promisingness value based on the graph topology, and by returning the subset ℱ𝑡𝑟𝑎𝑖𝑛 of the top promising facts. Intuitively, ℎ topologically closer entities often bear a stronger semantic relationship: so, for any ⟨ℎ, 𝑠, 𝑞⟩ (or ⟨𝑞, 𝑠, ℎ⟩) connecting ℎ to an entity 𝑞 close to 𝑡 we compute promisingness as the length of the shortest non-oriented path connecting 𝑞 to 𝑡; lower values convey better promisingness � & 𝒢!"#$% & ℱ!"#$% 𝑋∗ Pre-Filter Explanation Builder 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 𝑋 Relevance Engine Figure 2: High-level Kelpie architecture, highlighting the interactions among its three modules. (and thus higher priority in the filtering). Consider again the example in Figure 1 (left) and suppose we want to explain the tail prediction ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑡𝑦, 𝑈 𝑆𝐴⟩: ∙ ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩ has promisingness 0 (the best possible value). ∙ ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩ has promisingness 1. ∙ ⟨𝐵𝑖𝑙𝑙_𝐺𝑎𝑡𝑒𝑠, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑, 𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎⟩ has promisingness 2, as the shortest path from 𝐵𝑖𝑙𝑙_𝐺𝑎𝑡𝑒𝑠 to 𝑈 𝑆𝐴 has length 2: [〈Bill_Gates, born_in, Seattle〉, 〈Seattle, located_in, USA〉]. Explanation Builder. This module explores the space of the candidate explanations 𝑋 that can be obtained by combining the Pre-Filtered facts, with the goal of identifying the smallest 𝑋 that is relevant enough for the prediction to explain. The notion of relevance of an explanation embodies the likelihood of yielding changes in predictions, and it is quantified by the Relevance Engine described in the next paragraph. Suppose that in the example in Figure 1 (left) only the facts 𝑓1 = ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡_𝑜𝑓, 𝑈 𝑆𝐴⟩ and 𝑓2 = ⟨𝐵𝑎𝑟𝑎𝑐𝑘_𝑂𝑏𝑎𝑚𝑎, 𝑏𝑜𝑟𝑛_𝑖𝑛, 𝐻𝑜𝑛𝑜𝑙𝑢𝑙𝑢⟩ survive the Pre-Filtering. The Explanation Builder starts by exploring short explanations, such as, 𝑋1 = {𝑓1 } and 𝑋2 = {𝑓2 }; then, it iteratively builds longer ones, such as 𝑋3 = {𝑓1 , 𝑓2 }, until either 𝑋 * is found with relevance higher than a user-specified threshold, or early stopping policies are enacted. In the latter case, the best explanation seen so far is returned. Among same-sized explanations (e.g., 𝑋1 and 𝑋2 ), a prioritization mechanism based on an off-line heuristic is used. For details about early stopping and exploration priority, we refer the reader to Algorithm 3 in [3]. Relevance Engine. This module aims at estimating how adding or removing training facts from ℎ would affect the prediction to explain. Ideally, the relevance of any candidate explanation 𝑋 could be computed by retraining the model after adding or removing its facts from 𝒢𝑡𝑟𝑎𝑖𝑛 ; in practice this is unfeasible, so Kelpie introduces the novel ML process of Post-Training (PT). PT consists in using an already trained model to obtain an alternate version, i.e., a mimic, of an entity 𝑒 and of the corresponding embedding. Given 𝑒, PT amounts to: (i) replicating the training facts mentioning 𝑒, 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 , and potentially injecting additions/removals; (ii) training a new embedding to optimize the plausibility of those facts, while keeping all the other embeddings and parameters in the model frozen. Compared to a full training, PT is a lightweight process as it optimizes only one embedding and uses a training set that is several orders of magnitude smaller than the entire 𝐺𝑡𝑟𝑎𝑖𝑛 . For any entity 𝑒 Kelpie can post-train two types of mimics: ∙ A homologous mimic 𝑒′ has its embedding initialized randomly, and then post-trained on an exact replica of 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 . As a result, we expect 𝑒′ to behave similarly to the original entity 𝑒. ∙ A non-homologous mimic 𝑒′−𝑀 (or 𝑒′+𝑀 ) initialized randomly as well, and then post-trained � Necessary Explanations Effectiveness Sufficient Explanations Effectiveness FB15k WN18 FB15k-237 WN18RR YAGO3-10 FB15k WN18 FB15k-237 WN18RR YAGO3-10 ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR ΔH@1 ΔMRR TransE Kelpie -0.490 -0.321 -0.920 -0.857 -0.540 -0.356 -0.920 -0.904 -0.740 -0.580 +0.319 +0.516 +0.259 +0.434 +0.128 +0.218 +0.117 +0.169 +0.048 +0.109 DP -0.380 -0.258 -0.900 -0.859 -0.460 -0.303 -0.770 -0.701 -0.670 -0.533 +0.273 +0.461 +0.183 +0.350 +0.051 +0.080 +0.082 +0.116 +0.036 +0.117 Kelpie -0.850 -0.695 -0.910 -0.827 -0.590 -0.413 -0.980 -0.913 -0.960 -0.858 +0.945 +0.920 +0.953 +0.962 +0.377 +0.490 +0.834 +0.877 +0.858 +0.892 ComplEx DP -0.540 -0.458 -0.800 -0.742 -0.340 -0.185 -0.750 -0.650 -0.810 -0.714 +0.910 +0.888 +0.931 +0.940 +0.245 +0.334 +0.836 +0.878 +0.835 +0.878 Criage -0.030 -0.020 -0.050 -0.045 -0.090 -0.050 -0.180 -0.150 -0.05 -0.030 +0.068 +0.069 +0.105 +0.137 +0.035 +0.038 +0.110 +0.147 +0.000 +0.000 Kelpie -0.710 -0.516 -0.930 -0.860 -0.430 -0.284 -0.980 -0.914 -0.980 -0.884 +0.677 +0.649 +0.903 +0.900 +0.225 +0.203 +0.827 +0.856 +0.799 +0.848 ConvE DP -0.350 -0.232 -0.790 -0.752 -0.290 -0.195 -0.850 -0.750 -0.880 -0.799 +0.234 +0.199 +0.396 +0.412 +0.202 +0.169 +0.373 +0.419 +0.366 +0.391 Criage -0.080 -0.056 -0.160 -0.146 -0.290 -0.196 -0.170 -0.156 -0.070 -0.042 +0.106 +0.065 +0.164 +0.166 +0.132 +0.094 +0.150 +0.164 +0.019 +0.020 Table 1 Effectiveness of necessary and sufficient explanations on a set of facts that replicates 𝒢𝑡𝑟𝑎𝑖𝑛 𝑒 with the removal (or addition) of a set 𝑀 of facts. After the post-training is over, the mimic is expected to approximate the behaviour that the original entity 𝑒 would have displayed if the injected variation had been present since the beginning. Examples of homologous and non-homologous mimics are in Figure 1 (center and right respec- tively). When explaining any tail prediction ⟨ℎ, 𝑟, 𝑡⟩, we can thus estimate the effect of adding (or removing) from ℎ the facts of any candidate explanation 𝑋 by comparing the outcomes obtained using a homologous mimic ℎ′ and a non-homologous mimic mimic ℎ′+𝑋 (or ℎ′+𝑋 ). 4. Experiments In this section we briefly report the main experimental results of Kelpie. For a deeper report on the experimental evaluation of the system, see [3]. Setup. Our experiments have been run on a server with 88 CPUs Intel Core(TM) i7-3820, 516GB RAM and 4 16GB NVIDIA Tesla P100 GPUs. We consider the 5 best-established dataset in LP literature: FB15k and FB15k-237, sampled from Freebase; WN18 and WN18RR, sampled from WordNet; and YAGO3-10, sampled from YAGO3. We consider 3 models representative for the 3 different families of the taxonomy in [2]: the geometric model TransE [9], the matrix factorization model ComplEx [10], and the deep learning model ConvE [11]. End-to-end performance. We compute the performance of Kelpie as the H@1 and MRR variation caused by applying the extracted explanation to the predictions to disable (in the necessary scenario) or to enable (in the sufficient scenario). More specifically: ∙ In our necessary scenario, for each model and dataset we randomly sample a set of 100 correct test tail predictions ⟨ℎ, 𝑟, 𝑡⟩. We extract their necessary explanations and remove the explanation facts from 𝒢𝑡𝑟𝑎𝑖𝑛 ; we then re-train the model and check how the removals have worsened the ranks of the target tails 𝑡. We measure such worsenings in terms of ∆𝐻@1 and ∆𝑀 𝑅𝑅: more negative values correspond to higher effectiveness. ∙ In our sufficient scenario, we once again sample 100 correct test tail predictions ⟨ℎ, 𝑟, 𝑡⟩ to explain for each model and dataset. We extract their sufficient explanations, each with the � Necessary Explanations Sufficient Explanations FB15k WN18 FB15k-237 WN18RR YAGO3-10 FB15k WN18 FB15k-237 WN18RR YAGO3-10 AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD TransE 2.78 1.13 2.17 1.24 2.5 1.15 1.67 0.96 1.99 1.20 1.89 1.03 1.02 0.14 3.43 0.83 1.67 0.79 1.45 0.73 ComplEx 3.40 1.10 3.39 1.00 3.83 0.57 3.16 1.08 2.27 1.31 1.40 0.75 1.00 0.00 2.51 1.20 1.00 0.00 1.04 0.31 ConvE 3.36 0.89 2.91 1.20 2.26 1.24 2.66 1.21 1.73 0.97 1.83 1.23 1.01 0.10 3.09 1.10 1.04 0.24 1.18 0.48 Table 2 Lengths of the necessary and sufficient explanations goal of converting a different set 𝐶 of 10 random entities. Then, for each prediction ⟨ℎ, 𝑟, 𝑡⟩ we add the explanation facts to all 𝑐 entities in the corresponding 𝐶 set, we re-train the model, and we measure how the addition of the explanation facts has improved the tail rank of 𝑡 in each ⟨𝑐, 𝑟, 𝑡⟩ prediction. We measure such improvements in terms of ∆𝐻@1 and ∆𝑀 𝑅𝑅: more positive values correspond to higher effectiveness. We use as baselines two data poisoning systems, namely Criage [12] and the framework by Zhang et al. [13], that we denote as DP. The goal of these systems is to disable the selected ⟨ℎ, 𝑟, 𝑡⟩ predictions by removing/adding individual training facts. In the necessary scenario we compare directly to their removal setting. In the sufficient scenario, that they do not support natively, we adapt their formulations to identify which facts would not only disable ⟨ℎ, 𝑟, 𝑡⟩ but also enable the selected ⟨𝑐, 𝑟, 𝑡⟩ predictions. We do not run Criage on TransE, as the code provided by the authors only supports multiplicative models (e.g., ConvE and ComplEx). Results by Criage, DP and Kelpie are reported in Table 1. Kelpie almost always outperforms baselines, by tackling predictions that others fail to explain. Since Criage and DP are limited to explanations with only 1 fact, we have also experimented with a single-fact version of Kelpie; we have obtained results comparable to the best baselines but almost always worse than "full" Kelpie. This demonstrates the effectiveness of post-training at identifying the most relevant facts, as well as the utility of supporting fact combinations. Explanations Lengths. We report in Table 2, for each scenario, model and dataset, the average (AVG) and the standard deviation (STD) of the lengths of our explanations. We observe that necessary explanations tend to always be longer than sufficient ones, for the same datasets and models. Necessary explanations should encompass all the facts supporting a prediction, so that removing them disables the prediction. Sufficient explanations, on the other hand, just need to to extend the same prediction to other entities: so, it is usually enough for them to include just a few facts (or even just one) as pieces of evidence for the prediction. We also observe that, in many cases, the average explanation lengths are lesser than 2, suggesting that Kelpie can identify minimal explanations, as expected (extensive experiments on minimality are in [3]). Explanations and Bias. LP datasets have been found to suffer from various forms of data bias [14]; we now report a brief example showing how Kelpie explanations can unveil previously unknown instances of bias in the training data. We have observed that the correctly predicted YAGO3-10 test facts that convey that a person born_in a city are generally explained by that person being a soccer player playing for a football team from the same city. This is surprising: in the real world, being member of a football team does not imply having been born in its city. � Guided by our explanations, we have found that, in YAGO3-10, people playing for a football team are born in same city unnaturally often (thus providing data bias) and that personal data are significantly scarce, making birthplace prediction very challenging: the correct predictions, in this regard, seem to be affected even by the slight preference that football players may have towards teams from their birthplace. This type of observations can allow researchers to intervene on LP datasets to make them better adhere to the semantics of the real world. 5. Related Works So far few works have addressed directly the interpretability of embedding-based LP models. The works most related to ours are Criage [12] and the framework by Zhang et al. [13]. Both of them follow a data poisoning approach: given a prediction ⟨ℎ, 𝑟, 𝑡⟩, their goal is to find which is the individual fact that, if added to or removed from 𝒢𝑡𝑟𝑎𝑖𝑛 , worsens the score 𝜑(ℎ, 𝑟, 𝑡) the most. Criage [12] uses Influence Functions [15] to estimate how the addition or removal of any fact would affect 𝜑(ℎ, 𝑟, 𝑡); unfortunately, the adaptability of their formulation to the scoring functions of non-multiplicative models is unclear. The framework by Zhang et al. [13] uses gradient analysis to perturb the embedding of ℎ in the direction that worsens 𝜑(ℎ, 𝑟, 𝑡) the most. Other than having empirical differences with Kelpie (as illustrated in Section 4) these methods have an inherently different goal, as they do not aim directly at explaining predictions, but rather they investigate how models withstand adversarial modifications. One of the aspects that make the interpretability of LP models so challenging is the large variety of ML architectures. In time, researchers have tried to bypass this problem in a variety of ways. Some authors have chosen to support specific architectures [16], while other have proposed inherently interpretable LP models [17]. Finally, a few frameworks explain predictions by focusing on the dataset topology more than on the model and its embeddings [18]. We refer the reader to [3] for further discussion about the application of general purpose explanation methods like LIME [19], SHAP [20] and ANCHOR [21] to the LP setting, and their shortcomings. 6. Conclusions We have discussed Kelpie, a full-fledged explainability framework for embedding-based LP models. Kelpie explanations highlight the most relevant training samples that have enabled our predictions; they are based on the fundamental notions of necessity and sufficiency, and they can encompass combinations of multiple training samples. In the sparkling topic of LP on Knowledge Graphs, interpretability is a strikingly desirable property; yet, it is hardly ever guaranteed by current state-of-the-art systems and frameworks. The effectiveness of Kelpie explanations, measured through extensive experiments, shows that Kelpie largely surpasses pre-existing methods across almost all scenarios in literature; furthermore, as demonstrated, Kelpie explanations can be precious in identifying unbalances and biases in LP datasets. Acknowledgments The work by Donatella Firmani has been supported in part by SEED PNR FLOWER grant 2021. �References [1] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, D. Lin, Knowledge base completion via search-based question answering, in: WWW, 2014. [2] A. Rossi, D. Barbosa, D. Firmani, A. Matinata, P. Merialdo, Knowledge graph embedding for link prediction: A comparative analysis, TKDD (2021). [3] A. Rossi, D. Firmani, P. Merialdo, T. Teofili, Explaining link prediction systems based on knowledge graph embeddings, in: SIGMOD, 2022. [4] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv. (2018). [5] D. S. Watson, L. Gultchin, A. Taly, L. Floridi, Local explanations via necessity and suffi- ciency: Unifying theory and practice, in: UAI, 2021. [6] A. Rossi, A. Matinata, Knowledge Graph Embeddings: Are Relation-Learning Models Learning Relations?, in: PIE, 2020. [7] Y. Wang, D. Ruffinelli, R. Gemulla, S. Broscheit, C. Meilicke, On Evaluating Embedding Models for Knowledge Base Completion, in: RepL4NLP@ACL, 2019. [8] A. Rossi, Interpreting link prediction on knowledge graphs., in: SEBD, 2020. [9] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: NIPS, 2013. [10] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex Embeddings for Simple Link Prediction, in: ICML, 2016. [11] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2d knowledge graph embeddings, in: AAAI, 2018. [12] P. Pezeshkpour, Y. Tian, S. Singh, Investigating robustness and interpretability of link prediction via adversarial modifications, in: NAACL-HLT, 2019. [13] H. Zhang, T. Zheng, J. Gao, C. Miao, L. Su, Y. Li, K. Ren, Data poisoning attack against knowledge graph embedding, in: IJCAI, 2019. [14] A. Rossi, D. Firmani, P. Merialdo, Knowledge graph embeddings or bias graph embeddings? a study of bias in link prediction models, in: DL4KG, 2021. [15] P. W. Koh, P. Liang, Understanding black-box predictions via influence functions, in: D. Precup, Y. W. Teh (Eds.), ICML, 2017. [16] Z. Ying, D. Bourgeois, J. You, M. Zitnik, J. Leskovec, Gnnexplainer: Generating explanations for graph neural networks, in: NIPS, 2019. [17] W. Zhang, S. Deng, H. Wang, Q. Chen, W. Zhang, H. Chen, Xtranse: Explainable knowledge graph embedding for link prediction with lifestyles in e-commerce, in: JIST, 2019. [18] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, H. Chen, Interaction embeddings for prediction and explanation in knowledge graphs, in: WSDM, 2019. [19] M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I Trust You?": Explaining the Predictions of Any Classifier, in: SIGKDD, 2016. [20] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: NIPS, 2017. [21] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations, in: AAAI, 2018. �