Difference between revisions of "Vol-3184/paper7"
Jump to navigation
Jump to search
(edited by wikiedit) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
|id=Vol-3184/paper7 | |id=Vol-3184/paper7 | ||
|wikidataid=Q117040470 | |wikidataid=Q117040470 | ||
− | |||
|authors=Diego Rincon-Yanez, Sabrina Senatore | |authors=Diego Rincon-Yanez, Sabrina Senatore | ||
|pdfUrl=https://ceur-ws.org/Vol-3184/TEXT2KG_Paper_7.pdf | |pdfUrl=https://ceur-ws.org/Vol-3184/TEXT2KG_Paper_7.pdf | ||
|volume=Vol-3184 | |volume=Vol-3184 | ||
|storemode=property | |storemode=property | ||
+ | |title =FAIR Knowledge Graph Construction from Text, an Approach Applied to Fictional Novels | ||
+ | |description= | ||
}} | }} | ||
=Freitext= | =Freitext= |
Latest revision as of 16:54, 30 March 2023
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3184/paper7 |
wikidataid | Q117040470→Q117040470 |
title | FAIR Knowledge Graph Construction from Text, an Approach Applied to Fictional Novels |
pdfUrl | https://ceur-ws.org/Vol-3184/TEXT2KG_Paper_7.pdf |
dblpUrl | |
volume | Vol-3184→Vol-3184 |
session | → |
Freitext
FAIR Knowledge Graph construction from text, an approach applied to fictional novels
FAIR Knowledge Graph construction from text, an approach applied to fictional novels Diego Rincon-Yanez1 , Sabrina Senatore1 1 University of Salerno Via Giovanni Paolo II, 132 - 84084 Fisciano SA, Italy Abstract A Knowledge Graph (KG) is a form of structured human knowledge depicting relations between entities, destined to reflect cognition and human-level intelligence. Large and openly available knowledge graphs (KGs) like DBpedia, YAGO, WikiData are universal cross-domain knowledge bases and are also accessible within the Linked Open Data (LOD) cloud, according to the FAIR principles that make data findable, accessible, interoperable and reusable. This work aims at proposing a methodological approach to construct domain-oriented knowledge graphs by parsing natural language content to extract simple triple-based sentences that summarize the analyzed text. The triples coded in RDF are in the form of subject, predicate, and object. The goal is to generate a KG that, through the main identified concepts, can be navigable and linked to the existing KGs to be automatically found and usable on the Web LOD cloud. Keywords Knowledge Graph, Natural Language Processing, Semantic Web Technologies, Knowledge Graph Em- beddings, FAIR 1. Introduction Due to the complex nature of real-world data, defining a data knowledge model within the specification of a domain is not a trivial task. Semantic Web Technologies (SWT) promise formal paradigms to structure data in a machine-processable way by connecting resources defining semantic relations among them. The result is a rich information representation model whose information is well defined, uniquely identifiable, and accessible as a knowledge base. However, the knowledge representation (KR) task is becoming more challenging every day due to the increasing need to improve, extend, and re-adapt that knowledge to the real-world evolution. Knowledge Graphs (KGs) provide a way of represent knowledge at different levels of abstrac- tion and granularity; generally it is described [1] as composed of a knowledge base (KB) and a reasoning system with multiple data sources. The KB is a directed network model whose edge represent semantic properties; traditionally those network models can be classified depending on the relations nature into a (1) homogeneous graph [2] 𝐺 = (𝑉, 𝐸), whose edges (E) rep- resents the same type of relationship between the vertices (V), and (2) heterogeneous graph 𝐺 = (𝑉, 𝐸, 𝑅)[3, 2], whose edges (E) describe a set of relations (R), i.e., natural connections Text2KG 2022: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022, May 05-30-2022, Crete,Hersonissos, Greece $ drinconyanez@unisa.it (D. Rincon-Yanez); ssenatore@unisa.it (S. Senatore) � 0000-0002-8982-1678 (D. Rincon-Yanez); 0000-0002-7127-4290 (S. Senatore) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �between the nodes. Thanks to the W3C Resource Description Framework (RDF) standard1 , knowledge can be ex- pressed as a network of atomic assertions, also called facts, described as triples composed of Subject (S), Predicate (P) and Object (O), 𝑇 = (𝑆, 𝑃, 𝑂). In the light of the traditional definition of heterogeneous graph 𝐺 = (𝑉, 𝐸, 𝑅), a KB can be described by mapping 𝑉 ≡ (𝑆 ∪ 𝑂) and 𝑅 ≡ 𝑃 as the semantic relation between nodes, leaving the description of vertices as ∀ ∈ 𝐸 : 𝑒 = (𝑉𝑥 , 𝑅𝑧 , 𝑉𝑦 ). On the other hand, the KG Reasoning System (KRS) can be interpreted as the combination of one or more computing or reasoning (r) techniques to read or write the the state- ments in the KB, and it can be expressed as 𝐾𝐺 = (𝐾𝑅𝑆, 𝐾𝐵) or 𝐾𝐺 = (𝐾𝑅𝑆, (𝑉, 𝐸, 𝑅)) where 𝐾𝑅𝑆 = {𝑟1 , 𝑟2 . . . 𝑟𝑛 } and 𝑛 ≥ 1 e.g.: 𝑅𝑖 : KGE(𝐾𝐵)𝑒𝑅𝑗 = (𝑒𝑠 , 𝑒𝑝 , 𝑒𝑜 ). In a nutshell, a Knowledge Graph (KG) is composed of a Knowledge Base (KB) and a Knowledge Reasoning System (KRS) 𝐾𝐺 = ({𝑟1 , 𝑟2 . . . 𝑟𝑛 }, (𝑉, 𝐸, 𝑅)). The FAIR [4]; Findability, Accessibility, Interoperability and Reuse principles refer to the guidelines for good data management practice: resources must be more accessible, understood, exchanged and reused by machines; these principles in the context of SWT and particularly in the field of Knowledge Graph, encourage communities such as Open Linked Data Community (OpenLOD) and Semantic Web to join efforts into the Fair Knowledge Graph topology [4] building. On the other hand, the Open Knowledge Graphs (OpenKGs) are semantically available on the Web [2, 5], they can be focused on universal domains, e.g., DBpedia [6], WikiData [7], or domain-specific knowledge such as YAGO [8] or WordNet2 ; most of them are integrated and can be found through the OpenLOD Cloud3 , and typically, they can be queried by SPARQL interfaces. Nowadays, Linked Open Data and FAIR principles have played an essential role in the KG spreading. These proposals catapulted the Knowledge Base Construction (KBC) process by providing enormous leverage helping to cross-reference and improve the new generated KBs accuracy, increasing the number of available queryable sources on the Web. This work proposes a methodological approach implemented as a library to build domain- specific knowledge graphs compliant with FAIR principles from an automated perspective without using knowledge-based semantic models. The method leverages traditional Natural Language Processing (NLP) techniques to generate simple triples as a starting point and then enhance them using external OpenKG, e.g., DBpedia, to annotate, normalize, and consolidate an RDF-based KG. As a case study, novels in literature were explored to validate the proposed approach by exploiting the features from the narrative construction and relationships between characters, places and events in the story. The remainder of this work is structured as follows: Section 2 provides an overview of relevant approaches, knowledge base construction foundations, and KG construction approaches. Section 3 presents the proposed methodological approach at a high abstraction level. Then, Section 4 describes details of the developed tool and the case study results. Experimental evidence and evaluation is reported in Section 5. Finally, the conclusions and the future work are highlighted in Section 6. 1 W3C RDF Official Website - https://www.w3.org/RDF/ 2 WordNet Official site - https://wordnet.princeton.edu 3 Open LOD Cloud - https://lod-cloud.net/ �Table 1 KBC process classification; Knowledge Source defines the characteristics of where the knowledge is located, e.g., Human, source design and Schema Fixed lexicon of entities and relations where the data will be stored, adapted from [13] Knowledge Source Method Schema Curated Manual Yes Collaborative Manual Yes Structured Automated Yes Semistructured Automated Yes Unstructured Automated No 2. Related Work The Knowledge Base (KB) lifecycle [9] can be described by three main phases: the Knowledge Creation, in charge of discovering initial knowledge; in this phase, initial statements are gener- ated systematically to feed the KB. Then, the Knowledge Curation phase annotates and validates the generated triples. Finally, the Knowledge Deployment phase is accomplished once the KB is mature and can be used along with a reasoning system (KRS) to perform inferring operations. 2.1. Knowledge Base Construction A knowledge base construction (KBC) can be achieved by performing the first two stages of the KB life cycle, populating a KB [10] and then annotate the knowledge with semantic information [11]. The annotation process can be approached from an automated or a human-supervised perspective. More specifically, the human-centered methods are categorized as (1) curated, meaning a closed group of experts or (2) collaborative, by an open group of volunteers; these two methods must be sustained by one or more knowledge schemes (e.g., ontology-based population). Meanwhile, automatized approaches can uphold the previous schema-based[12], but also a schema-free approach [13], this categorization is detailed in Table 1. The KB is often a Commonsense Knowledge (CSK) since it refers to generic concepts from daily life; classifying and getting more specific information can enhance the CSK by adding new statements annotated semantically. In [14] a three-step architecture is proposed: (1) Retrieval, (2) Extraction, (3) Consolidation are the steps aimed at fact construction to extract multiple key-value statements pairs from the consolidated assertions. Also, Open KGs are used to match and connect terms from existing ontologies. Since Open KGs, like DBpedia, are composed of large amounts of triples, retrieving data by SPARQL queries or other retrieval systems is time and resource-consuming. Working with DBpedia On-Demand [15] allows accessing a lightweight version of the KB, by using the semi-structured fields of DBpedia articles called info-boxes (structured knowledge from the ontology). Ontological reasoning, on the other hand, needs synergistic integration: in [16], for example, lexical-syntactic patterns are extracted from sentences in the text, exploiting an SVM (Support Vector Machines) algorithm and then finding matches with DBpedia; the resulting matches are then integrated by a Statistical Type Inference (STI) algorithm to create a homogeneous �RDF-based KB. Furthermore, Open Information Extraction (Open IE) provides atomic units of information, with individual references (URI) to simplify conceivable queries on external databases, such as DBpedia [17]. The KB construction has evolved along with the knowledge representation requirement for data ingestion as well as human consumption of data [18]. This evolution has increased the complexity in terms of data dimensionality, number of sources, and data type diversity [19]. For these additional challenges [20], semantic tools [21] can enhance the collected data by providing cross-context, allowing inference from native data relationships or external data sources, such as status reports or supply chain data. Once an initial KB has been deployed, the Knowledge Completion can be accomplished, leveraging on the open-world assumption (OWA) also by using masking functions [22] to connect unseen entities to the KG. Specifically, a masking model uses text features with the structure 𝑇 = (𝑆, 𝑃, ?) to learn embeddings of entities and relations via high-dimensional semantic representations based on the known Knowledge Graph Embedding (GKE) models such as TransE neural network [23]. This approach can also be used for link prediction by exploiting existing GKE models such as DistMult, ComplEX, to perform inference using the masked form 𝑇 = (𝑆, ?, 𝑂). 2.2. Knowledge Graph Embeddings (KGE) Knowledge Graph Embeddings (KGE), also called Knowledge Graph Representation Learning, has become a powerful tool for link prediction, triple classification, over an existing KG. These approaches encode entities and relations of a knowledge graph into a low-dimensional embed- ding vector space; these networks are based on Deep Learning techniques arriving at a score by minimizing a loss function ℒ. KGE provides a comprehensive framework to perform operations [24] over a KB, usually provided in the KB deployment phase. TransE [23], being one of the most studied KGE models, 𝑓 𝑇 𝑟𝑎𝑛𝑠𝐸 = −‖𝑒𝑠 + 𝑟𝑝 − 𝑒𝑜 ‖𝑛 , computes the similarity between the 𝑒𝑠𝑢𝑏𝑗𝑒𝑐𝑡 translated by the embedding of the predicate 𝑒𝑝 and the object 𝑒𝑜 . TransE model is often exploited for applications such as a knowledge validation model and achieves state-of-the-art prediction performance. TorusE is a TransE-based embedding method that, combined with the use of Lie Groups knowledge base (KGLG) [25], is used to validate an inferred KB; it has been tested using stan- dard de-facto datasets to evaluate the multidimensional distance between spaces of computed embeddings. Other KGE methods such as DistMult [26] 𝑓 𝐷𝑖𝑠𝑡𝑀 𝑢𝑙𝑡 = {𝑟𝑝 , 𝑒𝑠 , 𝑒𝑜 } use a tri- linear dot product to capture and extract semantic relations, while the method ComplEx [27] 𝑓 𝐶𝑜𝑚𝑝𝐸𝑋 = 𝑅𝑒({𝑟𝑝 , 𝑒𝑠 , 𝑒𝑜 }) uses a Hermitian dot product. 3. The proposed approach The proposed approach provides an incremental unsupervised methodology, compliant with the FAIR principles, to build a knowledge graph on a specific domain, without exploiting any ontology schema, but achieving the discovery, construction, and integration of triples 𝑇 = (𝑆, 𝑃, 𝑂), extracted from an unstructured text, with zero initial knowledge. The approach �Figure 1: Proposed Methodology from a high level view. Pipeline steps, positioned related with the Knowledge Graph definition. is described by the pipeline of Figure 1 shows the interaction of the main components of the proposed framework. From a high-level perspective, the input is represented by the natural language text sources and feeds the Triple Construction component, in charge of extracting triples from the analysis of sentences in the text. The text parsing and the triples extraction are tailored to exploit the specific prosaic text features, starting from the narrative properties and the implicit relations created by the recurring entities described in the text. The facts extracted from textual data sources are plain triples < 𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡 > that, are incorporated as formal statements in the KB. The triple elements are annotated semantically, by querying external semantic knowledge bases such as DBpedia or Wikidata. This task is accomplished by the Triple Enrichment component, generating a RDF-compliant KB. Afterward, this triple set is input to Embedding Training component responsible for the KB completion. It accomplishes the reasoning on the KB, exploiting a KGE model, to infer and validate the collected KB. Next subsections provides additional details about the introduced components of the frame- work pipeline of Figure 1. 3.1. Triple Construction NLP-based parsing tasks are applied to extract plain triple-based sentences. The Triple Construc- tion component initially carries out the Basic POS annotation on the input raw text, consisting of traditional preliminary text analysis, namely, the tokenization, stop-word removal and -tagging routines to identify and annotate the basic part of speech entities. Once the Basic POS annotation is accomplished, parallel activities, namely Information Ex- traction and Chunking Rules and Tagging Pattern Execution are carried out, as shown in Figure 2. 3.1.1. Information Extraction: This block is composed of three basic tasks necessary to get an further processing of the tagged text yield by the Basic POS Annotation. The text is indeed given to the Named Entity Recognizer (NER) which identifies named entities according to some specific entity types: person, organization, location, event. Then a Dependency Parsing task allow locating named entities into subject or object and finally, the Relation Extraction task is in charge of the triple composition, �Figure 2: Basic plain Triple Construction: from raw text to the POS annotation, chunking rules to grammar dependency rules. The dashed lines represent a process that is not executed, applies for the Tagging Pattern Execution step. Figure 3: Directed Acyclic Graph (DAG) describing the Tagging Pattern task starting by the detected entities. Let us remark that the named entities has a crucial role in the triple identification and generation. 3.1.2. Tagging Pattern: This task accomplish ad-hoc text processing, considering the nature of the text from a linguistic perspective. Linguistic patterns were defined specifically for processing the prosaic narrative and for this reason, for example, some stop words are not discarded. Those patterns can be implemented to discriminate additional triple-based statement to feed the KB; in particular, a finite-state machine is proposed, implementing three evaluation steps, one for each triple component 𝑇 = (𝑆, 𝑃, 𝑂). Figure 3 sketches a few of the implemented state transitions used in the development of this methodology. 3.1.3. Chunking Rules: This task achieve a parallel shallow syntactic analysis where certain sub-sequences of words are identified as forming special phrases. In particular, a regular-expression has been defined over the sequence of word tags to select and extract basic triple for enriching the KB, such as the following tag string 𝑇 𝑃 : {< 𝐷𝑇 >? < 𝐽𝐽* >< 𝑁 𝑁 >? < 𝑉 𝐵* >< 𝐷𝑇 >? < 𝐽𝐽* >< 𝑁 𝑁 >}. This regular expression pattern says that a triple pattern TP chunk should be formed by the subject chunker which finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN); then the predicate-chunker composed of the verb (in its tenses) an finally the object-chunker that is similar to the subject chunker, except that it could �Figure 4: Chunking Rule Evaluation Example. be a pronoun besides the noun. Subject and object chunkers are typical noun phrases. Some rule example application is shown in Figure 4. As final result of the Triple Construction component, all the generated data are integrated into a set of plain triples, 𝑇𝑃 𝑙𝑎𝑖𝑛 ≡ 𝑇𝐼𝐸 ∪ 𝑇𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 ∪ 𝑇𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 , where the 𝑇𝐼𝐸 represents resulting triples of Information Extraction task, the 𝑇𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 , the result of Tagging Pattern and 𝑇𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 is the Chunking Rules result, once redundant and duplicated are discarded. Finally, the resulting triple set 𝑇𝑃 𝑙𝑎𝑖𝑛 is stored into a plain format file. 3.2. Triple Enrichment Once the construction of the triple has been completed and stored, the Triple Enrichment component (Figure 1) is in charge of annotating semantically the triple elements. First, each element of triples is a candidate to be enriched semantically, by assigning a URI to it; for this purpose, SPARQL queries whose arguments are the individual subject and object of the triples are submitted to OpenKGs such as DBpedia or Wikidata, to retrieve the relative URIs, when they are available. Similarly, the predicate is also looked up in the semantically enhanced lexical database WordNet, once the verb has been transformed into its infinitive form. Algorithm 1 Triple enrichment Algorithm Input: Plain Triple Data set Output: RDF Triple Data set 1: for all [𝑆, 𝑃, 𝑂] : 𝐴𝑙𝑙𝑇 𝑟𝑖𝑝𝑙𝑒𝑠 do 2: for (S and O) as Entity do 3: [URI, Label] = QueryOpenKG(Entity) 4: triples.replace( Entity, URI ) 5: triples.append( (URI, rdfs:label, label) ) 6: end for 7: O = QueryWordNet( tokenization(O) ) 8: end for 9: return RDF Triples The general process of the Triple Enrichment component is described in Algorithm 1. It �takes the triple collection as input and for each triple subject and object, search for the cor- responding URI in a OpenKG; and then, once retrieved, replaced the URI in the triple, along with the associated rdfs:label. Let us notice that not only the external OpenKG URI is ex- tracted from the query, but also the rdfs:label for each entity added to the KB. A similar query is carried out on WordNet, for the predicate: a simple disambiguation task is accom- plished considering the other triple elements, by similarity between terms in the WordNet definition and triple words. Otherwise, the first sense is retrieved for default. Future devel- opment are taken into account to select the right sense after a more accurate disambiguation task. The output of the Triple Enrichment component is a semtic version of the collected triples: 𝑇𝑅𝐷𝐹 −𝐾𝐵 ≡ 𝑇𝑅𝐷𝐹 −𝐼𝐸 ∪ 𝑇𝑅𝐷𝐹 −𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 ∪ 𝑇𝑅𝐷𝐹 −𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 , where 𝑇𝑅𝐷𝐹 −𝐼𝐸 , 𝑇𝑅𝐷𝐹 −𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 , 𝑇𝑅𝐷𝐹 −𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 are the RDF-annotated triples coming from the corresponding tasks of the Triple Construction component. 3.3. KG Embeddings Training Last component of the described pipeline is the Embedding Training that is responsible to train the KGE model on the processed triples. The KGE model provides a mechanism to perform operations with the discovered KB, such as fact checking, question answering, link prediction and entity linking. According to the close-world knowledge assumption (CWA) [22] of the knowledge graph, in general no new statements (triples) can be added to an existing KG, except that predicting relationships between existing, well-connected entities. In the light of this consideration, it is essential to integrate some reasoning mechanism (KRS) that allows the KG to perform operation within the statements inside the KB, to reach the so called closed-world Knowledge Graph Completion. At the same time, once all the possible statements are saturated, additional new statement could be added by considering external information sources, in order to enrich the KB. 4. Implementation detail: A case study The approach has been implemented and validated, considering a case study relative to the narrative domain, specifically books and article synopses describing short stories, fairy tales, adventure stories. This section describes the methodology at a technical level to identify the main characters, capture the relationships among them, places and events where they are involved by exploiting linguistic features and qualities of the narrative prose. An enlarged view of the proposed pipeline of Figure 1 is shown in Figure 5 where the whole high-level framework is described. A synergy between different methodologies and technologies has been achieved: NLP activities are integrated with the Machine Learning models, designed according to the principles of Information Retrieval and FAIR, through the Semantic Web technologies (RDF, Turtle, OWL and SPARQL). The implementation is based on Python language and cover the core system shown in Figure 5. Input textual resources are downloaded from Kaggle narrative books, available in plain text format; at the same time, the synoptic version of the book was retrieved from Wikipedia. Our case study is applied to the book The Lord of the Rings. The text was extracted using a simple �Figure 5: Methodology implementation detailed overview. web crawler that retrieve the content of the URL page; the data is cleaned using a standard HTML/XML parser (Data Extraction). As shown in Figure 5, the Triple Construction component achieves the traditional POS-tagging activities (Figure 2). The information extraction is performed by the Stanford Open-IE server tool combined with Spacy library (en_core_web pre-trained models in English) for the extraction of triples. The Tagging Pattern and Chunking Rules tasks from Triple Construction are ad-hoc designed; these tasks also were implemented by exploiting WordNet. To give an example of pattern extraction described in Figure 3 the text "But Frodo is still wounded in body and spirit" is analyzed according to the regular expression 𝑇 = (𝑁 𝑁 *, 𝑉 𝐵*, 𝑁 𝑁 *), where the wildcard character is a placeholder for possible variations of that POS-tags. The generated triple is 𝑇 = (𝐹 𝑟𝑜𝑑𝑜, 𝑤𝑜𝑢𝑛𝑑𝑒𝑑, 𝑠𝑝𝑖𝑟𝑖𝑡). The triple extraction is also aimed at generating the widest set of triples from the sentence analysis. Along with the definition of regular expression patterns for discovering simple phrase chunks, the co-reference is also achieved: primary sentence with the same subject in secondary sentences or simply sentences with pronoun are re-arranged to get triples with noun instead of pronoun. The rule (𝑃 𝑅𝑃, 𝑉 𝐵, 𝑁 𝑁 ) in Figure 4 allows us to obtain the pronoun PRP that will be replaced by the right a noun ( in the previous sentence NN 𝑆 = ℎ𝑜𝑏𝑏𝑖𝑡𝑠), to get the triple 𝑇2 = (ℎ𝑜𝑏𝑏𝑖𝑡𝑠, 𝑒𝑛𝑐𝑜𝑢𝑛𝑡𝑒𝑟, 𝑅𝑎𝑛𝑔𝑒𝑟). Once the Triple Construction process is concluded, the generated triples are rearranged, discarding duplicates (coming from the parallel tasks). The remaining triples are ordered according to the same subject and then predicate and then stored in a plain CSV file for further use. The Triple Enrichment component uses the CSV-generated file to carry out Algorithm 1. An initial Named Entity Recognition task allows discriminating important entities among subject and object, in the triples in order to discard irrelevant triples (i.e., triples with no named entity) �Algorithm 2 SPARQL Query example to URI and Label extraction SELECT ∗ WHERE { { {{ d b r : { word } r d f : t y p e dbo : F i c t i o n a l C h a r a c t e r . d b r : { word } dbp : name ? l a b e l . } } UNION { { { { d b r : { word } dbo : w i k i P a g e R e d i r e c t s ? URI } } UNION { { d b r : { word } dbo : w i k i P a g e D i s a m b i g u a t e s ? URI } } . ? URI r d f : t y p e dbo : F i c t i o n a l C h a r a c t e r . ? URI r d f s : l a b e l ? l a b e l }} FILTER ( lang ( ? l a b e l ) = ' en ' ) }} and more important, to use that entities as parameter for the query, according to Algorithm 1. The individual entities are looked at DBpedia by submitting SPARQL queries, to retrieve the semantic annotation (URI and label) corresponding to that entity. The goal is to make the subject or object on each triple accessible resources, according to the Linked Data principles. The query shown in Algorithm 2 was submitted to the public SPARQL end-point 4 by implementing the SPARQLWrapper to retrieve the DBpedia URI of the selected entities and the rdfs:label. Let us notice that when a URI of an entity is not available, i.e., no result is returned by the query, a customized URI is generated. In a similar way, the predicate is processed using WordNet after reducing it to the infinitive form, e.g., the several tenses kills, killed, killing become the infinite form kill. After these activities, a semantic knowledge base (called RDF KB) should be available. The resulting RDF KB is passed to the KGE models (Embeddings Training in Figure 5 to get the embeddings on the RDF triple-based KB. Embedding models such as ComplEX, TransE, and DistMult were trained to compare and evaluate the results across this implementation. The training was achieved and evaluated by using the evaluation proposed by the KG Embeddings of the framework Ampligraph [28]. 4.0.1. Triple Storage: An important role was assumed by the storage component, that allows to store all the interme- diate version of the triples generated. In a nutshell, triples generated in the Triple Construction component (Section 3.1) are stored in CSV format (to reuse them later); then, after the the Triple Enrichment (Section 3.2), the resulting triples are stored semantically, namely in RDF/Turtle languages. RDF-based triples can be e imported into the NEO4J5 by the neosemantics plugin and then the resulting knowledge graph can be graphically displayed into the GUI This last 4 DBpedia SPARQL end-point - https://dbpedia.org/sparql 5 https://neo4j.com/v2/ �storage task is crucial for reusing the KB for later validation, charts generation, query execution, and further operations. 4.0.2. Source Code: A demo for the implementation of this project is available on GitHub. The usage instructions are consigned on the repository6 and the produced results (KBs) should be published under the CC-BY-SA license. 4.1. Knowledge Graph Completion Evaluation The KB validation is performed by implementing a cross-graph representation learning ap- proach [29] by embedding triples based on the semantic meaning of the KB; the representation is done by estimating a confident score in each statement based on a degree of correctness. This estimation is calculated by exploiting the KGE models and the corresponding evaluation metrics over the discovered KB. First, the validation is accomplished by considering the output of each tasks of the Triple Construction component and setting a pair-wise comparison among 𝑇𝑅𝐸𝑆−𝐼𝐸 , 𝑇𝑅𝐸𝑆−𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 , 𝑇𝑅𝐸𝑆−𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 through the trained KGE models, with the purpose of measure the correspon- dence between the discovered knowledge. Then, to assess the generated knowledge base, new triples are added to the KGE models once trained the trustworthiness (i.e., how they are true) of these triples. These new triples are called "synthetic", because they are manually generated, are used to perform link predictions queries of the form 𝑇 = (?, 𝑃, ?). The synthetic triples are divided into two groups, positive (𝑆𝑇𝑃 𝑜𝑠 , existing) and negatives (𝑆𝑇𝑁 𝑒𝑔 , not existing, or unlikely to be true). Known learning-to-rank metrics to assess the performance of KGE models are as follows. |𝑄| 1 ∑︁ 1 |𝑞 ∈ 𝑄 : 𝑞 < 𝑛| MRR = (1) HITS@N = (2) |𝑄| 𝑞 |𝑄| 𝑖=1 The Mean Reciprocal Rank (MRR) is a statistical measure for assessing the quality of a list of possible answers to a sample of queries, sorted by probability of correctness. MRR assume value equal to 1 if a relevant answer was retrieved at rank 1; the intention is to measure the correctness of each of the proposed scenarios using 𝑀 𝑅𝑅 to validate source knowledge against the validation dataset. The 𝐻𝐼𝑇 𝑆@𝑁 measures the percent of triples that were correctly ranked in the top 𝑁 . Besides, the combination with 𝐻𝐼𝑇 𝑆, with two different coefficients 𝑁 = 5 and 𝑁 = 3 to measure the ranking of the entities part of the statement (S, O) in the validation dataset at the time of link prediction. 5. Experiment Design and Results 5.1. Triple Discovery For the validation of the KBC process, let us focus on the KB generated by the Triple Construction component described in Section 3.1. After the basic POS tagging tasks, the three sub-tasks, 6 Github Repository - https://github.com/d1egoprog/Text2KG �Information Extraction, Chunking Rules and Tagging Pattern Execution generates statements for the graph construction, the statements are grouped in 𝑇𝐼𝐸 , 𝑇𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 , 𝑇𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 , respectively. The triples generated by using the Information Extraction (𝑇𝐼𝐸 ) are approximately 75% of the discovered triples. The remaining 25% can be split between the triples (about 10% of the total KB generated) coming from the Chunking Rules (𝑇𝐶ℎ𝑢𝑛𝑘𝑖𝑛𝑔 ) and those (around 15%) from Tagging Pattern components (𝑇𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 ). Overlapping statements represent 1% of the triples discovered. 5.2. KG Completion evaluation approach Table 2 shows the results of the evaluation process, by pair-wise comparison of the statements generated by the components and tasks, as described in Section 4.1. The first two columns of the table show, in fact, the set of triples (statements) generated by two data sources, compared with each other. In other words, the first column represent the reference triple set that is used to validate the triple set from the second column. Then, there are the presented metrics and the relative values obtained for each KGE model used, for the three KGE models analyzed. Let us notice that in general there is an important dependence between the ranked triple list compared with the expected ordering of the reference triple set for most of pairs reported in the table. The worst MRR values are returned for the ranked triples 𝑇𝑅𝐸𝑆−𝑃 𝑎𝑡𝑡𝑒𝑟𝑛 returned by all the three models, with respect to 𝑇𝑅𝐸𝑆−𝐶ℎ𝑢𝑛𝑘 and vice-versa. This is due to the small amount of triples generated by the corresponding tasks, compared to the equally small reference dataset. The highest scores of the 𝑀 𝑅𝑅 are instead returned for the number of triples generated by using the Information Extraction (𝑇𝐼𝐸 ) and even for the Chunking Rules task, namely 𝑇𝑅𝐸𝑆−𝐶ℎ𝑢𝑛𝑘 . The percentage of the top three and five ranked triples are also shown in the table, confirming that using triples from the Information Extraction task (𝑇𝐼𝐸 ) for both reference and validation often produces good triple rankings. To consider the validation of the external knowledge, the last rows of the table show the result of the comparison with the synthetic triples 𝑆𝑇𝑃 𝑜𝑠 and 𝑆𝑇𝑁 𝑒𝑔 . The results suggest that the validation of the synthetic triples produces high values (about 0.95) of 𝑀 𝑅𝑅 for the positive 𝑆𝑇𝑃 𝑜𝑠 triples and 0 for the negative 𝑆𝑇𝑃 𝑜𝑠 triples. The synthetic statements appear to accurately reflect the knowledge discovered by the methodology and the correspondence implementation, and the positive statements are potential candidates to feed into the existing knowledge base. 6. Conclusions and Future Work This paper proposes unsupervised knowledge construction from domain-specific knowledge given a natural language text. In particular, a case study is designed on a narrative text, exploiting features unique to prosaic text. The text was parsed and basic triples, representing the core part (subject, predicate and object) of the main sentences are extracted. By Semantic Web technologies, the triple are semantically annotated to be linked and navigable into the Linked Data cloud. At that point, Knowledge graph embedding models have been exploited to extend the potential of the collected triples by verifying and validating the so generated knowledge. Performance is assessed by cross-validation, scoring functions and accuracy on unseen triples. �Table 2 KG Completition approach results, exploiting the three KGE models TransE, DistMult, ComplEX. Src is the reference data; Val. are the data to validate with Src. All the KB are the final 𝑇𝑅𝐸𝑆 for each process TransE DistMult ComplEx Src. Val. MRR H@3 H@5 MRR H@3 H@5 MRR H@3 H@5 𝑇𝐼𝐸 𝑇𝐶𝐻 0.69 0.53 0.85 0.63 0.58 0.71 0.65 0.59 0.71 𝑇𝐼𝐸 𝑇𝑃 𝑇 0.55 0.40 0.74 0.48 0.41 0.56 0.55 0.47 0.65 𝑇𝐶𝐻 𝑇𝐼𝐸 0.53 0.42 0.67 0.52 0.48 0.57 0.57 0.54 0.59 𝑇𝐶𝐻 𝑇𝑃 𝑇 0.33 0.28 0.42 0.30 0.28 0.31 0.33 0.31 0.36 𝑇𝑃 𝑇 𝑇𝐼𝐸 0.54 0.40 0.70 0.53 0.47 0.58 0.55 0.49 0.61 𝑇𝑃 𝑇 𝑇𝐶𝐻 0.34 0.28 0.42 0.33 0.31 0.33 0.29 0.28 0.31 TransE DistMult ComplEx Src. Val. MRR H@1 H@3 MRR H@1 H@3 MRR H@1 H@3 𝑇𝐾𝐵 𝑆𝑇𝑃 𝑜𝑠 0.95 0.92 0.99 0.92 0.85 0.99 0.99 0.98 0.99 𝑇𝐾𝐵 𝑆𝑇𝑁 𝑒𝑔 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 The work provides a general methodological approach to automatically build a knowledge base, on a preferred domain, given in textual resources. It allows the creation of different types of KBs. It can be leveraged to build ad-hoc knowledge bases, tailored to a specific domain. Question-answering systems can act as validation systems when verification of the veracity and reliability (degree of truth) of a given statement is required, for example, for the discovery of fake news or just for fact-checking. As future work, the generation and enrichment of triples by querying existing Open Knowl- edge bases will be improved, along with the more challenging goal of enhancing the degree of confidence in the KB by implementing an automatic validation method. References [1] L. Ehrlinger, W. Wöß, Towards a definition of knowledge graphs, in: Procedings of 12th International Conference on Semantic Systems (SEMANTiCS2016), volume 1695, CEUR-WS, 2016, pp. 1–4. [2] A. Hogan, E. Blomqvist, M. Cochez, C. D’Amato, G. de Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen, J. Sequeda, S. Staab, A. Zimmermann, Knowledge Graphs, number 2 in Synthesis Lectures on Data, Semantics, and Knowledge, Morgan & Claypool, 2021. doi:10.2200/S01125ED1V01Y202109DSK022. [3] B. Lee, S. Zhang, A. Poleksic, L. Xie, Heterogeneous Multi-Layered Network Model for Omics Data Integration and Analysis, Frontiers in Genetics 0 (2020) 1381. doi:10.3389/ FGENE.2019.01381. [4] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez- � Beltran, A. J. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. Van Der Lei, E. Van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, Comment: The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3 (2016) 1–9. doi:10.1038/sdata.2016.18. [5] P. N. Mendes, M. Jakob, C. Bizer, DBpedia: A multilingual cross-domain knowledge base, in: Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 1813–1817. [6] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: A Nucleus for a Web of Open Data, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 4825 LNCS, Springer, Berlin, Heidelberg, 2007, pp. 722–735. doi:10.1007/978-3-540-76298-0\ _52. [7] H. Turki, T. Shafee, M. A. Hadj Taieb, M. Ben Aouicha, D. Vrandečić, D. Das, H. Hamdi, Wikidata: A large-scale collaborative ontological medical database, Journal of Biomedical Informatics 99 (2019) 103292. doi:10.1016/j.jbi.2019.103292. [8] F. M. Suchanek, G. Kasneci, G. Weikum, YAGO: A Core of Semantic Knowledge, in: Proceedings of the 16th international conference on World Wide Web - WWW ’07, ACM Press, New York, New York, USA, 2007, pp. 697–706. URL: https://dl.acm.org/doi/10.1145/ 1242572.1242667. doi:10.1145/1242572. [9] U. S. Simsek, K. Angele, E. Kärle, J. Opdenplatz, D. Sommer, J. Umbrich, D. Fensel, Knowl- edge Graph Lifecycle: Building and Maintaining Knowledge Graphs, in: Proceedings of the 2nd International Workshop on Knowledge Graph Construction, CEUR-WS, 2021, p. 16. [10] C. Ré, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, C. Zhang, Feature Engineering for Knowledge Base Construction, . (2014). arXiv:1407.6439. [11] V. Leone, G. Siragusa, L. Di Caro, R. Navigli, Building semantic grams of human knowledge, in: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 2020, pp. 2991–3000. [12] D. R. Yanez, F. Crispoldi, D. Onorati, P. Ulpiani, G. Fenza, S. Senatore, Enabling a Semantic Sensor Knowledge Approach for Quality Control Support in Cleanrooms, in: CEUR Workshop Proceedings, volume 2980, CEUR-WS, 2021, pp. 1–3. URL: http://ceur-ws.org/ Vol-2980/paper409.pdf. [13] M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A Review of Relational Machine Learning for Knowledge Graphs, Proceedings of the IEEE 104 (2016) 11–33. doi:10.1109/JPROC. 2015.2483592. arXiv:1503.00759. [14] T.-P. Nguyen, S. Razniewski, G. Weikum, Advanced Semantics for Commonsense Knowl- edge Extraction, in: Proceedings of the Web Conference 2021, ACM, New York, NY, USA, 2021, pp. 2636–2647. doi:10.1145/3442381.3449827. [15] M. Brockmeier, Y. Liu, S. Pateer, S. Hertling, H. Paulheim, On-Demand and Lightweight Knowledge Graph Generation – a Demonstration with DBpedia, in: Proceedings of the Semantics 2021, volume 2941, CEUR-WS, 2021, p. 5. arXiv:2107.00873. �[16] T. Kliegr, O. Zamazal, LHD 2.0: A text mining approach to typing entities in knowledge graphs, Journal of Web Semantics 39 (2016) 47–61. doi:10.1016/j.websem.2016.05. 001. [17] J. L. Martinez-Rodriguez, I. Lopez-Arevalo, A. B. Rios-Alvarado, OpenIE-based approach for Knowledge Graph construction from text, Expert Systems with Applications 113 (2018) 339–355. doi:10.1016/j.eswa.2018.07.017. [18] Y. Liu, T. Zhang, Z. Liang, H. Ji, D. L. McGuinness, Seq2RDF: An end-to-end application for deriving Triples from Natural Language Text, in: Proceedings of the 17th International Se- mantic Web Conference, CEUR-WS, 2018, p. 4. URL: http://ceur-ws.org/Vol-2180/paper-37. pdf. [19] A. Ratner, C. Ré, P. Bailis, Research for practice, Communications of the ACM 61 (2018) 95–97. URL: https://dl.acm.org/doi/10.1145/3233243. doi:10.1145/3233243. [20] S. Tiwari, F. N. Al-Aswadi, D. Gaurav, Recent trends in knowledge graphs: theory and practice, Soft Computing 25 (2021) 8337–8355. URL: https://link.springer.com/10.1007/ s00500-021-05756-8. doi:10.1007/s00500-021-05756-8. [21] G. Fenza, M. Gallo, V. Loia, D. Marino, F. Orciuoli, A cognitive approach based on the actionable knowledge graph for supporting maintenance operations, in: 2020 IEEE Con- ference on Evolving and Adaptive Intelligent Systems, EAIS 2020, Bari, Italy, May 27-29, 2020, IEEE, 2020, pp. 1–7. doi:{10.1109/EAIS48028.2020.9122759}. [22] B. Shi, T. Weninger, Open-World Knowledge Graph Completion, Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018). [23] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Advances in Neural Information Processing Systems, NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, pp. 2787–2795. [24] K. Scharei, F. Heidecker, M. Bieshaar, Knowledge Representations in Technical Systems – A Taxonomy, . (2020). arXiv:2001.04835. [25] T. Ebisu, R. Ichise, Generalized translation-based embedding of knowledge graph, IEEE Transactions on Knowledge and Data Engineering 32 (2020) 941–951. doi:10.1109/TKDE. 2019.2893920. [26] B. Yang, W.-t. Yih, X. He, J. Gao, L. Deng, Embedding Entities and Relations for Learning and Inference in Knowledge Bases, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015, p. 15. [27] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, G. Bouchard, Complex Embeddings for Simple Link Prediction, in: M. F. Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, PMLR, New York, New York, USA, 2016, pp. 2071–2080. [28] L. Costabello, S. Pai, C. L. Van, R. McGrath, N. McCarthy, AmpliGraph: a Library for Representation Learning on Knowledge Graphs, GitHub (2019). URL: https://zenodo.org/ record/2595049. doi:10.5281/ZENODO.2595049. [29] Y. Wang, F. Ma, J. Gao, Efficient Knowledge Graph Validation via Cross-Graph Representa- tion Learning, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, ACM, New York, NY, USA, 2020, pp. 1595–1604. doi:10.1145/3340531.3411902. arXiv:2008.06995.