Vol-3184/paper4
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3184/paper4 |
wikidataid | →Q117040467 |
title | Generating Domain-Specific Knowledge Graphs: Challenges with Open Information Extraction |
pdfUrl | https://ceur-ws.org/Vol-3184/TEXT2KG_Paper_4.pdf |
dblpUrl | |
volume | →Vol-3184 |
session | → |
Generating Domain-Specific Knowledge Graphs: Challenges with Open Information Extraction
Generating Domain-Specific Knowledge Graphs: Challenges with Open Information Extraction Nitisha Jain1 , Alejandro Sierra-Múnera1 , Julius Streit1 , Simon Thormeyer1 , Philipp Schmidt1 , Maria Lomaeva2 and Ralf Krestel3,4 1 HPI - Hasso Plattner Institute, Potsdam, Germany 2 University of Potsdam, Potsdam, Germany 3 ZBW - Leibniz Centre for Economics, Kiel, Germany 4 Kiel University, Kiel, Germany Abstract Knowledge Graphs (KGs) are a popular way to structure and represent knowledge in a machine-readable way. While KGs serve as the foundation for many applications, the automatic construction of these KGs from texts is a challenging task where Open Information Extraction techniques are prominently leveraged. In this paper, we focus on generating a domain-specific knowledge graph based on art-historic texts from a digitized text collection. We describe the combined use and adaption of existing open information extraction methods to build an art-historic KG that can facilitate data exploration for domain experts. We discuss the challenges that were faced at each step and present detailed error analysis to identify the limitations of existing methods when working with domain-specific corpora. Keywords Knowledge graphs, Open information extraction, Domain-specific texts 1. Introduction Knowledge Graphs (KGs) have gained considerable popularity in both academia and industry. They are employed to represent information in a structured format after extraction from large collections of heterogeneous, diverse, and unstructured documents [1]. These KGs can then be used for downstream tasks, such as question answering, logical inference, recommendation, or information retrieval. Besides general KGs that aim to capture generic knowledge about real-world data, such as DBpedia [2] and Wikidata [3], domain-specific KGs have become important for targeted domains [4]. They have been leveraged to support multiple information- based applications, e.g., in the context of health and life sciences [5], news search [6] or fact checking [7]. There have been several efforts towards automatic construction of general purpose knowledge graphs from the Web based on machine learning techniques [8, 9]. In the absence of a pre- specified list of relations for performing pattern-based extractions, Open Information Extraction Text2KG 2022: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022, May 05-30-2022, Crete, Hersonissos, Greece $ Nitisha.Jain@hpi.de (N. Jain); Alejandro.Sierra@hpi.de (A. Sierra-Múnera); R.Krestel@zbw.eu (R. Krestel) https://nitishajain.github.io/ (N. Jain) � 0000-0002-7429-7949 (N. Jain); 0000-0003-3637-4904 (A. Sierra-Múnera); 0000-0002-5036-8589 (R. Krestel) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 (Open IE) is a popular approach, where a large set of relational triples can be extracted from text without any human input or domain expertise [10]. Several Open IE techniques have been proposed to build and populate knowledge graphs from free-form texts [11, 12, 13, 14, 15, 16]. However, these methods for automated knowledge base construction suffer from a number of shortcomings in terms of their coverage [17] and applicability to specific domains [4]. Existing techniques that exhibit state-of-the-art results on standard, clean datasets fail to achieve comparable performance for domain-specific datasets, e.g., in the art-historic domain where the data often consists of highly heterogeneous and noisy collections [18]. KG for Art. The art and cultural heritage domain provides a plethora of opportunities for knowledge graph applications. An art knowledge graph can enable art historians, as well as interested users, to explore interesting information that is hidden in large volumes of text in a structured manner. With a large variety of diverse information sources and manifold application scenarios, the (automated) construction of task-specific and domain-specific knowledge graphs becomes even more crucial for this domain. In contrast to general purpose KGs, a KG for the art domain could comprise a specific set of entity types, such as artworks, galleries, as well as relevant relations, such as influenced_by, part_of_movement etc., depending on the specific task and on the specific text collection. The important entities and relations might also differ across different document types, such as auction catalogues, exhibition catalogues, or art magazines. On one hand, a general purpose, art-oriented ontology may not be well-suited and comprehensive enough for specific data collections. On the other hand, designing a custom ontology for the different art corpora would be a challenging and expensive task due to the need for significant domain expertise. In the past, several attempts have been made at creating KGs for art and related domains [19, 20, 21], with the most recent one by Castellano et al. [22]. However, a systematic method for the construction of a knowledge graph based on a collection of art-related documents without a well-defined ontology has not been proposed thus far. Goals. In this paper, we describe an ongoing project1 for the automatic construction of a knowledge graph based on a large, private archive of art-historic documents. Instead of relying on existing ontologies to dictate the information extraction process (that might restrict the scope of the entities and relations that could be extracted from the text when the ontology is not hand-crafted for the specific dataset) we decided to pursue the schema-less Open IE approach in this work. We present the results from our exploration of existing Open IE techniques to generate structured information and discuss our insights in terms of their shortcomings and limited applicability when deployed for noisy, digitized data in the art domain. We make the following contributions in this paper: (i) Construct a domain-specific knowledge graph based on a collection of digitized art-historic documents. (ii) Describe the process of automated construction of the KG with Open IE techniques. (iii) Analyze and discuss the challenges and limitations for the adaptation of Open IE tools to domain-specific datasets. 1 https://hpi.de/naumann/projects/web-science/ai4art.html 2 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 2. Related Work With the availability of digitized cultural data, several previous works have proposed KGs for art-related datasets [19, 20, 21, 23]. Arco [21] is a large Italian cultural heritage graph with a pre-defined ontology that was developed in a collaborative fashion with contributions from domain experts all over the country. While the Arco KG is quite broad in its coverage, Ardo [24] pertains to a very specific use case of multimedia archival records. Similarly, the Linked Stage Graph [25] was developed as a KG specifically for storing historical data about the Stuttgart State Theater. Increasingly, the principles of linked open data2 have also been widely adopted within the cultural heritage domain for facilitating researchers, practitioners and generic users to study and consume cultural objects. Notable examples include the CIDOC-CRM [26], the Rijksmuseum collection [27], the Zeri Photo Archive3 , OpenGLAM [28] among many others. Most related to our work is the ArtGraph [22] where the authors have integrated the art resources from DBpedia and WikiArt and constructed a KG with a well-defined schema that is centered around artworks and artists. While all these works are concerned with KGs and ontologies for specific art-related corpora, they have leveraged a schema for representing the information and are not concerned with the challenges of a schema-free extraction process, which is the main focus of this work. Open IE approaches extract triples directly from text, without an explicit ontology or schema behind the extraction process. Several works have been proposed in the past. TextRunner [12] relies on a self supervised classifier which determines trustworthy relationships with pairs of entities, while Reverb [11] uses syntactical and lexical constraints to overcome incoherent and uninformative relationships. ClausIE [14] relies heavily on dependency parsing to construct clauses from which the propositions will be extracted. In this work, we have leveraged the Stanford CoreNLP OpenIE implementation [29, 13] that uses dependency parsing to minimize the phrases of the resulting clauses, and was originally evaluated in a slot filling task. The construction of domain-specific KGs has been the subject of investigation in previous works for various domains, e.g. software engineering [30], academic literatures [31], and more prominently, the biomedical domain [32, 33, 34]. However, the previously proposed automated methods are not directly applicable for the arts and cultural heritage domain, where unique challenges with respect to the heterogeneity and quality of data are prevalent. This work identifies and discusses the particular difficulties encountered while applying existing information extraction techniques to art-related corpora. 3. Automated Construction of Art-historic KG In this section, we describe our underlying art-historic dataset as well as the steps employed for the automated extraction of information (in form of triples) to construct an art-historic knowledge graph. Fig. 1 shows an overview of this process. 2 Linked Open Data: http://www.w3.org/DesignIssues/LinkedData 3 https://fondazionezeri.unibo.it/en 3 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 Figure 1: Construction of art-historic KG. 3.1. Dataset For this work, we are working with a large collection of recently digitized art-historical texts pro- vided by our project partners. This collection consists of a variety of heterogeneous documents including auction catalogs, exhibition catalogs, art books, etc. that contain semi-structured as well as unstructured texts describing artists, artworks, exhibitions and so on. Art historians regularly study these data collections for art-historical analysis. Therefore, a systematic rep- resentation of this data in the form of a KG would be a valuable resource for them to explore this data swiftly and efficiently. The whole collection is quite large (≈ 1TB of data), in order to restrict the size of the dataset for a proof-of-concept of our KG construction process, a subset of this dataset pertaining to information about the artist Picasso was chosen. The decision of choosing an artist-oriented subset of the collection enabled us to better understand the context and evaluate the triples that were obtained throughout the process of KG construction. The data was filtered by querying the document collection using the keyword query ‘Picasso’, resulting in 224,469 entries (where each entry corresponds to a page of the original digitized corpus) containing the term ‘Picasso’. Due to the filtering, each entry is an independent document, in the sense that the neighboring entries do not always represent the correct context. This led to some of the entries in our dataset containing incomplete sentences at the beginning or the end of a page. One such example is an entry starting with ‘to say47—Picasso never belittled his work, until . . . ’ where the tokens ‘to say’ belong to a sentence which started in a different entry, that might no longer be a part of the dataset under consideration. It is important to note that in the same example we can see more noise, e.g., numbers are mixed in between words in the digitized version of the text. This noise in the dataset was introduced by the optical character recognition 4 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 (OCR) process during the digitization of the documents (performed in a prior step by the data providers). In general, the dataset contains full sentences, such as ‘Matisse’s return to the study of ancient and Renaissance sculpture is significant in itself.’, as well as short description phrases, figure captions or footnotes such as ‘G. Bloch, Pablo Picasso, Bern, 1972, vol. III, p.142’. 3.2. Finding Named Entities As a first step, it was interesting to inspect if the named entities present in the corpus could be easily identified. A dictionary-based approach to find the named entities would identify the mentions with a high precision, but at the cost of very low recall by ignoring many potentially interesting entities to be discovered in the corpus. Therefore, we chose to follow a machine learning approach to named entity recognition (NER). Generic NER tools work very well for the common entity types, such as person, location, organization and so on, though fine-grained or domain-specific entities are harder to identify [35]. We employed the SpaCy library4 for finding named entities since its pre-trained models includes a Work_Of_Art category that could potentially identify the entities that are important in the art domain (this could encompass mentions of paintings, books, statues etc.). Excluding the cardinal entities in order to reduce noise, the SpaCy library with the pre-trained ‘en_core_web_trf’ model was used to identify the following entity types - Work_Of_Art, Person, Product, ORG, LOC, GPE and NORP, which showed reasonably good results. The process of NER enabled us to filter out any sentences without any entity mention since such sentences were likely to have no useful information for the KG construction. Thus, the NER step helped with pruning the dataset for further processing, as well as improving the quality of the resulting KG. 3.3. Triple Extraction After obtaining informative sentences from the previous step, we employed Open IE tools to extract the triples from them. It is important to note that while there are some art-related ontologies proposed in previous works such as Arco [21] and ArDo [24], none of them are suitable for our corpus since they are very specific to the datasets they were designed for. Other general ontologies such as CIDOC-CRM are, on the other hand, too broad and would not be able to extract novel and interesting facts from a custom and heterogeneous corpus such as ours, where the entities and relations among them are not known before hand. In the absence of such an ontology specifically designed for the description of art-historic catalogs, we choose to employ open information extraction techniques for the construction of our KG in order to broaden the scope and utility of the extracted information. To this end, we ran the Stanford CoreNLP OpenIE annotator [29, 36] to extract ⟨subject, predicate, object⟩ triples from the sentences. A total of 5,057,488 triples were extracted in this process, where multiple triples could be extracted from a single sentence. Another round of filtering was performed at this stage, where any triples that did not contain a named entity in the subject or object phrase were removed. Additionally, duplicate entries and triples with serial numbers as entities were also ignored. Some examples of triples that were removed are: ⟨we, 4 https://spacy.io/usage/v3 5 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 have, good relationship⟩, ⟨i, be, director⟩, ⟨brothel, be in, evening⟩, ⟨drawings, acquired, work⟩. A total of 160,000 triples remained, a valid triple at this stage looked like ⟨P. Picasso, is, artiste⟩. 3.4. Entity Linking Once the triples were extracted, the entity linking component of the Stanford CoreNLP pipeline [29] was used to link the entities. This component uses WikiDict as a resource, and uses the dictio- nary to match the entity mention text to a specific entity in Wikipedia. Since the entities in our dataset were present in multiple different surface forms, this step allowed us to partially normalize the entities and identify the unique entities. Though the number of entities was reduced as a result, the total number of triples remained the same. Note that this linking could only map entities to their Wikipedia counterpart if the entity was found as a subject or object in a triple. In many cases though, the subject and object were noun phrases instead of obvious entities, for which this kind of linking did not really work. This process was still quite useful as around 108,841 out of 337,100 entities were successfully linked to their Wikipedia form (leading to 8,369 unique entities). Some of the most frequent entities found in the dataset (along with their frequencies) were: (Pablo_Picasso, 11219), (Paris, 2178), (Artist, 1904), (Henri_Matisse, 1769), (Georges_Braque, 1352). 3.5. Canonicalization One of the main challenges when constructing a KG through Open IE techniques, is that of canonicalization. Multiple surface forms of the same entity or relation might be observed in the triples extracted with Open IE techniques in the form of noun phrases or verb phrases that need to be identified and tagged to a single semantic entity or relation in the KG. Since the triples extracted from our dataset via Open IE method comprised many noisy phrases, as well as new entities, such as titles of artworks, that may not be available for mapping in existing databases, entity linking techniques would not suffice in this case. Different from entity linking (that can only link entities already present in external KGs), canonicalization is able to perform clustering for the entities and relations that may not be present in existing KGs, by labelling them as OOV (out of vocabulary) instances. In this work, we chose to perform canonicalization with the help of CESI [37] which is a popular and openly available approach for this task. The CESI approach performs clustering over the non-canonicalized forms of noun phrases for entities and verb phrases for the relations. It leverages different sources of side information for noun phrases and relation phrases such as entity linking, word senses and rule-mining systems for learning embeddings for these phrases using the HolE [38] knowledge graph embedding technique. The clustering is then performed using hierarchical agglomerative clustering (HAC) based on the cosine similarity of the phrase embeddings in vector space. In this manner, different phrases for the same entity or relation were mapped to one canonicalized form for including in the KG. In total, we obtained 3,789 entity clusters and 3,778 relation clusters from the CESI approach that contained two or more terms. Representative Selection. An important step in the CESI approach is the assignment of representatives for the clusters obtained for the noun and relation phrases. This is decided by 6 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 calculating a weighted mean of all the cluster members’ embeddings in terms of their frequency of occurrence. The phrase closest to this mean is selected as the representative. However, this technique did not work well for our domain-specific and noisy dataset and many undesirable errors were noticed. For example, an entity cluster obtained from CESI was: Olga_Khokhlova, olga, khokhlova, picasso. Since Picasso is the most frequent entity in the dataset, it was chosen as representative by CESI, but this is clearly wrong since Picasso and Olga are different entities. There were several other errors observed, e.g., all days of the week were clustered together in one cluster. This could be a result of the embedding and contexts of the days of the week to be quite similar, hence their vectors would end up together in the vector space. In other cases, the color blue occasionally showed up in a cluster of phrases related to color red, certain dates got clustered and certain related but not interchangeable words got clustered (kill vs murder vs shot). In some cases, the first name was being replaced by the incorrect full name (not every david is david johnson). To mitigate the above discussed errors, we had to perform manual vetting of the clusters for verification and selection of the correct cluster representatives which took around 2-3 person hours. During this process, certain clusters, where the entities were different, were removed (such as the cluster with days of the week). After this, the entities and relations were canonicalized as per their chosen cluster representatives leading to a total of 35,305 unique entities and 33,448 unique relations in the final KG5 . 3.6. Entity Typing Since a schema or ontology was not employed to extract the triples from text, the entities in our KG do not have any entity types implicitly assigned to them. Therefore, we attempted to identify the types of as many entities in our graph as possible. With the help of NER, we assigned the types to the entities that were recognized in the triples. A total of 14,960 entities were typed with this technique to generic types such as Person, Product, ORG, LOC, GPE, NORP and Work_Of_Art, as well as numeric types such as Date, Time and Ordinal. Note that Work_of_Art is quite a broad category that includes artworks but also movies, books and various other art forms. Since artworks such as paintings and sculptures are one of the most important entities in our art-historic KG, it is worthwhile to identify the mention and type of these entities. However, generic NER process is neither equipped nor optimized to correctly identify such mentions. Thus, we additionally applied dictionary-based matching. This was done by compiling a large gazetteer of artwork titles by querying Wikidata with the help of the Wikidata Query Service6 for the names of paintings and sculptures, retrieving approximately 15,000 artwork titles. In addition, we augmented our dictionary with the names of the artwork entities from the ArtGraph dataset [22] which contains more than 60,000 artworks derived from DBpedia and WikiArt. If a match was found for an entity in our KG in the compiled dictionary, the type was assigned as artwork accordingly. This led to the tagging of further 1,397 entities in our KG as artworks. The dictionary-based matching for artworks was particularly useful in the cases where it was able to correctly identify entities that were wrongly assigned as the 5 It is to be noted that existing canonicalization techniques such as CESI are largely optimized for canonicalization of entities and their performance is considerably worse for relations. We also observed similar results during our analysis. 6 https://query.wikidata.org/ 7 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 Table 1 Statistics of the KG Total Unique Unique Attribute Artworks Artists Triples Entities Relations Count 147,510 35,305 33,448 1,397 656 Person type by NER, such as la_donna_gravida, portrait_of_mary_cassatt and st._paul_in_prison. Similar to artworks, we attempted to additionally identify the names of artists in our triples. While NER could only tag entities as Person, we used a dictionary of artist names from Wikidata to identify 656 unique artist entities in our data. These included names of artists such as Piet Mondrian, Edvard Munch and Rembrandt. However, the process of entity typing described above is only able to identify and tag around half of the entities in our KG. Several domain and corpus-specific challenges acted as bottlenecks during this process. For example, even after filtering, some triples extracted from Open IE contained either subject or object noun phrases that were generic and did not correspond to any named entity. Examples of such phrases include essay, anthology, periodical, or album that are present in triples such as ⟨album, be_shown_in, Paris⟩. Without designing a custom ontology for this corpus, such entities cannot be hoped to be correctly typed. The categorization of the relations in the KG is a particularly complicated task due to the wide variety of relations extracted from the Open IE process. Few of the most frequent relations in the KG are will, be_in, have, show, paint, work etc. We estimated that the types of the entities could be utilized to find patterns and link the most popular edges in the KG to the relations in existing graphs such as Wikidata or ArtGraph. However, preliminary analysis led to some interesting observations. Firstly, we noted the presence of multiple relations between pairs of entities in the KG. For example, Picasso and June are connected by various relations such as will_be, work and take_trip_in that were extracted from different contexts in the corpus and represent separate meaningful facts. Furthermore, in general, there are several different types of semantic relations between the popular entity types in our KG. For instance, two entities of the type artist are connected by several relations including work, meet, know_well, be_with, friend_of and be_admirer_of. While this variety indicates that a large number of interesting facts have been derived by Open IE in the absence of a fixed and limiting schema, normalizing the relations to improve the quality of the KG is a difficult task that is part of the ongoing and future work. 4. Art-historic Knowledge Graph The statistics of the KG generated from the steps as described in the previous section are shown in Table 1. 8 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 4.1. Graph Features After obtaining the refined set of triples for the first version of the art-historic KG, we performed a preliminary analysis of the graph to derive useful insights with the help of the NetworkX7 package. To understand the graph structure, the number of disconnected components of the graph was measured before and after the canonicalization step. It was noticed that the number of disconnected components was reduced to around 1,500 (down from 2,500) after clustering with CESI. This indicates that canonicalization of entities and relations improved the quality of the knowledge graph by removing unnecessary disconnected parts that were created through redundant triples. Additionally, we also performed node centrality on the graph using eigenvector centrality [39] and link analysis using PageRank [40]. For both the measures, the node for Pablo Picasso was the most central. This confirms the property of the underlying dataset which is focused on Picasso. Other central nodes discovered were corresponding to popular words in the corpus such as work, artist, painting etc. Overall, it is promising to witness that centrality analysis of the generated KG conforms well regarding the main entities and topics of the underlying corpus. A hand-picked example of a subset of the neighborhood of the entity Picasso is shown in Fig. 2. Figure 2: Illustration of a subset of the KG. 4.2. Evaluation Due to the lack of any gold standard for direct comparison, the evaluation of the resulting KG proved challenging. While an absolute measure of the coverage of any KG is a non-trivial 7 https://pypi.org/project/networkx/ 9 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 task due to the open world assumption [17], we attempted to perform limited evaluation in terms of the coverage of the KG in a semi-automated fashion. For this, we first created a Jacques Henri Villon berliner_ Matisse t ex ex hibi sezession ita hi bi ta h ib ta t t ex international_ salon_ lart_moderne exhibition exhibitat d%27automne Georges Braque t b ita Robert ex hi Delaunay hib ex ita t Pablo exh ib itat itat exhibita Picasso exhib photo_secession _gsallery berthe_weills_ t t exhibita salon_des_ gallery independants ex hi b ita invitation t Wassily Kandinsky Paris (a) Artists exhibited at. (corresponding query: MATCH p=(:Artist)-[r:exhibitat]->() RETURN p) school_of_ congress_of_ fine_art intellectuals n involvedi la_lonja in school_of_art ed in olv vo lv inv in ed d ve in l inv vo olv in ed in art_academy in involved Pablo Picasso involvedin exhibition_of_ invo lved french_art in din in o lve vo inv lv ed galerie_pierre in din invo e olv lved inv peace_ in movement french_ resistance group_exhibition annual_ exhibition (b) Picasso involved in various Art schools. (corresponding query: MATCH p=(s)-[r:involvedin]->() WHERE s.name="Pablo Picasso" RETURN p) Figure 3: Examples of query results on the KG (node colours assigned by Neo4j). 10 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 subset of Wikidata [3] by querying for triples about the entity Picasso and used this as the knowledge graph for comparison. This is motivated by the fact that Wikidata contains high quality information about Picasso and the entity linking used in our pipeline performs the linking to Wikipedia (hence, Wikidata) entities. Therefore, it was likely to have a higher match between the surface forms of entities in our KG to the Wikipedia entities, as compared to other datasets such as DBpedia. From the obtained Wikidata subset, 100 triples were randomly selected that related to infor- mation about Picasso as well as about museums that owned his works. Upon careful manual inspection (independently by three annotators) and resolution of conflicts with discussions, it was measured that the facts represented in 43% of these triples were also present in our KG as a direct match or in a different form with the same meaning. Notably, our KG was missing information about the museums that own Picasso’s works, this is because our underlying corpus is also lacking comprehensive information on this topic. Therefore, triples relating to museums from Wikidata could not be matched. Additionally, we checked how many of our entities and entity pairs are written in exactly the same way as in the Wikidata graph. Overall, around 12% of entities and 10% of entity pairs in our graph have exact matches in Wikidata. These preliminary results are promising and point towards the need for a domain-oriented construction process for further improvement of the art-historic KG. In particular, the precision of the triples in art-historic KG is more important to the users and therefore, factual verification for the triples that were extracted from our dataset but are not found in Wikidata needs to be conducted by enlisting the help of domain experts. 4.3. Implementation Taking cue from related work [22], we have encoded our KG data into Neo4j8 which is a no-SQL graph database that provides an efficient way of capturing the diverse connections between the different entities of our knowledge graph. Additionally, the knowledge graph stored in the Neo4j database can be queried easily with the help of the Cypher language for enabling data exploration and knowledge discovery. Fig. 3 shows the results of a few example queries that can be executed on the KG - venues where Picasso and other artists had exhibited their work; and various art schools or movements where Picasso was involved. Further, Fig. 4 shows the persons and/or art styles that Picasso influenced or was influenced by. In some cases, interesting connections with other relevant entities are also retrieved, thus providing useful cues for further exploration of the data in the KG for domain experts as well as interested users. 5. Discussion and Error Analysis Due to the source corpus being heterogeneous and noisy, the Open IE process led to a number of incorrect triples in the KG despite our best efforts to eliminate the noise at each step. Here, we perform a critical analysis and look deeper into the quality of the triples in the first version of the KG. For this, we sample few of the incorrectly extracted triples, to understand the nature 8 https://neo4j.com 11 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 of mistakes committed by the automated KG generation process. Table 2 presents some triples in the KG and the corresponding text snippets in the input data from which they were extracted. In T1, even though the triple appears to be syntactically correct, the actual entity corresponds to the entire phrase The Third of May 1808 in Madrid which is an artwork, and thus the correct triples should relate this artwork to the corresponding artist Francisco de Goya, perhaps including the date 1814 as well. This example illustrates the difficulty of recognizing artwork titles, given that they usually contain other entities like Madrid (location). A similar mistake can be seen in T6. Here Appel was incorrectly recognized as a location instead of the surname of Karel Appel (person), and thus the triple represents the information to be an influence of an artist on a location, instead of between the artists. Examples in T2 to T6 represent the triples and the supporting text snippets for the results of the query as depicted in Figure 4, which contains a mixture of factually correct, factually incorrect, and speculative facts. In T2, a relation was correctly extracted from the text, but the head entity was incorrectly recognized as ‘American’. This example speaks for the need for additional work on co-reference resolution, in order to properly follow the connections in the text. A more precise triple would have been ⟨Gorky, beInfluenceBy, Pablo Picasso⟩. T3 is an example in which the lack of context in the syntactic analysis of the sentence results in the assumption that the statement is true, although it is a suggestion by a specific person and therefore, not necessarily a true fact. A similar example is T4 in which the source text is american guevara City bei eby nflu Artist enc Morris enc Louis nflu Appel eby bei be infl ue nc nc eby NORP eb e y nflu bei Pablo y Picasso nceb None be influe be roman_art infl ue nc eb y y eb bein nc femme_ ue fluen couchee infl be ceby Aubrey Beardsley philpot Figure 4: Illustration of a subset of KG, depicting the influence of and on Picasso (corresponding query: MATCH p=(s)-[:beinfluenceby]-(o) WHERE s.name="Pablo Picasso" RETURN p) 12 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 Table 2 Examples of triples in the KG with their corresponding source texts T1 ⟨The Third of May 1808, beIn, Madrid⟩ At the center of the show, a room containing Francisco de Goya’s The Third of May 1808 in Madrid (1814), Édouard Manet’s The Execution of Emperor Maximilian of Mexico (1868-69). . . T2 ⟨American, beInfluenceBy, Pablo Picasso⟩ The more one examines Gorky’s early works, the more they appear like Gorkys rather than like Picassos. Moreover, his unabashed borrowings can be seen as forward-looking: for an American to be influenced by Picasso in the heyday of American Scene painting was, art historian Meyer Schapiro points out, “an act of originality.” T3 ⟨Pablo Picasso, beInfluenceBy, Morris Louis⟩ . . . to Andrew Hudson, art critic of The Washington Post, for suggesting that Pablo Picasso has been influenced by Morris Louis and Kenneth Noland, two leaders of the “post-painterly” Washington, D.C. T4 ⟨Guevara, beInfluenceBy, Pablo Picasso⟩ It is probable that Guevara was influenced by Picasso to experiment with the encaustic technique, which had been practised in antiquity. Hot wax was used as a medium for mixing floral and vegetable dyes. T5 ⟨Pablo Picasso, beInfluenceBy, Aubrey Beardsley⟩ Picasso was influenced doubtless by Aubrey Beardsley, who had died in 1899 at the age of twenty-six, but then what an excellent influence it proved to be for this portrait ! T6 ⟨Appel, beInfluenceBy, Pablo Picasso⟩ In artistic respect, one could also see, that Karel Appel was strongly influenced in this period, by Picasso and Miro. explaining a potential influence relation between the artists, but it cannot be directly assumed that it is a fact. These two examples illustrate that the context of the actual text might get lost during the extraction process, which may lead to erroneous facts being represented in the KG. Thus, it is important to take into account the provenance information that can help the user understand the full context for obtaining the correct information. A different scenario is depicted in T5, in which the text clearly confirms the validity of the fact. One interesting observation is regarding the syntactic structure of the relation phrase - the word ‘doubtless’ acts as an adverb emphasizing the validity of the fact, and although it divides the relation phrase ‘was influenced by’, the syntactic analyzer and the canonicalization step were able to normalize the relation to a canonical form. This is also evident in the diversity of relation phrases in this sample of texts. They are expressed in different tenses, with auxiliary verbs, and sometimes spread within a more complex sentence, as seen in T5. Examples T3 to T6 illustrate the need for fact-checking in our KG. Particularly, the facts in the KG could be presented to domain experts who would be able to easily look at the information in a user-friendly manner and then proceed to investigate further to either corroborate or even contradict the triples in the automatically generated KG. We envision the easy access and scrutiny of the information stored in large text collections to be the primary use-case of this automatically generated art-historic KG. 13 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 6. Lessons Learned and Future Work This work presented a first attempt at constructing a domain-oriented knowledge graph for the art domain in an automated fashion with Open IE techniques. Due to the noisy and hetero- geneous dataset that is typical of digitized art-historic collections, we encountered challenges at various steps of the KG construction process. During the very first step, it was difficult to correctly identify the mentions of artworks (i.e. titles of paintings) in the dataset due to the noise and inherent ambiguities. This domain-specific issue needs further attention in order to improve the quality as well as coverage of the resulting KG, as discussed in detail by previous work [35]. In addition, a co-reference resolution tool [41] could also help with the identification and linking of relevant entities. While the Open IE approach allowed for the extraction of a wide variety of entities and relations, this led to canonicalization becoming a complicated task. We observed that existing techniques for canonicalization on generic datasets, such as CESI, do not show comparable performance for domain-specific dataset. It would be interesting to investigate if large pre- trained language models such as FastText and BERT could compete with the relatively older KG embeddings that were employed in CESI for obtaining better clusters. There are other recent works on canonicalization [42, 43] that demonstrate better results and would be worth exploring further for our use case in future work. Another important aspect is the incomplete tagging of the various types of entities obtained from Open IE. Attributed yet again to the noise in the process, as well as to lack of any underlying schema, many entities could not be assigned their correct type. This task needs further exploration for the enrichment of the KG. Moreover, we have only considered English texts in this work so far, since the existing methods show their best performance with English texts. However, our art-historic collection is comprised of multiple languages and we would like to expand the pipeline to process multi- lingual texts. Taking into account the existing limitations of the methods with domain-specific corpora, this seems to be an arduous but interesting research challenge. With regard to the implementation of the KG pipeline, while we have so far used off-the-shelf tools and libraries like SpaCy, Stanford CoreNLP and CESI, we plan to further fine-tune them to the task of domain-specific KG construction. It will also be worthwhile to explore and evaluate the performance with other available tools such as Flair [44] and Blink [45] for entity recognition, linking and typing, as well as OpenIE [16] and MinIE [15] for the extraction of triples. The scalability of these approaches and the completeness of the resulting KG in the presence of new and expanding cultural heritage datasets is also an open research question to be looked into. The evaluation of the art-historic KG is also a crucial task worth discussing. While we have performed a semi-automated evaluation for the first version of our KG, a more rigorous and thorough evaluation of the correctness of the facts is certainly imperative before this KG can be useful to a non-expert user (as discussed in Section 5). One way to ensure this would be to maintain the provenance and of the facts in the KG, in terms of their source document as well as their confidence measure. This could also facilitate a fair and complementary manual evaluation in terms of precision and recall which could provide further insights. For this, we plan to closely collaborate with domain experts and enlist their help in the near future. 14 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 7. Conclusion In this work, we have presented our approach to construct an art-historic KG from digitized texts in an automated manner. We have leveraged existing Open IE tools for various stages of the KG construction process and discussed the limitations and challenges while adapting these generic tools for domain-specific datasets. We have presented these insights with the hope of encouraging interesting dialogue and further progress along these lines. While our limited initial analysis and evaluation has shown encouraging results, it has also shown clear indications towards the points of improvement for creating a more refined and comprehensive version of an art-historic KG which could be used for downstream tasks such as search and querying. References [1] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, et al., Knowledge graphs, ACM Computing Surveys (CSUR) 54 (2021) 1–37. [2] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al., DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web 6 (2015) 167–195. [3] D. Vrandečić, M. Krötzsch, Wikidata: A free collaborative knowledge base, Communica- tions of the ACM 57 (2014) 78–85. [4] M. Kejriwal, Domain-specific knowledge graph construction, Springer, 2019. [5] P. Ernst, C. Meng, A. Siu, G. Weikum, Knowlife: A knowledge graph for health and life sciences, in: Proceedings of the 30th International Conference on Data Engineering, IEEE, 2014, pp. 1254–1257. [6] C. Rudnik, T. Ehrhart, O. Ferret, D. Teyssou, R. Troncy, X. Tannier, Searching news articles using an event knowledge graph leveraged by Wikidata, in: Companion Proceedings of The 2019 World Wide Web Conference, 2019, pp. 1232–1239. [7] G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, A. Flammini, Computa- tional fact checking from knowledge networks, PloS one 10 (2015) e0128193. [8] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, C. Ré, Incremental Knowledge Base Construc- tion using Deepdive, in: Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, volume 8, 2015, p. 1310. [9] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, T. M. Mitchell, Toward an Architecture for Never-Ending Language Learning, in: Proceedings of the 24th AAAI Conference on Artificial Intelligence, 2010, pp. 1306–1313. [10] O. Etzioni, M. Banko, S. Soderland, D. S. Weld, Open information extraction from the web, Communications of the ACM 51 (2008) 68–74. [11] A. Fader, S. Soderland, O. Etzioni, Identifying Relations for Open Information Extraction, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 1535–1545. [12] A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland, Textrunner: 15 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 Open Information Extraction on the Web, in: Proceedings of Human Language Tech- nologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2007, pp. 25–26. [13] G. Angeli, M. J. J. Premkumar, C. D. Manning, Leveraging linguistic structure for open domain information extraction, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 344–354. [14] L. D. Corro, R. Gemulla, ClausIE: Clause-based open information extraction, Proceedings of the 22nd International Conference on World Wide Web (2013). [15] K. Gashteovski, R. Gemulla, L. del Corro, MinIE: Minimizing facts in open information extraction, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2630–2640. URL: https://aclanthology.org/D17-1278. doi:10.18653/v1/D17-1278. [16] K. Kolluru, V. Adlakha, S. Aggarwal, Mausam, S. Chakrabarti, OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 3748–3761. URL: https:// aclanthology.org/2020.emnlp-main.306. doi:10.18653/v1/2020.emnlp-main.306. [17] L. Galárraga, S. Razniewski, A. Amarilli, F. M. Suchanek, Predicting Completeness in Knowledge Bases, in: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, 2017, pp. 375–383. [18] N. Jain, Domain-Specific Knowledge Graph Construction for Semantic Analysis, in: Proceedings of the Extended Semantic Web Conference (ESWC) 2020 Satellite Events, Springer International Publishing, Cham, 2020, pp. 250–260. [19] H. Wu, S. Y. Liu, W. Zheng, Y. Yang, H. Gao, PaintKG: The painting knowledge graph using biLSTM-CRF, in: Proceedings of the 2020 International Conference on Information Science and Education (ICISE-IE), 2020, pp. 412–417. doi:10.1109/ICISE51755.2020.00094. [20] J. Hunter, S. Odat, Building a Semantic Knowledge-base for Painting Conservators, in: Proceedings of the 2011 IEEE Seventh International Conference on eScience, 2011, pp. 173–180. doi:10.1109/eScience.2011.32. [21] V. A. Carriero, A. Gangemi, M. L. Mancinelli, L. Marinucci, A. G. Nuzzolese, V. Presutti, C. Veninata, ArCo: The Italian cultural heritage knowledge graph, in: International Semantic Web Conference, Springer, 2019, pp. 36–52. [22] G. Castellano, G. Sansaro, G. Vessio, ArtGraph: Towards an Artistic Knowledge Graph, arXiv e-prints (2021) arXiv–2105. [23] S. Oramas, L. Espinosa-Anke, M. Sordo, H. Saggion, X. Serra, Information extraction for knowledge base construction in the music domain, Data and Knowledge Engineering 106 (2016) 70–83. URL: https://www.sciencedirect.com/science/article/pii/S0169023X16300416. doi:10.1016/j.datak.2016.06.001. [24] O. Vsesviatska, T. Tietz, F. Hoppe, M. Sprau, N. Meyer, D. Dessì, H. Sack, ArDO: An ontology to describe the dynamics of multimedia archival records, in: Proceedings of the 36th Annual ACM Symposium on Applied Computing, 2021, pp. 1855–1863. [25] T. Tietz, J. Waitelonis, K. Zhou, P. Felgentreff, N. Meyer, A. Weber, H. Sack, Linked Stage Graph, in: SEMANTICS Posters&Demos, 2019. 16 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 [26] D. Oldman, C. Labs, The CIDOC Conceptual Reference Model (CIDOC-CRM): PRIMER, CIDOC-CRM official web site (2014). [27] C. Dijkshoorn, L. Jongma, L. Aroyo, J. Van Ossenbruggen, G. Schreiber, W. ter Weele, J. Wielemaker, The Rijksmuseum Collection as Linked Data, Semantic Web 9 (2018) 221–230. [28] S. Van Hooland, R. Verborgh, Linked Data for Libraries, Archives and Museums: How to clean, link and publish your metadata, Facet publishing, 2014. [29] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stan- ford CoreNLP natural language processing toolkit, in: Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. URL: http://www.aclweb.org/ anthology/P/P14/P14-5010. [30] X. Zhao, Z. Xing, M. A. Kabir, N. Sawada, J. Li, S.-W. Lin, HDSKG: Harvesting domain specific knowledge graph from content of webpages, in: Proceedings of the 24th Interna- tional Conference on Software Analysis, Evolution and Re-engineering (SANER), IEEE, 2017, pp. 56–67. [31] S. Huang, X. Wan, AKMiner: Domain-specific knowledge graph mining from academic literatures, in: Proceedings of the International Conference on Web Information Systems Engineering, Springer, 2013, pp. 241–255. [32] J. Yuan, Z. Jin, H. Guo, H. Jin, X. Zhang, T. Smith, J. Luo, Constructing biomedical domain- specific knowledge graph with minimum supervision, Knowledge and Information Systems 62 (2020) 317–336. [33] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, J. Morissette, Bio2RDF: Towards a mashup to build bioinformatics knowledge systems, Journal of biomedical informatics 41 (2008) 706–716. [34] P. Ernst, A. Siu, G. Weikum, Knowlife: A versatile approach for constructing a large knowledge graph for biomedical sciences, BMC bioinformatics 16 (2015) 1–13. [35] N. Jain, R. Krestel, Who is Mona L.? Identifying Mentions of Artworks in Historical Archives, in: Proceedings of the International Conference on Theory and Practice of Digital Libraries, Springer International Publishing, Cham, 2019, pp. 115–122. [36] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python natural language processing toolkit for many human languages, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020. URL: https://nlp.stanford.edu/pubs/qi2020stanza.pdf. [37] S. Vashishth, P. Jain, P. Talukdar, CESI: Canonicalizing open knowledge bases using embeddings and side information, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1317–1327. [38] M. Nickel, L. Rosasco, T. Poggio, Holographic embeddings of knowledge graphs, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [39] P. Bonacich, Power and centrality: A family of measures, American journal of sociology 92 (1987) 1170–1182. [40] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Technical Report 1999-66, Stanford InfoLab, 1999. URL: http://ilpubs. stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120. [41] K. Clark, C. D. Manning, Deep reinforcement learning for mention-ranking coreference 17 �Nitisha Jain et al. CEUR Workshop Proceedings 1–18 models, in: Proceedings of the 2016 Conference on Empirical Methods on Natural Language Processing, 2016. URL: https://nlp.stanford.edu/pubs/clark2016deep.pdf. [42] T. Jiang, T. Zhao, B. Qin, T. Liu, N. V. Chawla, M. Jiang, Canonicalizing Open Knowledge Bases with Multi-Layered Meta-Graph Neural Network, CoRR abs/2006.09610 (2020). URL: https://arxiv.org/abs/2006.09610. arXiv:2006.09610. [43] S. Dash, G. Rossiello, N. Mihindukulasooriya, S. Bagchi, A. Gliozzo, Open knowledge graphs canonicalization using variational autoencoders, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 10379–10394. URL: https://aclanthology.org/2021.emnlp-main.811. doi:10.18653/v1/2021.emnlp-main. 811. [44] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use framework for state-of-the-art NLP, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 54–59. URL: https://aclanthology.org/N19-4010. doi:10.18653/v1/N19-4010. [45] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettlemoyer, Scalable Zero-shot Entity Linking with Dense Entity Retrieval, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6397–6407. URL: https://aclanthology.org/2020.emnlp-main.519. doi:10. 18653/v1/2020.emnlp-main.519.