Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper2 |
wikidataid | Q117344957→Q117344957 |
title | Extracting a Unified Database from Heterogeneous Bills of Materials via Data Integration Approaches |
pdfUrl | https://ceur-ws.org/Vol-3194/paper2.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/LuciaF0O22 |
volume | Vol-3194→Vol-3194 |
session | → |
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper2 |
wikidataid | Q117344957→Q117344957 |
title | Extracting a Unified Database from Heterogeneous Bills of Materials via Data Integration Approaches |
pdfUrl | https://ceur-ws.org/Vol-3194/paper2.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/LuciaF0O22 |
volume | Vol-3194→Vol-3194 |
session | → |
Extracting a Unified Database from Heterogeneous Bills of Materials via Data Integration Approaches Luigi De Lucia1,2 , Guglielmo Faggioli1 , Nicola Ferro1 and Fabio Orobelli2 1 University of Padova, Padova, Italy 2 Bedeschi S.p.A., Limena, Italy Abstract Since the beginning of the information era, manufacturing companies have collected a huge amount of data and acknowledged the importance of proper collection and curation policies. Nevertheless, an open problem is the exploitation of historical and legacy data. Such data is heterogeneous: it has often been accumulated throughout decades by different curators with different degrees of attention toward the data. In this work, we focus on the historical data collected by Bedeschi S.p.A., an international company dealing with the design and production of turn-key solutions for Bulk Material Handling, Container Logistics and Bricks. Our task is to extract knowledge, available under the form of partially structured Bills of Materials (BoMs), to distil a unified database over a mediated schema that can be used by designers and sales department to rapidly have a global overview of the products developed over the years. We present a feasibility study investigating the performance of several traditional approaches of data integration applied to our specific task. In particular, we treat BoMs as sources that need to be aggregated in the unified mediated schema. We achieved satisfactory results indicating the direction that the company should follow to properly extract knowledge from their historical data. 1. Introduction Data is arguably becoming day by day one of the most valuable assets for an increasing number of companies; large companies have digitalized several production processes, making it easy to collect, curate and exploit their data. This is not true for what concerns historical and legacy data, which is often underlooked and poorly used. Three main challenges impair the handling of historical data. First, legacy data is fragmented and scattered over different sources, formats and media, making it challenging to organize it as unitary knowledge. Secondly, historical data is unstructured and weakly linked to the other information, it also contains dirty values and information expressed via natural language [1]. Finally, historical data is characterized by high degree of heterogeneity: several annotators collected it on digital archives over decades, with underlying production processes and paradigms changing. These challenges make legacy and historical datasets unexploitable automatically. Our work stems from a specific industrial need: exploiting historical data to help the design and sales departments by providing them with a unifying overview of all the products built over the years. We exploit historical data collected by Bedeschi S.p.A, an international company specialized in the production of turn-key solutions for Bulk Material Handling, Container Logistics and Bricks. We aim at extracting information from a set of heterogeneous semi-structured Bills of Materials (BoMs) to organize SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �it in a uniform database over a global schema providing an overview of the different models of Bedeschi machines developed over decades. This work is a feasibility study to determine whether it is possible to process these BoMs automatically. In particular, to determine the most promising approaches, we focus on a single Bedeschi machine type, the BEL-C, in production at Bedeschi S.p.A. for decades. Furthermore, machines of this type have rich BoMs that allow us to recognize emerging challenges and critical issues. Moreover, different designers and curators have produced, collected, and organized the data about this type of machine over the years using different media, starting from paper sheets – now digitalized – to modern information access systems.. We interpret our task under the data integration framework. In particular, we consider BoMs associated with different machines as our data sources. These data sources are then integrated to obtain a global schema. Our contributions are the following: • We formalize the unified database extraction task from an industrial perspective as a data integration task; • We identify a set of techniques drawn from the schema matching and instance matching domains to address the task; • We empirically evaluate the performance of such techniques. The remainder is organized as follows: Section 2 describes the main efforts in the Data Integration domain. Section 3 contains the description of the task at hand and the data available. Section 4 details on our methodology, while Section 5 contains its evaluation. Finally, in Section 6 we draw same conclusions and outline future steps. 2. Related Work Data integration aims to provide unified access to data that usually resides on multiple, au- tonomous data sources, like different databases or data retrieved by internet websites [2]. One of the major challenges to data integration is data ambiguity. This ambiguity arises whenever there is some uncertainty about the meant data information. Among data integration systems developed, some handle the whole data integration process, such as Do and Rahm [3], Fagin et al. [4], Seligman et al. [5], while others, e.g., Bernstein et al. [6], Falconer and Noy [7], focus only on the schema mapping task. Among the most prominent endeavours, we list the following. COMA [3, 8, 9] automatically generates matchings between source and target schemas, by exploiting multiple matching approaches to compute the matching. Clio, developed by Fagin et al. [4], supports a visual matching representation, and is capable of working with schemas without knowing the mediated schema. Cupid [10] is a schema matcher algorithm that exploits similarities of different aspects of the schema elements: keys, referential constraints, and views. Several different endeavours have been devoted to the schema matching task. Rahm [11] presents an overview of the possible approaches based on different types of matching to detect the correspondences between the schema elements. When the instances are available, it is possible to state that two schema elements are similar if they have close instances Bernstein et al. [6], Madhavan et al. [12]. � (a) BEL-C (b) Bill of Materials (BoM) Figure 1: A BEL-C machine and – a part of – its BoM Additionally, different works exploit the information provided by instance values. COMA++, an extension of COMA proposed by Engmann and Massmann [8], considers the average length of instance characters, average values and standard deviations. A second approach has been proposed by Bilke and Naumann [13] and exploits duplicate instances to find matches. Despite the popularity of the task, there is no universal benchmark to evaluate schema matching approaches [14]. Among the ancillary techniques needed to achieve data integration, we can list data profiling, reverse database engineering, and matching approaches combination. Data profiling is the set of processes and actions performed to retrieve specific information, also called metadata, about a given dataset [15]. Generally, profiling tasks support manual data exploration and improve the data gazing [16] process. Data profiling further helps data integration by revealing inconsisten- cies and errors in the data errors [17]. Reverse database engineering allows identifying different characteristics of the database entities like relations, attributes, domain semantics, foreign keys and cardinalities that can be introduced into the data integration process via explicit rules [18]. Combining different matching approaches allows detecting a higher number of correct matches with respect using a single matcher [19]. We can see some approaches in [12, 10, 20]. Hybrid and composite matchings are two different ways of combining matching approaches. The former uses multiple matching criteria together to compute the score. The latter, which we employ in our work, computes the matching scores independently and then combine them. 3. The Task and Dataset 3.1. Task In the design process of a new product, it is common practice to look at the design of previous ones to retrieve reference specifications and understand the design choices. To this end, it is necessary to have an overview of products realized previously. In this work, we focus on a specific product: a machine type, called BEL-C, which has been in production for the Bedeschi company for decades and has multiple versions. Figure 1a depicts an example of BEL-C machine: this machine includes several “parts”, called assemblies, e.g., the mechanical arm on the right �side is an assembly composed of multiple articles such as wheels and buckets. The description of each realization of a BEL-C machine is available under the form of Bill of Materials (BoM). A BoM is the list of the components, also called articles, needed to manufacture a certain product. Multiple articles can be further aggregated into assemblies. Thus, the elements in a BoM are organized hierarchically as a tree. The root node represents the final product, while internal and leaves nodes represent respectively assemblies and articles. Figure 1b illustrates the apical area of the BoM for the machine on Figure 1a. BoMs referring to different products of the same type, as in our case, might be highly heterogeneous. In fact, they might have been realized decades apart, can be the digitalization of old analogical production data and might include heterogeneous descriptions. This results in data ambiguities that need to be resolved. We aim to realize a tool that resumes all the relevant information of the different BEL-C machines produced during the years. More in detail, we have a manually compiled example of a database constructed over a mediated schema and we want to learn how to extract a similar database automatically from the BoMs. Such example unified database is based on recent realizations of the BEL-C. Therefore, mapping historical BoMs into the database over the mediated schema presents two main challenges: i) example data is expressed in natural language and it is not possible to find the very same attribute values among our data; ii) modern BoMs used to construct the example of unified database are not among the historical ones and thus are not in our dataset. Therefore, we cannot use traditional entity resolution and linkage approaches. Even though they share similar structure, the heterogeneity between different BoMs makes them closer to different data sources rather than to different instances of the same database. In this sense, the schema underlying the unified database that combines the information on different BoMs can be seen as a mediated schema. Therefore, We model our task in a data integration framework as follows: given a set of BoMs, the description of the articles they contain and an example of unified database over the mediated schema, we have to: 1) understand how BoMs are mapped into the mediated schema; 2) construct automatically the unified database for previously unseen BoMs. 3.2. Dataset To address our task, we have a set of BoMs, a database of the articles, and a manually built example of the required final mediated schema. BoMs Different BEL-C machines might have diverse specifications, i.e., the same component can have different dimensions or distinct motors can be installed. Each BoM, identified by a unique serial number, is an instance representing a specific realization of a BEL-C machine with its characteristics. The BoM contains the articles’ ID used in building the machine. Inside each BoM articles might be organized into assemblies. Each assembly has a unique identifier and a description. The assembly description allows understanding what part of the machine the tree portion is referring to. In our historical data, we have BoMs associated with 16 distinct machines. BoMs contain from 6 to 375 articles. Articles The article database contains the description of the characteristics of articles listed in BoMs. Articles are organized into classes – e.g., chains, engines, etc – uniquely identified � BoM Engine ... ... Chain (1, N) (0, N) includes Article (1, N) (0, N) (0, N) (0, N) Assembly is_composed (0, N) contains Figure 2: The ideal Entity-Relationship Schema of our data. Each BoM includes multiple assemblies and/or single articles. An assembly can be composed either by articles or by other assemblies. Articles are divided into several different classes. by a mnemonic code. Articles belonging to different classes have distinct sets of attributes. The domains of such attributes are not formally described: in most cases, attributes values are expressed via natural language strings even when only typed (e.g., numeric or Boolean) values are expected. The only attribute common to all classes is “description”: a free text field containing the description of the article and its usage. Several articles have multiple null values – such values are not really empty: they have not been correctly reported or lost while moving from paper to digital archives. In this sense, the “description” attribute is often the only one (ab-)used, containing also the values for all the other attributes. Figure 2 reports the ideal Entity–Relationship (ER) schema for our dataset, BoMs are composed of assemblies or articles and assemblies include either other assemblies or articles. Example Database over the Global Schema To understand which are the relevant at- tributes and their domain, domain experts provided us an example of the database constructed over the global schema that we would like to distil from the BoMs. This database was constructed manually and often the values for its fields are not in the domain of those in the articles dataset – e.g., different units of measure, strings instead of numbers. Furthermore, it is based on recent BEL-C machines that are not among historical BoMs. The main challenge is understanding how to map articles’ attributes to the global schema fields comparing their domains and relations. The example database organizes the data as follows: each column contains a different machine realization – 20 of them – while each row represents a machine specification. Our dataset includes 47 unique fields for the machine specification. Rows are divided into groups – e.g. technical features, chain – that provide semantic meaning to the fields they include. Manual mapping As aforementioned, historical BoMs and those available in the example of unified database do not overlap. Furthermore, values for fields in the example unified database are manually set and not available in the domain of the attributes in the article’s data set. Therefore, we cannot determine whether the automatic mapping produced using data integration techniques is correct. Thus, with the help of domain experts, we built a manual mapping between global schema fields and articles’ attributes. Such mapping allows us to define a query that provides the mediated schema immediately. The construction of such �manual mapping took several hours and has two main drawbacks: it does not work on different machines besides the BEL-C, nor it works if new attributes are taken into account. Therefore, we use the manual mapping to evaluate different automatic integration approaches, so that they can be used on previously unseen scenarios for which we do not have time or resources to construct the manual mapping. 4. Methodology Our objective is to understand the mapping between fields of the target schema and article attributes. Given ℱ the set of fields of the mediated schema and 𝒜 the set of articles’ attribute, for each 𝑓 ∈ ℱ our objective is to select the articles attribute 𝑎 ∈ 𝒜 that maps into 𝑓 . To do so, we compute a similarity measure 𝑠(𝑓, 𝑎) which describe how similar are 𝑓 and 𝑎. Then the value of the field 𝑓 is extracted from the attribute 𝑎 such that: argmax 𝑠(𝑓, 𝑎) 𝑎∈𝒜 Our task becomes identifying the best similarity function 𝑠 to describe the relation between a field 𝑓 and an article attribute 𝑎. To do so, we model three aspects: the similarity function 𝑠, the representation for the field 𝑓 and the representation for the article attribute 𝑎. Depending on the representation used for 𝑓 and 𝑎, there are two main categories of matching approaches: schema- based and instance-based matching approaches. The former relies on schema information about both field and article, such as their names, group or class. Instance-based matching approaches represent articles and fields using their instances. A third class of approaches is the one based on composite matchings, where the outputs of multiple matchers are combined. Finally, using the manual mapping developed with the contribution of two domain experts, we can evaluate the performance of each similarity function or mapping strategy. 4.1. Data Normalization Following common practices in the schema matching domain [10, 21], we start by normalizing the entities in our dataset. The normalization pipeline includes converting characters to lower- case, removing trailing spaces, removing non-informative tokens (e.g., single characters and stopwords). 4.2. Schema-Based Matching Using the schema information, such as labels, data types, and structural properties is a widely used approach [3, 10, 12, 22, 23]. In this scenario, we can exploit elements from the schema of the unified database, the mediated schema, and the schema of the articles database. For the mediated schema fields 𝑓 we consider the following representations: • Field name (F): names of the attributes of the target schema. • Group name (G): name of group the field is in. � • Concatenation between the field and the group it belongs to (GF). Concerning the articles attribute 𝑎, on the other hand, we can represent them as follows: • Article attribute name (A): name of the article attribute in the articles database. • Article class (C): Name of the article class. Being a mnemonic code, it contains relevant semantic information. • Concatenation between the article class and attribute name (CA). Once we have defined the representation for the fields and articles, we need to define the similarity 𝑠. In particular, we employ the following matching approaches: • Dice’s coefficient (CW): size of the intersection between the terms of two strings. • Edit Distance (ED): edit operations required to transform one string into the other. • Pairwise Edit Distance (EDP): if two strings contain identical words in diverse order, they have a high edit distance. EDP computes the edit distance between each possible pair of tokens from the two strings, returning the average distance of all the combinations. • N-Grams (NG𝑛 ): number of common n-grams between two strings [24]. Besides the pairwise edit distance, normalized by default, all the above-mentioned matching approaches are employed in both their original and normalized forms. 4.3. Instance-Based Matching In the case of instance-based matching, we represent the field 𝑓 and the article attribute 𝑎 using the values that they assume for the different instances. In particular, given a field 𝑓 , we can concatenate the values of that field for each BoM in the example global schema and obtain a document 𝑑𝑓 . Similarly, taking all the values for the attribute 𝑎 in the articles database, we concatenate them and define a document 𝑑𝑎 , that represents the article. Given the size of 𝑑𝑓 and 𝑑𝑎 , the similarities used for the schema-based scenario are computationally too demanding. Therefore, we propose to use documents’ similarity measures, such as those typically used in Information Retrieval (IR). Given a document field 𝑑𝑓 we rank all the attribute documents 𝑑𝑎 , based on their similarity, and select the attribute that maximizes the similarity with the document field. Among the similarity measures 𝑠, we investigate the following: Jaccard’s coefficient [25, 26], cosine similarity [25] and BM25 [27]. 4.4. Composite Matchings To further improve matching performance, we also investigate matching strategies fusion. In particular, similarly to what typically done in IR, for each field, we rank attributes based on their similarity with the field. Then, we fuse different rankings using CombSUM and CombMNZ [28]. �We also propose an approach, dubbed weighted average (WA), that weights the similarity computed by different matching approaches as follows: 𝛼𝑠1 (𝑓, 𝑎) + 𝛽𝑠2 (𝑓, 𝑎) + ... + 𝛾𝑠𝑀 (𝑓, 𝑎) WA(𝑓, 𝑎) = (1) 𝛼 + 𝛽 + ... + 𝛾 where 𝑠1 , 𝑠2 , ..., 𝑠𝑀 is the set of similarity matrices. 5. Empirical Evaluation The subsequent section contains the empirical analysis of the proposed methodology. We adopt Mean Reciprocal Rank (MRR) [29] to evaluate our matching strategies. Furthermore, we also include the proportion of exact matches (precision at 1), using two distinct tie-breaking policies: “first” which considers the first match in alphabetical order and “any”, which measures whether, among the ties, one of them is the correct match. To understand the statistical relevance of the different matching strategies performances, we conduct an ANOVA test. 5.1. Schema-Based Matching Evaluation The performances of schema-based matching strategies are available in Table 1. Concerning the MRR, the best performing strategy is 3-gram algorithm with “Field/Class and Attribute” input. This configuration achieves the best MRR score 33.6% among the set of tested matching strategies, but it’s statistically different only from 9 other configurations. Table 1 illustrates how the normalized version of the algorithms outperform the not-normalized ones. Furthermore, 3-gram and 2-gram usually achieve better MRR scores, but Dice’s coefficient has high number of matches with the "Any" policy – it exhibits high recall. 5.2. Instance-Based Matching Evaluation The performances of the three instance-based matching strategies are reported in Table 2. Notice that, the new similarity functions are based on real numbers much more widely spread. This dramatically reduces the chance of ties: thanks to this, we do not need anymore a tie-breaking policy. BM25 and cosine are statistically significantly equivalent, while Jaccard similarity is significantly worse than the others. It is interesting to observe that different matching strategies identify different correct matches. This suggests that a composite matching strategy might improve the overall quality of the approach. 5.3. Composite Matching Strategies When considering the composite matching strategies, almost any combination of matchers and representations is a suitable candidate - therefore, in the following analysis we report the results for the best performing or most interesting composite matching techniques, selected also with the help of experts of the domain. Among the composition strategies that we considered, the first approach consists in using simple matching strategies and combine them using traditional rank fusion techniques. Based �Table 1 Schema based matchers performances. † indicates statistical equivalence with the best performing approach. P@1 - first P@1 - any MRR input matcher orig. norm. orig. norm. orig. norm. † CW .177 .200 .400 .377 .221 .259† ED .111 .088 .133 .200 .166† .153 F, A EDP .177 — .377 — .245† — NG2 .111 .111 .022 .288 .181† .215† NG3 .133 .155 .333 .355 .199† .248† CW .177 .200 .400 .377 .221† .259† ED .022 .022 .088 .088 .059 .048 F, CA EDP .022 — .022 — .280† — NG2 .200 .244 .266 .311 .261† .291† NG3 .244 .288 .333 .377 .302† .336 CW .133 .200 .400 .377 .199† .259† ED .022 .022 .088 .088 .047 .047 GF, A EDP .133 — .266 — .182† — NG2 .022 .044 .111 .155 .075 .120 NG3 .088 .111 .288 .311 .160† .209† CW .133 .200 .400 .377 .199† .259† ED .088 .088 .133 .133 .138 .138 GF, CA EDP .155 — .155 — .223† — NG2 .133 .222 .177 .222 .209† .274† NG3 .244 .244 .288 .288 .316† .333† Table 2 Instance based matchers performances. The symbol † indicates the top tier of systems. matcher P@1 MRR cos .178 .275† BM25 .155 .257† Jaccard .022 .058 on Tables 1 and 2, we select the following matchers as fusion candidate: 2-gram and 3-gram normalized, CW normalized, EDP, BM25 and Cosine similarity. To represent fields and attributes in the schema-based approaches, for each matcher, we use the best performing representation according to Table 1. CombMNZ and CombSUM are then used to aggregate the similarity scores computed by the single matching strategies. Furthermore, we experiment with the WA fusion strategy. In particular we use the following similarity function and test: • NG2 (F,A): normalized 2-gram similarity between field and attribute names. • NG2 (GF,C): normalized 2-gram similarity between field name and group name concate- �Table 3 Matcher combination performances. Fusion matchers P@1 MRR BM25+cos+NG2 +EDP .288 .379† BM25+cos .200 .288 BM25+NG2 .222 .342† cos+NG2 .266 .366 CombMNZ EDP+NG2 .244 .291 cos+EDP .244 .357† cos+CW .177 .314 NG2 +NG3 .244 .304† BM25+cos+NG2 +NG3 +CW+EDP .311 .414† BM25+cos+NG2 +EDP .288 .375† BM25+cos .200 .288 BM25+NG2 .222 .338† cos+NG2 .266 .365 CombSUM EDP+NG2 .244 .291 cos+EDP .244 .357† cos+CW .177 .314 NG2 +NG3 .244 .304† BM25+cos+NG2 +NG3 +CW+EDP .311 .409† WA cos+BM25+NG2 (F,A)+NG2 (GF,C)+NG3 (F,CA) .467 .519† nation and attribute name. • NG3 (F, CA): normalized 3-gram similarity between field name and attribute and class names concatenation. • cos(𝑓, 𝑎): cosine similarity between document representation for field 𝑓 and article’s attribute 𝑎. • BM25(𝑓, 𝑎): BM25 similarity between document representation for field 𝑓 and article’s attribute 𝑎. Among all the different combinations we experiment with, the above-mentioned matchers have been selected due to their diversity. To find the best weight combination, we perform a grid search through a subset of parameters. Each weight can assume the values between 0 and 1 with a pace of 0.2, resulting in 16807 dispositions with repetition. The resulting formula for the weighted average approach is: WA(𝑓, 𝑎) = 𝛼 NG2 (𝐹,𝐴)+𝛽 NG2 (𝐺𝐹,𝐶)+𝛾 NG3 (𝐹,𝐶𝐴)+𝛿 cos(𝑓,𝑎)+𝜁 BM25(𝑓,𝑎) 𝛼+𝛽+𝛾+𝛿+𝜁 (2) �The empirical analysis, carried out following a 5-fold cross-validation strategy, shows that the top performing set of parameters are: 𝛼 = 0.2, 𝛽 = 1,𝛾 = 1, 𝛿 = 0.6, 𝜁 = 0.81 . Table 3 illustrates the results of the composite approach using the CombSUM, CombMNZ, and Weighted Average strategies. It is possible to observe that WA is the best performing strategy, achieving 46.7% correct matches and an MRR score of 51.9%. We observe a general improvement of performances with respect to single matching strategies – Tables 1 and 2). CombMNZ aggregation provides better results than CombSUM. Finally, we observe that, as a general trend, if we include more matching strategies, the quality of the aggregation improves. 6. Conclusion This work stems from the need for exploiting historical production data to obtain a mediated schema of machines built by a company throughout several years. We interpreted the problem as a data integration task that required us to join multiple semi-structured BoMs in a unified database over the global schema. We, therefore, focused on lexical schema- and instance-based matching strategies. Among schema-based matching strategies, we considered matching strate- gies that exploit the lexical similarity of the element names. For instance-based matching, we used document similarity measures. Finally, we investigated the combination of the developed matches showed a performance increase regarding the single matching strategies. We showed that the number of correct matches increases if we combine multiple matching approaches. Results are satisfactory and considered promising by The Bedeschi Company. Our study high- lighted challenges and issues emerging when approaching this task and indicated promising research paths. Future works will analyze the performances of our proposed solutions on different datasets as well as the use of additional similarity measures. References 1. F. Azzalini, S. Jin, M. Renzi, L. Tanca, Blocking techniques for entity linkage: A semantics- based approach, Data Science and Engineering 6 (2021) 20–38. 2. A. Doan, A. Halevy, Z. Ives, Principles of data integration, Elsevier, 2012. 3. H.-H. Do, E. Rahm, Coma—a system for flexible combination of schema matching approaches, in: Proceedings of the VLDB, Elsevier, 2002, pp. 610–621. 4. R. Fagin, L. M. Haas, M. Hernández, R. J. Miller, L. Popa, Y. Velegrakis, Clio: Schema mapping creation and data exchange, in: Conceptual modeling: foundations and applications, Springer, 2009, pp. 198–236. 5. L. Seligman, P. Mork, A. Halevy, K. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kan- nan, D. Burdick, Openii: an open source information integration toolkit, in: Proceedings of the 2010 ACM SIGMOD, 2010, pp. 1057–1060. 6. P. A. Bernstein, J. Madhavan, E. Rahm, Generic schema matching, ten years later, Proceed- ings of the VLDB Endowment 4 (2011) 695–701. 7. S. M. Falconer, N. F. Noy, Interactive techniques to support ontology matching, in: Schema Matching and Mapping, Springer, 2011, pp. 29–51. 1 Due to space reasons, we omit the full analysis on the parametrization � 8. D. Engmann, S. Massmann, Instance matching with coma++., in: BTW workshops, volume 7, 2007, pp. 28–37. 9. S. Massmann, S. Raunich, D. Aumüller, P. Arnold, E. Rahm, Evolution of the coma match system, Ontology Matching 49 (2011) 49–60. 10. J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: vldb, volume 1, Citeseer, 2001, pp. 49–58. 11. E. Rahm, Towards large-scale schema and ontology matching, in: Schema matching and mapping, Springer, 2011, pp. 3–27. 12. J. Madhavan, P. A. Bernstein, A. Doan, A. Halevy, Corpus-based schema matching, in: 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 57–68. 13. A. Bilke, F. Naumann, Schema matching using duplicates, in: 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 69–80. 14. Z. Bellahsene, A. Bonifati, F. Duchateau, Y. Velegrakis, On evaluating schema matching and mapping, in: Schema matching and mapping, Springer, 2011, pp. 253–291. 15. Z. Abedjan, L. Golab, F. Naumann, T. Papenbrock, Data profiling, Synthesis Lectures on Data Management 10 (2018) 1–154. 16. A. Maydanchik, Data quality assessment, Technics publications, 2007. 17. S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, J. Heer, Profiler: Integrated statistical analysis and visualization for data quality assessment, in: Proceedings of AVI, 2012. 18. Z. Abedjan, L. Golab, F. Naumann, Profiling relational data: a survey, The VLDB Journal 24 (2015) 557–581. 19. E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, the VLDB Journal 10 (2001) 334–350. 20. R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, imap: Discovering complex semantic matches between database schemas, in: Proceedings of the ACM SIGMOD, 2004, pp. 383–394. 21. P. Mork, L. Seligman, A. Rosenthal, J. Korb, C. Wolf, The harmony integration workbench, in: Journal on Data Semantics XI, Springer, 2008, pp. 65–93. 22. F. Giunchiglia, M. Yatskevich, Element level semantic matching (2004). 23. W. H. Gomaa, A. A. Fahmy, et al., A survey of text similarity approaches, International Journal of Computer Applications 68 (2013) 13–18. 24. G. Kondrak, N-gram similarity and distance, in: International symposium on string processing and information retrieval, Springer, 2005, pp. 115–126. 25. A. Huang, et al., Similarity measures for text document clustering, in: Proceedings of NZCSRSC, volume 4, 2008, pp. 9–56. 26. R. Ferdous, et al., An efficient k-means algorithm integrated with jaccard distance measure for document clustering, in: Proceedings of the AH-ICI, IEEE, 2009, pp. 1–6. 27. S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Now Publishers Inc, 2009. 28. J. H. Lee, Analyses of multiple evidence combination, in: Proc. SIGIR, 1997, pp. 267–276. 29. E. M. Voorhees, et al., The trec-8 question answering track report, in: Trec, 1999, pp. 77–82. �
Extracting a Unified Database from Heterogeneous Bills of Materials via Data Integration Approaches Luigi De Lucia1,2 , Guglielmo Faggioli1 , Nicola Ferro1 and Fabio Orobelli2 1 University of Padova, Padova, Italy 2 Bedeschi S.p.A., Limena, Italy Abstract Since the beginning of the information era, manufacturing companies have collected a huge amount of data and acknowledged the importance of proper collection and curation policies. Nevertheless, an open problem is the exploitation of historical and legacy data. Such data is heterogeneous: it has often been accumulated throughout decades by different curators with different degrees of attention toward the data. In this work, we focus on the historical data collected by Bedeschi S.p.A., an international company dealing with the design and production of turn-key solutions for Bulk Material Handling, Container Logistics and Bricks. Our task is to extract knowledge, available under the form of partially structured Bills of Materials (BoMs), to distil a unified database over a mediated schema that can be used by designers and sales department to rapidly have a global overview of the products developed over the years. We present a feasibility study investigating the performance of several traditional approaches of data integration applied to our specific task. In particular, we treat BoMs as sources that need to be aggregated in the unified mediated schema. We achieved satisfactory results indicating the direction that the company should follow to properly extract knowledge from their historical data. 1. Introduction Data is arguably becoming day by day one of the most valuable assets for an increasing number of companies; large companies have digitalized several production processes, making it easy to collect, curate and exploit their data. This is not true for what concerns historical and legacy data, which is often underlooked and poorly used. Three main challenges impair the handling of historical data. First, legacy data is fragmented and scattered over different sources, formats and media, making it challenging to organize it as unitary knowledge. Secondly, historical data is unstructured and weakly linked to the other information, it also contains dirty values and information expressed via natural language [1]. Finally, historical data is characterized by high degree of heterogeneity: several annotators collected it on digital archives over decades, with underlying production processes and paradigms changing. These challenges make legacy and historical datasets unexploitable automatically. Our work stems from a specific industrial need: exploiting historical data to help the design and sales departments by providing them with a unifying overview of all the products built over the years. We exploit historical data collected by Bedeschi S.p.A, an international company specialized in the production of turn-key solutions for Bulk Material Handling, Container Logistics and Bricks. We aim at extracting information from a set of heterogeneous semi-structured Bills of Materials (BoMs) to organize SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �it in a uniform database over a global schema providing an overview of the different models of Bedeschi machines developed over decades. This work is a feasibility study to determine whether it is possible to process these BoMs automatically. In particular, to determine the most promising approaches, we focus on a single Bedeschi machine type, the BEL-C, in production at Bedeschi S.p.A. for decades. Furthermore, machines of this type have rich BoMs that allow us to recognize emerging challenges and critical issues. Moreover, different designers and curators have produced, collected, and organized the data about this type of machine over the years using different media, starting from paper sheets – now digitalized – to modern information access systems.. We interpret our task under the data integration framework. In particular, we consider BoMs associated with different machines as our data sources. These data sources are then integrated to obtain a global schema. Our contributions are the following: • We formalize the unified database extraction task from an industrial perspective as a data integration task; • We identify a set of techniques drawn from the schema matching and instance matching domains to address the task; • We empirically evaluate the performance of such techniques. The remainder is organized as follows: Section 2 describes the main efforts in the Data Integration domain. Section 3 contains the description of the task at hand and the data available. Section 4 details on our methodology, while Section 5 contains its evaluation. Finally, in Section 6 we draw same conclusions and outline future steps. 2. Related Work Data integration aims to provide unified access to data that usually resides on multiple, au- tonomous data sources, like different databases or data retrieved by internet websites [2]. One of the major challenges to data integration is data ambiguity. This ambiguity arises whenever there is some uncertainty about the meant data information. Among data integration systems developed, some handle the whole data integration process, such as Do and Rahm [3], Fagin et al. [4], Seligman et al. [5], while others, e.g., Bernstein et al. [6], Falconer and Noy [7], focus only on the schema mapping task. Among the most prominent endeavours, we list the following. COMA [3, 8, 9] automatically generates matchings between source and target schemas, by exploiting multiple matching approaches to compute the matching. Clio, developed by Fagin et al. [4], supports a visual matching representation, and is capable of working with schemas without knowing the mediated schema. Cupid [10] is a schema matcher algorithm that exploits similarities of different aspects of the schema elements: keys, referential constraints, and views. Several different endeavours have been devoted to the schema matching task. Rahm [11] presents an overview of the possible approaches based on different types of matching to detect the correspondences between the schema elements. When the instances are available, it is possible to state that two schema elements are similar if they have close instances Bernstein et al. [6], Madhavan et al. [12]. � (a) BEL-C (b) Bill of Materials (BoM) Figure 1: A BEL-C machine and – a part of – its BoM Additionally, different works exploit the information provided by instance values. COMA++, an extension of COMA proposed by Engmann and Massmann [8], considers the average length of instance characters, average values and standard deviations. A second approach has been proposed by Bilke and Naumann [13] and exploits duplicate instances to find matches. Despite the popularity of the task, there is no universal benchmark to evaluate schema matching approaches [14]. Among the ancillary techniques needed to achieve data integration, we can list data profiling, reverse database engineering, and matching approaches combination. Data profiling is the set of processes and actions performed to retrieve specific information, also called metadata, about a given dataset [15]. Generally, profiling tasks support manual data exploration and improve the data gazing [16] process. Data profiling further helps data integration by revealing inconsisten- cies and errors in the data errors [17]. Reverse database engineering allows identifying different characteristics of the database entities like relations, attributes, domain semantics, foreign keys and cardinalities that can be introduced into the data integration process via explicit rules [18]. Combining different matching approaches allows detecting a higher number of correct matches with respect using a single matcher [19]. We can see some approaches in [12, 10, 20]. Hybrid and composite matchings are two different ways of combining matching approaches. The former uses multiple matching criteria together to compute the score. The latter, which we employ in our work, computes the matching scores independently and then combine them. 3. The Task and Dataset 3.1. Task In the design process of a new product, it is common practice to look at the design of previous ones to retrieve reference specifications and understand the design choices. To this end, it is necessary to have an overview of products realized previously. In this work, we focus on a specific product: a machine type, called BEL-C, which has been in production for the Bedeschi company for decades and has multiple versions. Figure 1a depicts an example of BEL-C machine: this machine includes several “parts”, called assemblies, e.g., the mechanical arm on the right �side is an assembly composed of multiple articles such as wheels and buckets. The description of each realization of a BEL-C machine is available under the form of Bill of Materials (BoM). A BoM is the list of the components, also called articles, needed to manufacture a certain product. Multiple articles can be further aggregated into assemblies. Thus, the elements in a BoM are organized hierarchically as a tree. The root node represents the final product, while internal and leaves nodes represent respectively assemblies and articles. Figure 1b illustrates the apical area of the BoM for the machine on Figure 1a. BoMs referring to different products of the same type, as in our case, might be highly heterogeneous. In fact, they might have been realized decades apart, can be the digitalization of old analogical production data and might include heterogeneous descriptions. This results in data ambiguities that need to be resolved. We aim to realize a tool that resumes all the relevant information of the different BEL-C machines produced during the years. More in detail, we have a manually compiled example of a database constructed over a mediated schema and we want to learn how to extract a similar database automatically from the BoMs. Such example unified database is based on recent realizations of the BEL-C. Therefore, mapping historical BoMs into the database over the mediated schema presents two main challenges: i) example data is expressed in natural language and it is not possible to find the very same attribute values among our data; ii) modern BoMs used to construct the example of unified database are not among the historical ones and thus are not in our dataset. Therefore, we cannot use traditional entity resolution and linkage approaches. Even though they share similar structure, the heterogeneity between different BoMs makes them closer to different data sources rather than to different instances of the same database. In this sense, the schema underlying the unified database that combines the information on different BoMs can be seen as a mediated schema. Therefore, We model our task in a data integration framework as follows: given a set of BoMs, the description of the articles they contain and an example of unified database over the mediated schema, we have to: 1) understand how BoMs are mapped into the mediated schema; 2) construct automatically the unified database for previously unseen BoMs. 3.2. Dataset To address our task, we have a set of BoMs, a database of the articles, and a manually built example of the required final mediated schema. BoMs Different BEL-C machines might have diverse specifications, i.e., the same component can have different dimensions or distinct motors can be installed. Each BoM, identified by a unique serial number, is an instance representing a specific realization of a BEL-C machine with its characteristics. The BoM contains the articles’ ID used in building the machine. Inside each BoM articles might be organized into assemblies. Each assembly has a unique identifier and a description. The assembly description allows understanding what part of the machine the tree portion is referring to. In our historical data, we have BoMs associated with 16 distinct machines. BoMs contain from 6 to 375 articles. Articles The article database contains the description of the characteristics of articles listed in BoMs. Articles are organized into classes – e.g., chains, engines, etc – uniquely identified � BoM Engine ... ... Chain (1, N) (0, N) includes Article (1, N) (0, N) (0, N) (0, N) Assembly is_composed (0, N) contains Figure 2: The ideal Entity-Relationship Schema of our data. Each BoM includes multiple assemblies and/or single articles. An assembly can be composed either by articles or by other assemblies. Articles are divided into several different classes. by a mnemonic code. Articles belonging to different classes have distinct sets of attributes. The domains of such attributes are not formally described: in most cases, attributes values are expressed via natural language strings even when only typed (e.g., numeric or Boolean) values are expected. The only attribute common to all classes is “description”: a free text field containing the description of the article and its usage. Several articles have multiple null values – such values are not really empty: they have not been correctly reported or lost while moving from paper to digital archives. In this sense, the “description” attribute is often the only one (ab-)used, containing also the values for all the other attributes. Figure 2 reports the ideal Entity–Relationship (ER) schema for our dataset, BoMs are composed of assemblies or articles and assemblies include either other assemblies or articles. Example Database over the Global Schema To understand which are the relevant at- tributes and their domain, domain experts provided us an example of the database constructed over the global schema that we would like to distil from the BoMs. This database was constructed manually and often the values for its fields are not in the domain of those in the articles dataset – e.g., different units of measure, strings instead of numbers. Furthermore, it is based on recent BEL-C machines that are not among historical BoMs. The main challenge is understanding how to map articles’ attributes to the global schema fields comparing their domains and relations. The example database organizes the data as follows: each column contains a different machine realization – 20 of them – while each row represents a machine specification. Our dataset includes 47 unique fields for the machine specification. Rows are divided into groups – e.g. technical features, chain – that provide semantic meaning to the fields they include. Manual mapping As aforementioned, historical BoMs and those available in the example of unified database do not overlap. Furthermore, values for fields in the example unified database are manually set and not available in the domain of the attributes in the article’s data set. Therefore, we cannot determine whether the automatic mapping produced using data integration techniques is correct. Thus, with the help of domain experts, we built a manual mapping between global schema fields and articles’ attributes. Such mapping allows us to define a query that provides the mediated schema immediately. The construction of such �manual mapping took several hours and has two main drawbacks: it does not work on different machines besides the BEL-C, nor it works if new attributes are taken into account. Therefore, we use the manual mapping to evaluate different automatic integration approaches, so that they can be used on previously unseen scenarios for which we do not have time or resources to construct the manual mapping. 4. Methodology Our objective is to understand the mapping between fields of the target schema and article attributes. Given ℱ the set of fields of the mediated schema and 𝒜 the set of articles’ attribute, for each 𝑓 ∈ ℱ our objective is to select the articles attribute 𝑎 ∈ 𝒜 that maps into 𝑓 . To do so, we compute a similarity measure 𝑠(𝑓, 𝑎) which describe how similar are 𝑓 and 𝑎. Then the value of the field 𝑓 is extracted from the attribute 𝑎 such that: argmax 𝑠(𝑓, 𝑎) 𝑎∈𝒜 Our task becomes identifying the best similarity function 𝑠 to describe the relation between a field 𝑓 and an article attribute 𝑎. To do so, we model three aspects: the similarity function 𝑠, the representation for the field 𝑓 and the representation for the article attribute 𝑎. Depending on the representation used for 𝑓 and 𝑎, there are two main categories of matching approaches: schema- based and instance-based matching approaches. The former relies on schema information about both field and article, such as their names, group or class. Instance-based matching approaches represent articles and fields using their instances. A third class of approaches is the one based on composite matchings, where the outputs of multiple matchers are combined. Finally, using the manual mapping developed with the contribution of two domain experts, we can evaluate the performance of each similarity function or mapping strategy. 4.1. Data Normalization Following common practices in the schema matching domain [10, 21], we start by normalizing the entities in our dataset. The normalization pipeline includes converting characters to lower- case, removing trailing spaces, removing non-informative tokens (e.g., single characters and stopwords). 4.2. Schema-Based Matching Using the schema information, such as labels, data types, and structural properties is a widely used approach [3, 10, 12, 22, 23]. In this scenario, we can exploit elements from the schema of the unified database, the mediated schema, and the schema of the articles database. For the mediated schema fields 𝑓 we consider the following representations: • Field name (F): names of the attributes of the target schema. • Group name (G): name of group the field is in. � • Concatenation between the field and the group it belongs to (GF). Concerning the articles attribute 𝑎, on the other hand, we can represent them as follows: • Article attribute name (A): name of the article attribute in the articles database. • Article class (C): Name of the article class. Being a mnemonic code, it contains relevant semantic information. • Concatenation between the article class and attribute name (CA). Once we have defined the representation for the fields and articles, we need to define the similarity 𝑠. In particular, we employ the following matching approaches: • Dice’s coefficient (CW): size of the intersection between the terms of two strings. • Edit Distance (ED): edit operations required to transform one string into the other. • Pairwise Edit Distance (EDP): if two strings contain identical words in diverse order, they have a high edit distance. EDP computes the edit distance between each possible pair of tokens from the two strings, returning the average distance of all the combinations. • N-Grams (NG𝑛 ): number of common n-grams between two strings [24]. Besides the pairwise edit distance, normalized by default, all the above-mentioned matching approaches are employed in both their original and normalized forms. 4.3. Instance-Based Matching In the case of instance-based matching, we represent the field 𝑓 and the article attribute 𝑎 using the values that they assume for the different instances. In particular, given a field 𝑓 , we can concatenate the values of that field for each BoM in the example global schema and obtain a document 𝑑𝑓 . Similarly, taking all the values for the attribute 𝑎 in the articles database, we concatenate them and define a document 𝑑𝑎 , that represents the article. Given the size of 𝑑𝑓 and 𝑑𝑎 , the similarities used for the schema-based scenario are computationally too demanding. Therefore, we propose to use documents’ similarity measures, such as those typically used in Information Retrieval (IR). Given a document field 𝑑𝑓 we rank all the attribute documents 𝑑𝑎 , based on their similarity, and select the attribute that maximizes the similarity with the document field. Among the similarity measures 𝑠, we investigate the following: Jaccard’s coefficient [25, 26], cosine similarity [25] and BM25 [27]. 4.4. Composite Matchings To further improve matching performance, we also investigate matching strategies fusion. In particular, similarly to what typically done in IR, for each field, we rank attributes based on their similarity with the field. Then, we fuse different rankings using CombSUM and CombMNZ [28]. �We also propose an approach, dubbed weighted average (WA), that weights the similarity computed by different matching approaches as follows: 𝛼𝑠1 (𝑓, 𝑎) + 𝛽𝑠2 (𝑓, 𝑎) + ... + 𝛾𝑠𝑀 (𝑓, 𝑎) WA(𝑓, 𝑎) = (1) 𝛼 + 𝛽 + ... + 𝛾 where 𝑠1 , 𝑠2 , ..., 𝑠𝑀 is the set of similarity matrices. 5. Empirical Evaluation The subsequent section contains the empirical analysis of the proposed methodology. We adopt Mean Reciprocal Rank (MRR) [29] to evaluate our matching strategies. Furthermore, we also include the proportion of exact matches (precision at 1), using two distinct tie-breaking policies: “first” which considers the first match in alphabetical order and “any”, which measures whether, among the ties, one of them is the correct match. To understand the statistical relevance of the different matching strategies performances, we conduct an ANOVA test. 5.1. Schema-Based Matching Evaluation The performances of schema-based matching strategies are available in Table 1. Concerning the MRR, the best performing strategy is 3-gram algorithm with “Field/Class and Attribute” input. This configuration achieves the best MRR score 33.6% among the set of tested matching strategies, but it’s statistically different only from 9 other configurations. Table 1 illustrates how the normalized version of the algorithms outperform the not-normalized ones. Furthermore, 3-gram and 2-gram usually achieve better MRR scores, but Dice’s coefficient has high number of matches with the "Any" policy – it exhibits high recall. 5.2. Instance-Based Matching Evaluation The performances of the three instance-based matching strategies are reported in Table 2. Notice that, the new similarity functions are based on real numbers much more widely spread. This dramatically reduces the chance of ties: thanks to this, we do not need anymore a tie-breaking policy. BM25 and cosine are statistically significantly equivalent, while Jaccard similarity is significantly worse than the others. It is interesting to observe that different matching strategies identify different correct matches. This suggests that a composite matching strategy might improve the overall quality of the approach. 5.3. Composite Matching Strategies When considering the composite matching strategies, almost any combination of matchers and representations is a suitable candidate - therefore, in the following analysis we report the results for the best performing or most interesting composite matching techniques, selected also with the help of experts of the domain. Among the composition strategies that we considered, the first approach consists in using simple matching strategies and combine them using traditional rank fusion techniques. Based �Table 1 Schema based matchers performances. † indicates statistical equivalence with the best performing approach. P@1 - first P@1 - any MRR input matcher orig. norm. orig. norm. orig. norm. † CW .177 .200 .400 .377 .221 .259† ED .111 .088 .133 .200 .166† .153 F, A EDP .177 — .377 — .245† — NG2 .111 .111 .022 .288 .181† .215† NG3 .133 .155 .333 .355 .199† .248† CW .177 .200 .400 .377 .221† .259† ED .022 .022 .088 .088 .059 .048 F, CA EDP .022 — .022 — .280† — NG2 .200 .244 .266 .311 .261† .291† NG3 .244 .288 .333 .377 .302† .336 CW .133 .200 .400 .377 .199† .259† ED .022 .022 .088 .088 .047 .047 GF, A EDP .133 — .266 — .182† — NG2 .022 .044 .111 .155 .075 .120 NG3 .088 .111 .288 .311 .160† .209† CW .133 .200 .400 .377 .199† .259† ED .088 .088 .133 .133 .138 .138 GF, CA EDP .155 — .155 — .223† — NG2 .133 .222 .177 .222 .209† .274† NG3 .244 .244 .288 .288 .316† .333† Table 2 Instance based matchers performances. The symbol † indicates the top tier of systems. matcher P@1 MRR cos .178 .275† BM25 .155 .257† Jaccard .022 .058 on Tables 1 and 2, we select the following matchers as fusion candidate: 2-gram and 3-gram normalized, CW normalized, EDP, BM25 and Cosine similarity. To represent fields and attributes in the schema-based approaches, for each matcher, we use the best performing representation according to Table 1. CombMNZ and CombSUM are then used to aggregate the similarity scores computed by the single matching strategies. Furthermore, we experiment with the WA fusion strategy. In particular we use the following similarity function and test: • NG2 (F,A): normalized 2-gram similarity between field and attribute names. • NG2 (GF,C): normalized 2-gram similarity between field name and group name concate- �Table 3 Matcher combination performances. Fusion matchers P@1 MRR BM25+cos+NG2 +EDP .288 .379† BM25+cos .200 .288 BM25+NG2 .222 .342† cos+NG2 .266 .366 CombMNZ EDP+NG2 .244 .291 cos+EDP .244 .357† cos+CW .177 .314 NG2 +NG3 .244 .304† BM25+cos+NG2 +NG3 +CW+EDP .311 .414† BM25+cos+NG2 +EDP .288 .375† BM25+cos .200 .288 BM25+NG2 .222 .338† cos+NG2 .266 .365 CombSUM EDP+NG2 .244 .291 cos+EDP .244 .357† cos+CW .177 .314 NG2 +NG3 .244 .304† BM25+cos+NG2 +NG3 +CW+EDP .311 .409† WA cos+BM25+NG2 (F,A)+NG2 (GF,C)+NG3 (F,CA) .467 .519† nation and attribute name. • NG3 (F, CA): normalized 3-gram similarity between field name and attribute and class names concatenation. • cos(𝑓, 𝑎): cosine similarity between document representation for field 𝑓 and article’s attribute 𝑎. • BM25(𝑓, 𝑎): BM25 similarity between document representation for field 𝑓 and article’s attribute 𝑎. Among all the different combinations we experiment with, the above-mentioned matchers have been selected due to their diversity. To find the best weight combination, we perform a grid search through a subset of parameters. Each weight can assume the values between 0 and 1 with a pace of 0.2, resulting in 16807 dispositions with repetition. The resulting formula for the weighted average approach is: WA(𝑓, 𝑎) = 𝛼 NG2 (𝐹,𝐴)+𝛽 NG2 (𝐺𝐹,𝐶)+𝛾 NG3 (𝐹,𝐶𝐴)+𝛿 cos(𝑓,𝑎)+𝜁 BM25(𝑓,𝑎) 𝛼+𝛽+𝛾+𝛿+𝜁 (2) �The empirical analysis, carried out following a 5-fold cross-validation strategy, shows that the top performing set of parameters are: 𝛼 = 0.2, 𝛽 = 1,𝛾 = 1, 𝛿 = 0.6, 𝜁 = 0.81 . Table 3 illustrates the results of the composite approach using the CombSUM, CombMNZ, and Weighted Average strategies. It is possible to observe that WA is the best performing strategy, achieving 46.7% correct matches and an MRR score of 51.9%. We observe a general improvement of performances with respect to single matching strategies – Tables 1 and 2). CombMNZ aggregation provides better results than CombSUM. Finally, we observe that, as a general trend, if we include more matching strategies, the quality of the aggregation improves. 6. Conclusion This work stems from the need for exploiting historical production data to obtain a mediated schema of machines built by a company throughout several years. We interpreted the problem as a data integration task that required us to join multiple semi-structured BoMs in a unified database over the global schema. We, therefore, focused on lexical schema- and instance-based matching strategies. Among schema-based matching strategies, we considered matching strate- gies that exploit the lexical similarity of the element names. For instance-based matching, we used document similarity measures. Finally, we investigated the combination of the developed matches showed a performance increase regarding the single matching strategies. We showed that the number of correct matches increases if we combine multiple matching approaches. Results are satisfactory and considered promising by The Bedeschi Company. Our study high- lighted challenges and issues emerging when approaching this task and indicated promising research paths. Future works will analyze the performances of our proposed solutions on different datasets as well as the use of additional similarity measures. References 1. F. Azzalini, S. Jin, M. Renzi, L. Tanca, Blocking techniques for entity linkage: A semantics- based approach, Data Science and Engineering 6 (2021) 20–38. 2. A. Doan, A. Halevy, Z. Ives, Principles of data integration, Elsevier, 2012. 3. H.-H. Do, E. Rahm, Coma—a system for flexible combination of schema matching approaches, in: Proceedings of the VLDB, Elsevier, 2002, pp. 610–621. 4. R. Fagin, L. M. Haas, M. Hernández, R. J. Miller, L. Popa, Y. Velegrakis, Clio: Schema mapping creation and data exchange, in: Conceptual modeling: foundations and applications, Springer, 2009, pp. 198–236. 5. L. Seligman, P. Mork, A. Halevy, K. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kan- nan, D. Burdick, Openii: an open source information integration toolkit, in: Proceedings of the 2010 ACM SIGMOD, 2010, pp. 1057–1060. 6. P. A. Bernstein, J. Madhavan, E. Rahm, Generic schema matching, ten years later, Proceed- ings of the VLDB Endowment 4 (2011) 695–701. 7. S. M. Falconer, N. F. Noy, Interactive techniques to support ontology matching, in: Schema Matching and Mapping, Springer, 2011, pp. 29–51. 1 Due to space reasons, we omit the full analysis on the parametrization � 8. D. Engmann, S. Massmann, Instance matching with coma++., in: BTW workshops, volume 7, 2007, pp. 28–37. 9. S. Massmann, S. Raunich, D. Aumüller, P. Arnold, E. Rahm, Evolution of the coma match system, Ontology Matching 49 (2011) 49–60. 10. J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: vldb, volume 1, Citeseer, 2001, pp. 49–58. 11. E. Rahm, Towards large-scale schema and ontology matching, in: Schema matching and mapping, Springer, 2011, pp. 3–27. 12. J. Madhavan, P. A. Bernstein, A. Doan, A. Halevy, Corpus-based schema matching, in: 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 57–68. 13. A. Bilke, F. Naumann, Schema matching using duplicates, in: 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 69–80. 14. Z. Bellahsene, A. Bonifati, F. Duchateau, Y. Velegrakis, On evaluating schema matching and mapping, in: Schema matching and mapping, Springer, 2011, pp. 253–291. 15. Z. Abedjan, L. Golab, F. Naumann, T. Papenbrock, Data profiling, Synthesis Lectures on Data Management 10 (2018) 1–154. 16. A. Maydanchik, Data quality assessment, Technics publications, 2007. 17. S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, J. Heer, Profiler: Integrated statistical analysis and visualization for data quality assessment, in: Proceedings of AVI, 2012. 18. Z. Abedjan, L. Golab, F. Naumann, Profiling relational data: a survey, The VLDB Journal 24 (2015) 557–581. 19. E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, the VLDB Journal 10 (2001) 334–350. 20. R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, imap: Discovering complex semantic matches between database schemas, in: Proceedings of the ACM SIGMOD, 2004, pp. 383–394. 21. P. Mork, L. Seligman, A. Rosenthal, J. Korb, C. Wolf, The harmony integration workbench, in: Journal on Data Semantics XI, Springer, 2008, pp. 65–93. 22. F. Giunchiglia, M. Yatskevich, Element level semantic matching (2004). 23. W. H. Gomaa, A. A. Fahmy, et al., A survey of text similarity approaches, International Journal of Computer Applications 68 (2013) 13–18. 24. G. Kondrak, N-gram similarity and distance, in: International symposium on string processing and information retrieval, Springer, 2005, pp. 115–126. 25. A. Huang, et al., Similarity measures for text document clustering, in: Proceedings of NZCSRSC, volume 4, 2008, pp. 9–56. 26. R. Ferdous, et al., An efficient k-means algorithm integrated with jaccard distance measure for document clustering, in: Proceedings of the AH-ICI, IEEE, 2009, pp. 1–6. 27. S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Now Publishers Inc, 2009. 28. J. H. Lee, Analyses of multiple evidence combination, in: Proc. SIGIR, 1997, pp. 267–276. 29. E. M. Voorhees, et al., The trec-8 question answering track report, in: Trec, 1999, pp. 77–82. �