Vol-3194/paper17
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper17 |
wikidataid | →Q117344914 |
title | Imputation of Missing Values through Profiling Metadata |
pdfUrl | https://ceur-ws.org/Vol-3194/paper17.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/BreveCDP22 |
volume | Vol-3194→Vol-3194 |
session | → |
Imputation of Missing Values through Profiling Metadata
Imputation of Missing Values through Profiling Metadata Bernardo Breve, Loredana Caruccio, Vincenzo Deufemia and Giuseppe Polese University of Salerno, via Giovanni Paolo II, 132, Fisciano (SA), 84084, Italy Abstract Among the several problems related to the management of database instances, missing values represents a crucial factor that could severely compromise the integrity and the meaningfulness of such data representations. Thus, the data imputation research field focuses its efforts on solutions for filling missing values by means of plausible candidates, while still preserving the overall semantic integrity the database instance is characterized by. To keep imputation times low while still keeping high accuracy, the employment of metadata has made its way through research proposals. This discussion paper presents our effort in the definition of RENUVER, a novel data imputation algorithm relying on Relaxed Functional Dependencies (rfds) for identifying value candidates best guaranteeing the semantic integrity of data. Experimental results on real-world datasets highlighted the effectiveness of RENUVER in terms of both filling accuracy and imputation times, also compared to other well-known approaches. Keywords Data imputation, Profiling metadata, Relaxed Functional Dependencies, Data quality 1. Introduction With the advent of big data, the presence of missing values inside database instances has been widely recognized as a complex problem to handle, especially for Relational Database Management Systems [1]. Moreover, several application contexts might require the absence of this data quality issue inside their datasets. For instance, machine learning processes could not provide good accuracy scores if trained on data with many missing values. In general, it is not possible to infer reliable knowledge using datasets with incomplete information [2]. The identification of the best values in a dataset to impute the missing ones is an extremely complex task, since it entails the evaluation of all possible combinations in the value distribution. Most of the approaches proposed in the literature focus on maximizing the number of imputed values, overshadowing the accuracy of single imputations. This discussion paper presents the data imputation algorithm proposed in [3], namely RENUVER (Rfd basEd NUll ValuE Repairer), which relies on Relaxed Functional Dependencies (rfds) for imputing missing values within a relational database instance. By adopting the concept of rfds as metadata for supporting the imputation process, we are able to perform a broader analysis of the correlations among SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ bbreve@unisa.it (B. Breve); lcaruccio@unisa.it (L. Caruccio); deufemia@unisa.it (V. Deufemia); gpolese@unisa.it (G. Polese) � 0000-0002-3898-7512 (B. Breve); 0000-0002-2418-1606 (L. Caruccio); 0000-0002-6711-3590 (V. Deufemia); 0000-0002-8496-2658 (G. Polese) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �attributes, yielding an accurate and somewhat fast solution for the imputation of missing values within relational database instances. In fact, rfds are still widely considered for detecting and repairing many types of errors, such as duplicates, outliers, and constraint violations [4]. Thus, we made use them for identifying suitable candidate values for replacing missing ones in the data imputation process. RENUVER exploits rfds for: i) identifying the candidate tuples useful for the imputation of missing values, ii) ranking candidate tuples based on their similarity with respect to the tuples containing missing values, and iii) evaluating each imputation to guarantee the semantic consistency of the whole dataset. In particular, RENUVER generates candidate tuples and rank them, according to rfds implying the attribute on which a value is missing. Moreover, the imputation strategy of RENUVER does not alter value consistency with respect to the ones in the original dataset. Finally, RENUVER exploits rfds to also judge whether it is possible to impute a missing value, in order to preserve the integrity of data and to avoid the insertion of inconsistent information. The effectiveness of RENUVER has been evaluated on real-world datasets1 in terms of accuracy, and execution time. In order to extract rfds, we relied on an existing rfd discovery algorithm [5], since the problem of discovering rfds is out of the scope of this paper. Moreover, we introduce a novel method for the automatic evaluation of data imputation results, which permits to judge the imputed values even with different syntactical representations. Evaluation results demonstrate that RENUVER outperforms other data imputation approaches [6, 7, 8]. The paper is organized as follows: Section 2 provides preliminary notions on rfds. Section 3 introduces RENUVER’s logic through the employment of the rfds in the data imputation problem. An experimental evaluation measuring the effectiveness RENUVER is presented in Section 4. Finally, conclusions and further research are reported in Section 5. 2. Preliminaries Before describing how we approached the imputation problem through the employment of rfds, let us introduce some propaedeutics notions to our methodology. Functional Dependency. Given a relational database schema ℛ, and 𝑅 = {𝐴1 , . . . , 𝐴𝑚 } one of its relation schemas, and a tuple 𝑡 ∈ 𝑟, we use 𝑡[𝐴𝑖 ], with 0 ≤ 𝑖 ≤ 𝑚, to denote the projection of 𝑡 onto 𝐴𝑖 ; similarly, for a set of attributes 𝑋 = {𝐴𝑖1 , . . . , 𝐴𝑖𝑘 }, with 1 ≤ 𝑘 ≤ 𝑚, 𝑡[𝑋] ∈ 𝑑𝑜𝑚(𝐴𝑖1 ) × . . . × 𝑑𝑜𝑚(𝐴𝑖𝑘 ) represents the projection of 𝑡 onto 𝑋, also denoted with Π𝑋 (𝑡). An fd on ℛ is a statement 𝑋 → 𝑌 (𝑋 implies 𝑌 ), with 𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟(𝑅), such that, given an instance 𝑟 of 𝑅, 𝑋 → 𝑌 is satisfied in 𝑟 if and only if for each pair of tuples (𝑡1 , 𝑡2 ) in 𝑟, whenever 𝑡1 [𝑋] = 𝑡2 [𝑋], then 𝑡1 [𝑌 ] = 𝑡2 [𝑌 ]. The sets of attributes 𝑋 and 𝑌 are named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of the fd, respectively. With respect to fd definition, the rfd generalizes the comparison paradigm, by including similarity/distance-based comparisons between tuple projections, also admitting the possibility for a dependency to hold only on a subset of tuples. The latter can be defined through either a coverage measure, quantifying the portion of the dataset on which a dependency holds or a condition restricting the domain on which a dependency can hold [9]. Since the proposed 1 https://github.com/DastLab/RENUVER-evaluation-datasets �Table 1 A sample of the Restaurant dataset. Name City Phone Type Class 𝑡1 Granita Malibu 310/456-0488 Californian 6 𝑡2 Chinois Main LA 310-392-9025 French 5 𝑡3 Citrus Los Angeles 213/857-0034 Californian 6 𝑡4 Citrus Los Angeles _ Californian 6 𝑡5 Fenix Hollywood 213/848-6677 _ 5 𝑡6 Fenix Argyle _ 213/848-6677 French (new) 5 𝑡7 C. Main Los Angeles _ French 5 approach exploits only rfds relying on a similarity/distance-based tuple comparison method, in what follows we provide only the definition of this type of rfds, known as rfdc . For a more general definition of rfd, see [9]. rfdc . Given a relational database schema ℛ, and 𝑅 = {𝐴1 , . . . , 𝐴𝑚 } one of its relation schemas, an rfdc 𝜙 on ℛ 𝑋Φ1 → 𝑌Φ2 (1) where • 𝑋, 𝑌 ⊆ 𝑎𝑡𝑡𝑟(𝑅); • Φ1 contains (for each attribute 𝑋𝑖 ∈ 𝑋) a constraint 𝜑𝑖 [𝑋𝑖 ] that can be used to determine whether pair of tuples with values in 𝑑𝑜𝑚(𝑋𝑖 ) are “similar” enough (likewise for each attribute 𝑌𝑗 ∈ 𝑌 with 𝜑𝑗 [𝑌𝑗 ] ∈ Φ2 ). More specifically, each 𝜑𝑖 [𝑋𝑖 ] (𝜑𝑗 [𝑌𝑗 ] resp.) requires the specification of a similarity/distance function defined on the domain of 𝑋𝑖 (𝑌𝑗 , resp.), an operator, and a threshold setting the boundaries for the satisfaction of the constraint. holds on a relation instance 𝑟 (denoted by 𝑟 |= 𝜙) if and only if for each pair of tuples (𝑡1 , 𝑡2 ) ∈ 𝑟 for which 𝑡1 [𝑋] and 𝑡2 [𝑋] satisfy the constraint 𝜑𝑖 [𝑋𝑖 ] for each 𝑋𝑖 ∈ 𝑋, then 𝑡1 [𝑌 ] and 𝑡2 [𝑌 ] satisfy the constraint 𝜑𝑖 [𝑌𝑖 ] for each 𝑌𝑖 ∈ 𝑌 . For sake of simplicity, in the following, we apply a more compact notation for the constraints, showing only the operator and the numeric threshold associated with each attribute. Example. Let us consider the sample relation shown in Table 1, derived from a database of restaurants in USA. Within this database, each tuple represents a restaurant providing information about its name, address, city, phone number, type of cuisine, and class. The latter is a numeric id associated to the type of cuisine. On such dataset, the following rfdc holds: Name(≤4) → − Phone(≤1) which states that, if two restaurants have a similar name, then they also have a similar phone number. This should be true despite the names and/or the phone numbers of restaurants being written in different ways or using different abbreviations. From a theoretical point of view, rfdc s permit to use any type of similarity/distance functions, e.g., edit distance, abs differences, and so forth. However, they are usually inherited from the functions involved in the automatic rfdc discovery process [5]. For the scope of this proposal, without loss of generality, we can consider rfdc s with a single attribute on the RHS, and the associated constraint 𝜑2 . In particular, we considered 𝜑2 composed of a distance function, the operator ≤, and a distance threshold. A particular type of rfdc is the key-rfdc , which is defined in the following. Key rfdc . Given a relation schema 𝑅, and an instance 𝑟 of 𝑅, an rfdc 𝜙 : 𝑋Φ1 → 𝐴𝜑2 is said to be key if and only if 𝜙 holds on 𝑟 (𝑟 |= 𝜙), but there is no pair of distinct tuples (𝑡1 , 𝑡2 ) ∈ 𝑟, for which 𝑡1 [𝑋] and 𝑡2 [𝑋] satisfy all the constraints in Φ1 [𝑋]. �3. The RENUVER imputation approach In this section, we formalize the data imputation problem by defining some of its underlying concepts, then describing the basics of the proposed imputation approach. Let us start defining the concept of missing value. Missing value. Given a relation schema 𝑅, defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), an instance 𝑟 of 𝑅, an attribute 𝐴 ∈ 𝑎𝑡𝑡𝑟(𝑅), and a tuple 𝑡 ∈ 𝑟, a missing value of tuple 𝑡 on the attribute 𝐴, denoted as 𝑡[𝐴] = _, is such that 𝑡[𝐴] is null. Here, 𝑟 is said to be an incomplete instance, and ˆ𝑟 ⊆ 𝑟 contains only incomplete tuples. The general missing value imputation problem is formally defined as follows. Missing value imputation problem. Given a relation schema 𝑅, and an instance 𝑟 of 𝑅, for every tuple 𝑡 ∈ 𝑟 and every attribute 𝐴 ∈ 𝑎𝑡𝑡𝑟(𝑅) for which 𝑡[𝐴] = _, the imputation problem consists of finding a plausible value 𝑎 ∈ 𝑑𝑜𝑚(𝐴), such that the database instance 𝑟′ resulting from the imputation process does not contain inconsistent values. A missing value imputation approach also requires the application of constraints for evalu- ating the consistency of values at the end of the imputation process. The proposed approach exploits rfds to both guarantee the verification of the semantic consistency, and to drive the searching of meaningful candidates for all missing values. Semantically consistent imputation. Given a relation schema 𝑅, defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), an instance 𝑟 of 𝑅, and a set of rfdc s, Σ, holding on 𝑟 (𝑟 |= Σ), an instance 𝑟′ of 𝑅 resulting from an imputation process 𝐼 over the instance 𝑟, denoted as 𝑟′ = 𝐼(𝑟), is semantically consistent iff 𝑟′ |= Σ. One of the possible strategies that could guarantee the semantic consistency of the imputation process is to find candidate values for 𝑡[𝐴] = _ by considering a set 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ⊆ 𝑟 of plausible candidate tuples for imputing 𝑡[𝐴], such that ∀𝑡𝑘 ∈ 𝑇𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑡𝑘 [𝐴] ̸= _ and 𝑡𝑘 is similar to 𝑡 on some attributes beyond 𝐴. In what follows we define the criteria used by RENUVER for deciding when a tuple can be considered as a plausible candidate, which is based on rfdc s. Plausible candidate tuple. Given a missing value 𝑡[𝐴]=_ over a database instance 𝑟 of a relation schema 𝑅, and an rfdc 𝜙 : 𝑋Φ1 → 𝐴𝜑2 holding on 𝑟, a tuple 𝑡′ ∈ 𝑟 can be considered as a plausible candidate tuple for imputing 𝑡[𝐴] according to 𝜙 iff 𝑡 and 𝑡′ , are similar according to the constraints in Φ1 . The candidate tuple generation process performed according to the definition presented above, has to be generalized in order to perform the imputation process on tuples containing more than one missing value, and for each 𝑡 ∈ ˆ𝑟. Missing value imputation for a tuple. Let 𝑅 be a relational schema defined over a set of attributes 𝑎𝑡𝑡𝑟(𝑅), 𝑟 an instance of 𝑅, 𝑡 a tuple of 𝑟, 𝑍 ⊂ 𝑎𝑡𝑡𝑟(𝑅) a set of attributes such that for each 𝐴 ∈ 𝑍 𝑡[𝐴] = _, and Σ a set of rfdc s holding on 𝑟. An imputation process for 𝑡 consists of selecting a plausible candidate tuple 𝑡𝑗 for each 𝐴 ∈ 𝑍 such that 𝑡[𝐴] = _, so that 𝑡[𝐴] can be set equal to 𝑡𝑗 [𝐴]. However, when for a 𝑡[𝐴] = _ it is not possible to identify a plausible candidate tuple guaranteeing a semantic consistent imputation, it is better to leave 𝑡[𝐴] unimputed. Although this strategy has been widely applied in other approaches [7], it � a) Data pre-processing Name(≤ 8), Phone(≤ 0), Class(≤ 1) ➝ Type(≤ 0) Name City Phone Type Class Class(≤ 0) ➝ Type(≤ 5) t1 Granita Malibu 310/456-0488 Californian 6 City(≤ 2) ➝ Phone(≤ 2) t2 Chinos Main LA 310-932-9025 French 5 Name(≤ 4) ➝ Phone(≤ 1) t3 Citrus Los Angeles 213/857-0034 Californian 6 Name(≤ 8), Phone(≤ 0) ➝ City(≤ 9) t4 Citrus Los Angeles _ Californian 6 Name(≤ 6), City(≤ 9) ➝ Phone(≤ 0) t5 Fenix Hollywood 213/848-6677 _ 5 t6 Fenix Argyle _ 213/848-6677 French (new) 5 Phone(≤ 1) ➝ Class(≤ 0) ... ... t7 C. Main Los Angeles _ French 5 b) RFDc selection Name City Phone Type Class t1 Granita Malibu 310/456-0488 Californian 6 t2 Chinos Main LA 310-932-9025 French 5 0 : Name (≤ 6), City : Name (≤ 6) (≤ 9) ➝ ➝ , City (≤ 9) Phone Phone (≤ 0) (≤ 0) t3 Citrus Los Angeles 213/857-0034 Californian 6 Phone t4 Citrus Los Angeles _ Californian 6 : Name(≤ 4) ➝ Phone(≤ 1) : Name(≤ 4) ➝ Phone(≤ 1) t5 Fenix Hollywood 213/848-6677 _ 5 Phone : City(≤ 2) ➝ Phone(≤ 2) t6 Fenix Argyle _ 213/848-6677 French (new) 5 : City(≤ 2) ➝ Phone(≤ 2) t7 C. Main Los Angeles _ French 5 Phone c) Imputing missing values Name City Phone Type Class t1 Granita Malibu 310/456-0488 Californian 6 : Name(≤ 6), City(≤ 9) ➝ Phone(≤ 0) t2 Chinos Main LA 310-932-9025 French 5 Phone t3 Citrus Los Angeles 213/857-0034 Californian 6 t4 Citrus Los Angeles _ Californian 6 : Name(≤ 4) ➝ Phone(≤ 1) Phone t5 Fenix Hollywood 213/848-6677 _ 5 t6 Fenix Argyle _ 213/848-6677 French (new) 5 : City(≤ 2) ➝ Phone(≤ 2) C. Main Los Angeles _ Phone t7 French 5 Imputing t7[Phone] with t3[Phone] Name City Phone Type Class t1 Granita Malibu 310/456-0488 Californian 6 t2 Chinos Main LA 310-932-9025 French 5 : Phone(≤ 1) ➝ Class(≤ 0) t3 Citrus Los Angeles 213/857-0034 Californian 6 t4 Citrus Los Angeles _ Californian 6 violated! t5 Fenix Hollywood 213/848-6677 _ 5 t6 Fenix Argyle _ 213/848-6677 French (new) 5 t7 C. Main Los Angeles 213/857-0034 French 5 Imputing t7[Phone] with t2[Phone] Name City Phone Type Class t1 Granita Malibu 310/456-0488 Californian 6 : Phone(≤ 1) ➝ Class(≤ 0) t2 Chinos Main LA 310-932-9025 French 5 t3 Citrus Los Angeles 213/857-0034 Californian 6 NOT violated! t4 Citrus Los Angeles _ Californian 6 t5 Fenix Hollywood 213/848-6677 _ 5 t6 Fenix Argyle _ 213/848-6677 French (new) 5 t7 C. Main Los Angeles 310-932-9025 French 5 Figure 1: An example of RENUVER imputation on the Restaurant dataset of Table 1. yields to another important issue that RENUVER deals with, i.e., minimizing the number of non-imputed values. Figure 1 summarizes the imputation logic of RENUVER2 through an example. In particular, we show how the aforesaid definitions empower the imputation of a missing value in the Restaurant dataset, previously introduced. In details, we can identify three major phases yielding the imputation of certain missing value, that are: • Pre-processing: during this phase, missing values within a database instance are identi- 2 A deep overview of RENUVER, together with a more exhaustive evaluation has been carried out in [3]. � fied and isolated. Furthermore, RENUVER excludes all key-rfdc s from the set of the rfdc s which can be employed for the imputation of any missing value (see Figure 1.a). • rfdc selection: following the selection of a missing value to impute, during this phase RENUVER identifies all the rfdc s that can be useful for its imputation. rfdc s are then organized in a set of clusters according to their threshold on the RHS (see Figure 1.b). • Imputing missing values: during this phase, RENUVER performs a series of operations leading to the imputation of a missing value by retrieving the value from a set of plausible candidate tuples relying on the same database instance (see Figure 1.c). In particular, RENUVER iteratively performs the following operations: – generates a set of plausible candidate tuples that satisfy the LHS constraints of an rfdc s belonging to one of the clusters previously generated. – computes a distance value for each plausible candidate tuple with respect to the tuple having the missing value. The evaluation is performed by considering the LHS attributes of the rfdc s selected. Finally the candidate tuple having the minimum distance is the exploited for the imputation of the missing value. – verifies whether the imputed value causes a violation of holding rfdc s. In this case, RENUVER selects the next plausible candidate tuple with the lowest distance value. These operations are repeated for each cluster as long as the imputation is not successful. 4. Experimental Evaluation In this section, we present a comparative evaluation of RENUVER w.r.t. other approaches exploiting different imputation strategies. In particular, we benchmarked RENUVER against an holistic-machine learning-based approach, namely Holoclean [6], (considering its attention- based expansion module AimNet [10]) and a differential dependencies guided approach [7] named Derand, for which we employed the same rfdc s as RENUVER. All evaluations were performed under the same conditions on an iMac Pro with an 8-core CPU and 32GB RAM. Datasets. The considered algorithms have been evaluated on two real-world datasets 2 in order to perform a stress test on RENUVER and all compared imputation approaches, aiming to determine their time and memory requirements. To this end, we stopped the executions exceeding 48 hours of execution time and/or 30GB of memory consumption, respectively. Furthermore, in order to obtain an accurate comparison between the imputed values and the expected ones, missing values have been artificially injected in a random manner. Moreover, to avoid an arrangement of missing values over one algorithm, for each missing injection we produced five different datasets, yielding a total of twenty-five variants of the same dataset. The metrics adopted for the comparison are then averaged over each missing rate. Evaluation metrics. The effectiveness of the data imputation approaches have been evaluated by considering three different metrics: precision, recall, F1-measure. Which can be formally defined as: ⋂︀ ⋂︀ precision = |true|imputed| imputed| recall = |true|missing| missing| F1-measure = 2 × precision×recall precision+recall where true represents the correctly imputed missing values at the end of the imputation process, imputed represents all the imputed missing values, and missing the missing values in the dataset. �Table 2 Comparative evaluation of RENUVER on the Restaurants and Physician datasets. Dataset #Tuples #Attributes #Missing val. #rfdc s #DCs Dataset #Tuples #Attributes #Missing val. #rfdc s #DCs 259 (5%) 104 (0.05%) 13 (1%) 1430 518 (10%) 208 (0.1%) 27 (1%) 2553 Restaurant 864 6 1037 (20%) 1961 9 Physician 1036 (0.5%) 13 135 (1%) 3895 74 1555 (30%) 2072 (1%) 269 (1%) 5708 2074 (40%) 10359 (5%) 1319 (1%) 6137 Dataset Approach Recall Precision F1-Meas. Time Mem. Dataset Approach Recall Precision F1-Meas. Time Mem. 0.329 0.864 0.476 14m 29s 1.38 GB 0.338 1 0.505 470ms 1.48 GB (varying the number of tuples) 0.296 0.832 0.437 23m 21s 1.31 GB 0.328 0.547 0.410 3s 1.79 GB (varying the missing rate) RENUVER 0.294 0.845 0.436 33m 20s 1.36 GB RENUVER 0.326 0.607 0.424 1m 19s 0.71 GB 0.258 0.828 0.394 36m 37s 1.37 GB 0.254 0.483 0.333 15m 1s 1.30 GB 0.232 0.726 0.349 30m 23s 1.38 GB - - - TL - 0.295 0.419 0.345 47h 13m 7.21 GB 0.121 0.210 0.151 1h 10s 1.25 GB Restaurant Physician - - - TL - 0.125 0.190 0.150 9h 49m 3.32 GB Derand - - - TL - Derand 0.110 0.121 0.115 25h 40m 8.21 GB - - - TL - - - - TL - - - - TL - - - - TL - 0.275 0.544 0.362 14s 0.99 GB 0.230 0.300 0.599 7s 3.95 GB 0.099 0.218 0.131 15s 0.99 GB 0.115 0.120 0.117 12s 5.15 GB Holoclean 0.071 0.153 0.095 14s 0.99 GB Holoclean 0.097 0.114 0.104 1m 8s 6.16 GB 0.064 0.192 0.095 11s 0.78 GB 0.156 0.167 0.161 8m 21s 26.89 GB 0.165 0.419 0.237 10s 0.79 GB - - - - ML TL: time limit of 48 hours exceeded − ML: memory limit of 30 GB exceeded TL: time limit of 48 hours exceeded − ML: memory limit of 30 GB exceeded Results. The first evaluation session is focused on the Restaurant dataset by considering the following missing rates: [5%, 10%, 20%, 30%, 40%] (see Table 2). We can notice that the fastest approach is Holoclean, whereas Derand registered severely higher execution times, exceeding the 48h time limit starting from the 10% of missing rate. The faster execution times of Holoclean can be justified by the conspicuously lower number of metadata to be processed during the imputation process, i.e., 9 Denial of Constraints, compared to 1961 rfdc s. Nevertheless, RENUVER registered the best performances on all the considered qualitative metrics. The second evaluation session is focused on the Physician dataset, by fixing the missing rate and by varying the number of tuples to be considered. This dataset is particularly complex to analyze, since it also contains a high number of attributes (i.e., 13 attributes). In fact, this dataset allowed us to catch a time and/or memory limit for all considered approaches (i.e., RENUVER, Derand, and Holoclean), as shown in Table 2. In particular, we can notice that, on average, both RENUVER and Holoclean registered faster execution times than Derand. In fact, the latter exceeds the time limit of 48h on the datasets having 2072 and 10359 tuples, respectively. On the other hand, Holoclean manages to achieve reasonable executions times, but the huge amount of consumed memory makes it exceed the 30GB memory limit on the dataset having 10359 tuples. Finally, RENUVER also exceeds the time limit on the largest dataset, despite a more reasonable memory consumption. This evaluation session proved the capability of RENUVER to outperform the compared approaches on the considered qualitative metrics. It also emphasized that Derand’s execution times are strongly dependent on the number of missing values, whereas although Holoclean provided overall faster execution times, it resulted heavily memory-consuming. 5. Conclusion In this paper, we proposed RENUVER, a data imputation algorithm that exploits relaxed func- tional dependencies. The latter enables RENUVER to select and evaluate tuple candidates to be used during the imputation process. The whole imputation process preserves the semantic �consistency of the data, by guaranteeing that no imputation can violate any rfdc . Evaluation results demonstrated that RENUVER outperforms recent approaches using different imputation strategies: machine learning-based (Holoclean) and dependency-based (Derand). In the future, we would like to extend RENUVER with the possibility of selecting plausible candidate tuples among multiple datasets. Finally, we would like to study the applicability of RENUVER over incremental scenarios, like for example those related to the imputation of time series [11], which would require the usage of incremental rfdc discovery algorithms [12, 13]. References [1] M. V. Martinez, C. Molinaro, J. Grant, V. Subrahmanian, Customized policies for handling partial information in relational databases, IEEE Transactions on Knowledge and Data Engineering 25 (2012) 1254–1271. [2] B. Montesdeoca, J. Luengo, J. Maillo, D. García-Gil, S. García, F. Herrera, A first approach on big data missing values imputation, in: Proceedings of 5th International Conference on Internet of Things, Big Data and Security (IoTBDS), SciTePress, 2019, pp. 315–323. [3] B. Breve, L. Caruccio, V. Deufemia, G. Polese, RENUVER: A missing value imputation algorithm based on relaxed functional dependencies, in: To appear in Proceedings of the 25th International Conference on Extending Database Technology, (EDBT), OpenProceed- ings.org, 2022. [4] I. F. Ilyas, X. Chu, et al., Trends in cleaning relational data: consistency and deduplication, Foundations and Trends® in Databases 5 (2015) 281–393. [5] L. Caruccio, V. Deufemia, F. Naumann, G. Polese, Discovering relaxed functional depen- dencies based on multi-attribute dominance, IEEE Transactions on Knowledge and Data Engineering 33 (2021) 3212–3228. [6] T. Rekatsinas, X. Chu, I. F. Ilyas, C. Ré, Holoclean: holistic data repairs with probabilistic inference, Proceedings of VLDB Endowment 10 (2017) 1190–1201. [7] S. Song, Y. Sun, A. Zhang, L. Chen, J. Wang, Enriching data imputation under similarity rule constraints, IEEE Transactions on Knowledge and Data Engineering 32 (2020) 275–287. [8] C.-C. Huang, H.-M. Lee, A grey-based nearest neighbor approach for missing attribute value prediction, Applied Intelligence 20 (2004) 239–252. [9] L. Caruccio, V. Deufemia, G. Polese, Relaxed functional dependencies—A survey of ap- proaches, IEEE Transactions on Knowledge and Data Engineering 28 (2016) 147–165. [10] R. Wu, A. Zhang, I. Ilyas, T. Rekatsinas, Attention-based learning for missing data imputa- tion in holoclean, Proceedings of Machine Learning and Systems 2 (2020) 307–325. [11] M. Khayati, A. Lerner, Z. Tymchenko, P. Cudré-Mauroux, Mind the gap: An experimental evaluation of imputation of missing values techniques in time series, Proceedings VLDB Endowment 13 (2020) 768–782. [12] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Incremental discovery of functional depen- dencies with a bit-vector algorithm, in: Proceedings of Italian Symposium on Advanced Database Systems, volume 2400 of SEBD ’19, CEUR-WS.org, 2019, pp. 1–12. [13] L. Caruccio, S. Cirillo, Incremental discovery of imprecise functional dependencies, Journal of Data and Information Quality (JDIQ) 12 (2020) 1–25. �