Vol-3194/paper18
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper18 |
wikidataid | Q117345099→Q117345099 |
title | Supporting the Design of Data Preparation Pipelines |
pdfUrl | https://ceur-ws.org/Vol-3194/paper18.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/SancriccaC22 |
volume | Vol-3194→Vol-3194 |
session | → |
Supporting the Design of Data Preparation Pipelines
Supporting the Design of Data Preparation Pipelines (Discussion Paper) Camilla Sancricca1 , Cinzia Cappiello1 1 Politecnico di Milano - Dipartimento di Elettronica, Informazione e Bioingegneria Abstract The availability of a large amount of data facilitates spreading a data-driven culture in which data are used and analyzed to support decision-making. However, data-based decisions are effective only if the considered input data sources are not affected by poor quality and biases. For this reason, the data preparation phase is crucial for guaranteeing an appropriate output quality. There is a strong evidence in the literature that dealing with data preparation is not simple: it is the most resource consuming step in data analysis and most of the times it is performed using a trial and error approach. Considering this, we aim to support users in the design of data preparation pipelines by identifying the most suitable data transformation/cleaning operations to apply and the order in which they have to be executed. In order to achieve such a goal, using different datasets and ML algorithms, we conducted a series of experiments designed to assess the impact of different types of errors on the quality of the output. The idea is to develop a framework that provides users with guidelines that recommend to address the data quality issues with the highest negative impact first. A preliminary validation has confirmed that following the system suggestions yields better results. Keywords Data Quality, Data Preparation, Bias, Decision-making 1. Introduction Nowadays, organizations are increasingly adopting a data-driven culture in which data are used and analyzed to support decisions. However, in order to get valuable results from data analysis, input data should be reliable to avoid the well known Garbage In Garbage Out (GIGO) effect. Unfortunately, real world data are often affected by errors, inconsistencies, incomplete values, or biases (e.g., the data source does not exactly represent the considered population). In order to avoid having erroneous and unusable outputs, data preparation has become a crucial step in the data analysis pipeline for guaranteeing an appropriate quality output. It is worth noting that data preparation is not a simple task: usually it is a time consuming activity and it is usually performed by using a trial and error approach. This paper aims to propose a framework to support users in the identification of the most appropriate data transformation/cleaning activities to perform. Recommendations are designed to address the data quality issues that affect more the reliability of the analysis results first. Data Quality (DQ) is often defined as “fitness for use”, i.e., the ability of a data collection to meet user requirements [1]. Data quality SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy " camilla.sancricca@polimi.it (C. Sancricca); cinzia.cappiello@polimi.it (C. Cappiello) � 0000-0001-6062-5174 (C. Cappiello) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �is a multi-dimensional concept that refers to several aspects that affect data from various perspectives. The most used data quality dimensions are accuracy, completeness, consistency and timeliness [2]. We base our approach on the fact a specific data quality issue can be solved by using one or more data preparation actions. Therefore, once that a preliminary data quality assessment reveals the dimensions that need to be improved, it is possible to identify the related data preparation actions to perform. Moreover, a series of experiments allowed us to discover that the impact of different quality issues is dependent on the context of the analysis, where the context is modeled as the analytics application and the characteristics of the considered data sources. These findings helped us in designing an adaptive system able to provide users with guidelines about the sequence of data preparation activities to perform in a particular context to maximize the output quality. The effectiveness of such guidelines has been proven testing them with different combinations of data sets and analytics algorithms. The tests confirmed that applying the suggested sequence of data preparation tasks yields better final results. The paper is organized as follows. Section 2 presents the proposed approach and shows the experiments conducted to understand the impact of the data quality errors on the results of data analysis. Section 3 shows a preliminary validation of the effectiveness of the presented system. Section 4 discusses previous literature contributions. Finally, Section 5 draws conclusion and discusses future work. 2. A framework to support the design of data preparation pipelines This section aims to describe the framework we designed for supporting users in the data preprocessing phase. Sections 2.1 and 2.3 present the architecture and the experiments conducted for feeding the knowledge base used to provide valuable recommendations. Section 2.2 clarifies the data quality aspects and biases we consider in this work. 2.1. The general architecture A data analytics pipeline is usually composed of two main phases: data preprocessing and data analysis. The former collects and processes the data for guaranteeing a certain level of quality while the latter performs the data analysis tasks. This paper focuses on the data preprocessing Figure 1: The Data Preparation framework: the high level architecture �phase and proposes a framework that aims to guide the user through the design of the data preparation pipeline, suggesting the most suitable activities to perform. As depicted in Figure 1, the system allows the user to specify the context of analysis in terms of the data sources to consider and the analytics application to use. Once the data are collected, they are inspected and analyzed by using data profiling techniques and data quality and bias assessment algorithms. The considered data quality and bias metrics are described in Section 2.2. Users can access the results of this phase in order to understand the content of the datasets and their initial suitability for the task at hand. The results are sent to the data preparation module that has to identify the most appropriate task to perform and support its execution. The Data Preparation module is supported by a knowledge base that contains information to infer the data preparation tasks to suggest. It contains, for each data quality dimension, the association between the considered dimension and the data preparation activities able to improve it. Moreover, it registers the impact of the issues related to a quality dimension on the results of an analytics application. Note that such an impact depends on the chosen data analysis algorithm and the data source profile. These relationships together with the data provided by the data profiling and quality/bias assessment module are the input for the design of the data preparation pipeline that defines the data preparation techniques to consider and the order with which they have to be executed. In details, the most suitable data preparation tasks are identified on the basis of a ranking that sorts the data quality dimensions to improve. Such ranking is obtained by merging the quality level of each dimensions with its impact of the quality results. Note that in the envisioned approach, we assume that the users are free to follow the suggestions or not, letting them building their own data preparation pipeline and executing it. When the data preparation phase is completed, the data analysis phase can start. 2.2. Data Quality and Bias As stated in the Section 1, Data Quality is a multidimensional concept: a DQ model is composed of DQ dimensions that represent the different aspects to consider. In our work, we focused on the accuracy and completeness dimensions. Accuracy is defined as the closeness between a data value v and a data value v’, considered as the correct representation of the real-life phenomenon that the data value v aims to represent [2]. In the literature two types of accuracy are defined: Syntactic accuracy and Semantic accuracy. Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D [2]. If v is one of the values in D, then v is accurate otherwise it is not accurate. Semantic accuracy is defined as the distance between a data value v and a data value v’. In this case v is a value of the domain D. Semantic inaccurate values are those that, despite belonging to the domain, are not the correct ones. Completeness characterizes the extent to which the table represents the corresponding real world [2]. It is related to the presence of null values and a simple way to assess completeness in a table is to calculate the ratio between the number of non-null values and the number of cells of the table. The proposed framework also aims to alert the user of the presence of potential biases that can affect the data. In fact, a high data quality level does not guarantee the representativeness of the database: a bias analysis is needed. Currently, we focus on three metrics for quantifying the bias: coverage, density and diversity [3] [4]. Coverage is defined as the degree to which the dataset is �Figure 2: Results related to the impact of data quality issues on Decision Tree results representative of the real-world. It can be measured as the ratio between the number of different entities populating the dataset, and the total number of real-world entities, for which the value can be approximated or given by the user. Density is the degree to which different entities occur into the dataset. Given an attribute, for each distinct value, it is defined as its occurrence in one column. Density can be also computed for the whole attribute, representing the degree to which distinct values in one column are uniformly distributed. Diversity is associated with the concept of entropy. In the area of relational databases, entropy relies on how much an attribute is informative. To assess diversity the Shannon entropy, also known as Shannon’s diversity index [5], can be used. 2.3. Evaluation of the impact of poor Data Quality on Machine Learning This section describes the experiments carried on to understand the impact of the data quality dimensions on the quality of the analytics results. So far we focused on two data quality dimensions, i.e., Accuracy and Completeness and five ML algorithms, i.e., Decision Tree, k- Nearest Neighbors and Naïve Bayes for classification, k-Means as clustering algorithm and Ordinary Least Squares as linear regression method. Note that, as regards the accuracy, we analyzed the impact of errors related to the syntactic accuracy (e.g., typos or values outside the admissible domain) and the semantic accuracy (e.g., an admissible value that is not the correct one). The method we used to evaluate the discussed impact is based on a fault injection approach. Starting from a dataset and a quality dimension we introduced errors with a uniform distribution varying their quantity (from 0 to 90% with a step of 10%). In this way, we obtained nine instances of the same dataset with which we fed a ML algorithm and collected the data related to the performances of the results. We reiterated this procedure for all the considered data quality dimensions. Figure 2 shows an example of the results obtained by applying the method described above on two datasets (i.e., Iris1 and Soybean2 datasets) that are characterized by low dimensionality (i.e., limited number of attributes) but different data types: Iris dataset is numerical and Soybean dataset is categorical. As ML application, a Decision Tree classifier is 1 https://archive.ics.uci.edu/ml/datasets/iris 2 https://archive.ics.uci.edu/ml/datasets/Soybean+(Large) �Figure 3: Evaluation of data preparation guidelines on the Iris dataset applying different ML applications considered and its results are evaluated by using the F1 Score. The results of the experiments show that the quality impact depends on the analyzed datasets that in this case differ in the data types. In this example, on the basis of the discussed impact it is possible to understand that both for Iris and Soybean datasets the first dimension to improve is the Semantic Accuracy while the second one is the Syntactic Accuracy for Iris and the Completeness for Soybean. In summary, these experiments allowed us to identify the data quality dimensions of which issues have a greater impact on the results. Such dimensions are the ones that should be improved at first. Once identified the quality dimension to improve we can retrieve from the knowledge base the data preparation techniques able to improve it and suggest them to the user. 3. Experimental Results This section provides an example of the experiments conducted for evaluating the effectiveness of the data preparation guidelines provided by the proposed system. Considering a dataset, a ML application and the related data quality dimensions ranking, we conducted the experiments using the following process: (i) we injected errors in the dataset creating nine instances following the same procedure described in the previous section. Each instance contains the same quantity of errors for all the considered quality dimensions . (ii) the ML application is executed on the created instances and the output quality is registered; (iii) following a specific ranking of the data quality dimensions, one dimension at a time is improved and the ML application is executed on the enhanced dataset instances.This method has been applied both following the impact ranking described in Section 2.1 and also using different rankings obtained changing the order of the quality dimensions. The results confirmed the usefulness of our approach. In fact, it has been discovered that, performing data preparation following the suggestions yields better final results than applying the same preparation activities in a different order. Figure 3 shows an example of the results obtained running the above method for the Iris dataset. The considered quality dimensions are the same as before: Semantic and Syntactic Accuracy and Completeness. Instead, the selected ML applications are k-Nearest Neighbors for classification and Ordinary Least Squares for regression. The quality of the results has been evaluated by using the F1 �Score. Both algorithms are characterized by the same impact ranking: Syntactic Accuracy, Completeness and Semantic Accuracy. Figure 3 shows the results of the experiments where each curve represents the trend of the performance obtained every time a quality dimension is improved, following a specific order. It is possible to notice that the highest performance is reached by improving the first dimension of the extracted ranking. Data preparation actions performed subsequently bring only minimal improvements. It means that improving the most impacting dimension leads to obtain quickly good results. Therefore, the suggestions might help the user in saving some time in performing data preparation, since the suggestions avoid the adoption of a trial and error approach and the first suggested actions are already effective on the obtained results. 4. Related work Data-driven decision making relies on the use of advanced analysis tools able to deal with high volume of data. Unfortunately such data are often affected by poor quality that might negatively impact on the quality of the data analysis results and, thus, on the decisions outcome. To address data quality issues, within the processing pipeline a set of data preparation tasks can be performed. The main commercial data preparation tools are surveyed in [6],where their features are collected illustrating the data preparation task in which they are involved, for example, "Locate missing values", "Locate outliers" or "Change date & time format". Then, these features are re-organized in six categories, i.e., data discovery, data validation, data structuring, data enrichment, data filtering and data cleaning. Moreover, recent studies [7] [8] started to analyze the impact of data quality issues on the output quality of ML algorithms. These contributions investigate the effect of missing, inconsistent and conflicting data on the results of different ML tasks and quantify such an impact with an evaluation metric, called sensibility, which measure the sensibility of an algorithm to the data quality. They provide guidelines on: (i) detecting the errors rates (e.g., missing rate, inconsistent rate, conflicting rate) in the given data (ii) selecting the least sensitive ML algorithm according to the error types that have higher rates and (iii) clean each type of dirty data until reaching its corresponding keeping point, a metric that identifies at which point the corresponding error rate is acceptable for the selected ML algorithm. However, this approach does not provide any guidelines to clean the poor quality data. Moreover, the evaluation metrics used in this work are Precision, Recall and F-measure for all the ML algorithms. This can become a problem in testing some ML methods because the computation of such metrics requires the original class labels which, in some cases, may not be provided in the final prediction. In our approach, in fact, we consider different metrics to evaluate the output quality of different ML methods, e.g., F1 Score for classification or Silhouette Score for clustering. Several studies focus instead on the design of data preparation pipelines with the goal to get better results. [9] proposes a reinforcement learning method that finds out the optimal sequence of data preparation tasks able to maximize the performance of a ML model. The approach takes as input a data source, a model and a performance metric to optimize and, supported by Q-learning, explores all the possible data preparation tasks, determining at each step the next preprocessing method to execute to maximize the given metric. Our approach differs from this method since we have a knowledge base containing information �about the impact that several types of quality errors have on different ML applications, and the corresponding data preparation tasks that should be performed to address such quality issues. This allow us to suggest an optimal sequence of preparation activities, knowing in advance the problems which can mainly compromise the results. Instead, [9] does not have any knowledge and it needs to compute the optimal sequence running the system from scratch every time. A method aimed to decrease the time needed for the data preparation phase is described in [10]. This work offers a toolkit, which provides a set of key quality metrics related to the context of a ML project. The goal is to assess a set of metrics that are able to detect the specific data issues connected to this field, e.g., class overlap, feature relevance or data homogeneity. Then, they consider the identified issues and build a specific pipeline to detect, explain and address such problems. Recently, data quality has been also connected to the ethics context, since the fact that data are correct is a typical ethical requirement and, moreover, data should conform high ethical standard to be considered of high quality [4]. [4] has combined the most widely used ethical requirements with data quality, defining a set of ethical quality dimensions, i.e., data transparency, diversity and data protection and fairness. Another relevant work is [11] in which a wide variety of aspects related to bias and fairness in the field of machine learning are discussed. This work investigates different real-world examples that has been affected by biases and the different sources of bias in AI systems. Moreover, it creates a taxonomy grouping many existing definitions for fairness. This contribution highlights the fact that biases need to be considered in a data analysis pipeline as our approach suggests. 5. Conclusions and Future Work This work addresses the problem of defining reliable guidelines to support users in the data preparation process. Results of the conducted experiments above led to an important conclusions: (i) it is confirmed that different data quality issues have different impact on the quality of the results depending both on the algorithm applied and the dataset features; (ii) following the guidelines leads to quality results in an efficient and effective way. In fact, results show that, following the proposed suggestions yields better results than applying the same preparation activities in a different order. This work opens up a number of new research opportunities. Human-in-the-loop (HITL) techniques should be better exploited in order to further improve the data preparation process and the provided guidelines. Future work will also better investigate issues related to bias: we aim to extend the data preparation techniques by considering bias mitigation techniques (e.g., data enrichment) in addition to data quality improvement techniques. References [1] R. Y. Wang, D. M. Strong, Beyond accuracy: What data quality means to data consumers, Journal of Management Information Systems 12 (1996) 5–33. URL: https://doi.org/10.1080/07421222.1996.11518099. doi:10.1080/07421222.1996. 11518099. arXiv:https://doi.org/10.1080/07421222.1996.11518099. � [2] C. Batini, M. Scannapieco, Data and Information Quality - Dimensions, Principles and Techniques, Data-Centric Systems and Applications, Springer, 2016. URL: https://doi.org/ 10.1007/978-3-319-24106-7. doi:10.1007/978-3-319-24106-7. [3] F. Naumann, J. C. Freytag, U. Leser, Completeness of integrated information sources, Inf. Syst. 29 (2004) 583–615. URL: https://doi.org/10.1016/j.is.2003.12.005. doi:10.1016/j.is. 2003.12.005. [4] D. Firmani, L. Tanca, R. Torlone, Ethical dimensions for data quality, ACM J. Data Inf. Qual. 12 (2020) 2:1–2:5. URL: https://doi.org/10.1145/3362121. doi:10.1145/3362121. [5] L. Jost, Entropy and diversity, Oikos 113 (2006) 363–375. [6] M. Hameed, F. Naumann, Data preparation: A survey of commercial tools, SIGMOD Rec. 49 (2020) 18–29. URL: https://doi.org/10.1145/3444831.3444835. doi:10.1145/3444831. 3444835. [7] Z. Qi, H. Wang, Dirty-data impacts on regression models: An experimental evaluation, in: C. S. Jensen, E. Lim, D. Yang, W. Lee, V. S. Tseng, V. Kalogeraki, J. Huang, C. Shen (Eds.), Database Systems for Advanced Applications - 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part I, volume 12681 of Lecture Notes in Computer Science, Springer, 2021, pp. 88–95. URL: https://doi.org/10.1007/ 978-3-030-73194-6_6. doi:10.1007/978-3-030-73194-6\_6. [8] Z. Qi, H. Wang, A. Wang, Impacts of dirty data on classification and clustering models: An experimental evaluation, J. Comput. Sci. Technol. 36 (2021) 806–821. URL: https: //doi.org/10.1007/s11390-021-1344-6. doi:10.1007/s11390-021-1344-6. [9] L. Berti-Équille, Active reinforcement learning for data preparation: Learn2clean with human-in-the-loop, in: 10th Conference on Innovative Data Systems Re- search, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceed- ings, www.cidrdb.org, 2020. URL: http://cidrdb.org/cidr2020/gongshow2020/gongshow/ abstracts/cidr2020_abstract59.pdf. [10] N. Gupta, H. Patel, S. Afzal, N. Panwar, R. S. Mittal, S. C. Guttula, A. Jain, L. Nagalapatti, S. Mehta, S. Hans, P. Lohia, A. Aggarwal, D. Saha, Data quality toolkit: Automatic assess- ment of data quality and remediation for machine learning datasets, CoRR abs/2108.05935 (2021). URL: https://arxiv.org/abs/2108.05935. arXiv:2108.05935. [11] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A survey on bias and fairness in machine learning, CoRR abs/1908.09635 (2019). URL: http://arxiv.org/abs/1908.09635. arXiv:1908.09635. �