Vol-3194/paper66
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper66 |
wikidataid | →Q117344898 |
title | Functional Dependencies to Mitigate Data Bias |
pdfUrl | https://ceur-ws.org/Vol-3194/paper66.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/AzzaliniCT22 |
volume | Vol-3194→Vol-3194 |
session | → |
Functional Dependencies to Mitigate Data Bias
Functional Dependencies to Mitigate Data Bias (Discussion Paper) Fabio Azzalini1,2 , Chiara Criscuolo1 and Letizia Tanca1 1 Politecnico di Milano - Dipartimento di Elettronica, Informazione e Bioingegneria 2 Human Technopole - Center for Analysis, Decisions and Society Abstract Technologies based on data are frequently adopted in many sensitive environments to build models that support important and life-changing decisions. As a result, for an application to be ethically reliable, it should be associated with tools to discover and mitigate bias in data, in order to avoid (possibly unintentional) unethical behaviors and the associated consequences. In this paper we propose a novel solution that, exploiting the notion of Functional Dependency and its variants - well-known data constraints - aims at enforcing fairness by discovering and solving discrim- ination in datasets. Our system first identifies the attributes of a dataset that encompass discrimination (e.g. gender, ethnicity or religion), generating a list of dependencies, then, based on this information, determines the smallest set of tuples that must be added or removed to mitigate such bias in the dataset. Experimental results on two real-world datasets demonstrated that our approach can greatly improve the ethical quality of data sources. Keywords Fairness, Data Bias, Functional Dependencies 1. Introduction In the recent years fairness has become an important topic of interest in the Data Science community. Indeed, computers and algorithms have made our lives efficient and easier, but among the prices we risk to pay is the possible presence of discrimination and unfairness in the decisions we make with their support. As a result, algorithmic decision systems should work on unbiased data to obtain fair results, since learning from historical data might mean to learn also traditional prejudices that are endemic in society, producing unethical decisions. Last year we presented FAIR-DB [1] [2], a framework that, by discovering and analyzing a par- ticular type of database integrity constraints can find unfair behaviors in a dataset. Specifically, the system employees approximate conditional functional dependencies (ACFDs) to recognize the cases where the values of a certain attributes (e.g. gender, ethnicity or religion) frequently determines the value of another one (such as range of the proposed salary or social state). In this paper we enhance the framework with an additional functionality to repair datasets found unfair by the discovery phase of FAIR-DB. The new module, ACFD-Repair method, is a procedure that, based on the discovered dependencies, determines the smallest set of tuples to be added or removed from the dataset to mitigate, or completely, remove the previously discovered bias. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (Pisa), Italy Envelope-Open fabio.azzalini@polimi.it (F. Azzalini); chiara.criscuolo@polimi.it (C. Criscuolo); letizia.tanca@polimi.it (L. Tanca) Orcid 0000-0003-0631-2120 (F. Azzalini); 0000-0003-2607-3171 (L. Tanca) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � Correction Algorithm Data Bias ACFD-Repair: Input Dataset Final Dataset discovery mitigate bias Figure 1: Framework workflow ACFD Support Difference (Sex = ‘Female’) → Income = ‘≤50K’ 0.287 0.135 (Race = ‘Black’) → Income = ‘≤50K’ 0.081 0.119 (Race = ‘Amer-Indian-Eskimo’) → Income = ‘≤50K’ 0.008 0.129 (Native-Country = ‘NC-Hispanic’) → Income = ‘≤50K’ 0.044 0.158 Table 1 A few user-selected dependencies from the U.S. Census Adult Dataset 2. Framework Overview Before presenting the methodology, we first introduce some fundamental notions that will accompany us along our discussion. A protected attribute is a characteristic for which non- discrimination should be established, like age, race, sex, etc [3]; the target variable is the feature of a dataset about which the user wants to gain a deeper understanding, for example the income, or a boolean label that indicates whether a loan is authorized or not, etc. Figure 1 shows the framework of the enhanced version of FAIR-DB. We now give a brief presentation of the Data Bias discovery step. For a deeper understanding of this phase, we refer the reader to the following papers: [1], [2]. The Data Bias discovery module operates according to the following steps: we import the data and perform (if needed) data cleaning and integration, we extract the ACFDs from the dataset and discard the ones that do not contain protected attributes and the target variable, then, to select the ACFDs that show unfair behaviours for each dependency, we compute some metrics capturing the “ethical level” of the dependency, to facilitate user interaction we rank the ACFDs in descending order of importance; finally the user selects the ACFDs she perceives as the most problematic and then, the system computes some metrics that summarize the level of unfairness of the dataset. The core of the discovery phase is the Difference, a novel metric that indicates how much ‘unethical’ the behaviour highlighted by a dependency is. The higher this metric, the stronger the alert for an unfair behavior. For each dependency 𝜙, we define the Difference metric of 𝜙 as the difference between the dependency confidence and the confidence of the dependency computed without the protected attributes of the Left-Hand-Side (LHS) of the ACFD. Given a dependency in the form 𝜙:(𝑋 → 𝑌 , 𝑡𝑝 ), let 𝑍 = (𝑋 − {𝑃𝑟𝑜𝑡𝑒𝑐𝑡𝑒𝑑𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠}), that is the LHS of the ACFD without its protected attributes, and 𝜙 ′ : (𝑍 → 𝑌 , 𝑡𝑝 ). We define the Difference as: Difference(𝜙) = Confidence(𝜙) − Confidence(𝜙 ′ ) The Difference metric gives us an idea of how much the values of the protected attributes influence the value of 𝑌. In order to assess the unfair behavior of a dependency, we also take into consideration its support, that indicates the pervasiveness of the ACFD; unethical dependencies with high support will impact many tuples, and thus will be more important. � Unfair Greedy Selected Statistics Input Dataset Tuple Count Hit-Count Final Dataset Data Bias ACFDs Computation Computation algorithm discovery Optional Correction Algorithm Figure 2: ACFD-Repair pipeline Table 1 reports a few user-selected ACFDs mined from the U.S. Census Adult Dataset1 . An analysis of the dependencies shows Data thatBias the groups more discriminated are: ‘Female’, ‘Black’, ACFD-Repair: Input Dataset Final Dataset ‘NC-Hispanic’ and ‘Amer-Indian-Eskimo’. discovery mitigate bias 3. ACFD-Repair The final objective of the repair phase is to create a fair dataset where the unethical behaviors are greatly mitigated or completely removed. The procedure has to conform to a fundamental requirement, i.e., since the repaired dataset has to be used later on in a data science pipeline, the number of modifications, consisting of deletion or addition of tuples, has to be minimized in order to protect the original distributions of values. Note also that, given the variety of decision algorithms, the repair procedure must work also for attributes with non-binary values. Figure 2 presents the ACFD-Repair methodology: we now give a detailed description of each phase using the U.S. Census Adult dataset as running example. 3.1. Unfair Tuple Count Computation We propose to try to reduce, or eliminate, unfairness, starting from the list of ACFDs computed in the Data Bias discovery step. The repair is performed by bringing the value of the Difference of each user-selected dependency below a minimum threshold 𝛿; this can be achieved in two ways: (i) removing the tuples matching the dependency so that, after removal, the Difference value of the dependency will become lower than 𝛿; (ii) adding the tuples that, combined with the elements that match the dependency, if added, will lower its initial Difference value below 𝛿. The Unfair Tuple Count computation step is therefore responsible for finding, for each dependency, the number of tuples that has to be added or removed. For each ACFD 𝜙, we define its Unfair Tuple Count as F(𝜙) = (𝛼,𝛽) where 𝛼 represents the number of tuples that should be added, and 𝛽 the number of tuples that should be removed to repair the discrimination behaviour shown by the dependency. To explain how the Unfair Tuple Count value of each dependency is computed, we make use of the dependency 𝜙1 ∶ (Sex = ‘Female’, Workclass = ‘Private’) → Income = ‘≤50K’. The Difference of 𝜙1 can be computed as: Diff(𝜙1 ) = Conf(𝜙1 ) − Conf(𝜙1′ ), where 𝜙1′ : (Workclass = ‘Private’) → (Income = ‘≤50K’). Rewriting the confidence as a ratio of supports we get the following formula: Sup(𝜙1 ) Sup(𝜙1′ ) 𝑎 𝑐 Diff(𝜙1 ) = − ′ = − . Sup(LHS(𝜙1 )) Sup(LHS(𝜙1 )) 𝑏 𝑑 1 https://archive.ics.uci.edu/ml/datasets/Adult �As already mentioned, to repair 𝜙1 we can either add or remove tuples: let’s first focus on the former. To remove the unfair behaviour highlighted by 𝜙1 : (Sex = ‘Female’, Workclass = ‘Private’) → Income = ‘≤50K’, intuitively, we can either add males that earn less than 50K $/year and work in the private sector or add females that earn more than 50K $/year and work in the private sector. Therefore we have to add to the dataset tuples that satisfy the following two dependencies: • 𝜙2 : (Sex = ‘Male’, Workclass = ‘Private’) → Income = ‘≤50K’ • 𝜙3 : (Sex = ‘Female’, Workclass = ‘Private’) → Income = ‘>50K’ The two dependencies above define the tuples that, if added to the initial dataset, will lower the initial Difference value below the threshold 𝛿; 𝜙2 and 𝜙3 , being the ”antagonists” of 𝜙1 , form the Opposite Set (OS) of 𝜙1 . Specifically, 𝜙2 has been obtained by changing, one at the time, the values of the protected attributes: we name the set of such ACFDs Protected attribute Opposite Set (POS). In the example we have only one, binary protected attribute, therefore the POS will contain only the dependency 𝜙2 ; in case the protected attribute is non-binary, or there is more than one protected attribute, the POS would include more than one dependency. On the other hand, 𝜙3 has been obtained by reversing the value of the target class. This ACFD is in the Target attribute Opposite Set (TOS), that, since the target attribute is required to be binary, will always contain only one dependency. In case we have multiple discriminated minorities, adding tuples that satisfy the dependencies contained in the POS could result in enhancing discrimination. If we consider 𝜙: (Race= ‘Black’) → Income = ‘≤50K’, the POS contains the ACFDs that involve all the other values of the protected attribute ‘Race’: ‘White’, ‘Asian-Pac-Islander’, ‘Other’ and ‘Amer-Indian-Eskimo’; therefore we would add tuples referring to ‘Amer-Indian-Eskimo’ people that earn less than 50K $/year to the dataset, thus aggravating their already discriminatory condition. Finally, given what we have just presented and the fact that we want to repair the dataset by improving the condition of discriminated groups, we decide to add, for each ACFD, only the tuples that satisfy the dependency contained in each respective TOS. We now show how to compute the number of tuples to be added to the dataset for the dependencies belonging to the TOS: we start by analyzing 𝜙3 , obtained by reversing the value of the target attribute. Adding 𝑘 tuples that satisfy 𝜙3 will impact the Difference of 𝜙1 in the following way: 𝑎 𝑐 Diff(𝜙1 ) = − 𝑏+𝑘 𝑑 +𝑘 Now, to repair 𝜙1 , we have to impose that Diff(𝜙1 ) be lower than 𝛿, and, solving the inequality, we find the number of tuples that need be added by solving for 𝑘. Now that we demonstrated how to find the number of tuples to be added in order to repair a dependency, we show how to find the number of tuples that need be removed to accomplish the same task. To repair an ACFD by removing tuples, we simply need to remove tuples that satisfy the dependency. To compute the number of tuples to be deleted from the dataset, we start by noticing that removing from the dataset 𝑘 tuples that satisfy 𝜙1 will modify the Difference of 𝜙1 in the following way: 𝑎−𝑘 𝑐−𝑘 Diff(𝜙1 ) = − 𝑏−𝑘 𝑑 −𝑘 Now, to repair 𝜙1 , we have to impose that Diff(𝜙1 ) has to be lower than 𝛿, and, similarly to what we have done in the previous two cases, find the number of tuples that should be removed by solving for 𝑘. � ACFD Tuples to add Tuples to remove (Sex = ‘Female’) → Income = ‘≤50K’ 1141 4883 (Race= ‘Black’) → Income = ‘≤50K’ 185 852 (Race = ‘Amer-Indian-Eskimo’) → Income = ‘≤50K’ 21 96 (Native-Country = ‘NC-Hispanic’) → Income = ≤50K 162 734 Table 2 A few user-selected dependencies from the U.S. Census Dataset with the corresponding values of (𝛼, 𝛽) Example 1. For the repair phase, we continue using the U.S. Census Dataset. Table 1 report some examples of selected ACFDs to repair the dataset. Table 2 reports, for each selected ACFD, the number of tuples that should be added according to the TOS computation, or removed to compensate the Difference metric. 3.2. Greedy Hit-Count algorithm A basic solution to repair the dataset could simply consist in adding, for each negative ACFD, the tuples that satisfy its TOS, using, for each tuple, its Unfair Tuple Count as multiplicity. Unfortunately, this is not a valid option, because it would result in adding too many tuples to the dataset, which would not respect the key requirement to limit the modifications of the original dataset. To add tuples to the initial dataset in an optimized way, we use a modified version of the Greedy Hit-Count algorithm algorithm presented in [4]. From each dependency in a TOS, we generate a pattern, which is an array whose dimension is equal to the number of attributes in the dataset and where each cell represents the value of a specific attribute. Given a dataset 𝐷 with 𝑑 categorical attributes and an ACFD 𝜙, a pattern 𝑃 is a vector of size 𝑑 generated from 𝜙, where 𝑃[𝑘] is either 𝑋 (meaning that its value is unspecified) or the value of the corresponding attribute in 𝜙. The pattern computation step is useful for two reasons: (i) to transform an ACFD into a tuple and decide the value of the free attributes (i.e. the non-instantiated attributes in the dependency); and (ii) to add tuples to the dataset in an optimized way, exploiting the Greedy Hit-Count algorithm. The Greedy Hit-Count algorithm takes as input the set of patterns computed at the previous step and returns the set of tuples needed to repair the dataset. The idea behind this algorithm is that a value combination (i.e. a generated tuple) can cover multiple patterns simultaneously, allowing us to add fewer tuples than the basic solution, and thus minimizing the changes in the final, repaired dataset. The algorithm stops when all the patterns are covered, returning the values combinations that hit the maximum number of uncovered patterns. To summarize, given the set of uncovered patterns as input, this step finds the minimum set of tuples to repair the dataset, along with the indication of which patterns each tuple can cover. Finally, to decide how many copies of each tuple we insert in the dataset, we take the average of the Unfair Tuple Count of the patterns covered by that tuple. The current implementation of the algorithm does not prevent the generation of unrealistic tuples (e.g. combining “Sex”=“Male” with “Pregnant”=“Yes”), since there is no constraint on the value combinations. We plan, in a future work, to improve the system with the addition of a validation oracle [4] to identify and prevent the insertion of inconsistent tuples. Example 2. For each ACFD in Table 2, the system determines the TOS, then, from these new � Workclass Race Sex Hours-Per-Week NC Age-Range Education Income X X Female X X X X >50K X Black X X X X X >50K X A-I-E X X X X X >50K X X X X NC-Hisp X X >50K Table 3 Pattern Generation in the U.S. Census Adult Dataset Tuple Add ‘Private’, ‘Black’, ‘Female’, ‘0-20’, ‘NC-Hispanic’, ‘75-100’, ‘HS-College’, ‘>50K’ 427 ‘Private’, ‘Amer-Indian-Eskimo’, ‘Female’, ‘0-20’, ‘NC-US’, ‘75-100’, ‘Assoc’, ‘>50K’ 184 Table 4 The tuples with their cardinality mined from the algorithm to repair the U.S. Census Adult Dataset ACFDs, it generates the corresponding patterns (reported in Table 3). Note that, in the patterns, the target attribute value is changed in order to compensate the Difference of the original ACFDs. The computed set of patterns is the input of the Greedy Hit-Count algorithm. Table 4 reports the output composed by 2 tuples, with their cardinalities, that should be added to the dataset. The first tuple derives from 6 uncovered patterns and the second tuple derives from 2 uncovered patterns. The two tuples identified from the algorithm have no inconsistencies. 3.3. Correction algorithm The aim of the Greedy Hit-Count algorithm phase was to try to enhance the fairness of the initial dataset, by adding a very low number of external elements. Using a greedy approach, and setting a limit to the maximum number of tuples that can be added/removed, we prevent the overloading of the final dataset, but this could potentially not be enough to repair the dataset completely. During this phase, if any of the initial ACFDs is still present; we remove from the dataset some tuples matching them. To determine the number of tuples to remove, for each ACFD still present in the dataset we use 𝛽, the second value of the Unfair Tuple Count. Example 3. After adding tuples in the previous step, only one of the original ACFD is still found. If the user chooses to apply the Correction algorithm, the system computes again, according to 𝛿, the number of tuples to remove for each ACFD and generates the corresponding patterns that, converted into tuples are removed from the dataset. In this case, to solve the ACFD we need to remove 1733 tuples, obtaining a final dataset with 29088 tuples. 3.4. Statistics Computation We now present a set of metrics to analyze the quality of the repair procedure. Specifically, for each ACFD given as input and still present in the repaired dataset we compute: the Cumulative Support, the percentage of tuples in the dataset involved by the selected ACFDs and the Mean Difference, the mean of all the ‘Difference’ scores of the selected ACFDs. Moreover, to make sure that the dataset can still be used for data science tasks, we compare the distribution of values of each attribute in the initial dataset with its counterpart in the repaired one. Finally, we apply to the repaired dataset the Data Bias discovery procedure to inspect the ghost ACFDs, i.e. new ACFDs that were not mined from the original dataset and might appear now after this phase and might introduce new unfair behaviours. � ACFD Support Difference (Sex = ‘Female’, Class = 1) → Survived= 1 0.103 0.579 (Survived= 0, Sex = ‘Female’) → Class = 3 0.082 0.332 (Sex = ‘Male’) → Survived= 0 0.518 0.197 Table 5 A few dependencies from the Titanic Dataset Example 4. Using the corrected version of the dataset we perform again the Data Bias discovery procedure; we do not have any ACFDs in common with the ones selected from the input dataset, so the dataset can be considered as repaired. The metrics are actually lower: the Cumulative Support, representing the percentage of tuples originally involved by unfair dependencies (around a 35%), after the repair procedure is 0, the Mean Difference goes from 0.13 to 0. We compare the value distribution of each attribute between the initial and the corrected dataset (we do not report the plots for brevity) and the distributions remain almost unchanged, apart from a slight increase in the number of females and private workers. Moreover no unfair ghost ACFD is discovered. 4. Experimental Results We now present the results obtained by our system on another real-world dataset: the Titanic dataset2 . The discovery procedure returns the ACFDs that show unfairness, Table 5 reports some of them. The dataset is unfair with respect to all the protected attributes; specifically, the groups more discriminated are: ‘Male’ and ‘Third-Class’ passengers. Due to the discrimination we decide to mitigate the bias applying the ACFD-Repair procedure. According to the user-selected dependencies, the algorithm computes the number of tuples that should be added to the dataset to compensate their Difference value, specifically, we add 233 tuples to the dataset. Since no initial dependency is extracted from the repaired dataset, the correction procedure is skipped and we consider this version of the dataset as corrected. The obtained metrics for the final dataset show that the mitigation procedure worked: the Cumulative Support, that was around a 86%), after the repair procedure is 0, the Mean Difference goes from 0.35 to 0. 5. Comparison with similar systems Three possible approaches can be adopted when trying to enforce fairness in a data analysis application: preprocessing techniques, i.e. procedures that, before the application of a prediction algorithm, make sure that the learning data are fair; (ii) inprocessing techniques, i.e. procedures that ensure that, during the learning phase, the algorithm does not pick up the bias present in the data, and (iii) and postprocessing techniques, i.e. procedures that correct the algorithm’s decisions with the scope of making them fair. One of the first preprocessing techniques is [5]. The process, on the basis of discrimination measures used in the legal literature, can identify potentially discriminatory itemsets by dis- covering association rules. Furthermore, the authors propose a set of sanitization methods: given a discriminatory rule, it is sanitized by modifying the itemset distribution in order to prevent discrimination. For each discrimination measure, they propose a method to achieve a fair dataset by introducing a reasonable (controlled) pattern distortion. Unfortunately, this 2 https://www.kaggle.com/c/titanic �system does not involve user interaction, therefore the user cannot discard the rules that are not interesting for the specific investigation. Our approach provides as a final step a set of summarizing metrics that describes the overall degree of unfairness of the dataset. In the Machine Learning context, the majority of works that tries to enforce fairness is related to a prediction task, and more specifically to classification algorithms. One of these works is AI Fairness 360[6], an open-source framework whose aim is to reach algorithmic fairness for classifiers. It tries to mitigate data bias, quantified using different statistical measures, by exploiting pre-processing, in-processing and post-processing techniques. The results obtained by AI Fairness 360 are in complete accordance with ours. The competitor checks fairness property only for one binary attribute at the time, while, since the ACFD technique can involve more than one attribute at a time, our tool can detect unfair behaviors at finer level of granularity. The main difference between these approaches and our framework, that solves unfairness adopting a preprocessing technique, is that our system does not need a classifier to work, because it is based on finding conditions (in form of approximate constraints) that are already present in the data, even though possibly with some level of approximation. Another related topic regards rule-based data-cleaning techniques [7]. The scope of these approaches is to eliminate the inconsistencies present in dataset by making sure that all the rules (integrity constraints) hold on the final, clean one. In our case, the final objective is quite different: first, we don’t modify the values of specific records present in the dataset, but mainly focus on adding tuples to make it more fair; indeed, we are not interested in ensuring that all the tuples respect the discovered dependencies, but, rather, that the final repaired dataset does not contain unfair dependencies. 6. Conclusion and Future Works We presented a novel framework, that, through the extraction of a particular type of Functional Dependencies, can discover and mitigate bias and discrimination present in datasets. Future works will include: (i) the study of data equity, operationalizing this definition and studying how it augments or contradicts existing definitions of fairness, (ii) the development of a graphical user interface to facilitate the interaction of the user with the system, (iii) the study of other classes of functional dependencies[8]. References [1] F. Azzalini, et al., FAIR-DB: Functional dependencies to discover data bias, Workshop Proceedings of the EDBT/ICDT (2021). [2] F. Azzalini, et al., A short account of FAIR-DB: a system to discover data bias, SEBD (2021). [3] S. Verma, J. Rubin, Fairness definitions explained, in: FairWare, IEEE, 2018. [4] A. Asudeh, et al., Assessing and remedying coverage for a given dataset, in: ICDE, 2019. [5] S. Hajian, et al., Discrimination-and privacy-aware patterns, Data Min. Knowl. Discov. 29 (2015) 1733–1782. [6] R. K. Bellamy, et al., AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias, in: IBM Journal of Research and Development, IBM, 2019. [7] I. F. Ilyas, X. Chu, Data cleaning, Morgan & Claypool, 2019. [8] L. Caruccio, et al., Relaxed functional dependencies—a survey of approaches, TKDE (2015). �