Paper

Paper
edit
description
id	Vol-3194/paper19
wikidataid	Q117344918→Q117344918
title	Counting Database Repairs Entailing a Query: The Case of Functional Dependencies
pdfUrl	https://ceur-ws.org/Vol-3194/paper19.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/CalauttiLPS22
volume	Vol-3194→Vol-3194
session	→
Counting Database Repairs Entailing a Query: The Case of Functional Dependencies

Counting Database Repairs Entailing a Query: The
Case of Functional Dependencies
(Discussion Paper)

Marco Calautti1 , Ester Livshits2 , Andreas Pieris2,3 and Markus Schneider2
1
  DISI, University of Trento, Italy
2
  University of Edinburgh, UK
3
  University of Cyprus, Cyprus


                                         Abstract
                                         A key task in the context of consistent query answering is to count the number of repairs that entail the
                                         query, with the ultimate goal being a precise data complexity classification. This has been achieved in
                                         the case of primary keys and self-join-free conjunctive queries (CQs) via an FP/♯P-complete dichotomy.
                                         We lift this result to the more general case of functional dependencies (FDs). Another important task in
                                         this context is whenever the counting problem in question is intractable, to classify it as approximable,
                                         i.e., the target value can be efficiently approximated with error guarantees via a fully polynomial-time
                                         randomized approximation scheme (FPRAS), or as inapproximable. Although for primary keys and CQs
                                         (even with self-joins) the problem is always approximable, we prove that this is not the case for FDs. We
                                         show, however, that the class of FDs with a left-hand side chain forms an island of approximability. We
                                         see these results, apart from being interesting in their own right, as crucial steps towards a complete
                                         classification of approximate counting of repairs in the case of FDs and self-join-free CQs.




1. Introduction
A database is inconsistent if it does not satisfy its integrity constraints. There is a consensus
that inconsistency is a real-life phenomenon that arises due to many reasons such as integration
of conflicting sources. With the aim of obtaining conceptually meaningful answers to queries
posed over inconsistent databases, Arenas, Bertossi, and Chomicki introduced in the late 1990s
the notion of Consistent Query Answering (CQA) [1]. The key elements underlying CQA are
(i) the notion of (database) repair of an inconsistent database 𝐷, that is, a consistent database
whose difference with 𝐷 is somehow minimal, and (ii) the notion of query answering based on
certain answers, that is, answers that are entailed by every repair.

Example 1. Consider the relation name Employee(id, name, dept) that comes with the constraint
that the attribute id functionally determines name and dept. Consider also the database 𝐷 con-
sisting of the tuples: (1, Bob, HR), (1, Bob, IT), (2, Alice, IT), (2, Tim, IT). It is easy to see that
𝐷 is inconsistent since we are uncertain about Bob’s department, and the name of the employee
with id 2. To devise a repair, we need to keep one tuple from each conflicting pair, which leads to a
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
" marco.calautti@unitn (M. Calautti); ster.livshits@ed.ac.uk (E. Livshits); apieris@inf.ed.ac.uk (A. Pieris);
m.schneider@ed.ac.uk (M. Schneider)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
�maximal subset of 𝐷 that is consistent. Observe now that the query that asks whether employees 1
and 2 work in the same department is true only in two out of four repairs, and thus, not entailed.

Counting Repairs Entailing a Query. A key task in this context is to count the number of
repairs of an inconsistent database 𝐷 w.r.t. a set Σ of constraints that entail a given query 𝑄;
for clarity, we base our discussion on Boolean queries. Given a set Σ of constraints and a query
𝑄, the problem ♯Repairs(Σ, 𝑄) that takes as input a database 𝐷, and asks for the number of
repairs of 𝐷 w.r.t. Σ that entail 𝑄, can be tractable or intractable depending on the shape of Σ
and 𝑄. This leads to the natural question whether we can establish a complete classification,
i.e., for every Σ and 𝑄, classify ♯Repairs(Σ, 𝑄) as tractable or intractable by simply inspecting
Σ and 𝑄. We already know from [2] that for a set Σ of primary keys, and a self-join-free
conjunctive queries (SJFCQs) query 𝑄, ♯Repairs(Σ, 𝑄) is either in FP or ♯P-complete, and we
can determine in polynomial time, by simply analyzing Σ and 𝑄, which complexity statement
holds. However, the dichotomy result in [2] does not apply when we consider the more general
class of functional dependencies (FDs). This brings us to the following question:
Question 1: Can we lift the dichotomy result for primary keys and SJFCQs to the more general
case of functional dependencies?
   The closest known result related to Question 1 is for the problem ♯Repairs(Σ), where Σ is
a set of FDs, that given a database 𝐷, asks for the number of repairs of 𝐷 w.r.t. Σ (without
considering a query). In particular, we know from [3] that whenever Σ has a so-called left-hand
side (LHS, for short) chain (up to equivalence), ♯Repairs(Σ) is in FP; otherwise, it is ♯P-complete.
Approximate Counting. Another key task is to classify ♯Repairs(Σ, 𝑄), whenever it is
intractable, as approximable via a so-called fully polynomial-time randomized approximation
scheme (FPRAS), or as inapproximable.
   For a set Σ of primary keys, and a CQ 𝑄 (even with self-joins), ♯Repairs(Σ, 𝑄) is always
approximable [4]. The next question is whether we can obtain a full classification for FDs:
Question 2: For a set Σ of FDs, and an SJFCQ 𝑄, can we determine whether ♯Repairs(Σ, 𝑄)
admits an FPRAS by inspecting Σ and 𝑄?
Summary of Contributions. Concerning Question (1), we lift the dichotomy of [2] for primary
keys and SJFCQs to the general case of FDs (Theorem 4). Concerning Question (2), although we
do not establish a complete classification (which would resolve a challenging open problem), we
show that, for every set Σ of FDs with an LHS chain (up to equivalence) and a CQ 𝑄 (even with
self-joins), ♯Repairs(Σ, 𝑄) admits an FPRAS. On the other hand, we show that there is a very
simple set Σ of FDs such that, for every SJFCQ 𝑄, ♯Repairs(Σ, 𝑄) does not admit an FPRAS
(under a standard complexity assumption).


2. Preliminaries
We consider the disjoint countably infinite sets C and V of constants and variables, respectively.
For 𝑛 > 0, let [𝑛] be the set {1, . . . , 𝑛}, and for a finite set 𝑆, let ♯𝑆 be the cardinality of 𝑆.
Relational Databases. A schema S is a finite set of relation names with associated arity; we
write 𝑅/𝑛 to denote that 𝑅 has arity 𝑛 > 0. Each relation name 𝑅/𝑛 is associated with a tuple
�of distinct attribute names (𝐴1 , . . . , 𝐴𝑛 ); we write att(𝑅) for the set {𝐴1 , . . . , 𝐴𝑛 }. A position
of S is a pair (𝑅, 𝐴), where 𝑅 ∈ S and 𝐴 ∈ att(𝑅), that essentially identifies the attribute 𝐴 of
𝑅. A atom 𝛼 over S is an expression of the form 𝑅(𝑡1 , . . . , 𝑡𝑛 ), where 𝑅/𝑛 ∈ S, and 𝑡𝑖 ∈ C ∪ V
for each 𝑖 ∈ [𝑛]. A fact is an atom mentioning only constants; we may say 𝑅-atom/fact to
indicate that the relation name is 𝑅. For an atom 𝛼 = 𝑅(𝑡1 , . . . , 𝑡𝑛 ), with (𝐴1 , . . . , 𝐴𝑛 ) being
the tuple of attribute names of 𝑅, we write 𝛼[𝐴𝑖 ] for the term 𝑡𝑖 . A database 𝐷 over S is a finite
set of facts over S. We write 𝐷𝑅 , for 𝑅 ∈ S, for the database {𝑅(𝑡¯) | 𝑅(𝑡¯) ∈ 𝐷}. The active
domain of 𝐷, denoted dom(𝐷), is the set of constants occurring in 𝐷.
Functional Dependencies. A functional dependency (FD) 𝜑 over a schema S is an expression
of the form 𝑅 : 𝑋 → 𝑌 , where 𝑅/𝑛 ∈ S and 𝑋, 𝑌 ⊆ att(𝑅); we say 𝑅-FD to indicate that
the relation name is 𝑅. We call 𝜑 a key if 𝑋 ∪ 𝑌 = att(𝑅). Given a set Σ of FDs over S, we
write Σ𝑅 , for 𝑅 ∈ S, for the set {𝜑 ∈ Σ | 𝜑 is an 𝑅-FD}. We call Σ a set of primary keys if it
consists only of keys, and |Σ𝑅 | ≤ 1 for each 𝑅 ∈ S. A database 𝐷 over S satisfies an FD 𝜑
over S, denoted 𝐷 |= 𝜑, if, for every two 𝑅-facts 𝑓, 𝑔 ∈ 𝐷 the following holds: 𝑓 [𝐴] = 𝑔[𝐴]
for every 𝐴 ∈ 𝑋 implies 𝑓 [𝐵] = 𝑔[𝐵] for every 𝐵 ∈ 𝑌 . We say that 𝐷 is consistent w.r.t. a set
Σ of FDs, written 𝐷 |= Σ, if 𝐷 |= 𝜑 for every 𝜑 ∈ Σ; otherwise, 𝐷 is inconsistent w.r.t. Σ. Two
sets of FDs Σ, Σ′ are equivalent if, for every database 𝐷, 𝐷 |= Σ iff 𝐷 |= Σ′ .
Conjunctive Queries. A (Boolean) conjunctive query (CQ) 𝑄 over S is an expression of the
form ∃𝑦¯1 , . . . , ∃𝑦¯𝑛 𝑅1 (𝑦¯1 ) ∧ · · · ∧ 𝑅𝑛 (𝑦¯𝑛 ), where each 𝑅𝑖 (𝑦¯𝑖 ), for 𝑖 ∈ [𝑛], is an atom over S.
With abuse of notation we may treat 𝑄 as the set of its atoms. The query 𝑄 is a self-join-free
CQ (SJFCQ) if it mentions every relation name of S at most once. Let var(𝑄) and const(𝑄) be
the set of variables and constants in 𝑄, respectively. A homomorphism from 𝑄 to a database
𝐷 is a function ℎ : var(𝑄) ∪ const(𝑄) → dom(𝐷), which is the identity over C, such that
𝑅𝑖 (ℎ(𝑦¯𝑖 )) ∈ 𝐷 for 𝑖 ∈ [𝑛]. We say that 𝐷 entails 𝑄, and write 𝐷 |= 𝑄 if there exists a
homomorphism from 𝑄 to 𝐷. We point out that in this paper we are only dealing with Boolean
queries, but all the presented results can be generalized to non-Boolean CQs.
Database Repairs. Given a database 𝐷 and a set Σ of FDs, both over a schema S, a repair of
𝐷 w.r.t. Σ is a maximal subset 𝐷′ of 𝐷 such that 𝐷′ |= Σ. Let repΣ (𝐷) be the set of repairs
of 𝐷 w.r.t Σ. Given a CQ 𝑄 over S, we write repΣ,𝑄 (𝐷) for {𝐷′ ∈ repΣ (𝐷) | 𝐷′ |= 𝑄}. The
following is a useful observation that we will exploit later in the paper:
Lemma 2. Consider a database 𝐷, a set Σ of FDs, and a Boolean SJFCQ 𝑄. We can compute in
polynomial time a database 𝐷′ such that ♯repΣ (𝐷) = ♯repΣ,𝑄 (𝐷′ ).

Problem Definition. For each set Σ of FDs and CQ 𝑄, we focus on the problem ♯Repairs(Σ, 𝑄),
which takes as input a database 𝐷, and asks for the number ♯repΣ,𝑄 (𝐷). The goal is to classify
it as tractable (place it in FP) or as intractable (show that is ♯P-complete).
   Another important task is whenever ♯Repairs(Σ, 𝑄) is intractable to classify it as approx-
imable via a so-called fully polynomial-time randomized approximation scheme (FPRAS, for
short), or as inapproximable. Formally, an FPRAS for ♯Repairs(Σ, 𝑄) is a randomized algorithm
A that takes as input a database 𝐷, and numbers 𝜖 > 0 and 0 < 𝛿 < 1, runs in polyno-
mial(︀ time in ||𝐷||, 1/𝜖 and log(1/𝛿), and produces  )︀ a random variable A(𝐷, 𝜖, 𝛿) such that
Pr |A(𝐷, 𝜖, 𝛿) − ♯repΣ,𝑄 (𝐷)| ≤ 𝜖 · ♯repΣ,𝑄 (𝐷) ≥ 1 − 𝛿.
�LHS Chain FDs. Consider a set Σ of FDs over S, and a relation name 𝑅 ∈ S. We say that
Σ𝑅 has a left-hand side chain (LHS chain) if the FDs of Σ𝑅 can be arranged in a sequence
𝑅 : 𝑋1 → 𝑌1 , . . . , 𝑅 : 𝑋𝑛 → 𝑌𝑛 such that 𝑋1 ⊆ 𝑋2 ⊆ · · · ⊆ 𝑋𝑛 [3]. We call such a sequence
an LHS chain of Σ𝑅 , and say that Σ has an LHS chain if, for every 𝑅 ∈ S, Σ𝑅 has an LHS chain.
We know that for sets of FDs with an LHS chain, counting the number of repairs is tractable:
Proposition 3 ([3]). Given a database 𝐷, and a set Σ of FDs with an LHS chain (up to equiva-
lence), ♯repΣ (𝐷) is computable in polynomial time in ||𝐷||.

3. Exact Counting
The goal of this section is to discuss the key ideas behind our main result on exact counting:
Theorem 4. For a set Σ of FDs, and an SJFCQ 𝑄, ♯Repairs(Σ, 𝑄) is either in FP or ♯P-complete,
and we can decide in polynomial time in ||Σ|| + ||𝑄|| which of the two cases hold.
   We observe that for every set Σ of FDs, and an SJFCQ 𝑄, ♯Repairs(Σ, 𝑄) is in ♯P: simply guess
(in polynomial time) a subset 𝐷′ of the given database 𝐷, and verify (again in polynomial time)
that 𝐷′ is a repair of 𝐷 w.r.t. Σ that entails 𝑄. We also note that the non-existence of an LHS
chain (up to equivalence) is a preliminary boundary for the hard side of the target dichotomy,
regardless of the CQ. We know from [3] that, for a set Σ of FDs, the problem ♯Repairs(Σ),
which given a database 𝐷, asks for ♯repΣ (𝐷), is ♯P-hard whenever Σ does not have an LHS
chain (up to equivalence). From the above discussion, and Lemma 2, we get the following result:
Proposition 5. Consider a set Σ of FDs without an LHS chain (up to equivalence), and an SJFCQ
𝑄. ♯Repairs(Σ, 𝑄) is ♯P-complete.
   From Proposition 5, and the fact that checking whether a set Σ of FDs has an LHS chain (up
to equivalence) is feasible in polynomial time [3], to obtain Theorem 4, it suffices to provide an
analogous result for FDs with an LHS chain (up to equivalence). To this end, following what
was done for primary keys in [2], we are going to concentrate on the problem of computing
the relative frequency of the query. The relative frequency of a CQ 𝑄 w.r.t. a database 𝐷 and a
                                                      ♯rep   (𝐷)
set Σ of FDs is defined as the ratio rfreq𝐷,Σ (𝑄) = ♯repΣ,𝑄(𝐷) that computes the percentage of
                                                           Σ
repairs that entail the query 𝑄. We then define the problem RelFreq(Σ, 𝑄) that takes as input a
database 𝐷, and asks for rfreq𝐷,Σ (𝑄). We can indeed focus on RelFreq(Σ, 𝑄) since we know
from Proposition 3 that ♯Repairs(Σ) is in FP whenever Σ has an LHS chain (up to equivalence),
which implies that ♯Repairs(Σ, 𝑄) and RelFreq(Σ, 𝑄) have the same complexity, that is, both
are either in FP or ♯P-hard. Consequently, to obtain Theorem 4 it suffices to show the following:
Theorem 6. For a set Σ of FDs with an LHS chain (up to equivalence), and an SJFCQ 𝑄,
RelFreq(Σ, 𝑄) is either in FP or ♯P-hard, and we can decide in polynomial time in ||Σ|| + ||𝑄||
which of the two cases hold.

  In the rest, we give some hints on the proof of Theorem 6; we first need some new notions.
Canonical Covers. If Σ has an LHS chain (up to equivalence), we know that it has a single
canonical cover1 Σ′ , and Σ′ has an LHS chain [3]. Furthermore, it can be verified that, for every
    1
    That is, (i) the FDs in Σ are not redundant and have no redundant attributes, and (ii) for a relation name 𝑅 and
𝑋 ⊆ att(𝑅), there is at most one FD in Σ of the form 𝑅 : 𝑋 → 𝑌 .
�relation name 𝑅 occurring in Σ′ , there exists a unique sequence 𝑅 : 𝑋1 → 𝑌1 , . . . , 𝑅 : 𝑋𝑛 →
𝑌𝑛 of the FDs of Σ′𝑅 , which we call the LHS chain of Σ′𝑅 , such that (i) 𝑋𝑖 ⊊ 𝑋𝑖+1 for each
𝑖 ∈ [𝑛 − 1], (ii) 𝑋𝑖 ∩ 𝑌𝑗 = ∅ for each 𝑖, 𝑗 ∈ [𝑛], and (iii) 𝑌𝑖 ∩ 𝑌𝑗 = ∅ for each 𝑖, 𝑗 ∈ [𝑛] with
𝑖 ̸= 𝑗. Since a canonical cover of a set of FDs can be computed in polynomial time, to establish
Theorem 6 it suffices to show it for sets of FDs with an LHS chain that are canonical. Therefore,
in the rest of the section, we assume that sets of FDs are canonical.
Primary FDs and Positions. Consider now an SJFCQ 𝑄. Let 𝛼𝑅 be the 𝑅-atom in 𝑄, and let
Λ𝑅 = 𝑅 : 𝑋1 → 𝑌1 , . . . , 𝑅 : 𝑋𝑛 → 𝑌𝑛 be the LHS chain of Σ𝑅 . We call an FD 𝑅 : 𝑋𝑖 → 𝑌𝑖 , for
some 𝑖 ∈ [𝑛], the primary FD of Σ𝑅 w.r.t. 𝑄 if 𝑋𝑖 ∪𝑌𝑖 contains an attribute 𝐴 such that at position
(𝑅, 𝐴) in 𝛼𝑅 we have a variable, whereas at each position of {(𝑅, 𝐵) | 𝐵 ∈ 𝑋𝑗 ∪ 𝑌𝑗 for 𝑗 < 𝑖}
in 𝛼𝑅 we have a constant. Note that there is no guarantee that the primary FD of Σ𝑅 w.r.t. 𝑄
exists. If the primary FD 𝑅 : 𝑋𝑖 → 𝑌𝑖 of Σ𝑅 w.r.t. 𝑄 exists, for some 𝑖 ∈ [𝑛], then the primary-
lhs positions of 𝛼𝑅 (w.r.t. Σ) are the positions {(𝑅, 𝐴) | 𝐴 ∈ 𝑋𝑖 }, while the non-primary-lhs
positions of 𝛼𝑅 (w.r.t. Σ) are the positions {(𝑅, 𝐵) | 𝐵 ∈ att(𝑅) ∖ 𝑋𝑖 }. We further call the
sequence of FDs 𝑅 : 𝑋1 → 𝑌1 , . . . , 𝑅 : 𝑋𝑖−1 → 𝑌𝑖−1 the primary prefix of Σ𝑅 w.r.t. 𝑄. If the
primary FD of Σ𝑅 w.r.t. 𝑄 does not exist, then, by convention, the primary-lhs positions of 𝛼𝑅
are the positions {(𝑅, 𝐴) | 𝐴 ∈ att(𝑅)}, i.e., all the positions in 𝛼𝑅 are primary-lhs, which
in turn implies that 𝛼𝑅 has no non-primary-lhs positions. Moreover, the primary prefix of
Σ𝑅 w.r.t. 𝑄 is defined as the sequence Λ𝑅 itself. We denote by pvarΣ (𝛼𝑅 ) the set of variables
occurring in 𝛼𝑅 at primary-lhs positions of 𝛼𝑅 .
Query Complex Part. A variable 𝑥 ∈ var(𝑄) is a liaison variable (of 𝑄) if it occurs more than
once in 𝑄. The complex part of 𝑄 w.r.t. Σ, denoted compΣ (𝑄), is the set of atoms 𝛼 of 𝑄 in
which we have a constant or a liaison variable at a non-primary-lhs position (𝑅, 𝐴), with 𝛼
being the 𝑅-atom of 𝑄, such that, for every FD 𝑅 : 𝑋 → 𝑌 in the primary prefix of Σ𝑅 w.r.t. 𝑄,
𝐴 ̸∈ 𝑋 ∪ 𝑌 . We observe that when compΣ (𝑄) = ∅, the number of repairs of 𝐷 w.r.t. Σ that
                                                                                     Σ,𝑄
entail 𝑄 coincides with the number of repairs (without any query) of the subset 𝐷conf    of 𝐷
obtained after removing the facts of 𝐷 that are somehow in a conflict with 𝑄, and thus, they
                                                  Σ,𝑄
cannot appear in a repair that entails 𝑄. Since 𝐷conf can be computed in polynomial time in
||𝐷||, by Proposition 3 we obtain the following:

Proposition 7. Given a database 𝐷, a set Σ of FDs with an LHS chain, and an SJFCQ 𝑄 with
compΣ (𝑄) = ∅, it holds that rfreq𝐷,Σ (𝑄) is computable in polynomial time in ||𝐷||.

3.1. The Tractable Side
We start by defining some properties that, when satisfied by a SJFCQ, allow to restate its relative
frequency in terms of the relative frequency of simpler queries.
   We write 𝑄1 ⊎ 𝑄2 for the CQ consisting of the atoms of two non-empty CQs 𝑄1 , 𝑄2 that
share no atoms and variables, i.e., 𝑄1 ∩ 𝑄2 = ∅ and var(𝑄1 ) ∩ var(𝑄2 ) = ∅. We also write
𝑄𝑥↦→𝑐 , where 𝑥 ∈ var(𝑄) and 𝑐 ∈ C, for the CQ obtained from 𝑄 after replacing 𝑥 with 𝑐.
Theorem 8. Consider a set Σ of FDs with an LHS chain, and a SJFCQ 𝑄. For every database 𝐷,
the following hold:

   1. If 𝑄 = 𝑄1 ⊎ 𝑄2 , then rfreq𝐷,Σ (𝑄) = rfreq𝐷,Σ (𝑄1 ) × rfreq𝐷,Σ (𝑄2 ).
�    2. If compΣ (𝑄) ̸= ∅, and there exists a variable 𝑥 ∈ var(𝑄) such that 𝑥 ∈ pvarΣ (𝛼) for every
       𝛼 ∈ compΣ (𝑄), then there is 𝐷* ⊆ 𝐷, computable in polynomial time in ||𝐷||, such that
                            ⎛                                           ⎞         (︁            )︁
                                                                                           Σ,𝑄
                                     ∏︁ (︀                           )︀     ♯repΣ    𝐷 ∖ 𝐷 conf
         rfreq𝐷,Σ (𝑄) = ⎝1 −                  1 − rfreq𝐷* ,Σ (𝑄𝑥↦→𝑐 ) ⎠ ×                          .
                                                                                 ♯repΣ (𝐷)
                                        𝑐∈dom(𝐷)


    3. If compΣ (𝑄) ̸= ∅, and there exists an atom 𝛼 = 𝑅(𝑡¯) ∈ compΣ (𝑄) such that pvarΣ (𝛼) =
       ∅, and a variable 𝑥 occurs in 𝛼 at a position of {(𝑅,
                                                          ∑︀𝐴) | 𝐴 ∈ 𝑌 }, where 𝑅 : 𝑋 → 𝑌 is the
       primary FD of Σ𝑅 w.r.t. 𝑄, then rfreq𝐷,Σ (𝑄) = 𝑐∈dom(𝐷) rfreq𝐷,Σ (𝑄𝑥↦→𝑐 ).

   Item (1) of Theorem 8 is straightforward, and implies that when 𝑄 = 𝑄1 ⊎ 𝑄2 , one can
compute the relative frequency of 𝑄 in polynomial time, if the same can be done for the relative
frequencies of 𝑄1 and 𝑄2 . Items (2) and (3) are much more involved, and we refer the reader
to the full version of the paper [5]. The main message is that when a SJFCQ 𝑄 satisfies the
conditions of item (2), then its relative frequency is polynomial-time computable if so is for the
relative frequency of 𝑄𝑥↦→𝑐 , for each constant 𝑐 of the database, over a certain easily computable
subset 𝐷* of 𝐷, where 𝑥 is as defined in item (2). Similarly, when 𝑄 satisfies the conditions
of item (3), then its relative frequency is polynomial-time computable, if so is for the relative
frequency of 𝑄𝑥↦→𝑐 , for each constant 𝑐 of the database, where 𝑥 is as defined in item (3).
   For a set Σ of FDs with an LHS chain and an SJFCQ 𝑄, we say that 𝑄 is Σ-safe if either its
complex part is empty, i.e., compΣ (𝑄) = ∅, or after recursively applying the conditions of items
(1)-(3) of Theorem 8 to 𝑄, we are only left with queries having an empty complex part.2 By
definition of Σ-safe queries, and by Proposition 7 and Theorem 8, we obtain the following:
Theorem 9. Consider a set Σ of FDs with an LHS chain, and an SJFCQ 𝑄. If 𝑄 is Σ-safe, then
RelFreq(Σ, 𝑄) is in FP.
   Interestingly, safety exhausts the tractable side of the dichotomy stated in Theorem 6. That
is, if 𝑄 is not Σ-safe, then RelFreq(Σ, 𝑄) is ♯P-hard, as we discuss in the next section. Let us
stress that checking whether 𝑄 is Σ-safe is feasible in polynomial time in ||Σ|| + ||𝑄||, which
establishes the second part of Theorem 6.

3.2. The Hard Side
We conclude this section by briefly discussing how the intractable side of the dichotomy stated
in Theorem 6 is obtained. We know from [2] that, for every set Σ of primary keys, and an
SJFCQ 𝑄 that is not Σ-safe, RelFreq(Σ, 𝑄) is ♯P-hard. Hence, with Theorem 9 in place, stating
that query safety ensures tractability, we can obtain the hard side of the dichotomy by showing:
Theorem 10. There is a set Σ′ of primary keys, and a non Σ′ -safe SJFCQ 𝑄′ such that, for each set
Σ of FDs with an LHS chain and non Σ-safe SJFCQ 𝑄, RelFreq(Σ′ , 𝑄′ ) reduces to RelFreq(Σ, 𝑄).


    2
     We remark that for items (2) and (3), it is enough to recursively apply the conditions of Theorem 8 to 𝑄𝑥↦→𝑐 for
only one arbitrarily chosen constant from C, and thus the notion of Σ-safety is database independent.
�Theorem 10 is shown in two main steps. We first prove that there exists a set Σ* of FDs, and
an SJFCQ 𝑄* such that RelFreq(Σ* , 𝑄* ) can be reduced to RelFreq(Σ, 𝑄) by means of a set of
rewriting rules that extend the ones of [2]. Second, we show that there is a set Σ′ of primary
keys, and an SJFCQ 𝑄′ that is not Σ′ -safe such that, for every database 𝐷, rfreq𝐷,Σ* (𝑄* ) =
rfreq𝐷,Σ′ (𝑄′ ), which then implies that RelFreq(Σ′ , 𝑄′ ) reduces to RelFreq(Σ* , 𝑄* ), as needed.


4. Approximate Counting
We now briefly discuss our results on approximate counting. The main contribution of this
section is the following theorem; BPP is the class of decision problems that are efficiently
solvable via a randomized algorithm with a bounded two-sided error [6].

Theorem 11. For every set Σ of FDs with an LHS chain (up to equivalence), and a CQ 𝑄,
♯Repairs(Σ, 𝑄) admits an FPRAS. Moreover, for Σℎ = {𝑅 : 𝐴1 → 𝐴2 , 𝑅 : 𝐴3 → 𝐴4 }, where
att(𝑅) = {𝐴𝑖 | 𝑖 ∈ [4]}, ♯Repairs(Σℎ , 𝑄) admits no FPRAS, for every SJFCQ 𝑄, unless NP ⊆ BPP.

   To prove the approximability result we observe that ♯Repairs(Σ, 𝑄) can be seen as an in-
stantiation of the so-called union of sets problem ⋃︀[7], which given a succinct representation
of 𝑛 ≥ 1 sets 𝑆1 , . . . , 𝑆𝑛 , asks for the number | 𝑖∈[𝑛] 𝑆𝑖 |. Indeed, given a database 𝐷, let
hom𝐷,Σ (𝑄) be the set of all homomorphic images of 𝑄 in 𝐷 that are consistent w.r.t. Σ.
Formally, hom𝐷,Σ (𝑄) ⃒= {ℎ(𝑄) | ℎ is a homomorphism
                                              ⃒              from 𝑄 to 𝐷 such that ℎ(𝑄) |= Σ}.
Hence, ♯repΣ,𝑄 (𝐷) = ⃒ 𝑖∈[𝑛] repΣ,𝐻𝑖 (𝐷)⃒, assuming that hom𝐷,Σ (𝑄) = {𝐻1 , . . . , 𝐻𝑛 }.3
                          ⃒⋃︀                 ⃒
   From the above discussion, and the results of [7] on the approximability of the union of sets
problem, showing the existence of an FPRAS for ♯Repairs(Σ, 𝑄) boils down to showing that
for each 𝐻 ∈ hom𝐷,Σ (𝑄) we can (i) compute ♯repΣ,𝐻 (𝐷), (ii) sample elements of repΣ,𝐻 (𝐷)
uniformly at random, and (iii) check whether a database is in repΣ,𝐻 (𝐷), all in polynomial time
in ||𝐷||. We prove that the above properties hold when Σ has an LHS chain (up to equivalence),
and the first part of Theorem 11 follows.
   The proof of the inapproximability of ♯Repairs(Σℎ , 𝑄), for every SJFCQ 𝑄, employs a so-
called gap preserving reduction from the promise problem Gap3SAT𝛾 , for some fixed 𝛾 ∈ (0, 18 ),
i.e., the problem that given a Boolean formula 𝜙 in 3CNF, asks whether 𝜙 is satisfiable. The
promise is that if 𝜙 is unsatisfiable, then every truth assignment makes at most 78 + 𝛾 of the
clauses of 𝜙 true. For every 𝛾 ∈ (0, 81 ), Gap3SAT𝛾 is NP-complete [8].
   For a fixed 𝛾 ∈ (0, 81 ), the reduction maps each formula 𝜙 to a database db(𝜙) such that the
following holds: for any yes instance 𝜙𝑌 of Gap3SAT𝛾 , and any no instance 𝜙𝑁 of Gap3SAT𝛾 ,
the “gap” (i.e., the ratio) between the numbers ♯repΣℎ ,𝑄 (db(𝜙𝑌 )) and ♯repΣℎ ,𝑄 (db(𝜙𝑁 )) is so
large that an FPRAS for ♯Repairs(Σℎ , 𝑄) could be used, by employing a small enough error
𝜖, to distinguish between yes and no instances of Gap3SAT𝛾 with high probability, and thus,
placing the NP-complete problem Gap3SAT𝛾 in BPP.
The Difficulty Underlying a Dichotomy. Despite our efforts, we could not get a complete
approximability/inapproximability classification. However, in the full version of the paper [5]
   3
       By abuse of terminology, we treat each 𝐻𝑖 as a CQ consisting of facts.
�we show that obtaining such a classification is as hard as solving the challenging open question
whether ♯MaxMatch admits an FPRAS, i.e., the problem that given a bipartite graph 𝐺, asks for
the number of maximal matchings in 𝐺; please refer to the full version for more details.


5. Conclusion
The obvious problems that remain open are the following: (1) lift the dichotomy of Theo-
rem 4 for FDs and self-join-free CQs to arbitrary CQs with self-joins, and (2) establish an
approximability/inapproximability dichotomy for FDs and SJFCQs.
   We point out that approximate versions of CQA have already been considered, both in
the database and ontological setting. In the database setting, to the best of our knowledge
[4, 9, 10] are the only works that focus on approximations in CQA. In particular, [4] studies
approximation algorithms for the relative frequency under primary keys only, which have also
been experimentally evaluated in [9]. In [10], FDs are considered, but the notion of repair is
based on value updates, rather than tuple deletions, as we do. In the ontological setting, different
approximations have been considered. However, in this setting, approximations are understood
as subsets (i.e., sound but not complete) of the standard consistent answers (e.g., see [11, 12]).


References
 [1] M. Arenas, L. E. Bertossi, J. Chomicki, Consistent query answers in inconsistent databases,
     in: PODS, 1999, pp. 68–79.
 [2] D. Maslowski, J. Wijsen, A dichotomy in the complexity of counting database repairs, J.
     Comput. Syst. Sci. 79 (2013) 958–983.
 [3] E. Livshits, B. Kimelfeld, J. Wijsen, Counting subset repairs with functional dependencies,
     J. Comput. Syst. Sci. 117 (2021) 154–164.
 [4] M. Calautti, M. Console, A. Pieris, Counting database repairs under primary keys revisited,
     in: PODS, 2019, pp. 104–118.
 [5] M. Calautti, E. Livshits, A. Pieris, M. Schneider, Counting database repairs entailing a
     query: The case of functional dependencies, in: PODS (to appear), 2022, pp. 91–103.
 [6] S. Arora, B. Barak, Computational Complexity - A Modern Approach, Cambridge University
     Press, 2009.
 [7] R. M. Karp, M. Luby, N. Madras, Monte-carlo approximation algorithms for enumeration
     problems, J. Algorithms 10 (1989) 429–448.
 [8] J. Håstad, Some optimal inapproximability results, J. ACM 48 (2001) 798–859.
 [9] M. Calautti, M. Console, A. Pieris, Benchmarking approximate consistent query answering,
     in: PODS, 2021, pp. 233–246.
[10] S. Greco, C. Molinaro, Probabilistic query answering over inconsistent databases, Ann.
     Math. Artif. Intell. 64 (2012).
[11] S. Greco, C. Molinaro, I. Trubitsyna, Computing approximate query answers over incon-
     sistent knowledge bases, in: IJCAI, 2018, pp. 1838–1846.
[12] T. Lukasiewicz, E. Malizia, C. Molinaro, Complexity of approximate query answering
     under inconsistency in datalog+/-, in: IJCAI, 2018, pp. 1921–1927.
�
Vol-3194/paper19

Paper

Counting Database Repairs Entailing a Query: The Case of Functional Dependencies

Navigation menu

Search