Difference between revisions of "Vol-3194/paper40"

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search
(edited by wikiedit)
 
(edited by wikiedit)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
+
=Paper=
 
{{Paper
 
{{Paper
 +
|id=Vol-3194/paper40
 +
|storemode=property
 +
|title=Ambiguity Detection and Textual Claims Generation from Relational Data
 +
|pdfUrl=https://ceur-ws.org/Vol-3194/paper40.pdf
 +
|volume=Vol-3194
 +
|authors=Enzo Veltri,Donatello Santoro,Gilbert Badaro,Mohammed Saeed,Paolo Papotti
 +
|dblpUrl=https://dblp.org/rec/conf/sebd/VeltriSB0P22
 
|wikidataid=Q117345157
 
|wikidataid=Q117345157
 
}}
 
}}
 +
==Ambiguity Detection and Textual Claims Generation from Relational Data==
 +
<pdf width="1500px">https://ceur-ws.org/Vol-3194/paper40.pdf</pdf>
 +
<pre>
 +
Ambiguity Detection and Textual Claims Generation
 +
from Relational Data
 +
(Discussion Paper)
 +
 +
Enzo Veltri1 , Donatello Santoro1 , Gilbert Badaro2 , Mohammed Saeed2 and
 +
Paolo Papotti2
 +
1
 +
    Università degli Studi della Basilicata (UNIBAS), Potenza, Italy
 +
2
 +
    EURECOM, Biot, France
 +
 +
 +
                                        Abstract
 +
                                        Computational fact checking, (given a textual claim and a table, verify if the claim holds w.r.t. the given
 +
                                        data) and data-to-text generation (given a subset of cells, produce a sentence describing them) exploit
 +
                                        the relationship between relational data and natural language text. Despite promising results in these
 +
                                        areas, state-of-the-art solutions simply fail in managing “data-ambiguity", i.e., the case when there are
 +
                                        multiple interpretations of the relationship between the textual sentence and the relational data. To
 +
                                        tackle this problem, we present Pythia, a system that, given a relational table 𝐷, generates textual
 +
                                        sentences that contain factual ambiguities w.r.t. the data in 𝐷. Such sentences can then be used to train
 +
                                        target applications in handling data-ambiguity.
 +
                                            In this paper, we discuss how Pythia generates data ambiguous sentences for a given table in
 +
                                        an unsupervised fashion using data profiling and query generation. We then show how two existing
 +
                                        downstream applications, namely data-to-text and computational fact checking, benefit from Pythia’s
 +
                                        generated sentences, improving the state-of-the-art results without manual user effort.
 +
 +
                                        Keywords
 +
                                        Unsupervised Text Generation, Fact Checking, Data to Text Generation, Data Ambiguity
 +
 +
 +
 +
 +
1. Introduction
 +
Ambiguity is common in natural language in many forms [1, 2, 3]. We focus on the ambiguity
 +
of a factual sentence w.r.t. the data in a table, or relation. The problem of data ambiguities
 +
in sentences is relevant for many natural language processing (NLP) applications that use
 +
relational data, from computational fact checking [4, 5, 6] to data-to-text generation [7, 8], and
 +
question answering in general [9, 10].
 +
  Consider a fact checking application, where the goal is to verify a textual claim against
 +
relational data, and the sentence “Carter LA has higher shooting than Smith SF” to be checked
 +
against the dataset 𝐷 in Table 1. Even as humans, it is hard to state if the sentence is true
 +
or false, w.r.t. the data in 𝐷. The challenge is due to the two different meanings that can be
 +
 +
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
 +
$ enzo.veltri@unibas.it (E. Veltri); donatello.santoro@unibas.it (D. Santoro); gilbert.badaro@eurecom.fr
 +
(G. Badaro); mohammed.saeed@eurecom.fr (M. Saeed); papotti@eurecom.fr (P. Papotti)
 +
� 0000-0001-9947-8909 (E. Veltri); 0000-0002-5651-8584 (D. Santoro); 0000-0002-7277-617X (G. Badaro);
 +
0000-0001-6815-9374 (M. Saeed); 0000-0003-0651-4128 (P. Papotti)
 +
                                      © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 +
    CEUR
 +
    Workshop
 +
    Proceedings
 +
                  http://ceur-ws.org
 +
                  ISSN 1613-0073
 +
                                      CEUR Workshop Proceedings (CEUR-WS.org)
 +
�                                Player  Team  FG%  3FG%    fouls  apps
 +
                            𝑡1  Carter    LA    56    47      4      5
 +
                            𝑡2  Smith    SF    55    50      4      7
 +
                            𝑡3  Carter    SF    60    51      3      3
 +
Table 1
 +
Dataset 𝐷. The sentence “Carter has higher shooting than Smith” is data ambiguous w.r.t. 𝐷.
 +
 +
 +
matched to shooting percentage: the claim can refer to statistics about Field Goal (FG%) or 3-point
 +
Field Goal (3FG%). We refer to this issue as data ambiguity, i.e., the existence of more than one
 +
unique interpretation of a factual sentence w.r.t. the data for a human reader.
 +
  Sentences with data ambiguities are not present in existing corpora with sentences annotated
 +
w.r.t. the data, which leads target applications to simply fail in these scenarios. In existing fact
 +
checking applications, the sentence above leads to a single interpretation, that is incorrect in
 +
50% of the cases. This problem is important in practice. In the log of an online application for
 +
fact checking 1 , we found that more than 20 thousand claims submitted by the users are data
 +
ambiguous, over 80% of the submitted sentences!
 +
  While traditionally high quality corpora of annotated text sentences come from extensive and
 +
expensive manual work, in this work we demonstrate that by exploiting deep learning over the
 +
input table, we can automatically generate SQL scripts that output data ambiguous annotated
 +
sentences. SQL enables the efficient generation of a large number of annotated sentences for
 +
a given table, which in turn can be used as training data to significantly advance the target
 +
downstream applications.
 +
  In this short paper we describe Pythia, a system for the automatic generation of sentences
 +
with data ambiguities, which effectively solves the problem of crafting training examples at
 +
scale [11, 12]. More precisely, we first introduce our solution based on data profiling, query
 +
generation, and a novel algorithm to detect ambiguous attributes (Section 2). We then illustrate
 +
how the users engage with Pythia with interactive parameters, including the ability to load
 +
their own datasets and automatically obtain data ambiguous sentences. We finally show how
 +
the generated sentences are used as training data to improve the coverage and quality of two
 +
target applications (Section 3).
 +
 +
 +
 +
 +
Figure 1: Pythia takes as input any relational table and generates data ambiguous sentences.
 +
 +
    1
 +
        https://coronacheck.eurecom.fr
 +
�2. System Overview
 +
Consider again the sentence about the higher shooting percentage. The sentence can be obtained
 +
from a SQL query over 𝐷 comparing the FG% and 3FG% attributes for pairs of players.
 +
q1: SELECT CONCAT(b1.Player, b1.Team, 'has higher
 +
          shooting than', b2.Player, b2.Team)
 +
    FROM D b1, D b2
 +
    WHERE b1.Player <> b2.Player AND
 +
          b1.Team <> b2.Team AND
 +
          b1.FG% > b2.FG% AND b1.3FG% < b2.3FG%
 +
 +
  Another example is “Carter has higher 3FG% than Smith". There are two players named
 +
‘Carter’ and the sentence is true for the one in team ‘SF’, but it is false for the one in team ‘LA’.
 +
This sentence can also be generated with a query over 𝐷.
 +
  Given a relation, the key idea is to model a sentence as the result of a query over the data.
 +
However, coming up with the query automatically from a given relational table is challenging.
 +
Consider again query 𝑞1 above. First, the ambiguous attribute pair is not given with the table,
 +
e.g., FG% and 3FG% in 𝑞1 must be identified. Second, an ambiguous sentence requires a label,
 +
words in the Select clause that plausibly refer to the two attributes, e.g., “shooting” in 𝑞1 is the
 +
label for FG% and 3FG%. To handle these challenges, we restrict the query generation search
 +
space by making use of templates, which get instantiated with deep learning models predicting
 +
the query parameters from the given table, as detailed next.
 +
Data Ambiguity with A-Queries. Sentences can be data ambiguous in different ways. For
 +
example, the claim 𝑠1 : “Carter LA has better shooting than Smith SF” is ambiguous w.r.t. relation
 +
𝐷 in Table 1 because even as humans we cannot state for sure if it refers to regular Field Point
 +
statistics or performance for the 3 point range. This is an example of attribute ambiguity over
 +
two attributes, but the same sentence can be ambiguous to more than two attributes.
 +
  Sentence 𝑠2 : “Carter committed 4 fouls" shows a different kind of data ambiguity. As rows
 +
are identified by two attributes in 𝐷, a reference to only part of the composite key makes it
 +
impossible to identify the right player. This is a row ambiguity over two rows, but the same
 +
sentence can be ambiguous to more rows.
 +
  A sentence can also be ambiguous in the two dimensions, both over the attributes and the
 +
rows. With more dimensions, also the number of possible interpretations increases. We there-
 +
fore distinguish between attribute, row, and full for the structure of the ambiguity. We further
 +
distinguish between contradictory and uniform sentences. The first type leads to interpretations
 +
that have opposite factual match w.r.t. the data. Sentences 𝑠1 and 𝑠2 are both contradictory, as
 +
one interpretation is confirmed from the data and the other is refuted. For uniform sentences, all
 +
interpretations have equal matching w.r.t. the data, e.g., sentence “Carter has more appearances
 +
than Smith" is refuted for both Carter players in 𝐷. We found the distinction between contra-
 +
dictory and uniform sentences crucial in target applications, as they guarantee more control
 +
over the output corpora.
 +
  An ambiguity query (or a-query) executed over table 𝐷 returns 𝑠1 , . . . , 𝑠𝑛 sentences with
 +
data ambiguity w.r.t. 𝐷. Consider again 𝑞1 defined over 𝐷, it returns all the pairs of distinct
 +
players that lead to contradictory sentences. The same query can be modified to obtain uniform
 +
sentences by changing one operator in one of the WHERE clause comparing statistics, e.g.,
 +
�change ‘<’ to ‘>’ in the last clause. Finally, by removing the last two WHERE clauses, the query
 +
returns both contradictory and uniform sentences.
 +
From the Table to the Sentences. A-queries can be obtained from parametrized SQL A-Query
 +
Templates. The parameters identify the elements of the query and can be assigned to (i) relation
 +
and attribute names, (ii) a set of labels, i.e. words with a meaning that applies for two or more
 +
attributes, (iii) comparison operators.
 +
  Given an A-Query Template is possible to generate multiple a-queries. Pythia already stores
 +
multiple handcrafted SQL templates that cover different ambiguity types and can be widely
 +
used with different datasets, but it also allows users to add other custom templates using a
 +
predefined syntax.
 +
  Figure 2 summarizes the overall process. Starting from dataset 𝐷, key/FDs profiling methods
 +
and our ambiguous attribute discovery algorithm identify the metadata to fill up the A-Templates
 +
provided in Pythia. All compatible operators are used by default. Metadata, operators, and
 +
templates can be refined and extended manually by users. Once the template is instantiated, the
 +
A-Query is ready for execution over 𝐷 to obtain the data ambiguous sentences, each annotated
 +
with the relevant cells in 𝐷.
 +
  Parameters values to instantiate the templates are automatically populated by Pythia using
 +
existing profiling methods and a novel deep learning method to find attribute ambiguities
 +
with the corresponding ambiguous words. We remark our focus on generating sentences with
 +
genuine data ambiguity, which is in contrast to other cases where the correct meaning of the
 +
text is entirely clear to a human but an algorithm detects more than one interpretation in the
 +
data. For example, “Carter SF has played 3 times” clearly refers to the player Appearances (apps),
 +
but an algorithm could be uncertain between attributes fouls and apps. We refer the reader to
 +
the full paper for details about our deep learning solution [12].
 +
 +
 +
3. Experiments
 +
The section is organized in two parts. In the first part, we focus on the data ambiguity sentences
 +
from a given table. In the second part, we show how the generated corpus can improve three
 +
target applications.
 +
 +
3.1. Dataset Generation
 +
Pythia provides a visual interface that allows to use preloaded text generation scenarios or
 +
create a new one. Figure 2a shows the configuration of a new scenario. First, the users uploads a
 +
dataset. Then metadata such as primary keys, composite keys and FDs can be manually provided
 +
or obtained with profiling [13]. The core of Pythia is built on top of A-Query Templates. Pythia
 +
comes with a set of three generic default templates that were manually crafted from the analysis
 +
of an online fact checking application’s log. The three templates cover more than 90% of user
 +
submitted claims with data ambiguities. More A-Query Templates can be provided by the user.
 +
Parameters’ values of the templates are automatically populated by Pythia using the schema
 +
metadata and a novel deep learning module to find ambiguous attributes and the word that
 +
�          (a) Pythia configuration                          (b) Output visualization
 +
Figure 2: Pythia configuration of a scenario and output visualization with data ambiguous sentences
 +
 +
 +
describes both attributes (label). Pythia allows the user to submit more ambiguous attributes
 +
and labels, ultimately submitting feedback that further improves the underlying ML module.
 +
  After the scenario’s definition, the user generates ambiguous sentences as depicted in Fig-
 +
ure 2b. To control the type of ambiguities of the generated dataset, it is possible to select the
 +
structure type, i.e., the A-Query Templates to use, strategies (Contradictory, Uniform True,
 +
Uniform False), and the number of output sentences per a-query. The system outputs for every
 +
generated sentence the pair of template and a-query that generated it, together with the data in
 +
the table that relates to the sentence. In this step, the users can also fine-tune templates and
 +
queries interactively. The generated corpus can then be exported in CSV, JSON, and application
 +
specific formats.
 +
 +
3.2. Use cases
 +
In this section we use seven datasets. We use Iris, Abalone, Adults and Mushroom from UCI
 +
repository [14], two versions of Basket dataset, one with acronyms and one with full names for
 +
the attributes [15], and a version of Soccer crawled from the web. Training data for downstream
 +
applications is generated using three datasets (Iris, Basket full names and Soccer), the remaining
 +
�datasets are used for test data. For each dataset, we generate sentences with contradictory,
 +
uniform true and uniform false strategies. We denote with 𝑃𝑡 and 𝑃𝑑 the training and developing
 +
(testing) data generated with Pythia, respectively.
 +
 +
                                                              BLEU      ROUGE
 +
                                Original (ToTTo [7])            0.547    0.202
 +
                      ToTTo fine-tuned with Pythia’s output  37.631    0.631
 +
 +
Table 2
 +
Impact of 𝑃𝑡 data on table-to-text generation.
 +
 +
 +
  ToTTo. ToTTo [7] is a dataset for table-to-text generation, i.e., given a set of cells, the goal
 +
is to generate a sentence describing the input data. ToTTo contains annotated sentences that
 +
involve reasoning and comparisons among rows, columns, or cells. We test a T5 model[16]
 +
trained on the original ToTTo for sentence generation. The results show that ambiguities are
 +
not handled in ToTTo, and thus, the model simply fails when it makes predictions over 𝑃𝑑 . We
 +
use the 𝑃𝑡 to fine-tune the model. Such fine-tuning step improves the performance of the model
 +
in terms of BLEU and ROUGE metrics according to the results in Table 2 (higher is better).
 +
 +
 +
                                                      CLASS        P      R    F1
 +
                                                    Ambiguous    n.a.    n.a.  n.a.
 +
                          Feverous Baseline          Supports    0.33    1.00  0.49
 +
                                                      Refutes    0.00    0.00  0.00
 +
                                                    Ambiguous    0.92    0.85  0.89
 +
                  Feverous Baseline trained on 𝑃𝑡    Supports    0.96    0.95  0.95
 +
                                                      Refutes    0.88    0.95  0.92
 +
 +
Table 3
 +
Impact of 𝑃𝑡 data on fact checking (Feverous).
 +
 +
 +
  Feverous. Feverous [17] is a dataset for fact checking containing textual claims. Every
 +
claim is annotated with the data identified by human annotators to Support or Refute a claim.
 +
Unfortunately, a model trained with the Feverous baseline pipeline 2 fails on the classification
 +
of an ambiguous claim from 𝑃𝑑 , as the Ambiguous label is not present in the original dataset.
 +
Training the same model with 𝑃𝑡 , which contains also examples of ambiguity, radically improves
 +
the performance for all classes. Table 3 reports the results in terms of precision, recall, and
 +
f-measure.
 +
  CoronaCheck. As a second fact checking application, we show the verification of statistical
 +
claims related to COVID-19 [5]. We incorporate the contributions of Pythia to improve an
 +
existing system 3 . The original system was not able to handle attribute ambiguity. Also, row
 +
    2
 +
        https://github.com/Raldir/FEVEROUS
 +
    3
 +
        https://coronacheck.eurecom.fr
 +
�ambiguity caused the original system to hallucinate with lower classification accuracy for am-
 +
biguous claims. One of the tables used for verifying the statistical claims consists of the following
 +
attributes, among others: country, date, total_confirmed, new_confirmed, total_fatality_rate, to-
 +
tal_mortality_rate. From the analysis of the log of claims submitted by users, an example of
 +
common attribute ambiguity is between attributes total_fatality_rate and total_mortality_rate
 +
for sentences containing “death rate”. Another example of attribute ambiguity is between
 +
total_confirmed and new_confirmed when sentences that only mention “cases” are verified.
 +
Examples of row ambiguity consist of claims that refer to two records with the same location
 +
but different timestamps, such as “In France, 10k confirmed cases have been reported" (today or
 +
yesterday?).
 +
 +
                                      Users’    Accuracy        Accuracy
 +
                                      Claims    original    original+ Pythia
 +
                      Row Amb.            44        25/44          36/44
 +
                    Attribute Amb.        9          0/9            6/9
 +
                    No Ambiguity          6          6/6            6/6
 +
                        Total            50        31/50          42/50
 +
 +
Table 4
 +
Impact of 𝑃𝑡 data on fact checking (CoronaCheck).
 +
 +
  Pythia enabled us to improve CoronaCheck by automatically generating ambiguity aware
 +
training data. The new corpora have been used to train new classifiers that allowed the
 +
extension of the original system. We consider 50 claims from the log of CoronaCheck and
 +
we study the types of ambiguities they include. As shown in Table 4, 88% include a row or
 +
attribute ambiguities. 83% of those ambiguous claims have row ambiguity, and 17% have both
 +
attribute and row ambiguity. The original CoronaCheck fails to classify claims with ambiguity.
 +
However, training the system with the output of Pythia leads to handling most of the row and
 +
attribute ambiguity. To show the benefit of using Pythia compared to the original system, we
 +
consider the following example: “June 2021: France has 111,244 Covid-19 deaths”. (True for
 +
total_deaths, False for new_deaths). The original system returns a false decision by checking
 +
against new_deaths only. With Pythia support, the decision is True for total_deaths and False
 +
otherwise.
 +
 +
 +
References
 +
[1] [n.d.], Notes on ambiguity, https://cs.nyu.edu/~davise/ai/ambiguity.html, 2021.
 +
[2] D. Newman-Griffis, et al., Ambiguity in medical concept normalization: An analysis of
 +
    types and coverage in electronic health record datasets, Journal of the American Medical
 +
    Informatics Association 28 (2021) 516–532.
 +
[3] P. Lapadula, G. Mecca, D. Santoro, L. Solimando, E. Veltri, Humanity is overrated. or not.
 +
    automatic diagnostic suggestions by greg, ml, in: European Conference on Advances in
 +
    Databases and Information Systems, Springer, 2018, pp. 305–313.
 +
� [4] N. Hassan, et al, Claimbuster: The first-ever end-to-end fact-checking system, Proc. VLDB
 +
      Endow. 10 (2017) 1945–1948.
 +
[5] G. Karagiannis, M. Saeed, P. Papotti, I. Trummer, Scrutinizer: A mixed-initiative approach
 +
      to large-scale, data-driven claim verification, Proc. VLDB Endow. 13 (2020) 2508–2521.
 +
[6] Y. Wu, P. K. Agarwal, C. Li, J. Yang, C. Yu, Computational fact checking through query
 +
      perturbations, ACM Trans. Database Syst. 42 (2017) 4:1–4:41.
 +
[7] A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, D. Das, Totto: A
 +
      controlled table-to-text generation dataset, in: EMNLP, ACL, 2020, pp. 1173–1186.
 +
[8] I. Trummer, Data vocalization with CiceroDB, in: CIDR, www.cidrdb.org, 2019.
 +
[9] W. Chen, M. Chang, E. Schlinger, W. Y. Wang, W. W. Cohen, Open question answering
 +
      over tables and text, in: ICLR, OpenReview.net, 2021.
 +
[10] J. Thorne, M. Yazdani, M. Saeidi, F. Silvestri, S. Riedel, A. Y. Levy, From natural language
 +
      processing to neural databases, Proc. VLDB Endow. 14 (2021) 1033–1039.
 +
[11] E. Veltri, D. Santoro, G. Badaro, M. Saeed, P. Papotti, Pythia: Unsupervised generation of
 +
      ambiguous textual claims from relational data, in: SIGMOD, 2022. doi:10.1145/3514221.
 +
      3520164.
 +
[12] E. Veltri, G. Badaro, M. Saeed, P. Papotti, Pythia: Generating Ambiguous Sentences From
 +
      Relational Data, https://www.eurecom.fr/~papotti/pythiaTr.pdf, 2021.
 +
[13] Z. Abedjan, L. Golab, F. Naumann, T. Papenbrock, Data Profiling, Synthesis Lectures on
 +
      Data Management, Morgan & Claypool Publishers, 2018.
 +
[14] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
 +
[15] S. Wiseman, S. Shieber, A. Rush, Challenges in data-to-document generation, in: Pro-
 +
      ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
 +
    Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2253–2263.
 +
      URL: https://aclanthology.org/D17-1239. doi:10.18653/v1/D17-1239.
 +
[16] C. Raffel, et al., Exploring the limits of transfer learning with a unified text-to-text
 +
      transformer, Journal of Machine Learning Research 21 (2020).
 +
[17] R. Aly, et al., FEVEROUS: Fact extraction and VERification over unstructured and structured
 +
      information, in: Thirty-fifth Conference on Neural Information Processing Systems (NIPS
 +
    - Datasets and Benchmarks), 2021.
 +
 +
</pre>

Latest revision as of 18:01, 30 March 2023

Paper

Paper
edit
description  
id  Vol-3194/paper40
wikidataid  Q117345157→Q117345157
title  Ambiguity Detection and Textual Claims Generation from Relational Data
pdfUrl  https://ceur-ws.org/Vol-3194/paper40.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/VeltriSB0P22
volume  Vol-3194→Vol-3194
session  →

Ambiguity Detection and Textual Claims Generation from Relational Data

load PDF

Ambiguity Detection and Textual Claims Generation
from Relational Data
(Discussion Paper)

Enzo Veltri1 , Donatello Santoro1 , Gilbert Badaro2 , Mohammed Saeed2 and
Paolo Papotti2
1
    Università degli Studi della Basilicata (UNIBAS), Potenza, Italy
2
    EURECOM, Biot, France


                                         Abstract
                                         Computational fact checking, (given a textual claim and a table, verify if the claim holds w.r.t. the given
                                         data) and data-to-text generation (given a subset of cells, produce a sentence describing them) exploit
                                         the relationship between relational data and natural language text. Despite promising results in these
                                         areas, state-of-the-art solutions simply fail in managing “data-ambiguity", i.e., the case when there are
                                         multiple interpretations of the relationship between the textual sentence and the relational data. To
                                         tackle this problem, we present Pythia, a system that, given a relational table 𝐷, generates textual
                                         sentences that contain factual ambiguities w.r.t. the data in 𝐷. Such sentences can then be used to train
                                         target applications in handling data-ambiguity.
                                             In this paper, we discuss how Pythia generates data ambiguous sentences for a given table in
                                         an unsupervised fashion using data profiling and query generation. We then show how two existing
                                         downstream applications, namely data-to-text and computational fact checking, benefit from Pythia’s
                                         generated sentences, improving the state-of-the-art results without manual user effort.

                                         Keywords
                                         Unsupervised Text Generation, Fact Checking, Data to Text Generation, Data Ambiguity




1. Introduction
Ambiguity is common in natural language in many forms [1, 2, 3]. We focus on the ambiguity
of a factual sentence w.r.t. the data in a table, or relation. The problem of data ambiguities
in sentences is relevant for many natural language processing (NLP) applications that use
relational data, from computational fact checking [4, 5, 6] to data-to-text generation [7, 8], and
question answering in general [9, 10].
   Consider a fact checking application, where the goal is to verify a textual claim against
relational data, and the sentence “Carter LA has higher shooting than Smith SF” to be checked
against the dataset 𝐷 in Table 1. Even as humans, it is hard to state if the sentence is true
or false, w.r.t. the data in 𝐷. The challenge is due to the two different meanings that can be

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ enzo.veltri@unibas.it (E. Veltri); donatello.santoro@unibas.it (D. Santoro); gilbert.badaro@eurecom.fr
(G. Badaro); mohammed.saeed@eurecom.fr (M. Saeed); papotti@eurecom.fr (P. Papotti)
� 0000-0001-9947-8909 (E. Veltri); 0000-0002-5651-8584 (D. Santoro); 0000-0002-7277-617X (G. Badaro);
0000-0001-6815-9374 (M. Saeed); 0000-0003-0651-4128 (P. Papotti)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�                                 Player   Team   FG%   3FG%    fouls   apps
                            𝑡1   Carter    LA     56    47       4       5
                            𝑡2   Smith     SF     55    50       4       7
                            𝑡3   Carter    SF     60    51       3       3
Table 1
Dataset 𝐷. The sentence “Carter has higher shooting than Smith” is data ambiguous w.r.t. 𝐷.


matched to shooting percentage: the claim can refer to statistics about Field Goal (FG%) or 3-point
Field Goal (3FG%). We refer to this issue as data ambiguity, i.e., the existence of more than one
unique interpretation of a factual sentence w.r.t. the data for a human reader.
   Sentences with data ambiguities are not present in existing corpora with sentences annotated
w.r.t. the data, which leads target applications to simply fail in these scenarios. In existing fact
checking applications, the sentence above leads to a single interpretation, that is incorrect in
50% of the cases. This problem is important in practice. In the log of an online application for
fact checking 1 , we found that more than 20 thousand claims submitted by the users are data
ambiguous, over 80% of the submitted sentences!
   While traditionally high quality corpora of annotated text sentences come from extensive and
expensive manual work, in this work we demonstrate that by exploiting deep learning over the
input table, we can automatically generate SQL scripts that output data ambiguous annotated
sentences. SQL enables the efficient generation of a large number of annotated sentences for
a given table, which in turn can be used as training data to significantly advance the target
downstream applications.
   In this short paper we describe Pythia, a system for the automatic generation of sentences
with data ambiguities, which effectively solves the problem of crafting training examples at
scale [11, 12]. More precisely, we first introduce our solution based on data profiling, query
generation, and a novel algorithm to detect ambiguous attributes (Section 2). We then illustrate
how the users engage with Pythia with interactive parameters, including the ability to load
their own datasets and automatically obtain data ambiguous sentences. We finally show how
the generated sentences are used as training data to improve the coverage and quality of two
target applications (Section 3).




Figure 1: Pythia takes as input any relational table and generates data ambiguous sentences.

    1
        https://coronacheck.eurecom.fr
�2. System Overview
Consider again the sentence about the higher shooting percentage. The sentence can be obtained
from a SQL query over 𝐷 comparing the FG% and 3FG% attributes for pairs of players.
q1: SELECT CONCAT(b1.Player, b1.Team, 'has higher
          shooting than', b2.Player, b2.Team)
    FROM D b1, D b2
    WHERE b1.Player <> b2.Player AND
          b1.Team <> b2.Team AND
          b1.FG% > b2.FG% AND b1.3FG% < b2.3FG%

   Another example is “Carter has higher 3FG% than Smith". There are two players named
‘Carter’ and the sentence is true for the one in team ‘SF’, but it is false for the one in team ‘LA’.
This sentence can also be generated with a query over 𝐷.
   Given a relation, the key idea is to model a sentence as the result of a query over the data.
However, coming up with the query automatically from a given relational table is challenging.
Consider again query 𝑞1 above. First, the ambiguous attribute pair is not given with the table,
e.g., FG% and 3FG% in 𝑞1 must be identified. Second, an ambiguous sentence requires a label,
words in the Select clause that plausibly refer to the two attributes, e.g., “shooting” in 𝑞1 is the
label for FG% and 3FG%. To handle these challenges, we restrict the query generation search
space by making use of templates, which get instantiated with deep learning models predicting
the query parameters from the given table, as detailed next.
Data Ambiguity with A-Queries. Sentences can be data ambiguous in different ways. For
example, the claim 𝑠1 : “Carter LA has better shooting than Smith SF” is ambiguous w.r.t. relation
𝐷 in Table 1 because even as humans we cannot state for sure if it refers to regular Field Point
statistics or performance for the 3 point range. This is an example of attribute ambiguity over
two attributes, but the same sentence can be ambiguous to more than two attributes.
   Sentence 𝑠2 : “Carter committed 4 fouls" shows a different kind of data ambiguity. As rows
are identified by two attributes in 𝐷, a reference to only part of the composite key makes it
impossible to identify the right player. This is a row ambiguity over two rows, but the same
sentence can be ambiguous to more rows.
   A sentence can also be ambiguous in the two dimensions, both over the attributes and the
rows. With more dimensions, also the number of possible interpretations increases. We there-
fore distinguish between attribute, row, and full for the structure of the ambiguity. We further
distinguish between contradictory and uniform sentences. The first type leads to interpretations
that have opposite factual match w.r.t. the data. Sentences 𝑠1 and 𝑠2 are both contradictory, as
one interpretation is confirmed from the data and the other is refuted. For uniform sentences, all
interpretations have equal matching w.r.t. the data, e.g., sentence “Carter has more appearances
than Smith" is refuted for both Carter players in 𝐷. We found the distinction between contra-
dictory and uniform sentences crucial in target applications, as they guarantee more control
over the output corpora.
   An ambiguity query (or a-query) executed over table 𝐷 returns 𝑠1 , . . . , 𝑠𝑛 sentences with
data ambiguity w.r.t. 𝐷. Consider again 𝑞1 defined over 𝐷, it returns all the pairs of distinct
players that lead to contradictory sentences. The same query can be modified to obtain uniform
sentences by changing one operator in one of the WHERE clause comparing statistics, e.g.,
�change ‘<’ to ‘>’ in the last clause. Finally, by removing the last two WHERE clauses, the query
returns both contradictory and uniform sentences.
From the Table to the Sentences. A-queries can be obtained from parametrized SQL A-Query
Templates. The parameters identify the elements of the query and can be assigned to (i) relation
and attribute names, (ii) a set of labels, i.e. words with a meaning that applies for two or more
attributes, (iii) comparison operators.
   Given an A-Query Template is possible to generate multiple a-queries. Pythia already stores
multiple handcrafted SQL templates that cover different ambiguity types and can be widely
used with different datasets, but it also allows users to add other custom templates using a
predefined syntax.
   Figure 2 summarizes the overall process. Starting from dataset 𝐷, key/FDs profiling methods
and our ambiguous attribute discovery algorithm identify the metadata to fill up the A-Templates
provided in Pythia. All compatible operators are used by default. Metadata, operators, and
templates can be refined and extended manually by users. Once the template is instantiated, the
A-Query is ready for execution over 𝐷 to obtain the data ambiguous sentences, each annotated
with the relevant cells in 𝐷.
   Parameters values to instantiate the templates are automatically populated by Pythia using
existing profiling methods and a novel deep learning method to find attribute ambiguities
with the corresponding ambiguous words. We remark our focus on generating sentences with
genuine data ambiguity, which is in contrast to other cases where the correct meaning of the
text is entirely clear to a human but an algorithm detects more than one interpretation in the
data. For example, “Carter SF has played 3 times” clearly refers to the player Appearances (apps),
but an algorithm could be uncertain between attributes fouls and apps. We refer the reader to
the full paper for details about our deep learning solution [12].


3. Experiments
The section is organized in two parts. In the first part, we focus on the data ambiguity sentences
from a given table. In the second part, we show how the generated corpus can improve three
target applications.

3.1. Dataset Generation
Pythia provides a visual interface that allows to use preloaded text generation scenarios or
create a new one. Figure 2a shows the configuration of a new scenario. First, the users uploads a
dataset. Then metadata such as primary keys, composite keys and FDs can be manually provided
or obtained with profiling [13]. The core of Pythia is built on top of A-Query Templates. Pythia
comes with a set of three generic default templates that were manually crafted from the analysis
of an online fact checking application’s log. The three templates cover more than 90% of user
submitted claims with data ambiguities. More A-Query Templates can be provided by the user.
Parameters’ values of the templates are automatically populated by Pythia using the schema
metadata and a novel deep learning module to find ambiguous attributes and the word that
�          (a) Pythia configuration                          (b) Output visualization
Figure 2: Pythia configuration of a scenario and output visualization with data ambiguous sentences


describes both attributes (label). Pythia allows the user to submit more ambiguous attributes
and labels, ultimately submitting feedback that further improves the underlying ML module.
   After the scenario’s definition, the user generates ambiguous sentences as depicted in Fig-
ure 2b. To control the type of ambiguities of the generated dataset, it is possible to select the
structure type, i.e., the A-Query Templates to use, strategies (Contradictory, Uniform True,
Uniform False), and the number of output sentences per a-query. The system outputs for every
generated sentence the pair of template and a-query that generated it, together with the data in
the table that relates to the sentence. In this step, the users can also fine-tune templates and
queries interactively. The generated corpus can then be exported in CSV, JSON, and application
specific formats.

3.2. Use cases
In this section we use seven datasets. We use Iris, Abalone, Adults and Mushroom from UCI
repository [14], two versions of Basket dataset, one with acronyms and one with full names for
the attributes [15], and a version of Soccer crawled from the web. Training data for downstream
applications is generated using three datasets (Iris, Basket full names and Soccer), the remaining
�datasets are used for test data. For each dataset, we generate sentences with contradictory,
uniform true and uniform false strategies. We denote with 𝑃𝑡 and 𝑃𝑑 the training and developing
(testing) data generated with Pythia, respectively.

                                                               BLEU       ROUGE
                                Original (ToTTo [7])            0.547     0.202
                       ToTTo fine-tuned with Pythia’s output   37.631     0.631

Table 2
Impact of 𝑃𝑡 data on table-to-text generation.


   ToTTo. ToTTo [7] is a dataset for table-to-text generation, i.e., given a set of cells, the goal
is to generate a sentence describing the input data. ToTTo contains annotated sentences that
involve reasoning and comparisons among rows, columns, or cells. We test a T5 model[16]
trained on the original ToTTo for sentence generation. The results show that ambiguities are
not handled in ToTTo, and thus, the model simply fails when it makes predictions over 𝑃𝑑 . We
use the 𝑃𝑡 to fine-tune the model. Such fine-tuning step improves the performance of the model
in terms of BLEU and ROUGE metrics according to the results in Table 2 (higher is better).


                                                       CLASS        P       R     F1
                                                     Ambiguous     n.a.    n.a.   n.a.
                          Feverous Baseline           Supports     0.33    1.00   0.49
                                                       Refutes     0.00    0.00   0.00
                                                     Ambiguous     0.92    0.85   0.89
                   Feverous Baseline trained on 𝑃𝑡    Supports     0.96    0.95   0.95
                                                       Refutes     0.88    0.95   0.92

Table 3
Impact of 𝑃𝑡 data on fact checking (Feverous).


   Feverous. Feverous [17] is a dataset for fact checking containing textual claims. Every
claim is annotated with the data identified by human annotators to Support or Refute a claim.
Unfortunately, a model trained with the Feverous baseline pipeline 2 fails on the classification
of an ambiguous claim from 𝑃𝑑 , as the Ambiguous label is not present in the original dataset.
Training the same model with 𝑃𝑡 , which contains also examples of ambiguity, radically improves
the performance for all classes. Table 3 reports the results in terms of precision, recall, and
f-measure.
   CoronaCheck. As a second fact checking application, we show the verification of statistical
claims related to COVID-19 [5]. We incorporate the contributions of Pythia to improve an
existing system 3 . The original system was not able to handle attribute ambiguity. Also, row
    2
        https://github.com/Raldir/FEVEROUS
    3
        https://coronacheck.eurecom.fr
�ambiguity caused the original system to hallucinate with lower classification accuracy for am-
biguous claims. One of the tables used for verifying the statistical claims consists of the following
attributes, among others: country, date, total_confirmed, new_confirmed, total_fatality_rate, to-
tal_mortality_rate. From the analysis of the log of claims submitted by users, an example of
common attribute ambiguity is between attributes total_fatality_rate and total_mortality_rate
for sentences containing “death rate”. Another example of attribute ambiguity is between
total_confirmed and new_confirmed when sentences that only mention “cases” are verified.
Examples of row ambiguity consist of claims that refer to two records with the same location
but different timestamps, such as “In France, 10k confirmed cases have been reported" (today or
yesterday?).

                                       Users’    Accuracy         Accuracy
                                       Claims    original     original+ Pythia
                      Row Amb.            44        25/44           36/44
                    Attribute Amb.        9          0/9             6/9
                    No Ambiguity          6          6/6             6/6
                         Total            50        31/50           42/50

Table 4
Impact of 𝑃𝑡 data on fact checking (CoronaCheck).

   Pythia enabled us to improve CoronaCheck by automatically generating ambiguity aware
training data. The new corpora have been used to train new classifiers that allowed the
extension of the original system. We consider 50 claims from the log of CoronaCheck and
we study the types of ambiguities they include. As shown in Table 4, 88% include a row or
attribute ambiguities. 83% of those ambiguous claims have row ambiguity, and 17% have both
attribute and row ambiguity. The original CoronaCheck fails to classify claims with ambiguity.
However, training the system with the output of Pythia leads to handling most of the row and
attribute ambiguity. To show the benefit of using Pythia compared to the original system, we
consider the following example: “June 2021: France has 111,244 Covid-19 deaths”. (True for
total_deaths, False for new_deaths). The original system returns a false decision by checking
against new_deaths only. With Pythia support, the decision is True for total_deaths and False
otherwise.


References
 [1] [n.d.], Notes on ambiguity, https://cs.nyu.edu/~davise/ai/ambiguity.html, 2021.
 [2] D. Newman-Griffis, et al., Ambiguity in medical concept normalization: An analysis of
     types and coverage in electronic health record datasets, Journal of the American Medical
     Informatics Association 28 (2021) 516–532.
 [3] P. Lapadula, G. Mecca, D. Santoro, L. Solimando, E. Veltri, Humanity is overrated. or not.
     automatic diagnostic suggestions by greg, ml, in: European Conference on Advances in
     Databases and Information Systems, Springer, 2018, pp. 305–313.
� [4] N. Hassan, et al, Claimbuster: The first-ever end-to-end fact-checking system, Proc. VLDB
      Endow. 10 (2017) 1945–1948.
 [5] G. Karagiannis, M. Saeed, P. Papotti, I. Trummer, Scrutinizer: A mixed-initiative approach
      to large-scale, data-driven claim verification, Proc. VLDB Endow. 13 (2020) 2508–2521.
 [6] Y. Wu, P. K. Agarwal, C. Li, J. Yang, C. Yu, Computational fact checking through query
      perturbations, ACM Trans. Database Syst. 42 (2017) 4:1–4:41.
 [7] A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, D. Das, Totto: A
      controlled table-to-text generation dataset, in: EMNLP, ACL, 2020, pp. 1173–1186.
 [8] I. Trummer, Data vocalization with CiceroDB, in: CIDR, www.cidrdb.org, 2019.
 [9] W. Chen, M. Chang, E. Schlinger, W. Y. Wang, W. W. Cohen, Open question answering
      over tables and text, in: ICLR, OpenReview.net, 2021.
[10] J. Thorne, M. Yazdani, M. Saeidi, F. Silvestri, S. Riedel, A. Y. Levy, From natural language
      processing to neural databases, Proc. VLDB Endow. 14 (2021) 1033–1039.
[11] E. Veltri, D. Santoro, G. Badaro, M. Saeed, P. Papotti, Pythia: Unsupervised generation of
      ambiguous textual claims from relational data, in: SIGMOD, 2022. doi:10.1145/3514221.
      3520164.
[12] E. Veltri, G. Badaro, M. Saeed, P. Papotti, Pythia: Generating Ambiguous Sentences From
      Relational Data, https://www.eurecom.fr/~papotti/pythiaTr.pdf, 2021.
[13] Z. Abedjan, L. Golab, F. Naumann, T. Papenbrock, Data Profiling, Synthesis Lectures on
      Data Management, Morgan & Claypool Publishers, 2018.
[14] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[15] S. Wiseman, S. Shieber, A. Rush, Challenges in data-to-document generation, in: Pro-
      ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
     Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2253–2263.
      URL: https://aclanthology.org/D17-1239. doi:10.18653/v1/D17-1239.
[16] C. Raffel, et al., Exploring the limits of transfer learning with a unified text-to-text
      transformer, Journal of Machine Learning Research 21 (2020).
[17] R. Aly, et al., FEVEROUS: Fact extraction and VERification over unstructured and structured
      information, in: Thirty-fifth Conference on Neural Information Processing Systems (NIPS
     - Datasets and Benchmarks), 2021.
�