Paper

Paper
edit
description
id	Vol-3194/paper45
wikidataid	Q117344938→Q117344938
title	Improving Conversational Evaluation via a Dependency-Aware Permutation Strategy
pdfUrl	https://ceur-ws.org/Vol-3194/paper45.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/FaggioliF00T22
volume	Vol-3194→Vol-3194
session	→

Improving Conversational Evaluation via a Dependency-Aware Permutation Strategy

Improving Conversational Evaluation via a
Dependency-Aware Permutation Strategy
Guglielmo Faggioli1 , Marco Ferrante1 , Nicola Ferro1 , Raffaele Perego2 and
Nicola Tonellotto3
1
University of Padova, Padova, Italy
2
ISTI-CNR, Pisa, Italy
3
University of Pisa, Pisa, Italy

Abstract
The rapid growth in number and complexity of conversational agents has highlighted the need for suitable
evaluation tools to describe their performance. Current offline conversational evaluation approaches
rely on collections composed of multiturn conversations, each including a sequence of utterances. Such
sequences represent a snapshot of reality: a single dialog between the user and a hypothetical system
on a specific topic. We argue that this paradigm is not realistic enough: multiple users will ask diverse
questions in variable order, even for a conversation on the same topic. In this work1 we propose a
dependency-aware utterances sampling strategy to augment data available in conversational collections
while maintaining temporal dependencies within conversations. Using the sampled conversations, we
show that the current evaluation framework favours specific systems while penalizing others, leading to
biased evaluation. We further show how to exploit dependency-aware utterances permutations in our
current evaluation framework and increase the power of statistical evaluation tools such as ANOVA.

1. Introduction
The conversational search domain has recently drawn increasing attention from the Information
Retrieval (IR) community. A conversational agent is expected to interact seamlessly with the
user through natural language, either written (i.e. text chat-bots) or spoken (i.e. vocal assistants).
Following the development of conversational systems, also their evaluation is receiving a lot of
attention. According to the best practices proposed by TREC CAsT [2], the principal evaluation
campaign in the conversational domain, the evaluation process is very similar to the one used in
ad-hoc retrieval. It follows the Cranfield paradigm, with a corpus of passage documents, a set of
conversations representing various information needs, and a set of relevance judgements. Each
conversation is a sequence of utterances – i.e., phrases issued by the user during the conversation
– and the relevance judgements are collected for each utterance. Several works [3, 4, 5] have
already recognized the drawbacks of using traditional evaluation approaches in a (multi-turn)
conversational setup. Conversations in the current evaluation collections represent a single
interaction between a user and a hypothetical system. Therefore, when we evaluate using a
conversation represented as a sequence of utterances, we consider a snapshot of reality [4].
Therefore, since we have a unique sequence of utterances, we cannot generalize to conversations
1
This is an extended abstract of Faggioli et al. [1]
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
�on the same topic not present in the collection that could have happened between the user and
the system. We show a series of experiments meant to demonstrate the poor generalizability
of results obtained using offline evaluation collections. Our work can be formalized with the
following research questions:
RQ1 What is the effect of including dependency-aware permuted conversations in the compar-
ison between systems?

RQ2 Can we improve conversational agents evaluation using permuted dialogues?
By answering the first question, we obtain a sound process to permute utterances of a
conversation, producing new conversations to test conversational systems. We, therefore, use
such conversations to compare models under the current evaluation paradigm, highlighting and
measuring its flaws. Finally, we propose a new strategy to include the permuted conversations
in the evaluation methodology. We do not propose a new evaluation measure – as done for
example in [5, 4] – but show how, by adapting our current instruments, we could partially
mitigate the limitations associated with the evaluation of the conversational systems. Our
main contributions are the following. We show that: i) Modeling a conversation using a single
sequence of utterances only favours some systems, while penalizing others; ii) If we consider
multiple valid permutations of the conversations, the performance of conversational agents
moves from point estimations to distributions of performance (in which the default sequence is
an arbitrary point); iii) By including multiple permutations in the evaluation, we obtain more
reliable and generalizable statistical inference.

2. Related Work
In this work, we focus on the evaluation of Multi-turn Task-driven Conversational search systems.
One of the most peculiar aspects related to the multi-turn conversational task is the role played by
the concept of “context” [6, 7]. The context corresponds to the system’s internal representation
of the conversation state that evolves through time. Correctly maintaining and updating such
internal belief is essential to approach effectively the multi-turn conversational task. Multi-turn
conversational search is also the main focus of the TREC Conversational Assistance Track
(CAsT) campaign [2]. Currently, the track has reached its third edition: a further demonstration
of the interest shown by the community. The evaluation aspect of conversational agents is
consequently drawing increasing interest [8, 3, 4, 5]. Even though several efforts aimed at
developing proper techniques to evaluate conversational systems [5, 3], there is a consensus on
the fact that we still lack the properer statistical tools to correctly evaluate such systems. Faggioli
et al. [5] propose to model a conversation through a graph: utterances in a conversation are
linked if they concern the same entities. Authors argue that current evaluation approaches
introduce biases on systems comparison, by considering utterances as independent events.
Lipani et al. [4] propose to simulate users through a stochastic process, similarly to what done
in [3]. In particular, each topic is modelled as a set of subtopics (collected manually and using
the available experimental collections). Using crowd assessors, Lipani et al. [4] define a Markov
chain process that should model how users present utterances to the system when interacting
with a conversational agent. This allows producing new simulated conversations. Such a
�solution partially solves the low generalizability problem. Nevertheless, the need for online
data makes it infeasible for purely offline scenarios, where no users are available.

3. Dependence-aware Utterance Permutation Strategy
Several works [8, 3, 5, 4] recognize the need of increasing the variety of conversations to
improve the generalizability of offline conversational evaluation. As observed by [4], when
conversing with a system about a specific topic, distinct users tend to traverse subtopics in
different orders. Generalization would ask to observe how distinct users interact with the
systems to investigate a specific topic: this is not possible in an offline scenario. A possible
approach to simulate how users would experience a system consists in permuting the utterances
of a given conversation, and measure how the system performs on the new scenario. We cannot
however permute utterances completely randomly. In fact, we might lose temporal dependency
between the moment an entity is mentioned for the first time and its subsequent references.
To solve this limitation, we would have to re-gather the relevance judgements to fit the newly
defined anaphoras in the randomly built conversation. This is prohibitive and not suited to an
offline evaluation scenario. A better permutation strategy consists in permuting utterances by
respecting the temporal dependencies. To this end, we could rely on classification labels (we
dub this approach class-based permutation) to identify such dependencies. Following the
work by Mele et al. [9], we manually annotate the data using four classes of utterances. We
also identify constraints to permute utterances of a conversation, while preserving temporal
dependencies. The utterance classes and constraints are the following:
• First utterance: it expresses the main topic of the conversation and cannot be moved.

• Self-Explanatory (SE) utterances do not contain any semantic omission. Non-contextual
retrieval systems can answer such utterances. Being independent from other utterances,
they can appear in any position inside the conversation.

• Utterances that depend on the First Topic of the conversation (FT): they contain an -
often implicit - reference to the general topic of the conversation, subsumed by the first
utterance. Since the FT utterances depend only on the global topic of the dialogue, they
can be issued at any moment after the first one.

• Utterances that depend on a Previous Topic (PT): the previous SE utterance contains the
entity to solve the semantic omission in the current one. PT utterances have to appear
immediately after their SE utterance but they can be permuted with other PT utterances
referring to the same SE.

4. Experimental Analysis
In our experimental analysis, we consider the Conversational Assistance Track (CAsT) 2019 [2].
Such collection contains 50 multi-turn conversations, each composed of 9 utterances on average.
The utterances in their original formulation contain semantic omissions - anaphoras, ellipsis and
co-references. Among the all conversations, we consider only the 20 test conversations, being
�their relevance judgements much more significant. The corpus is composed of approximately 38
million paragraphs from the TREC Complex Answer Retrieval Paragraph Collection (CAR) [10]
and the MS MARCO collection. Regarding the relevance judgements, CAsT 2019 contains graded
judgements on a scale from 0 to 4. We adopt Normalized Discounted Cumulated Gain (nDCG)
with cutoff at 3, being the most widely diffused evaluation measure for this specific scenario [2].

4.1. Conversational Models
As commonly done [4], we select as a set of archetypal conversational models to observe what
happens with conversations permutations. If not differently specified, we used BM25 as ranker.
Non-contextual baseline Models We consider three non-contextual baseline models, used as a
comparison with other approaches: okapi BM25 model with default terrier parameters (𝑘 = 1.2
and 𝑏 = 0.75); Query Language Model with Bayesian Dirichlet smoothing and 𝜇 = 2500; a
model based on Pseudo-Relevance feedback RM3 rewriting [11] that considers the 10 most
popular terms of the 10 documents ranked the highest.
Concatenation-Based Models A simple approach to enrich utterances with context to address
the multi-turn conversational challenges consists in concatenating them with one (or more)
of the previous ones. We propose three concatenation-based strategies, previously adopted as
baselines in the literature [9]. First Utterance (FU): each utterance 𝑢𝑗 is concatenated with 𝑢1 ,
the first utterance of the conversation; Context Utterance (CU): each utterance 𝑢𝑗 is concatenated
with 𝑢1 and 𝑢𝑗−1 , the previous utterance; Linear Previous (LP): we concatenate 𝑢𝑗 with 𝑢𝑗−1
linearly weighting the terms: 𝑞𝑗 = 𝜆 * 𝑢𝑗 + (1 − 𝜆) * 𝑢𝑗−1 , with 𝜆 ∈ [0, 1]. We use 𝜆 = 0.6,
since it provides the best empirical results.
Pseudo-Relevance Feedback Based Models We consider two approaches based on pseudo-
relevance feedback (PRF) that account for the “multi-turn” aspect. RM3-previous (RM3p): it
concatenates the current utterance and the RM3 expansion of the previous one (using BM25 as
first stage retrieval model); RM3-sequential (RM3s): it takes the relevance feedback considering
the ranked list retrieved for the previous utterance, and uses it to expand the current. The
difference between the two models is that, for RM3p, the ranked list depends only on the previous
utterance and the one at hand. Conversely, the latter considers the sequence of utterances
observed up to the current one.
Language Model-Based Models Among the neural language models, we consider coref-spanBERT
(anCB). This method relies on the Higher-order Coreference Resolution model, as defined in
[12], but employs the spanBERT [13] embeddings to represent the words. In particular, we use
the pre-trained version of the approach available in the AllenNLP framework1 .

4.2. RQ1: Conversational Systems Performance on Permuted Conversations
Table 1 reports the nDCG@3 observed for the different archetypal conversational retrieval
baselines either by considering only the original order of the utterances as defined in CAsT
2019 or considering the average over multiple permutations for each conversation. To grant
a fair comparison between different conversations, since they can have a different number of
valid class-based permutations, we sample only 100 permutations for each of them. An
1
https://docs.allennlp.org
�Table 1
Performance measured with nDCG@3 for the selected archetypal conversational models. Baselines results do not
depend on the order of the utterances. We report the mean for both standard order of the utterances, and over all
permuted conversations. Concerning permuted conversations, we also report the minimum and maximum mean
over all conversations that can be observed, using different permutations.
orig. order permutations
model min mean max.
BM25 0.0981 0.0981 0.0981 0.0981
baselines DLM 0.0794 0.0794 0.0794 0.0794
RM3 0.1064 0.1064 0.1064 0.1064
FU 0.1692 0.1692 0.1692 0.1692
concatenation-based CU 0.1687 0.1185 0.1481 0.1809
LP 0.1464 0.0906 0.1279 0.1671
RM3p 0.1451 0.1019 0.1353 0.1709
PRF-based
RM3s 0.1639 0.1108 0.1482 0.1857
neural LM based anCB 0.1640 0.1410 0.1553 0.1645

interesting insight that can be drawn by Table 1 is that the best performing system is the “First
Utterance” (FU). The first utterance of the original conversation is often the most generic: f we
concatenate it with other utterances, it boosts their recall, helping them obtain better results.
The FU approach obtains the same results even when we permute conversations. Since we
forced the first utterance to remain in its position, the order does not influence this algorithm.If
we consider the result achieved with permuted conversations, we observe a general decrease
in the average performance, due to the increased variance caused by the permutations. If we
consider the maximum performance achievable, interestingly, all the methods can outperform
the results achieved with the original order, indicating that there are situations in which different
orders are preferable. The change in performance occurs due to the different information flow.
The conversational models selected – as the majority of common conversational strategies –
exploit the context to solve the anaphoras and rewrite the utterances. Such context derives
from previous turns. By changing the previous turns, we also change the context, and thus
the information used by the system. This aims at mimicking a real-world scenario, where we
do not know if previous utterances provided good context. Furthermore, such context might
change depending on the path followed by the user.
Figure 1 plots, for each CAsT 2019 conversation, the distribution over the permutations of the
average performance of all systems. The yellow diamond represents the mean performance using
the default order of the utterances. It is insightful noticing that the default order rarely gives the
best performance: using a different order of utterances strongly influences performance. Such
a pattern is also observable for each system singularly2 . Notice that, with the new permuted
conversations, it is always possible to cherry-pick conversations permutations to make any
model the best in a pairwise comparison.

2
We do not report the figure for each system, to avoid clutter.
� 0.40
0.35
0.30
0.25
nDCG@3

0.20
0.15
0.10
0.05
0.00
31 32 33 34 37 40 49 50 54 56 58 59 61 67 68 69 75 77 78 79
Conv. id
Figure 1: Distributions of the average systems performance over different permutations of the conversations,
considering original CAsT 2019 utterances. The yellow diamond is the average performance achieved using the
original order of utterances. In most cases the original order of the utterances does not have the best performance.

Table 2
Summary statistics for ANOVA MD0. This models considers only one permutation for each conversation (the
2
original one, presented in CAsT 2019). Different models do not show significant differences. 𝜔𝑚𝑜𝑑𝑒𝑙 is not reported,
being 𝜔 2 ill-defined for non-significant factors.

Source SS DF MS F p-value ^ 2⟨𝑓 𝑎𝑐𝑡⟩
𝜔
topic 1.052 19 0.055 17.454 0.0000 0.758
model 0.010 4 0.002 0.762 0.5532 —
Error 0.241 76 0.003
Total 1.302 99

4.3. RQ2: Comparing Systems via ANOVA
We are now interested in assessing the effect of using utterances permutations in the current
evaluation scenario. We therefore compare different retrieval models using ANalysis Of VAriance
(ANOVA). If we were to apply ANOVA in the current evaluation setup, we would likely rely on
the following model:
𝑦𝑖𝑘 = 𝜇.. + 𝜏𝑖 + 𝛼𝑘 + 𝜀𝑖𝑘 (MD0)
Where 𝑦𝑖𝑘 is the mean performance of all utterances for the conversation 𝑖, using the retrieval
model 𝑘. 𝜇.. is the grand mean, 𝜏𝑖 is the contribution to the performance of the 𝑖-th conversation,
while 𝛼𝑘 is the effect of the 𝑘-th system. Finally, 𝜀𝑖𝑘 is the error. Table 2 reports the summary
statistics for ANOVA when applied to CAsT 2019 conversations, using the Model MD0. We
observe that the effect of the “conversation” factor is significant and large-sized (𝜔 2 ≥ 0.14).
�Table 3
Summary statistics for ANOVA MD1. This models considers 100 unique permutations for each conversation plus
the original one. Observe that now all the factors have a significant effect. We report the Sum of Squares (SS), the
Degrees of Freedom (DF), the Mean Squares (MS), the F statistics, the p-value and the Strength of Association (SOA),
measured according to the 𝜔 2 measure.

Source SS DF MS F p-value ^ 2⟨𝑓 𝑎𝑐𝑡⟩
𝜔
topic 38.594 19 2.031 657.983 >1e-3 0.722
perm(topic) 2.438 940 0.003 0.840 0.999 —
model 0.472 4 0.118 38.230 >1e-3 0.030
Error 11.842 3836 0.003
Total 53.347 4799

This pattern is often observed in many IR scenarios, such as ad-hoc retrieval [14] or Query
Performance Prediction (QPP) [15]. Conversely, the effect of the Model factor is not significant:
none of the models is significantly the best. This indicates the low discriminative power
associated with this evaluation approach.
If we include permutations for each conversation, we can use the following ANOVA model:

𝑦𝑖(𝑗)𝑘 = 𝜇.. + 𝜏𝑖 + 𝜈𝑗(𝑖) + 𝛼𝑘 + 𝜀𝑖𝑗𝑘 (MD1)

Where, compared to Model MD0, 𝜈𝑗(𝑖) is the nested factor that represents the effect of the
𝑗-th permutation of the 𝑖-th conversation. Table 3 reports the summary statistics for ANOVA
with model MD1. By looking at Table 3 we can see the first huge advantage of including
permutations in our evaluation framework: the Model factor is now significant - although
small (0.01 < 𝜔 2 < 0.06). As a side note, Tukey’s post-hoc analysis shows that anCB is the
best model, followed by RM3s which belong to the same tier. Subsequently, we have RM3p and
CU, which again are statistically not different from each other, but worse than the previous
ones. Finally, LP is the only member of the worst-quality tier. We have moved from having
all models equal in Table 2 to a four-tiers sorting of the models in Table 3. The Permutation
factor is not significant an this highlights that there is not a single best permutation for every
system, but rather there is an interaction between the systems and permutations: distinct
models behave differently according to the permutation at hand. Table 3 shows that, if we
use the permutations as additional evidence of the quality of a model, we discriminate better
between them. Furthermore, we do not know in which order the user will pose their utterances.
Including permutations allows us to model better the reality: what we observe in our offline
experiment is likely to generalize more to a real-world scenario. Permutations allow robust
statistical inference, without requiring to gather new conversations, utterances and relevance
judgements.

5. Conclusions and Future Works
In this work, we showed that traditional evaluation is seldom reliable when applied to the con-
versational search. We proposed a methodology to permute the utterances of the conversations
�used to evaluate conversational systems, enlarging conversational collections. We showed that
it is hard to determine the best system when considering multiple conversation permutations.
Consequently, any system can be deemed the best, according to specific permutations of the
conversations. Finally, we showed how to use permutations of the evaluation dialogues, ob-
taining by far more reliable and trustworthy systems comparisons. As future work, we plan
to study how to estimate the distribution of systems performance without actually having the
permutations and the models at hand.

References
1. G. Faggioli, M. Ferrante, N. Ferro, R. Perego, N. Tonellotto, A Dependency-Aware Utterances
Permutation Strategy to Improve Conversational Evaluation, in: Proc. ECIR, 2022.
2. J. Dalton, C. Xiong, J. Callan, TREC CAsT 2019: The Conversational Assistance Track
Overview, in: TREC, 2020.
3. S. Zhang, K. Balog, Evaluating conversational recommender systems via user simulation,
in: Proc. SIGKDD, 2020, p. 1512–1520.
4. A. Lipani, B. Carterette, E. Yilmaz, How Am I Doing?: Evaluating Conversational Search
Systems Offline, TOIS 39 (2021).
5. G. Faggioli, M. Ferrante, N. Ferro, R. Perego, N. Tonellotto, Hierarchical Dependence-Aware
Evaluation Measures for Conversational Search, in: Proc. SIGIR, 2021, p. 1935–1939.
6. J. Li, C. Liu, C. Tao, Z. Chan, D. Zhao, M. Zhang, R. Yan, Dialogue history matters!
personalized response selection in multi-turn retrieval-based chatbots, TOIS 39 (2021).
7. I. Mele, C. I. Muntean, F. M. Nardini, R. Perego, N. Tonellotto, O. Frieder, Topic Propagation
in Conversational Search, in: Proc. SIGIR, 2020, pp. 2057–2060.
8. G. Penha, C. Hauff, Challenges in the evaluation of conversational search systems, in:
Workshop on Conversational Systems Towards Mainstream Adoption, KDD-Converse, 2020.
9. I. Mele, C. I. Muntean, F. M. Nardini, R. Perego, N. Tonellotto, O. Frieder, Adaptive utterance
rewriting for conversational search, IPM 58 (2021) 102682.
10. L. Dietz, M. Verma, F. Radlinski, N. Craswell, TREC Complex Answer Retrieval Overview.,
in: TREC, 2017.
11. Y. Lv, C. Zhai, Positional relevance model for pseudo-relevance feedback, in: Proc. SIGIR,
2010, p. 579–586.
12. K. Lee, L. He, L. Zettlemoyer, Higher-order Coreference Resolution with Coarse-to-fine
Inference, in: Proc. NAACL-HLT, 2018, pp. 687–692.
13. M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, O. Levy, Spanbert: Improving pre-
training by representing and predicting spans, TACL 8 (2020) 64–77.
14. D. Banks, P. Over, N.-F. Zhang, Blind Men and Elephants: Six Approaches to TREC data,
IRJ 1 (1999) 7–34.
15. G. Faggioli, O. Zendel, J. S. Culpepper, N. Ferro, F. Scholer, An Enhanced Evaluation
Framework for Query Performance Prediction, in: Proc. ECIR, 2021, pp. 115–129.
�

Vol-3194/paper45

Paper

Improving Conversational Evaluation via a Dependency-Aware Permutation Strategy

Navigation menu

Search