Vol-3197/paper11


Wolfgang Fahl

Paper

Paper
edit
description  scientific paper published in CEUR-WS Volume 3197
id  Vol-3197/paper11
wikidataid  Q117341837→Q117341837
title  There and Back Again: Combining Non-monotonic Logical Reasoning and Deep Learning on an Assistive Robot
pdfUrl  https://ceur-ws.org/Vol-3197/paper11.pdf
dblpUrl  https://dblp.org/rec/conf/nmr/SridharanBFG22
volume  Vol-3197→Vol-3197
session  →

Paper[edit]

Paper
edit
description  scientific paper published in CEUR-WS Volume 3197
id  Vol-3197/paper11
wikidataid  Q117341837→Q117341837
title  There and Back Again: Combining Non-monotonic Logical Reasoning and Deep Learning on an Assistive Robot
pdfUrl  https://ceur-ws.org/Vol-3197/paper11.pdf
dblpUrl  https://dblp.org/rec/conf/nmr/SridharanBFG22
volume  Vol-3197→Vol-3197
session  →

There and Back Again: Combining Non-monotonic Logical Reasoning and Deep Learning on an Assistive Robot[edit]

load PDF

There and Back Again: Combining Non-monotonic
Logical Reasoning and Deep Learning on an Assistive
Robot
Mohan Sridharan1,* , Chloé Benz2 , Arthur Findelair3 and Kévin Gloaguen4
1
  Intelligent Robotics Lab, School of Computer Science, University of Birmingham, UK
2
  Illinois Institute of Technology, USA
3
  Illinois Institute of Technology, USA
4
  École Nationale Supérieure de Mécanique et d’Aérotechnique, France


                                          Abstract
                                          This paper describes the development of an architecture that combines non-monotonic logical reasoning and deep learning
                                          in virtual (simulated) and real (physical) environments for an assistive robot. As an illustrative example, we consider a robot
                                          assisting in a simulated restaurant environment. For any given goal, the architecture uses Answer Set Prolog to represent and
                                          reason with incomplete commonsense domain knowledge, providing a sequence of actions for the robot to execute. At the
                                          same time, reasoning directs the robot’s learning of deep neural network models for human face and hand gestures made in
                                          the real world. These learned models are used to recognize and translate human gestures to scenarios that mimic real-world
                                          situations in the simulated environment, and to goals that need to be achieved by the robot in the simulated environment. We
                                          report the challenges faced in the development of such an integrated architecture, as well as the insights learned from the design,
                                          implementation, and evaluation of this architecture by a distributed team of researchers during the ongoing pandemic.

                                          Keywords
                                          Non-monotonic logical reasoning, Probabilistic reasoning, Interactive learning, Robotics



1. Motivation
Consider the motivating example of a mobile robot (Pep-
per) waiter in a simulated restaurant, as shown in Figure 1.
The robot has to perform tasks such as seating customers
at suitable tables, taking and delivering food orders, and
collecting payment. To perform these tasks, the robot
extracts and reasons with the information from different
sensors (e.g., camera, range finder) and incomplete com-
monsense domain knowledge. This knowledge includes
relational descriptions of the domain objects and their at-
                                                                                                     Figure 1: Illustrative snapshot of an assistive robot oper-
tributes (e.g., size, number, and relative positions of tables,
                                                                                                     ating as a waiter in a simulated restaurant scenario.
chairs, and people). It also includes axioms governing
actions and change in the domain (e.g., the preconditions
and effects of seating a group of people at a particular
table), including default statements that hold in all but                                              with its knowledge and sensor observations to revise its
a few exceptional circumstances (e.g., “customers typ-                                                 knowledge (e.g., revise the number of people seated at
ically need some time to look at the menu before they                                                  different tables, learn the effects of different gestures).
place an order”). Since the domain description is incom-                                               Furthermore, to promote better interaction with humans
plete and can change over time, the robot also reasons                                                 in the restaurant, the robot provides on-demand relational
                                                                                                       descriptions of its decisions and the evolution of beliefs.
NMR 2022: 20th InternationalWorkshop on Non-Monotonic Reason-                                             Realizing the motivating scenario described above
ing, August 07–09, 2022, Haifa, Israel                                                                 poses fundamental challenges in knowledge represen-
*
  Corresponding author.                                                                                tation, reasoning, and learning. State of the art robot
" m.sridharan@bham.ac.uk (M. Sridharan);
                                                                                                       architectures often seek to address these challenges by
chloe.c.benz@gmail.com (C. Benz); arthfind@gmail.com
(A. Findelair); k.gloaguen1303@gmail.com (K. Gloaguen)                                                 using logics and probabilistic methods to represent and
~ https://www.cs.bham.ac.uk/~sridharm/ (M. Sridharan)                                                  reason with domain knowledge and observations, and
� 0000-0001-9922-8969 (M. Sridharan)                                                                   by using data-driven (deep) learning methods to extract
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   knowledge from large, labeled datasets (e.g., of noisy sen-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                  115
�sor observations). However, practical domains make it           2. Related Work
difficult to provide a comprehensive encoding of domain
knowledge, or the computational resources and examples          There is a well-established history of the use of log-
needed to augment or revise the robot’s knowledge. Fur-         ics in different AI and robotics applications. The non-
thermore, circumstances such as the ongoing pandemic            monotonic logical reasoning paradigm used in this paper,
make it rather challenging for a distributed team of re-        ASP, has been used by an international community of re-
searchers to design and evaluate such architectures for         searchers for many applications in robotics [1] and other
integrated robot systems.                                       fields [2]. There has also been a lot of work over multiple
   This paper makes a two-fold contribution towards ad-         decades on integrating logical and probabilistic reason-
dressing the above-mentioned challenges. First, it uses         ing [3, 4, 5], and on using different logics for guiding
the motivating example to describe the development              probabilistic sequential decision making [6]. Our focus
of an architecture that adapts knowledge representation         here is on building on this work to support transparent
(KR) tools to achieve transparent, reliable, and efficient      knowledge-based reasoning and data-driven learning in
knowledge-based reasoning and data-driven learning on           integrated robot systems.
an assistive robot. Second, it highlights the advantages of        There are many methods for learning logic-based rep-
using KR tools, and of formally coupling representation,        resentations of domain knowledge. This includes the
reasoning and learning, to design such an architecture.         incremental revision of action operators in first-order
More specifically, our architecture:                            logic [7], the inductive learning of domain knowledge
                                                                encoded as an Answer Set Prolog program [8], and the
     • Represents and performs non-monotonic logical            work on coupling non-monotonic logical reasoning with
       reasoning with incomplete commonsense domain             inductive learning or relational reinforcement learning to
       knowledge using Answer Set Prolog (ASP) to ob-           learn axioms [9, 10]. Our approach in this architecture is
       tain a plan of abstract actions for any given goal;      inspired by work in interactive task learning [11]; unlike
     • Executes each abstract action as a sequence of           methods that learn from many training examples, our ap-
       concrete actions by automatically identifying and        proach seeks to identify and learn from a limited number
       reasoning probabilistically about the relevant do-       of relevant training examples.
       main knowledge at a finer granularity;                      Given the use of deep networks in different applications,
     • Reasons with domain knowledge to allow humans            there is much interest in understanding their operation in
       making hand gestures in the physical world to            terms of the features influencing network outputs [12, 13].
       interact with the simulated robot in a manner that       There is also work on neuro-symbolic systems that reason
       mimics interaction in the physical world; and            with learned symbolic structure or a scene graph in con-
     • Reasons with domain knowledge to guide the               junction with deep networks to answer questions about
       learning of models for new hand gestures and             images [14, 15]. Work in the broader areas of explainable
       the corresponding axioms, and for providing on-          AI and explainable planning can be categorized into two
       demand relational descriptions as explanations of        groups. Methods in one group modify or map learned
       the robot’s decisions and beliefs.                       models or reasoning systems to make their decisions more
                                                                interpretable [16] or easier for humans to understand [17].
The interactive interface between the virtual and physical      Methods in the other group provide descriptions that make
world helped the three undergraduate student authors de-        a reasoning system’s decisions more transparent [18], help
sign, implement, and evaluate the architecture remotely         humans understand plans [19], and help justify solutions
over different time intervals during the pandemic. It also      obtained by non-monotonic logical reasoning [20]. Re-
helped us explore the interplay between reasoning and           cent survey papers indicate that existing methods: (i) do
learning. The “there and back again” in the title thus refers   not fully integrate reasoning and learning to inform and
to the architecture’s on-demand ability to traverse differ-     guide each other; (ii) do not fully exploit the available
ent points in space and time, and to transition between         commonsense domain knowledge for reliable, efficient,
the physical and virtual world for human-robot collabora-       and transparent reasoning and learning; and (iii) are often
tion. We demonstrate the capabilities of our architecture       agnostic to how an explanation is structured or assumes
through experimental results and execution traces of use        comprehensive domain knowledge [21, 22]
cases in our motivating restaurant domain.                         Our work focuses on transparent, reliable, and efficient
   The remainder of this paper is organized as follows. We      reasoning and learning in integrated robot systems that
begin by discussing related work in Section 2. Next, we         combine reasoning with incomplete commonsense do-
describe our architecture and its components in Section 3.      main knowledge and data-driven learning from limited
The execution traces and results of evaluating our archi-       examples. We seek to demonstrate that this objective can
tecture’s components are described in Section 4, and the        be achieved by building on KR tools. To do so, we build
conclusions are described in Section 5.                         on some of the prior work of the lead author with others.




                                                            116
�                   Knowledge Representation+ Reasoning
                                                                            providing a bill and collecting payment; and (iv) respond-
                  domain knowledge (relations, action theory)               ing to requests from the customer(s) and the designer. The
                      non−monotonic logical reasoning
                          probabilistic reasoning
                                                                            robot uses probabilistic algorithms to model and account
                                                                            for the uncertainty experienced during perception and ac-
                                                                            tuation. Interactions of the robot with a human supervisor
                                                                            are handled through the interface that interprets hand ges-
                                                                            tures made by a human in the physical world. The robot
        virtual world                               deep/reinforcement
                                                                            has incomplete (and potentially imprecise) domain knowl-
                                                         inductive          edge, which includes number, size, and location of tables
       physical world                                                       and chairs; spatial relations between objects; and some
                                                   Interactive Learning     axioms governing domain dynamics such as:
     Interaction Interface

Figure 2: Overview of our architecture combining non-                            • If the robot allocates a group of customers to a
monotonic logical reasoning, probabilistic reasoning, and                          table, all members of the group are considered to
deep learning for reliable, efficient, and transparent rea-                        be seated at that table.
soning and learning.
                                                                                 • The robot cannot seat customers at a table that is
                                                                                   not empty, i.e., is occupied.
                                                                                 • Any customer cannot be allocated to more than
In particular, we build on work on: (i) a refinement-based                         one table at a time.
architecture for representation and reasoning [23]; (ii)
explainable agency and theory of explanations [24, 25];                     This knowledge, e.g., the axioms describing dynamic
and (iii) combining non-monotonic logical reasoning and                     changes and the values of some attributes of the domain
deep learning for axiom learning and scene understand-                      or robot, may need to be revised over time.
ing [9, 26]. The novelty is in bringing these different
strands together in an architecture, and in facilitating
                                                                            3.1. Representation and Reasoning
the interactive interface between the virtual and physi-
cal worlds for design and evaluation.                         To represent and reason with domain knowledge, we use
                                                              CR-Prolog, an extension of Answer Set Prolog (ASP) that
                                                              introduces consistency restoring (CR) rules [27]. ASP
3. Architecture Description                                   is based on stable model semantics, and supports default
                                                              negation and epistemic disjunction, e.g., unlike “¬𝑎” that
Figure 2 presents an overview of the main components implies a is believed to be false, “𝑛𝑜𝑡 𝑎” only implies
of our architecture. As stated earlier, the architecture a is not believed to be true, and unlike “𝑝 ∨ ¬𝑝” in
uses ASP to represent and reason with commonsense do- propositional logic, “𝑝 𝑜𝑟 ¬𝑝” is not tautologous. ASP
main knowledge, e.g., to reason about object and robot can represent recursive definitions and constructs that are
attributes to compute a plan to achieve a given goal. For difficult to express in classical logic formalisms, and it
more complex domains, this reasoning can take place us- supports non-monotonic logical reasoning, i.e., the abil-
ing transition diagrams at two different resolutions, with ity to revise previously held conclusions based on new
the fine-resolution diagram defined as a refinement of evidence. We use the terms “CR-Prolog” and “ASP” in-
the coarse-resolution diagram. Execution of the actions terchangeably in this paper.
by a robot can then involve probabilistic reasoning with
a relevant part of the fine-resolution transition diagram.
Reasoning informs and guides both the interactive learn- Knowledge representation. A domain’s description
ing of previously unknown domain knowledge (which in ASP comprises a system description 𝒟 and a history ℋ.
is used for subsequent reasoning), and the interface for 𝒟 comprises a sorted signature Σ and axioms encoding
interaction between a human in the physical world and the domain’s dynamics. Σ comprises basic sorts, statics,
the robot in the virtual world. Reasoning is also used i.e., domain attributes that do not change over time, fluents,
to identify relevant literals and axioms to provide an on- i.e., domain attributes whose values can be changed, and
demand description of the robot’s decisions and beliefs. actions; note that statics, fluents, and actions are described
The individual components are described below using the in terms of the sorts of their arguments. In the RW domain,
following example domain.                                     the robot needs to reason about spatial relations between
                                                              objects, and to plan and execute actions that change the
Example Domain 1. [Robot Waiter (RW) Domain]                  domain. Such a dynamic domain is modeled in our archi-
A Pepper robot operates as a waiter in a restaurant. Its tecture by first describing Σ and the domain’s transition
tasks include: (i) greeting and seating customers; (ii) tak- diagram in action language 𝒜ℒ𝑑 [28]; this description is
ing food orders and delivering food to specific tables; (iii) then translated to ASP statements. The basic sorts of the




                                                                          117
�                                                                 ¬𝑜𝑐𝑐𝑢𝑟𝑠(𝑚𝑜𝑣𝑒(𝑅, 𝑁 ), 𝐼) ←                           (1e)
                                                                          ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑅, 𝑀 ), 𝐼), ¬𝑒𝑑𝑔𝑒(𝑀, 𝑁 )
                                                                 ¬𝑜𝑐𝑐𝑢𝑟𝑠(𝑔𝑖𝑣𝑒𝑏𝑖𝑙𝑙(𝑅, 𝑇 ), 𝐼) ←                       (1f)
                                                                          ¬ℎ𝑜𝑙𝑑𝑠(𝑤𝑎𝑛𝑡𝑠𝑏𝑖𝑙𝑙(𝑇 ), 𝐼)

                                                              which encode two causal laws, two state constraints, and
                                                              two executability conditions respectively. For example,
                                                              Statement 1(a) is a causal law that implies that execut-
                                                              ing the move action causes the robot’s location to be the
                                                              desired node in the next time step, Statement 1(c) is a
                                                              constraint stating that a customer can only be at one table
                                                              at a time, and Statement 1(e) is an executability condition
Figure 3: Example layout of the RW domain, which orga- that implies that a move to a target location is not possible
nizes the available space into nodes representing regions if it is not connected to the robot’s current location. The
with specific tables.                                         axioms also encode some default statements that hold in
                                                              all but a few exceptional situations. For example, in the
                                                              RW domain, we may want to encode that “clean plates
RW domain include 𝑡𝑎𝑏𝑙𝑒, 𝑟𝑜𝑏𝑜𝑡, 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒, are usually in the kitchen” unless stated otherwise:
𝑤𝑎𝑖𝑡𝑒𝑟, 𝑓 𝑢𝑟𝑛𝑖𝑡𝑢𝑟𝑒, 𝑔𝑒𝑠𝑡𝑢𝑟𝑒, 𝑔𝑒𝑠𝑡𝑢𝑟𝑒_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦, and
𝑠𝑡𝑒𝑝 for temporal reasoning. The sorts may be organized         ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑃, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), 𝐼) ← ℎ𝑜𝑙𝑑𝑠(𝑐𝑙𝑒𝑎𝑛(𝑃 ), 𝐼),
hierarchically, e.g., chair and table are subsorts of the              𝑝𝑙𝑎𝑡𝑒(𝑃 ), 𝑛𝑜𝑡 ¬ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑃, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), 𝐼) (2)
sort furniture, and the sort employee includes robot and
supervisor as subsorts.                                       where “not” denotes default negation. One potential ex-
   Statics of the RW domain include relations edge(node, ception to this axiom is that some clean plates may also
node) and linked(node, furniture); the former is a graph- be placed near the buffet table; these exceptions can also
based encoding of regions, e.g., see Figure 3, and the latter be encoded. In addition to axioms, information extracted
associates particular tables to particular nodes. Fluents from the sensor inputs (e.g., different hand gestures) are
include relations such as location(robot, node), iswait- also converted to ASP statements at that time step. Each
ing(customer), attable(customer, table), occupancy(table, gesture is also associated with the corresponding axioms;
num), and haspaid(customer). Actions of the RW do- more specific details are provided in Section 3.3.
main include move(robot, node), which causes the robot           A dynamic domain’s history ℋ typically comprises
to move to a particular node; seat(robot, customer, table), records of: (a) fluents observed to be true or false at
which causes the robot to seat particular customer(s) at a a particular time step; and (b) the actual execution of
particular table; and givebill(robot, table), which causes particular actions at particular time steps:
the robot to give the bill to a customer at a particular ta-                 𝑜𝑏𝑠(𝑓 𝑙𝑢𝑒𝑛𝑡, 𝑏𝑜𝑜𝑙𝑒𝑎𝑛, 𝑠𝑡𝑒𝑝)
ble. In addition, relation holds(fluent, step) implies that
a particular fluent holds true at a particular timestep, and                 ℎ𝑝𝑑(𝑎𝑐𝑡𝑖𝑜𝑛, 𝑠𝑡𝑒𝑝)
occurs(action, step) implies the occurrence of a particular    Prior work demonstrated that this notion of history can
action at a particular timestep of the plan.                   be expanded to include defaults describing the values of
   Given the signature Σ, axioms describing a domain’s         fluents in the initial state, along with exceptions [23].
dynamics consist of causal laws, state constraints, and
executability conditions. For the RA domain, these are
                                                               Reasoning. Given the representation of domain knowl-
translated to statements in ASP such as:
                                                               edge described above, the robot still needs to reason with
     ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑅, 𝑁 ), 𝐼 + 1) ←                        (1a)    this knowledge and observations perform tasks such as in-
                                                               ference, planning, and diagnostics. In our architecture, we
           𝑜𝑐𝑐𝑢𝑟𝑠(𝑚𝑜𝑣𝑒(𝑅, 𝑁 ), 𝐼)
                                                               automatically construct the CR-Prolog program Π(𝒟, ℋ),
     ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 ), 𝐼 + 1) ←                    (1b)    which includes Σ and axioms of 𝒟, inertia axioms, reality
           𝑜𝑐𝑐𝑢𝑟𝑠(𝑠𝑒𝑎𝑡(𝑅, 𝐶, 𝑇 ), 𝐼)                           check axioms, closed world assumptions for actions, and
   ¬ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 2), 𝐼) ←                        (1c)    observations, actions, and defaults from ℋ; a basic version
                                                               of this program can be viewed online [29]. For planning
           ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 1), 𝐼), 𝑇 1 ̸= 𝑇 2               and diagnostics, this program also includes helper axioms
   ¬ℎ𝑜𝑙𝑑𝑠(𝑜𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦(𝑇, 𝑋2), 𝐼) ←                       (1d)    that define a goal, and require the robot to search until a
           ℎ𝑜𝑙𝑑𝑠(𝑜𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦(𝑇, 𝑋1), 𝐼), 𝑋1 ̸= 𝑋2                consistent model of the world is constructed and a plan
                                                               is computed to achieve the goal. Planning, diagnostics,




                                                           118
�and inference are then reduced to computing answer sets         𝑎𝑡𝑔 = 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ). The object constants relevant to
of Π; we use the SPARC system [30] to compute answer            this transition then include 𝑟𝑜𝑏1 , 𝑛1 , 𝑛2 , and 𝑘𝑖𝑡𝑐ℎ𝑒𝑛.
set(s). Each answer set represents the robot’s beliefs in a
possible world; the literals of fluents and statics at a time   Definition 2. [Relevant system description]
step represent the domain’s state at that time step. As         The system description relevant to a transition 𝑇 =
stated earlier, our architecture’s non-monotonic reasoning      ⟨𝜎1 , 𝑎𝑡𝑔 , 𝜎2 ⟩, i.e., 𝒟(𝑇 ), is defined by signature Σ(𝑇 )
ability supports recovery from incorrect inferences due to      and axioms. Σ(𝑇 ) is constructed to comprise:
incomplete knowledge or noisy sensor inputs.                         • Basic sorts of Σ that produce a non-empty inter-
   Prior work by the lead author and others resulted in an             section with 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).
architecture for reasoning with transition diagrams at two
                                                                     • All object constants of basic sorts of Σ(𝑇 ) that
resolutions, with the fine-resolution diagram formally de-
                                                                       form the range of a static attribute.
fined as a refinement of the coarse-resolution diagram [23].
                                                                     • The object constants of basic sorts of Σ(𝑇 ) that
This definition differs from recent work on refinement and
                                                                       form the range of a fluent, or the domain of a
abstraction of ASP programs and other logics [31, 32] in
                                                                       fluent or a static, and are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).
how the transition diagrams are coupled formally to satisfy
the requirements in the challenging context of integrated            • Domain attributes restricted to Σ(𝑇 )’s basic sorts.
robot systems. This relation guarantees the existence of a      Axioms of 𝒟(𝑇 ) are those of 𝒟 restricted to Σ(𝑇 ). It
path in the fine-resolution transition diagram implement-       can be shown that for each transition in the transition dia-
ing each coarse-resolution transition. The robot can then       gram of 𝒟, there is a transition in the transition diagram
use non-monotonic logical reasoning to compute a se-            of 𝒟(𝑇 ). States of 𝒟(𝑇 ), i.e., literals comprising fluents
quence of abstract actions for any given goal, implement-       and statics in the answer set of the ASP program, and
ing each abstract action as a sequence of fine-resolution       ground actions of 𝒟(𝑇 ), are candidates for further explo-
actions by automatically zooming to and reasoning prob-         ration. Continuing with the example in Definition 1, for
abilistically with the part of the fine-resolution diagram      𝑎𝑡𝑔 = 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ), 𝒟(𝑇 ) will not include axioms
relevant to the coarse-resolution transition. We build on       corresponding to other actions, e.g., for seating customers
that notion of relevance to automatically: (a) constrain the    at a table or giving the bill to a customer. If the robot has
robot’s attention to the nodes and regions relevant to any      to perform fine-resolution probabilistic reasoning for ac-
given transition or plan that the robot has to execute—this     tion execution, only the refinement of the relevant system
supports selective grounding; (b) limit recognition of hand     description will be considered.
gestures to the subset relevant to the task at hand, e.g.,
gestures for placing an order once customers are seated,        A robot waiter equipped with the representation and rea-
and limit learning to previously unknown hand gestures          soning module described above, still needs to interact with
and related axioms—see Section 3.3; and (c) provide rela-       humans. To support design and evaluation when in-person
tional descriptions of decisions by tracing the evolution of    interaction with the robot is not possible, we incorporated
relevant beliefs and application of relevant axioms—see         the interactive simulation module, as described below.
Section 3.3. For ease of understanding, we define the no-
tion of relevance for a given transition; similar definitions   3.2. Interactive Simulation and Hand
can be provided for a given goal or literal.
                                                                     Gestures
Definition 1. [Relevant object constants]                         We developed a simulation environment and interface for
Let 𝑇 = ⟨𝜎1 , 𝑎𝑡𝑔 , 𝜎2 ⟩ be the transition of interest. Let       the design and evaluation of our architecture. We used Py-
𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ) be the set of object constants of signature Σ          Bullet [33], a Python-based module for simulating games
of 𝒟 identified using the following rules:                        and domains for machine learning and robotics. It enables
     • Object constants from 𝑎𝑡𝑔 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 );               us to quickly load different articulated bodies and pro-
     • If 𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑦) is a literal formed of a domain vides  built-in support for forward and inverse kinematics,
        attribute, and the literal belongs to 𝜎1 or 𝜎2 , but      collision detection, and simulation of domain dynamics.
        not both, then 𝑥1 , . . . , 𝑥𝑛 , 𝑦 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 );        In our architecture, PyBullet is used to automatically
     • If body 𝐵 of an axiom of 𝑎𝑡𝑔 contains generate a restaurant layout, e.g., see Figure 4, based on
        𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑌 ), a term whose domain is the domain information encoded in the ASP program, e.g.,
        ground, and 𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑦) ∈ 𝜎1 , then Figure 3. Using the built-in blender of PyBullet, we are
        𝑥1 , . . . , 𝑥𝑛 , 𝑦 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).                    able to populate the simulated restaurant with a Pepper
                                                                  robot, tables, chairs, and the desired number of customers.
Object constants from 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ) are said to be rele- We are also able to make on-demand revisions to the
vant to 𝑇 . For example, consider an initial state 𝜎1 domain, e.g., to match changes in the domain knowledge.
with 𝑙𝑜𝑐(𝑟𝑜𝑏1 , 𝑛1 ) and 𝑙𝑜𝑐(𝑤𝑎𝑖𝑡𝑒𝑟, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), and action In addition, our simulator supports the movement of the




                                                            119
�                                                             are related to seating customers, handling food orders, or
                                                             executing terminal transactions (e.g., provide bill).

                                                             3.3. Interactive Learning and
                                                                  Transparency
                                                             The architecture described so far reasons with incomplete
                                                             domain knowledge, which may lead the robot to make
                                                             incorrect decisions or cause the robot’s performance to
                                                             suffer, e.g., the robot may compute incorrect or unneces-
                                                             sarily long plans for any given goal. Also, the encoded
Figure 4: Simulated restaurant layout in PyBullet with       knowledge and models may need to change over time. We
robot waiter and customers.                                  address this requirement by introducing a module for in-
                                                             teractive learning and generation of relational descriptions
        Table 1         Table 2         Table 3
                                                             as “explanations” of the robot’s decisions and beliefs.

                                                             Interactive learning. The interactive learning com-
        Table 4         Table 5        Order fries           ponent of our architecture has two parts. Given the use
                                                             of hand gestures for human-robot interaction, the first
                                                             part seeks to detect new gestures and learn models for
                                                             these gestures. A new hand gesture is detected when
      Order steak   Ask for the bill
                                                             the observed gesture differs significantly from any of the
                                           Thumb
                                           Index             known gestures. A significant difference is experimen-
                                           Middle
                                           Ring
                                           Little
                                                             tally determined as a difference in 15% of the keypoints
                                                             in a sequence of images. When a new gesture is recog-
                                                             nized, the robot automatically gathers a sequence of image
Figure 5: (Left) Subset of hand gestures providing direc-
tions to robot; (Right) The 21 keypoints used to model
                                                             frames, extracts features from these images, stores them
each hand gesture.                                           in a separate file and quickly updates the hand gesture
                                                             recognition models to include this new gesture. A key
                                                             feature of our architecture is that reasoning and learning
                                                             inform and guide each other. For example, when the robot
robot in the restaurant based on the axioms encoded in the
                                                             has to recognize and respond to gestures, it automatically
ASP program. Furthermore, it is also possible to introduce
                                                             limits itself to gestures relevant to its current category of
new objects in the simulator (e.g., using hand gestures,
                                                             tasks, e.g., a robot delivering food cannot respond to direc-
see below) and automatically add this information to the
                                                             tion from a supervisor to seat new customers1 . Also, any
ASP program for further reasoning
                                                             newly learned gesture is placed in the appropriate cate-
   Recall that communication of human instructions to the
                                                             gory of gestures (determined based on purpose of gesture)
robot waiter is based on hand gestures made in the physi-
                                                             for subsequent reasoning. This use of reasoning to direct
cal world. To support such interaction, we first enabled
                                                             learning speeds up recognition and learning.
our architecture to recognize a base set of hand gestures;
                                                                The second part of the learning component focuses
a subset of these gestures are shown in Figure 5(left).
                                                             on acquiring axioms corresponding to any new gesture,
To model and recognize hand gestures, we integrate the
                                                             and merging the axioms with the existing ones. This is
OpenPose system [34] that characterizes gestures using
                                                             achieved by taking the label provided by human for the
21 keypoints, as shown in Figure 5(right). After the inte-
                                                             new gesture and checking if the corresponding instruction
gration, the simulator allows us to capture images of the
                                                             (e.g., seat two people) can be executed with the existing
hand gestures made in the physical world to quickly train
                                                             knowledge. If that is possible, no further learning is per-
deep network models that can accurately recognize these
                                                             formed. If existing knowledge is insufficient to execute
gestures in new videos (i.e., image sequences). We used
                                                             the new instruction, or if the human provides feedback,
an existing Python library for training these deep network
                                                             e.g., a textual or verbal description that is processed using
models with experimentally determined loss functions—
                                                             existing tools, which includes an action, literals extracted
Figure 6. Note that the modularity of the architecture
                                                             from the feedback are used to construct an axiom that is
makes it easy to quickly explore the different deep net-
                                                             merged with existing ones. Once again, reasoning helps
work models without changing other parts of the architec-
ture. The known hand gestures with trained models are        1
                                                                 Associating priority levels with tasks will enable the robot to inter-
then grouped in different categories based on whether they       rupt its current task to execute a higher-priority task.




                                                         120
�                                100


                               10 1




                        Loss
                               10 2


                               10 3


                               10 4
                                       0         2    4       6           8   10      12      14
                                                                  Epoch
                                      improved       ANN-3x16 (1691)          ANN-2x128 (25499)
                                      baseline       ANN-3x64 (12827)

Figure 6: Learning curves for acquiring models for the hand gestures using different deep network structures; models
with low loss are obtained over a few epochs when guided by reasoning.



direct this learning by limiting scope to the relevant ob-     Paths from the root to the leaves in these trees provide
ject constants and description. For example, assume that       explanations. If multiple such paths exist, we currently
the robot is shown a new gesture for seating a group of        select one of the shortest branches at random; other heuris-
customers at a table. The robot will use human feed-           tics could be used to compare the explanations. For ex-
back about this new gesture, and only consider literals        ample, if the robot is asked why it seated a group of three
corresponding to: the location of these customers, its own     customers at 𝑇 𝑎𝑏𝑙𝑒5 , it can trace the current belief about
location, and the occupancy of tables in the restaurant, to    the group back to the initial state through the applica-
learn axioms for the new action.                               tion of relevant axioms, and come up with an explanation
                                                               such as: “The three customers came to the restaurant and
Tracing explanations. Our architecture supports the wanted to be seated as a group. 𝑇 𝑎𝑏𝑙𝑒5 at node 𝑛7 was the
ability to infer the sequence of axioms and beliefs that table closest to the entrance that had the desired number
explains the evolution of any given belief or the non- of seats available. I seated the customers at 𝑇 𝑎𝑏𝑙𝑒5 ”.
selection of any given ground action at a given time. We          In addition to tracing the evolution of a target belief
build on the idea of proof trees, which have been used to and justifying the non-selection of a particular action, our
explain observations in classical first-order logic [35], and architecture can also provide: (a) a description of any
adapt it to our architecture that is based on descriptions computed or executed plan in terms of literals in the plan;
in non-monotonic logic. Our approach is based on the (b) justification for executing a particular action at a partic-
following sequence of steps:                                   ular time step by examining the change in state caused by
                                                               the action’s execution and how this state change achieves
     1. Select axioms that have the target belief or action the goal or facilitates the execution of the next action in
        in the head.                                           the plan; and (c) inferred outcome(s) of the execution of
     2. Ground literals in each such axiom’s body and hypothetical actions based on a mental simulation guided
        check whether these ground literals are supported by the current domain knowledge. In all these cases, the
        (i.e., satisfied) by the current answer set.           identified literals are encapsulated in a prespecified an-
     3. Create a new branch in the proof tree (that has the swer template to provide the descriptions. For proof of
        target belief or action as root) for each selected concept examples in simplistic scene understanding sce-
        axiom supported by the current answer set, and narios, please see [9]; some specific examples in the RW
        store the axiom and the related supporting ground domain are provided below (Section 4.1).
        literals in suitable nodes.
     4. Repeats Steps 1-3 with the supporting ground lit- Control loop. Algorithm 1 is the overall control loop
        erals in Step 3 as target beliefs in Step 1, until all for the architecture. The baseline behavior (lines 3-8) is
        branches reach a leaf node without further sup- to plan and execute actions to achieve the given goal as
        porting axioms.                                        long as a consistent model of history can be computed.
                                                               If such a model cannot be constructed, it is attributed to




                                                             121
�    Algorithm 1: Our architecture’s control loop.               4. Execution Traces and Results
Input: Π(𝒟, ℋ); goal description; initial state 𝜎1 .
                                                                Meaningfully evaluating architectures for integrated robot
Output: Control signals for robot to execute.
                                                                systems is challenging. It is difficult to find a baseline
  1 planMode = true, learnExplainMode = false
                                                                that provides all the capabilities supported by our archi-
  2 while true do
                                                                tecture, and it is also difficult to evaluate the capabilities
   3    Add observations to history.
                                                                of each component of the architecture in isolation. Also,
   4    ComputeAnswerSets(Π(𝒟, ℋ))
                                                                given that reasoning and learning guide each other in our
   5    if planMode then
                                                                architecture to automatically identify and focus only on
   6         if existsGoal then
                                                                the relevant information, task complexity and scalability
   7              if explainedObs then
                                                                do not necessarily change substantially by increasing the
   8                   ExecutePlanStep()
                                                                number of tasks, and just reporting success in many sce-
   9              else
                                                                narios is not very informative. In addition, it was difficult
  10                   planMode = false
                                                                to use a physical robot to conduct the experimental trials
  11                   learnExplainMode = true
                                                                during the pandemic. We thus focus on illustrating the
 12               end
                                                                capabilities of our architecture using a combination of
 13          else                                               execution traces (i.e., use cases) and some experiments
 14               learnExplainMode = true                       that provide quantitative results. The key hypotheses to
 15          end                                                be evaluated are:
 16     else
                                                                      H1 : our architecture enables the robot to compute
 17          if interrupt then
                                                                         and execute plans to achieve desired goals;
 18               planMode = true
 19          else if learnExplainMode then                            H2 : having reasoning inform and guide learning im-
 20               AcquireKnowledgeExplain()                              proves computational efficiency of learning and
                                                                         recognition accuracy of the learned models; and
 21     end
                                                                      H3 : exploiting the links between reasoning and learn-
 22 end
                                                                         ing provides suitable relational descriptions as ex-
                                                                         planations of decisions and beliefs.
                                                                We explore hypotheses H1 and H3 in the execution traces
                                                                (Section 4.1), and provide experimental results in support
                                                                of H2 (Section 4.2).

                                                                4.1. Execution traces
                                                                We provide two execution traces to illustrate the operation
                                                                of our architecture in specific scenarios. Videos corre-
                                                                sponding to these traces can be viewed online [29]2 . In
                                                                all the scenarios, the human user (in the physical world)
                                                                uses hand gestures to create different situations and also
                                                                to mimic the gestures to be made by the customers or
 Figure 7: Example layout of the RW domain used in
                                                                the supervisor in the restaurant environment. The layout
 Execution Examples 1- 2.                                       used to generate these traces is shown in Figure 7; it is
                                                                simplified version of Figure 3.
                                                                Execution Example 1. [Plan, execute, explain]
 an unexplained, unexpected observation, and the robot          Consider a scenario in which there is one customer 𝑐𝑢1
 triggers interactive exploration (lines 9-12). Interactive     seated at 𝑡𝑎𝑏𝑙𝑒1 in the restaurant, and the robot waiter is
 exploration is also triggered if no active goal exists to be   in the region of node 𝑛4 . In this scenario, the restaurant
 achieved (lines 13-15). Depending on the human input,          is organized into regions corresponding to eight nodes:
 the architecture either acquires the previously unknown        𝑛0 − 𝑛7 . The subsequent steps in this scenario are:
 gestures and axioms, or attempts to provide the desired
 description of a target decision or belief (lines 19-21).              • Three new customers (𝑐𝑢2 − 𝑐𝑢4 ) are introduced
 When in the learning mode, the robot can be interrupted                  in the restaurant as a group by the human designer
 if needed (lines 17-18), e.g., to pursue a new goal.                     showing a suitable hand gesture. This information
                                                                          is also added to the ASP program automatically.
                                                                2
                                                                    https://www.cs.bham.ac.uk/~sridharm/KR22/




                                                            122
�     • The hand gesture also lets the robot waiter (𝑟𝑜𝑏1 )
       know that the new customers are to be seated at
       a table. The robot comes up with a plan based
       on the updated ASP program and the vacant table
       that is closest to it:

         𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛5 ), 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛0 ),
         𝑝𝑖𝑐𝑘𝑢𝑝(𝑟𝑜𝑏1 , 𝑔𝑟𝑜𝑢𝑝1 ), 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛5 ),
         𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛6 ), 𝑠𝑒𝑎𝑡(𝑟𝑜𝑏1 , 𝑔𝑟𝑜𝑢𝑝1 , 𝑡𝑎𝑏𝑙𝑒2 )

     • Note that applying the 𝑝𝑖𝑐𝑘𝑢𝑝 action to any cus-
       tomer in a group causes the same effect on all
       customers in the group. This plan is executed and
       the state is updated accordingly, e.g., 𝑐𝑢2 − 𝑐𝑢4
       are seated at 𝑡𝑎𝑏𝑙𝑒2 after the plan is executed.
     • The robot can be asked about the executed plan.
       Human: “why did you seat all the customers at
       𝑡𝑎𝑏𝑙𝑒2 ?”
       Pepper: “Because all the customers wanted to
       sit together and 𝑡𝑎𝑏𝑙𝑒2 was the closest available
       table.”

     • After some time, 𝑐𝑢1 has finished eating and
       would like to leave. The designer imitates the
       hand gesture that the customer would do in the
       restaurant to ask for the bill. This is translated into
       a goal in the ASP program: ℎ𝑎𝑠𝑝𝑎𝑖𝑑(𝑐𝑢1 ).
     • The robot computes and executes a suitable plan to
       give the bill to 𝑐𝑢1 , collect payment, and provide
       a receipt, after which 𝑐𝑢1 leaves the restaurant.

Figure 8 shows snapshots from the beginning, middle, and
end of this scenario.                                            Figure 8: Snapshots from the beginning, middle, and
                                                                 end of scenario in Execution Example 1: (top) there is
                                                                 initially one customer 𝑐𝑢1 seated at 𝑡𝑎𝑏𝑙𝑒1 ; (middle) the
Execution Example 2. [Learn, plan, explain]                      three new customers are at 𝑡𝑎𝑏𝑙𝑒2 and 𝑐𝑢1 gets the robot
Consider another scenario in which the restaurant initially      waiter’s attention to request the bill; and (bottom) 𝑐𝑢1 has
has no customers. Robot waiter 𝑟𝑜𝑏1 is in the region of          left the restaurant after paying the bill.
node 𝑛1 and knows that 𝑡𝑎𝑏𝑙𝑒1 and 𝑡𝑎𝑏𝑙𝑒2 have capacity
two and four respectively. Once again, the restaurant
                                                                      • Since 𝑟𝑜𝑏1 knows that serving a customer implies
is organized into regions corresponding to eight nodes:
                                                                        giving them the food item they want, it is able to
𝑛0 − 𝑛7 . The subsequent steps in this scenario are:
                                                                        parse this complex instruction into the component
     • The human (in the physical world) makes a hand                   actions. When the human then makes the same
       gesture that is unknown to the robot waiter. The                 hand gesture again and introduces three new cus-
       robot responds by identifying this as a new gesture              tomers (𝑐𝑢2 − 𝑐𝑢4 ) near the restaurant’s entrance,
       and conveys that this will be added to the database              𝑟𝑜𝑏1 computes a suitable plan (some steps omitted
       of hand gestures.                                                to promote understanding).
     • Robot adds the new hand gesture and solicits feed-                 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ), . . . , 𝑝𝑖𝑐𝑘𝑢𝑝(𝑟𝑜𝑏1 , 𝑐𝑢2 ), . . . ,
       back about the gesture. The human (designer)
                                                                          𝑠𝑒𝑎𝑡(𝑟𝑜𝑏1 , 𝑐𝑢2 , 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       intentionally provides a complex instruction (tex-
       tually) that this gesture corresponds to “serve steak              𝑠𝑒𝑟𝑣𝑒(𝑟𝑜𝑏1 , 𝑠𝑡𝑒𝑎𝑘, 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       to a group of three new customers, and then give                   𝑔𝑖𝑣𝑒𝑏𝑖𝑙𝑙(𝑟𝑜𝑏1 , 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       them the bill”.
                                                                      • Plan is executed and the state is updated accord-
                                                                        ingly at different time steps, e.g., 𝑐𝑢2 − 𝑐𝑢4 are




                                                             123
�                                                               achieve the assigned goals, identify and learn previously
                                                               unknown knowledge, and provide on-demand explana-
                                                               tions of decision and beliefs.

                                                               4.2. Experimental results
                                                               To further explore the effect of reasoning guiding learn-
                                                               ing, we conducted some quantitative studies. The first
                                                               experiment examined the benefits of reasoning guiding
                                                               the learning of deep network models for hand gestures.
                                                               Deep learning methods typically need many labeled train-
                                                               ing examples and epochs to learn models for the target
                                                               classification task. However, since learning in our archi-
                                                               tecture is constrained (by reasoning) to specific gestures
                                                               or classes of gestures at a time, it took fewer samples and
                                                               fewer epochs to acquire the desired models that provide
                                                               high accuracy—see Figure 10.
                                                                  The second experiment examined whether reasoning
                                                               helped improve the recognition accuracy. In this experi-
                                                               ment, we considered 30 hand gestures. One round of test-
                                                               ing included 40 iterations of each hand gesture by a person
                                                               who did not participate in training. We conducted mul-
                                                               tiple rounds of testing and ground truth information was
                                                               provided by the designers (i.e., student authors). In the ab-
                                                               sence of the coupling between reasoning and learning, the
                                                               learned models had (on average) an accuracy of 85% over
                                                               the different hand gestures. However, with learning being
                                                               directed to specific (classes of) gestures, the learned mod-
                                                               els resulted in better classification accuracy—≈ 100%.
                                                                  The third experiment examined the ability to provide
                                                               explanatory descriptions in response to different types of
                                                               queries in different situations. A description was consid-
Figure 9: Snapshots from the beginning, middle, and end        ered to be correct if it had all the correct literals but no
of scenario in Execution Example 2: (top) there is initially   additional literals. Overall, the interplay between reason-
no customer in the restaurant; (middle) the newly learned      ing (with relevant knowledge) and learning (of previously
hand gesture is made to get the robot to serve steak to a      unknown knowledge) led to the correct relational descrip-
group of customers; and (end) the robot provides a bill to     tions in 95% cases, with the “errors” being descriptions
the customers after they have completed their meal.            containing additional literals that were not essential to
                                                               answer the query posed but were not necessarily wrong.
                                                               In the absence of the learned knowledge, the accuracy
       seated at 𝑡𝑎𝑏𝑙𝑒2 after the 𝑠𝑒𝑎𝑡 action is executed.     (averaged over query types) was 65 − 80%.
     • The robot can be asked about specific plan steps.
       Human: “why did you not serve pasta to 𝑡𝑎𝑏𝑙𝑒2 ?”
       Pepper: “Because all customers at 𝑡𝑎𝑏𝑙𝑒2 wanted         5. Discussion and Conclusions
       to eat steak.”
                                                          We conclude by highlighting the key capabilities of our
        This explanation is based on the previously- architecture:
        described approach to trace beliefs and the ap-
                                                              • Once the designer has provided the domain-
        plication of relevant axioms.
                                                                specific information (e.g., arrangement of rooms,
Figure 9 shows snapshots from the beginning, middle, and        range of robot’s sensors), planning, diagnostics,
end of this scenario.                                           and plan execution can be automated. The cou-
                                                                pling between reasoning and learning enables
We evaluated the architecture in many other scenarios           more complex theories (of cognition, action) to
grounded in the motivating (restaurant) domain; the robot       be encoded without increasing the computational
was able to successfully compute and execute plans to           effort substantially.




                                                           124
�                                  1.005

                                  1.000

                                  0.995




                       Accuracy
                                  0.990

                                  0.985

                                  0.980

                                  0.975

                                  0.970
                                           0         2     4       6           8    10      12         14
                                                                       Epoch
                                          testing        ANN-3x16 (1691)           ANN-2x128 (25499)
                                          training       ANN-3x64 (12827)

Figure 10: Deep network models provide high (recognition) accuracy for hand gestures within a few epochs when
guided by reasoning.



     • Second, exploiting the interplay between                        References
       knowledge-based reasoning and data-driven
       learning provides a clear separation of concerns,    [1] E. Erdem, V. Patoglu, Applications of ASP in
       and helps focus attention automatically to the           Robotics, Kunstliche Intelligenz 32 (2018) 143–
       relevant knowledge and observed anomalies,               149.
       thus improving the reliability and efficiency of     [2] E. Erdem, M. Gelfond, N. Leone, Applications of
       reasoning and learning.                                  Answer Set Programming, AI Magazine 37 (2016)
     • Third, it is easier to understand and modify the         53–68.
       observed behavior than with architectures that con-  [3] K. Kersting, L. D. Raedt, Bayesian Logic Programs,
       sider all the available knowledge or only support        in: International Conference on Logic Programming,
       data-driven learning. The robot is able to provide       London, UK, 2000.
       relational descriptions of its decisions and the evo-[4] L. D. Raedt, A. Kimmig, Probabilistic Logic Pro-
       lution of its beliefs.                                   gramming Concepts, Machine Learning 100 (2015)
     • Fourth, there is smooth transfer of control and          5–47.
       relevant knowledge between components of the         [5] M. Richardson, P. Domingos, Markov Logic Net-
       architecture, and increased confidence in the cor-       works, Machine Learning 62 (2006) 107–136.
       rectness of the robot’s behavior. Also, the underly- [6] S. Zhang, M. Sridharan, A Survey of Knowledge-
       ing methodology can be used with different robots        based Sequential Decision Making under Uncer-
       and in different application domains.                    tainty, Artificial Intelligene Magazine 43 (2022)
                                                                249–266.
     • Fifth, using KR tools and the coupling between
                                                            [7] Y. Gil, Learning by Experimentation: Incremental
       reasoning and learning as the foundation promotes
                                                                Refinement of Incomplete Planning Domains, in: In-
       modularity and simplifies the design and evalua-
                                                                ternational Conference on Machine Learning, New
       tion of architectures for integrated robot systems.
                                                                Brunswick, USA, 1994, pp. 87–95.
Future work will further explore the interplay between rea- [8] M. Law, A. Russo, K. Broda, The ILASP System for
soning and learning for explaining decisions and beliefs        Inductive Learning of Answer Set Program, Associ-
while performing reasoning and learning in more complex         ation for Logic Programming Newsletter (2020).
robotics domains. We will also investigate the use of our   [9] T. Mota, M. Sridharan, A. Leonardis, Integrated
architecture on a physical robot interacting with humans        Commonsense Reasoning and Deep Learning for
through noisy sensors and actuators. The longer-term ob-        Transparent Decision Making in Robotics, Springer
jective is to support transparent reasoning and learning in     Nature CS 2 (2021) 1–18.
integrated robot systems operating in complex domains. [10] M. Sridharan, B. Meadows, Knowledge Representa-
                                                                tion and Interactive Learning of Domain Knowledge
                                                                for Human-Robot Collaboration, Advances in Cog-




                                                                  125
�     nitive Systems 7 (2018) 77–96.                          [22] T. Miller, Explanations in Artificial Intelligence:
[11] J. E. Laird, K. Gluck, J. Anderson, K. D. Forbus,            Insights from the Social Sciences, Artificial Intelli-
     O. C. Jenkins, C. Lebiere, D. Salvucci, M. Scheutz,          gence 267 (2019) 1–38.
     A. Thomaz, G. Trafton, R. E. Wray, S. Mohan, J. R.      [23] M. Sridharan, M. Gelfond, S. Zhang, J. Wy-
     Kirk, Interactive Task Learning, IEEE Intelligent            att, REBA: A Refinement-Based Architecture
     Systems 32 (2017) 6–21.                                      for Knowledge Representation and Reasoning in
[12] R. Assaf, A. Schumann, Explainable Deep Neural               Robotics, Journal of Artificial Intelligence Research
     Networks for Multivariate Time Series Predictions,           65 (2019) 87–180.
     in: International Joint Conference on Artificial In-    [24] P. Langley, B. Meadows, M. Sridharan, D. Choi, Ex-
     telligence, Macao, China, 2019, pp. 6488–6490.               plainable Agency for Intelligent Autonomous Sys-
[13] Wojciech Samek and Thomas Wiegand and Klaus-                 tems, in: Innovative Applications of Artificial Intel-
     Robert Muller, Explainable Artificial Intelligence:          ligence, San Francisco, USA, 2017.
     Understanding, Visualizing and Interpreting Deep        [25] M. Sridharan, B. Meadows, Towards a Theory of
     Learning Models, ITU Journal: ICT Discoveries                Explanations for Human-Robot Collaboration, Kun-
     (Special Issue 1): The Impact of Artificial Intelli-         stliche Intelligenz 33 (2019) 331–342.
     gence (AI) on Communication Networks and Ser-           [26] T. Mota, M. Sridharan, Commonsense Reasoning
     vices 1 (2017) 1–10.                                         and Knowledge Acquisition to Guide Deep Learn-
[14] W. Norcliffe-Brown, E. Vafeais, S. Parisot, Learn-           ing on Robots, in: Robotics Science and Systems,
     ing Conditioned Graph Structures for Interpretable           Freiburg, Germany, 2019.
     Visual Question Answering, in: Neural Information       [27] M. Balduccini, M. Gelfond, Logic Programs with
     Processing Systems, Montreal, Canada, 2018.                  Consistency-Restoring Rules, in: AAAI Spring
[15] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, J. B.           Symposium on Logical Formalization of Common-
     Tenenbaum, Neural-Symbolic VQA: Disentangling                sense Reasoning, 2003, pp. 9–18.
     Reasoning from Vision and Language Understand-          [28] M. Gelfond, D. Inclezan, Some Properties of Sys-
     ing, in: Neural Information Processing Systems,              tem Descriptions of 𝐴𝐿𝑑 , Journal of Applied
     Montreal, Canada, 2018.                                      Non-Classical Logics, Special Issue on Equilibrium
[16] M. Ribeiro, S. Singh, C. Guestrin, Why Should I              Logic and Answer Set Programming 23 (2013) 105–
     Trust You? Explaining the Predictions of Any Clas-           120.
     sifier, in: ACM SIGKDD International Conference         [29] M. Sridharan, Supporting code and videos, 2022.
     on Knowledge Discovery and Data Mining, 2016,                https://www.cs.bham.ac.uk/~sridharm/KRFiles/.
     pp. 1135–1144.                                          [30] E. Balai, M. Gelfond, Y. Zhang, Towards Answer
[17] Y. Zhang, S. Sreedharan, A. Kulkarni,                        Set Programming with Sorts, in: International Con-
     T. Chakraborti, H. H. Zhuo, S. Kambham-                      ference on Logic Programming and Nonmonotonic
     pati, Plan explicability and predictability for robot        Reasoning, Corunna, Spain, 2013.
     task planning, in: International Conference on          [31] B. Banihashemi, G. D. Giacomo, Y. Lesperance,
     Robotics and Automation, 2017, pp. 1313–1320.                Abstraction of Agents Executing Online and their
[18] R. Borgo, M. Cashmore, D. Magazzeni, Towards                 Abilities in Situation Calculus, in: International
     Providing Explanations for AI Planner Decisions,             Joint Conference on Artificial Intelligence, Stock-
     in: IJCAI Workshop on Explainable Artificial Intel-          holm, Sweden, 2018.
     ligence, 2018, pp. 11–17.                               [32] Z. Saribatur, T. Eiter, P. Schuller, Abstraction for
[19] P. Bercher, S. Biundo, T. Geier, T. Hoernle, F. Noth-        Non-ground Answer Set Programs, Artificial Intel-
     durft, F. Richter, B. Schattenberg, Plan, repair, ex-        ligence 300 (2021) 103563.
     ecute, explain - how planning helps to assemble         [33] E. Coumans, Y. Bai, PyBullet: A Python Module
     your home theater, in: Twenty-Fourth International           for Physics Simulation for Games, Robotics, and
     Conference on Automated Planning and Scheduling,             Machine Learning, Technical Report, http://pybullet.
     2014.                                                        org, 2016-2022.
[20] J. Fandinno, C. Schulz, Answering the "Why" in          [34] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, Y. A.
     Answer Set Programming: A Survey of Explanation              Sheikh, OpenPose: Realtime Multi-Person 2D Pose
     Approaches, Theory and Practice of Logic Program-            Estimation using Part Affinity Fields, IEEE Transac-
     ming 19 (2019) 114–203.                                      tions on Pattern Analysis and Machine Intelligence
[21] S. Anjomshoae, A. Najjar, D. Calvaresi, K. Fram-             (2019).
     ling, Explainable agents and robots: Results from a     [35] G. Ferrand, W. Lessaint, A. Tessier, Explanations
     systematic literature review, in: International Con-         and Proof Trees, Computing and Informatics 25
     ference on Autonomous Agents and Multiagent Sys-             (2006) 1001–1021.
     tems (AAMAS), Montreal, Canada, 2019.




                                                         126
�

There and Back Again: Combining Non-monotonic Logical Reasoning and Deep Learning on an Assistive Robot[edit]

load PDF

There and Back Again: Combining Non-monotonic
Logical Reasoning and Deep Learning on an Assistive
Robot
Mohan Sridharan1,* , Chloé Benz2 , Arthur Findelair3 and Kévin Gloaguen4
1
  Intelligent Robotics Lab, School of Computer Science, University of Birmingham, UK
2
  Illinois Institute of Technology, USA
3
  Illinois Institute of Technology, USA
4
  École Nationale Supérieure de Mécanique et d’Aérotechnique, France


                                          Abstract
                                          This paper describes the development of an architecture that combines non-monotonic logical reasoning and deep learning
                                          in virtual (simulated) and real (physical) environments for an assistive robot. As an illustrative example, we consider a robot
                                          assisting in a simulated restaurant environment. For any given goal, the architecture uses Answer Set Prolog to represent and
                                          reason with incomplete commonsense domain knowledge, providing a sequence of actions for the robot to execute. At the
                                          same time, reasoning directs the robot’s learning of deep neural network models for human face and hand gestures made in
                                          the real world. These learned models are used to recognize and translate human gestures to scenarios that mimic real-world
                                          situations in the simulated environment, and to goals that need to be achieved by the robot in the simulated environment. We
                                          report the challenges faced in the development of such an integrated architecture, as well as the insights learned from the design,
                                          implementation, and evaluation of this architecture by a distributed team of researchers during the ongoing pandemic.

                                          Keywords
                                          Non-monotonic logical reasoning, Probabilistic reasoning, Interactive learning, Robotics



1. Motivation
Consider the motivating example of a mobile robot (Pep-
per) waiter in a simulated restaurant, as shown in Figure 1.
The robot has to perform tasks such as seating customers
at suitable tables, taking and delivering food orders, and
collecting payment. To perform these tasks, the robot
extracts and reasons with the information from different
sensors (e.g., camera, range finder) and incomplete com-
monsense domain knowledge. This knowledge includes
relational descriptions of the domain objects and their at-
                                                                                                     Figure 1: Illustrative snapshot of an assistive robot oper-
tributes (e.g., size, number, and relative positions of tables,
                                                                                                     ating as a waiter in a simulated restaurant scenario.
chairs, and people). It also includes axioms governing
actions and change in the domain (e.g., the preconditions
and effects of seating a group of people at a particular
table), including default statements that hold in all but                                              with its knowledge and sensor observations to revise its
a few exceptional circumstances (e.g., “customers typ-                                                 knowledge (e.g., revise the number of people seated at
ically need some time to look at the menu before they                                                  different tables, learn the effects of different gestures).
place an order”). Since the domain description is incom-                                               Furthermore, to promote better interaction with humans
plete and can change over time, the robot also reasons                                                 in the restaurant, the robot provides on-demand relational
                                                                                                       descriptions of its decisions and the evolution of beliefs.
NMR 2022: 20th InternationalWorkshop on Non-Monotonic Reason-                                             Realizing the motivating scenario described above
ing, August 07–09, 2022, Haifa, Israel                                                                 poses fundamental challenges in knowledge represen-
*
  Corresponding author.                                                                                tation, reasoning, and learning. State of the art robot
" m.sridharan@bham.ac.uk (M. Sridharan);
                                                                                                       architectures often seek to address these challenges by
chloe.c.benz@gmail.com (C. Benz); arthfind@gmail.com
(A. Findelair); k.gloaguen1303@gmail.com (K. Gloaguen)                                                 using logics and probabilistic methods to represent and
~ https://www.cs.bham.ac.uk/~sridharm/ (M. Sridharan)                                                  reason with domain knowledge and observations, and
� 0000-0001-9922-8969 (M. Sridharan)                                                                   by using data-driven (deep) learning methods to extract
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   knowledge from large, labeled datasets (e.g., of noisy sen-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                  115
�sor observations). However, practical domains make it           2. Related Work
difficult to provide a comprehensive encoding of domain
knowledge, or the computational resources and examples          There is a well-established history of the use of log-
needed to augment or revise the robot’s knowledge. Fur-         ics in different AI and robotics applications. The non-
thermore, circumstances such as the ongoing pandemic            monotonic logical reasoning paradigm used in this paper,
make it rather challenging for a distributed team of re-        ASP, has been used by an international community of re-
searchers to design and evaluate such architectures for         searchers for many applications in robotics [1] and other
integrated robot systems.                                       fields [2]. There has also been a lot of work over multiple
   This paper makes a two-fold contribution towards ad-         decades on integrating logical and probabilistic reason-
dressing the above-mentioned challenges. First, it uses         ing [3, 4, 5], and on using different logics for guiding
the motivating example to describe the development              probabilistic sequential decision making [6]. Our focus
of an architecture that adapts knowledge representation         here is on building on this work to support transparent
(KR) tools to achieve transparent, reliable, and efficient      knowledge-based reasoning and data-driven learning in
knowledge-based reasoning and data-driven learning on           integrated robot systems.
an assistive robot. Second, it highlights the advantages of        There are many methods for learning logic-based rep-
using KR tools, and of formally coupling representation,        resentations of domain knowledge. This includes the
reasoning and learning, to design such an architecture.         incremental revision of action operators in first-order
More specifically, our architecture:                            logic [7], the inductive learning of domain knowledge
                                                                encoded as an Answer Set Prolog program [8], and the
     • Represents and performs non-monotonic logical            work on coupling non-monotonic logical reasoning with
       reasoning with incomplete commonsense domain             inductive learning or relational reinforcement learning to
       knowledge using Answer Set Prolog (ASP) to ob-           learn axioms [9, 10]. Our approach in this architecture is
       tain a plan of abstract actions for any given goal;      inspired by work in interactive task learning [11]; unlike
     • Executes each abstract action as a sequence of           methods that learn from many training examples, our ap-
       concrete actions by automatically identifying and        proach seeks to identify and learn from a limited number
       reasoning probabilistically about the relevant do-       of relevant training examples.
       main knowledge at a finer granularity;                      Given the use of deep networks in different applications,
     • Reasons with domain knowledge to allow humans            there is much interest in understanding their operation in
       making hand gestures in the physical world to            terms of the features influencing network outputs [12, 13].
       interact with the simulated robot in a manner that       There is also work on neuro-symbolic systems that reason
       mimics interaction in the physical world; and            with learned symbolic structure or a scene graph in con-
     • Reasons with domain knowledge to guide the               junction with deep networks to answer questions about
       learning of models for new hand gestures and             images [14, 15]. Work in the broader areas of explainable
       the corresponding axioms, and for providing on-          AI and explainable planning can be categorized into two
       demand relational descriptions as explanations of        groups. Methods in one group modify or map learned
       the robot’s decisions and beliefs.                       models or reasoning systems to make their decisions more
                                                                interpretable [16] or easier for humans to understand [17].
The interactive interface between the virtual and physical      Methods in the other group provide descriptions that make
world helped the three undergraduate student authors de-        a reasoning system’s decisions more transparent [18], help
sign, implement, and evaluate the architecture remotely         humans understand plans [19], and help justify solutions
over different time intervals during the pandemic. It also      obtained by non-monotonic logical reasoning [20]. Re-
helped us explore the interplay between reasoning and           cent survey papers indicate that existing methods: (i) do
learning. The “there and back again” in the title thus refers   not fully integrate reasoning and learning to inform and
to the architecture’s on-demand ability to traverse differ-     guide each other; (ii) do not fully exploit the available
ent points in space and time, and to transition between         commonsense domain knowledge for reliable, efficient,
the physical and virtual world for human-robot collabora-       and transparent reasoning and learning; and (iii) are often
tion. We demonstrate the capabilities of our architecture       agnostic to how an explanation is structured or assumes
through experimental results and execution traces of use        comprehensive domain knowledge [21, 22]
cases in our motivating restaurant domain.                         Our work focuses on transparent, reliable, and efficient
   The remainder of this paper is organized as follows. We      reasoning and learning in integrated robot systems that
begin by discussing related work in Section 2. Next, we         combine reasoning with incomplete commonsense do-
describe our architecture and its components in Section 3.      main knowledge and data-driven learning from limited
The execution traces and results of evaluating our archi-       examples. We seek to demonstrate that this objective can
tecture’s components are described in Section 4, and the        be achieved by building on KR tools. To do so, we build
conclusions are described in Section 5.                         on some of the prior work of the lead author with others.




                                                            116
�                   Knowledge Representation+ Reasoning
                                                                            providing a bill and collecting payment; and (iv) respond-
                  domain knowledge (relations, action theory)               ing to requests from the customer(s) and the designer. The
                      non−monotonic logical reasoning
                          probabilistic reasoning
                                                                            robot uses probabilistic algorithms to model and account
                                                                            for the uncertainty experienced during perception and ac-
                                                                            tuation. Interactions of the robot with a human supervisor
                                                                            are handled through the interface that interprets hand ges-
                                                                            tures made by a human in the physical world. The robot
        virtual world                               deep/reinforcement
                                                                            has incomplete (and potentially imprecise) domain knowl-
                                                         inductive          edge, which includes number, size, and location of tables
       physical world                                                       and chairs; spatial relations between objects; and some
                                                   Interactive Learning     axioms governing domain dynamics such as:
     Interaction Interface

Figure 2: Overview of our architecture combining non-                            • If the robot allocates a group of customers to a
monotonic logical reasoning, probabilistic reasoning, and                          table, all members of the group are considered to
deep learning for reliable, efficient, and transparent rea-                        be seated at that table.
soning and learning.
                                                                                 • The robot cannot seat customers at a table that is
                                                                                   not empty, i.e., is occupied.
                                                                                 • Any customer cannot be allocated to more than
In particular, we build on work on: (i) a refinement-based                         one table at a time.
architecture for representation and reasoning [23]; (ii)
explainable agency and theory of explanations [24, 25];                     This knowledge, e.g., the axioms describing dynamic
and (iii) combining non-monotonic logical reasoning and                     changes and the values of some attributes of the domain
deep learning for axiom learning and scene understand-                      or robot, may need to be revised over time.
ing [9, 26]. The novelty is in bringing these different
strands together in an architecture, and in facilitating
                                                                            3.1. Representation and Reasoning
the interactive interface between the virtual and physi-
cal worlds for design and evaluation.                         To represent and reason with domain knowledge, we use
                                                              CR-Prolog, an extension of Answer Set Prolog (ASP) that
                                                              introduces consistency restoring (CR) rules [27]. ASP
3. Architecture Description                                   is based on stable model semantics, and supports default
                                                              negation and epistemic disjunction, e.g., unlike “¬𝑎” that
Figure 2 presents an overview of the main components implies a is believed to be false, “𝑛𝑜𝑡 𝑎” only implies
of our architecture. As stated earlier, the architecture a is not believed to be true, and unlike “𝑝 ∨ ¬𝑝” in
uses ASP to represent and reason with commonsense do- propositional logic, “𝑝 𝑜𝑟 ¬𝑝” is not tautologous. ASP
main knowledge, e.g., to reason about object and robot can represent recursive definitions and constructs that are
attributes to compute a plan to achieve a given goal. For difficult to express in classical logic formalisms, and it
more complex domains, this reasoning can take place us- supports non-monotonic logical reasoning, i.e., the abil-
ing transition diagrams at two different resolutions, with ity to revise previously held conclusions based on new
the fine-resolution diagram defined as a refinement of evidence. We use the terms “CR-Prolog” and “ASP” in-
the coarse-resolution diagram. Execution of the actions terchangeably in this paper.
by a robot can then involve probabilistic reasoning with
a relevant part of the fine-resolution transition diagram.
Reasoning informs and guides both the interactive learn- Knowledge representation. A domain’s description
ing of previously unknown domain knowledge (which in ASP comprises a system description 𝒟 and a history ℋ.
is used for subsequent reasoning), and the interface for 𝒟 comprises a sorted signature Σ and axioms encoding
interaction between a human in the physical world and the domain’s dynamics. Σ comprises basic sorts, statics,
the robot in the virtual world. Reasoning is also used i.e., domain attributes that do not change over time, fluents,
to identify relevant literals and axioms to provide an on- i.e., domain attributes whose values can be changed, and
demand description of the robot’s decisions and beliefs. actions; note that statics, fluents, and actions are described
The individual components are described below using the in terms of the sorts of their arguments. In the RW domain,
following example domain.                                     the robot needs to reason about spatial relations between
                                                              objects, and to plan and execute actions that change the
Example Domain 1. [Robot Waiter (RW) Domain]                  domain. Such a dynamic domain is modeled in our archi-
A Pepper robot operates as a waiter in a restaurant. Its tecture by first describing Σ and the domain’s transition
tasks include: (i) greeting and seating customers; (ii) tak- diagram in action language 𝒜ℒ𝑑 [28]; this description is
ing food orders and delivering food to specific tables; (iii) then translated to ASP statements. The basic sorts of the




                                                                          117
�                                                                 ¬𝑜𝑐𝑐𝑢𝑟𝑠(𝑚𝑜𝑣𝑒(𝑅, 𝑁 ), 𝐼) ←                           (1e)
                                                                          ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑅, 𝑀 ), 𝐼), ¬𝑒𝑑𝑔𝑒(𝑀, 𝑁 )
                                                                 ¬𝑜𝑐𝑐𝑢𝑟𝑠(𝑔𝑖𝑣𝑒𝑏𝑖𝑙𝑙(𝑅, 𝑇 ), 𝐼) ←                       (1f)
                                                                          ¬ℎ𝑜𝑙𝑑𝑠(𝑤𝑎𝑛𝑡𝑠𝑏𝑖𝑙𝑙(𝑇 ), 𝐼)

                                                              which encode two causal laws, two state constraints, and
                                                              two executability conditions respectively. For example,
                                                              Statement 1(a) is a causal law that implies that execut-
                                                              ing the move action causes the robot’s location to be the
                                                              desired node in the next time step, Statement 1(c) is a
                                                              constraint stating that a customer can only be at one table
                                                              at a time, and Statement 1(e) is an executability condition
Figure 3: Example layout of the RW domain, which orga- that implies that a move to a target location is not possible
nizes the available space into nodes representing regions if it is not connected to the robot’s current location. The
with specific tables.                                         axioms also encode some default statements that hold in
                                                              all but a few exceptional situations. For example, in the
                                                              RW domain, we may want to encode that “clean plates
RW domain include 𝑡𝑎𝑏𝑙𝑒, 𝑟𝑜𝑏𝑜𝑡, 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒, are usually in the kitchen” unless stated otherwise:
𝑤𝑎𝑖𝑡𝑒𝑟, 𝑓 𝑢𝑟𝑛𝑖𝑡𝑢𝑟𝑒, 𝑔𝑒𝑠𝑡𝑢𝑟𝑒, 𝑔𝑒𝑠𝑡𝑢𝑟𝑒_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦, and
𝑠𝑡𝑒𝑝 for temporal reasoning. The sorts may be organized         ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑃, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), 𝐼) ← ℎ𝑜𝑙𝑑𝑠(𝑐𝑙𝑒𝑎𝑛(𝑃 ), 𝐼),
hierarchically, e.g., chair and table are subsorts of the              𝑝𝑙𝑎𝑡𝑒(𝑃 ), 𝑛𝑜𝑡 ¬ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑃, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), 𝐼) (2)
sort furniture, and the sort employee includes robot and
supervisor as subsorts.                                       where “not” denotes default negation. One potential ex-
   Statics of the RW domain include relations edge(node, ception to this axiom is that some clean plates may also
node) and linked(node, furniture); the former is a graph- be placed near the buffet table; these exceptions can also
based encoding of regions, e.g., see Figure 3, and the latter be encoded. In addition to axioms, information extracted
associates particular tables to particular nodes. Fluents from the sensor inputs (e.g., different hand gestures) are
include relations such as location(robot, node), iswait- also converted to ASP statements at that time step. Each
ing(customer), attable(customer, table), occupancy(table, gesture is also associated with the corresponding axioms;
num), and haspaid(customer). Actions of the RW do- more specific details are provided in Section 3.3.
main include move(robot, node), which causes the robot           A dynamic domain’s history ℋ typically comprises
to move to a particular node; seat(robot, customer, table), records of: (a) fluents observed to be true or false at
which causes the robot to seat particular customer(s) at a a particular time step; and (b) the actual execution of
particular table; and givebill(robot, table), which causes particular actions at particular time steps:
the robot to give the bill to a customer at a particular ta-                 𝑜𝑏𝑠(𝑓 𝑙𝑢𝑒𝑛𝑡, 𝑏𝑜𝑜𝑙𝑒𝑎𝑛, 𝑠𝑡𝑒𝑝)
ble. In addition, relation holds(fluent, step) implies that
a particular fluent holds true at a particular timestep, and                 ℎ𝑝𝑑(𝑎𝑐𝑡𝑖𝑜𝑛, 𝑠𝑡𝑒𝑝)
occurs(action, step) implies the occurrence of a particular    Prior work demonstrated that this notion of history can
action at a particular timestep of the plan.                   be expanded to include defaults describing the values of
   Given the signature Σ, axioms describing a domain’s         fluents in the initial state, along with exceptions [23].
dynamics consist of causal laws, state constraints, and
executability conditions. For the RA domain, these are
                                                               Reasoning. Given the representation of domain knowl-
translated to statements in ASP such as:
                                                               edge described above, the robot still needs to reason with
     ℎ𝑜𝑙𝑑𝑠(𝑙𝑜𝑐(𝑅, 𝑁 ), 𝐼 + 1) ←                        (1a)    this knowledge and observations perform tasks such as in-
                                                               ference, planning, and diagnostics. In our architecture, we
           𝑜𝑐𝑐𝑢𝑟𝑠(𝑚𝑜𝑣𝑒(𝑅, 𝑁 ), 𝐼)
                                                               automatically construct the CR-Prolog program Π(𝒟, ℋ),
     ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 ), 𝐼 + 1) ←                    (1b)    which includes Σ and axioms of 𝒟, inertia axioms, reality
           𝑜𝑐𝑐𝑢𝑟𝑠(𝑠𝑒𝑎𝑡(𝑅, 𝐶, 𝑇 ), 𝐼)                           check axioms, closed world assumptions for actions, and
   ¬ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 2), 𝐼) ←                        (1c)    observations, actions, and defaults from ℋ; a basic version
                                                               of this program can be viewed online [29]. For planning
           ℎ𝑜𝑙𝑑𝑠(𝑎𝑡𝑡𝑎𝑏𝑙𝑒(𝐶, 𝑇 1), 𝐼), 𝑇 1 ̸= 𝑇 2               and diagnostics, this program also includes helper axioms
   ¬ℎ𝑜𝑙𝑑𝑠(𝑜𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦(𝑇, 𝑋2), 𝐼) ←                       (1d)    that define a goal, and require the robot to search until a
           ℎ𝑜𝑙𝑑𝑠(𝑜𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦(𝑇, 𝑋1), 𝐼), 𝑋1 ̸= 𝑋2                consistent model of the world is constructed and a plan
                                                               is computed to achieve the goal. Planning, diagnostics,




                                                           118
�and inference are then reduced to computing answer sets         𝑎𝑡𝑔 = 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ). The object constants relevant to
of Π; we use the SPARC system [30] to compute answer            this transition then include 𝑟𝑜𝑏1 , 𝑛1 , 𝑛2 , and 𝑘𝑖𝑡𝑐ℎ𝑒𝑛.
set(s). Each answer set represents the robot’s beliefs in a
possible world; the literals of fluents and statics at a time   Definition 2. [Relevant system description]
step represent the domain’s state at that time step. As         The system description relevant to a transition 𝑇 =
stated earlier, our architecture’s non-monotonic reasoning      ⟨𝜎1 , 𝑎𝑡𝑔 , 𝜎2 ⟩, i.e., 𝒟(𝑇 ), is defined by signature Σ(𝑇 )
ability supports recovery from incorrect inferences due to      and axioms. Σ(𝑇 ) is constructed to comprise:
incomplete knowledge or noisy sensor inputs.                         • Basic sorts of Σ that produce a non-empty inter-
   Prior work by the lead author and others resulted in an             section with 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).
architecture for reasoning with transition diagrams at two
                                                                     • All object constants of basic sorts of Σ(𝑇 ) that
resolutions, with the fine-resolution diagram formally de-
                                                                       form the range of a static attribute.
fined as a refinement of the coarse-resolution diagram [23].
                                                                     • The object constants of basic sorts of Σ(𝑇 ) that
This definition differs from recent work on refinement and
                                                                       form the range of a fluent, or the domain of a
abstraction of ASP programs and other logics [31, 32] in
                                                                       fluent or a static, and are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).
how the transition diagrams are coupled formally to satisfy
the requirements in the challenging context of integrated            • Domain attributes restricted to Σ(𝑇 )’s basic sorts.
robot systems. This relation guarantees the existence of a      Axioms of 𝒟(𝑇 ) are those of 𝒟 restricted to Σ(𝑇 ). It
path in the fine-resolution transition diagram implement-       can be shown that for each transition in the transition dia-
ing each coarse-resolution transition. The robot can then       gram of 𝒟, there is a transition in the transition diagram
use non-monotonic logical reasoning to compute a se-            of 𝒟(𝑇 ). States of 𝒟(𝑇 ), i.e., literals comprising fluents
quence of abstract actions for any given goal, implement-       and statics in the answer set of the ASP program, and
ing each abstract action as a sequence of fine-resolution       ground actions of 𝒟(𝑇 ), are candidates for further explo-
actions by automatically zooming to and reasoning prob-         ration. Continuing with the example in Definition 1, for
abilistically with the part of the fine-resolution diagram      𝑎𝑡𝑔 = 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ), 𝒟(𝑇 ) will not include axioms
relevant to the coarse-resolution transition. We build on       corresponding to other actions, e.g., for seating customers
that notion of relevance to automatically: (a) constrain the    at a table or giving the bill to a customer. If the robot has
robot’s attention to the nodes and regions relevant to any      to perform fine-resolution probabilistic reasoning for ac-
given transition or plan that the robot has to execute—this     tion execution, only the refinement of the relevant system
supports selective grounding; (b) limit recognition of hand     description will be considered.
gestures to the subset relevant to the task at hand, e.g.,
gestures for placing an order once customers are seated,        A robot waiter equipped with the representation and rea-
and limit learning to previously unknown hand gestures          soning module described above, still needs to interact with
and related axioms—see Section 3.3; and (c) provide rela-       humans. To support design and evaluation when in-person
tional descriptions of decisions by tracing the evolution of    interaction with the robot is not possible, we incorporated
relevant beliefs and application of relevant axioms—see         the interactive simulation module, as described below.
Section 3.3. For ease of understanding, we define the no-
tion of relevance for a given transition; similar definitions   3.2. Interactive Simulation and Hand
can be provided for a given goal or literal.
                                                                     Gestures
Definition 1. [Relevant object constants]                         We developed a simulation environment and interface for
Let 𝑇 = ⟨𝜎1 , 𝑎𝑡𝑔 , 𝜎2 ⟩ be the transition of interest. Let       the design and evaluation of our architecture. We used Py-
𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ) be the set of object constants of signature Σ          Bullet [33], a Python-based module for simulating games
of 𝒟 identified using the following rules:                        and domains for machine learning and robotics. It enables
     • Object constants from 𝑎𝑡𝑔 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 );               us to quickly load different articulated bodies and pro-
     • If 𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑦) is a literal formed of a domain vides  built-in support for forward and inverse kinematics,
        attribute, and the literal belongs to 𝜎1 or 𝜎2 , but      collision detection, and simulation of domain dynamics.
        not both, then 𝑥1 , . . . , 𝑥𝑛 , 𝑦 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 );        In our architecture, PyBullet is used to automatically
     • If body 𝐵 of an axiom of 𝑎𝑡𝑔 contains generate a restaurant layout, e.g., see Figure 4, based on
        𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑌 ), a term whose domain is the domain information encoded in the ASP program, e.g.,
        ground, and 𝑓 (𝑥1 , . . . , 𝑥𝑛 , 𝑦) ∈ 𝜎1 , then Figure 3. Using the built-in blender of PyBullet, we are
        𝑥1 , . . . , 𝑥𝑛 , 𝑦 are in 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ).                    able to populate the simulated restaurant with a Pepper
                                                                  robot, tables, chairs, and the desired number of customers.
Object constants from 𝑟𝑒𝑙𝐶𝑜𝑛(𝑇 ) are said to be rele- We are also able to make on-demand revisions to the
vant to 𝑇 . For example, consider an initial state 𝜎1 domain, e.g., to match changes in the domain knowledge.
with 𝑙𝑜𝑐(𝑟𝑜𝑏1 , 𝑛1 ) and 𝑙𝑜𝑐(𝑤𝑎𝑖𝑡𝑒𝑟, 𝑘𝑖𝑡𝑐ℎ𝑒𝑛), and action In addition, our simulator supports the movement of the




                                                            119
�                                                             are related to seating customers, handling food orders, or
                                                             executing terminal transactions (e.g., provide bill).

                                                             3.3. Interactive Learning and
                                                                  Transparency
                                                             The architecture described so far reasons with incomplete
                                                             domain knowledge, which may lead the robot to make
                                                             incorrect decisions or cause the robot’s performance to
                                                             suffer, e.g., the robot may compute incorrect or unneces-
                                                             sarily long plans for any given goal. Also, the encoded
Figure 4: Simulated restaurant layout in PyBullet with       knowledge and models may need to change over time. We
robot waiter and customers.                                  address this requirement by introducing a module for in-
                                                             teractive learning and generation of relational descriptions
        Table 1         Table 2         Table 3
                                                             as “explanations” of the robot’s decisions and beliefs.

                                                             Interactive learning. The interactive learning com-
        Table 4         Table 5        Order fries           ponent of our architecture has two parts. Given the use
                                                             of hand gestures for human-robot interaction, the first
                                                             part seeks to detect new gestures and learn models for
                                                             these gestures. A new hand gesture is detected when
      Order steak   Ask for the bill
                                                             the observed gesture differs significantly from any of the
                                           Thumb
                                           Index             known gestures. A significant difference is experimen-
                                           Middle
                                           Ring
                                           Little
                                                             tally determined as a difference in 15% of the keypoints
                                                             in a sequence of images. When a new gesture is recog-
                                                             nized, the robot automatically gathers a sequence of image
Figure 5: (Left) Subset of hand gestures providing direc-
tions to robot; (Right) The 21 keypoints used to model
                                                             frames, extracts features from these images, stores them
each hand gesture.                                           in a separate file and quickly updates the hand gesture
                                                             recognition models to include this new gesture. A key
                                                             feature of our architecture is that reasoning and learning
                                                             inform and guide each other. For example, when the robot
robot in the restaurant based on the axioms encoded in the
                                                             has to recognize and respond to gestures, it automatically
ASP program. Furthermore, it is also possible to introduce
                                                             limits itself to gestures relevant to its current category of
new objects in the simulator (e.g., using hand gestures,
                                                             tasks, e.g., a robot delivering food cannot respond to direc-
see below) and automatically add this information to the
                                                             tion from a supervisor to seat new customers1 . Also, any
ASP program for further reasoning
                                                             newly learned gesture is placed in the appropriate cate-
   Recall that communication of human instructions to the
                                                             gory of gestures (determined based on purpose of gesture)
robot waiter is based on hand gestures made in the physi-
                                                             for subsequent reasoning. This use of reasoning to direct
cal world. To support such interaction, we first enabled
                                                             learning speeds up recognition and learning.
our architecture to recognize a base set of hand gestures;
                                                                The second part of the learning component focuses
a subset of these gestures are shown in Figure 5(left).
                                                             on acquiring axioms corresponding to any new gesture,
To model and recognize hand gestures, we integrate the
                                                             and merging the axioms with the existing ones. This is
OpenPose system [34] that characterizes gestures using
                                                             achieved by taking the label provided by human for the
21 keypoints, as shown in Figure 5(right). After the inte-
                                                             new gesture and checking if the corresponding instruction
gration, the simulator allows us to capture images of the
                                                             (e.g., seat two people) can be executed with the existing
hand gestures made in the physical world to quickly train
                                                             knowledge. If that is possible, no further learning is per-
deep network models that can accurately recognize these
                                                             formed. If existing knowledge is insufficient to execute
gestures in new videos (i.e., image sequences). We used
                                                             the new instruction, or if the human provides feedback,
an existing Python library for training these deep network
                                                             e.g., a textual or verbal description that is processed using
models with experimentally determined loss functions—
                                                             existing tools, which includes an action, literals extracted
Figure 6. Note that the modularity of the architecture
                                                             from the feedback are used to construct an axiom that is
makes it easy to quickly explore the different deep net-
                                                             merged with existing ones. Once again, reasoning helps
work models without changing other parts of the architec-
ture. The known hand gestures with trained models are        1
                                                                 Associating priority levels with tasks will enable the robot to inter-
then grouped in different categories based on whether they       rupt its current task to execute a higher-priority task.




                                                         120
�                                100


                               10 1




                        Loss
                               10 2


                               10 3


                               10 4
                                       0         2    4       6           8   10      12      14
                                                                  Epoch
                                      improved       ANN-3x16 (1691)          ANN-2x128 (25499)
                                      baseline       ANN-3x64 (12827)

Figure 6: Learning curves for acquiring models for the hand gestures using different deep network structures; models
with low loss are obtained over a few epochs when guided by reasoning.



direct this learning by limiting scope to the relevant ob-     Paths from the root to the leaves in these trees provide
ject constants and description. For example, assume that       explanations. If multiple such paths exist, we currently
the robot is shown a new gesture for seating a group of        select one of the shortest branches at random; other heuris-
customers at a table. The robot will use human feed-           tics could be used to compare the explanations. For ex-
back about this new gesture, and only consider literals        ample, if the robot is asked why it seated a group of three
corresponding to: the location of these customers, its own     customers at 𝑇 𝑎𝑏𝑙𝑒5 , it can trace the current belief about
location, and the occupancy of tables in the restaurant, to    the group back to the initial state through the applica-
learn axioms for the new action.                               tion of relevant axioms, and come up with an explanation
                                                               such as: “The three customers came to the restaurant and
Tracing explanations. Our architecture supports the wanted to be seated as a group. 𝑇 𝑎𝑏𝑙𝑒5 at node 𝑛7 was the
ability to infer the sequence of axioms and beliefs that table closest to the entrance that had the desired number
explains the evolution of any given belief or the non- of seats available. I seated the customers at 𝑇 𝑎𝑏𝑙𝑒5 ”.
selection of any given ground action at a given time. We          In addition to tracing the evolution of a target belief
build on the idea of proof trees, which have been used to and justifying the non-selection of a particular action, our
explain observations in classical first-order logic [35], and architecture can also provide: (a) a description of any
adapt it to our architecture that is based on descriptions computed or executed plan in terms of literals in the plan;
in non-monotonic logic. Our approach is based on the (b) justification for executing a particular action at a partic-
following sequence of steps:                                   ular time step by examining the change in state caused by
                                                               the action’s execution and how this state change achieves
     1. Select axioms that have the target belief or action the goal or facilitates the execution of the next action in
        in the head.                                           the plan; and (c) inferred outcome(s) of the execution of
     2. Ground literals in each such axiom’s body and hypothetical actions based on a mental simulation guided
        check whether these ground literals are supported by the current domain knowledge. In all these cases, the
        (i.e., satisfied) by the current answer set.           identified literals are encapsulated in a prespecified an-
     3. Create a new branch in the proof tree (that has the swer template to provide the descriptions. For proof of
        target belief or action as root) for each selected concept examples in simplistic scene understanding sce-
        axiom supported by the current answer set, and narios, please see [9]; some specific examples in the RW
        store the axiom and the related supporting ground domain are provided below (Section 4.1).
        literals in suitable nodes.
     4. Repeats Steps 1-3 with the supporting ground lit- Control loop. Algorithm 1 is the overall control loop
        erals in Step 3 as target beliefs in Step 1, until all for the architecture. The baseline behavior (lines 3-8) is
        branches reach a leaf node without further sup- to plan and execute actions to achieve the given goal as
        porting axioms.                                        long as a consistent model of history can be computed.
                                                               If such a model cannot be constructed, it is attributed to




                                                             121
�    Algorithm 1: Our architecture’s control loop.               4. Execution Traces and Results
Input: Π(𝒟, ℋ); goal description; initial state 𝜎1 .
                                                                Meaningfully evaluating architectures for integrated robot
Output: Control signals for robot to execute.
                                                                systems is challenging. It is difficult to find a baseline
  1 planMode = true, learnExplainMode = false
                                                                that provides all the capabilities supported by our archi-
  2 while true do
                                                                tecture, and it is also difficult to evaluate the capabilities
   3    Add observations to history.
                                                                of each component of the architecture in isolation. Also,
   4    ComputeAnswerSets(Π(𝒟, ℋ))
                                                                given that reasoning and learning guide each other in our
   5    if planMode then
                                                                architecture to automatically identify and focus only on
   6         if existsGoal then
                                                                the relevant information, task complexity and scalability
   7              if explainedObs then
                                                                do not necessarily change substantially by increasing the
   8                   ExecutePlanStep()
                                                                number of tasks, and just reporting success in many sce-
   9              else
                                                                narios is not very informative. In addition, it was difficult
  10                   planMode = false
                                                                to use a physical robot to conduct the experimental trials
  11                   learnExplainMode = true
                                                                during the pandemic. We thus focus on illustrating the
 12               end
                                                                capabilities of our architecture using a combination of
 13          else                                               execution traces (i.e., use cases) and some experiments
 14               learnExplainMode = true                       that provide quantitative results. The key hypotheses to
 15          end                                                be evaluated are:
 16     else
                                                                      H1 : our architecture enables the robot to compute
 17          if interrupt then
                                                                         and execute plans to achieve desired goals;
 18               planMode = true
 19          else if learnExplainMode then                            H2 : having reasoning inform and guide learning im-
 20               AcquireKnowledgeExplain()                              proves computational efficiency of learning and
                                                                         recognition accuracy of the learned models; and
 21     end
                                                                      H3 : exploiting the links between reasoning and learn-
 22 end
                                                                         ing provides suitable relational descriptions as ex-
                                                                         planations of decisions and beliefs.
                                                                We explore hypotheses H1 and H3 in the execution traces
                                                                (Section 4.1), and provide experimental results in support
                                                                of H2 (Section 4.2).

                                                                4.1. Execution traces
                                                                We provide two execution traces to illustrate the operation
                                                                of our architecture in specific scenarios. Videos corre-
                                                                sponding to these traces can be viewed online [29]2 . In
                                                                all the scenarios, the human user (in the physical world)
                                                                uses hand gestures to create different situations and also
                                                                to mimic the gestures to be made by the customers or
 Figure 7: Example layout of the RW domain used in
                                                                the supervisor in the restaurant environment. The layout
 Execution Examples 1- 2.                                       used to generate these traces is shown in Figure 7; it is
                                                                simplified version of Figure 3.
                                                                Execution Example 1. [Plan, execute, explain]
 an unexplained, unexpected observation, and the robot          Consider a scenario in which there is one customer 𝑐𝑢1
 triggers interactive exploration (lines 9-12). Interactive     seated at 𝑡𝑎𝑏𝑙𝑒1 in the restaurant, and the robot waiter is
 exploration is also triggered if no active goal exists to be   in the region of node 𝑛4 . In this scenario, the restaurant
 achieved (lines 13-15). Depending on the human input,          is organized into regions corresponding to eight nodes:
 the architecture either acquires the previously unknown        𝑛0 − 𝑛7 . The subsequent steps in this scenario are:
 gestures and axioms, or attempts to provide the desired
 description of a target decision or belief (lines 19-21).              • Three new customers (𝑐𝑢2 − 𝑐𝑢4 ) are introduced
 When in the learning mode, the robot can be interrupted                  in the restaurant as a group by the human designer
 if needed (lines 17-18), e.g., to pursue a new goal.                     showing a suitable hand gesture. This information
                                                                          is also added to the ASP program automatically.
                                                                2
                                                                    https://www.cs.bham.ac.uk/~sridharm/KR22/




                                                            122
�     • The hand gesture also lets the robot waiter (𝑟𝑜𝑏1 )
       know that the new customers are to be seated at
       a table. The robot comes up with a plan based
       on the updated ASP program and the vacant table
       that is closest to it:

         𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛5 ), 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛0 ),
         𝑝𝑖𝑐𝑘𝑢𝑝(𝑟𝑜𝑏1 , 𝑔𝑟𝑜𝑢𝑝1 ), 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛5 ),
         𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛6 ), 𝑠𝑒𝑎𝑡(𝑟𝑜𝑏1 , 𝑔𝑟𝑜𝑢𝑝1 , 𝑡𝑎𝑏𝑙𝑒2 )

     • Note that applying the 𝑝𝑖𝑐𝑘𝑢𝑝 action to any cus-
       tomer in a group causes the same effect on all
       customers in the group. This plan is executed and
       the state is updated accordingly, e.g., 𝑐𝑢2 − 𝑐𝑢4
       are seated at 𝑡𝑎𝑏𝑙𝑒2 after the plan is executed.
     • The robot can be asked about the executed plan.
       Human: “why did you seat all the customers at
       𝑡𝑎𝑏𝑙𝑒2 ?”
       Pepper: “Because all the customers wanted to
       sit together and 𝑡𝑎𝑏𝑙𝑒2 was the closest available
       table.”

     • After some time, 𝑐𝑢1 has finished eating and
       would like to leave. The designer imitates the
       hand gesture that the customer would do in the
       restaurant to ask for the bill. This is translated into
       a goal in the ASP program: ℎ𝑎𝑠𝑝𝑎𝑖𝑑(𝑐𝑢1 ).
     • The robot computes and executes a suitable plan to
       give the bill to 𝑐𝑢1 , collect payment, and provide
       a receipt, after which 𝑐𝑢1 leaves the restaurant.

Figure 8 shows snapshots from the beginning, middle, and
end of this scenario.                                            Figure 8: Snapshots from the beginning, middle, and
                                                                 end of scenario in Execution Example 1: (top) there is
                                                                 initially one customer 𝑐𝑢1 seated at 𝑡𝑎𝑏𝑙𝑒1 ; (middle) the
Execution Example 2. [Learn, plan, explain]                      three new customers are at 𝑡𝑎𝑏𝑙𝑒2 and 𝑐𝑢1 gets the robot
Consider another scenario in which the restaurant initially      waiter’s attention to request the bill; and (bottom) 𝑐𝑢1 has
has no customers. Robot waiter 𝑟𝑜𝑏1 is in the region of          left the restaurant after paying the bill.
node 𝑛1 and knows that 𝑡𝑎𝑏𝑙𝑒1 and 𝑡𝑎𝑏𝑙𝑒2 have capacity
two and four respectively. Once again, the restaurant
                                                                      • Since 𝑟𝑜𝑏1 knows that serving a customer implies
is organized into regions corresponding to eight nodes:
                                                                        giving them the food item they want, it is able to
𝑛0 − 𝑛7 . The subsequent steps in this scenario are:
                                                                        parse this complex instruction into the component
     • The human (in the physical world) makes a hand                   actions. When the human then makes the same
       gesture that is unknown to the robot waiter. The                 hand gesture again and introduces three new cus-
       robot responds by identifying this as a new gesture              tomers (𝑐𝑢2 − 𝑐𝑢4 ) near the restaurant’s entrance,
       and conveys that this will be added to the database              𝑟𝑜𝑏1 computes a suitable plan (some steps omitted
       of hand gestures.                                                to promote understanding).
     • Robot adds the new hand gesture and solicits feed-                 𝑚𝑜𝑣𝑒(𝑟𝑜𝑏1 , 𝑛2 ), . . . , 𝑝𝑖𝑐𝑘𝑢𝑝(𝑟𝑜𝑏1 , 𝑐𝑢2 ), . . . ,
       back about the gesture. The human (designer)
                                                                          𝑠𝑒𝑎𝑡(𝑟𝑜𝑏1 , 𝑐𝑢2 , 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       intentionally provides a complex instruction (tex-
       tually) that this gesture corresponds to “serve steak              𝑠𝑒𝑟𝑣𝑒(𝑟𝑜𝑏1 , 𝑠𝑡𝑒𝑎𝑘, 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       to a group of three new customers, and then give                   𝑔𝑖𝑣𝑒𝑏𝑖𝑙𝑙(𝑟𝑜𝑏1 , 𝑡𝑎𝑏𝑙𝑒2 ), . . . ,
       them the bill”.
                                                                      • Plan is executed and the state is updated accord-
                                                                        ingly at different time steps, e.g., 𝑐𝑢2 − 𝑐𝑢4 are




                                                             123
�                                                               achieve the assigned goals, identify and learn previously
                                                               unknown knowledge, and provide on-demand explana-
                                                               tions of decision and beliefs.

                                                               4.2. Experimental results
                                                               To further explore the effect of reasoning guiding learn-
                                                               ing, we conducted some quantitative studies. The first
                                                               experiment examined the benefits of reasoning guiding
                                                               the learning of deep network models for hand gestures.
                                                               Deep learning methods typically need many labeled train-
                                                               ing examples and epochs to learn models for the target
                                                               classification task. However, since learning in our archi-
                                                               tecture is constrained (by reasoning) to specific gestures
                                                               or classes of gestures at a time, it took fewer samples and
                                                               fewer epochs to acquire the desired models that provide
                                                               high accuracy—see Figure 10.
                                                                  The second experiment examined whether reasoning
                                                               helped improve the recognition accuracy. In this experi-
                                                               ment, we considered 30 hand gestures. One round of test-
                                                               ing included 40 iterations of each hand gesture by a person
                                                               who did not participate in training. We conducted mul-
                                                               tiple rounds of testing and ground truth information was
                                                               provided by the designers (i.e., student authors). In the ab-
                                                               sence of the coupling between reasoning and learning, the
                                                               learned models had (on average) an accuracy of 85% over
                                                               the different hand gestures. However, with learning being
                                                               directed to specific (classes of) gestures, the learned mod-
                                                               els resulted in better classification accuracy—≈ 100%.
                                                                  The third experiment examined the ability to provide
                                                               explanatory descriptions in response to different types of
                                                               queries in different situations. A description was consid-
Figure 9: Snapshots from the beginning, middle, and end        ered to be correct if it had all the correct literals but no
of scenario in Execution Example 2: (top) there is initially   additional literals. Overall, the interplay between reason-
no customer in the restaurant; (middle) the newly learned      ing (with relevant knowledge) and learning (of previously
hand gesture is made to get the robot to serve steak to a      unknown knowledge) led to the correct relational descrip-
group of customers; and (end) the robot provides a bill to     tions in 95% cases, with the “errors” being descriptions
the customers after they have completed their meal.            containing additional literals that were not essential to
                                                               answer the query posed but were not necessarily wrong.
                                                               In the absence of the learned knowledge, the accuracy
       seated at 𝑡𝑎𝑏𝑙𝑒2 after the 𝑠𝑒𝑎𝑡 action is executed.     (averaged over query types) was 65 − 80%.
     • The robot can be asked about specific plan steps.
       Human: “why did you not serve pasta to 𝑡𝑎𝑏𝑙𝑒2 ?”
       Pepper: “Because all customers at 𝑡𝑎𝑏𝑙𝑒2 wanted         5. Discussion and Conclusions
       to eat steak.”
                                                          We conclude by highlighting the key capabilities of our
        This explanation is based on the previously- architecture:
        described approach to trace beliefs and the ap-
                                                              • Once the designer has provided the domain-
        plication of relevant axioms.
                                                                specific information (e.g., arrangement of rooms,
Figure 9 shows snapshots from the beginning, middle, and        range of robot’s sensors), planning, diagnostics,
end of this scenario.                                           and plan execution can be automated. The cou-
                                                                pling between reasoning and learning enables
We evaluated the architecture in many other scenarios           more complex theories (of cognition, action) to
grounded in the motivating (restaurant) domain; the robot       be encoded without increasing the computational
was able to successfully compute and execute plans to           effort substantially.




                                                           124
�                                  1.005

                                  1.000

                                  0.995




                       Accuracy
                                  0.990

                                  0.985

                                  0.980

                                  0.975

                                  0.970
                                           0         2     4       6           8    10      12         14
                                                                       Epoch
                                          testing        ANN-3x16 (1691)           ANN-2x128 (25499)
                                          training       ANN-3x64 (12827)

Figure 10: Deep network models provide high (recognition) accuracy for hand gestures within a few epochs when
guided by reasoning.



     • Second, exploiting the interplay between                        References
       knowledge-based reasoning and data-driven
       learning provides a clear separation of concerns,    [1] E. Erdem, V. Patoglu, Applications of ASP in
       and helps focus attention automatically to the           Robotics, Kunstliche Intelligenz 32 (2018) 143–
       relevant knowledge and observed anomalies,               149.
       thus improving the reliability and efficiency of     [2] E. Erdem, M. Gelfond, N. Leone, Applications of
       reasoning and learning.                                  Answer Set Programming, AI Magazine 37 (2016)
     • Third, it is easier to understand and modify the         53–68.
       observed behavior than with architectures that con-  [3] K. Kersting, L. D. Raedt, Bayesian Logic Programs,
       sider all the available knowledge or only support        in: International Conference on Logic Programming,
       data-driven learning. The robot is able to provide       London, UK, 2000.
       relational descriptions of its decisions and the evo-[4] L. D. Raedt, A. Kimmig, Probabilistic Logic Pro-
       lution of its beliefs.                                   gramming Concepts, Machine Learning 100 (2015)
     • Fourth, there is smooth transfer of control and          5–47.
       relevant knowledge between components of the         [5] M. Richardson, P. Domingos, Markov Logic Net-
       architecture, and increased confidence in the cor-       works, Machine Learning 62 (2006) 107–136.
       rectness of the robot’s behavior. Also, the underly- [6] S. Zhang, M. Sridharan, A Survey of Knowledge-
       ing methodology can be used with different robots        based Sequential Decision Making under Uncer-
       and in different application domains.                    tainty, Artificial Intelligene Magazine 43 (2022)
                                                                249–266.
     • Fifth, using KR tools and the coupling between
                                                            [7] Y. Gil, Learning by Experimentation: Incremental
       reasoning and learning as the foundation promotes
                                                                Refinement of Incomplete Planning Domains, in: In-
       modularity and simplifies the design and evalua-
                                                                ternational Conference on Machine Learning, New
       tion of architectures for integrated robot systems.
                                                                Brunswick, USA, 1994, pp. 87–95.
Future work will further explore the interplay between rea- [8] M. Law, A. Russo, K. Broda, The ILASP System for
soning and learning for explaining decisions and beliefs        Inductive Learning of Answer Set Program, Associ-
while performing reasoning and learning in more complex         ation for Logic Programming Newsletter (2020).
robotics domains. We will also investigate the use of our   [9] T. Mota, M. Sridharan, A. Leonardis, Integrated
architecture on a physical robot interacting with humans        Commonsense Reasoning and Deep Learning for
through noisy sensors and actuators. The longer-term ob-        Transparent Decision Making in Robotics, Springer
jective is to support transparent reasoning and learning in     Nature CS 2 (2021) 1–18.
integrated robot systems operating in complex domains. [10] M. Sridharan, B. Meadows, Knowledge Representa-
                                                                tion and Interactive Learning of Domain Knowledge
                                                                for Human-Robot Collaboration, Advances in Cog-




                                                                  125
�     nitive Systems 7 (2018) 77–96.                          [22] T. Miller, Explanations in Artificial Intelligence:
[11] J. E. Laird, K. Gluck, J. Anderson, K. D. Forbus,            Insights from the Social Sciences, Artificial Intelli-
     O. C. Jenkins, C. Lebiere, D. Salvucci, M. Scheutz,          gence 267 (2019) 1–38.
     A. Thomaz, G. Trafton, R. E. Wray, S. Mohan, J. R.      [23] M. Sridharan, M. Gelfond, S. Zhang, J. Wy-
     Kirk, Interactive Task Learning, IEEE Intelligent            att, REBA: A Refinement-Based Architecture
     Systems 32 (2017) 6–21.                                      for Knowledge Representation and Reasoning in
[12] R. Assaf, A. Schumann, Explainable Deep Neural               Robotics, Journal of Artificial Intelligence Research
     Networks for Multivariate Time Series Predictions,           65 (2019) 87–180.
     in: International Joint Conference on Artificial In-    [24] P. Langley, B. Meadows, M. Sridharan, D. Choi, Ex-
     telligence, Macao, China, 2019, pp. 6488–6490.               plainable Agency for Intelligent Autonomous Sys-
[13] Wojciech Samek and Thomas Wiegand and Klaus-                 tems, in: Innovative Applications of Artificial Intel-
     Robert Muller, Explainable Artificial Intelligence:          ligence, San Francisco, USA, 2017.
     Understanding, Visualizing and Interpreting Deep        [25] M. Sridharan, B. Meadows, Towards a Theory of
     Learning Models, ITU Journal: ICT Discoveries                Explanations for Human-Robot Collaboration, Kun-
     (Special Issue 1): The Impact of Artificial Intelli-         stliche Intelligenz 33 (2019) 331–342.
     gence (AI) on Communication Networks and Ser-           [26] T. Mota, M. Sridharan, Commonsense Reasoning
     vices 1 (2017) 1–10.                                         and Knowledge Acquisition to Guide Deep Learn-
[14] W. Norcliffe-Brown, E. Vafeais, S. Parisot, Learn-           ing on Robots, in: Robotics Science and Systems,
     ing Conditioned Graph Structures for Interpretable           Freiburg, Germany, 2019.
     Visual Question Answering, in: Neural Information       [27] M. Balduccini, M. Gelfond, Logic Programs with
     Processing Systems, Montreal, Canada, 2018.                  Consistency-Restoring Rules, in: AAAI Spring
[15] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, J. B.           Symposium on Logical Formalization of Common-
     Tenenbaum, Neural-Symbolic VQA: Disentangling                sense Reasoning, 2003, pp. 9–18.
     Reasoning from Vision and Language Understand-          [28] M. Gelfond, D. Inclezan, Some Properties of Sys-
     ing, in: Neural Information Processing Systems,              tem Descriptions of 𝐴𝐿𝑑 , Journal of Applied
     Montreal, Canada, 2018.                                      Non-Classical Logics, Special Issue on Equilibrium
[16] M. Ribeiro, S. Singh, C. Guestrin, Why Should I              Logic and Answer Set Programming 23 (2013) 105–
     Trust You? Explaining the Predictions of Any Clas-           120.
     sifier, in: ACM SIGKDD International Conference         [29] M. Sridharan, Supporting code and videos, 2022.
     on Knowledge Discovery and Data Mining, 2016,                https://www.cs.bham.ac.uk/~sridharm/KRFiles/.
     pp. 1135–1144.                                          [30] E. Balai, M. Gelfond, Y. Zhang, Towards Answer
[17] Y. Zhang, S. Sreedharan, A. Kulkarni,                        Set Programming with Sorts, in: International Con-
     T. Chakraborti, H. H. Zhuo, S. Kambham-                      ference on Logic Programming and Nonmonotonic
     pati, Plan explicability and predictability for robot        Reasoning, Corunna, Spain, 2013.
     task planning, in: International Conference on          [31] B. Banihashemi, G. D. Giacomo, Y. Lesperance,
     Robotics and Automation, 2017, pp. 1313–1320.                Abstraction of Agents Executing Online and their
[18] R. Borgo, M. Cashmore, D. Magazzeni, Towards                 Abilities in Situation Calculus, in: International
     Providing Explanations for AI Planner Decisions,             Joint Conference on Artificial Intelligence, Stock-
     in: IJCAI Workshop on Explainable Artificial Intel-          holm, Sweden, 2018.
     ligence, 2018, pp. 11–17.                               [32] Z. Saribatur, T. Eiter, P. Schuller, Abstraction for
[19] P. Bercher, S. Biundo, T. Geier, T. Hoernle, F. Noth-        Non-ground Answer Set Programs, Artificial Intel-
     durft, F. Richter, B. Schattenberg, Plan, repair, ex-        ligence 300 (2021) 103563.
     ecute, explain - how planning helps to assemble         [33] E. Coumans, Y. Bai, PyBullet: A Python Module
     your home theater, in: Twenty-Fourth International           for Physics Simulation for Games, Robotics, and
     Conference on Automated Planning and Scheduling,             Machine Learning, Technical Report, http://pybullet.
     2014.                                                        org, 2016-2022.
[20] J. Fandinno, C. Schulz, Answering the "Why" in          [34] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, Y. A.
     Answer Set Programming: A Survey of Explanation              Sheikh, OpenPose: Realtime Multi-Person 2D Pose
     Approaches, Theory and Practice of Logic Program-            Estimation using Part Affinity Fields, IEEE Transac-
     ming 19 (2019) 114–203.                                      tions on Pattern Analysis and Machine Intelligence
[21] S. Anjomshoae, A. Najjar, D. Calvaresi, K. Fram-             (2019).
     ling, Explainable agents and robots: Results from a     [35] G. Ferrand, W. Lessaint, A. Tessier, Explanations
     systematic literature review, in: International Con-         and Proof Trees, Computing and Informatics 25
     ference on Autonomous Agents and Multiagent Sys-             (2006) 1001–1021.
     tems (AAMAS), Montreal, Canada, 2019.




                                                         126
�
🖨 🚪