Vol-2644/paper44

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search

Paper

Paper
edit
description  scientific paper published in CEUR-WS Volume 2644
id  Vol-2644/paper44
wikidataid  Q117337910→Q117337910
title  One-Shot Rule Learning for Challenging Character Recognition
pdfUrl  https://ceur-ws.org/Vol-2644/paper44.pdf
dblpUrl  https://dblp.org/rec/conf/ruleml/VargheseT20
volume  Vol-2644→Vol-2644
session  →

Freitext

One-Shot Rule Learning for Challenging Character Recognition

load PDF

                 One-Shot Rule Learning
          for Challenging Character Recognition

                  Dany Varghese and Alireza Tamaddoni-Nezhad

        Department of Computer Science, University of Surrey, Guildford, UK
              {dany.varghese,a.tamaddoni-nezhad}@surrey.ac.uk



        Abstract. Unlike most of computer vision approaches which depend
        on hundreds or thousands of training images, humans can typically learn
        from a single visual example. Humans achieve this ability using back-
        ground knowledge. Rule-based machine learning approaches such as In-
        ductive Logic Programming (ILP) provide a framework for incorporating
        domain specific background knowledge. These approaches have the po-
        tential for human-like learning from small data or even one-shot learning,
        i.e. learning from a single positive example. By contrast, statistics based
        computer vision algorithms, including Deep Learning, have no general
        mechanisms for incorporating background knowledge. In this paper, we
        present an approach for one-shot rule learning called One-Shot Hypoth-
        esis Derivation (OSHD) which is based on using a logic program declar-
        ative bias. We apply this approach to the challenging task of Malayalam
        character recognition. This is a challenging task due to spherical and
        complex structure of Malayalam hand-written language. Unlike for other
        languages, there is currently no efficient algorithm for Malayalam hand-
        written recognition. We compare our results with a state-of-the-art Deep
        Learning approach, called Siamese Network, which has been developed
        for one-shot learning. The results suggest that our approach can gener-
        ate human-understandable rules and also outperforms the deep learning
        approach with a significantly higher average predictive accuracy.

        Keywords: One-Shot Learning· Rule-Based Machine Learning· Induc-
        tive Logic Programming (ILP) · Malayalam Character Recognition ·
        Computer Vision




1     Introduction
Deep Neural Networks (DNNs) [7, 2, 15, 3] have demonstrated state-of-the-art re-
sults on many pattern recognition tasks, especially in image classification prob-
lems [14, 9, 23, 6]. However, recent studies [24, 26] revealed major differences be-
tween human visual cognition and DNNs, and in general most of statistics-based
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
�computer vision learning algorithms. For example, it is easy to produce images
that are completely unrecognizable to humans, though DNN visual learning al-
gorithms believe them to be recognizable objects with over 99% confidence [24].

    Another major difference is related to the number of required training ex-
amples. Humans can typically learn from a single visual example [10], unlike
statistical learning which depends on hundreds or thousands of images. Humans
achieve this ability using background knowledge, which plays a critical role. By
contrast, statistics based computer vision algorithms have no general mecha-
nisms for incorporating background knowledge.

    Computer vision is a multidisciplinary field that aims to create high-level
understanding from digital images or videos. The key intention of image analysis
is to bridge the semantic gap between low-level descriptions of an image and the
high level concept within the image. The main objective of structural pattern
analysis is to present the visual data using natural descriptions. Traditionally,
this is achieving by extracting low-level visual cues from the data provided,
then applying some grouping algorithm to express relationships that are then
transformed into more and more complex and convoluted features that generate
higher level rules [16].

    Document image analysis approaches such as Optical Character Recognition
(OCR) are an important part of visual artificial intelligence with many real-world
applications. The main objective of these approaches is to identify significant
graphical properties from images. In this context, Symbol Recognition has a
long history dating back to the 70’s. In the current state-of-the-art, symbol
recognition involves identifying isolated symbols, however this is not enough
for some more challenging real-world application. As an example, consider an
application where the visual data is represented as a combination of isolated
symbols as well as composite symbols that are connected with other graphical
elements. Then the statistical approaches which represent shapes only as low
level features will have limited success.

    In this paper, we present an approach for one-shot rule learning called One-
Shot Hypothesis Derivation (OSHD) which is based on using a logic program
declarative bias. We apply this approach to the challenging task of Malayalam
character recognition. This is a challenging task due to spherical and complex
structure of Malayalam hand-written language. Unlike for other languages, there
is currently no efficient algorithm for Malayalam hand-written recognition. The
language scripts are mainly based on circular geometrical properties. We have
created a dataset for Malayalam hand-written characters which includes high
level properties of the language based on ’Omniglot’ dataset designed for devel-
oping human-level concept learning algorithms [11]. We compare our results with
a state-of-the-art Deep Learning approach, called Siamese Network [8], which has
been developed for one-shot learning.
�2   Inductive Logic Programming (ILP) and One-Shot
    Hypothesis Derivation (OSHD)

Inductive Logic Programming (ILP) has been defined as the intersection of in-
ductive learning and logic programming [21]. Thus, ILP employs techniques from
both machine learning and logic programming.
    The main objective of ILP, in its simplest form, is to discover the definition
of a predicate by observing positive and negative examples of that predicate.
Together with positive and negative examples of the target predicate, other
background information may also be provided containing further information
relevant to learning the target predicate. This background information is repre-
sented as a logic program and is called background knowledge(BK). ILP systems
develop predicate descriptions from examples and background knowledge. The
examples, background knowledge and final descriptions are all described as logic
programs.
    The logical notations and foundations of ILP can be found in [21, 25]. The
following definition, adapted from [25], defines the learning problem setting for
ILP.

Definition 1 (ILP problem setting).
Input : Given hB, Ei, where B is a set of clauses representing the background
knowledge and E is the set of positive (E + ) and negative (E − ) examples such
that B 6|= E + .
Output : find a theory H such that H is complete and (weakly) consistent with
respect to B and E. H is complete with respect to B and E + if B ∧ H |= E + .
H is consistent with respect to B and E − if B ∧ H ∧ E − 6|= 2. H is weakly
consistent with respect to B if B ∧ H 6|= 2.

    In this definition, |= represents logical entailment and 2 represents an empty
clause or logical refutation. Note that in practice, due to the noise in the train-
ing examples, the completeness and consistency conditions are usually relaxed.
For example, weak consistency is usually used and a noise threshold is consid-
ered which allows H to be inconsistent with respect to a certain proportion (or
number) of negative examples.
    The following example is adapted from [13].
Example 1. In Definition 1, let E + , E − and B be defined as follows:

       E + = {daughter(mary, ann), daughter(eve, tom)}
       E − = {daughter(tom, ann), daughter(eve, ann)}
        B = {mother(ann, mary), mother(ann, tom), f ather(tom, eve),
              f ather(tom, ian), f emale(ann), f emale(mary),
              f emale(eve), male(pat), male(tom),
              parent(X, Y ) ← mother(X, Y ),
              parent(X, Y ) ← f ather(X, Y )}
�Then both theories H1 and H2 defined as follows:

             H1 = {daughter(X, Y ) ← f emale(X), parent(Y, X)}
             H2 = {daughter(X, Y ) ← f emale(X), mother(Y, X),
                   daughter(X, Y ) ← f emale(X), f ather(Y, X)}

are complete and consistent with respect to B and E.

2.1   One-Shot Hypothesis Derivation (OSHD)
In this paper we adopt a form of ILP which is suitable for one-shot learning and
is based on using a logic program declarative bias, i.e. using a logic program to
represent the declarative bias over the hypothesis space. Using a logic program
declarative bias has several advantages. Firstly, a declarative bias logic program
allows us to easily port bias from one problem to another similar problem (e.g. for
transfer learning). Secondly, it is possible to reason about the bias at the meta-
level. Declarative bias will also help to reduce the size of the search space for
the target concept or hypothesis derivation [1, 22]. We refer to this approach as
One-Shot Hypothesis Derivation (OSHD) which is a special case of Top-Directed
Hypothesis Derivation (TDHD) as described in [18].

Definition 2 (One-Shot Hypothesis Derivation). The input to an OSHD
system is the vector ST DHD = hN T, >, B, E, ei where N T is a set of “non-
terminal” predicate symbols, > is a logic program representing the declarative
bias over the hypothesis space, B is a logic program representing the background
knowledge and E is a set of examples and e is a positive example in E. The fol-
lowing three conditions hold for clauses in >: (a) each clause in > must contain
at least one occurrence of an element of N T while clauses in B and E must not
contain any occurrences of elements of N T , (b) any predicate appearing in the
head of some clause in > must not occur in the body of any clause in B and (c)
the head of the first clause in > is the target predicate and the head predicates
for other clauses in > must be in N T . The aim of a OSHD learning system is
to find a set of consistent hypothesised clauses H, containing no occurrence of
N T , such that for each clause h ∈ H the following two conditions hold:

                                      > |= h                                   (1)
                                    B, h |= e                                  (2)

The following theorem is a special case of Theorem 1 in [18].
Theorem 1. Given SOSHD = hN T, >, B, E, ei assumptions (1) and (2) hold
only if there exists an SLD refutation R of ¬e from >, B, such that R can be
re-ordered to give R0 = Dh Re where Dh is an SLD derivation of a hypothesis h
for which (1) and (2) hold.
According to Theorem 1, implicit hypotheses can be extracted from the refuta-
tions of e. Let us now consider a simple example.
�                                          >1 = alphabet(X) ← $body(X) ¬e =← alphabet(a)




                   >2 = $body(X) ← property1(X)               G1 =← $body(a)




      b1 = property1(a) ←                G2 =← property1(a)




                               2

                              Fig. 1: SLD-refutation of ¬e


Example 2. Let SOSHD = hN T, >, B, E, ei where N T , B , e and > are as follows:
                                                  
      N T = {$body}                                >1 : alphabet(X) ← $body(X)
                                                  
                                                  
      B = b1 = property1(a) ←               >=        >2 : $body(X) ← property1(X)
                                                  
                                                  
      e = alphabet(a) ←                               >3 : $body(X) ← property2(X)
                                                  


Given the linear refutation, R = h¬e, >1 , >2 , b1 i, as shown in Figure 1, we now
construct the re-ordered refutation R0 = Dh Re where Dh = h>1 , >2 i derives the
clause h = alphabet(X) ← property1(X) for which (1) and (2) hold.

   The user of OSHD can specify a declarative bias > in the form of a logic
program. A general > theory can be also generated from user specified mode
declarations. Below is a simplified example of user specified mode declarations
and the automatically constructed > theory.

2.2     The OSHD Learning Algorithm
The OSHD Learning algorithm can be described in 3 main steps:

1. Generate all hypotheses, He that are generalizations of e
�                                        
                                        
                                         >1 : alphabet(X) ← $body(X).
modeh(alphabet(+image)).                
                                        
                                         >2 : $body(X) ← .%emptybody
                                        
modeb(has prop1(+image)).          >=
                                        
                                         >3 : $body(X) ← has prop1(X), $body(X).
modeb(has prop2(+image)).
                                        
                                        
                                        
                                          >4 : $body(X) ← has prop2(X), $body(X).
                                        


    Fig. 2: Mode declarations and a > theory automatically constructed from it

 2. Compute the coverage of each hypothesis in He
 3. Build final theory, T , by choosing a subset of hypothesis in He that maximises
    a given score function (e.g. compression)

    In step 1, He is generated using the OSHD hypothesis derivation described
earlier in this section.
    The second step of the algorithm, computing the coverage of each hypothesis,
is not needed if the user program is a pure logic program (i.e. all relationships in
the background knowledge are self contained and do not rely on Prolog built in
predicates). This is because, by construction, the OSHD hypothesis derivation
generates all hypotheses that entail a given example with respect to the user
supplied mode declarations. This implies that the coverage of an hypothesis is
exactly the set of examples that have it as their generalization. However, this
coverage computation step is needed for the negative examples, as they were not
used to build the hypothesis set.
    For step 3, the compression-based evaluation function used for the experi-
ments in this paper is:
                 X
                     Covered Examples W eight − T otal Literals                  (3)
   The weight associated to an example may be defined by the user but by
default, positive examples have weight 1 and negative examples weight -1. In
general, negative examples are defined with a weight smaller than 0 and positive
examples with a weight greater than 0.


3     Siamese Neural Networks

In this paper, we use a state-of-the-art Deep Learning approach, called Siamese
Network [8], which has been developed for one-shot learning. The original Siamese
Networks were first introduced in the early 1990s by Bromley and LeCun to solve
signature verification as an image matching problem [5]. A Siamese network is
a Deep Learning architecture with two parallel neural networks with the same
properties in terms of weight, layers etc. Each network takes a different input,
and whose outputs are combined using energy function at the top to provide
some prediction. The energy function computes some metric between the high-
est level feature representation on each side (Figure 3). Weight tying guarantees
that two extremely similar images could not possibly be mapped by their respec-
tive networks to very different locations in feature space because each network
�            Fig. 3: A simple 2 hidden layer Siamese Neural Network [8]


computes the same function. Also, the network is symmetric, so that whenever
we present two distinct images to the twin networks, the top conjoining layer
will compute the same metric as if we were to present the same two images but
to the opposite twin.


4     One-Shot learning for Malayalam character recognition
We apply One-Shot Hypothesis Derivation (OSHD) as well as Deep Learning
(i.e. Siamese Network) to the challenging task of one-shot Malayalam character
recognition. This is a challenging task due to spherical and complex structure of
Malayalam hand-written language.

4.1    Character recognition and human-like background knowledge
Malayalam is one of the four major languages of the Dravidian language family
and originated from the ancient Brahmi script. Malayalam is the official language
of Kerala, a state of India with roughly forty-five million people. Unlike for other
languages, there is currently no efficient algorithm for Malayalam hand-written
recognition. The basic Malayalam characters along with International Phonetic
Alphabet (IPA) are shown in Figure 41 .
    The handwriting recognition for Malayalam script is a major challenge com-
pared to the recognition of other scripts because of the following reasons:

 – Presence of large number of alphabets
 – Different writing styles
 – Spherical features of alphabets
1
    From: https://sites.google.com/site/personaltesting1211/malayalam-alphabet.
�                 (a) Malayalam Vowels     (b) Special    Consonants
                                          (Chill)




                     (c) Consonants and Consonant Clusters

                      Fig. 4: Sample Malayalam characters


 – Similarity in character shapes

    We selected the hand-written characters from ’Omniglot’ dataset [11]. Sample
Malayalam alphabets from our dataset are shown in Figure 5. Feature extraction
is conducted utilizing a set of advanced geometrical features [27] and directional
features.


Geometrical Features Every character may be identified by its geometric
designations such as loops, junctions, arcs, and terminals. Geometrically, loop
means a closed path. Malayalam characters contain more intricate loops which
�                   (a) Character ’Aha’     (b) Character ’Tha’

             Fig. 5: Sample Malayalam characters from our dataset


may contain some up and downs within the loops itself. So we follow a concept
as shown in Figure 6(b). If the figure has a continuous closed curve then we
will identify it as a loop. Junctions may be defined as a meeting point of two or
more curves or line. It is easy for human to identify the junction from an image
as shown in Figure 6(c). As per dictionary definitions, an arc is a component
of a curve. So in our case, a path with semi opening will be considered as an
arc. Please refer to Figure 6(d) for more details. Terminals may be classified
as points where the character stroke ends, i.e. no more connection beyond that
point. Figure 6(e) is a self-explanatory example for the definition.
    We have included the visual explanation for the geometrical feature extrac-
tion in Figure 6. We have selected two characters to explicate the features as
shown in Figure 5 and marked each geometrical features as we discussed. Table
1 will give an abstract conception about the dataset we have developed for the
experiments from ’Omniglot’ dataset.

Directional Features Every character may be identified by its directional
specifications such as starting and ending points of the stroke. There are certain
unwritten rules for Malayalam characters, e.g. it always commences from left and
moves towards the right direction. Native Malayalam users can easily identify
the starting and ending point. However, we will need to consider the starting and
ending point as features so that these can be easily identified without semantic
knowledge of a character. The starting and ending points are determined by
standard direction properties as shown in the Figure 6(a). Figure 6(f, g) will
give you an idea about developing the directional features from an alphabet.
Character ID:13 is the corresponding entry for the character shown in figure
6(g). As we discussed, a user can identify both starting and ending point of the
character displayed in Figure 6(g) easily whereas the terminus point of Figure
6(f) is arduous to determine.

4.2   Mode declarations and background knowledge representation
In this section, we define the OSHD specific details of the declarative bias, de-
fined by mode declaration and background knowledge representation used in our
experiments.
�  (a) Direction Properties          (b) Feature : Loop         (c) Feature : Junction




     (d) Feature : Arc           (e) Feature : Terminals     (f) Feature : Starting Point




                             (g) Feature : Ending Point

                  Fig. 6: Human-like feature extraction criteria


    The first step was to develop and represent the background knowledge based
on the concepts described in Section 4.1. Table 1 shows the geometrical and
directional features of 18 characters from 5 different alphabets used in our ex-
periments.

Mode declaration and declarative bias In this section we describe how the
declarative bias for the hypothesis space was defined using mode declarations.
Here, we use the same notations used in Progol [17] and Toplog [18]. There are
two types of mode declarations.
1. modeh : defines the head of a hypothesised rule.
2. modeb : defines the literals (conditions) that may appear in the body of a
   hypothesised rule.
   For example, in our experiments, alphabet(+character) is the head of the
hypothesis, where +character defines the character identifier character as an
�                   Table 1: Geometrical and Directional Properties
                                Geometrical Properties                          Directional Properties
Character ID
                 No. Loops No. Junctions No. Arcs No. Terminals Starting Point Ending Point
       1              2              4             3              2                sw              null
       2              3              4             3              2                sw              null
       3              3              4             3              2                sw              null
       4              1              2             3              2               null             se
       5              1              3             3              2                nw              se
       6              0              1             3              2                nw              se
       7              1              1             2              1               null             se
       8              1              1             2              2                nw              se
       9              1              1             2              1               null             se
      10              3              3             4              1               null             se
      11              4              4             4              0               null             null
      12              3              4             3              0               null             null
      13              1              1             2              2                sw              se
      14              1              1             1              2                nw              se
      15              1              1             1              2                sw              ne
      16              0              2             1              2                sw              ne
      17              0              0             1              2                sw              ne
      18              0              2             1              2                nw              ne



input argument. We are using four predicates in the body part of the hypothesis
as shown in the listing 1.1. Note that +, -, indicate input, output or a constant
value arguments.


                               Listing 1.1: Mode declarations
:− modeh ( 1 , a l p h a b e t (+ c h a r a c t e r ) ) .
:− modeb ( ∗ , h a s g e m p r o p e r t i e s (+ c h a r a c t e r ,− p r o p e r t i e s ) ) .
:− modeb ( ∗ , h a s g e m p r o p e r t i e s c o u n t (+ p r o p e r t i e s ,
                                               #g e o f e a t u r e n a m e ,# i n t ) ) .
:− modeb ( ∗ , h a s d i r p r o p e r t i e s (+ c h a r a c t e r ,− p r o p e r t i e s ) ) .
:− modeb ( ∗ , h a s d i r p r o p e r t i e s f e a t u r e (+ p r o p e r t i e s ,
                                 #d i r f e a t u r e n a m e ,# f e a t u r e v a l u e ) ) .

    The meaning of each modeb condition is defined as follows:
�has gemproperties/2 predicate was used to represent the geometrical features
   as defined in Table 1. The input argument character is the unique identifier
   for an alphabet, properties refers to the property name.
has gemproperties count/3 predicate outlines the count of the particular fea-
   ture associated with the alphabet. The properties is the unique identifier for
   a particular geometrical property of a particular alphabet , geo f eature name
   refers to the property name and int stands for the feature count.
has dirproperties/2 predicate used to represent the directional features men-
   tioned in table 1. The character is the unique identifier for the alphabet,
   properties refers to the property name.
has dirproperties count/3 predicate outlines the count of a particular direc-
   tional feature associated with the alphabet. The properties is a unique iden-
   tifier for a particular property of a particular alphabet , dir f eature name
   refers to the property name and f eaturevalue stands for the feature vale.


Background knowledge representation As defined in Definition 1, back-
ground knowledge is a set of clauses representing the background knowledge
about a problem. In general, background knowledge can be represented as a
logic program and could include general first-order rules. However, in this paper
we only consider ground fact background knowledge. In the listing 1.2 we have
a sample background knowledge for an alphabet.

         Listing 1.2: Sample background knowledge for alphabet ’Aha’
     %% G e o m e t r i c a l F e a t u r e ’ Loops ’ with f e a t u r e count
has gemproperties ( character 0 , loops 0 ) .
has gemproperties count ( loops 0 , loops , 2 ) .
     %% G e o m e t r i c a l F e a t u r e ’ Arcs ’ with f e a t u r e count
has gemproperties ( character 0 , arcs 0 ) .
has gemproperties count ( arcs 0 , arcs , 3 ) .
     %% G e o m e t r i c a l F e a t u r e ’ J u n c t i o n s ’ with f e a t u r e count
has gemproperties ( character 0 , junctions 0 ) .
has gemproperties count ( junctions 0 , junctions , 4 ) .
     %% G e o m e t r i c a l F e a t u r e ’ Terminals ’ with f e a t u r e count
has gemproperties ( character 0 , terminals 0 ) .
has gemproperties count ( terminals 0 , terminals , 2 ) .
     %% D i r e c t i o n a l F e a t u r e ’ S t a r t i n g Point ’ with f e a t u r e
has dirproperties ( character 0 , starts 0 ) .
h a s d i r p r o p e r t i e s f e a t u r e ( s t a r t s 0 , s t a r t s a t , sw ) .


5    Experiments

In this section we evaluate the OSHD approach for complex character recognition
as described in this paper. We also compare the performance of OSHD with a
state-of-the-art Deep Leaning architecture for one-shot learning, i.e. the Siamese
�Network approach described in Section 4. In particular we test the following null
hypotheses:

Null Hypothesis H1 OSHD cannot outperform Siamese Networks in one-shot
   learning for complex character recognition.
Null Hypothesis H2 OSHD cannot learn human comprehensible rules for com-
   plex character recognition.


5.1   Materials and Methods

The OSHD algorithm in this experiment is based on Top-Directed Hypothesis
Derivation implemented in Toplog [18], and uses mode declarations and back-
ground knowledge which defined earlier in this paper. The Siamese Network used
in the experiment is based on the implementation described in [8].
    The challenging Malayalam character recognition dataset and the machine
learning codes, configurations and input files are available from:
https://github.com/danyvarghese/One-Shot-ILP
    We have selected 5 complex alphabets from the ’Omniglot’ dataset [11]. Ex-
ample characters used in our experiment and their visual properties (developed
using the geometrical and directional concepts, as discussed in Section 4) are
listed in Table 2. We have endeavoured to reiterate the same concept of working
in both architectures and repeated the experiments for different number of folds
and each fold consists of single positive example and n negative examples, where
n varies from 1 to 4 and the negative examples are selected from other alphabets.
In our experiment we are using the term ’number of classes’ in different aspect.
The number of classes is defined by the total number of examples (i.e. 1 positive
and n negative) used for the cross-validation.
    In the following we define specific parameter settings for each algorithm.


OSHD parameter settings The following Toplog parameter settings were
used in this experiment.

 clause length (value = 15) defines the maximum number of literals (including
    the head) of a hypothesis.
 weight the weight of negative example is taken always as the default value. The
    weight of positive example is the number of negative examples for that class.
    During the cross-validation test, we add one more positive example of the
    same alphabet for each fold. The weight of newly added example will not be
    greater than the previous one included in the same fold.
 positive example inf lation (value = 10) multiplies the weights of all positive
    examples by this factor.
 negative example inf lation (value = 5) multiplies the weights of all negative
    examples by this factor.
�                   Table 2: Sample characters & properties
                         Alphabet                  Properties


                                           Loops : 2
                                           Junctions : 4
                                           Arcs : 3
                                           Terminals : 2
                   Alphabet ’Aha’ (ID:1)   Starting Point : SW
                                           Ending Point : Null


                                           Loops : 1
                                           Junctions : 2
                                           Arcs : 3
                                           Terminals : 2
                    Alphabet ’Eh’ (ID:4)   Starting Point : Null
                                           Ending Point : SE


                                           Loops : 1
                                           Junctions : 1
                                           Arcs : 2
                                           Terminals : 1
                    Alphabet ’Uh’ (ID:7)   Starting Point : Null
                                           Ending Point : SE




Siamese Networks parameter settings For the implementation of the Siamese
Net, we followed the same setups used by Koch et al [8]. Koch et al use a con-
volutional Siamese network to classify pairs of ’Omniglot’ images, so the twin
networks are both Convolutional Neural Nets (CNNs). The twins each have the
following architecture:

 – Convolution with 64 (10 × 10) filters uses ’relu’ activation function.
 – ’max pooling’ convolution 128 (7 × 7) filters with ’relu’ activation function.
 – ’max pooling’ convolution 128 (4 × 4) filters with ’relu’ activation function.
 – ’max pooling’ convolution 256 (4 × 4) filters with ’relu’ activation function.

The twin networks reduce their inputs down to smaller and smaller 3D tensors.
Finally, there is a fully connected layer with 4096 units.
    In most of the implementations of Siamese Network, they are trying to de-
velop the training model from a high amount of data. Also, particularly in the
case of character recognition, they compare a character from a language against
the characters from other languages [12, 4]. In our experiment we have only con-
sidering alphabets from a single language.


5.2   Results and Discussions

Figure 7 shows the average predictive accuracy of ILP (OSHD) vs Deep Learn-
ing (Siamese Net) in One shot character recognition with increasing number of
character classes. According to this figure, OSHD outperforms the Siamese Nets,
�Fig. 7: Average Predictive accuracy of ILP (OSHD) vs Deep Learning (Siamese
Net) in One shot character recognition with increasing number of character
classes.


with an average difference of more than 20%. In this figure the random curve
represents the default accuracy of random guess. The accuracy for one class
prediction is always 100%. Null hypothesis H1 is therefore refuted by this ex-
periment. A better predictive accuracy of OSHD compared to the Siamese Net
could be explained by the fact that it uses background knowledge.
    Table 3 shows example of learned rules by OSHD generated from one positive
and two negative examples. One can easily differentiate alphabet ’Aha’ against
’Eh’ & ’Uh’. The unique properties of ’Ah’ from others alphabets is given in the
column ’Human Interpretations’, which is almost similar to the learned rule in
column 4. It is also clear that the rule in column 4 is human comprehensible.
Null hypothesis H2 is therefore refuted by this experiment.


6   Conclusion

In this paper, we presented a novel approach for one-shot rule learning called
One-Shot Hypothesis Derivation (OSHD) which is based on using a logic pro-
gram declarative bias. We applied this approach to the challenging task of Malay-
alam character recognition. This is a challenging task due to spherical and com-
plex structure of Malayalam hand-written language. Unlike for other languages,
there is currently no efficient algorithm for Malayalam hand-written recognition.
�                       Table 3: Example of learned rules
                                                Human
   +ve Example          −ve Example                                Learned Rules
                                            Interpretations
                    Alphabet ’Eh’
                    Loops : 1
                    Junctions : 2
                    Arcs : 3
Alphabet ’Aha’      Terminals : 2
                                                              alphabet(A) if
Loops : 2           Starting Point : Null
                                            Loops : 2         has gemproperties(A, B),
Junctions : 4       Ending Point : SE
                                            Junctions : 4     has gemproperties(A, C),
Arcs : 3
                                            Starting Point : has gemproperties count
Terminals : 2       Alphabet ’Uh’
                                            SW                (B, junctions, 4),
Starting Point : SW Loops : 1
                                                              has gemproperties(A, D)
Ending Point : Null Junctions : 1
                    Arcs : 2
                    Terminals : 1
                    Starting Point : Null
                    Ending Point : SE



The features used to express the background knowledge were developed in such
a way that it is acceptable for human visual cognition also. We could learn rules
for each character which is more natural and visually acceptable. We compared
our results with a state-of-the-art Deep Learning approach, called Siamese Net-
work, which has been developed for one-shot learning. The results suggest that
our approach can generate human-understandable rules and also outperforms
the deep learning approach with a significantly higher average predictive accu-
racy (an increase of more than 20% in average). Its was clear from the results
that deep learning paradigm use more data and its efficiency is less when dealing
with a small amount of data. As future work we would like to further extend
the background knowledge to include more semantic information. We will also
explore the new framework of Meta-Interpretive Learning (MIL) [19, 20] in order
to learn recursive rules.

Acknowledgments
We would like to acknowledge Stephen Muggleton and Jose Santos for the development
of Top Directed Hypothesis Derivation and Toplog [18] which was the basis for One-
Shot Hypothesis Derivation (OSHD) presented in this paper. We also acknowledge the
Vice Chancellor’s PhD Scholarship Award at the University of Surrey.
�References
 1. Adé, H., Raedt, L.D., Bruynooghe, M.: Declarative bias for specific-to-general ilp
    systems. Machine Learning 20, 119–154 (1995)
 2. Bengio, Y.: Learning Deep Architectures for AI. Found. Trends Mach. Learn. 2(1),
    1–127 (Jan 2009)
 3. Bennett, C.H., Parmar, V., Calvet, L.E., Klein, J., Suri, M., Marinella, M.J., Quer-
    lioz, D.: Contrasting advantages of learning with random weights and backpropaga-
    tion in non-volatile memory neural networks. IEEE Access 7, 73938–73953 (2019)
 4. Bouma, S.: One shot learning and siamese networks in keras.
    https://sorenbouma.github.io/blog/oneshot/ (2017)
 5. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification
    using a “siamese” time delay neural network. In: Proceedings of the 6th Interna-
    tional Conference on Neural Information Processing Systems. p. 737–744. NIPS’93,
    Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)
 6. Dany Varghese, Viju Shankar: A novel approach for single image super resolution
    using statistical mathematical model. International Journal of Applied Engineering
    Research (IJAER) 10(44) (2015)
 7. Hinton, G.: Learning multiple layers of representation. Trends in cognitive sciences
    11, 428–434 (11 2007)
 8. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot im-
    age recognition. In: Proceedings of the 32 nd International Conference on Machine
    Learning. vol. 37 (2015)
 9. Krig, S.: Computer Vision Metrics Survey, Taxonomy, and Analysis. Apress, Berke-
    ley, CA (2014)
10. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple
    visual concepts. In: Proceedings of the 33rd Annual Conference of the Cognitive
    Science Society. pp. 2568–2573 (2011)
11. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning
    through probabilistic program induction. Science 350(6266), 1332–1338 (2015)
12. Lamba, H.: One shot learning with siamese networks using keras.
    https://towardsdatascience.com/one-shot-learning-with-siamese-networks-using-
    keras-17f34e75bb3d (2019)
13. Lavrač, N., Džeroski, S.: Inductive Logic Programming : Techniques and Applica-
    tions. Ellis Horwood (1993)
14. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-
    temporal features for action recognition with independent subspace analysis. In:
    CVPR 2011. pp. 3361–3368 (2011)
15. Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural
    language understanding. CoRR 1901.11504 (2019)
16. Mas, J., Sanchez, G., Llados, J., Lamiroy, B.: An incremental on-line parsing al-
    gorithm for recognizing sketching diagrams. In: Ninth International Conference on
    Document Analysis and Recognition (ICDAR 2007). vol. 1, pp. 452–456 (2007)
17. Muggleton, S.: Inverse entailment and Progol. New Generation Computing 13,
    245–286 (1995)
18. Muggleton, S.H., Santos, J., Tamaddoni-Nezhad, A.: TopLog: ILP using a logic
    program declarative bias. In: Proceedings of the International Conference on Logic
    Programming 2008. pp. 687–692. LNCS 5366, Springer-Verlag (2010)
19. Muggleton, S., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of
    higher-order dyadic datalog: Predicate invention revisited. Machine Learning
    100(1), 49–73 (2015)
�20. Muggleton, S., Dai, W.Z., Sammut, C., Tamaddoni-Nezhad, A.: Meta-interpretive
    learning from noisy images. Machine Learning 107 (2018)
21. Muggleton, S., de Raedt, L.: Inductive logic programming: Theory and methods.
    The Journal of Logic Programming 19-20, 629 – 679 (1994), special Issue: Ten
    Years of Logic Programming
22. Nedellec, C.: Declarative bias in ilp (1996)
23. Neethu, K.S., Varghese, D.: An incremental semi-supervised approach for visual
    domain adaptation. In: 2017 International Conference on Communication and Sig-
    nal Processing (ICCSP). pp. 1343–1346 (2017)
24. Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High
    confidence predictions for unrecognizable images. In: 2015 IEEE Conference on
    Computer Vision and Pattern Recognition (CVPR). pp. 427–436 (2015)
25. Nienhuys-Cheng, S.H., de Wolf, R.: Foundations of Inductive Logic Programming.
    Springer-Verlag, Berlin (1997), lNAI 1228
26. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.,
    Fergus, R.: Intriguing properties of neural networks. In: International Conference
    on Learning Representations (2014)
27. Usman Akram, M., Bashir, Z., Tariq, A., Khan, S.A.: Geometric feature points
    based optical character recognition. In: 2013 IEEE Symposium on Industrial Elec-
    tronics Applications. pp. 86–89 (2013)
�