Vol-3194/paper9

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3194/paper9
wikidataid  Q117345107→Q117345107
title  Automatic Information Extraction from Investment Product Documents
pdfUrl  https://ceur-ws.org/Vol-3194/paper9.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/ScafoglieriMNLL22
volume  Vol-3194→Vol-3194
session  →

Automatic Information Extraction from Investment Product Documents

load PDF

Automatic Information Extraction from
Investment Product Documents
Federico Maria Scafoglieri1 , Alessandra Monaco1 , Giulia Neccia1 , Domenico Lembo1 ,
Alessandra Limosani2 and Francesca Medda2,3
1
  Sapienza Universitá di Roma, Piazzale Aldo Moro, 5, 00185 Roma RM
2
  Commissione nazionale per le societá e la Borsa, Via Giovanni Battista Martini, 3, 00198 Roma RM
3
  University College London, Gower Street, London, United Kingdom


                                         Abstract
                                         In this paper we report on the activities carried out within a collaboration between Consob and Sapienza
                                         University. The developed project focus on Information Extraction from documents describing financial
                                         investment products. We discuss how we automate this task, via both rule-based and machine learning-
                                         based methods, and describe the performances of our approach.

                                         Keywords
                                         Information Extraction, Financial documents, Financial Market Vigilance, Table Extraction




1. Introduction
In Italy, Consob (Commissione Nazionale per le Società e la Borsa) is the supervisory and regula-
tory authority of the financial market. Among the several functions, Consob has the role of
monitoring and supervising any financial instruments that are issued, with the ultimate aim of
detecting and enforcing against illicit conduct.
   The specific project we will discuss in the following, besides Consob, involves researchers
from Sapienza University of Rome. To support Consob in its supervision activities, and in
particular in the task of verifying the correctness and completeness of the information it collects
about financial investment products, within this collaboration a solution for the automatic
extraction of structured information from free text documents has been developed. Through
such solution, relevant data contained in the documents are not to be manually identified and
extracted, but can be collected through an automated process which makes them available in a
machine processable format. To this aim, two approaches to Information Extraction (IE) have
been adopted: rule-based and machine learning (ML)-based. In the following we describe them,
explain why both are useful for our goals, and report on first the results we obtained in the
project.



SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ scafoglieri@diag.uniroma1.it (F. M. Scafoglieri); monaco.1706205@studenti.uniroma1.it (A. Monaco);
neccia.1847033@studenti.uniroma1.it (G. Neccia); lembo@diag.uniroma1.it (D. Lembo); i.tiddi@vu.nl (A. Limosani);
f.medda@ucl.ac.uk (F. Medda)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�Figure 1: Portion of a KID by CreditSuisse.




2. The use case
In the EU, the creators of financial products (a.k.a. financial manufacturers) are obliged by law
to make information related to so-called PRIIPs (Packaged Retail Investment and Insurance-
based Investments Products) publicly available. The NCAs (National Competent Authorities)
have supervisory duties on such products, so that they can be safely placed on the respective
national markets. The legislation requires information about PRIIPs to be communicated to
NCAs through documents called KIDs (Key Information Documents). In the practice, this means
that features to be checked are cast into text reports, typically formatted as pdf files (cf. Figure 1,
containing a KID for the Italian market), and extracting structured data from them (to bootstrap
control activities), is actually in charge to the authority (In Italy, Consob). Due to the massive
amount of documents to be analyzed (e.g., more than 1 million KIDs received by in 2020), this
process cannot be carried out manually, but still it is only partially automated to date.


3. The approach
The KIDs present key information about financial products through free text or tables. We
decided to extract relevant data contained in the free text part of the documents through a
rule-based mechanism [1, 2]. This choice lies in the fact that the KIDs follow a quite rigid
template imposed by the European authority, a characteristic that makes domain experts able
to express quite precise patterns for the identification of wanted data, and AI experts to encode
them into extraction rules. This approach has also the great advantage to provide the final user
with a precisely explainable IE mechanism, a characteristic which is particularly critical in the
context we operate.
   However, rule-based mechanisms do not lend themselves well to the extraction of data
contained in tables, which are particularly difficult to identify, depending also on the way in
which the table is originally formatted [3]. For this reason we decided to resort to an approach
�based on learning systems that treat a pdf file as if it were an image and aim to identify the
set of pixels that are tables, together with the corresponding cells. Afterwards, the textual
characters contained inside the detected cells can be recognized through OCR methods. In the
following we discuss in more detail the two approaches, highlighting their complete pipelines
and showing the results obtained.

3.1. Rule-based IE
We defined around 100 rules to extract 16 data fields, like product name, manufacturer, in-
vestment risk, from the free text contained in a KID. Below we briefly comment on the data
preparation and the annotation modules that are part of our rule-based IE pipeline. A third
module is in charge of exporting the extracted fields in CSV format.
Data Preparation. This module transforms the PDF into plain text and clean it from errors.
We performed it by means of PDFBox1 , an Apache state-of-the-art and highly customizable
library for PDF documents.
Annotation. The annotation task is carried out through CoreNLP2 , a popular open-source JAVA
toolkit for natural language processing developed at Stanford University. For our development
we relied on the CoreNLP-it module [4], specifically developed for the Italian language, We
exploited fundamental services offered by CoreNLP, such as tokenization, lemmatization, POS
tagging, and in particular made use of the TokensRegex component [5], which extends traditional
regular expressions on strings by working on tokens instead of characters and defining pattern
matching via a stage-based application. These extensions of regular expressions allow for
writing extraction rules, i.e., rule-based extractors matching on additional token-level features,
such as part-of-speech annotations, named entity tags, and custom annotations.
   As an example, consider the following TokensRegex extraction rule useful for extracting
International Securities Identification Numbers (ISINs) from KIDs:
$StartISIN = (
        /ISIN/ /:/|/Codice/ /del/ /Prodotto|prodotto/ /:/|
        ... )
$EndISIN = ( /*/ ...)
$code = "/([A-Za-z][A-Za-z][0-9]{10})/"
{ ruleType: "tokens",
    pattern: (
        ($StartISIN) (?$CodeISIN [{word:$code} &
        {SECTION:"SECTION_PRODUCT"}]+?) ($EndISIN)),
    action: ( Annotate($CodeISIN, ISIN, "ISIN")) }


   The ISIN is an alphanumeric sequence of 12 characters identifying a PRIIP. This sequence
begins with two characters followed by ten numbers. In the rule above, the ISIN format
is compiled into the regular expression assigned to the variable $code. $StartISIN and
$EndISIN represent the set of sequences of tokens (separated from each other through the
symbol |) preceding and following the ISIN code, respectively. ruleType specifies that our

   1
       https://pdfbox.apache.org
   2
       https://stanfordnlp.github.io/CoreNLP/
�Figure 2: Pipeline for table extraction




rule works on "tokens", whereas pattern contains the pattern to be matched over the text.
Through the pattern above the rule looks for a matching of the regular expression $code over
all the tokens in the product section (i.e., annotated with SECTION_PRODUCT), and such that
$code is preceded by the tokens in $StartISN and followed by the ones in $EndISIN. Finally
action describes what annotation to apply. In our example, we annotate the token identified
by the group $CodeISIN with the annotation ISIN. By applying this rule over the KID in
Figure 1 we annotate CH0524993752 with ISIN.

3.2. Machine Learning-based IE
We used ML techniques to extract data from the following tables:
     • Performance scenarios: it provides data about potential risks and yield. We are inter-
       ested in extracting the potential refund and the average yield per year in all the possible
       scenarios: stress, unfavorable, moderate, favorable, each at fixed time periods (initial,
       intermediate, recommended holding).
     • Costs over time: it shows the cost trend. We want to extract data about total costs
       and reduction in yield (RIY) in percentage at the same time periods considered for the
       performance scenarios.
     • Composition of costs: it describes the components of costs. We aim to extract the RIY
       for various cost categories: una tantum input and una tantum output costs, recurrent
       wallet transactions, other recurrent costs, performance and overperformance fees.
   We modeled the problem as a Computer Vision task, converting each page of the KID into a
set of images. In such a setting, the Extraction involves (1) a Table Detection task, aiming at
detecting bordered or borderless/semi-bordered tables, (2) a Table Structure Recognition task,
able to identify the cells belonging to the detected table, and (3) a Text Extraction task, to get
the textual information contained inside each cell.
   We propose an automatic, learning-based approach, exploiting the power of Deep Learning
and Convolutional Neural Networks (CNN) for image processing. In particular, for subtasks 1-2
we use CascadeTabNet [6], a recent implementation of a CNN model which was trained by the
authors on the detection of border/borderless tables and cells first on general tables (e.g. word,
latex documents), then on ICDAR-193 . For subtask 3 instead, we use Tesseract [7], a popular
    3
     ICDAR-19 is a dataset containing table data, that was released for a Table Detection and Recognition competition
at ICDAR in 2019
�Optical Character Recognition (OCR) engine.
  The pipeline we realized is shown in Figure 2. The main module are:
    • Page Identification Module, which identifies the pages of the document containing
      the tables we want to extract. Such tables are indeed contained in specific sections of the
      document, which we can recognize through a simple lookup of certain keywords, like
      ‘performance scenarios’, ‘costs over time’, ‘composition of costs’.
    • Table Detection and Table Structure Recognition Module, which uses the pre-trained
      CascadeTabNet model to identify the masks (i.e. bounding boxes) of borderless/bordered
      tables and cells. The masks are returned if and only if their level of confidence is higher
      than the input threshold, that we experimentally tuned to 0.6. Cells are then assigned to
      the correspondent table depending on their positions.
    • Table Identification and Row Extraction Module, that applies the OCR to each cell
      bounding box. Based on the text extracted by the OCR, we are able to understand if the
      considered table is the one of interest. In a single page in fact, many different tables can
      be present. We use the top coordinate value of the cell masks to organize them into rows.
    • Row Processing and Cleaning Module, which produces the final output of the extrac-
      tion. In particular, from an inspection of the rows created in the previous step, and on
      the basis on their attended structure that we know from the regulation, we are able to
      assign the extracted information to the correspondent information field. A final Cleaning
      step, deals with missing punctuations, extra space removal, OCR error correction and,
      numerical formats standardization.
   The effectiveness of the results often depends on the table template: there are few tables
that the algorithm fails to recognize at all, or for which some cells are not detected (mostly the
table headers). Table/cell detection is indeed a challenging task, due to different reasons. Firstly,
the heterogeneity of table templates adopted by the PRIIPs authors makes the task harder for
the neural network; templates are slightly different from the ones in ICDAR-19, because they
are modern, often semi-bordered or borderless, semi-coloured, with multi-line cells. The same
table type may present different number of rows or columns, just because the specific kind
of PRIIP requires it, or because of a different arrangement of the information in the table (for
instance, a single cell contains both a cost and a RIY). Moreover, we noticed issues in identifying
multi-line cells: few times true multi-line cells are detected as two distinct cells or, on the
contrary, distinct cells that appear one under the other, probably very close, are detected as
a single multi-line cell. We addressed this last case in the Rows Cleaning Module: using our
knowledge of the table structures, knowing which field types appear one under the other, we
split the wrongly multi-line cells and assign the data to the proper field. Regarding cell masks,
we noticed also that, in few cases, the detected bounding box was quite inaccurate, cropping the
text and consequently causing troubles in OCR. In order to avoid it, we consider an enlargement
factor such that we run the OCR on a slightly larger area.


4. First results
Table 1 provides some information regarding the two datasets on which we tested our approach.
Dataset-1 was created specifically as a test scenario. It is representative of document hetero-
�geneity and includes KIDs from the years 2018-2020. Dataset-2 was selected randomly from a
sample of KIDs from the first half of 2021.

Table 1
Dataset Info
                           Dataset      Total KIDs         Size     Manufacturers
                          DATASET-1        1240           250MB          36
                          DATASET-2        7736            3GB           52

  Table 2 reports an average of precision and recall over all information extracted through the
rule-based approach. The performances we got testify that the effort put in designing the rules
definitely pays off, since we reach very high level of precision and recall, as required in the
considered application scenario.

Table 2
Results for the rule-based approach

                              Dataset      Precision       Recall   F-Measure
                             DATASET-1        98%           96%        97%
                             DATASET-2        98%           94%        96%

   Table 3 instead shows the number of tables extracted through our ML-based approach for
each table type, before the Fine-Tuning phase that we will describe in the next section. We
obtained the worst recall for the Costs Evolution table. This is probably due to the fact that the
extraction of this table mostly relies on just two header cells in the Table Identification step,
therefore, if the model fails for any reason in detecting just those two cell masks, the table is
not extracted even though the numerical cell masks are often correctly detected. This issue is
less critical for the other table types, since more cells can be used for Table Identification.

Table 3
Results of the Extractor for the different table types.

                                                                 Dataset
                                                          DATASET-1 DATASET-2
                          Performance     Extracted         1218         6435
                          scenarios        Missing           22          1301
                          Costs           Extracted          752         3925
                          over time        Missing           488         3811
                          Composition     Extracted         1153         6886
                          of costs         Missing           87          850



5. Fine-Tuning to improve Information Extraction from tables
As said, to extract information from tables, we started from a model that was trained on the
detection of generic tables. Since the tables in KIDs documents may present complex layouts,
we fine-tuned the model using custom data, in order to improve its effectiveness. Below we
describe the Fine-Tuning process we adopted.
�Table 4
Comparing results of the Original and Fine-Tuned Extractor for the different table types.

                                                             Model                  Gap
                                                      Original Fine-Tuned            (%)
                    Performance        Extracted       1218       1237              +1.5%
                    scenarios           Missing          22         3
                    Costs              Extracted        752       1223             +62,63%
                    over time           Missing         488        17
                    Composition        Extracted       1153       1167             + 1.2%
                    of costs            Missing          87        73

Labeling Data with Label Studio. The first step involves creating a suitable dataset for
training purposes. Here, ’suitable’ means ’labeled’: for each image of the dataset we have to
annotate the coordinates (bounding boxes) of the tables and the cells it contains. To this aim
we have used Label Studio4 , an open-source flexible data annotation tool, that can be adapted
to different labeling tasks (image classification, object detection, event recognition, etc.) and
data type (images, audio, text, time-series, etc.), allowing different users to collaborate on the
annotation process, that is well known to be highly time consuming and to require a lot of
manual effort. For our object detection annotation task we customized labels in ‘bordered’,
‘borderless’ and ‘cell’ and we drew the bounding boxes of each label using the mouse. The tool
automatically generates the bounding boxes coordinates and allows to export the annotations
in different formats. We exported annotations in Pascal VOC (Visual Object Classes) XML, that
creates a different xml file for each image, containing <object> elements providing annotation
information, in particular the label and the coordinates of the bounding box.
Training the Model. The extracted Pascal VOC data are then converted into COCO (Common
Objects in COntexts) format, typically required by object detection models. We trained the
model using MMDetection, an open source object detection toolbox, loading the pre-trained
weights of CascadeTabNet model and training for 10 epochs using our annotated data. 239
images (KID pages converted to png files) were used for the training, trying to maximize the
variance of the KIDs authors and therefore, the templates. Each image page may contain more
than a single table. In total, our training dataset contained 10844 objects, respectively 71 bor-
dered tables, 373 borderless tables and 10400 cells.

Results on table extraction after Fine-Tuning. The fine-tuned model has been tested on
Dataset-15 , showing a great improvement in extracting data from ‘Costs over time’ tables.
Whereas the first model (non-fine-tuned) was not able to extract 488 tables (~40% of the total
tables), the fine-tuned model misses just 17 tables. The gap between the two extractions is im-
pressive, which testifies that Fine-Tuning has been particularly effective. A slight improvement
has been observed in the other table types, too.
  In Table 4 we give the results of the extractions of tables by comparing the first model with
the fine-tuned model, whereas in Table 5 we report the average precision, recall and f1-scores
    4
     https://labelstud.io/
    5
     We also tested our results on Dataset-2, but our evaluation, due to dataset size, was done on a sample basis
with results similar to that reported for Dataset-1.
�on all the fields extracted from the 3 table types for Dataset-1.

Table 5
Results of Fine-Tuning.

                            Dataset     Precision   Recall   F-Measure
                           DATASET-1      ~99%       ~79%      ~88%



6. Conclusion
We are currently working to complete IE from KIDs, by attacking some of their portions that
have not been so far involved in the extraction process. The next project steps concern with post-
extraction data analysis, aimed to identify anomalous documents, correct errors and missing
mandatory data, discover interesting correlations among data.
  We are also investigating how to link collected data to a domain ontology, to exploit reasoning
services and enhance the entire data acquisition process according to the Ontology-based data
access and integration [8, 9, 10]

References
 [1] J. Doleschal, B. Kimelfeld, W. Martens, Database principles and challenges in text analysis,
     SIGMOD Record 50 (2021) 6–17.
 [2] D. Lembo, F. M. Scafoglieri, Ontology-based document spanning systems for information
     extraction, Int. J. of Semantic Computing 14 (2020) 3–26.
 [3] E. Oro, M. Ruffolo, PDF-TREX: an approach for recognizing and extracting tables from
     PDF documents, in: Proc. of ICDAR, IEEE Computer Society, 2009, pp. 906–910.
 [4] A. P. Aprosio, G. Moretti, Italy goes to stanford: a collection of corenlp modules for italian,
     Comp. Res. Rep. (CoRR) abs/1609.06204 (2016). URL: http://arxiv.org/abs/1609.06204.
 [5] A. X. Chang, C. D. Manning, Tokensregex: Defining cascaded regular expressions over
     tokens, Stanford University Computer Science Technical Reports. CSTR 2 (2014) 2014.
 [6] D. Prasad, A. Gadpal, K. Kapadni, M. Visave, K. Sultanpure, Cascadetabnet: An approach
     for end to end table detection and structure recognition from image-based documents,
     Comp. Res. Rep. (CoRR) abs/2004.12629 (2020). URL: https://arxiv.org/abs/2004.12629.
 [7] R. Smith, An overview of the tesseract OCR engine, in: Proc. of ICDAR, volume 2, 2007,
     pp. 629–633. doi:10.1109/ICDAR.2007.4376991.
 [8] D. Calvanese, G. D. Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Ontology-based data
     access and integration, in: L. Liu, M. T. Özsu (Eds.), Encyclopedia of Database Systems,
     Second Edition, Springer, 2018.
 [9] F. M. Scafoglieri, D. Lembo, A. Limosani, F. Medda, M. Lenzerini, Boosting information
     extraction through semantic technologies: The kids use case at consob, volume 2980 of
     CEUR, ceur-ws.org, 2021.
[10] D. Lembo, Y. Li, L. Popa, K. Qian, F. Scafoglieri, Ontology mediated information extraction
     with MASTRO SYSTEM-T, volume 2721 of CEUR, ceur-ws.org, 2020, pp. 256–261.
�