Difference between revisions of "Vol-3194/paper1"

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search
(edited by wikiedit)
 
(edited by wikiedit)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
+
=Paper=
 
{{Paper
 
{{Paper
 +
|id=Vol-3194/paper1
 +
|storemode=property
 +
|title=OpenTRIAGE: Entity Linkage for Detail Webpages
 +
|pdfUrl=https://ceur-ws.org/Vol-3194/paper1.pdf
 +
|volume=Vol-3194
 +
|authors=Roger Voyat,Valter Crescenzi,Paolo Merialdo
 +
|dblpUrl=https://dblp.org/rec/conf/sebd/VoyatCM22
 
|wikidataid=Q117345168
 
|wikidataid=Q117345168
 
}}
 
}}
 +
==OpenTRIAGE: Entity Linkage for Detail Webpages==
 +
<pdf width="1500px">https://ceur-ws.org/Vol-3194/paper1.pdf</pdf>
 +
<pre>
 +
OpenTRIAGE: Entity Linkage for Detail Webpages
 +
Roger Voyat1 , Valter Crescenzi1 and Paolo Merialdo1
 +
1
 +
    Department of Engineering – Università Roma Tre, Via della Vasca Navale 79, Rome Italy
 +
 +
 +
                                        Abstract
 +
                                        We present OpenTriage, a system for extracting structured entities from detail Web pages of several
 +
                                        sites and finding linkages between the extracted data. The system builds an integrated knowledge
 +
                                        base by leveraging the redundancy of information with an Open Information Extraction approach: it
 +
                                        incrementally processes all the available pages while discovering new attributes. It is based on a hybrid
 +
                                        human-machine learning technique that targets a desired quality level. After two preliminary tasks,
 +
                                        i.e., blocking and extraction, OpenTriage interleaves two integration tasks, i.e., linkage, and matching,
 +
                                        while managing the uncertainty by means of very simple questions that are posed to an external oracle.
 +
 +
 +
 +
 +
1. Introduction and Overview
 +
Although the amount of data that are available on the Web is growing with an exponentially
 +
pace, most of these data cannot be directly processed by applications, as they are published in
 +
HTML, a format meant to be displayed by a Web browser and consumed by humans rather
 +
than processed by machines. A lot of those data are published by mean of pages such as those
 +
depicted in Figure 1: detail pages [1, 2] that report data about one real world entity, e.g., the
 +
italian basketball player Danilo Gallinari. Several sites may independently publish data about
 +
the same real world topic entity, i.e., Figure 1 shows two pages taken from two sport sites
 +
(espn.com and foxsport.com).
 +
    Websites may publish different attributes, and detail pages about a different set of entities.
 +
Therefore, the integration of partially overlapping information coming from several sites is a
 +
challenging, yet rewarding, problem of building knowledge bases for several popular domains.
 +
Data are found in detail pages published on the Web and extracted into structured records such
 +
those shown in Figure 1c and 1d before being integrated.
 +
    The problem of finding whether two attributes of two distinct sources actually have the same
 +
semantics is commonly known as schema matching [3, 4, 5] (for example the pair of attributes
 +
(𝑎12 , 𝑎22 ) in Figure 1c and 1d), while the problem of finding which structured records relate to
 +
the same real world entity is often called entity linkage [6, 7, 8, 9].
 +
 +
Challenges & Opportunities Although linkage and matching are both considered difficult
 +
Big data integration problems [10], in many practical settings, including ours, the former is
 +
more often considered the most challenging one.
 +
 +
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
 +
$ roger.voyat@uniroma3.it (R. Voyat); valter.crescenzi@uniroma3.it (V. Crescenzi); paolo.merialdo@uniroma3.it
 +
(P. Merialdo)
 +
                                      © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 +
    CEUR
 +
    Workshop
 +
    Proceedings
 +
                  http://ceur-ws.org
 +
                  ISSN 1613-0073
 +
                                      CEUR Workshop Proceedings (CEUR-WS.org)
 +
�              (a) 𝑖 site espn.com.                                        (b) 𝑖𝑖 site foxsport.com.
 +
 +
 +
 +
 +
    (c) Extracted records from espn.com.                    (d) Extracted records from foxsport.com.
 +
Figure 1: Running example: Web detail pages and data extracted from NBA player domain.
 +
 +
 +
  First, for the dimension of the problem, as the number of input entities published in a Web
 +
source can outgrown the number of their attributes by several order of magnitudes: a site can
 +
publish thousands of detail pages with no more than a few dozen different attributes. One might
 +
argue that the number of attributes scales proportionally to the number of Web sources, but
 +
in practice, there are a few head attributes that are used by most of the sources [11], giving
 +
rise to skewed distributions. Second, the challenges posed by the matching problem directly
 +
affect the linkage one, as well: to decide whether two records are related to the same entity, it is
 +
important, especially in the presence of overlap between the entities of two sources, knowing
 +
the matching of the attributes of the compared records.
 +
  Another challenge is that two matching values can have formats and semantics that vary
 +
from source to source, or even from subgroup of entities to subgroup of entities within the same
 +
source. For example, the postal code in England is an alphanumeric string locating a single
 +
building, while in Italy the postal code is a numeric string often associated with several blocks
 +
of a quite big city; in another domain, a smartphone manufacture might adopt a naming policy
 +
for its models that distinguish a model from another for a single letter, e.g., Apple IPhone X and
 +
Apple IPhone XS, whereas another might adopt totally different names, e.g. OPPO Find X5 and
 +
OPPO Reno 6.
 +
  Although automatically building an integrated knowledge base from detail pages published
 +
by many sources is a challenging problem, the are interesting opportunities as well: especially
 +
for the most popular domains, the Web is an almost inexhaustible source of detail pages with
 +
redundant information about the same entities, to the extent that we can assume that our source
 +
base is redundant enough to contain at least two detail pages and at least two values for each
 +
domain entity and every attribute of the knowledge base we want to create.
 +
 +
Problem Description OpenTriage takes as input a domain 𝐷 made of a set of sources
 +
𝒮 = {𝑆1 , 𝑆2 , . . .} where each source 𝑆𝑖 is a set of 𝑛𝑖 detail pages each publishing information
 +
about a single entity from a set of domain entities 𝒪 = {𝑜1 , 𝑜2 , . . .}. A domain entity 𝑜𝑖 includes
 +
a set of domain attributes 𝒜 = {𝑎1 , 𝑎2 , . . .} whose values can be populated with values coming
 +
from the HTML source code of the corresponding detail page by means of a set of extraction
 +
rules associated with the domain attributes.
 +
�Figure 2: OpenTriage high-level architecture.
 +
 +
 +
    In the following, we introduce OpenTriage, a Human-in-the-Loop system that produces as
 +
output a set of entity linkages, i.e., pairs of pages related to the same domain entity, and a set of
 +
attributes matches, i.e., pairs of extraction rules related to the same domain attribute. Indeed,
 +
OpenTriage also generates the extraction rule, if any, that locates the value of domain attribute
 +
from the HTML source of any of the detail pages from a source.
 +
    Figure 1 shows two detail pages that include several attributes (highlighted in green). Figure 1c
 +
and Figure 1d show the extracted records with some anonymous attributes such as 𝑎11 , 𝑎12 , 𝑎13 ,
 +
and 𝑎14 (which is the Height). Possible matches are, for example: (𝑎11 , 𝑎21 ) (i.e., First Name),
 +
(𝑎12 , 𝑎22 ) (i.e., Last Name), 𝑎(13 , 𝑎23 ) (i.e., Team); a linkage is (𝑜11 , 𝑜21 ) which is a pair of entities
 +
from detail pages associated with Danilo Gallinari.
 +
 +
 +
2. OpenTriage System
 +
OpenTriage orchestrates several integration tasks as shown with the high-level architecture
 +
shown in Figure 2. A data extraction algorithm is applied over every input page to generate a
 +
set of extraction rules for every source, thus associating each page with a record of extracted
 +
data. Simultaneously, an initial blocking phase will group pages into blocks of pages potentially
 +
in linkage, with the goal of never separating two pages about the same entity into two different
 +
blocks. Then, OpenTriage launches a new integration task over every block of detail pages, as
 +
shown in Figure 2: iteratively, a linkage task and an attribute matching task are interleaved so
 +
that the produced linkages are used to discover attributes matches which, in turn, are used to
 +
provide better linkages. This process can be supported by an external oracle that contributes to
 +
halt the iteration when the desired quality levels are achieved.
 +
 +
Extraction rules generation and blocking OpenTriage generates a set of candidate extrac-
 +
tion rules for every input page and associates each detail page with a record of extracted data.
 +
This task is based on a state-of-the-art unsupervised algorithm that works per site. It is beyond
 +
the scope of the present paper as it’s already extensively discussed in previous works [12, 13, 14].
 +
As already shown in [2], by assuming that the set of candidate extraction rules is complete, i.e.,
 +
it includes at least one correct rule, the following integration tasks will always prefer the correct
 +
rules over the noisy ones, which do not find values that match with values from other sites. In
 +
the following we will not consider the presence of such noisy rules to simplify the discussion.
 +
    Then OpenTriage use a blocking algorithm working only with the titles of the input pages.
 +
The page title is easy to extract from HTML and, with rare exceptions, it contains some
 +
�      (a) Entity distance              (b) Blocking result                  (c) Block pyramid
 +
Figure 3: Entity distance, block & Pyramid:Figure 3a shows entity alignment with sim. distance inter
 +
(𝑑1 , 𝑑2 ) and intra-source(𝑑3 and 𝑑4 ). In Figure 3b is shown a block result and Figure 3c depict a block
 +
pyramid with green (pairs in linkage), gray (the uncertainty) and red zone (all remaining non-linkage
 +
pairs) used for integration.
 +
 +
 +
information about the entity described in a detail page. More precisely, in blocking, we compute
 +
the distance between two pages (Figure 3a) as the inverse of the Jaccard-index (𝐽) between
 +
the sets composed by the tokens contained in their HTML titles. 𝐽(𝑡1 , 𝑡2 ) = |𝑡|𝑡11 ∪𝑡 2|
 +
                                                                                      ∩𝑡2 | . All pairs
 +
of pages whose similarity is greater than a threshold are considered as belonging to the same
 +
block. We empirically set this threshold to a value (0.6) that our experiments showed to be low
 +
enough to have an almost perfect recall, even if that means sacrificing the precision, so that
 +
pages in linkage will belong to the same group.
 +
  With blocks generated, OpenTriage may obtain a first approximate noisy linkage. Since the
 +
number of input pages can be too big for a traditional quadratic linkage algorithm, as it is the
 +
case for many other linkage solutions [15, 16], we use a basic blocking algorithm [17, 18, 19]
 +
that allows to linearly group the pages into block of potentially similar detail pages and later we
 +
confine only within the pages of a single block further, and more expensive, elaborations. As
 +
later we won’t look for linkages across distinct blocks, blocking techniques are supposed to not
 +
split detail pages related to the same entity in different block to prevent false negatives, yet they
 +
should spread the entities in a reasonable number of not-too-big blocks as shown in Figure 3b.
 +
  Once the bootstrapping phase is over, one record of extracted data is associated with each
 +
input detail page, and the pages have been clustered into blocks of pages potentially related to
 +
the same real world entity. This is the input for the following processing stage.
 +
 +
Linkage of extracted records After the bootstrap phase generates blocks of pages poten-
 +
tially related to the same topic entity, for each block OpenTriage performs a data integration
 +
task that aims at computing both the linkages and the matches over the records of the pages in
 +
a block. The data integration task relies on a normalized distance function between pairs of
 +
records with known matches for finding the linkages, and symmetrically it leverages a distance
 +
function between pairs of attributes with known linkages for finding the matches. For the sake
 +
of presentation, in the following we start assuming that the matches are given, so that we can
 +
focus on finding the linkages, which usually is the most challenging integration task. Later, we
 +
will show how we can exploit a seed set of (possibly noisy) candidate linkages to compute an
 +
instance-based distance measure between attributes to solve the attribute matching problems,
 +
as well. That explains why an entity linkage task and an attribute matching task are interleaved
 +
�in Figure 5.
 +
  We compute the normalized distance for every possible pairs of pages in a block, and we rank
 +
these candidate linkages by distance. These results are organized in a data structure, the Block
 +
Pyramid – BP, that we depicted in Figure 3c, which is at the base of the following operations.
 +
 +
Triage OpenTriage scans every BP to perform a kind of triage over pairs of pages belonging
 +
to the same block. For an ideal domain, where the distance is capable of perfectly ranking the
 +
candidate linkages, all the pairs that correspond to actual linkages are at the top of the BP
 +
(green zone in Figure 3c) whereas all the other pairs (representing pairs of pages that are not in
 +
linkage) should be at the bottom of the pyramid (red zone): a single threshold on the distance
 +
can perfectly separate the green zone from the red one.
 +
  We can further develop this characterization of a domain in term of BP by introducing a
 +
domain property called Domain Separability (DS). A domain is separable if the distance between
 +
two entities in linkage is always smaller that distance between any pair of entities that are not
 +
linkage. Unfortunately, this property does not hold in practice as a distance function cannot
 +
always separate linkage and non-linkage pairs: as shown Figure 3c, there is a “gray” zone that
 +
includes pairs that are in linkage but their distance is greater than the distance of pairs that are
 +
not in linkage, or vice-versa.
 +
  The gray zone confirms the presence of DS violations and the possibility that a non-linkage
 +
could have a lower distance than an actual linkage. Our experimental evaluation with real
 +
pages has shown that the BPs have different size and “shape” not only across distinct domains,
 +
but also from block to block within the same domain. Every page and every attribute can
 +
exhibit, somehow, characteristics local to the Web source, or even to the block, for several
 +
reasons, including: First, the distances are affected by the formats and unit of measure of the
 +
data adopted by a source; second, there are sources publishing inaccurate, false, or just obsolete
 +
information, and finally, there might be other minor semantics differences between published
 +
attributes which are really challenging to detect. For example, an address can be published
 +
with any of the following values: “1600 Amphitheatre Pkwy, Mountain View, CA 94043, United
 +
States” ; “1600 Amphitheatre Parkway in Mountain View, California (USA)”.
 +
  Automatically discovering the best triage for every block of pages is therefore a challenging
 +
and difficult task, that we approach by looking for two thresholds (the green-to-red threshold
 +
separating the green zone from the gray one, and the gray-to-red threshold) and by making an
 +
additional assumption holding for many interesting real domains on the Web.
 +
  We say that a Web source exhibit the Local Consistency property (LC) if it does not publish
 +
more than one detail page related to the same real world entity. This means that the domain does
 +
not include linkages of pages from the very same source, which is often the case. Otherwise, it
 +
would mean that source is unnecessarily publishing twice or more information about the same
 +
entity with the risk of introducing inconsistencies.
 +
  Assuming the LC, as shown in Figure 3c, OpenTriage can find the first green-to-gray thresh-
 +
old as the distance of the first non-linkage pair under the green zone. The pair can be located
 +
by analyzing BP from top, as that having the smallest distance and made of pages belonging to
 +
the same source.
 +
  To identify the second gray-to-red threshold and to complete the triage, OpenTriage lever-
 +
� (a) Block 1 duality                      (b) Block 2 duality                      (c) Block 3 duality
 +
Figure 4: An example over the NBA domain. The top Cartesian planes in sketches the linkages finding
 +
problem: the points correspond to candidate linkages and their distance is computed by considering the
 +
values of two matching attributes associated with the axes. The Cartesian planes in the bottom of the
 +
figures illustrate the matches finding problem: they contain points corresponding candidate matches
 +
and their distance is computed by considering the values of those attributes of two entities associated
 +
with the axes. The gray pairs are those violating the domain separability.
 +
 +
 +
ages the feedback from an oracle, that in practice can be approximated by means of crowdsourc-
 +
ing techniques [20, 21, 22].
 +
 +
2.1. Human-in-the-Loop
 +
OpenTriage exploits an external oracle to perform an approximate triage for each BP. It
 +
poses simple binary questions that double check whether two structured records refer to the
 +
same entity. An example of the questions posed is: Do the records “(Danilo Gallinari,Atalanta
 +
Hawks,6’10 ’ ’) ” and “(Danilo Gallinari,Atalanta Hawks)” refer to the same real world entity?
 +
  Our algorithm processes the candidate pairs in a BP ordered by decreasing distance, and starts
 +
posing query only for pairs coming after the first green-to-gray threshold: The goal is that of
 +
finding (if any) a pair of records in linkage with the smallest distance that better approximates
 +
the gray-to-red threshold. To avoid posing too many questions that just collect useless negative
 +
answers, it uses two stop conditions: (𝑖) halt when it receives too many consecutive negative
 +
responses, (𝑖𝑖) halt when the next pair is too far from the last one. Both conditions depend
 +
on OpenTriage configuration parameters. The best quality-to-cost ratio in our experiments is
 +
found with this setting: we halt after 5 consecutive negative answers and with a pair which is
 +
at least a 0.2 distance from the last pair. The gray-to-red threshold is then approximated by the
 +
smallest distance of the pairs in linkage discovered by the oracle. In case none linkage has been
 +
found, it is assumed that gray zone shrinks to an empty one and the two thresholds collapses.
 +
 +
2.2. Matches/Linkages Duality
 +
OpenTriage leverages a distance measure between two records for finding the linkages between
 +
pairs of pages in the same block: that distance is computed by assuming that the attribute
 +
matches of the involved records are known, and then averaging over the distance of the values of
 +
the matching attributes. To relax this assumption, which is not holding for our input made just
 +
of records automatically extracted from detail Web pages and having anonymous labels, we also
 +
�Figure 5: Integration tasks over the one BP from the NBA domain.
 +
 +
 +
need to find the attribute matches. OpenTriage tackles the problem by using an instance-based
 +
distance measure between attributes from two sources: it is computed by averaging over every
 +
pairs of attribute values from records in linkages, i.e., from entities belonging to the overlap of
 +
the sources.1
 +
  OpenTriage finds the matches by framing the finding problem as a the dual problem of the
 +
linkages finding problem. The dual integration task leverages a dual BP in which all the possible
 +
matches are ranked according to their instance-based distance measure. These, in turn, are
 +
based on the linkages obtained by a previous execution of the primal task. The dual triage will
 +
classify the top candidate attribute matches in the green zone by exploiting the dual property
 +
of the LC, i.e., the same source never publishes twice the same attribute, and of the DS, i.e.,
 +
the instance-based distance between two matching attributes is always smaller that distance
 +
between two non-matching attributes. The dual triage will exploit an oracle to approximate the
 +
gray-to-red threshold by posing very simple binary questions such as: “are ’Danilo Gallinari’
 +
and ’D.Gallinari’ values of two matching attributes?”.
 +
  Figure 5 and Figure 2 show that the primal linkage task and the corresponding dual matching
 +
task are interleaved. Each task produces results consumed in a following iteration of the other
 +
task. OpenTriage’s main processing loop starts with all the weight parameters used in the
 +
normalized distance functions having the same initial values. The linkages/matches produced by
 +
every triage are then used to update the weights by means of a statistic linear model (i.e., logistic
 +
regression) that also considers the answers provided by the oracle. Updating weights implies a
 +
possible reshaping of the BP (some pair might change “color”): the iterations are halted as soon
 +
as the results do not change too much, usually just after 2 or 3 iterations. Figures 4a and 4b show
 +
a block with a few pairs just on, or close to, the border separating positive linkages/matching
 +
pairs from negative ones. Those pairs might need a specific oracle question to be fixed after a
 +
triage wrongly classified them. Figure 4c shows an even larger number of questions that might
 +
be asked, as the block includes many pairs close to the line separating the positive linkages
 +
from the negative linkages.
 +
 +
    1
 +
    This is yet another frequent reason for the existence of the gray zones in the matches BPs: if the overlap is too
 +
small, an instance-based distance function might become not reliable enough to provide good distance values.
 +
�3. Experimental Results
 +
OpenTriage discovers extraction rules, attribute matches and entity linkages by interleaving
 +
linkage and matching tasks while posing questions to an oracle. Those questions are constrained
 +
by an overall budget, i.e., the maximum number of questions that can be posed to solve a whole
 +
input domain, and represents the cost measure for our approach. As explained in Section 2.1,
 +
OpenTriage adopts two halting conditions that avoid consuming too many questions for the
 +
same block. Our preliminary experiments show that the size of the gray zone as estimated
 +
by the block triage procedure, monotonically shrinks during the consecutive iterations of
 +
OpenTriage integration task over that block. Depending on the characteristics of the input
 +
domain, OpenTriage can be used both as a stand-alone technique to fully integrate a domain,
 +
or as an approach to actively isolate the best samples of a training set. This is later used to
 +
feed a machine learning algorithm. Indeed, for easily separable domains, OpenTriage can
 +
help to find and fix a few outliers that need a small number of questions to the oracle. For
 +
these domains, it can automatically separate most of the positive pairs by means of the triage
 +
procedure. Conversely, for a highly non-separable domain, it can be better used to select
 +
non trivial pairs that stand in the gray zone. The questions to the oracle are saved for highly
 +
informative samples that can be used to train a non-linear machine learning algorithm.
 +
 +
Setup As we are not aware of any other approach as ours that tackles three integration
 +
problems over detail Web pages at the same time (extraction, linkage, and matching), for a
 +
preliminary experimental evaluation we compare our system with an entity resolution state-
 +
of-the-art approach over Web data. Namely, we consider Ditto [23] over the SWDE [24]
 +
benchmark. We focus on 2 out of 8 domains available SWDE: NBA players and University. They
 +
have rather different and interesting characteristics: the NBA players dataset is composed of
 +
4.5𝑘 pages from 10 websites, with large overlap of entities across sites as shown in Figure 6a(i.e.,
 +
8.8𝑘 of entity linkages); on the contrary, the University dataset includes 20𝑘 pages from 10
 +
sites, with a rather low level of redundancy across sources as shown in Figure 6d(i.e., 4.3𝑘
 +
entity linkages). For each processed domain, several executions were made with an increasing
 +
number of of questions to the oracle (our cost budget). The blocking task for the NBA domain
 +
has generated 90 blocks, whereas 1060 blocks were generated for the University domain. The
 +
former domain has a rather higher level of redundancy than the latter: OpenTriage ends
 +
up consuming a much smaller number of questions (80 vs 1350) by exploiting the available
 +
redundancy.
 +
 +
With or Without DITTO Ditto is a deep learning Entity Resolution system based on pre-
 +
built language models but it does not solve neither the extraction problem, nor the matching
 +
problem. We considered two settings for comparing our linkage results to those produced by
 +
Ditto.
 +
 +
• Ditto-Title, we use the page title as a textual attribute describing the entity contained in
 +
  the page. The title can be trivially extracted and it is useful for linkage as its value is usually
 +
  strictly related to the topic entity. For example, in the NBA page the title contains the name
 +
  of the player, and in the University page it contains the name of the university.
 +
�  (a) NBA entity overlap                  (b) NBA linkages                      (c) NBA budget
 +
 +
 +
 +
 +
(d) University entity overlap            (e) University linkages              (f) University budget
 +
Figure 6: Domain Overlap & results: Figures 6a and 6d show that domains have different overlap on
 +
entity published across websites. Figures 6b and 6e the experimental 𝑃 , 𝑅, 𝐹 of linkages for OpenTriage,
 +
Ditto-Title and Ditto-Golden. Figures 6c and 6f the quality of output linkage w.r.t. budget spent.
 +
 +
 +
• Ditto-Golden considers as input the golden structured records provided with the SWDE
 +
  benchmark: these records correspond to data extracted by manually defined extraction rules.
 +
  Observe that, in a real scenario, for using Ditto one would need an additional effort for
 +
  crafting the extraction rules.
 +
 +
  Ditto is a supervised system: we have trained it using all the linkages obtained by OpenTriage,
 +
and then we measured the quality of the output linkages w.r.t. a golden set of linkages using
 +
standard quality metrics. Precision (𝑃 ) is the percentage of output linkages which are correct,
 +
Recall (𝑅) is the percentage of linkages in the golden which are in the output linkages, and
 +
𝐹 -measure is their harmonic mean. Figures 6b and 6e show the results OpenTriage achieved
 +
for the considered domains with unlimited budget. In both the domains the output quality is
 +
good, but the University domain required far more questions than the NBA domain (1, 350
 +
vs 83). As long as the input Web sources have enough overlap in the set of provided detail
 +
pages, our approach adapts well to the characteristics of the input domain even if this means
 +
paying for many additional questions. Figures 6c and 6f show how the quality of the results is
 +
affected by a limited budget. We observed a stronger impact on the recall than on the precision,
 +
especially for domains with large overlap, where a few additional questions can significantly
 +
improved the triage quality.
 +
  For domains with a limited overlap of entities and attributes, the amount of available redun-
 +
dancy might result too small to support an effective triage. In this situation, OpenTriage would
 +
end up generating far more queries supporting the system iterations than it does in presence
 +
of a good amount of redundancy. As a future work, to solve this problem, a possible natural
 +
evolution for OpenTriage is that of supporting the incremental integration of new sources
 +
with different degrees of overlap. Indeed, a new source would increase the size of the problem,
 +
�but at the same time it might even more incisively contribute to lower the overall costs if it
 +
significantly overlaps with the other sources already available for the domain [25].
 +
 +
 +
4. Related work
 +
Big Data Integration [10] and Information Extraction [1] are broad research topics related to
 +
our approach: We only have the room for providing some entry points to such a vast literature.
 +
  The three main problems we study in this proposal, i.e, data extraction [12, 26], schema
 +
matching [3, 4, 5], and entity linkage [6, 7, 8, 9], have been most often tackled in isolation, we
 +
some rare exception [2]. However, entity linkage on Web data has received much less attention so
 +
far. Blocking techniques for entity resolution is a topic often been studied in the literature [16, 15].
 +
Unlike the Closed Information Extraction approaches [27, 28] where the managed knowledge
 +
base does not grow in terms of subjects and predicates but only in terms of values and data,
 +
OpenTriage falls in the realm of the Open Information Extraction approaches as it finds new
 +
domain attributes while processing the input sources. The research community has pursued the
 +
problem with different types of solutions. Traditional approaches [2, 13, 29, 30, 28, 31, 32, 33, 34]
 +
take advantage of the regularity of the HTML structures for pages published by templated
 +
websites. Other recent proposals [1, 35] use an hybrid approach exploiting both the regularity
 +
of the publishing patterns for HTML pages, and the tagging of attribute values provided by
 +
human workers. The main difference with OpenTriage is that our system do not require a
 +
partially populated knowledge base to trigger the acquisition of new data.
 +
  Other information extraction works [36, 37, 24] have exploited computer vision methods. For
 +
example they also consider page screenshots, the bounding boxes around the text nodes, and
 +
their visual distances. The latter requires expensive rendering for data processing and need a lot
 +
of training data for ML models. A recent approach is FreeDOM [38], that avoids rendering costs
 +
still need human feedback for tagging attributes values. Unlike the latter [38], OpenTriage
 +
does not need expert crowdsourcing workers.
 +
 +
Conclusions & Future work OpenTriage can be used either directly as an entity linkage
 +
solution over Web pages, or as an active learning algorithm for selecting the most profitable
 +
training data to feed a following non-linear ML solutions like Ditto [23]. Our future work
 +
includes evaluating OpenTriage with other datasets. It will be interesting to see if the monotonic
 +
trend of the gray areas is always confirmed and to study under which assumptions it can be
 +
proved. Another interesting development is to the design of a incremental solution leveraging
 +
the redundancy for finding the best source integration strategy [25] while integrating a single
 +
source at a time.
 +
 +
 +
References
 +
[1] C. Lockard, P. Shiralkar, X. L. Dong, Openceres: When open information extraction meets
 +
    the semi-structured web, in: Proceedings of the 2019 Conference of the North American
 +
    Chapter of the Association for Computational Linguistics: Human Language Technologies,
 +
    Volume 1 (Long and Short Papers), 2019, pp. 3047–3056.
 +
� [2] M. Bronzi, V. Crescenzi, P. Merialdo, P. Papotti, Extraction and integration of partially
 +
    overlapping web sources, Proceedings of the VLDB Endowment 6 (2013) 805–816.
 +
[3] Z. Bellahsene, A. Bonifati, F. Duchateau, Y. Velegrakis, On evaluating schema matching
 +
    and mapping, in: Schema matching and mapping, Springer, 2011, pp. 253–291.
 +
[4] J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: vldb,
 +
    volume 1, Citeseer, 2001, pp. 49–58.
 +
[5] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, the
 +
    VLDB Journal 10 (2001) 334–350.
 +
[6] D. Qiu, L. Barbosa, V. Crescenzi, P. Merialdo, D. Srivastava, Big data linkage for product
 +
    specification pages, in: Proceedings of the 2018 International Conference on Management
 +
    of Data, 2018, pp. 67–81.
 +
[7] A. Gruenheid, X. L. Dong, D. Srivastava, Incremental record linkage, Proceedings of the
 +
    VLDB Endowment 7 (2014) 697–708.
 +
[8] S. E. Whang, H. Garcia-Molina, Incremental entity resolution on rules and data, The VLDB
 +
    journal 23 (2014) 77–102.
 +
[9] D. G. Brizan, A. U. Tansel, A. survey of entity resolution and record linkage methodologies,
 +
    Communications of the IIMA 6 (2006) 5.
 +
[10] X. L. Dong, D. Srivastava, Big data integration, in: 2013 IEEE 29th international conference
 +
    on data engineering (ICDE), IEEE, 2013, pp. 1245–1248.
 +
[11] N. Dalvi, A. Machanavajjhala, B. Pang, An analysis of structured data on the web, arXiv
 +
    preprint arXiv:1203.6406 (2012).
 +
[12] V. Cetorelli, P. Atzeni, V. Crescenzi, F. Milicchio, The smallest extraction problem, Pro-
 +
    ceedings of the VLDB Endowment 14 (2021) 2445–2458.
 +
[13] V. Crescenzi, P. Merialdo, Wrapper inference for ambiguous web pages, Applied Artificial
 +
    Intelligence 22 (2008) 21–52.
 +
[14] V. Cetorelli, V. Crescenzi, P. Merialdo, R. Voyat, Noah: Creating data integration pipelines
 +
    over continuously extracted web data., in: EDBT/ICDT Workshops, 2021.
 +
[15] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, A survey of blocking and filtering
 +
    techniques for entity resolution, arXiv preprint arXiv:1905.06167 (2019).
 +
[16] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed represen-
 +
    tations of tuples for entity resolution, Proceedings of the VLDB Endowment 11 (2018)
 +
    1454–1467.
 +
[17] A. Gionis, P. Indyk, R. Motwani, et al., Similarity search in high dimensions via hashing,
 +
    in: Vldb, volume 99, 1999, pp. 518–529.
 +
[18] V. Christophides, V. Efthymiou, K. Stefanidis, Entity resolution in the web of data, Synthesis
 +
    Lectures on the Semantic Web 5 (2015) 1–122.
 +
[19] V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, T. Palpanas, Parallel meta-
 +
    blocking for scaling entity resolution over big heterogeneous data, Information Systems
 +
    65 (2017) 137–157.
 +
[20] V. Crescenzi, A. A. Fernandes, P. Merialdo, N. W. Paton, Crowdsourcing for data manage-
 +
    ment, Knowledge and Information Systems 53 (2017) 1–41.
 +
[21] G. Li, Y. Zheng, J. Fan, J. Wang, R. Cheng, Crowdsourced data management: Overview and
 +
    challenges, in: Proceedings of the 2017 ACM International Conference on Management of
 +
    Data, 2017, pp. 1711–1716.
 +
�[22] T. S. Behrend, D. J. Sharek, A. W. Meade, E. N. Wiebe, The viability of crowdsourcing for
 +
    survey research, Behavior research methods 43 (2011) 800–813.
 +
[23] Y. Li, J. Li, Y. Suhara, A. Doan, W.-C. Tan, Deep entity matching with pre-trained language
 +
    models, arXiv preprint arXiv:2004.00584 (2020).
 +
[24] Q. Hao, R. Cai, Y. Pang, L. Zhang, From one tree to a forest: a unified solution for structured
 +
    web data extraction, in: Proceedings of the 34th international ACM SIGIR conference on
 +
    Research and development in Information Retrieval, 2011, pp. 775–784.
 +
[25] X. L. Dong, B. Saha, D. Srivastava, Less is more: Selecting sources wisely for integration,
 +
    Proceedings of the VLDB Endowment 6 (2012) 37–48.
 +
[26] V. Crescenzi, G. Mecca, P. Merialdo, et al., Roadrunner: Towards automatic data extraction
 +
    from large web sites, in: VLDB, volume 1, 2001, pp. 109–118.
 +
[27] C. Lockard, X. L. Dong, A. Einolghozati, P. Shiralkar, Ceres: Distantly supervised relation
 +
    extraction from the semi-structured web, arXiv preprint arXiv:1804.04635 (2018).
 +
[28] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang, Diadem: thousands
 +
    of websites to a single database, Proceedings of the VLDB Endowment 7 (2014) 1845–1856.
 +
[29] N. Bhutani, H. Jagadish, D. Radev, Nested propositions in open information extraction, in:
 +
    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,
 +
    2016, pp. 55–64.
 +
[30] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A survey on open information extraction,
 +
    arXiv preprint arXiv:1806.05599 (2018).
 +
[31] H. A. Sleiman, R. Corchuelo, Trinity: on using trinary trees for unsupervised web data
 +
    extraction, IEEE Transactions on Knowledge and Data Engineering 26 (2013) 1544–1556.
 +
[32] M. A. B. M. Azir, K. B. Ahmad, Wrapper approaches for web data extraction: A review,
 +
    in: 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI),
 +
    IEEE, 2017, pp. 1–6.
 +
[33] M. Schmitz, S. Soderland, R. Bart, O. Etzioni, et al., Open language learning for information
 +
    extraction, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
 +
    Language Processing and Computational Natural Language Learning, 2012, pp. 523–534.
 +
[34] A. Fader, S. Soderland, O. Etzioni, Identifying relations for open information extraction, in:
 +
    Proceedings of the 2011 conference on empirical methods in natural language processing,
 +
    2011, pp. 1535–1545.
 +
[35] Y. Zhou, Y. Sheng, N. Vo, N. Edmonds, S. Tata, Learning transferable node representations
 +
    for attribute extraction from web documents, in: Proceedings of the Fifteenth ACM
 +
    International Conference on Web Search and Data Mining, 2022, pp. 1479–1487.
 +
[36] C. Lockard, P. Shiralkar, X. L. Dong, H. Hajishirzi, Zeroshotceres: Zero-shot relation
 +
    extraction from semi-structured webpages, arXiv preprint arXiv:2005.07105 (2020).
 +
[37] A. Carlson, C. Schafer, Bootstrapping information extraction from semi-structured web
 +
    pages, in: Joint European Conference on Machine Learning and Knowledge Discovery in
 +
    Databases, Springer, 2008, pp. 195–210.
 +
[38] B. Y. Lin, Y. Sheng, N. Vo, S. Tata, Freedom: A transferable neural architecture for structured
 +
    information extraction on web documents, in: Proceedings of the 26th ACM SIGKDD
 +
    International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1092–1102.
 +
 +
</pre>

Latest revision as of 18:01, 30 March 2023

Paper

Paper
edit
description  
id  Vol-3194/paper1
wikidataid  Q117345168→Q117345168
title  OpenTRIAGE: Entity Linkage for Detail Webpages
pdfUrl  https://ceur-ws.org/Vol-3194/paper1.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/VoyatCM22
volume  Vol-3194→Vol-3194
session  →

OpenTRIAGE: Entity Linkage for Detail Webpages

load PDF

OpenTRIAGE: Entity Linkage for Detail Webpages
Roger Voyat1 , Valter Crescenzi1 and Paolo Merialdo1
1
    Department of Engineering – Università Roma Tre, Via della Vasca Navale 79, Rome Italy


                                         Abstract
                                         We present OpenTriage, a system for extracting structured entities from detail Web pages of several
                                         sites and finding linkages between the extracted data. The system builds an integrated knowledge
                                         base by leveraging the redundancy of information with an Open Information Extraction approach: it
                                         incrementally processes all the available pages while discovering new attributes. It is based on a hybrid
                                         human-machine learning technique that targets a desired quality level. After two preliminary tasks,
                                         i.e., blocking and extraction, OpenTriage interleaves two integration tasks, i.e., linkage, and matching,
                                         while managing the uncertainty by means of very simple questions that are posed to an external oracle.




1. Introduction and Overview
Although the amount of data that are available on the Web is growing with an exponentially
pace, most of these data cannot be directly processed by applications, as they are published in
HTML, a format meant to be displayed by a Web browser and consumed by humans rather
than processed by machines. A lot of those data are published by mean of pages such as those
depicted in Figure 1: detail pages [1, 2] that report data about one real world entity, e.g., the
italian basketball player Danilo Gallinari. Several sites may independently publish data about
the same real world topic entity, i.e., Figure 1 shows two pages taken from two sport sites
(espn.com and foxsport.com).
    Websites may publish different attributes, and detail pages about a different set of entities.
Therefore, the integration of partially overlapping information coming from several sites is a
challenging, yet rewarding, problem of building knowledge bases for several popular domains.
Data are found in detail pages published on the Web and extracted into structured records such
those shown in Figure 1c and 1d before being integrated.
    The problem of finding whether two attributes of two distinct sources actually have the same
semantics is commonly known as schema matching [3, 4, 5] (for example the pair of attributes
(𝑎12 , 𝑎22 ) in Figure 1c and 1d), while the problem of finding which structured records relate to
the same real world entity is often called entity linkage [6, 7, 8, 9].

Challenges & Opportunities Although linkage and matching are both considered difficult
Big data integration problems [10], in many practical settings, including ours, the former is
more often considered the most challenging one.

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ roger.voyat@uniroma3.it (R. Voyat); valter.crescenzi@uniroma3.it (V. Crescenzi); paolo.merialdo@uniroma3.it
(P. Merialdo)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�              (a) 𝑖 site espn.com.                                         (b) 𝑖𝑖 site foxsport.com.




    (c) Extracted records from espn.com.                    (d) Extracted records from foxsport.com.
Figure 1: Running example: Web detail pages and data extracted from NBA player domain.


   First, for the dimension of the problem, as the number of input entities published in a Web
source can outgrown the number of their attributes by several order of magnitudes: a site can
publish thousands of detail pages with no more than a few dozen different attributes. One might
argue that the number of attributes scales proportionally to the number of Web sources, but
in practice, there are a few head attributes that are used by most of the sources [11], giving
rise to skewed distributions. Second, the challenges posed by the matching problem directly
affect the linkage one, as well: to decide whether two records are related to the same entity, it is
important, especially in the presence of overlap between the entities of two sources, knowing
the matching of the attributes of the compared records.
   Another challenge is that two matching values can have formats and semantics that vary
from source to source, or even from subgroup of entities to subgroup of entities within the same
source. For example, the postal code in England is an alphanumeric string locating a single
building, while in Italy the postal code is a numeric string often associated with several blocks
of a quite big city; in another domain, a smartphone manufacture might adopt a naming policy
for its models that distinguish a model from another for a single letter, e.g., Apple IPhone X and
Apple IPhone XS, whereas another might adopt totally different names, e.g. OPPO Find X5 and
OPPO Reno 6.
   Although automatically building an integrated knowledge base from detail pages published
by many sources is a challenging problem, the are interesting opportunities as well: especially
for the most popular domains, the Web is an almost inexhaustible source of detail pages with
redundant information about the same entities, to the extent that we can assume that our source
base is redundant enough to contain at least two detail pages and at least two values for each
domain entity and every attribute of the knowledge base we want to create.

Problem Description OpenTriage takes as input a domain 𝐷 made of a set of sources
𝒮 = {𝑆1 , 𝑆2 , . . .} where each source 𝑆𝑖 is a set of 𝑛𝑖 detail pages each publishing information
about a single entity from a set of domain entities 𝒪 = {𝑜1 , 𝑜2 , . . .}. A domain entity 𝑜𝑖 includes
a set of domain attributes 𝒜 = {𝑎1 , 𝑎2 , . . .} whose values can be populated with values coming
from the HTML source code of the corresponding detail page by means of a set of extraction
rules associated with the domain attributes.
�Figure 2: OpenTriage high-level architecture.


    In the following, we introduce OpenTriage, a Human-in-the-Loop system that produces as
output a set of entity linkages, i.e., pairs of pages related to the same domain entity, and a set of
attributes matches, i.e., pairs of extraction rules related to the same domain attribute. Indeed,
OpenTriage also generates the extraction rule, if any, that locates the value of domain attribute
from the HTML source of any of the detail pages from a source.
    Figure 1 shows two detail pages that include several attributes (highlighted in green). Figure 1c
and Figure 1d show the extracted records with some anonymous attributes such as 𝑎11 , 𝑎12 , 𝑎13 ,
and 𝑎14 (which is the Height). Possible matches are, for example: (𝑎11 , 𝑎21 ) (i.e., First Name),
(𝑎12 , 𝑎22 ) (i.e., Last Name), 𝑎(13 , 𝑎23 ) (i.e., Team); a linkage is (𝑜11 , 𝑜21 ) which is a pair of entities
from detail pages associated with Danilo Gallinari.


2. OpenTriage System
OpenTriage orchestrates several integration tasks as shown with the high-level architecture
shown in Figure 2. A data extraction algorithm is applied over every input page to generate a
set of extraction rules for every source, thus associating each page with a record of extracted
data. Simultaneously, an initial blocking phase will group pages into blocks of pages potentially
in linkage, with the goal of never separating two pages about the same entity into two different
blocks. Then, OpenTriage launches a new integration task over every block of detail pages, as
shown in Figure 2: iteratively, a linkage task and an attribute matching task are interleaved so
that the produced linkages are used to discover attributes matches which, in turn, are used to
provide better linkages. This process can be supported by an external oracle that contributes to
halt the iteration when the desired quality levels are achieved.

Extraction rules generation and blocking OpenTriage generates a set of candidate extrac-
tion rules for every input page and associates each detail page with a record of extracted data.
This task is based on a state-of-the-art unsupervised algorithm that works per site. It is beyond
the scope of the present paper as it’s already extensively discussed in previous works [12, 13, 14].
As already shown in [2], by assuming that the set of candidate extraction rules is complete, i.e.,
it includes at least one correct rule, the following integration tasks will always prefer the correct
rules over the noisy ones, which do not find values that match with values from other sites. In
the following we will not consider the presence of such noisy rules to simplify the discussion.
    Then OpenTriage use a blocking algorithm working only with the titles of the input pages.
The page title is easy to extract from HTML and, with rare exceptions, it contains some
�      (a) Entity distance               (b) Blocking result                   (c) Block pyramid
Figure 3: Entity distance, block & Pyramid:Figure 3a shows entity alignment with sim. distance inter
(𝑑1 , 𝑑2 ) and intra-source(𝑑3 and 𝑑4 ). In Figure 3b is shown a block result and Figure 3c depict a block
pyramid with green (pairs in linkage), gray (the uncertainty) and red zone (all remaining non-linkage
pairs) used for integration.


information about the entity described in a detail page. More precisely, in blocking, we compute
the distance between two pages (Figure 3a) as the inverse of the Jaccard-index (𝐽) between
the sets composed by the tokens contained in their HTML titles. 𝐽(𝑡1 , 𝑡2 ) = |𝑡|𝑡11 ∪𝑡 2|
                                                                                      ∩𝑡2 | . All pairs
of pages whose similarity is greater than a threshold are considered as belonging to the same
block. We empirically set this threshold to a value (0.6) that our experiments showed to be low
enough to have an almost perfect recall, even if that means sacrificing the precision, so that
pages in linkage will belong to the same group.
   With blocks generated, OpenTriage may obtain a first approximate noisy linkage. Since the
number of input pages can be too big for a traditional quadratic linkage algorithm, as it is the
case for many other linkage solutions [15, 16], we use a basic blocking algorithm [17, 18, 19]
that allows to linearly group the pages into block of potentially similar detail pages and later we
confine only within the pages of a single block further, and more expensive, elaborations. As
later we won’t look for linkages across distinct blocks, blocking techniques are supposed to not
split detail pages related to the same entity in different block to prevent false negatives, yet they
should spread the entities in a reasonable number of not-too-big blocks as shown in Figure 3b.
   Once the bootstrapping phase is over, one record of extracted data is associated with each
input detail page, and the pages have been clustered into blocks of pages potentially related to
the same real world entity. This is the input for the following processing stage.

Linkage of extracted records After the bootstrap phase generates blocks of pages poten-
tially related to the same topic entity, for each block OpenTriage performs a data integration
task that aims at computing both the linkages and the matches over the records of the pages in
a block. The data integration task relies on a normalized distance function between pairs of
records with known matches for finding the linkages, and symmetrically it leverages a distance
function between pairs of attributes with known linkages for finding the matches. For the sake
of presentation, in the following we start assuming that the matches are given, so that we can
focus on finding the linkages, which usually is the most challenging integration task. Later, we
will show how we can exploit a seed set of (possibly noisy) candidate linkages to compute an
instance-based distance measure between attributes to solve the attribute matching problems,
as well. That explains why an entity linkage task and an attribute matching task are interleaved
�in Figure 5.
   We compute the normalized distance for every possible pairs of pages in a block, and we rank
these candidate linkages by distance. These results are organized in a data structure, the Block
Pyramid – BP, that we depicted in Figure 3c, which is at the base of the following operations.

Triage OpenTriage scans every BP to perform a kind of triage over pairs of pages belonging
to the same block. For an ideal domain, where the distance is capable of perfectly ranking the
candidate linkages, all the pairs that correspond to actual linkages are at the top of the BP
(green zone in Figure 3c) whereas all the other pairs (representing pairs of pages that are not in
linkage) should be at the bottom of the pyramid (red zone): a single threshold on the distance
can perfectly separate the green zone from the red one.
   We can further develop this characterization of a domain in term of BP by introducing a
domain property called Domain Separability (DS). A domain is separable if the distance between
two entities in linkage is always smaller that distance between any pair of entities that are not
linkage. Unfortunately, this property does not hold in practice as a distance function cannot
always separate linkage and non-linkage pairs: as shown Figure 3c, there is a “gray” zone that
includes pairs that are in linkage but their distance is greater than the distance of pairs that are
not in linkage, or vice-versa.
   The gray zone confirms the presence of DS violations and the possibility that a non-linkage
could have a lower distance than an actual linkage. Our experimental evaluation with real
pages has shown that the BPs have different size and “shape” not only across distinct domains,
but also from block to block within the same domain. Every page and every attribute can
exhibit, somehow, characteristics local to the Web source, or even to the block, for several
reasons, including: First, the distances are affected by the formats and unit of measure of the
data adopted by a source; second, there are sources publishing inaccurate, false, or just obsolete
information, and finally, there might be other minor semantics differences between published
attributes which are really challenging to detect. For example, an address can be published
with any of the following values: “1600 Amphitheatre Pkwy, Mountain View, CA 94043, United
States” ; “1600 Amphitheatre Parkway in Mountain View, California (USA)”.
   Automatically discovering the best triage for every block of pages is therefore a challenging
and difficult task, that we approach by looking for two thresholds (the green-to-red threshold
separating the green zone from the gray one, and the gray-to-red threshold) and by making an
additional assumption holding for many interesting real domains on the Web.
   We say that a Web source exhibit the Local Consistency property (LC) if it does not publish
more than one detail page related to the same real world entity. This means that the domain does
not include linkages of pages from the very same source, which is often the case. Otherwise, it
would mean that source is unnecessarily publishing twice or more information about the same
entity with the risk of introducing inconsistencies.
   Assuming the LC, as shown in Figure 3c, OpenTriage can find the first green-to-gray thresh-
old as the distance of the first non-linkage pair under the green zone. The pair can be located
by analyzing BP from top, as that having the smallest distance and made of pages belonging to
the same source.
   To identify the second gray-to-red threshold and to complete the triage, OpenTriage lever-
� (a) Block 1 duality                      (b) Block 2 duality                      (c) Block 3 duality
Figure 4: An example over the NBA domain. The top Cartesian planes in sketches the linkages finding
problem: the points correspond to candidate linkages and their distance is computed by considering the
values of two matching attributes associated with the axes. The Cartesian planes in the bottom of the
figures illustrate the matches finding problem: they contain points corresponding candidate matches
and their distance is computed by considering the values of those attributes of two entities associated
with the axes. The gray pairs are those violating the domain separability.


ages the feedback from an oracle, that in practice can be approximated by means of crowdsourc-
ing techniques [20, 21, 22].

2.1. Human-in-the-Loop
OpenTriage exploits an external oracle to perform an approximate triage for each BP. It
poses simple binary questions that double check whether two structured records refer to the
same entity. An example of the questions posed is: Do the records “(Danilo Gallinari,Atalanta
Hawks,6’10 ’ ’) ” and “(Danilo Gallinari,Atalanta Hawks)” refer to the same real world entity?
   Our algorithm processes the candidate pairs in a BP ordered by decreasing distance, and starts
posing query only for pairs coming after the first green-to-gray threshold: The goal is that of
finding (if any) a pair of records in linkage with the smallest distance that better approximates
the gray-to-red threshold. To avoid posing too many questions that just collect useless negative
answers, it uses two stop conditions: (𝑖) halt when it receives too many consecutive negative
responses, (𝑖𝑖) halt when the next pair is too far from the last one. Both conditions depend
on OpenTriage configuration parameters. The best quality-to-cost ratio in our experiments is
found with this setting: we halt after 5 consecutive negative answers and with a pair which is
at least a 0.2 distance from the last pair. The gray-to-red threshold is then approximated by the
smallest distance of the pairs in linkage discovered by the oracle. In case none linkage has been
found, it is assumed that gray zone shrinks to an empty one and the two thresholds collapses.

2.2. Matches/Linkages Duality
OpenTriage leverages a distance measure between two records for finding the linkages between
pairs of pages in the same block: that distance is computed by assuming that the attribute
matches of the involved records are known, and then averaging over the distance of the values of
the matching attributes. To relax this assumption, which is not holding for our input made just
of records automatically extracted from detail Web pages and having anonymous labels, we also
�Figure 5: Integration tasks over the one BP from the NBA domain.


need to find the attribute matches. OpenTriage tackles the problem by using an instance-based
distance measure between attributes from two sources: it is computed by averaging over every
pairs of attribute values from records in linkages, i.e., from entities belonging to the overlap of
the sources.1
   OpenTriage finds the matches by framing the finding problem as a the dual problem of the
linkages finding problem. The dual integration task leverages a dual BP in which all the possible
matches are ranked according to their instance-based distance measure. These, in turn, are
based on the linkages obtained by a previous execution of the primal task. The dual triage will
classify the top candidate attribute matches in the green zone by exploiting the dual property
of the LC, i.e., the same source never publishes twice the same attribute, and of the DS, i.e.,
the instance-based distance between two matching attributes is always smaller that distance
between two non-matching attributes. The dual triage will exploit an oracle to approximate the
gray-to-red threshold by posing very simple binary questions such as: “are ’Danilo Gallinari’
and ’D.Gallinari’ values of two matching attributes?”.
   Figure 5 and Figure 2 show that the primal linkage task and the corresponding dual matching
task are interleaved. Each task produces results consumed in a following iteration of the other
task. OpenTriage’s main processing loop starts with all the weight parameters used in the
normalized distance functions having the same initial values. The linkages/matches produced by
every triage are then used to update the weights by means of a statistic linear model (i.e., logistic
regression) that also considers the answers provided by the oracle. Updating weights implies a
possible reshaping of the BP (some pair might change “color”): the iterations are halted as soon
as the results do not change too much, usually just after 2 or 3 iterations. Figures 4a and 4b show
a block with a few pairs just on, or close to, the border separating positive linkages/matching
pairs from negative ones. Those pairs might need a specific oracle question to be fixed after a
triage wrongly classified them. Figure 4c shows an even larger number of questions that might
be asked, as the block includes many pairs close to the line separating the positive linkages
from the negative linkages.

    1
     This is yet another frequent reason for the existence of the gray zones in the matches BPs: if the overlap is too
small, an instance-based distance function might become not reliable enough to provide good distance values.
�3. Experimental Results
OpenTriage discovers extraction rules, attribute matches and entity linkages by interleaving
linkage and matching tasks while posing questions to an oracle. Those questions are constrained
by an overall budget, i.e., the maximum number of questions that can be posed to solve a whole
input domain, and represents the cost measure for our approach. As explained in Section 2.1,
OpenTriage adopts two halting conditions that avoid consuming too many questions for the
same block. Our preliminary experiments show that the size of the gray zone as estimated
by the block triage procedure, monotonically shrinks during the consecutive iterations of
OpenTriage integration task over that block. Depending on the characteristics of the input
domain, OpenTriage can be used both as a stand-alone technique to fully integrate a domain,
or as an approach to actively isolate the best samples of a training set. This is later used to
feed a machine learning algorithm. Indeed, for easily separable domains, OpenTriage can
help to find and fix a few outliers that need a small number of questions to the oracle. For
these domains, it can automatically separate most of the positive pairs by means of the triage
procedure. Conversely, for a highly non-separable domain, it can be better used to select
non trivial pairs that stand in the gray zone. The questions to the oracle are saved for highly
informative samples that can be used to train a non-linear machine learning algorithm.

Setup As we are not aware of any other approach as ours that tackles three integration
problems over detail Web pages at the same time (extraction, linkage, and matching), for a
preliminary experimental evaluation we compare our system with an entity resolution state-
of-the-art approach over Web data. Namely, we consider Ditto [23] over the SWDE [24]
benchmark. We focus on 2 out of 8 domains available SWDE: NBA players and University. They
have rather different and interesting characteristics: the NBA players dataset is composed of
4.5𝑘 pages from 10 websites, with large overlap of entities across sites as shown in Figure 6a(i.e.,
8.8𝑘 of entity linkages); on the contrary, the University dataset includes 20𝑘 pages from 10
sites, with a rather low level of redundancy across sources as shown in Figure 6d(i.e., 4.3𝑘
entity linkages). For each processed domain, several executions were made with an increasing
number of of questions to the oracle (our cost budget). The blocking task for the NBA domain
has generated 90 blocks, whereas 1060 blocks were generated for the University domain. The
former domain has a rather higher level of redundancy than the latter: OpenTriage ends
up consuming a much smaller number of questions (80 vs 1350) by exploiting the available
redundancy.

With or Without DITTO Ditto is a deep learning Entity Resolution system based on pre-
built language models but it does not solve neither the extraction problem, nor the matching
problem. We considered two settings for comparing our linkage results to those produced by
Ditto.

• Ditto-Title, we use the page title as a textual attribute describing the entity contained in
  the page. The title can be trivially extracted and it is useful for linkage as its value is usually
  strictly related to the topic entity. For example, in the NBA page the title contains the name
  of the player, and in the University page it contains the name of the university.
�   (a) NBA entity overlap                   (b) NBA linkages                      (c) NBA budget




(d) University entity overlap            (e) University linkages               (f) University budget
Figure 6: Domain Overlap & results: Figures 6a and 6d show that domains have different overlap on
entity published across websites. Figures 6b and 6e the experimental 𝑃 , 𝑅, 𝐹 of linkages for OpenTriage,
Ditto-Title and Ditto-Golden. Figures 6c and 6f the quality of output linkage w.r.t. budget spent.


• Ditto-Golden considers as input the golden structured records provided with the SWDE
  benchmark: these records correspond to data extracted by manually defined extraction rules.
  Observe that, in a real scenario, for using Ditto one would need an additional effort for
  crafting the extraction rules.

   Ditto is a supervised system: we have trained it using all the linkages obtained by OpenTriage,
and then we measured the quality of the output linkages w.r.t. a golden set of linkages using
standard quality metrics. Precision (𝑃 ) is the percentage of output linkages which are correct,
Recall (𝑅) is the percentage of linkages in the golden which are in the output linkages, and
𝐹 -measure is their harmonic mean. Figures 6b and 6e show the results OpenTriage achieved
for the considered domains with unlimited budget. In both the domains the output quality is
good, but the University domain required far more questions than the NBA domain (1, 350
vs 83). As long as the input Web sources have enough overlap in the set of provided detail
pages, our approach adapts well to the characteristics of the input domain even if this means
paying for many additional questions. Figures 6c and 6f show how the quality of the results is
affected by a limited budget. We observed a stronger impact on the recall than on the precision,
especially for domains with large overlap, where a few additional questions can significantly
improved the triage quality.
   For domains with a limited overlap of entities and attributes, the amount of available redun-
dancy might result too small to support an effective triage. In this situation, OpenTriage would
end up generating far more queries supporting the system iterations than it does in presence
of a good amount of redundancy. As a future work, to solve this problem, a possible natural
evolution for OpenTriage is that of supporting the incremental integration of new sources
with different degrees of overlap. Indeed, a new source would increase the size of the problem,
�but at the same time it might even more incisively contribute to lower the overall costs if it
significantly overlaps with the other sources already available for the domain [25].


4. Related work
Big Data Integration [10] and Information Extraction [1] are broad research topics related to
our approach: We only have the room for providing some entry points to such a vast literature.
   The three main problems we study in this proposal, i.e, data extraction [12, 26], schema
matching [3, 4, 5], and entity linkage [6, 7, 8, 9], have been most often tackled in isolation, we
some rare exception [2]. However, entity linkage on Web data has received much less attention so
far. Blocking techniques for entity resolution is a topic often been studied in the literature [16, 15].
Unlike the Closed Information Extraction approaches [27, 28] where the managed knowledge
base does not grow in terms of subjects and predicates but only in terms of values and data,
OpenTriage falls in the realm of the Open Information Extraction approaches as it finds new
domain attributes while processing the input sources. The research community has pursued the
problem with different types of solutions. Traditional approaches [2, 13, 29, 30, 28, 31, 32, 33, 34]
take advantage of the regularity of the HTML structures for pages published by templated
websites. Other recent proposals [1, 35] use an hybrid approach exploiting both the regularity
of the publishing patterns for HTML pages, and the tagging of attribute values provided by
human workers. The main difference with OpenTriage is that our system do not require a
partially populated knowledge base to trigger the acquisition of new data.
   Other information extraction works [36, 37, 24] have exploited computer vision methods. For
example they also consider page screenshots, the bounding boxes around the text nodes, and
their visual distances. The latter requires expensive rendering for data processing and need a lot
of training data for ML models. A recent approach is FreeDOM [38], that avoids rendering costs
still need human feedback for tagging attributes values. Unlike the latter [38], OpenTriage
does not need expert crowdsourcing workers.

Conclusions & Future work OpenTriage can be used either directly as an entity linkage
solution over Web pages, or as an active learning algorithm for selecting the most profitable
training data to feed a following non-linear ML solutions like Ditto [23]. Our future work
includes evaluating OpenTriage with other datasets. It will be interesting to see if the monotonic
trend of the gray areas is always confirmed and to study under which assumptions it can be
proved. Another interesting development is to the design of a incremental solution leveraging
the redundancy for finding the best source integration strategy [25] while integrating a single
source at a time.


References
 [1] C. Lockard, P. Shiralkar, X. L. Dong, Openceres: When open information extraction meets
     the semi-structured web, in: Proceedings of the 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Volume 1 (Long and Short Papers), 2019, pp. 3047–3056.
� [2] M. Bronzi, V. Crescenzi, P. Merialdo, P. Papotti, Extraction and integration of partially
     overlapping web sources, Proceedings of the VLDB Endowment 6 (2013) 805–816.
 [3] Z. Bellahsene, A. Bonifati, F. Duchateau, Y. Velegrakis, On evaluating schema matching
     and mapping, in: Schema matching and mapping, Springer, 2011, pp. 253–291.
 [4] J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: vldb,
     volume 1, Citeseer, 2001, pp. 49–58.
 [5] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, the
     VLDB Journal 10 (2001) 334–350.
 [6] D. Qiu, L. Barbosa, V. Crescenzi, P. Merialdo, D. Srivastava, Big data linkage for product
     specification pages, in: Proceedings of the 2018 International Conference on Management
     of Data, 2018, pp. 67–81.
 [7] A. Gruenheid, X. L. Dong, D. Srivastava, Incremental record linkage, Proceedings of the
     VLDB Endowment 7 (2014) 697–708.
 [8] S. E. Whang, H. Garcia-Molina, Incremental entity resolution on rules and data, The VLDB
     journal 23 (2014) 77–102.
 [9] D. G. Brizan, A. U. Tansel, A. survey of entity resolution and record linkage methodologies,
     Communications of the IIMA 6 (2006) 5.
[10] X. L. Dong, D. Srivastava, Big data integration, in: 2013 IEEE 29th international conference
     on data engineering (ICDE), IEEE, 2013, pp. 1245–1248.
[11] N. Dalvi, A. Machanavajjhala, B. Pang, An analysis of structured data on the web, arXiv
     preprint arXiv:1203.6406 (2012).
[12] V. Cetorelli, P. Atzeni, V. Crescenzi, F. Milicchio, The smallest extraction problem, Pro-
     ceedings of the VLDB Endowment 14 (2021) 2445–2458.
[13] V. Crescenzi, P. Merialdo, Wrapper inference for ambiguous web pages, Applied Artificial
     Intelligence 22 (2008) 21–52.
[14] V. Cetorelli, V. Crescenzi, P. Merialdo, R. Voyat, Noah: Creating data integration pipelines
     over continuously extracted web data., in: EDBT/ICDT Workshops, 2021.
[15] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, A survey of blocking and filtering
     techniques for entity resolution, arXiv preprint arXiv:1905.06167 (2019).
[16] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed represen-
     tations of tuples for entity resolution, Proceedings of the VLDB Endowment 11 (2018)
     1454–1467.
[17] A. Gionis, P. Indyk, R. Motwani, et al., Similarity search in high dimensions via hashing,
     in: Vldb, volume 99, 1999, pp. 518–529.
[18] V. Christophides, V. Efthymiou, K. Stefanidis, Entity resolution in the web of data, Synthesis
     Lectures on the Semantic Web 5 (2015) 1–122.
[19] V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, T. Palpanas, Parallel meta-
     blocking for scaling entity resolution over big heterogeneous data, Information Systems
     65 (2017) 137–157.
[20] V. Crescenzi, A. A. Fernandes, P. Merialdo, N. W. Paton, Crowdsourcing for data manage-
     ment, Knowledge and Information Systems 53 (2017) 1–41.
[21] G. Li, Y. Zheng, J. Fan, J. Wang, R. Cheng, Crowdsourced data management: Overview and
     challenges, in: Proceedings of the 2017 ACM International Conference on Management of
     Data, 2017, pp. 1711–1716.
�[22] T. S. Behrend, D. J. Sharek, A. W. Meade, E. N. Wiebe, The viability of crowdsourcing for
     survey research, Behavior research methods 43 (2011) 800–813.
[23] Y. Li, J. Li, Y. Suhara, A. Doan, W.-C. Tan, Deep entity matching with pre-trained language
     models, arXiv preprint arXiv:2004.00584 (2020).
[24] Q. Hao, R. Cai, Y. Pang, L. Zhang, From one tree to a forest: a unified solution for structured
     web data extraction, in: Proceedings of the 34th international ACM SIGIR conference on
     Research and development in Information Retrieval, 2011, pp. 775–784.
[25] X. L. Dong, B. Saha, D. Srivastava, Less is more: Selecting sources wisely for integration,
     Proceedings of the VLDB Endowment 6 (2012) 37–48.
[26] V. Crescenzi, G. Mecca, P. Merialdo, et al., Roadrunner: Towards automatic data extraction
     from large web sites, in: VLDB, volume 1, 2001, pp. 109–118.
[27] C. Lockard, X. L. Dong, A. Einolghozati, P. Shiralkar, Ceres: Distantly supervised relation
     extraction from the semi-structured web, arXiv preprint arXiv:1804.04635 (2018).
[28] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang, Diadem: thousands
     of websites to a single database, Proceedings of the VLDB Endowment 7 (2014) 1845–1856.
[29] N. Bhutani, H. Jagadish, D. Radev, Nested propositions in open information extraction, in:
     Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,
     2016, pp. 55–64.
[30] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A survey on open information extraction,
     arXiv preprint arXiv:1806.05599 (2018).
[31] H. A. Sleiman, R. Corchuelo, Trinity: on using trinary trees for unsupervised web data
     extraction, IEEE Transactions on Knowledge and Data Engineering 26 (2013) 1544–1556.
[32] M. A. B. M. Azir, K. B. Ahmad, Wrapper approaches for web data extraction: A review,
     in: 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI),
     IEEE, 2017, pp. 1–6.
[33] M. Schmitz, S. Soderland, R. Bart, O. Etzioni, et al., Open language learning for information
     extraction, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
     Language Processing and Computational Natural Language Learning, 2012, pp. 523–534.
[34] A. Fader, S. Soderland, O. Etzioni, Identifying relations for open information extraction, in:
     Proceedings of the 2011 conference on empirical methods in natural language processing,
     2011, pp. 1535–1545.
[35] Y. Zhou, Y. Sheng, N. Vo, N. Edmonds, S. Tata, Learning transferable node representations
     for attribute extraction from web documents, in: Proceedings of the Fifteenth ACM
     International Conference on Web Search and Data Mining, 2022, pp. 1479–1487.
[36] C. Lockard, P. Shiralkar, X. L. Dong, H. Hajishirzi, Zeroshotceres: Zero-shot relation
     extraction from semi-structured webpages, arXiv preprint arXiv:2005.07105 (2020).
[37] A. Carlson, C. Schafer, Bootstrapping information extraction from semi-structured web
     pages, in: Joint European Conference on Machine Learning and Knowledge Discovery in
     Databases, Springer, 2008, pp. 195–210.
[38] B. Y. Lin, Y. Sheng, N. Vo, S. Tata, Freedom: A transferable neural architecture for structured
     information extraction on web documents, in: Proceedings of the 26th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1092–1102.
�