Vol-3170/paper11

From BITPlan ceur-ws Wiki
Revision as of 12:14, 31 March 2023 by Wf (talk | contribs) (modified through wikirestore by wf)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3170/paper11
wikidataid  →Q117351471
title  Towards Automatic Extraction of Events for SON Modelling
pdfUrl  https://ceur-ws.org/Vol-3170/paper11.pdf
dblpUrl  https://dblp.org/rec/conf/apn/Alshammari22
volume  Vol-3170→Vol-3170
session  →

Towards Automatic Extraction of Events for SON Modelling

load PDF

Towards Automatic Extraction of Events for SON
Modelling
Tuwailaa Alshammari1
1 School of Computing, Newcastle University, Science Square, Newcastle upon Tyne, NE4 5TG, United Kingdom



                                      Abstract
                                      Data visualization is the process of transforming data into a visual representation in order to make it easier
                                      for human to comprehend and derive knowledge from. By offering a detailed overview of crime events,
                                      data visualization technologies have the potential to assist investigators in analysing crimes.
                                          This paper proposes a new model that takes advantage of statistical natural language processing
                                      technologies to extract people’s names and relevant events from crime documents for SON modelling and
                                      visualisation. The proposed extractor is examined and evaluated. It is argued that it achieved reasonable
                                      results when data is extracted for SON modelling when compared with human extraction.

                                      Keywords
                                      structured occurrence net, structured acyclic nets, communication structured acyclic net, natural language
                                      processing, event extraction, model building




1. Introduction
Structured occurrence nets (SONs) [1, 2] are a Petri net based formalism for representing the be-
haviour of complex systems consisting of interdependent subsystems which proceed concurrently
and interact with each other. This extends the concept of an occurrence net, which represents a
single ‘causal history’ and provides a full and unambiguous record of all causal dependencies
between the events it involves. An example of a complex system is (cyber)crime and its computer
based representation and analysis gained considerable research attention in recent years using, in
particular, the SON model [3].
   An extension of SONs are the communication structured acyclic nets (CSA-nets) [4] which are
based on acyclic nets (ANs) rather than occurrence nets (ONs). A CSA-net joins together two or
more ANs by employing buffer places to connect pairs of events from different ANs. The nature of
such connections can be synchronous or asynchronous. In a synchronous communication, events
are executed concurrently, whereas in asynchronous communication, events may be executed
concurrently or sequentially.
   One of the main challenges in conducting effective criminal investigations is the overwhelming
amount of data, which makes it challenging for investigators to comprehend the crime and,
therefore, make decisions. In particular, investigators rely on a variety of sources of information
during criminal investigations, including police written reports and witness statements, which
may contain information that needs to be extracted and analysed. This is performed by connecting

PNSE’22, International Workshop on Petri Nets and Software Engineering, Bergen, Norway, 2022
" t.t.t.alshammari2@ncl.ac.uk (T. Alshammari)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                       188
�Tuwailaa Alshammari CEUR Workshop Proceedings                                                 188–201


and analysing the different aspects of the crime in order to comprehend the causality behind
crime events.
   Natural Language Processing ( NLP ) can help in the analysis of such unstructured data sources
and extraction of crime events. NLP began in the 1950s, according to [5]. With the invention
of the computer, it became necessary to build human-machine interaction relationships in order
to teach computers how to interpret real human language through manipulation and analysis of
human text and speech. NLP is defined as a sub-field of Artificial Intelligence and Linguistics
that focuses on teaching computers to recognise and interpret text, statements, and words written
in human languages. NLP is now employed in a number of applications, including machine
translation, sentiment analysis, and chatbots. As a result, several natural language processing
tools and libraries have emerged in recent years, including C ORENLP, NLTK, and SPAC Y (used in
the work presented here). SPAC Y [6] is an open-source natural language processing toolkit that
was created to assist developers in implementing natural language processing annotations and
tasks. It is a statistical model that excels at data extraction and text preparation for deep learning.
As with other NLP libraries, SPAC Y has a variety of valuable linguistic features, including Part of
Speech (POS) tagging, a dependency parser, and a Named Entity Recognizer (NER). This paper
proposed integration of SONs with NLP in the extraction and modelling of crime events. The
idea is to extract useful information from unstructured written sources in order to analyse and
visualise in SONs.
   This paper is organised as follows. Section 2 provides an overview of research background
on SON and NLP. Section 3 presents basic definitions concerning SONs. Section 4 presents the
extraction and modelling of events using SONs done by human. We then introduce TEXT 2 SON,
our proposed automatic extraction approach. Section 5 discusses and compares the results and
shortcomings for both manual and automatic extraction. Section 6 concludes the paper and
provides an overview of future work.


2. Background
In this section, we look at the related work that have been used to attempt to solve the mentioned
challenges through the use of graph analysis and visualisation, NLP, and data extraction for crime
data. In particular, the work done in crime modelling in SONs as well as crime data extraction
using various NLP techniques.
   SON s demonstrated promising results for accident, criminal and (cyber)crime investigations.
[7] demonstrates an explicit capture of the accident behaviours for multiple sub-systems by
modelling it in SONs. It showed the ability of SONs to aid an investigator in comprehending
how the accident occurred and tracing the sequence of events leading up to the accident cause.
Moreover, [3] suggests the use of SON features to detect DNS tunneling during an actual attack. A
unique method for detecting DNS tunneling based on SONs has been created and implemented.
Additionally, pre-processing of data and a set of experiments were discussed.
   The paper [8] introduced new WoPeD capabilities for integrating NLP and Business Processing.
WoPeD (Petri net editor and simulator), is an open source Java application for building business
processes using workflow nets. Algorithms have been presented for converting graphical process
models into textual process descriptions and vice versa. However, the tool suffers from the



                                                 189
�Tuwailaa Alshammari CEUR Workshop Proceedings                                                    188–201


common issue of semantic ambiguity in natural language processing (in both directions).
   News and social media are taking considerable focus in information extraction and classifica-
tion. [9] describes the development of a crime investigation tool that leverages Twitter data to aid
criminal investigations, by providing contextual information about crime occurring in a certain
location. A prototype has been implemented in the San Francisco region. This system provides
users with a spatial view of criminal incidents and associated tweets in the area, allowing them to
investigate the various tweets and crimes that occurred prior to and following a crime incident,
as well as obtaining information about the spatial and temporal characteristics of a crime via
the web. [10] presents data mining techniques, such as clustering, which have been shown to
be useful in extracting insights from publicly available structured data (National Crime Records
Bureau). Additionally, an approach for retrieving data via web scraping from news media has
been presented, as well as the essential NLP techniques for extracting significant information that
is not available through typical structured data sources.
   A continuing work on Simple Event Model ontology [11] discusses populating instances
extracted from crime-related documents, aided by an SVO (Subject, Verb, Object) algorithm that
extracts events using hand-crafted rules. The study employs the SVO algorithm to generate SVO
triples by parsing crime-related words using the MALTPARSER 5 dependency parser and then
extracting SVO triples from the parsed sentences.


3. Preliminaries
In this section, we recall basic definitions of the SON model needed in the rest of the paper.

Acyclic nets and occurrence nets
A acyclic net is a ‘database’ of empirical facts (both actual and hypothetical expressed using
places, transitions, and arcs linking them) accumulated during an investigation. Acyclic nets
can represent alternative ways of interpreting what has happened, and so may exhibit (backward
and forward) non-determinism. An example of acyclic net is occurrence nets, which provides
a full and unambiguous record of all causal dependencies between the events it involves. An
occurrence net represents a single ‘causal history’.
   Formally, an acyclic net is a triple acnet = (P, T, F) = (Pacnet , Tacnet , Facnet ), where P and T are
disjoint sets of places and transitions respectively, and F ⊆ (P × T ) ∪ (T × P) is a flow relation
such that F is acyclic, and, for every t ∈ T , there are p, q ∈ P such that pFt and tFq. Moreover,
acnet is an occurrence net if, for each place p ∈ P, there is exactly one t ∈ T such that tF p, and
exactly one u ∈ T such that pFu.
   An acyclic net is well-formed if for every step sequence starting from the default initial marking
(i.e., the set of places without incoming arcs), no transition occurs more than once, and the sets of
post-places of the transitions which have occurred are disjoint. Note that all occurrence nets are
well-formed.




                                                   190
�Tuwailaa Alshammari CEUR Workshop Proceedings                                                   188–201


Communication structured acyclic nets
A communication structured acyclic net consists of a number of disjoint acyclic nets which
can communicate through special (buffer) places. CSA-nets may exhibit backward and forward
non-determinism. They can contain cycles involving buffer places. Formally, a communication
structured acyclic net (or CSA-net) is a tuple csan = (acnet1 , . . . , acnetn , Q,W ) (n ≥ 1) such that:
   1. acnet1 , . . . , acnetn are well-formed acyclic nets with disjoint sets of nodes (i.e., places and
      transitions). We also denote:

                                     Pcsan = Pacnet1 ∪ · · · ∪ Pacnetn
                                     Tcsan = Tacnet1 ∪ · · · ∪ Tacnetn
                                     Fcsan = Facnet1 ∪ · · · ∪ Facnetn .

   2. Q is a set of buffer places and W ⊆ (Q × Tcsan ) ∪ (Tcsan × Q) is a set of arcs adjacent to the
      buffer places satisfying the following:
        a) Q ∩ (Pcsan ∪ Tcsan ) = ∅.
        b) For every buffer place q:
               i. There is at least one transition t such that tW q.
              ii. If tW q and qWu then transitions t and u belong to different component acyclic
                  nets.
That is, in addition to requiring the disjointness of the component acyclic nets and the buffer
places, it is required that buffer places pass tokens between different component acyclic nets. In
the step semantics of CSA-nets, the role of the buffer places is special as they can ‘instantaneously’
pass tokens from transitions producing them to transitions needing them. In this way, cycles
involving only the buffer places and transitions do not stop steps from being executable.
   A CSA-net csan = (acnet1 , . . . , acnetn , Q,W ) is a communication structured occurrence net (or
CSO -net) if the following hold

   1. The component acyclic nets are occurrence nets.
   2. For every q ∈ Q, there is exactly one t ∈ Tcsan such that tW q, and exactly one u ∈ Tcsan
      such that qWu.
   3. No place in Pcsan belongs to a cycle in the graph of Fcsan ∪W .
That is, only cycles involving buffer places are allowed.
   All CSO-nets are well-formed in a sense similar to that of well-formed acyclic nets. As a result,
they support clear notions of, in particular, causality and concurrency between transitions.
   In this paper, we use occurrence nets and CSO-nets rather than more general acyclic nets and
CSA -nets. However, this will change in the future work when we move to the next stages of our
work where alternative statements in textual documents are taken into account.


4. Extraction and Modelling
Crime can be conceptualised as a complex evolving system characterised by the occurrence of
numerous relevant and linked variables. Such systems require the examination and comprehension



                                                  191
�Tuwailaa Alshammari CEUR Workshop Proceedings                                             188–201


of behaviour to assist investigators in the decision-making process. Investigators typically rely
on a variety of sources, including written police reports and/or witness statements. CSA-nets
provide a distinctive method for analysing such types of crimes via representing events and chain
of events in order to uncover causal relationships between them. Also, CSA-nets can assist in
better comprehension and visualisation of events. Our work relies on integrating NLP techniques
with CSA-nets in order to extract useful information from written sources, and representing crime
events through CSA-nets. This integration aims at the development of an automatic extraction
tool (T EXT 2 SON ) for criminal cases leveraging statistical NLP models.

4.1. Human extraction: an experiment
Extracting information from unstructured data, such as written investigation reports, aims to
extract valuable information that could aid investigators in analysing and comprehending the
dependencies between crime events. CSA-nets are one of the potential techniques for visually
representing data in order to assist investigators in analysing and identifying causality among
these occurrences.
   The existing CSA-net approach lacks the ability to automatically extract information from (un-
structured) written sources and reports. Figure 2 illustrates the outcome of extracting information
and representing it by three expert SON users. The experiment was focused on a short fragment
of a crime story displayed in Figure 1. The users were asked to extract and represent crime events
as a SON model. In addition, we were interested in observing the style of human extraction and
modelling processes in order to determine the consistency of the models and the amount of time
spent.




Figure 1: A short fragment of a crime story


   The users extracted the following verbs from the sentences: play, lost, leaves, wearing,
goes/returned, and shoot. Nevertheless, not all users agreed on the exact model design and
in the terms of wording. For example, Modeller1 extracted only three verbs (play, lost, and shot),
whereas Modeller3 extracted five verbs noting the insertion of verbs that were not explicitly
mentioned in the sentences. Modeller3 added the words (leaves and goes) that may be explained
by the human capacity to comprehend and express events differently.
   Despite minor representational discrepancies (for example, the extent of information provided
by different modellers), the experiment revealed semantically similar models. In comparison to
other modellers, Modeller1 extracted just enough data. Modeller1 extracted and presented the
offense in a very straightforward manner by extracting two entities and three verbs. Modeller2,
on the other hand, added an additional entity, ON : DICE, that the other two modellers did not,




                                               192
�Tuwailaa Alshammari CEUR Workshop Proceedings                                          188–201




Figure 2: Human modelling done by son expert users



instead inserting DICE into the play event, PLAYED _ DICE and PLAY _ DICE, by Modeller1 and
Modeller3, respectively.

4.2. Automatic extraction: Text2son
With the above in mind, our goal is to develop a tool to extract crime events and relationships
between crime events, and build a model in SONs for behaviour analysis and visualisation. We
have applied three methods of extraction to identify and evaluate the most accurate method



                                             193
�Tuwailaa Alshammari CEUR Workshop Proceedings                                              188–201


compared to human extraction presented in the previous section. Initially, we only considered
extracting the main verb that SPAC Y’s parser nominates as the main verb of a sentence, tagged
as ROOT. ROOT tags appear once in every sentence representing the main word carrying the
meaning of the sentence is (usually a verb). We then considered extracting more information,
by evaluating the most frequently occurring verbs in the reserved data set (note that from the
evaluation of around 570 crime stories we compiled a list of most frequently occurring verbs).
We then tested the second method by extracting both ROOT verbs and verbs that match common
verbs. Finally, we considered extracting all other verbs present in the text, including verbs tagged
as ROOT.

4.2.1. Terminology
Tokenization: the process of splitting words in a sentence into a series of tokens.
Part-of-Speech (POS): assigns part of speech tags.
Dependency Parsing: links tokens (words) as they grammatically appear in the sentence and
assigns them parsing tags that shows their relationships, i.e., subject, object, conjunction, etc.
Named Entity Recognizer (NER): is a model where entities are identified within a text and
tagged with name types.
Coreferencing: resolve pronouns and mentions to the original names they refer to.
ROOT verbs: main verbs appear in sentences and are predicated by the PARSER as the primary
word from which the sentence is parsed.
Occurrence Net (ON): an acyclic net which provides a full and unambiguous record of all causal
dependencies between the events it involves.

4.2.2. Main assumptions and preliminary rules
In order to extract data automatically, we propose to extract entities PEOPLE and verbs. In our
methodology, we consider entities as representations for ONs, and verbs as representations of
EVENT s within the ON s. We will also consider that the shared EVENT s between different ON s
represent potential synchronised communications, and we connect them formally using channel
places.
   A number of assumptions has been put in place in order to carry out the extraction. We first
assume that verbs tagged as ROOT verbs represent events (and each occurs exactly once). Then,
since ROOT verb appears only once in a sentence, we assume that events are represented by either
verbs tagged as ROOT verbs, or are verbs in most frequently occurring verbs list. This is due to
the possibility of more than one event occurring in the same sentence.

4.2.3. The Extractor
The proposed T EXT 2 SON extractor utilises the statistical models provided by SPAC Y. The
algorithm in Figure 3 extracts people names present in a text, by searching for names in every
sentence. Following this phase, we analyse every sentence by searching for presence of people



                                                194
�Tuwailaa Alshammari CEUR Workshop Proceedings                                                   188–201




Figure 3: Extractor design



names in each sentence and the occurring verbs labelled as ROOT. Once found, they are grouped
together in a list. This process is repeated until the text reaches its end, resulting in lists of people
names and their associated events (verbs). This is required because we regard names to be the
representations of occurrence nets (i.e., each name is associated with exactly one ON), and verbs
to be the representations of events.
   Figure 4(a) demonstrates the tools used to create the T EXT 2 SON extractor and their respective
versions. We utilised SPAC Y, a Python library that makes use of pipeline packages with key



                                                  195
�Tuwailaa Alshammari CEUR Workshop Proceedings                                              188–201




                                                   (a)




                                                  (b)
Figure 4: Tools for extractor design (a); and extractor design (b)


linguistic features such as a TAGGER , PARSER , and NER.
   Figure 4(b) illustrates the steps involved in the extraction process. To begin with, text data
is fed into SPAC Y’s pipeline, where it is tokenised into individual tokens. The tokens are then
passed to SPAC Y’s tagger, which assigns anticipated POS tags, depending on the pre-trained
model predictions. The parser next assigns tags indicating the relationships between the tokens.
The NER assigns labels to identified entities, which may be individuals, organizations, or dates, to
mention a few. However, for this part of the research, we are interested only in the PEOPLE tag.
Additional tags will be included in future versions of the tool.
   Among the difficulties is resolving the pronouns to the correct previously named individual.
To address this issue, we integrated N EURAL C OREF [12] , a SPAC Y compatible neural network
model capable of annotating and resolving coreferences. Figure 5 shows how crime example
sentences from Figure 1 are modified after applying N EURAL C OREF. More precisely, we can
observe the replacement of the pronoun HE with the person’s name ROSS.




Figure 5: The example text after applying coreference resolution using Neuralcoref



4.2.4. Formalisation of the construction
In this section we provide a formal description of the steps undertaken by the proposed extraction
method and construction procedure, for the first of the proposed event extraction methods (the



                                                  196
�Tuwailaa Alshammari CEUR Workshop Proceedings                                                             188–201


remaining two are similar).
   We assume that the NLP stage generated, from a given text of k sentences (written in a natural
language), an extracted sequence:

                    ExtractedText = ExtractedS1 ExtractedS2 . . . ExtractedSk

such that each ExtractedSi is a pair (Entitiesi , eventi ), where Entitiesi are the entities associated
with the i-th sentence, and eventi is the root verb of i-th sentence. Moreover, let

                        Entities = Entities1 ∪ · · · ∪ Entitiesk = {ent1 , . . . , entn }

be the set of all the entities. Then, for every entity ent, let events(ent) be the sequence of events

                                             events(ent) = x1 . . . xk

where xi = eventi if ent belongs to Entitiesi , and otherwise xi is the empty string. Intuitively,
events(ent) is the ordered sequence of events in which entity ent ‘participated’, and such a
sequence can be used to provide a time-line for this entity. Following this observation, for each
entity ent with events(ent) = ev1 . . . evl , we construct an occurrence net ONent = (P, T, F), where:

                 P = {p(ent,evi ) | i = 1, . . . , k} ∪ {pent }
                 T = t(ent,ev1 ) , . . . ,t(ent,evk )
                 F = {(p(ent,evi ) ,t(ent,evi ) ) | i = 1, . . . , l}∪
                     {(t(ent,evi ) , p(ent,evi+1 ) ) | i = 1, . . . , l − 1} ∪ {(t(ent,evl ) , pent )}.

Finally, for each pair (t,t ′ ) = (t(ent,ev) ,t(ent ′ ,ev′ ) ) of transitions in T , where ent is different from
ent ′ , we add channel places q = q(t,t ′ ) and q′ = q(t ′ ,t) together with the arcs

                                         (t, q) (q,t ′ ) (t ′ , q′ ) (q′ ,t)

to enforce synchronisation between t and t ′ . One can then show that the result is a CSO-net which
can be used for analysis and visualisation.


5. Discussion
In order to evaluate our modelling approach, we used human modelling and compared it with the
proposed extractor. We have conducted manual extraction and modelling experiments with expert
SON users (researchers). They all produced similar outcomes in terms of model explaining the
case, but (not surprisingly) in various forms. These models are not fundamentally dissimilar in
terms of meaning, but rather in terms of the amount of information displayed. Figure 2 illustrates
the human expert modelling for the example sentences in Figure 1.
   All of SON models shown here convey the same narrative because all of the modellers reported
or modelled the semantics (meaning) of the sentences in the example sentences. However,
different modellers incorporated varying amounts of information (EVENTs and ONs) in their
models based on their judgments. This, however, may indicate a lack of modelling consistency
due to the volume of data presented in the experiment. Another issue is the amount of time



                                                         197
�Tuwailaa Alshammari CEUR Workshop Proceedings                                            188–201


required for such modelling. To construct a model, spending time on reading, comprehending to
extract the crime events, and then modelling is inevitable.
   To address these issues, we employed three extraction methods and compared them to the
human extraction described in Section 4.1. At first, we considered only the main verb that
SPAC Y ’s parser indicates as the main verb of the sentence tagged with ROOT . ROOT tags appear
once in every sentence representing the main word carrying the meaning of the sentence which is
usually a verb. Then we considered extracting ROOT verbs alongside a list of common verbs used
in criminal reporting. This list was compiled after analysing approximately 570 crime stories
from The Violence Policy Center1 website for the most frequently occurring verbs. Finally, we
considered extracting ROOT verbs as well as all other verbs in the text.




Figure 6: Extractor automated extraction, manually modelled for visualisation purposes

   1 https://vpc.org/




                                               198
�Tuwailaa Alshammari CEUR Workshop Proceedings                                               188–201


   T EXT 2 SON showed a significantly faster extraction process compared to human extraction.
In our observations, we estimated that the time spent for manual extraction and modelling by
the three expert users was on average 8 minutes. However, the automatic extractor handled the
extraction process in about 6 seconds. But this was only the extraction phase time as we are
working on the development of an automatic modeller that will function in conjunction with the
extractor.
   The extractor demonstrated lower accuracy by associating verbs with unrelated entities. The
assumption is that if the verb is a ROOT verb and appears in a sentence with other entities, we can
reasonably presume that they are linked. However, because each phrase can have only one ROOT
verb, which may or may not contain the intended crime verb, a modeller may infrequently choose
another verb for extraction. The automatic modeller does not recognize linking of such verbs that
relate to a single distinct entity unless that entity is the only entity contained within a sentence.
   However, when we fed the extractor a list of common crime verbs, it performed notably better
in terms of extracting events. Yet, it maintains a connection between the newly extracted verbs
and all other entities in the sentence. This is not necessarily accurate, as previously discussed,
because a NON - ROOT verb can refer to a single entity. This leads to the establishment of an
inaccurate communication link between the two entities.
   Another challenge is the amount of information displayed in visualised models in comparison
to human modelling. Human modellers frequently augment the information offered in events with
additional words. Comparing Figures 2 and 6, we noticed that people tend to add more words
to events, such as PLAYED DICE, LOST GAME, LEAVE UPSET, and GOES TO S PICER ’ S events.
This addition provides further description for the model, which aids visualisation. On the other
hand, the automatic extractor extracts only one token for each identified event. We are currently
investigating and enhancing the extractor by including information that a human modeller would
incorporate.
   To facilitate comparison, the following table lists all the verbs extracted or selected for mod-
elling by the various modellers. We can see that nearly all extraction methods agree on the verbs
PLAYED , LOST, and SHOT , which may express the essential shooting events. Extracting only
the ROOT produced satisfactory results except for the omission of the shooting incident, which
we think to be noteworthy. As previously stated, some sentences may contain verbs other than
the ROOT verb. This is one of the shortcomings of the proposed approach, which prompted us
to experiment with two additional methods: extracting common verbs, and extracting all verbs.
In comparison with the other two approaches, we see that extracting common criminal verbs
alongside the root verb (second approach) produced more steady and acceptable result.
     Modeller1       Modeller2      Modeller3          ROOT         ROOT and        ROOT and
                                                                    common           all verbs
                                                                       verbs
      played            play            play           played         played          played
       lost             lost           leaves            lost           lost            lost
    shoot(shot)        return         wearing         returned      according       according
         -           shoot(kill)        goes               -         returned        wearing
         -                -           shoots               -            shot         returned
         -                -               -                -              -            shot



                                                199
�Tuwailaa Alshammari CEUR Workshop Proceedings                                           188–201


  This provides an opportunity for further enhancement and development of the tool. We are
now working on improving the model by incorporating word relationships such as subject and
object via the use of the dependency parser. Additionally, we are creating and training the NER
model, as well as introducing new labels for criminal extractions for SON modelling.


6. Conclusion and future work
We discussed the initial steps toward automatic SON data extraction using NLP prediction tech-
niques. We used SPAC Y and included several of its models without modification. Specifically, the
TOKENIZER , POS , PARSER, and NER. Additionally, we used N EURALCOREF to resolve mentions.
   We developed our algorithm to extract people’s names and events associated with them. Then
we illustrated how the algorithm works by feeding T EXT 2 SON a text passage to extract events
automatically in three different approaches. We then used expert SON users to extract and model
the entities and events to verify and validate the result obtained from the tool. We compared
human extraction to the final output produced by our automatic modeller.
   The ongoing work focuses on improving automatic extraction and developing an automatic
modeller, as well as integrating both with SONs. Among the ongoing and future works are the
following:

    • Developing and integrating an automatic modeller and examining it using a larger data set.
    • Utilising the dependency parser in SPAC Y to extract events associated with the extracted
      entities. In our approach, we leveraged the sentence’s main verb, ROOT, to express events
      regardless of their relationship to other entities in the sentence.
    • Investigating the effect of various human extraction behaviours on SON modelling. We
      will look for commonalities in human extraction behaviour in order to assess a broader
      understanding of human extraction for the purpose of SON modelling.
    • Developing a new NER model by training the model on a new set of data and introducing
      new distinct NER labels suited for crime extraction.


References
 [1] M. Koutny, B. Randell, Structured occurrence nets: A formalism for aiding system failure
     prevention and analysis techniques, Fundam. Informaticae 97 (2009) 41–91.
 [2] B. Randell, Occurrence nets then and now: The path to structured occurrence nets, in: L. M.
     Kristensen, L. Petrucci (Eds.), Applications and Theory of Petri Nets - 32nd International
     Conference, PETRI NETS 2011, Newcastle, UK, June 20-24, 2011. Proceedings, volume
     6709 of Lecture Notes in Computer Science, Springer, 2011, pp. 1–16.
 [3] T. Alharbi, M. Koutny, Domain name system (DNS) tunneling detection using structured
     occurrence nets (sons), in: D. Moldt, E. Kindler, M. Wimmer (Eds.), Proceedings of the
     International Workshop on Petri Nets and Software Engineering (PNSE 2019), volume 2424
     of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 93–108.




                                              200
�Tuwailaa Alshammari CEUR Workshop Proceedings                                             188–201


 [4] B. Li, M. Koutny, Unfolding CSPT-nets, in: D. Moldt, H. Rölke, H. Störrle (Eds.), Pro-
     ceedings of the International Workshop on Petri Nets and Software Engineering (PNSE’15),
     volume 1372 of CEUR Workshop Proceedings, CEUR-WS.org, 2015, pp. 207–226.
 [5] P. M. Nadkarni, L. Ohno-Machado, W. W. Chapman, Natural language processing: an
     introduction, J. Am. Medical Informatics Assoc. 18 (2011) 544–551.
 [6] spaCy, https://spacy.io, 2022.
 [7] B. Li, Visualisation and Analysis of Complex Behaviours using Structured Occurrence Nets,
     Ph.D. thesis, School of Computing, Newcastle University, 2017.
 [8] T. Freytag, P. Allgaier, Woped goes NLP: conversion between workflow nets and natural
     language, in: W. M. P. van der Aalst et. al (Ed.), Proceedings of the Dissertation Award,
     Demonstration, and Industrial Track at BPM 2018, volume 2196 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2018, pp. 101–105.
 [9] P. Siriaraya, Y. Zhang, Y. Wang, Y. Kawai, M. Mittal, P. Jeszenszky, A. Jatowt, Witnessing
     crime through tweets: A crime investigation tool based on social media, in: F. B. Kashani,
     G. Trajcevski, R. H. Güting, L. Kulik, S. D. Newsam (Eds.), Proceedings of the 27th ACM
     International Conference on Advances in Geographic Information Systems, SIGSPATIAL
     2019, Chicago, IL, USA, November 5-8, 2019, ACM, 2019, pp. 568–571.
[10] S. Chakravorty, S. Daripa, U. Saha, S. Bose, S. Goswami, S. Mitra, Data mining tech-
     niques for analyzing murder related structured and unstructured data, American Journal of
     Advanced Computing 2 (2015) 47–54.
[11] G. Carnaz, V. B. Nogueira, M. Antunes, Knowledge representation of crime-related events: a
     preliminary approach, in: R. Rodrigues, J. Janousek, L. Ferreira, L. Coheur, F. Batista, H. G.
     Oliveira (Eds.), 8th Symposium on Languages, Applications and Technologies, SLATE
     2019, June 27-28, 2019, Coimbra, Portugal, volume 74 of OASIcs, Schloss Dagstuhl -
     Leibniz-Zentrum für Informatik, 2019, pp. 13:1–13:8.
[12] NeuralCoref, Neuralcoref 4.0: Fast coreference resolution in spacy with neural networks,
     https://github.com/huggingface/neuralcoref, 2022.




                                               201
�