Difference between revisions of "Vol-3194/paper22"
Jump to navigation
Jump to search
(edited by wikiedit) |
(modified through wikirestore by wf) |
||
Line 1: | Line 1: | ||
− | + | =Paper= | |
{{Paper | {{Paper | ||
− | | | + | |id=Vol-3194/paper22 |
+ | |storemode=property | ||
+ | |title=A Network-based Model and a Related Approach to Represent and Handle the Semantics of Comments in a Social Network | ||
+ | |pdfUrl=https://ceur-ws.org/Vol-3194/paper22.pdf | ||
+ | |volume=Vol-3194 | ||
+ | |authors=Gianluca Bonifazi,Francesco Cauteruccio,Enrico Corradini,Michele Marchetti,Giorgio Terracina,Domenico Ursino,Luca Virgili | ||
+ | |dblpUrl=https://dblp.org/rec/conf/sebd/BonifaziCCMTUV22 | ||
}} | }} | ||
+ | ==A Network-based Model and a Related Approach to Represent and Handle the Semantics of Comments in a Social Network== | ||
+ | <pdf width="1500px">https://ceur-ws.org/Vol-3194/paper22.pdf</pdf> | ||
+ | <pre> | ||
+ | A Network-based Model and a Related Approach to | ||
+ | Represent and Handle the Semantics of Comments in | ||
+ | a Social Network | ||
+ | (Discussion Paper) | ||
+ | |||
+ | Gianluca Bonifazia , Francesco Cauteruccioa , Enrico Corradinia , Michele Marchettia , | ||
+ | Giorgio Terracinab , Domenico Ursinoa and Luca Virgilia | ||
+ | a | ||
+ | DII, Polytechnic University of Marche | ||
+ | b | ||
+ | DEMACS, University of Calabria | ||
+ | |||
+ | |||
+ | Abstract | ||
+ | In this paper, we propose a network-based model and a related approach to represent and handle the | ||
+ | semantics of a set of comments expressed by users of a social network. Our model and approach are multi- | ||
+ | dimensional and holistic because they manage the semantics of comments from multiple perspectives. | ||
+ | Our approach first selects the text patterns that best characterize the involved comments. Then, it uses | ||
+ | these patterns and the proposed model to represent each set of comments by means of a suitable network. | ||
+ | Finally, it adopts a suitable technique to measure the semantic similarity of each pair of comment sets. | ||
+ | |||
+ | Keywords | ||
+ | Comment analysis, Social Network Analysis, Text Pattern Mining, Semantic Similarity, Utility Functions | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | 1. Introduction | ||
+ | In the last few years, the investigation of the content of comments expressed by people in | ||
+ | social media has increased enormously [1]. In fact, social media comments are one of the places | ||
+ | where people tend to express their ideas most spontaneously [2]. Consequently, they play a | ||
+ | privileged role in allowing the reconstruction of the real feelings and thoughts of a person, as | ||
+ | well as in building a more faithful profile of her [3, 4, 5]1 . Spontaneity is both the main strength | ||
+ | and one of the main weaknesses of comments. In fact, they are often written on the spur of | ||
+ | the moment, with a language style that is not very structured, apparently confused and in | ||
+ | some cases contradictory. In spite of these flaws, the set of comments written by a certain user | ||
+ | |||
+ | SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy | ||
+ | $ g.bonifazi@univpm.it (G. Bonifazi); f.cauteruccio@univpm.it (F. Cauteruccio); e.corradini@pm.univpm.it | ||
+ | (E. Corradini); m.marchetti@pm.univpm.it (M. Marchetti); terracina@mat.unical.it (G. Terracina); | ||
+ | d.ursino@univpm.it (D. Ursino); luca.virgili@univpm.it (L. Virgili) | ||
+ | � 0000-0002-1947-8667 (G. Bonifazi); 0000-0001-8400-1083 (F. Cauteruccio); 0000-0002-1140-4209 (E. Corradini); | ||
+ | 0000-0003-3692-3600 (M. Marchetti); 0000-0002-3090-7223 (G. Terracina); 0000-0003-1360-8499 (D. Ursino); | ||
+ | 0000-0003-1509-783X (L. Virgili) | ||
+ | © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). | ||
+ | CEUR | ||
+ | |||
+ | CEUR Workshop Proceedings (CEUR-WS.org) | ||
+ | Workshop | ||
+ | Proceedings | ||
+ | http://ceur-ws.org | ||
+ | ISSN 1613-0073 | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | 1 | ||
+ | In this paper, we focus on comments expressed by people through their well-defined accounts. We do not | ||
+ | consider anonymous comments because they are less reliable and, in any case, not useful for the objectives of our | ||
+ | research. | ||
+ | �provides an overview of her thoughts and profile. Reconstructing the latter from the apparent | ||
+ | “chaos” inherent in the comments is a challenging issue for researchers working in the context | ||
+ | of the extraction of content semantics. | ||
+ | In this paper, we want to make a contribution in this context by proposing a model, and a | ||
+ | related approach, to detect and handle the content semantics from a set of comments posted on | ||
+ | a social network. We argue that our model and its related approach are able to extract from the | ||
+ | apparent “chaos” of comments the thoughts of their publisher and, eventually, to reconstruct | ||
+ | the corresponding profile. However, the latter is only one of the possible uses of our model | ||
+ | and approach. In fact, if we widen our gaze to the comments written by all the users of a | ||
+ | certain community, we are able to understand the dominant thoughts in it. If we consider all | ||
+ | the comments on a certain topic (e.g., COVID-19), we can reconstruct the various viewpoints | ||
+ | on such a topic. Again, if we consider all comments in a certain time period (e.g., the first three | ||
+ | months of the year 2022) we can determine what are the dominants thoughts in that period. | ||
+ | Furthermore, the reconstruction of thoughts is only one of the possible applications of our | ||
+ | model and approach. Other ones may include, for example, constructing recommender systems, | ||
+ | building new user communities, identifying outliers or constructing new topic forums. Some of | ||
+ | the most interesting applications are described in [6]. | ||
+ | This paper is organized as follows: In Section 2, we present an overview of our proposal. In | ||
+ | Section 3, we illustrate our model. In Section 4, we describe our approach. Finally, in Section 5, | ||
+ | we draw our conclusions and have a look at some possible future developments. Due to space | ||
+ | limitations, we cannot describe here the experiments carried out to test our model and approach. | ||
+ | However, the interested reader can find them in [6]. | ||
+ | |||
+ | |||
+ | 2. An overview of our proposal | ||
+ | Our approach consists of two phases, namely pre-processing and knowledge extraction. | ||
+ | The pre-processing phase is aimed at cleaning and annotating the available comments and, | ||
+ | then, selecting the most meaningful ones. During the cleaning activity, bot-generated content, | ||
+ | errors, inconsistencies, etc., are removed, and comment tokenization and lemmatization tasks | ||
+ | are performed. The annotation activity aims to automatically enrich each lemmatized comment | ||
+ | with some important information, such as the value of the associated sentiment, the post to | ||
+ | which it refers, the author who wrote it, etc. | ||
+ | The selection of the most significant comments is based on a text pattern mining technique. | ||
+ | While most of the approaches to perform such a task proposed in the past consider only the | ||
+ | frequency of patterns [7], our technique considers also, and primarily, its utility [8, 9], measured | ||
+ | with the support of a utility function. Regarding this function, we point out that our technique is | ||
+ | orthogonal to the utility function used. As a consequence, it is possible to choose different utility | ||
+ | functions to prioritize certain comment properties over others. A first utility function could | ||
+ | be the sentiment of comments; it could allow, for instance, the identification of only positive | ||
+ | comments or only negative ones. A second utility function might be the rate of comments; it | ||
+ | might allow, for instance, the selection of patterns involving only high rate comments or only | ||
+ | low rate ones. A third utility function could be the Pearson’s correlation [10] between sentiment | ||
+ | and rate; it could allow, for instance, the selection of patterns involving only comments with | ||
+ | �discordant (resp., concordant) sentiment and rate. More details on utility functions can be found | ||
+ | in Section 3. | ||
+ | Once the comments and patterns of interest have been selected, it is necessary to have a | ||
+ | model for their representation and management. As mentioned in the Introduction, in this | ||
+ | paper we propose a new network-based model called CS-Net (Content Semantic Network). | ||
+ | The nodes of a CS-Net represent the comment lemmas. Its arcs can be of two types, which | ||
+ | reflect two different perspectives on the investigation of comment semantics. The first one, | ||
+ | based on the concept of co-occurrence, reflects the past results obtained by many Information | ||
+ | Retrieval researchers [11]. It assumes that two semantic related lemmas tend to appear together | ||
+ | very often in sentences. The second one, based on the concepts of semantic relationships and | ||
+ | semantically related terms, reflects the past results obtained by many researchers working in | ||
+ | Natural Language Processing [12]. Actually, the CS-Net model is extensible so that, if in the | ||
+ | future we wanted to add additional perspectives for investigating comment content, it would | ||
+ | be sufficient to add a new type of arc for each new perspective. The CS-Net model is described | ||
+ | in detail in Section 3. | ||
+ | After selecting the comments and patterns of interest, and after representing them by means | ||
+ | of CS-Nets, a technique to evaluate the semantic similarity of two CS-Nets is necessary. This | ||
+ | technique operates by separately evaluating, and then appropriately combining, the semantic | ||
+ | similarity of each pair of subnets obtained by projecting the original CS-Nets in such a way | ||
+ | as to consider only one type of arcs at a time. The combination of the single components is | ||
+ | done by weighting them differently, based on the extension of the CS-Net projections from | ||
+ | which they are derived. This extension is determined by the number of the corresponding | ||
+ | arcs. In particular, our technique favors the most extensive component, because it represents a | ||
+ | larger portion of the content semantics than the other. Analogously to the CS-Net model, our | ||
+ | technique for computing the similarity of two CS-Nets is extensible if one wants to add new | ||
+ | perspectives of semantic similarity evaluation. In fact, to obtain an overall semantic similarity | ||
+ | value, it is sufficient to compute the components related to each perspective separately, and | ||
+ | then combine them according to the procedure mentioned above. In evaluating the semantics | ||
+ | of two homogeneous subnets (i.e., subnets of only co-occurrences or subnets of only semantic | ||
+ | relationships), our technique considers two further aspects, namely the topological similarity | ||
+ | of the subnets and the similarity of the concepts expressed by the corresponding nodes. To | ||
+ | compute the former, we adopt an approach already proposed in the literature, i.e., NetSimile | ||
+ | [13]. To compute the latter, we use an enhanced version of the Jaccard coefficient, capable of | ||
+ | considering synonymies and homonymies as well. Adding these two further contributions to | ||
+ | co-occurrences and semantic relationships makes our approach even more holistic. A detailed | ||
+ | description of our technique for evaluating the semantic similarity of two CS-Nets can be found | ||
+ | in Section 4. | ||
+ | |||
+ | |||
+ | 3. Proposed model | ||
+ | Let 𝒞 = {𝑐1 , 𝑐2 , · · · 𝑐𝑛 } be a set of lemmatized comments and let ℒ = {𝑙1 , 𝑙2 , · · · , 𝑙𝑞 } be the set | ||
+ | of all lemmas that can be found in a comment of 𝒞. Each comment 𝑐𝑘 ∈ 𝒞 can be represented | ||
+ | as a set of lemmas 𝑐𝑘 = {𝑙1 , 𝑙2 , . . . , 𝑙𝑚 }; therefore, 𝑐𝑘 ⊆ ℒ. A text pattern 𝑝ℎ is a set of lemmas; | ||
+ | �therefore, 𝑝ℎ ⊆ ℒ. | ||
+ | We are interested in patterns with frequency values and utility functions belonging to | ||
+ | appropriate intervals. In particular, as far as frequency is concerned, we are interested in | ||
+ | patterns whose frequency value is greater than a certain threshold. Instead, for what concerns | ||
+ | the utility function, the scenario is more complex, because it depends on the utility function | ||
+ | adopted and the context in which our model is used. For example: | ||
+ | |||
+ | • We could employ as utility function the average sentiment value of the comments to | ||
+ | which the pattern of interest refers. We call 𝑓𝑠 (·) this utility function. It allows us to | ||
+ | select patterns characterized by a compound score (and, therefore, a sentiment value) | ||
+ | very high (e.g., positive patterns), very low (e.g., negative patterns) or belonging to a | ||
+ | given range (e.g., neutral patterns). | ||
+ | • We could adopt as utility function the Pearson’s correlation [10] between the sentiment | ||
+ | and the score of the comments in which the pattern of interest is present. We call 𝑓𝑝 (·) | ||
+ | this utility function. It allows us to select: (i) patterns having a high sentiment value and | ||
+ | stimulating positive comments; (ii) patterns having a low sentiment value and stimulating | ||
+ | negative comments; (iii) patterns having a high sentiment value and stimulating negative | ||
+ | comments; (iv) patterns having a low sentiment value and stimulating positive comments. | ||
+ | Clearly, in the vast majority of investigations, the patterns of interest are those related | ||
+ | to cases (i) and (ii). However, there may be rare cases where the patterns of interest are | ||
+ | those related to cases (iii) and (iv). | ||
+ | |||
+ | In the following, we denote by 𝒫 the set of the patterns of interest, whose values of frequency | ||
+ | and utility function belong to the intervals of interest for the application that is being considered. | ||
+ | We are now able to formalize our model. In particular, a Content Semantics Network (hereafter, | ||
+ | CS-Net) 𝒩 is defined as 𝒩 = ⟨𝑁, 𝐴𝑐 ∪ 𝐴𝑟 ⟩. | ||
+ | 𝑁 is the set of nodes of 𝒩 . There is a node 𝑛𝑖 ∈ 𝑁 for each lemma 𝑙𝑖 ∈ ℒ. Since there exists | ||
+ | a biunivocal correspondence between 𝑛𝑖 and 𝑙𝑖 , in the following we will use these two symbols | ||
+ | interchangeably. | ||
+ | 𝐴𝑐 is the set of co-occurrence arcs. There is an arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) ∈ 𝐴𝑐 if the lemmas 𝑙𝑖 and 𝑙𝑗 | ||
+ | appear at least once together in a pattern of 𝒫. 𝑤𝑖𝑗 is a real number belonging to the interval | ||
+ | [0, 1] and denoting the strength of the co-occurrence. The higher 𝑤𝑖𝑗 , the higher this strength. | ||
+ | For example, 𝑤𝑖𝑗 could be obtained as a function of the number of patterns in which 𝑙𝑖 and 𝑙𝑗 | ||
+ | co-occur. | ||
+ | 𝐴𝑟 is the set of semantic relationship arcs. There is an arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) ∈ 𝐴𝑟 if there is a | ||
+ | semantic relationship between 𝑙𝑖 and 𝑙𝑗 . 𝑤𝑖𝑗 is a real number in the interval [0, 1] denoting the | ||
+ | strength of the relationship. The higher 𝑤𝑖𝑗 , the higher this strength. 𝑤𝑖𝑗 could be computed | ||
+ | using ConceptNet [14] and considering both the number of times 𝑙𝑗 is present in the set of | ||
+ | “related terms” of 𝑙𝑖 and the values of the corresponding weights. | ||
+ | An observation on the structure of the CS-Net model is necessary. As specified above, our | ||
+ | goal is to model and manage the semantics of the content of a set of comments. CS-Net is | ||
+ | a model tailored exactly to that goal. For this reason, it considers two perspectives derived | ||
+ | from the past literature. The former is related to the concept of co-occurrence. It indicates | ||
+ | that two semantically related lemmas tend to appear very often together in sentences. This | ||
+ | �perspective is probably the most immediate in the context of text mining. In fact, here, it is | ||
+ | well known that the frequency with which two or more lemmas appear together in a text | ||
+ | represents an index of their correlation. The potential weakness of this perspective lies in | ||
+ | the need to compute the frequency of each pair of lemmas. Moreover, this computation must | ||
+ | be continually updated whenever a new comment is taken into consideration. The latter is | ||
+ | related to the concepts of semantic relationships and semantically related terms. These refer | ||
+ | to several researches conducted in the past in the contexts of Information Retrieval [11] and | ||
+ | Natural Language Processing [12]. In this perspective, the meanings of the terms, and thus | ||
+ | their semantics, are taken into consideration. Indeed, semantic relationships between terms | ||
+ | (e.g., synonymies and homonymies) are a very common feature in natural languages. The main | ||
+ | weakness of this perspective lies in the need to have the availability of a thesaurus, which | ||
+ | stores the semantic relationships between terms. If such a tool exists, the computation of the | ||
+ | strength of the semantic relationship is straightforward. Clearly, additional perspectives could | ||
+ | be considered in the future. This is facilitated by the extensibility of our model. Indeed, if one | ||
+ | wanted to consider a new perspective, it would be sufficient to add to 𝐴𝑐 and 𝐴𝑟 a third set of | ||
+ | arcs representing the new perspective. | ||
+ | |||
+ | |||
+ | 4. Evaluation of the semantic similarity of two CS-Nets | ||
+ | In this section, we illustrate our approach for computing the semantic similarity of content | ||
+ | related to two sets of comments represented by means of two CS-Nets 𝒩1 and 𝒩2 . It receives | ||
+ | two CS-Nets 𝒩1 and 𝒩2 and returns a coefficient 𝜎12 , whose value belongs to the real interval | ||
+ | [0, 1]. It measures the strength of the semantic similarity of the content represented by 𝒩1 and | ||
+ | 𝒩2 ; the higher its value, the higher the semantic similarity. Our technique behaves as follows: | ||
+ | |||
+ | • It constructs two pairs of subnets (𝒩1𝑐 , 𝒩2𝑐 ) and (𝒩1𝑟 , 𝒩2𝑟 ). The former (resp., latter) is | ||
+ | obtained by selecting only the co-occurrence (resp., semantic relationship) arcs from the | ||
+ | networks 𝒩1 and 𝒩2 . Specifically: 𝒩1𝑐 = ⟨𝒩1 , 𝐴𝑐1 ⟩, 𝒩2𝑐 = ⟨𝒩2 , 𝐴𝑐2 ⟩, 𝒩1𝑟 = ⟨𝒩1 , 𝐴𝑟1 ⟩, | ||
+ | and 𝒩2𝑟 = ⟨𝒩2 , 𝐴𝑟2 ⟩. If, in the future, we want to add a new perspective, and therefore | ||
+ | a new set of arcs beside 𝐴𝑐 and 𝐴𝑟 , it will be sufficient to build another pair of subnets | ||
+ | corresponding to the new perspective. | ||
+ | • It computes the semantic similarity degree 𝜎12 𝑐 and 𝜎 𝑟 for the pairs of networks (𝒩 𝑐 , 𝒩 𝑐 ) | ||
+ | 12 1 2 | ||
+ | and (𝒩1𝑟 , 𝒩2𝑟 ), respectively. The approach for computing 𝜎12 𝑥 , 𝑥 ∈ {𝑐, 𝑟} should be as | ||
+ | |||
+ | holistic as possible. To this end, it is necessary to define a formula capable of considering | ||
+ | as many factors as possible, among those that are believed to influence the semantic | ||
+ | similarity degree of two networks 𝒩1𝑥 and 𝒩2𝑥 , 𝑥 ∈ {𝑐, 𝑟}. In particular, it is possible to | ||
+ | consider at least two factors with these characteristics. | ||
+ | The first factor concerns the topological similarity of the networks, i.e., the similarity of | ||
+ | their structural characteristics. The structure of a network is ultimately determined by its | ||
+ | nodes and arcs. In our networks, nodes are associated with lemmas, while arcs represent | ||
+ | features (e.g., co-occurrences or semantic relationships) contributing significantly to | ||
+ | define the semantics of the lemmas they connect. This reasoning is further reinforced | ||
+ | by the fact that the semantics of a lemma can be contributed by the lemmas to which | ||
+ | it is related in the network (in this observation, the application to the CS-Net of the | ||
+ | �principle of homophily, which characterizes social networks, takes place). The second | ||
+ | factor is much more immediate. In fact, it concerns the semantic meaning of the concepts | ||
+ | expressed by the nodes of the CS-Net, each representing a lemma of the set of comments | ||
+ | associated with it. | ||
+ | Regarding the first factor, many approaches for computing the similarity degree of the | ||
+ | structures of two networks have been proposed in the past literature. We decided to | ||
+ | adopt one of these approaches, i.e., NetSimile [13]. This choice is motivated by the fact | ||
+ | that the latter has a much shorter computation time than the other related approaches. | ||
+ | At the same time, it guarantees an accuracy level adequate for our reference context. | ||
+ | NetSimile extracts and evaluates the structural characteristics of each node by analyzing | ||
+ | the structural characteristics of its ego network. Therefore, in order to return the similarity | ||
+ | score of two networks, it computes the similarity degree of the corresponding vectors of | ||
+ | features. | ||
+ | Regarding the second factor, we decided to consider the portion of nodes with the same or | ||
+ | similar meaning present in the two subnets of the pair. A simple, but very effective, way to | ||
+ | do this is the computation of the Jaccard coefficient between the sets of lemmas associated | ||
+ | with the nodes of the two CS-Nets. Actually, the Jaccard coefficient only considers | ||
+ | equality between two lemmas, while we can also have lexicographic relationships (e.g., | ||
+ | synonymies and homonymies) between them [15]. These can modify the semantic | ||
+ | relationships between two lemmas and, therefore, must be taken into consideration. To | ||
+ | do so, our technique uses an advanced thesaurus, i.e., ConceptNet [14], which includes | ||
+ | WordNet within it. Based on this thesaurus, we redefine the Jaccard coefficient and | ||
+ | introduce an enhanced version of it, which we call 𝐽 * . It behaves as the classic Jaccard | ||
+ | coefficient but takes lexicographic relationships into account. | ||
+ | Given these premises, we can define the formula for the computation of 𝜎12 𝑥 : | ||
+ | |||
+ | |||
+ | 𝑥 = 𝛽 𝑥 · 𝜈(𝒩 𝑥 , 𝒩 𝑥 ) + (1 − 𝛽 𝑥 ) · 𝐽 * (𝑁 𝑥 , 𝑁 𝑥 ). | ||
+ | 𝜎12 1 2 1 2 | ||
+ | |||
+ | Here: | ||
+ | – 𝜈(𝒩1𝑥 , 𝒩2𝑥 ) is a function that applies NetSimile for computing the topological simi- | ||
+ | larity of 𝒩1𝑥 and 𝒩2𝑥 . | ||
+ | – 𝐽 * (𝑁1𝑥 , 𝑁2𝑥 ) is the enhanced Jaccard coefficient between 𝒩1𝑥 and 𝒩2𝑥 . | ||
+ | – 𝛽 𝑥 represents the weight given to the topological similarity of CS-Nets with respect | ||
+ | to the lexical similarity of the lemmas associated with their nodes. A discussion | ||
+ | on the possible formulas for 𝛽 𝑥 based on the objectives one wants to pursue in a | ||
+ | specific application can be found in [6]. | ||
+ | Note that our approach for computing 𝜎12𝑥 can operate on any projection 𝒩 𝑥 and 𝒩 𝑥 of | ||
+ | 1 2 | ||
+ | the networks 𝒩1 and 𝒩2 . In fact, the only constraint related to it is that the arcs of the | ||
+ | two networks involved are of the same type 𝑥. This allows it to be extensible. Indeed, | ||
+ | if we wish to add a new perspective on modeling content semantics in the future, the | ||
+ | similarity degree of the corresponding projections of 𝒩1 and 𝒩2 can be computed using | ||
+ | the same formula of 𝜎12𝑥 described above. | ||
+ | � • It computes the overall semantic similarity degree 𝜎12 of 𝒩1 and 𝒩2 as a weighted mean | ||
+ | of the two semantic similarity degrees 𝜎12 | ||
+ | 𝑐 and 𝜎 𝑟 : | ||
+ | 12 | ||
+ | 𝑐 ·𝜎 𝑐 +𝜔 𝑟 ·𝜎 𝑟 | ||
+ | 𝜔12 | ||
+ | 𝜎12 = 12 | ||
+ | 𝜔12 | ||
+ | 12 12 | ||
+ | 𝑐 +𝜔 𝑟 = 𝛼 · 𝜎12 | ||
+ | 𝑐 + (1 − 𝛼) · 𝜎 𝑟 | ||
+ | 12 | ||
+ | 12 | ||
+ | |||
+ | 𝜔𝑐 | ||
+ | In this formula, 𝛼 = 𝜔𝑐 +𝜔 12 | ||
+ | 𝑟 weights the semantic similarity obtained through the | ||
+ | 12 12 | ||
+ | analysis of co-occurrences against the one derived from the analysis of the semantic | ||
+ | relationships between lemmas. The rationale behind it is that the greater the amount of | ||
+ | information carried out by one perspective, relative to the other, the greater its weight | ||
+ | in defining the overall semantics. Now, since |𝑁1𝑐 | = |𝑁1𝑟 | and |𝑁2𝑐 | = |𝑁2𝑟 |, the amount | ||
+ | of information carried out by the two perspectives can be measured by considering the | ||
+ | cardinality of the corresponding sets of arcs. On the basis of this reasoning, we have | ||
+ | 𝑐 𝑐 𝑟 𝑟 |𝐴𝑐1 | |𝐴𝑐2 | | ||
+ | 𝑐 = 𝜔1 +𝜔2 , and 𝜔 𝑟 = 𝜔1 +𝜔2 , 𝜔 𝑐 = | ||
+ | that: 𝜔12 |𝐴𝑐1 |+|𝐴𝑟1 | , 𝜔2 = |𝐴𝑐2 |+|𝐴𝑟2 | , 𝜔1 = 1 − 𝜔1 , | ||
+ | 𝑐 𝑟 𝑐 | ||
+ | 2 12 2 1 | ||
+ | 𝜔2𝑟 = 1 − 𝜔2𝑐 . These formulas essentially tell us that the importance of a perspective in | ||
+ | determining the overall content semantics is directly proportional to the number of pairs | ||
+ | of lemmas it can involve. | ||
+ | Finally, note that 𝜎12 ranges in the real interval [0, 1]. The higher 𝜎12 , the greater the | ||
+ | similarity of 𝒩1 and 𝒩2 . | ||
+ | Like the other components of our approach, the one for computing 𝜎12 is extensible. | ||
+ | In fact, in the future, if we wanted to add additional perspectives for modeling content | ||
+ | semantics, we would simply add to 𝜎12 𝑐 and 𝜎 𝑟 an additional similarity coefficient for | ||
+ | 12 | ||
+ | each added perspective and modify the weights in the formula of 𝜎12 accordingly. | ||
+ | |||
+ | |||
+ | 5. Conclusion | ||
+ | In this paper, we have proposed a model and a related approach to represent and handle content | ||
+ | semantics in a social platform. Our model is network-based and is capable of representing | ||
+ | content semantics from different perspectives. It is also extensible in that new perspectives can | ||
+ | be easily added when desired. It first performs the detection of the text patterns of interest, | ||
+ | based not only on their frequency but also on their utility. Then, it uses these patterns and the | ||
+ | proposed model to represent each set of comments by means of a CS-Net. Finally, it adopts | ||
+ | a suitable technique to measure the semantic similarity of each pair of comment sets. The | ||
+ | latter information can be useful in a variety of applications, ranging from the construction of | ||
+ | recommender systems to the building of new topic forums [6]. | ||
+ | In the future, we plan to extend this research in various directions. First, we could use our | ||
+ | approach as the core of a system for the automatic identification of offensive content of a certain | ||
+ | type (cyberbullism, racism, etc.) in a set of comments. In addition, we could study the evolution | ||
+ | of CS-Nets over time. This could allow us to identify new trends and topics that characterize a | ||
+ | social platform. Finally, we plan to use our approach in a sentiment analysis context. Indeed, in | ||
+ | the past literature, there are several studies on how people with anxiety and/or psychological | ||
+ | disorders write their comments on social media. We could contribute to this research effort | ||
+ | by considering sets of comments written by users with these characteristics, constructing the | ||
+ | corresponding CS-Nets and analyzing them in detail. We could also compare these CS-Nets | ||
+ | with “template CS-Nets”, typical of a certain emotional state, to support classification activities. | ||
+ | �References | ||
+ | [1] X. Chen, Y. Yuan, M. Orgun, Using Bayesian networks with hidden variables for identifying | ||
+ | trustworthy users in social networks, Journal of Information Science 46 (2020) 600–615. | ||
+ | SAGE Publications Sage UK: London, England. | ||
+ | [2] P. Boczkowski, M. Matassi, E. Mitchelstein, How young users deal with multiple platforms: | ||
+ | The role of meaning-making in social media repertoires, Journal of Computer-Mediated | ||
+ | Communication 23 (2018) 245–259. Oxford University Press. | ||
+ | [3] F. Cauteruccio, E. Corradini, G. Terracina, D. Ursino, L. Virgili, Investigating Reddit to | ||
+ | detect subreddit and author stereotypes and to evaluate author assortativity, Journal | ||
+ | of Information Science (2021). doi:https://doi.org/10.1177/01655515211047428, | ||
+ | sAGE. | ||
+ | [4] B. Abu-Salih, P. Wongthongtham, K. Chan, K. Yan, D. Zhu, CredSaT: Credibility ranking | ||
+ | of users in big social data incorporating semantic analysis and temporal factor, Journal of | ||
+ | Information Science 45 (2019) 259–280. SAGE Publications Sage UK: London, England. | ||
+ | [5] S. Ahmadian, M. Afsharchi, M. Meghdadi, An effective social recommendation method | ||
+ | based on user reputation model and rating profile enhancement, Journal of Information | ||
+ | Science 45 (2019) 607–642. SAGE Publications Sage UK: London, England. | ||
+ | [6] G. Bonifazi, F. Cauteruccio, E. Corradini, M. Marchetti, G. Terracina, D. Ursino, L. Virgili, | ||
+ | Representation, detection and usage of the content semantics of comments in a social | ||
+ | platform, Journal of Information Science (Forthcoming). SAGE. | ||
+ | [7] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. Koh, R. Thomas, A survey of sequential | ||
+ | pattern mining, Data Science and Pattern Recognition 1 (2017) 54–77. | ||
+ | [8] P. Fournier-Viger, J. Lin, B. Vo, T. Chi, J. Zhang, H. Le, A survey of itemset mining, | ||
+ | WIREs Data Mining and Knowledge Discovery 7 (2017) e1207. doi:https://doi.org/ | ||
+ | 10.1002/widm.1207, Wiley. | ||
+ | [9] L. Gadár, J. Abonyi, Frequent pattern mining in multidimensional organizational networks, | ||
+ | Scientific Reports 9 (2019) 1–12. Nature Publishing Group. | ||
+ | [10] K. Pearson, Note on Regression and Inheritance in the Case of Two Parents, Proceedings | ||
+ | of the Royal Society of London 58 (1895) 240–242. The Royal Society. | ||
+ | [11] Y. Djenouri, A. Belhadi, P. Fournier-Viger, J. Lin, Fast and effective cluster-based infor- | ||
+ | mation retrieval using frequent closed itemsets, Information Sciences 453 (2018) 154–167. | ||
+ | Elsevier. | ||
+ | [12] Z. Bouraoui, J. Camacho-Collados, S. Schockaert, Inducing relational knowledge from | ||
+ | BERT, in: Proc. of the International Conference on Artificial Intelligence (AAAI 2020), | ||
+ | volume 34(05), New York, NY, USA, 2020, pp. 7456–7463. Association for the Advancement | ||
+ | of Artificial Intelligence. | ||
+ | [13] M. Berlingerio, D. Koutra, T. Eliassi-Rad, C. Faloutsos, Netsimile: A scalable approach to | ||
+ | size-independent network similarity, arXiv preprint arXiv:1209.2684 (2012). | ||
+ | [14] H. Liu, P. Singh, ConceptNet — a practical commonsense reasoning tool-kit, BT technology | ||
+ | journal 22 (2004) 211–226. Springer. | ||
+ | [15] P. De Meo, G. Quattrone, G. Terracina, D. Ursino, Integration of XML Schemas at various | ||
+ | “severity” levels, Information Systems 31(6) (2006) 397–434. | ||
+ | � | ||
+ | </pre> |
Revision as of 17:55, 30 March 2023
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper22 |
wikidataid | →Q117344911 |
title | A Network-based Model and a Related Approach to Represent and Handle the Semantics of Comments in a Social Network |
pdfUrl | https://ceur-ws.org/Vol-3194/paper22.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/BonifaziCCMTUV22 |
volume | Vol-3194→Vol-3194 |
session | → |
A Network-based Model and a Related Approach to Represent and Handle the Semantics of Comments in a Social Network
A Network-based Model and a Related Approach to Represent and Handle the Semantics of Comments in a Social Network (Discussion Paper) Gianluca Bonifazia , Francesco Cauteruccioa , Enrico Corradinia , Michele Marchettia , Giorgio Terracinab , Domenico Ursinoa and Luca Virgilia a DII, Polytechnic University of Marche b DEMACS, University of Calabria Abstract In this paper, we propose a network-based model and a related approach to represent and handle the semantics of a set of comments expressed by users of a social network. Our model and approach are multi- dimensional and holistic because they manage the semantics of comments from multiple perspectives. Our approach first selects the text patterns that best characterize the involved comments. Then, it uses these patterns and the proposed model to represent each set of comments by means of a suitable network. Finally, it adopts a suitable technique to measure the semantic similarity of each pair of comment sets. Keywords Comment analysis, Social Network Analysis, Text Pattern Mining, Semantic Similarity, Utility Functions 1. Introduction In the last few years, the investigation of the content of comments expressed by people in social media has increased enormously [1]. In fact, social media comments are one of the places where people tend to express their ideas most spontaneously [2]. Consequently, they play a privileged role in allowing the reconstruction of the real feelings and thoughts of a person, as well as in building a more faithful profile of her [3, 4, 5]1 . Spontaneity is both the main strength and one of the main weaknesses of comments. In fact, they are often written on the spur of the moment, with a language style that is not very structured, apparently confused and in some cases contradictory. In spite of these flaws, the set of comments written by a certain user SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ g.bonifazi@univpm.it (G. Bonifazi); f.cauteruccio@univpm.it (F. Cauteruccio); e.corradini@pm.univpm.it (E. Corradini); m.marchetti@pm.univpm.it (M. Marchetti); terracina@mat.unical.it (G. Terracina); d.ursino@univpm.it (D. Ursino); luca.virgili@univpm.it (L. Virgili) � 0000-0002-1947-8667 (G. Bonifazi); 0000-0001-8400-1083 (F. Cauteruccio); 0000-0002-1140-4209 (E. Corradini); 0000-0003-3692-3600 (M. Marchetti); 0000-0002-3090-7223 (G. Terracina); 0000-0003-1360-8499 (D. Ursino); 0000-0003-1509-783X (L. Virgili) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 In this paper, we focus on comments expressed by people through their well-defined accounts. We do not consider anonymous comments because they are less reliable and, in any case, not useful for the objectives of our research. �provides an overview of her thoughts and profile. Reconstructing the latter from the apparent “chaos” inherent in the comments is a challenging issue for researchers working in the context of the extraction of content semantics. In this paper, we want to make a contribution in this context by proposing a model, and a related approach, to detect and handle the content semantics from a set of comments posted on a social network. We argue that our model and its related approach are able to extract from the apparent “chaos” of comments the thoughts of their publisher and, eventually, to reconstruct the corresponding profile. However, the latter is only one of the possible uses of our model and approach. In fact, if we widen our gaze to the comments written by all the users of a certain community, we are able to understand the dominant thoughts in it. If we consider all the comments on a certain topic (e.g., COVID-19), we can reconstruct the various viewpoints on such a topic. Again, if we consider all comments in a certain time period (e.g., the first three months of the year 2022) we can determine what are the dominants thoughts in that period. Furthermore, the reconstruction of thoughts is only one of the possible applications of our model and approach. Other ones may include, for example, constructing recommender systems, building new user communities, identifying outliers or constructing new topic forums. Some of the most interesting applications are described in [6]. This paper is organized as follows: In Section 2, we present an overview of our proposal. In Section 3, we illustrate our model. In Section 4, we describe our approach. Finally, in Section 5, we draw our conclusions and have a look at some possible future developments. Due to space limitations, we cannot describe here the experiments carried out to test our model and approach. However, the interested reader can find them in [6]. 2. An overview of our proposal Our approach consists of two phases, namely pre-processing and knowledge extraction. The pre-processing phase is aimed at cleaning and annotating the available comments and, then, selecting the most meaningful ones. During the cleaning activity, bot-generated content, errors, inconsistencies, etc., are removed, and comment tokenization and lemmatization tasks are performed. The annotation activity aims to automatically enrich each lemmatized comment with some important information, such as the value of the associated sentiment, the post to which it refers, the author who wrote it, etc. The selection of the most significant comments is based on a text pattern mining technique. While most of the approaches to perform such a task proposed in the past consider only the frequency of patterns [7], our technique considers also, and primarily, its utility [8, 9], measured with the support of a utility function. Regarding this function, we point out that our technique is orthogonal to the utility function used. As a consequence, it is possible to choose different utility functions to prioritize certain comment properties over others. A first utility function could be the sentiment of comments; it could allow, for instance, the identification of only positive comments or only negative ones. A second utility function might be the rate of comments; it might allow, for instance, the selection of patterns involving only high rate comments or only low rate ones. A third utility function could be the Pearson’s correlation [10] between sentiment and rate; it could allow, for instance, the selection of patterns involving only comments with �discordant (resp., concordant) sentiment and rate. More details on utility functions can be found in Section 3. Once the comments and patterns of interest have been selected, it is necessary to have a model for their representation and management. As mentioned in the Introduction, in this paper we propose a new network-based model called CS-Net (Content Semantic Network). The nodes of a CS-Net represent the comment lemmas. Its arcs can be of two types, which reflect two different perspectives on the investigation of comment semantics. The first one, based on the concept of co-occurrence, reflects the past results obtained by many Information Retrieval researchers [11]. It assumes that two semantic related lemmas tend to appear together very often in sentences. The second one, based on the concepts of semantic relationships and semantically related terms, reflects the past results obtained by many researchers working in Natural Language Processing [12]. Actually, the CS-Net model is extensible so that, if in the future we wanted to add additional perspectives for investigating comment content, it would be sufficient to add a new type of arc for each new perspective. The CS-Net model is described in detail in Section 3. After selecting the comments and patterns of interest, and after representing them by means of CS-Nets, a technique to evaluate the semantic similarity of two CS-Nets is necessary. This technique operates by separately evaluating, and then appropriately combining, the semantic similarity of each pair of subnets obtained by projecting the original CS-Nets in such a way as to consider only one type of arcs at a time. The combination of the single components is done by weighting them differently, based on the extension of the CS-Net projections from which they are derived. This extension is determined by the number of the corresponding arcs. In particular, our technique favors the most extensive component, because it represents a larger portion of the content semantics than the other. Analogously to the CS-Net model, our technique for computing the similarity of two CS-Nets is extensible if one wants to add new perspectives of semantic similarity evaluation. In fact, to obtain an overall semantic similarity value, it is sufficient to compute the components related to each perspective separately, and then combine them according to the procedure mentioned above. In evaluating the semantics of two homogeneous subnets (i.e., subnets of only co-occurrences or subnets of only semantic relationships), our technique considers two further aspects, namely the topological similarity of the subnets and the similarity of the concepts expressed by the corresponding nodes. To compute the former, we adopt an approach already proposed in the literature, i.e., NetSimile [13]. To compute the latter, we use an enhanced version of the Jaccard coefficient, capable of considering synonymies and homonymies as well. Adding these two further contributions to co-occurrences and semantic relationships makes our approach even more holistic. A detailed description of our technique for evaluating the semantic similarity of two CS-Nets can be found in Section 4. 3. Proposed model Let 𝒞 = {𝑐1 , 𝑐2 , · · · 𝑐𝑛 } be a set of lemmatized comments and let ℒ = {𝑙1 , 𝑙2 , · · · , 𝑙𝑞 } be the set of all lemmas that can be found in a comment of 𝒞. Each comment 𝑐𝑘 ∈ 𝒞 can be represented as a set of lemmas 𝑐𝑘 = {𝑙1 , 𝑙2 , . . . , 𝑙𝑚 }; therefore, 𝑐𝑘 ⊆ ℒ. A text pattern 𝑝ℎ is a set of lemmas; �therefore, 𝑝ℎ ⊆ ℒ. We are interested in patterns with frequency values and utility functions belonging to appropriate intervals. In particular, as far as frequency is concerned, we are interested in patterns whose frequency value is greater than a certain threshold. Instead, for what concerns the utility function, the scenario is more complex, because it depends on the utility function adopted and the context in which our model is used. For example: • We could employ as utility function the average sentiment value of the comments to which the pattern of interest refers. We call 𝑓𝑠 (·) this utility function. It allows us to select patterns characterized by a compound score (and, therefore, a sentiment value) very high (e.g., positive patterns), very low (e.g., negative patterns) or belonging to a given range (e.g., neutral patterns). • We could adopt as utility function the Pearson’s correlation [10] between the sentiment and the score of the comments in which the pattern of interest is present. We call 𝑓𝑝 (·) this utility function. It allows us to select: (i) patterns having a high sentiment value and stimulating positive comments; (ii) patterns having a low sentiment value and stimulating negative comments; (iii) patterns having a high sentiment value and stimulating negative comments; (iv) patterns having a low sentiment value and stimulating positive comments. Clearly, in the vast majority of investigations, the patterns of interest are those related to cases (i) and (ii). However, there may be rare cases where the patterns of interest are those related to cases (iii) and (iv). In the following, we denote by 𝒫 the set of the patterns of interest, whose values of frequency and utility function belong to the intervals of interest for the application that is being considered. We are now able to formalize our model. In particular, a Content Semantics Network (hereafter, CS-Net) 𝒩 is defined as 𝒩 = ⟨𝑁, 𝐴𝑐 ∪ 𝐴𝑟 ⟩. 𝑁 is the set of nodes of 𝒩 . There is a node 𝑛𝑖 ∈ 𝑁 for each lemma 𝑙𝑖 ∈ ℒ. Since there exists a biunivocal correspondence between 𝑛𝑖 and 𝑙𝑖 , in the following we will use these two symbols interchangeably. 𝐴𝑐 is the set of co-occurrence arcs. There is an arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) ∈ 𝐴𝑐 if the lemmas 𝑙𝑖 and 𝑙𝑗 appear at least once together in a pattern of 𝒫. 𝑤𝑖𝑗 is a real number belonging to the interval [0, 1] and denoting the strength of the co-occurrence. The higher 𝑤𝑖𝑗 , the higher this strength. For example, 𝑤𝑖𝑗 could be obtained as a function of the number of patterns in which 𝑙𝑖 and 𝑙𝑗 co-occur. 𝐴𝑟 is the set of semantic relationship arcs. There is an arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) ∈ 𝐴𝑟 if there is a semantic relationship between 𝑙𝑖 and 𝑙𝑗 . 𝑤𝑖𝑗 is a real number in the interval [0, 1] denoting the strength of the relationship. The higher 𝑤𝑖𝑗 , the higher this strength. 𝑤𝑖𝑗 could be computed using ConceptNet [14] and considering both the number of times 𝑙𝑗 is present in the set of “related terms” of 𝑙𝑖 and the values of the corresponding weights. An observation on the structure of the CS-Net model is necessary. As specified above, our goal is to model and manage the semantics of the content of a set of comments. CS-Net is a model tailored exactly to that goal. For this reason, it considers two perspectives derived from the past literature. The former is related to the concept of co-occurrence. It indicates that two semantically related lemmas tend to appear very often together in sentences. This �perspective is probably the most immediate in the context of text mining. In fact, here, it is well known that the frequency with which two or more lemmas appear together in a text represents an index of their correlation. The potential weakness of this perspective lies in the need to compute the frequency of each pair of lemmas. Moreover, this computation must be continually updated whenever a new comment is taken into consideration. The latter is related to the concepts of semantic relationships and semantically related terms. These refer to several researches conducted in the past in the contexts of Information Retrieval [11] and Natural Language Processing [12]. In this perspective, the meanings of the terms, and thus their semantics, are taken into consideration. Indeed, semantic relationships between terms (e.g., synonymies and homonymies) are a very common feature in natural languages. The main weakness of this perspective lies in the need to have the availability of a thesaurus, which stores the semantic relationships between terms. If such a tool exists, the computation of the strength of the semantic relationship is straightforward. Clearly, additional perspectives could be considered in the future. This is facilitated by the extensibility of our model. Indeed, if one wanted to consider a new perspective, it would be sufficient to add to 𝐴𝑐 and 𝐴𝑟 a third set of arcs representing the new perspective. 4. Evaluation of the semantic similarity of two CS-Nets In this section, we illustrate our approach for computing the semantic similarity of content related to two sets of comments represented by means of two CS-Nets 𝒩1 and 𝒩2 . It receives two CS-Nets 𝒩1 and 𝒩2 and returns a coefficient 𝜎12 , whose value belongs to the real interval [0, 1]. It measures the strength of the semantic similarity of the content represented by 𝒩1 and 𝒩2 ; the higher its value, the higher the semantic similarity. Our technique behaves as follows: • It constructs two pairs of subnets (𝒩1𝑐 , 𝒩2𝑐 ) and (𝒩1𝑟 , 𝒩2𝑟 ). The former (resp., latter) is obtained by selecting only the co-occurrence (resp., semantic relationship) arcs from the networks 𝒩1 and 𝒩2 . Specifically: 𝒩1𝑐 = ⟨𝒩1 , 𝐴𝑐1 ⟩, 𝒩2𝑐 = ⟨𝒩2 , 𝐴𝑐2 ⟩, 𝒩1𝑟 = ⟨𝒩1 , 𝐴𝑟1 ⟩, and 𝒩2𝑟 = ⟨𝒩2 , 𝐴𝑟2 ⟩. If, in the future, we want to add a new perspective, and therefore a new set of arcs beside 𝐴𝑐 and 𝐴𝑟 , it will be sufficient to build another pair of subnets corresponding to the new perspective. • It computes the semantic similarity degree 𝜎12 𝑐 and 𝜎 𝑟 for the pairs of networks (𝒩 𝑐 , 𝒩 𝑐 ) 12 1 2 and (𝒩1𝑟 , 𝒩2𝑟 ), respectively. The approach for computing 𝜎12 𝑥 , 𝑥 ∈ {𝑐, 𝑟} should be as holistic as possible. To this end, it is necessary to define a formula capable of considering as many factors as possible, among those that are believed to influence the semantic similarity degree of two networks 𝒩1𝑥 and 𝒩2𝑥 , 𝑥 ∈ {𝑐, 𝑟}. In particular, it is possible to consider at least two factors with these characteristics. The first factor concerns the topological similarity of the networks, i.e., the similarity of their structural characteristics. The structure of a network is ultimately determined by its nodes and arcs. In our networks, nodes are associated with lemmas, while arcs represent features (e.g., co-occurrences or semantic relationships) contributing significantly to define the semantics of the lemmas they connect. This reasoning is further reinforced by the fact that the semantics of a lemma can be contributed by the lemmas to which it is related in the network (in this observation, the application to the CS-Net of the �principle of homophily, which characterizes social networks, takes place). The second factor is much more immediate. In fact, it concerns the semantic meaning of the concepts expressed by the nodes of the CS-Net, each representing a lemma of the set of comments associated with it. Regarding the first factor, many approaches for computing the similarity degree of the structures of two networks have been proposed in the past literature. We decided to adopt one of these approaches, i.e., NetSimile [13]. This choice is motivated by the fact that the latter has a much shorter computation time than the other related approaches. At the same time, it guarantees an accuracy level adequate for our reference context. NetSimile extracts and evaluates the structural characteristics of each node by analyzing the structural characteristics of its ego network. Therefore, in order to return the similarity score of two networks, it computes the similarity degree of the corresponding vectors of features. Regarding the second factor, we decided to consider the portion of nodes with the same or similar meaning present in the two subnets of the pair. A simple, but very effective, way to do this is the computation of the Jaccard coefficient between the sets of lemmas associated with the nodes of the two CS-Nets. Actually, the Jaccard coefficient only considers equality between two lemmas, while we can also have lexicographic relationships (e.g., synonymies and homonymies) between them [15]. These can modify the semantic relationships between two lemmas and, therefore, must be taken into consideration. To do so, our technique uses an advanced thesaurus, i.e., ConceptNet [14], which includes WordNet within it. Based on this thesaurus, we redefine the Jaccard coefficient and introduce an enhanced version of it, which we call 𝐽 * . It behaves as the classic Jaccard coefficient but takes lexicographic relationships into account. Given these premises, we can define the formula for the computation of 𝜎12 𝑥 : 𝑥 = 𝛽 𝑥 · 𝜈(𝒩 𝑥 , 𝒩 𝑥 ) + (1 − 𝛽 𝑥 ) · 𝐽 * (𝑁 𝑥 , 𝑁 𝑥 ). 𝜎12 1 2 1 2 Here: – 𝜈(𝒩1𝑥 , 𝒩2𝑥 ) is a function that applies NetSimile for computing the topological simi- larity of 𝒩1𝑥 and 𝒩2𝑥 . – 𝐽 * (𝑁1𝑥 , 𝑁2𝑥 ) is the enhanced Jaccard coefficient between 𝒩1𝑥 and 𝒩2𝑥 . – 𝛽 𝑥 represents the weight given to the topological similarity of CS-Nets with respect to the lexical similarity of the lemmas associated with their nodes. A discussion on the possible formulas for 𝛽 𝑥 based on the objectives one wants to pursue in a specific application can be found in [6]. Note that our approach for computing 𝜎12𝑥 can operate on any projection 𝒩 𝑥 and 𝒩 𝑥 of 1 2 the networks 𝒩1 and 𝒩2 . In fact, the only constraint related to it is that the arcs of the two networks involved are of the same type 𝑥. This allows it to be extensible. Indeed, if we wish to add a new perspective on modeling content semantics in the future, the similarity degree of the corresponding projections of 𝒩1 and 𝒩2 can be computed using the same formula of 𝜎12𝑥 described above. � • It computes the overall semantic similarity degree 𝜎12 of 𝒩1 and 𝒩2 as a weighted mean of the two semantic similarity degrees 𝜎12 𝑐 and 𝜎 𝑟 : 12 𝑐 ·𝜎 𝑐 +𝜔 𝑟 ·𝜎 𝑟 𝜔12 𝜎12 = 12 𝜔12 12 12 𝑐 +𝜔 𝑟 = 𝛼 · 𝜎12 𝑐 + (1 − 𝛼) · 𝜎 𝑟 12 12 𝜔𝑐 In this formula, 𝛼 = 𝜔𝑐 +𝜔 12 𝑟 weights the semantic similarity obtained through the 12 12 analysis of co-occurrences against the one derived from the analysis of the semantic relationships between lemmas. The rationale behind it is that the greater the amount of information carried out by one perspective, relative to the other, the greater its weight in defining the overall semantics. Now, since |𝑁1𝑐 | = |𝑁1𝑟 | and |𝑁2𝑐 | = |𝑁2𝑟 |, the amount of information carried out by the two perspectives can be measured by considering the cardinality of the corresponding sets of arcs. On the basis of this reasoning, we have 𝑐 𝑐 𝑟 𝑟 |𝐴𝑐1 | |𝐴𝑐2 | 𝑐 = 𝜔1 +𝜔2 , and 𝜔 𝑟 = 𝜔1 +𝜔2 , 𝜔 𝑐 = that: 𝜔12 |𝐴𝑐1 |+|𝐴𝑟1 | , 𝜔2 = |𝐴𝑐2 |+|𝐴𝑟2 | , 𝜔1 = 1 − 𝜔1 , 𝑐 𝑟 𝑐 2 12 2 1 𝜔2𝑟 = 1 − 𝜔2𝑐 . These formulas essentially tell us that the importance of a perspective in determining the overall content semantics is directly proportional to the number of pairs of lemmas it can involve. Finally, note that 𝜎12 ranges in the real interval [0, 1]. The higher 𝜎12 , the greater the similarity of 𝒩1 and 𝒩2 . Like the other components of our approach, the one for computing 𝜎12 is extensible. In fact, in the future, if we wanted to add additional perspectives for modeling content semantics, we would simply add to 𝜎12 𝑐 and 𝜎 𝑟 an additional similarity coefficient for 12 each added perspective and modify the weights in the formula of 𝜎12 accordingly. 5. Conclusion In this paper, we have proposed a model and a related approach to represent and handle content semantics in a social platform. Our model is network-based and is capable of representing content semantics from different perspectives. It is also extensible in that new perspectives can be easily added when desired. It first performs the detection of the text patterns of interest, based not only on their frequency but also on their utility. Then, it uses these patterns and the proposed model to represent each set of comments by means of a CS-Net. Finally, it adopts a suitable technique to measure the semantic similarity of each pair of comment sets. The latter information can be useful in a variety of applications, ranging from the construction of recommender systems to the building of new topic forums [6]. In the future, we plan to extend this research in various directions. First, we could use our approach as the core of a system for the automatic identification of offensive content of a certain type (cyberbullism, racism, etc.) in a set of comments. In addition, we could study the evolution of CS-Nets over time. This could allow us to identify new trends and topics that characterize a social platform. Finally, we plan to use our approach in a sentiment analysis context. Indeed, in the past literature, there are several studies on how people with anxiety and/or psychological disorders write their comments on social media. We could contribute to this research effort by considering sets of comments written by users with these characteristics, constructing the corresponding CS-Nets and analyzing them in detail. We could also compare these CS-Nets with “template CS-Nets”, typical of a certain emotional state, to support classification activities. �References [1] X. Chen, Y. Yuan, M. Orgun, Using Bayesian networks with hidden variables for identifying trustworthy users in social networks, Journal of Information Science 46 (2020) 600–615. SAGE Publications Sage UK: London, England. [2] P. Boczkowski, M. Matassi, E. Mitchelstein, How young users deal with multiple platforms: The role of meaning-making in social media repertoires, Journal of Computer-Mediated Communication 23 (2018) 245–259. Oxford University Press. [3] F. Cauteruccio, E. Corradini, G. Terracina, D. Ursino, L. Virgili, Investigating Reddit to detect subreddit and author stereotypes and to evaluate author assortativity, Journal of Information Science (2021). doi:https://doi.org/10.1177/01655515211047428, sAGE. [4] B. Abu-Salih, P. Wongthongtham, K. Chan, K. Yan, D. Zhu, CredSaT: Credibility ranking of users in big social data incorporating semantic analysis and temporal factor, Journal of Information Science 45 (2019) 259–280. SAGE Publications Sage UK: London, England. [5] S. Ahmadian, M. Afsharchi, M. Meghdadi, An effective social recommendation method based on user reputation model and rating profile enhancement, Journal of Information Science 45 (2019) 607–642. SAGE Publications Sage UK: London, England. [6] G. Bonifazi, F. Cauteruccio, E. Corradini, M. Marchetti, G. Terracina, D. Ursino, L. Virgili, Representation, detection and usage of the content semantics of comments in a social platform, Journal of Information Science (Forthcoming). SAGE. [7] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. Koh, R. Thomas, A survey of sequential pattern mining, Data Science and Pattern Recognition 1 (2017) 54–77. [8] P. Fournier-Viger, J. Lin, B. Vo, T. Chi, J. Zhang, H. Le, A survey of itemset mining, WIREs Data Mining and Knowledge Discovery 7 (2017) e1207. doi:https://doi.org/ 10.1002/widm.1207, Wiley. [9] L. Gadár, J. Abonyi, Frequent pattern mining in multidimensional organizational networks, Scientific Reports 9 (2019) 1–12. Nature Publishing Group. [10] K. Pearson, Note on Regression and Inheritance in the Case of Two Parents, Proceedings of the Royal Society of London 58 (1895) 240–242. The Royal Society. [11] Y. Djenouri, A. Belhadi, P. Fournier-Viger, J. Lin, Fast and effective cluster-based infor- mation retrieval using frequent closed itemsets, Information Sciences 453 (2018) 154–167. Elsevier. [12] Z. Bouraoui, J. Camacho-Collados, S. Schockaert, Inducing relational knowledge from BERT, in: Proc. of the International Conference on Artificial Intelligence (AAAI 2020), volume 34(05), New York, NY, USA, 2020, pp. 7456–7463. Association for the Advancement of Artificial Intelligence. [13] M. Berlingerio, D. Koutra, T. Eliassi-Rad, C. Faloutsos, Netsimile: A scalable approach to size-independent network similarity, arXiv preprint arXiv:1209.2684 (2012). [14] H. Liu, P. Singh, ConceptNet — a practical commonsense reasoning tool-kit, BT technology journal 22 (2004) 211–226. Springer. [15] P. De Meo, G. Quattrone, G. Terracina, D. Ursino, Integration of XML Schemas at various “severity” levels, Information Systems 31(6) (2006) 397–434. �