Difference between revisions of "Vol-3194/paper68"

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search
(edited by wikiedit)
 
(modified through wikirestore by wf)
Line 1: Line 1:
 
+
=Paper=
 
{{Paper
 
{{Paper
|wikidataid=Q117344906
+
|id=Vol-3194/paper68
 +
|storemode=property
 +
|title=Gender Discriminatory Language Identification with an Hybrid Algorithm based on Syntactic Rules and Machine Learning
 +
|pdfUrl=https://ceur-ws.org/Vol-3194/paper68.pdf
 +
|volume=Vol-3194
 +
|authors=Valerio Bellandi,Stefano Siccardi
 +
|dblpUrl=https://dblp.org/rec/conf/sebd/BellandiS22
 
}}
 
}}
 +
==Gender Discriminatory Language Identification with an Hybrid Algorithm based on Syntactic Rules and Machine Learning==
 +
<pdf width="1500px">https://ceur-ws.org/Vol-3194/paper68.pdf</pdf>
 +
<pre>
 +
Gender Discriminatory Language Identification with
 +
an Hybrid Algorithm based on Syntactic Rules and
 +
Machine Learning
 +
Valerio Bellandi1 , Stefano Siccardi1
 +
1
 +
    Computer Science Department, Università Degli Studi di Milano, Via Celoria 18 Milano (MI) Italy
 +
 +
 +
                                        Abstract
 +
                                        In the last years, gender discrimination in textual documents has emerged as an open problem and is
 +
                                        undergoing analysis. The difficulty of identifying sentences in which this discrimination is present is
 +
                                        linked to the context used and the formalisms adopted. This work describes an exploratory activity
 +
                                        linked to the context of regulations and official documents of Italian public administrations. A hybrid
 +
                                        algorithm based on syntactic rules and machine learning is therefore proposed, capable of identifying a
 +
                                        specific subset of possible gender discrimination.
 +
 +
                                        Keywords
 +
                                        Gender Discrimination, Syntactic Rules, Entities Extraction
 +
 +
 +
 +
 +
1. Introduction
 +
Discriminatory attitudes against minorities or gender related have been reported in many areas;
 +
often they are conveyed by language in the form of open “hate speech” or in more subtle forms,
 +
for instance associating the discriminated group to specific social roles or professions. Several
 +
social networks, like Facebook and Twitter, define and ban hate speech (see [4]). Nevertheless,
 +
[9] analyzing more than 2 millions tweets in a 7 months period, found that women (at the first
 +
place), immigrants, gay and lesbian persons, Muslims, Jews and disabled persons were addressed
 +
by more than 100 thousands of hateful tweets. Another study ([7]) reports that around 10% of
 +
social media users reports being victimized by online hate speech. On the other hand, several
 +
institutions have approved guidelines to promote the usage of an inclusive language in their
 +
official documents, that is a language that does not carry any explicit or implicit difference
 +
between genders. For instance, we quote the European Parliament (see [8]) and the University
 +
of Milan (see [10]). This work aims at helping the detection of non inclusive language usage to
 +
facilitate their correction.
 +
 +
 +
 +
 +
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
 +
$ valerio.bellandi@unimi.it (V. Bellandi); stefano.siccardi@unimi.it (. S. Siccardi)
 +
� 0000-0003-4473-6258 (V. Bellandi); 0000-0002-6477-3876 (. S. Siccardi)
 +
                                      © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 +
    CEUR
 +
    Workshop
 +
    Proceedings
 +
                  http://ceur-ws.org
 +
                  ISSN 1613-0073
 +
                                      CEUR Workshop Proceedings (CEUR-WS.org)
 +
�2. Related work
 +
Natural Language Processing techniques to mitigate gender bias have been reviewed by [15].
 +
They start from the observation that NLP systems containing bias in training data, resources,
 +
pretrained models (e.g. word embeddings), and algorithms can produce gender biased predictions
 +
and sometimes even amplify biases present in the training sets. An example, driven from
 +
Machine Translation is that translating “He is a nurse. She is a doctor.” to Hungarian and back
 +
to English results in “She is a nurse. He is a doctor.” The main contributions of the paper are
 +
related to NLP itself, more than to user composed documents. In the same spirit, [2] quantify the
 +
degree to which gender bias differs with the corpora used for training. They look especially at
 +
the impact of starting with a pre-trained model and fine-tuning with additional data. A different
 +
strand of study uses sentiment analysis to try to detect gender of writers in blogs or social media,
 +
for instance to establish a detailed health policy specialized into patient segments ([12]), or aims
 +
at detecting gender and age of writers in order to improve sentiment analysis ([16]). [1] tested
 +
the use of tools to simplify the task of information selection from collections of documents used
 +
by qualitative researchers in order to analyze discrimination. They compare some methods and
 +
find that relevant words can be efficiently found, but results heavily depend on the quality of
 +
pre processing. A tool to check German texts for gender discriminatory formulations has been
 +
described in [5]. It is based on rules to detect nouns used to denote male persons and then to
 +
filter out correct cases, for instance when the noun refers to a specific male person (as a proper
 +
noun is found nearby). In non correct cases the user is prompted with suitable messages and
 +
hints. The way this tool works is basically the same of our rule based model. More recently,
 +
Microsoft Word text editor started offering a tool to check for non inclusive language, that,
 +
when activated, prompts some hints during documents input. However useful, it is presently
 +
not configurable and does not cover a large number of cases. Our method relies on Named
 +
Entity Recognition capabilities of NLP software to extend the set of entities to check; a complete
 +
review of the topic is however out of the scope of the present work and we will limit ourselves
 +
to describe some specific points. [11] examined 200 English and 200 Spanish tweets containing
 +
hate speech against women and immigrants using a model based on Spacy ([14]). Words were
 +
divided into 27 categories according to the type of negative contents; an accuracy of 84% for
 +
English and 62% for Spanish in identifying hate speech is reported. Bert has been used in [13]
 +
to analyze 16,000 tweets, containing 1972 tagged as racist and 3383 as sexist. A goal of the
 +
study was to reduce false positives to detect hate speech without undermining the freedom of
 +
expression. The maximum achieved specificity, was 83.03. Another study, [3], compared four
 +
algorithms including Spacy, Bert and two monolingual algorithms (Flair and camemBERT) to
 +
find entities in a set of 500 legal French cases, where a pool of experts annotated the cases with
 +
60,000 entity quotations. The monolingual tools reached the best precision and recall.
 +
 +
 +
3. Method
 +
The proposed methods consists of two interleaved pipelines, one is the “production” tool for
 +
the detection of discriminatory language in official documents, the other is used to expand
 +
the tables used by the system, so that more and more cases can be detected. Referring to fig.
 +
�1, the first pipeline, labelled 1, is fed with documents. A detection software, using rules and
 +
tables with words used as “seed entities”, produces a tagged version of documents, to highlight
 +
potentially discriminatory points. Each word included in the tables and not fitting one of the
 +
required rules is tagged as an entity (in the sense of NER of NLP terminology), with a suitable
 +
error tag. A user checks the tags, fixes the document and resubmit it, until a satisfactory result
 +
is obtained. Examples of tables are: entities to check, like: (man: woman), (men: women); list of
 +
male proper names, and so on. Examples of rules are:
 +
  1. Base rules, evaluating True or False: R1 = (word is in table of entities); R2 = (one of
 +
      neighbors of word = table of entities[word])
 +
  2. Compound rules, evaluating a Tag or None: if R1 and R2 then assign Tagxxx to word
 +
  The second pipeline is in turn divided into two branches. The branch labelled 2 builds a
 +
model and the branch labelled 3 uses it to find new entities. In branch 2, the documents are
 +
processed by a modified version of the detection software, that shares the same rules and tables.
 +
However, instead of creating tagged versions of the documents, it creates a set of annotations.
 +
These are single sentences, taken from the documents, with an occurrence of an entity, tagged
 +
with the proper error, in the format required by the training program of the chosen NLP tool.
 +
Moreover, whenever a “wrong” occurrence is found, all the “correct” version are added, as
 +
separate annotations with the proper tags. The annotation are then used to train a NER model.
 +
We used Spacy, but in principle any NLP system with trainable NER capabilities can be employed.
 +
In branch 3, the model is fed with the documents and produces a new set of tagged ones. Tagged
 +
words will not coincide with those found in pipeline 1; often we found more errors compared
 +
to the rule based version. However new entities, not in the set of the seed ones, are in general
 +
found. After a user’s review, the approved entities are added to the tables used in pipeline 1.
 +
The role of the user is important, because the system often finds entities that are not of interest
 +
or even completely wrong. This closes the loop, extending the capabilities of the “production”
 +
rule system. Pipeline 2 can be run with the extended tables, until a satisfactory set of terms is
 +
obtained.
 +
 +
 +
4. Experiments and Results
 +
A selection of official documents of the Milan University, chosen among general, departmental,
 +
security / privacy and staff regulations has been used to search for two basic types of “problems”:
 +
i) sentences containing only the male form of a noun having a different female form, e.g. “uomo
 +
– donna” (man – woman) ii) sentences containing nouns having the same male and female form,
 +
without any other grammatical element to stress reference to both genders, e.g. “il docente”
 +
instead of “il/la docente” (the teacher, no English analogous article form) Whenever one of the
 +
above was found, the following annotations were created:
 +
    • the original sentence,
 +
    • the sentence with both forms of the noun for cases 1 above,
 +
    • the sentence with the unique form of the noun and both articles for case 2 above,
 +
    • the sentence with the male form of the noun and male article, followed by a randomly,
 +
      chosen proper male name,
 +
�Figure 1: The pipelines used by the method.
 +
 +
 +
    • the sentence with only the female form of the noun or the female article.
 +
 +
We used these annotations to train the model to recognize the possible cases, assigning a label
 +
to each one.
 +
 +
4.1. Training with a rich seed entity set
 +
We used a set containing 23 entities typically found in University documents, like teacher,
 +
student, researcher and so on and trained two different models, the first using 4683 annotations
 +
the second one using 8272. After training, both models have been used to analyse the same
 +
set of documents. In order to evaluate the accuracy, as a first approximation we considered
 +
the rule based detector 100% correct for the seed entities and implemented a semi automatic
 +
procedure to check detection errors (false positives). We found that the rule based model found
 +
a total of 1846 errors, the first model 2337 including 634 false positives, so that accuracy is
 +
72.9%; it missed 143 errors found by the rule based model, that is the 7.7%. The second model
 +
found 2316 errors including 503 false positives, with accuracy 78.3%, missing 33 errors (1.8%)
 +
found by the rule based model. The first model was able to detect 23 new correct entities (e.g.
 +
“referent”, “warranter/sponsor”), the second one just 16. In other terms, a larger training set
 +
reduces errors, but also the number of new entities found. It must be noted that the rule based
 +
model was actually not 100% correct, a circumstance that may have had a negative impact on
 +
Spacy’s training. A manual check performed on ≈ 15% of the documents showed ≈ 4.47 false
 +
positives. For example the term “componente” (component) may indicate a person in a sentence
 +
like “un componente dello staff...” (a component of the staff) or an abstract entity in the sentence
 +
“la componente studentesca della popolazione” (the student component of the population).
 +
�Figure 2: The workflow of incremental training.
 +
 +
 +
4.2. Training with an incremental seed entity set
 +
Experiments were performed to check the ability of build a rich set of entities starting with just
 +
a few ones. The workflow is shown in fig. 2: we started with a table consisting of a few seed
 +
entities and used the rule based model to create annotations from a document. They were used
 +
to train a model, that was run on the same document. If some new entities were found, they
 +
were manually checked and, if correct, added to the entity table; then new annotations were
 +
created and the model was trained again. If no new entities were found, a new document was
 +
used without retraining the model. In the picture, continuous arrows denote the control flow,
 +
dashed ones the data used at each step.
 +
  Results are summarized in table 1.
 +
  Two experiments were run, the first starting with 7 seed entities (the first 3 columns in the
 +
table 1), the second with 3 (the last 3 columns 1). Each row shows the number of new entities
 +
found if any, the number of corresponding annotations and the time in minutes needed to
 +
retrain the model. Experiments were run on a “small” system, that is a commercial PC with an
 +
i7 Intel processor, 6 cores, 2.7 GHz, 16 Gb Ram. We can summarize that:
 +
 +
    • the first experiment stopped finding new entities after the 22nd
 +
    • 10 entities found by the first and 6 found by the second model were elements of the
 +
      original 23 seed entity set chosen by the user for the experiments in section 4.1
 +
�            New ents    Annotations  Train time  New ents  Annotations  Train time
 +
              7            66            1          3          3            0.2
 +
              0            —            —          0          —            —
 +
              1          1069            3          2          1894          11
 +
              2          1107            5          0          —            —
 +
              0            —            —          0          —            —
 +
              0            —            —          2          1928            9
 +
              1          1137            5          0          —            —
 +
              1          1339          13          0          —            —
 +
              2          1977          16          1          1933          19
 +
              0            —            —          1          1933          12
 +
              2          1983          16          1          1971          10
 +
              2          2019            8          3          1977          11
 +
              2          2168            8          2          2054          13
 +
              2          2173            7          3          2676          17
 +
              0            —            —          3          2818          20
 +
              0            —            —          0          —            —
 +
              0            —            —          0          —            —
 +
              0            —            —          2          2861          18
 +
              22                                    23
 +
Table 1
 +
Results of incremental training.
 +
 +
 +
    • some “popular” entities sharply increase the number of annotations
 +
    • the training time roughly increases linearly with the number of annotations, even if it
 +
      does not depend only on it (see fig. 3)
 +
 +
 +
5. Conclusion and future work
 +
We showed that a combination of a rule based model with a trainable one is a promising way to
 +
get an accurate and extensible tool to detect non inclusive language. We applied the method
 +
to official documents of an Institution as a first test area, but we think that it can be applied
 +
to wider types of documents and discrimination languages. In the future, we plan to consider,
 +
for instance, textbooks and blog or newspaper articles. These imply to manage some more
 +
subtle types of discriminatory language, related for instance to bad sentiments and stereotypes.
 +
Therefore, we plan to perform some technical enhancements, such as: i) the rule engine and
 +
rule set will be expanded to handle more complex cases and avoid the small percentage of errors
 +
we found in the present work ii) we will compare performances of several Natural Language
 +
Processing tool, instead of using just one and iii) we will include methods of the Sentiment
 +
Analysis area.
 +
 +
 +
Acknowledgements
 +
The authors wish to thank their students Francesca Aru, Giovanni Lafiandra and Giulia Pagani
 +
for their helpful work, especially during the experimental phase. This work was partly supported
 +
by Università degli Studi di Milano under the program “Piano di sostegno alla ricerca”.
 +
�Figure 3: The dependence of training time on annotation number.
 +
 +
 +
References
 +
[1] Alatrista-Salas, H. and Hidalgo-Leon, P. and Nunez-del-Prado, M. Documents Retrieval
 +
    for Qualitative Research: Gender Discrimination Analysis, 2018 IEEE Latin American
 +
    Conference on Computational Intelligence (LA-CCI), 2018, pp. 1-6, doi: 10.1109/LA-
 +
    CCI.2018.8625211.
 +
[2] Babaeianjelodar, M. and Lorenz, S. and Gordon, J. and Matthews, J. and Freitag, E.
 +
    ,Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web
 +
    Conference 2020. Association for Computing Machinery, New York, NY, USA, 2020.
 +
    DOI:https://doi.org/10.1145/3366424.3383559
 +
[3] Benesty, M., NER algo benchmark: spaCy, Flair, m-BERT and camemBERT
 +
    on anonymizing French commercial legal cases, https://towardsdatascience.com/
 +
    benchmark-ner-algorithm-d4ab01b2d4c3, Last accessed 22 Feb 2022
 +
[4] Bortone, R. and Cerquozzi, F., L’hate speech al tempo di Internet, Aggiornamenti sociali,
 +
    vol. 818, 2017
 +
[5] Carl, M. and Garnier, S. and Haller, J. and Altmayer, A. and Miemietz, B. Controlling gender
 +
    equality with shallow NLP techniques. In Proceedings of the 20th international conference
 +
    on Computational Linguistics (COLING ’04). Association for Computational Linguistics,
 +
    USA, 820–es. 2004. DOI:https://doi.org/10.3115/1220355.1220473
 +
�[6] Devlin, J. and Chang, M.-W. and Lee, K. and Toutanova, K., BERT: Pre-training of Deep Bidi-
 +
    rectional Transformers for Language Understanding, Proceedings of the 2019 Conference
 +
    of the North American Chapter of the Association for Computational Linguistics: Human
 +
    Language Technologies, Volume 1 (Long and Short Papers), url https://aclanthology.org/N19-
 +
    1423, doi:10.18653/v1/N19-1423, 2019
 +
[7] Döring, N. and Mohseni, M. R., Gendered hate speech in YouTube and YouNow comments:
 +
    Results of two content analyses, SCM Studies in Communication and Media, vol. 9, n. 1,
 +
    2020
 +
[8] Gender neutral language in the European Parliament, https://www.europarl.europa.eu/
 +
    cmsdata/151780/GNL_Guidelines_EN.pdf. Last accessed 28 Feb 2022
 +
[9] Lingiardi, V. and Carone, N. and Semeraro, G. and Musto, C. and D’Amico, M. and Brena,
 +
    S., Mapping Twitter hate speech towards social and sexual minorities: a lexicon-based
 +
    approach to semantic content analysis, Behaviour & Information Technology, vol. 39, n. 7,
 +
    2020
 +
[10] Linee guida per l’adozione della parita di genere nei testi amministrativi e
 +
    nella comunicazione istituzionale dell’Universita degli Studi di Milano, (in Ital-
 +
    ian), https://www.unimi.it/sites/default/files/regolamenti/Lineeguidalinguaggiodigenere_
 +
    2020_UniversitádegliStudidiMilano.pdf. Last accessed 28 Feb 2022
 +
[11] Lai, M. and Stranisci, Marco A. and Bosco, C. and Damiano, R. and Patti, V. HaMor at the
 +
    Profiling Hate Speech Spreaders on Twitter, Working Notes of CLEF 2021-Conference and
 +
    Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2936, pp. 2047–2055, 2021
 +
[12] Park, S. and Woo, J. Gender Classification Using Sentiment Analysis and Deep Learning in
 +
    a Health Web Forum. Appl. Sci. 2019, 9, 1249. https://doi.org/10.3390/app9061249
 +
[13] Gaurav Rajput and Narinder Singh punn and Sanjay Kumar Sonbhadra and Sonali Agarwal,
 +
    Hate speech detection using static BERT embeddings, arxiv:2106.15537, 2021
 +
[14] Spacy Homepage, https://spacy.io/. Last accessed 21 Feb 2022
 +
[15] Tony Sun, T. and Gaut, A. and Tang, S. and Huang, Y. and ElSherief, M. and Zhao, J. and
 +
    Mirza, D. and Belding, E. and Chang, K-W. and Yang Wang, W., Mitigating Gender Bias in
 +
    Natural Language Processing: Literature Review, arxiv 1906.08976, 2019
 +
[16] Volkova, S. and Wilson, T. and Yarowsky, D., Exploring Demographic Lan-
 +
    guage Variations to Improve Multilingual Sentiment Analysis in Social Media,
 +
    url=http://aclweb.org/anthology/D/D13/D13-1187.pdf, EMNLP. 2013
 +
 +
</pre>

Revision as of 17:54, 30 March 2023

Paper

Paper
edit
description  
id  Vol-3194/paper68
wikidataid  →Q117344906
title  Gender Discriminatory Language Identification with an Hybrid Algorithm based on Syntactic Rules and Machine Learning
pdfUrl  https://ceur-ws.org/Vol-3194/paper68.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/BellandiS22
volume  Vol-3194→Vol-3194
session  →

Gender Discriminatory Language Identification with an Hybrid Algorithm based on Syntactic Rules and Machine Learning

load PDF

Gender Discriminatory Language Identification with
an Hybrid Algorithm based on Syntactic Rules and
Machine Learning
Valerio Bellandi1 , Stefano Siccardi1
1
    Computer Science Department, Università Degli Studi di Milano, Via Celoria 18 Milano (MI) Italy


                                         Abstract
                                         In the last years, gender discrimination in textual documents has emerged as an open problem and is
                                         undergoing analysis. The difficulty of identifying sentences in which this discrimination is present is
                                         linked to the context used and the formalisms adopted. This work describes an exploratory activity
                                         linked to the context of regulations and official documents of Italian public administrations. A hybrid
                                         algorithm based on syntactic rules and machine learning is therefore proposed, capable of identifying a
                                         specific subset of possible gender discrimination.

                                         Keywords
                                         Gender Discrimination, Syntactic Rules, Entities Extraction




1. Introduction
Discriminatory attitudes against minorities or gender related have been reported in many areas;
often they are conveyed by language in the form of open “hate speech” or in more subtle forms,
for instance associating the discriminated group to specific social roles or professions. Several
social networks, like Facebook and Twitter, define and ban hate speech (see [4]). Nevertheless,
[9] analyzing more than 2 millions tweets in a 7 months period, found that women (at the first
place), immigrants, gay and lesbian persons, Muslims, Jews and disabled persons were addressed
by more than 100 thousands of hateful tweets. Another study ([7]) reports that around 10% of
social media users reports being victimized by online hate speech. On the other hand, several
institutions have approved guidelines to promote the usage of an inclusive language in their
official documents, that is a language that does not carry any explicit or implicit difference
between genders. For instance, we quote the European Parliament (see [8]) and the University
of Milan (see [10]). This work aims at helping the detection of non inclusive language usage to
facilitate their correction.




SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ valerio.bellandi@unimi.it (V. Bellandi); stefano.siccardi@unimi.it (. S. Siccardi)
� 0000-0003-4473-6258 (V. Bellandi); 0000-0002-6477-3876 (. S. Siccardi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�2. Related work
Natural Language Processing techniques to mitigate gender bias have been reviewed by [15].
They start from the observation that NLP systems containing bias in training data, resources,
pretrained models (e.g. word embeddings), and algorithms can produce gender biased predictions
and sometimes even amplify biases present in the training sets. An example, driven from
Machine Translation is that translating “He is a nurse. She is a doctor.” to Hungarian and back
to English results in “She is a nurse. He is a doctor.” The main contributions of the paper are
related to NLP itself, more than to user composed documents. In the same spirit, [2] quantify the
degree to which gender bias differs with the corpora used for training. They look especially at
the impact of starting with a pre-trained model and fine-tuning with additional data. A different
strand of study uses sentiment analysis to try to detect gender of writers in blogs or social media,
for instance to establish a detailed health policy specialized into patient segments ([12]), or aims
at detecting gender and age of writers in order to improve sentiment analysis ([16]). [1] tested
the use of tools to simplify the task of information selection from collections of documents used
by qualitative researchers in order to analyze discrimination. They compare some methods and
find that relevant words can be efficiently found, but results heavily depend on the quality of
pre processing. A tool to check German texts for gender discriminatory formulations has been
described in [5]. It is based on rules to detect nouns used to denote male persons and then to
filter out correct cases, for instance when the noun refers to a specific male person (as a proper
noun is found nearby). In non correct cases the user is prompted with suitable messages and
hints. The way this tool works is basically the same of our rule based model. More recently,
Microsoft Word text editor started offering a tool to check for non inclusive language, that,
when activated, prompts some hints during documents input. However useful, it is presently
not configurable and does not cover a large number of cases. Our method relies on Named
Entity Recognition capabilities of NLP software to extend the set of entities to check; a complete
review of the topic is however out of the scope of the present work and we will limit ourselves
to describe some specific points. [11] examined 200 English and 200 Spanish tweets containing
hate speech against women and immigrants using a model based on Spacy ([14]). Words were
divided into 27 categories according to the type of negative contents; an accuracy of 84% for
English and 62% for Spanish in identifying hate speech is reported. Bert has been used in [13]
to analyze 16,000 tweets, containing 1972 tagged as racist and 3383 as sexist. A goal of the
study was to reduce false positives to detect hate speech without undermining the freedom of
expression. The maximum achieved specificity, was 83.03. Another study, [3], compared four
algorithms including Spacy, Bert and two monolingual algorithms (Flair and camemBERT) to
find entities in a set of 500 legal French cases, where a pool of experts annotated the cases with
60,000 entity quotations. The monolingual tools reached the best precision and recall.


3. Method
The proposed methods consists of two interleaved pipelines, one is the “production” tool for
the detection of discriminatory language in official documents, the other is used to expand
the tables used by the system, so that more and more cases can be detected. Referring to fig.
�1, the first pipeline, labelled 1, is fed with documents. A detection software, using rules and
tables with words used as “seed entities”, produces a tagged version of documents, to highlight
potentially discriminatory points. Each word included in the tables and not fitting one of the
required rules is tagged as an entity (in the sense of NER of NLP terminology), with a suitable
error tag. A user checks the tags, fixes the document and resubmit it, until a satisfactory result
is obtained. Examples of tables are: entities to check, like: (man: woman), (men: women); list of
male proper names, and so on. Examples of rules are:
   1. Base rules, evaluating True or False: R1 = (word is in table of entities); R2 = (one of
      neighbors of word = table of entities[word])
   2. Compound rules, evaluating a Tag or None: if R1 and R2 then assign Tagxxx to word
   The second pipeline is in turn divided into two branches. The branch labelled 2 builds a
model and the branch labelled 3 uses it to find new entities. In branch 2, the documents are
processed by a modified version of the detection software, that shares the same rules and tables.
However, instead of creating tagged versions of the documents, it creates a set of annotations.
These are single sentences, taken from the documents, with an occurrence of an entity, tagged
with the proper error, in the format required by the training program of the chosen NLP tool.
Moreover, whenever a “wrong” occurrence is found, all the “correct” version are added, as
separate annotations with the proper tags. The annotation are then used to train a NER model.
We used Spacy, but in principle any NLP system with trainable NER capabilities can be employed.
In branch 3, the model is fed with the documents and produces a new set of tagged ones. Tagged
words will not coincide with those found in pipeline 1; often we found more errors compared
to the rule based version. However new entities, not in the set of the seed ones, are in general
found. After a user’s review, the approved entities are added to the tables used in pipeline 1.
The role of the user is important, because the system often finds entities that are not of interest
or even completely wrong. This closes the loop, extending the capabilities of the “production”
rule system. Pipeline 2 can be run with the extended tables, until a satisfactory set of terms is
obtained.


4. Experiments and Results
A selection of official documents of the Milan University, chosen among general, departmental,
security / privacy and staff regulations has been used to search for two basic types of “problems”:
i) sentences containing only the male form of a noun having a different female form, e.g. “uomo
– donna” (man – woman) ii) sentences containing nouns having the same male and female form,
without any other grammatical element to stress reference to both genders, e.g. “il docente”
instead of “il/la docente” (the teacher, no English analogous article form) Whenever one of the
above was found, the following annotations were created:
    • the original sentence,
    • the sentence with both forms of the noun for cases 1 above,
    • the sentence with the unique form of the noun and both articles for case 2 above,
    • the sentence with the male form of the noun and male article, followed by a randomly,
      chosen proper male name,
�Figure 1: The pipelines used by the method.


    • the sentence with only the female form of the noun or the female article.

We used these annotations to train the model to recognize the possible cases, assigning a label
to each one.

4.1. Training with a rich seed entity set
We used a set containing 23 entities typically found in University documents, like teacher,
student, researcher and so on and trained two different models, the first using 4683 annotations
the second one using 8272. After training, both models have been used to analyse the same
set of documents. In order to evaluate the accuracy, as a first approximation we considered
the rule based detector 100% correct for the seed entities and implemented a semi automatic
procedure to check detection errors (false positives). We found that the rule based model found
a total of 1846 errors, the first model 2337 including 634 false positives, so that accuracy is
72.9%; it missed 143 errors found by the rule based model, that is the 7.7%. The second model
found 2316 errors including 503 false positives, with accuracy 78.3%, missing 33 errors (1.8%)
found by the rule based model. The first model was able to detect 23 new correct entities (e.g.
“referent”, “warranter/sponsor”), the second one just 16. In other terms, a larger training set
reduces errors, but also the number of new entities found. It must be noted that the rule based
model was actually not 100% correct, a circumstance that may have had a negative impact on
Spacy’s training. A manual check performed on ≈ 15% of the documents showed ≈ 4.47 false
positives. For example the term “componente” (component) may indicate a person in a sentence
like “un componente dello staff...” (a component of the staff) or an abstract entity in the sentence
“la componente studentesca della popolazione” (the student component of the population).
�Figure 2: The workflow of incremental training.


4.2. Training with an incremental seed entity set
Experiments were performed to check the ability of build a rich set of entities starting with just
a few ones. The workflow is shown in fig. 2: we started with a table consisting of a few seed
entities and used the rule based model to create annotations from a document. They were used
to train a model, that was run on the same document. If some new entities were found, they
were manually checked and, if correct, added to the entity table; then new annotations were
created and the model was trained again. If no new entities were found, a new document was
used without retraining the model. In the picture, continuous arrows denote the control flow,
dashed ones the data used at each step.
   Results are summarized in table 1.
   Two experiments were run, the first starting with 7 seed entities (the first 3 columns in the
table 1), the second with 3 (the last 3 columns 1). Each row shows the number of new entities
found if any, the number of corresponding annotations and the time in minutes needed to
retrain the model. Experiments were run on a “small” system, that is a commercial PC with an
i7 Intel processor, 6 cores, 2.7 GHz, 16 Gb Ram. We can summarize that:

    • the first experiment stopped finding new entities after the 22nd
    • 10 entities found by the first and 6 found by the second model were elements of the
      original 23 seed entity set chosen by the user for the experiments in section 4.1
�            New ents    Annotations   Train time   New ents   Annotations   Train time
               7            66             1          3           3             0.2
               0            —             —           0           —             —
               1           1069            3          2          1894           11
               2           1107            5          0           —             —
               0            —             —           0           —             —
               0            —             —           2          1928            9
               1           1137            5          0           —             —
               1           1339           13          0           —             —
               2           1977           16          1          1933           19
               0            —             —           1          1933           12
               2           1983           16          1          1971           10
               2           2019            8          3          1977           11
               2           2168            8          2          2054           13
               2           2173            7          3          2676           17
               0            —             —           3          2818           20
               0            —             —           0           —             —
               0            —             —           0           —             —
               0            —             —           2          2861           18
              22                                     23
Table 1
Results of incremental training.


    • some “popular” entities sharply increase the number of annotations
    • the training time roughly increases linearly with the number of annotations, even if it
      does not depend only on it (see fig. 3)


5. Conclusion and future work
We showed that a combination of a rule based model with a trainable one is a promising way to
get an accurate and extensible tool to detect non inclusive language. We applied the method
to official documents of an Institution as a first test area, but we think that it can be applied
to wider types of documents and discrimination languages. In the future, we plan to consider,
for instance, textbooks and blog or newspaper articles. These imply to manage some more
subtle types of discriminatory language, related for instance to bad sentiments and stereotypes.
Therefore, we plan to perform some technical enhancements, such as: i) the rule engine and
rule set will be expanded to handle more complex cases and avoid the small percentage of errors
we found in the present work ii) we will compare performances of several Natural Language
Processing tool, instead of using just one and iii) we will include methods of the Sentiment
Analysis area.


Acknowledgements
The authors wish to thank their students Francesca Aru, Giovanni Lafiandra and Giulia Pagani
for their helpful work, especially during the experimental phase. This work was partly supported
by Università degli Studi di Milano under the program “Piano di sostegno alla ricerca”.
�Figure 3: The dependence of training time on annotation number.


References
[1] Alatrista-Salas, H. and Hidalgo-Leon, P. and Nunez-del-Prado, M. Documents Retrieval
    for Qualitative Research: Gender Discrimination Analysis, 2018 IEEE Latin American
    Conference on Computational Intelligence (LA-CCI), 2018, pp. 1-6, doi: 10.1109/LA-
    CCI.2018.8625211.
[2] Babaeianjelodar, M. and Lorenz, S. and Gordon, J. and Matthews, J. and Freitag, E.
    ,Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web
    Conference 2020. Association for Computing Machinery, New York, NY, USA, 2020.
    DOI:https://doi.org/10.1145/3366424.3383559
[3] Benesty, M., NER algo benchmark: spaCy, Flair, m-BERT and camemBERT
    on anonymizing French commercial legal cases, https://towardsdatascience.com/
    benchmark-ner-algorithm-d4ab01b2d4c3, Last accessed 22 Feb 2022
[4] Bortone, R. and Cerquozzi, F., L’hate speech al tempo di Internet, Aggiornamenti sociali,
    vol. 818, 2017
[5] Carl, M. and Garnier, S. and Haller, J. and Altmayer, A. and Miemietz, B. Controlling gender
    equality with shallow NLP techniques. In Proceedings of the 20th international conference
    on Computational Linguistics (COLING ’04). Association for Computational Linguistics,
    USA, 820–es. 2004. DOI:https://doi.org/10.3115/1220355.1220473
�[6] Devlin, J. and Chang, M.-W. and Lee, K. and Toutanova, K., BERT: Pre-training of Deep Bidi-
    rectional Transformers for Language Understanding, Proceedings of the 2019 Conference
    of the North American Chapter of the Association for Computational Linguistics: Human
    Language Technologies, Volume 1 (Long and Short Papers), url https://aclanthology.org/N19-
    1423, doi:10.18653/v1/N19-1423, 2019
[7] Döring, N. and Mohseni, M. R., Gendered hate speech in YouTube and YouNow comments:
    Results of two content analyses, SCM Studies in Communication and Media, vol. 9, n. 1,
    2020
[8] Gender neutral language in the European Parliament, https://www.europarl.europa.eu/
    cmsdata/151780/GNL_Guidelines_EN.pdf. Last accessed 28 Feb 2022
[9] Lingiardi, V. and Carone, N. and Semeraro, G. and Musto, C. and D’Amico, M. and Brena,
    S., Mapping Twitter hate speech towards social and sexual minorities: a lexicon-based
    approach to semantic content analysis, Behaviour & Information Technology, vol. 39, n. 7,
    2020
[10] Linee guida per l’adozione della parita di genere nei testi amministrativi e
    nella comunicazione istituzionale dell’Universita degli Studi di Milano, (in Ital-
    ian), https://www.unimi.it/sites/default/files/regolamenti/Lineeguidalinguaggiodigenere_
    2020_UniversitádegliStudidiMilano.pdf. Last accessed 28 Feb 2022
[11] Lai, M. and Stranisci, Marco A. and Bosco, C. and Damiano, R. and Patti, V. HaMor at the
    Profiling Hate Speech Spreaders on Twitter, Working Notes of CLEF 2021-Conference and
    Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2936, pp. 2047–2055, 2021
[12] Park, S. and Woo, J. Gender Classification Using Sentiment Analysis and Deep Learning in
    a Health Web Forum. Appl. Sci. 2019, 9, 1249. https://doi.org/10.3390/app9061249
[13] Gaurav Rajput and Narinder Singh punn and Sanjay Kumar Sonbhadra and Sonali Agarwal,
    Hate speech detection using static BERT embeddings, arxiv:2106.15537, 2021
[14] Spacy Homepage, https://spacy.io/. Last accessed 21 Feb 2022
[15] Tony Sun, T. and Gaut, A. and Tang, S. and Huang, Y. and ElSherief, M. and Zhao, J. and
    Mirza, D. and Belding, E. and Chang, K-W. and Yang Wang, W., Mitigating Gender Bias in
    Natural Language Processing: Literature Review, arxiv 1906.08976, 2019
[16] Volkova, S. and Wilson, T. and Yarowsky, D., Exploring Demographic Lan-
    guage Variations to Improve Multilingual Sentiment Analysis in Social Media,
    url=http://aclweb.org/anthology/D/D13/D13-1187.pdf, EMNLP. 2013
�