Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper44 |
wikidataid | Q117344993→Q117344993 |
title | Towards Extreme Multi-Label Classification of Multimedia Content |
pdfUrl | https://ceur-ws.org/Vol-3194/paper44.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/MiniciPG022 |
volume | Vol-3194→Vol-3194 |
session | → |
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper44 |
wikidataid | Q117344993→Q117344993 |
title | Towards Extreme Multi-Label Classification of Multimedia Content |
pdfUrl | https://ceur-ws.org/Vol-3194/paper44.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/MiniciPG022 |
volume | Vol-3194→Vol-3194 |
session | → |
Towards Extreme Multi-Label Classification of Multimedia Content (Discussion Paper) Marco Minici1,2 , Francesco Sergio Pisani2 , Massimo Guarascio2 and Giuseppe Manco2 1 Università degli Studi di Pisa 2 ICAR-CNR, Rende (CS), Italy Abstract Providing rich and accurate metadata for indexing media content represents a major issue for enterprises offering streaming entertainment services. Metadata information are usually exploited to boost the search capabilities for relevant contents and as such it can be used by recommendation algorithms for yielding recommendation lists matching user interests. In this context, we investigate the problem of associating suitable labels (or tag) to multimedia contents, that can accurately describe the topics associated with such contents. This task is usually performed by domain experts in a fully manual fashion that makes the overall process time-consuming and susceptible to errors. In this work we propose a Deep Learning based framework for semi-automatic, multi-label and semi-supervised classification. By integrating different data types (e.g., text, images, etc.) the approach allows for tagging media contents with specific labels. A preliminary experimentation conducted on a real dataset demonstrates the quality of the approach in terms of predictive accuracy. Keywords Extreme Multi-Label Classification, Data Integration, Natural Language Processing, Semi-supervised Learning 1. Introduction Nowadays, entertainment industry represents one of the most profitable and widespread busi- ness sector, with a constant growth in terms of number of users. With estimated revenues amounting to about 2 trillion of dollars worldwide, providing effective research services is a crucial task for the companies operating in multimedia content delivery. In particular, the rise of streaming services and on-demand contents fostered the interest for AI-based solutions capable to facilitate the research and identification of contents matching the user interests. Just as an example, Recommender Systems (RS) are technologies widely adopted by big players (e.g., Netflix, Disney+, Amazon, etc.) to suggest items of their catalogues able to arouse users’ interest. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ marco.minici@icar.cnr.it (M. Minici); giuseppe.manco@icar.cnr.it (F. S. Pisani); massimo.guarascio@icar.cnr.it (M. Guarascio); giuseppe.manco@icar.cnr.it (G. Manco) https://mminici.github.io (M. Minici) � 0000-0002-9641-8916 (M. Minici); 0000-0003-2922-0835 (F. S. Pisani); 0000-0001-7711-9833 (M. Guarascio); 0000-0001-9672-3833 (G. Manco) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � Besides them, the technologies that allow for enriching content metadata with informative labels (or tags) act a key role as they can exploited to improve the RS performances and simultaneously enable a more effective research by means of the traditional research engines. Basically, these labels are used to group contents exhibiting common features and provide aggregated views for the users. However, the labelling task is a time-consuming and prone to the error process since it is manually performed by domain experts. Indeed, the lack of a common shared taxonomy can lead to yield repeated labels describing the same concept. Moreover, the assignment of a label to a content is subjective and depends on the skill and perception of the expert. In this scenario, Artificial Intelligence (AI) techniques represent a valuable tool to automate such a process by limiting the human factor and, as a consequence, reducing the classification error. Anyway, effectively addressing this problem requires the development of specific ap- proaches able to cope with different hard issues, i.e., unbalancing of the classes, lack of labelled data, capability of the models to process different types of data (e.g., text, images, etc.) and providing multi-class predictions on an high number of labels. In particular, Deep Learning (DL) paradigm [1] is considered a state-of-the-art solution for effectively address these issues. DL-based models can be exploited to extract accurate multi- label classification models by combining raw low-level data, gathered from a wide variety of sources (e.g., wikidata, IMDB, etc.). These models learn in a hierarchical fashion: several layers of non-linear processing units are stacked in a single network and each subsequent layer of the architecture can extract features with a higher level of abstraction compared to the previous one. Therefore, Deep Learning-based approaches allow to extract data abstractions and representations at different levels, they also represent a good choice for analyzing raw data provided in different formats and by different types of source. In this work we propose to combine different types of data from different publicly available data sources for classifying media contents and enrich them with informative labels. In Figure 1, we sketched the overall learning process. After an Information Retrieval phase in which data are gathered and wrapped in a single view, these raw data are provided as input to Machine Learning block. Our solution adopts a hierarchical Deep Learning based approach: on top, an ensemble of pre-trained models (Embedder) are fine-tuned and used to map the input (text and/or images) in a low-dimensional space. Here, the main idea is that contents with similar labels generate similar vector representations (embeddings). Then, a clustering algorithm (Clusterer) is used to group similar contents and yield sub-samples of the original dataset. Finally, each sub-sample is exploited to learn a local model focused on a limited set of labels that allows for yielding more accurate predictions for specific cases. Although our approach is totally general and capable to handle different type of data by adding specific models to the Embedder, the current implementation focuses on analyzing text data. An experimental evaluation conducted on a real dataset containing movie plots demonstrates the quality of our approach in providing accurate predictions in this challenging scenario. The rest of this paper is organized as follows: in Section 2 we provide an overview of the main approaches proposed in literature to tackle the automatic content tagging problem. In Section 3, we describe the framework used to address the problem and the deep learning architecture used to learn the multi-label classification model; while in Section 4 we discuss the experimental results. Section 5 concludes the work and introduces some new research lines. �Figure 1: Overview of the Learning Process. 2. Related Work The problem of classifying movies is not new in the literature and can be considered a general classification task on heterogeneous (video, images, audio, text) data. Wu et al. [2] process user reviews to extract relevant tags for movies. Afterward, they propagate these tags to less popular products according to the movie similarity based on multiple attributes (e.g.: title, summary). Hence, this work draws from the collaborative recommendation paradigm, while our proposal exploits deep metric learning and content-based techniques to solve the tag sparsity problem. Arevalo et al. [3] employ a neural architecture - inspired by recurrent units such as LSTM - named Gated Multimodal Unit (GMU) to effectively combine features coming from the poster image and the plot synopsis. They focus on solving the multi-modal fusion problem rather than the movie tagging itself. Indeed, their dataset contains fewer tags than ours. The work that most resembles our approach is [4], which makes use of plot synopses to predict tags in the realm of movies. They focus on modeling the plot text as an emotion flow - i.e.: a series of consecutive states of emotion. Their main conclusion is that incorporating the emotion flow increases the tag prediction quality with respect to naive approaches. Wehrmann and Barros [5] analyze movie trailers for performing multi-label genre classifica- tion. They explore the extraction of the audio and image features to establish spatio-temporal relationships between genres and the entire trailer. Similar to our approach, different learners are combined. Standalone models are trained separately for the image and the audio input, then they are fused using a weighted average. Anyhow, as stated by authors, the main limitation of the work relies in the use of only nine common movie genres. Fish et al. [6] highlight how a single movie genre hold back a large semantic that can be exploited to have a fine-grained description of the movie. The proposed model combines the embeddings produced by four pre-trained multi-modal ‘experts’ processing the audio and video of the movie. The training process is intended to improve the quality of the embeddings, i.e. the similarity between each movie clip and one of the 20 genres of the tags. Table 1 summarizes the most significant approaches among those described above. Compared to these approaches, there are some major differences with regards to the problem we aim to tackle: first, the tagging task is relative to a high number of labels. Second, this large number of labels exhibits a long-tail distribution, as illustrated also in Figure 4 and discussed later in the paper. To the best of our knowledge, our solution is the first approach that can handle large � Approach Dataset Number of tags DL architecture Data Type XMLC Multi-Modal Metric Result Kar et al. [4] MPST 71 LSTM Text y n Micro F1 0.37 Arevalo et al. [3] MM-IMDb 26 Multimodal Fusion with Pre-Trained nets Text, Image n y Micro F1 0.63 Arevalo et al. [3] MM-IMDb 26 Multimodal Fusion with Pre-Trained nets Text, Image n y Macro F1 0.54 Wehrmann et al. [5] LMTD 22 Multimodal Convolutional NN Audio, Image n y Micro AUC-PR 0.65 Wehrmann et al. [5] LMTD 22 Multimodal Convolutional NN Audio, Image n y Macro AUC-PR 0.74 Fish et al. [6] MMX-Trailer-20 20 Multimodal classifiers Audio, Image n y F1-weighted 0.60 Table 1 Analysis of current literature on Genre/Tag classification. amounts of labels (XMLC - eXtreme Multi-Label Classification) and process different types of data. 3. Framework In this section we illustrate our solution and the main components of the proposed DL-based architecture. As highlighted in Section 1, we adopted a hierarchical approach composed of three main components, as shown in Figure 2: (i) an Embedder, devoted to summarizing the original input into a vector representation (embedding); (ii) a cluster module (Clusterer) that allows for identifying media with similar contents and extracting focused sub-samples of the original dataset; and (iii) the local models that perform the final predictions. Embedder. As mentioned above, the current implementation of our technique works on text data (i.e., the movie plots), therefore our Embedder takes the form of a widely adopted (pre- trained) neural network i.e., BERT (Bidirectional Encoder Representations from Transformers) [7]. BERT is a transformer-based neural architecture able to process natural language and trained through an algorithm including two main steps, respectively named Word Masking and Next sentence prediction. In the former step, a percentage of the words composing a sentence is masked and the model is trained to predict the missing terms by considering the word context i.e., the terms that precede and follow the masked one. Then, the model is fine-tuned by considering a further task that allows for understanding the relations among the sentences. Basically, given two subsequent sentences, negative examples are created by replacing the second one with a random sentence. As regards the architecture, BERT can be figured out as a stack of transformer encoder layers that include multiple self attention “heads” [7]. In our framework, we use a BERT instance pre-trained on Wikipedia pages and the final embedding is obtained by averaging the output of the last four layers of the model. Notably, our BERT instance is further fine-tuned by adopting a Deep Metric Learning [8] based approach: three instances of the same architecture sharing the same weights are trained against triplets ⟨𝑎𝑛𝑐ℎ𝑜𝑟, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒⟩. Basically, the term anchor refers the reference input whereas positive and negative represent other examples respectively similar and dissimilar to the anchor. The goal consists of minimizing the distance between the anchor and the positive example while, simultaneously, the distance between the anchor and the negative one is maximized. A customized version of the triplet loss for multi-label tasks is exploited in the learning phase. Specifically, we adopted a semi-hard negative mining approach that filters out negative instances which share more tags with the anchor w.r.t. the positive ones. At prediction time, only a model is used to compute the vector � Movie Plots Embedding Module Clustering Module Classification Module In 1981, at a bus stop in Savannah, Georgia, a man named … embedding A gang of criminals rob a Gotham City mob bank; the Joker manipulates them into murdering each other … . . . Amid a galactic civil war, Rebel Alliance spies have stolen plans to the Galactic Empire's Death Star … … Triplet Loss K-Means CB Loss Figure 2: Our proposed 3-step architecture comprises three modules: embedding, clustering, and classification. Each component can be modified to address different goals - e.g.: different data modalities, training strategies, or models. representation of the input data. The main benefit of this approach relies on the possibility of combinatorically increasing the input size and handle the lack of labelled examples. Clusterer. A clustering algorithm is involved in the framework to group similar movies, thus allowing the deployment of local classification models. Each cluster includes movies that share a minimal number of tags with respect to the whole label space. Hence, this phase further alleviates the extreme-classification problem. New instances, to be classified, are assigned to the closest cluster on the basis of a suitable distance metric (in our case, euclidean distance is adopted). As shown in Figure 2, we adopted the K-Means algorithm as our clusterer. Local Models. In our framework, local models take the form of neural networks too. Specifi- cally, we exploited the DNN-based architecture, shown in Figure 3, to provide more accurate predictions also for minority classes. The base building block of our model includes three types of layers: (i) a fully-connected dense layer equipped with Rectified Linear Unit (ReLU) activation function [9], for each node composing the layer, (ii) a batch-normalization layer for improving stability and performances of the current dense layer [10], (iii) and a dropout layer for reducing the overfitting problem [11]. Several instances of this base component can be stacked in a single model, in particular, in our experimentation we tested a solution with three instances. The output layer of the architectures is equipped with a sigmoid activation function[12] and a variable number of neurons depending on the number of labels falling on the cluster associated with the local model. Basically, the output layer provides a class probability for each label. To address the Class Imbalance Problem, each local model is trained by using the Class-Balanced (CB) loss proposed in [13]. Here, the main idea consists of weighting loss inversely with the effective number of samples per class. 4. Experimental Results In order to assess the quality of our approach in labelling movies with related tags, we conducted a preliminary experimentation on a real dataset extracted by fusing data from different data sources. In particular, first we illustrate the dataset and the challenges to address for providing accurate predictions in this scenario; then we describe the adopted evaluation protocol and � Fully Connected Batch Layer Normalization Dropout x1 – μ1 x1 σ1 x1 x2 – μ2 x2 σ2 x2 DNN Blocki+1 ~ Input Layer DNN Block1 Output layer DNN Blockm y x3 – μ3 . . . x3 σ3 x3 . . xN – μN . xN σN xN DNN Blocki (a) Overview of the model. (b) Layers composing a base DNN component Figure 3: Local Model architecture: overview (a) and details of the internal structure for each base component (b). Dataset Number of Number of Avg. number of Avg. number of Movies Unique Tags Tags per Movie words per Movie MoTags 5595 85 2.82 71 Table 2 Main statistics of the dataset. metrics; finally, we show an ablation study aiming at highlight the benefits of the proposed approach. Our approach has been evaluated on a novel media content dataset gathering different types of information on a movie catalogue (e.g., movie plot, trailer, poster, synopsis, tags, etc.). Specifically, we focused our analysis on text data (i.e., movie plots) and tags extracted by multiple open sources such as Wikipedia, Wikimedia, and TMDB. When available, an extended plot is associated with the movie otherwise it is replaced with its synopsis. As shown in Table 2, our dataset contains ∼5k movies, with about 3 tags per movie. The complete list of tags (85) is showed in Figure 4. It is important noting that a restricted number of tags (mainly the genres) occurs more frequently than others, that can be considered as keywords summarizing some aspects of the movie. In particular, we can see that the data distribution exhibits a long tail shape. The experiments are perfomed on a DGX machine equipped with V100 GPUs. The dataset is split in training and test set respectively with 70%/30% percentages. In more detail, the dataset is partitioned in a stratified fashion so to reduce the sampling error. Adam is used as optimizer while, as mentioned above, we exploits two loss functions to handle the imbalance problem i.e., the triplet loss and the CB-loss. The first one is used during the Embedder learning phase while the last is used for training the local models. As regards the Clusterer, the number of groups 𝑘 has been empirically determined to 20. As a result of the clustering phase, each local model can focus the learning on a limited number of tags, in particular the average number of tags per cluster is ∼ 12. In Table 3 we report F1 score averaged according to two different strategies, respectively named macro and micro [14]. In the former, the F1 is computed for each class and then it is � 3 10 occurrences 2 10 scifi LGBT musical biopic drama comedy action thriller crime violence dystopic magic horror romantic fantasy mystery war wartime western noir sexual serial killer splatter animation aviation monsters futuristic love chase space history prison politics nudity teen movie revenge espionage neo-noir martial arts terrorism zombies family suspenseful vampires time travel murder fight scenes sport children alien music creepy small town cult justice romantic comedy adventure comedy drama documentary supernatural inbreeding death surreal pirates melodrama cartoon gunfight loneliness religion erotic thriller horror comedy bank robbery disaster movie coming of age police investigation high school parenthood super hero true story death penalty alcohol addiction friendship action comedies prostitution pornography tag Figure 4: Label distribution. There are 85 tags comprising genres (e.g.: Drama, Comedy) and more specific keywords (e.g.: parenthood, small town). It shows a long-tail shape because of few tags (movie genres) occurring most of the times. Method Micro-F1 Macro-F1 BERT 0.176 0.025 BERT* 0.498 0.163 𝑜𝑢𝑟_𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 0.507 0.253 Table 3 Experimental results on movie tags prediction. BERT* denotes the version trained using Triplet Loss. averaged, whereas in the latter, the cumulative sum of the counts of various true/false posi- tive/negative is computed, and then the overall measure is calculated. While macro-averaging weights all classes equally, micro-averaging favors bigger classes. The results shown in table 3 highlight the poor performance of the base model, i.e. the model trained on all tags. It is unable to handle the high number of tags (i.e. classes) and provides inaccurate predictions. The comparison of the values of Micro-F1 and Macro-F1 highlight the influences of the majority classes such as Drama or Comedy on the overall performances. The low value of Macro-F1 show that the model is unable to detect the under-represented tags which are the majority in the dataset. The adoption of the triplet loss architecture allows for improving the performances of the base model, although the low value of the Macro-F1 indicates poor performances on the minority classes. Finally, the full approach, named in table as 𝑜𝑢𝑟_𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ, allows for improving also the Macro-F1 value. 5. Conclusions and future work Enriching metadata with informative labels is a crucial task for the enterprises operating in the media content delivery field. However, automating this process requires to cope with different challenging issues. In this work we proposed a hierarchical DL-Based approach for extreme multi-label classification aiming at providing accurate predictions for movie tagging task. An experimentation conducted on a real dataset demonstrates the quality of the approach. As a pointer of further research, we aim at boosting the overall performance of the proposed �approach by integrating information coming from unlabeled data in a semi-supervised or self- supervised way. Also, active learning schemes can be fruitfully exploited by implementing ad-hoc oracle labeling strategies. Finally, we are interested to extend the experimentation for a fully multi-modal scenario by including heterogeneous data e.g., movie posters and trailers. Acknowledgments This work was supported by PON I&C 2014-2020 FESR MISE, Catch 4.0. References [1] Y. Le Cun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444. [2] C. Wu, C. Wang, Y. Zhou, D. Wu, M. Chen, J. H. Wang, J. Qin, Exploiting user reviews for automatic movie tagging, Multimedia Tools and Applications 79 (2020) 11399–11419. [3] J. Arevalo, T. Solorio, M. Montes-y Gómez, F. A. González, Gated multimodal units for information fusion, arXiv preprint arXiv:1702.01992 (2017). [4] S. Kar, S. Maharjan, T. Solorio, Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network, in: Proceedings of the 27th Inter- national Conference on Computational Linguistics, 2018, pp. 2879–2891. [5] J. Wehrmann, R. C. Barros, Movie genre classification: A multi-label approach based on convolutions through time, Applied Soft Computing 61 (2017) 973–982. [6] E. Fish, J. Weinbren, A. Gilbert, Rethinking movie genre classification with fine-grained semantic clustering, arXiv preprint arXiv:2012.02639 (2020). [7] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423. [8] M. KAYA, H. S. BILGE, Deep metric learning: A survey, Symmetry 11 (2019). doi:10. 3390/sym11091066. [9] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th Int. Conf. on Machine Learning, ICML’10, 2010, pp. 807–814. [10] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, 2015, pp. 448–456. [11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958. [12] M. Guarascio, G. Manco, E. Ritacco, Deep learning, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics 1-3 (2018) 634–647. [13] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, S. Belongie, Class-balanced loss based on effective number of samples, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277. [14] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manage. 45 (2009) 427–437. �
Towards Extreme Multi-Label Classification of Multimedia Content (Discussion Paper) Marco Minici1,2 , Francesco Sergio Pisani2 , Massimo Guarascio2 and Giuseppe Manco2 1 Università degli Studi di Pisa 2 ICAR-CNR, Rende (CS), Italy Abstract Providing rich and accurate metadata for indexing media content represents a major issue for enterprises offering streaming entertainment services. Metadata information are usually exploited to boost the search capabilities for relevant contents and as such it can be used by recommendation algorithms for yielding recommendation lists matching user interests. In this context, we investigate the problem of associating suitable labels (or tag) to multimedia contents, that can accurately describe the topics associated with such contents. This task is usually performed by domain experts in a fully manual fashion that makes the overall process time-consuming and susceptible to errors. In this work we propose a Deep Learning based framework for semi-automatic, multi-label and semi-supervised classification. By integrating different data types (e.g., text, images, etc.) the approach allows for tagging media contents with specific labels. A preliminary experimentation conducted on a real dataset demonstrates the quality of the approach in terms of predictive accuracy. Keywords Extreme Multi-Label Classification, Data Integration, Natural Language Processing, Semi-supervised Learning 1. Introduction Nowadays, entertainment industry represents one of the most profitable and widespread busi- ness sector, with a constant growth in terms of number of users. With estimated revenues amounting to about 2 trillion of dollars worldwide, providing effective research services is a crucial task for the companies operating in multimedia content delivery. In particular, the rise of streaming services and on-demand contents fostered the interest for AI-based solutions capable to facilitate the research and identification of contents matching the user interests. Just as an example, Recommender Systems (RS) are technologies widely adopted by big players (e.g., Netflix, Disney+, Amazon, etc.) to suggest items of their catalogues able to arouse users’ interest. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ marco.minici@icar.cnr.it (M. Minici); giuseppe.manco@icar.cnr.it (F. S. Pisani); massimo.guarascio@icar.cnr.it (M. Guarascio); giuseppe.manco@icar.cnr.it (G. Manco) https://mminici.github.io (M. Minici) � 0000-0002-9641-8916 (M. Minici); 0000-0003-2922-0835 (F. S. Pisani); 0000-0001-7711-9833 (M. Guarascio); 0000-0001-9672-3833 (G. Manco) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � Besides them, the technologies that allow for enriching content metadata with informative labels (or tags) act a key role as they can exploited to improve the RS performances and simultaneously enable a more effective research by means of the traditional research engines. Basically, these labels are used to group contents exhibiting common features and provide aggregated views for the users. However, the labelling task is a time-consuming and prone to the error process since it is manually performed by domain experts. Indeed, the lack of a common shared taxonomy can lead to yield repeated labels describing the same concept. Moreover, the assignment of a label to a content is subjective and depends on the skill and perception of the expert. In this scenario, Artificial Intelligence (AI) techniques represent a valuable tool to automate such a process by limiting the human factor and, as a consequence, reducing the classification error. Anyway, effectively addressing this problem requires the development of specific ap- proaches able to cope with different hard issues, i.e., unbalancing of the classes, lack of labelled data, capability of the models to process different types of data (e.g., text, images, etc.) and providing multi-class predictions on an high number of labels. In particular, Deep Learning (DL) paradigm [1] is considered a state-of-the-art solution for effectively address these issues. DL-based models can be exploited to extract accurate multi- label classification models by combining raw low-level data, gathered from a wide variety of sources (e.g., wikidata, IMDB, etc.). These models learn in a hierarchical fashion: several layers of non-linear processing units are stacked in a single network and each subsequent layer of the architecture can extract features with a higher level of abstraction compared to the previous one. Therefore, Deep Learning-based approaches allow to extract data abstractions and representations at different levels, they also represent a good choice for analyzing raw data provided in different formats and by different types of source. In this work we propose to combine different types of data from different publicly available data sources for classifying media contents and enrich them with informative labels. In Figure 1, we sketched the overall learning process. After an Information Retrieval phase in which data are gathered and wrapped in a single view, these raw data are provided as input to Machine Learning block. Our solution adopts a hierarchical Deep Learning based approach: on top, an ensemble of pre-trained models (Embedder) are fine-tuned and used to map the input (text and/or images) in a low-dimensional space. Here, the main idea is that contents with similar labels generate similar vector representations (embeddings). Then, a clustering algorithm (Clusterer) is used to group similar contents and yield sub-samples of the original dataset. Finally, each sub-sample is exploited to learn a local model focused on a limited set of labels that allows for yielding more accurate predictions for specific cases. Although our approach is totally general and capable to handle different type of data by adding specific models to the Embedder, the current implementation focuses on analyzing text data. An experimental evaluation conducted on a real dataset containing movie plots demonstrates the quality of our approach in providing accurate predictions in this challenging scenario. The rest of this paper is organized as follows: in Section 2 we provide an overview of the main approaches proposed in literature to tackle the automatic content tagging problem. In Section 3, we describe the framework used to address the problem and the deep learning architecture used to learn the multi-label classification model; while in Section 4 we discuss the experimental results. Section 5 concludes the work and introduces some new research lines. �Figure 1: Overview of the Learning Process. 2. Related Work The problem of classifying movies is not new in the literature and can be considered a general classification task on heterogeneous (video, images, audio, text) data. Wu et al. [2] process user reviews to extract relevant tags for movies. Afterward, they propagate these tags to less popular products according to the movie similarity based on multiple attributes (e.g.: title, summary). Hence, this work draws from the collaborative recommendation paradigm, while our proposal exploits deep metric learning and content-based techniques to solve the tag sparsity problem. Arevalo et al. [3] employ a neural architecture - inspired by recurrent units such as LSTM - named Gated Multimodal Unit (GMU) to effectively combine features coming from the poster image and the plot synopsis. They focus on solving the multi-modal fusion problem rather than the movie tagging itself. Indeed, their dataset contains fewer tags than ours. The work that most resembles our approach is [4], which makes use of plot synopses to predict tags in the realm of movies. They focus on modeling the plot text as an emotion flow - i.e.: a series of consecutive states of emotion. Their main conclusion is that incorporating the emotion flow increases the tag prediction quality with respect to naive approaches. Wehrmann and Barros [5] analyze movie trailers for performing multi-label genre classifica- tion. They explore the extraction of the audio and image features to establish spatio-temporal relationships between genres and the entire trailer. Similar to our approach, different learners are combined. Standalone models are trained separately for the image and the audio input, then they are fused using a weighted average. Anyhow, as stated by authors, the main limitation of the work relies in the use of only nine common movie genres. Fish et al. [6] highlight how a single movie genre hold back a large semantic that can be exploited to have a fine-grained description of the movie. The proposed model combines the embeddings produced by four pre-trained multi-modal ‘experts’ processing the audio and video of the movie. The training process is intended to improve the quality of the embeddings, i.e. the similarity between each movie clip and one of the 20 genres of the tags. Table 1 summarizes the most significant approaches among those described above. Compared to these approaches, there are some major differences with regards to the problem we aim to tackle: first, the tagging task is relative to a high number of labels. Second, this large number of labels exhibits a long-tail distribution, as illustrated also in Figure 4 and discussed later in the paper. To the best of our knowledge, our solution is the first approach that can handle large � Approach Dataset Number of tags DL architecture Data Type XMLC Multi-Modal Metric Result Kar et al. [4] MPST 71 LSTM Text y n Micro F1 0.37 Arevalo et al. [3] MM-IMDb 26 Multimodal Fusion with Pre-Trained nets Text, Image n y Micro F1 0.63 Arevalo et al. [3] MM-IMDb 26 Multimodal Fusion with Pre-Trained nets Text, Image n y Macro F1 0.54 Wehrmann et al. [5] LMTD 22 Multimodal Convolutional NN Audio, Image n y Micro AUC-PR 0.65 Wehrmann et al. [5] LMTD 22 Multimodal Convolutional NN Audio, Image n y Macro AUC-PR 0.74 Fish et al. [6] MMX-Trailer-20 20 Multimodal classifiers Audio, Image n y F1-weighted 0.60 Table 1 Analysis of current literature on Genre/Tag classification. amounts of labels (XMLC - eXtreme Multi-Label Classification) and process different types of data. 3. Framework In this section we illustrate our solution and the main components of the proposed DL-based architecture. As highlighted in Section 1, we adopted a hierarchical approach composed of three main components, as shown in Figure 2: (i) an Embedder, devoted to summarizing the original input into a vector representation (embedding); (ii) a cluster module (Clusterer) that allows for identifying media with similar contents and extracting focused sub-samples of the original dataset; and (iii) the local models that perform the final predictions. Embedder. As mentioned above, the current implementation of our technique works on text data (i.e., the movie plots), therefore our Embedder takes the form of a widely adopted (pre- trained) neural network i.e., BERT (Bidirectional Encoder Representations from Transformers) [7]. BERT is a transformer-based neural architecture able to process natural language and trained through an algorithm including two main steps, respectively named Word Masking and Next sentence prediction. In the former step, a percentage of the words composing a sentence is masked and the model is trained to predict the missing terms by considering the word context i.e., the terms that precede and follow the masked one. Then, the model is fine-tuned by considering a further task that allows for understanding the relations among the sentences. Basically, given two subsequent sentences, negative examples are created by replacing the second one with a random sentence. As regards the architecture, BERT can be figured out as a stack of transformer encoder layers that include multiple self attention “heads” [7]. In our framework, we use a BERT instance pre-trained on Wikipedia pages and the final embedding is obtained by averaging the output of the last four layers of the model. Notably, our BERT instance is further fine-tuned by adopting a Deep Metric Learning [8] based approach: three instances of the same architecture sharing the same weights are trained against triplets ⟨𝑎𝑛𝑐ℎ𝑜𝑟, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒⟩. Basically, the term anchor refers the reference input whereas positive and negative represent other examples respectively similar and dissimilar to the anchor. The goal consists of minimizing the distance between the anchor and the positive example while, simultaneously, the distance between the anchor and the negative one is maximized. A customized version of the triplet loss for multi-label tasks is exploited in the learning phase. Specifically, we adopted a semi-hard negative mining approach that filters out negative instances which share more tags with the anchor w.r.t. the positive ones. At prediction time, only a model is used to compute the vector � Movie Plots Embedding Module Clustering Module Classification Module In 1981, at a bus stop in Savannah, Georgia, a man named … embedding A gang of criminals rob a Gotham City mob bank; the Joker manipulates them into murdering each other … . . . Amid a galactic civil war, Rebel Alliance spies have stolen plans to the Galactic Empire's Death Star … … Triplet Loss K-Means CB Loss Figure 2: Our proposed 3-step architecture comprises three modules: embedding, clustering, and classification. Each component can be modified to address different goals - e.g.: different data modalities, training strategies, or models. representation of the input data. The main benefit of this approach relies on the possibility of combinatorically increasing the input size and handle the lack of labelled examples. Clusterer. A clustering algorithm is involved in the framework to group similar movies, thus allowing the deployment of local classification models. Each cluster includes movies that share a minimal number of tags with respect to the whole label space. Hence, this phase further alleviates the extreme-classification problem. New instances, to be classified, are assigned to the closest cluster on the basis of a suitable distance metric (in our case, euclidean distance is adopted). As shown in Figure 2, we adopted the K-Means algorithm as our clusterer. Local Models. In our framework, local models take the form of neural networks too. Specifi- cally, we exploited the DNN-based architecture, shown in Figure 3, to provide more accurate predictions also for minority classes. The base building block of our model includes three types of layers: (i) a fully-connected dense layer equipped with Rectified Linear Unit (ReLU) activation function [9], for each node composing the layer, (ii) a batch-normalization layer for improving stability and performances of the current dense layer [10], (iii) and a dropout layer for reducing the overfitting problem [11]. Several instances of this base component can be stacked in a single model, in particular, in our experimentation we tested a solution with three instances. The output layer of the architectures is equipped with a sigmoid activation function[12] and a variable number of neurons depending on the number of labels falling on the cluster associated with the local model. Basically, the output layer provides a class probability for each label. To address the Class Imbalance Problem, each local model is trained by using the Class-Balanced (CB) loss proposed in [13]. Here, the main idea consists of weighting loss inversely with the effective number of samples per class. 4. Experimental Results In order to assess the quality of our approach in labelling movies with related tags, we conducted a preliminary experimentation on a real dataset extracted by fusing data from different data sources. In particular, first we illustrate the dataset and the challenges to address for providing accurate predictions in this scenario; then we describe the adopted evaluation protocol and � Fully Connected Batch Layer Normalization Dropout x1 – μ1 x1 σ1 x1 x2 – μ2 x2 σ2 x2 DNN Blocki+1 ~ Input Layer DNN Block1 Output layer DNN Blockm y x3 – μ3 . . . x3 σ3 x3 . . xN – μN . xN σN xN DNN Blocki (a) Overview of the model. (b) Layers composing a base DNN component Figure 3: Local Model architecture: overview (a) and details of the internal structure for each base component (b). Dataset Number of Number of Avg. number of Avg. number of Movies Unique Tags Tags per Movie words per Movie MoTags 5595 85 2.82 71 Table 2 Main statistics of the dataset. metrics; finally, we show an ablation study aiming at highlight the benefits of the proposed approach. Our approach has been evaluated on a novel media content dataset gathering different types of information on a movie catalogue (e.g., movie plot, trailer, poster, synopsis, tags, etc.). Specifically, we focused our analysis on text data (i.e., movie plots) and tags extracted by multiple open sources such as Wikipedia, Wikimedia, and TMDB. When available, an extended plot is associated with the movie otherwise it is replaced with its synopsis. As shown in Table 2, our dataset contains ∼5k movies, with about 3 tags per movie. The complete list of tags (85) is showed in Figure 4. It is important noting that a restricted number of tags (mainly the genres) occurs more frequently than others, that can be considered as keywords summarizing some aspects of the movie. In particular, we can see that the data distribution exhibits a long tail shape. The experiments are perfomed on a DGX machine equipped with V100 GPUs. The dataset is split in training and test set respectively with 70%/30% percentages. In more detail, the dataset is partitioned in a stratified fashion so to reduce the sampling error. Adam is used as optimizer while, as mentioned above, we exploits two loss functions to handle the imbalance problem i.e., the triplet loss and the CB-loss. The first one is used during the Embedder learning phase while the last is used for training the local models. As regards the Clusterer, the number of groups 𝑘 has been empirically determined to 20. As a result of the clustering phase, each local model can focus the learning on a limited number of tags, in particular the average number of tags per cluster is ∼ 12. In Table 3 we report F1 score averaged according to two different strategies, respectively named macro and micro [14]. In the former, the F1 is computed for each class and then it is � 3 10 occurrences 2 10 scifi LGBT musical biopic drama comedy action thriller crime violence dystopic magic horror romantic fantasy mystery war wartime western noir sexual serial killer splatter animation aviation monsters futuristic love chase space history prison politics nudity teen movie revenge espionage neo-noir martial arts terrorism zombies family suspenseful vampires time travel murder fight scenes sport children alien music creepy small town cult justice romantic comedy adventure comedy drama documentary supernatural inbreeding death surreal pirates melodrama cartoon gunfight loneliness religion erotic thriller horror comedy bank robbery disaster movie coming of age police investigation high school parenthood super hero true story death penalty alcohol addiction friendship action comedies prostitution pornography tag Figure 4: Label distribution. There are 85 tags comprising genres (e.g.: Drama, Comedy) and more specific keywords (e.g.: parenthood, small town). It shows a long-tail shape because of few tags (movie genres) occurring most of the times. Method Micro-F1 Macro-F1 BERT 0.176 0.025 BERT* 0.498 0.163 𝑜𝑢𝑟_𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 0.507 0.253 Table 3 Experimental results on movie tags prediction. BERT* denotes the version trained using Triplet Loss. averaged, whereas in the latter, the cumulative sum of the counts of various true/false posi- tive/negative is computed, and then the overall measure is calculated. While macro-averaging weights all classes equally, micro-averaging favors bigger classes. The results shown in table 3 highlight the poor performance of the base model, i.e. the model trained on all tags. It is unable to handle the high number of tags (i.e. classes) and provides inaccurate predictions. The comparison of the values of Micro-F1 and Macro-F1 highlight the influences of the majority classes such as Drama or Comedy on the overall performances. The low value of Macro-F1 show that the model is unable to detect the under-represented tags which are the majority in the dataset. The adoption of the triplet loss architecture allows for improving the performances of the base model, although the low value of the Macro-F1 indicates poor performances on the minority classes. Finally, the full approach, named in table as 𝑜𝑢𝑟_𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ, allows for improving also the Macro-F1 value. 5. Conclusions and future work Enriching metadata with informative labels is a crucial task for the enterprises operating in the media content delivery field. However, automating this process requires to cope with different challenging issues. In this work we proposed a hierarchical DL-Based approach for extreme multi-label classification aiming at providing accurate predictions for movie tagging task. An experimentation conducted on a real dataset demonstrates the quality of the approach. As a pointer of further research, we aim at boosting the overall performance of the proposed �approach by integrating information coming from unlabeled data in a semi-supervised or self- supervised way. Also, active learning schemes can be fruitfully exploited by implementing ad-hoc oracle labeling strategies. Finally, we are interested to extend the experimentation for a fully multi-modal scenario by including heterogeneous data e.g., movie posters and trailers. Acknowledgments This work was supported by PON I&C 2014-2020 FESR MISE, Catch 4.0. References [1] Y. Le Cun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444. [2] C. Wu, C. Wang, Y. Zhou, D. Wu, M. Chen, J. H. Wang, J. Qin, Exploiting user reviews for automatic movie tagging, Multimedia Tools and Applications 79 (2020) 11399–11419. [3] J. Arevalo, T. Solorio, M. Montes-y Gómez, F. A. González, Gated multimodal units for information fusion, arXiv preprint arXiv:1702.01992 (2017). [4] S. Kar, S. Maharjan, T. Solorio, Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network, in: Proceedings of the 27th Inter- national Conference on Computational Linguistics, 2018, pp. 2879–2891. [5] J. Wehrmann, R. C. Barros, Movie genre classification: A multi-label approach based on convolutions through time, Applied Soft Computing 61 (2017) 973–982. [6] E. Fish, J. Weinbren, A. Gilbert, Rethinking movie genre classification with fine-grained semantic clustering, arXiv preprint arXiv:2012.02639 (2020). [7] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423. [8] M. KAYA, H. S. BILGE, Deep metric learning: A survey, Symmetry 11 (2019). doi:10. 3390/sym11091066. [9] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th Int. Conf. on Machine Learning, ICML’10, 2010, pp. 807–814. [10] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, 2015, pp. 448–456. [11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958. [12] M. Guarascio, G. Manco, E. Ritacco, Deep learning, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics 1-3 (2018) 634–647. [13] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, S. Belongie, Class-balanced loss based on effective number of samples, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277. [14] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manage. 45 (2009) 427–437. �