Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper60 |
wikidataid | Q117344891→Q117344891 |
title | The Challenging Reproducibility Task in Recommender Systems Research between Traditional and Deep Learning Models |
pdfUrl | https://ceur-ws.org/Vol-3194/paper60.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/AnelliBFMMPDSN22 |
volume | Vol-3194→Vol-3194 |
session | → |
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper60 |
wikidataid | Q117344891→Q117344891 |
title | The Challenging Reproducibility Task in Recommender Systems Research between Traditional and Deep Learning Models |
pdfUrl | https://ceur-ws.org/Vol-3194/paper60.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/AnelliBFMMPDSN22 |
volume | Vol-3194→Vol-3194 |
session | → |
The Challenging Reproducibility Task in Recommender Systems Research between Traditional and Deep Learning Models Discussion Paper Vito Walter Anelli1 , Alejandro Bellogín2 , Antonio Ferrara1 , Daniele Malitesta1 , Felice Antonio Merra4 , Claudio Pomo1 , Francesco Maria Donini3 , Eugenio Di Sciascio1 and Tommaso Di Noia1 1 Politecnico di Bari, via Orabona, 4, 70125 Bari, Italy 2 Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, 28049 Madrid, Spain 3 Università degli Studi della Tuscia, via Santa Maria in Gradi, 4, 01100 Viterbo, Italy 4 Amazon Science Berlin, Invalidenstraße 75, 10557 Berlin, Germany Abstract Recommender Systems have shown to be a useful tool for reducing over-choice and providing accurate, personalized suggestions. The large variety of available recommendation algorithms, splitting techniques, assessment protocols, metrics, and tasks, on the other hand, has made thorough experimental evaluation extremely difficult. Elliot is a comprehensive framework for recommendation with the goal of running and reproducing a whole experimental pipeline from a single configuration file. The framework uses a variety of ways to load, filter, and divide data. Elliot optimizes hyper-parameters for a variety of recommendation algorithms, then chooses the best models, compares them to baselines, computes metrics ranging from accuracy to beyond-accuracy, bias, and fairness, and does statistical analysis. The aim is to provide researchers with a tool to ease all the experimental evaluation phases (and make them reproducible), from data reading to results collection. Elliot is freely available on GitHub at https://github.com/sisinflab/elliot. Keywords Recommender Systems, Reproducibility, Adversarial Learning, Visual Recommenders, Knowledge Graphs 1. Introduction In the last decade, Recommendation Systems (RSs) have gained momentum as the pivotal choice for personalized decision-support systems. Recommendation is essentially a retrieval task where a catalog of items is ranked in a personalized way and the top-scoring items are presented to the user [1]. Once the RSs ability to provide personalized items to clients had been demonstrated, both academia and industry began to devote attention to them [2]. This collective effort resulted in an impressive number of recommendation algorithms, ranging from memory-based [3] to latent factor-based [4, 5], as well as deep learning-based methods [6]. At the same time, the RS research community realized that focusing only on the accuracy of results could be detrimental, SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ vitowalter.anelli@poliba.it (V. W. Anelli); claudio.pomo@poliba.it (C. Pomo) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �and started exploring beyond-accuracy evaluation [7]. As accuracy was recognized as insufficient to guarantee users’ satisfaction [8], novelty and diversity [9, 10] came into play as new dimensions to be analyzed when comparing algorithms. However, this was only the first step in the direction of a more comprehensive evaluation. Indeed, more recently, the presence of biased [11] and unfair recommendations towards user groups and item categories [12] has been widely investigated. The abundance of possible choices has generated confusion around choosing the correct base- lines, conducting the hyperparameter optimization and the experimental evaluation [13], and reporting the details of the adopted procedure. Consequently, two major concerns have arisen: unreproducible evaluation and unfair comparisons [14]. The advent of various frameworks over the last decade has improved the research process, and the RS community has gradually embraced the emergence of recommendation, assessment, and even hyperparameter tweaking frameworks. Starting from 2011, Mymedialite [15], LensKit [16], LightFM [17], RankSys [9], and Surprise [18], have formed the basic software for rapid prototyp- ing and testing of recommendation models, thanks to an easy-to-use model execution and the implementation of standard accuracy, and beyond-accuracy, evaluation measures and splitting techniques. However, the outstanding success and the community interest in Deep Learning (DL) recommendation models, raised the need for novel instruments. LibRec [19], Spotlight 1 , and OpenRec [20] are the first open-source projects that made DL-based recommenders available – with less than a dozen of available models but, unfortunately, without filtering, splitting, and hyper-optimization tuning strategies. An important step towards more exhaustive and up-to-date set of model implementations have been released with RecQ [21], DeepRec [22], and Cornac [23] frameworks. However, they do not provide a general tool for extensive experiments on the pre-elaboration and the evaluation of a dataset. Indeed, after the reproducibility hype [24, 25], DaisyRec [14] and RecBole [26] raised the bar of framework capabilities, making available both large set of models, data filtering/splitting operations and, above all, hyper-parameter tuning features. Unfortunately, even though these frameworks are a great help to researchers, facilitating reproducibility or extending the provided functionality would typically depend on developing bash scripts or programming on whatever language each framework is written. This is where Elliot comes to the stage. It is a novel kind of recommendation framework, aimed to overcome these obstacles by proposing a fully declarative approach (by means of a configuration file) to the set-up of an experimental setting. It analyzes the recommendation problem from the researcher’s perspective as it implements the whole experimental pipeline, from dataset loading to results gathering in a principled way. The main idea behind Elliot is to keep an entire experiment reproducible and put the user (in our case, a researcher or RS developer) in control of the framework. To date, according to the recommendation model, Elliot allows for choosing among 27 similarity metrics, defining of multiple neural architectures, and choosing 51 hyperparameter tuning combined approaches, unleashing the full potential of the HyperOpt library [27]. To enable evaluation for the diverse tasks and domains, Elliot supplies 36 metrics (including Accuracy, Error-based, Coverage, Novelty, Diversity, Bias, and Fairness metrics), 13 splitting strategies, and 8 prefiltering policies. 1 https://github.com/maciejkula/spotlight � Configuration File Prefiltering Loading Data Modules Filter-by-rating Ratings Run Module k-core Side Information Evaluation Modules Output Module Optional Modules Splitting Recommendation Temporal Random Output Hyperoptimization Restore Model Metrics Fix Performance Tables Accuracy Model Weights Error Restore Model Recommendation Lists Coverage Novelty Diversity Stat. Tests Restore External Model Bias Paired t-test Fairness Wilcoxon Figure 1: Overview of Elliot. 2. Framework Elliot is an extendable framework made up of eight functional modules, each of which is in charge of a different phase in the experimental suggestion process. The user is only required to submit human-level experimental flow information via a customisable configuration file, so what happens beneath the hood (Figure 1) is transparent to them. As a result, Elliot constructs the whole pipeline. What follows presents each of Elliot’s modules and how to create a configuration file. 2.1. Data Preparation The Data modules are in charge of handling and organizing the experiment’s input, as well as providing a variety of supplementary data, such as item characteristics, visual embeddings, and pictures. The input data is taken over by the Prefiltering and Splitting modules after being loaded by the Loading module, whose techniques are described in Sect.2.1.2 and 2.1.3 respectively. 2.1.1. Loading Different data sources, such as user-item feedback or side information, such as the item visual aspects, may be required for RSs investigations. Elliot comes with a variety of Loading module im- plementations to meet these requirements. Furthermore, the user may create computationally in- tensive prefiltering and splitting operations that can be saved and loaded to save time in the future. Additional data, such as visual characteristics and semantic features generated from knowledge graphs, can be handled through data-driven extensions. When a side-information-aware Loading module is selected, it filters out items that lack the needed information to provide a fair comparison. 2.1.2. Prefiltering Elliot provides data filtering procedures using two different techniques after data loading. Filter-by-rating is the first method implemented in the Prefiltering module, which removes a user- item interaction if the preference score falls below a certain level. It can be a Numerical value, such as 3.5, a Distributional information, such as the worldwide rating average value, or a user-based �distributional (User Dist.) value, such as the user’s average rating value. The 𝑘-core prefiltering approach eliminates people, objects, or both if there are less than 𝑘 documented interactions. The 𝑘-core technique can be used to both users and things repeatedly (Iterative 𝑘-core) until the 𝑘-core filtering requirement is fulfilled, i.e., all users and items have at least 𝑘 recorded interactions. Since reaching such condition might be intractable, Elliot allows specifying the maximum number of iterations (Iter-𝑛-rounds). Finally, the Cold-Users filtering feature allows retaining cold-users only. 2.1.3. Splitting Elliot implements three splitting strategies: (i) Temporal, (ii) Random, and (iii) Fix. The Temporal method divides user-item interactions depending on the transaction timestamp, either by setting the timestamp, selecting the best one [28, 29], or using a hold-out (HO) mechanism. Hold-out (HO), 𝐾-repeated hold-out (K-HO), and cross-validation (CV ) are all part of the Random methods. Finally, the Fix approach leverages an already split dataset. 2.2. Recommendation Models After data loading and pre-elaborations, the Recommendation module (Figure 1) provides the functionalities to train (and restore) both Elliot’s state-of-the-art recommendation models and custom user-implemented models, with the possibility to find the best hyper-parameter setting. 2.2.1. Implemented Models To date, Elliot integrates around 50 recommendation models grouped into two sets: (i) popular models implemented in at least two of the other reviewed frameworks, and (ii) other well- known state-of-the-art recommendation models which are less common in the reviewed frame- works, such as autoencoder-based, e.g., [6], graph-based, e.g., [30], visually-aware [31], e.g., [32], adversarially-robust, e.g., [33], and content-aware, e.g., [34, 35]. 2.2.2. Hyper-parameter Tuning According to Rendle et al. [25], Anelli et al. [36], hyper-parameter optimization has a significant impact on performance. Elliot supplies Grid Search, Simulated Annealing, Bayesian Optimization, and Random Search, supporting four different traversal techniques in the search space. Grid Search is automatically inferred when the user specifies the available hyper-parameters. 2.3. Performance Evaluation After the training phase, Elliot continues its operations, evaluating recommendations. Figure 1 indicates this phase with two distinct evaluation modules: Metrics and Statistical Tests. �2.3.1. Metrics Elliot provides a set of 36 evaluation metrics, partitioned into seven families: Accuracy, Error, Coverage, Novelty, Diversity, Bias, and Fairness. It is worth mentioning that Elliot is the framework that exposes both the largest number of metrics and the only one considering bias and fairness mea- sures. Moreover, the practitioner can choose any metric to drive the model selection and the tuning. 2.3.2. Statistical Tests All other cited frameworks do not support statistical hypothesis tests, probably due to the need for computing fine-grained (e.g., per-user or per-partition) results and retaining them for each recommendation model. Conversely, Elliot helps computing two statistical hypothesis tests, i.e., Wilcoxon and Paired t-test, with a flag in the configuration file. 2.4. Framework Outcomes When the training of recommenders is over, Elliot uses the Output module to gather the results. Three types of output files can be generated: (i) Performance Tables, (ii) Model Weights, and (iii) Recommendation Lists. Performance Tables come in the form of spreadsheets, including all the metric values generated on the test set for each recommendation model given in the configuration file. Cut-off-specific and model-specific tables are included in a final report (i.e., considering each combination of the explored parameters). Statistical hypothesis tests are also presented in the tables, as well as a JSON file that summarizes the optimal model parameters. Optionally, Elliot stores the model weights for the sake of future re-training. 2.5. Preparation of the Experiment Elliot is triggered by a single configuration file written in YAML (e.g., refer to the toy ex- ample sample_hello_world.yml). The first section details the data loading, filtering, and splitting information defined in Section 2.1. The models section represents the recommendation models’ configuration, e.g., Item-𝑘NN. Here, the model-specific hyperparameter optimization strategies are specified, e.g., the grid-search. The evaluation section details the evaluation strategy with the desired metrics, e.g., nDCG in the toy example. Finally, save_recs and top_k keys detail, for example, the Output module abilities described in Section 2.4. 3. Conclusion and Future Work Elliot is a framework that looks at the recommendation process from the eyes of an RS researcher. To undertake a thorough and repeatable experimental assessment, the user only has to generate a flexible configuration file. Several loading, prefiltering, splitting, hyperparameter optimization, recommendation models, and statistical hypothesis testing are included in the framework. Elliot reports may be evaluated and used directly into research papers. We eval- uated the RS assessment literature, putting Elliot in the context of the other frameworks and highlighted its benefits and drawbacks. Following that, we looked at the framework’s design �and how to create a functional (and repeatable) experimental benchmark. Elliot is the only recommendation framework we’re aware of that supports a full multi-recommender experimen- tal pipeline from a single configuration file. We intend to expand the framework in the near future to incorporate sequential recommendation scenarios, adversarial attacks, reinforcement learning-based recommendation systems, differential privacy facilities, sampling assessment, and distributed recommendation, among other things. References [1] W. Krichene, S. Rendle, On sampled metrics for item recommendation, in: R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, ACM, 2020, pp. 1748–1757. URL: https://dl.acm.org/doi/10.1145/3394486.3403226. [2] J. Bennett, S. Lanning, The netflix prize, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12-15, 2007, ACM, 2007. [3] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms, in: V. Y. Shen, N. Saito, M. R. Lyu, M. E. Zurko (Eds.), WWW 2001, ACM, 2001, pp. 285–295. doi:10.1145/371920.372071. [4] Y. Koren, R. M. Bell, Advances in collaborative filtering, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, Springer, 2015, pp. 77–118. doi:10.1007/978-1-4899-7637-6\_3. [5] S. Rendle, Factorization machines, in: G. I. Webb, B. Liu, C. Zhang, D. Gunopulos, X. Wu (Eds.), ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010, IEEE Computer Society, 2010, pp. 995–1000. doi:10.1109/ICDM.2010.127. [6] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, Variational autoencoders for collaborative filtering, in: P. Champin, F. L. Gandon, M. Lalmas, P. G. Ipeirotis (Eds.), WWW 2018, ACM, 2018, pp. 689–698. doi:10.1145/3178876.3186150. [7] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems, in: B. Mobasher, R. D. Burke, D. Jannach, G. Adomavicius (Eds.), RecSys 2011, ACM, 2011, pp. 109–116. URL: https://dl.acm.org/citation.cfm?id=2043955. [8] S. M. McNee, J. Riedl, J. A. Konstan, Being accurate is not enough: how accuracy metrics have hurt recommender systems, in: G. M. Olson, R. Jeffries (Eds.), Extended Abstracts Proceedings of the 2006 Conference on Human Factors in Computing Systems, CHI 2006, Montréal, Québec, Canada, April 22-27, 2006, ACM, 2006, pp. 1097–1101. doi:10.1145/1125451.1125659. [9] S. Vargas, Novelty and diversity enhancement and evaluation in recommender systems and information retrieval, in: S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke, K. Järvelin (Eds.), SIGIR 2014, ACM, 2014, p. 1281. doi:10.1145/2600428.2610382. [10] P. Castells, N. J. Hurley, S. Vargas, Novelty and diversity in recommender sys- tems, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, � Springer, 2015, pp. 881–918. URL: https://doi.org/10.1007/978-1-4899-7637-6_26. doi:10.1007/978-1-4899-7637-6\_26. [11] Z. Zhu, Y. He, X. Zhao, Y. Zhang, J. Wang, J. Caverlee, Popularity-opportunity bias in collaborative filtering, in: WSDM 2021, ACM, 2021. doi:https: //doi.org/10.1145/3437963.3441820. [12] Y. Deldjoo, V. W. Anelli, H. Zamani, A. Bellogin, T. Di Noia, A flexible framework for evaluating user and item fairness in recommender systems, User Modeling and User-Adapted Interaction (2020) 1–47. [13] A. Said, A. Bellogín, Comparative recommender system evaluation: benchmarking recommendation frameworks, in: A. Kobsa, M. X. Zhou, M. Ester, Y. Koren (Eds.), RecSys 2014, ACM, 2014, pp. 129–136. doi:10.1145/2645710.2645746. [14] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, C. Geng, Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison, in: R. L. T. Santos, L. B. Marinho, E. M. Daly, L. Chen, K. Falk, N. Koenigstein, E. S. de Moura (Eds.), RecSys 2020, ACM, 2020, pp. 23–32. doi:10.1145/3383313.3412489. [15] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: a free recommender system library, in: B. Mobasher, R. D. Burke, D. Jannach, G. Adomavicius (Eds.), RecSys 2011, ACM, 2011, pp. 305–308. doi:10.1145/2043932.2043989. [16] M. D. Ekstrand, Lenskit for python: Next-generation software for recommender systems experiments, in: M. d’Aquin, S. Dietze, C. Hauff, E. Curry, P. Cudré-Mauroux (Eds.), CIKM 2020, ACM, 2020, pp. 2999–3006. doi:10.1145/3340531.3412778. [17] M. Kula, Metadata embeddings for user and item cold-start recommendations, in: T. Bogers, M. Koolen (Eds.), Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems co-located with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Austria, September 16-20, 2015., volume 1448 of CEUR Workshop Proceedings, CEUR-WS.org, 2015, pp. 14–21. URL: http://ceur-ws.org/Vol-1448/paper4.pdf. [18] N. Hug, Surprise: A python library for recommender systems, J. Open Source Softw. 5 (2020) 2174. doi:10.21105/joss.02174. [19] G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: A java library for recommender systems, in: A. I. Cristea, J. Masthoff, A. Said, N. Tintarev (Eds.), Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd Conference on User Modeling, Adaptation, and Personalization (UMAP 2015), Dublin, Ireland, June 29 - July 3, 2015, volume 1388 of CEUR Workshop Proceedings, CEUR-WS.org, 2015. [20] L. Yang, E. Bagdasaryan, J. Gruenstein, C. Hsieh, D. Estrin, Openrec: A modular framework for extensible and adaptable recommendation algorithms, in: Y. Chang, C. Zhai, Y. Liu, Y. Maarek (Eds.), WSDM 2018, ACM, 2018, pp. 664–672. doi:10.1145/3159652.3159681. [21] J. Yu, M. Gao, H. Yin, J. Li, C. Gao, Q. Wang, Generating reliable friends via adversarial training to improve social recommendation, in: J. Wang, K. Shim, X. Wu (Eds.), 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China, November 8-11, 2019, IEEE, 2019, pp. 768–777. doi:10.1109/ICDM.2019.00087. [22] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G. Wei, H. S. Lee, D. Brooks, C. Wu, Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference, in: 47th ACM/IEEE Annual International Symposium on Computer Archi- tecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020, IEEE, 2020, pp. 982–995. � doi:10.1109/ISCA45697.2020.00084. [23] A. Salah, Q. Truong, H. W. Lauw, Cornac: A comparative framework for mul- timodal recommender systems, J. Mach. Learn. Res. 21 (2020) 95:1–95:5. URL: http://jmlr.org/papers/v21/19-805.html. [24] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? A worrying analysis of recent neural recommendation approaches, in: RecSys, ACM, 2019, pp. 101–109. [25] S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neural collaborative filtering vs. matrix factorization revisited, in: RecSys, ACM, 2020, pp. 240–248. [26] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen, Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min, Z. Feng, X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, J. Wen, Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms, CoRR abs/2011.01731 (2020). URL: https://arxiv.org/abs/2011.01731. arXiv:2011.01731. [27] J. Bergstra, D. Yamins, D. D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, in: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, JMLR.org, 2013, pp. 115–123. URL: http://proceedings.mlr.press/v28/bergstra13.html. [28] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, Local popularity and time in top-n recommendation, in: ECIR (1), volume 11437 of Lecture Notes in Computer Science, Springer, 2019, pp. 861–868. [29] A. Bellogín, P. Sánchez, Revisiting neighbourhood-based recommenders for temporal sce- narios, in: RecTemp@RecSys, volume 1922 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 40–44. [30] X. Wang, X. He, M. Wang, F. Feng, T. Chua, Neural graph collaborative filtering, in: B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), SIGIR 2019, ACM, 2019, pp. 165–174. doi:10.1145/3331184.3331267. [31] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, V-elliot: Design, evaluate and tune visual recommender systems, in: RecSys, ACM, 2021, pp. 768–771. [32] R. He, J. J. McAuley, VBPR: visual bayesian personalized ranking from implicit feedback, in: D. Schuurmans, M. P. Wellman (Eds.), Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, AAAI Press, 2016, pp. 144–150. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11914. [33] J. Tang, X. Du, X. He, F. Yuan, Q. Tian, T. Chua, Adversarial training towards robust multimedia recommender system, IEEE Trans. Knowl. Data Eng. 32 (2020) 855–867. doi:10.1109/TKDE.2019.2893638. [34] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, How to make latent factors interpretable by feeding factorization machines with knowledge graphs, in: ISWC (1), volume 11778 of Lecture Notes in Computer Science, Springer, 2019, pp. 38–56. [35] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ferrara, A. C. M. Mancino, Sparse feature factoriza- tion for recommender systems with knowledge graphs, in: RecSys, ACM, 2021, pp. 154–165. [36] V. W. Anelli, T. D. Noia, E. D. Sciascio, C. Pomo, A. Ragone, On the discriminative power of hyper-parameters in cross-validation and how to choose them, in: RecSys, ACM, 2019, pp. 447–451. �
The Challenging Reproducibility Task in Recommender Systems Research between Traditional and Deep Learning Models Discussion Paper Vito Walter Anelli1 , Alejandro Bellogín2 , Antonio Ferrara1 , Daniele Malitesta1 , Felice Antonio Merra4 , Claudio Pomo1 , Francesco Maria Donini3 , Eugenio Di Sciascio1 and Tommaso Di Noia1 1 Politecnico di Bari, via Orabona, 4, 70125 Bari, Italy 2 Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, 28049 Madrid, Spain 3 Università degli Studi della Tuscia, via Santa Maria in Gradi, 4, 01100 Viterbo, Italy 4 Amazon Science Berlin, Invalidenstraße 75, 10557 Berlin, Germany Abstract Recommender Systems have shown to be a useful tool for reducing over-choice and providing accurate, personalized suggestions. The large variety of available recommendation algorithms, splitting techniques, assessment protocols, metrics, and tasks, on the other hand, has made thorough experimental evaluation extremely difficult. Elliot is a comprehensive framework for recommendation with the goal of running and reproducing a whole experimental pipeline from a single configuration file. The framework uses a variety of ways to load, filter, and divide data. Elliot optimizes hyper-parameters for a variety of recommendation algorithms, then chooses the best models, compares them to baselines, computes metrics ranging from accuracy to beyond-accuracy, bias, and fairness, and does statistical analysis. The aim is to provide researchers with a tool to ease all the experimental evaluation phases (and make them reproducible), from data reading to results collection. Elliot is freely available on GitHub at https://github.com/sisinflab/elliot. Keywords Recommender Systems, Reproducibility, Adversarial Learning, Visual Recommenders, Knowledge Graphs 1. Introduction In the last decade, Recommendation Systems (RSs) have gained momentum as the pivotal choice for personalized decision-support systems. Recommendation is essentially a retrieval task where a catalog of items is ranked in a personalized way and the top-scoring items are presented to the user [1]. Once the RSs ability to provide personalized items to clients had been demonstrated, both academia and industry began to devote attention to them [2]. This collective effort resulted in an impressive number of recommendation algorithms, ranging from memory-based [3] to latent factor-based [4, 5], as well as deep learning-based methods [6]. At the same time, the RS research community realized that focusing only on the accuracy of results could be detrimental, SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ vitowalter.anelli@poliba.it (V. W. Anelli); claudio.pomo@poliba.it (C. Pomo) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �and started exploring beyond-accuracy evaluation [7]. As accuracy was recognized as insufficient to guarantee users’ satisfaction [8], novelty and diversity [9, 10] came into play as new dimensions to be analyzed when comparing algorithms. However, this was only the first step in the direction of a more comprehensive evaluation. Indeed, more recently, the presence of biased [11] and unfair recommendations towards user groups and item categories [12] has been widely investigated. The abundance of possible choices has generated confusion around choosing the correct base- lines, conducting the hyperparameter optimization and the experimental evaluation [13], and reporting the details of the adopted procedure. Consequently, two major concerns have arisen: unreproducible evaluation and unfair comparisons [14]. The advent of various frameworks over the last decade has improved the research process, and the RS community has gradually embraced the emergence of recommendation, assessment, and even hyperparameter tweaking frameworks. Starting from 2011, Mymedialite [15], LensKit [16], LightFM [17], RankSys [9], and Surprise [18], have formed the basic software for rapid prototyp- ing and testing of recommendation models, thanks to an easy-to-use model execution and the implementation of standard accuracy, and beyond-accuracy, evaluation measures and splitting techniques. However, the outstanding success and the community interest in Deep Learning (DL) recommendation models, raised the need for novel instruments. LibRec [19], Spotlight 1 , and OpenRec [20] are the first open-source projects that made DL-based recommenders available – with less than a dozen of available models but, unfortunately, without filtering, splitting, and hyper-optimization tuning strategies. An important step towards more exhaustive and up-to-date set of model implementations have been released with RecQ [21], DeepRec [22], and Cornac [23] frameworks. However, they do not provide a general tool for extensive experiments on the pre-elaboration and the evaluation of a dataset. Indeed, after the reproducibility hype [24, 25], DaisyRec [14] and RecBole [26] raised the bar of framework capabilities, making available both large set of models, data filtering/splitting operations and, above all, hyper-parameter tuning features. Unfortunately, even though these frameworks are a great help to researchers, facilitating reproducibility or extending the provided functionality would typically depend on developing bash scripts or programming on whatever language each framework is written. This is where Elliot comes to the stage. It is a novel kind of recommendation framework, aimed to overcome these obstacles by proposing a fully declarative approach (by means of a configuration file) to the set-up of an experimental setting. It analyzes the recommendation problem from the researcher’s perspective as it implements the whole experimental pipeline, from dataset loading to results gathering in a principled way. The main idea behind Elliot is to keep an entire experiment reproducible and put the user (in our case, a researcher or RS developer) in control of the framework. To date, according to the recommendation model, Elliot allows for choosing among 27 similarity metrics, defining of multiple neural architectures, and choosing 51 hyperparameter tuning combined approaches, unleashing the full potential of the HyperOpt library [27]. To enable evaluation for the diverse tasks and domains, Elliot supplies 36 metrics (including Accuracy, Error-based, Coverage, Novelty, Diversity, Bias, and Fairness metrics), 13 splitting strategies, and 8 prefiltering policies. 1 https://github.com/maciejkula/spotlight � Configuration File Prefiltering Loading Data Modules Filter-by-rating Ratings Run Module k-core Side Information Evaluation Modules Output Module Optional Modules Splitting Recommendation Temporal Random Output Hyperoptimization Restore Model Metrics Fix Performance Tables Accuracy Model Weights Error Restore Model Recommendation Lists Coverage Novelty Diversity Stat. Tests Restore External Model Bias Paired t-test Fairness Wilcoxon Figure 1: Overview of Elliot. 2. Framework Elliot is an extendable framework made up of eight functional modules, each of which is in charge of a different phase in the experimental suggestion process. The user is only required to submit human-level experimental flow information via a customisable configuration file, so what happens beneath the hood (Figure 1) is transparent to them. As a result, Elliot constructs the whole pipeline. What follows presents each of Elliot’s modules and how to create a configuration file. 2.1. Data Preparation The Data modules are in charge of handling and organizing the experiment’s input, as well as providing a variety of supplementary data, such as item characteristics, visual embeddings, and pictures. The input data is taken over by the Prefiltering and Splitting modules after being loaded by the Loading module, whose techniques are described in Sect.2.1.2 and 2.1.3 respectively. 2.1.1. Loading Different data sources, such as user-item feedback or side information, such as the item visual aspects, may be required for RSs investigations. Elliot comes with a variety of Loading module im- plementations to meet these requirements. Furthermore, the user may create computationally in- tensive prefiltering and splitting operations that can be saved and loaded to save time in the future. Additional data, such as visual characteristics and semantic features generated from knowledge graphs, can be handled through data-driven extensions. When a side-information-aware Loading module is selected, it filters out items that lack the needed information to provide a fair comparison. 2.1.2. Prefiltering Elliot provides data filtering procedures using two different techniques after data loading. Filter-by-rating is the first method implemented in the Prefiltering module, which removes a user- item interaction if the preference score falls below a certain level. It can be a Numerical value, such as 3.5, a Distributional information, such as the worldwide rating average value, or a user-based �distributional (User Dist.) value, such as the user’s average rating value. The 𝑘-core prefiltering approach eliminates people, objects, or both if there are less than 𝑘 documented interactions. The 𝑘-core technique can be used to both users and things repeatedly (Iterative 𝑘-core) until the 𝑘-core filtering requirement is fulfilled, i.e., all users and items have at least 𝑘 recorded interactions. Since reaching such condition might be intractable, Elliot allows specifying the maximum number of iterations (Iter-𝑛-rounds). Finally, the Cold-Users filtering feature allows retaining cold-users only. 2.1.3. Splitting Elliot implements three splitting strategies: (i) Temporal, (ii) Random, and (iii) Fix. The Temporal method divides user-item interactions depending on the transaction timestamp, either by setting the timestamp, selecting the best one [28, 29], or using a hold-out (HO) mechanism. Hold-out (HO), 𝐾-repeated hold-out (K-HO), and cross-validation (CV ) are all part of the Random methods. Finally, the Fix approach leverages an already split dataset. 2.2. Recommendation Models After data loading and pre-elaborations, the Recommendation module (Figure 1) provides the functionalities to train (and restore) both Elliot’s state-of-the-art recommendation models and custom user-implemented models, with the possibility to find the best hyper-parameter setting. 2.2.1. Implemented Models To date, Elliot integrates around 50 recommendation models grouped into two sets: (i) popular models implemented in at least two of the other reviewed frameworks, and (ii) other well- known state-of-the-art recommendation models which are less common in the reviewed frame- works, such as autoencoder-based, e.g., [6], graph-based, e.g., [30], visually-aware [31], e.g., [32], adversarially-robust, e.g., [33], and content-aware, e.g., [34, 35]. 2.2.2. Hyper-parameter Tuning According to Rendle et al. [25], Anelli et al. [36], hyper-parameter optimization has a significant impact on performance. Elliot supplies Grid Search, Simulated Annealing, Bayesian Optimization, and Random Search, supporting four different traversal techniques in the search space. Grid Search is automatically inferred when the user specifies the available hyper-parameters. 2.3. Performance Evaluation After the training phase, Elliot continues its operations, evaluating recommendations. Figure 1 indicates this phase with two distinct evaluation modules: Metrics and Statistical Tests. �2.3.1. Metrics Elliot provides a set of 36 evaluation metrics, partitioned into seven families: Accuracy, Error, Coverage, Novelty, Diversity, Bias, and Fairness. It is worth mentioning that Elliot is the framework that exposes both the largest number of metrics and the only one considering bias and fairness mea- sures. Moreover, the practitioner can choose any metric to drive the model selection and the tuning. 2.3.2. Statistical Tests All other cited frameworks do not support statistical hypothesis tests, probably due to the need for computing fine-grained (e.g., per-user or per-partition) results and retaining them for each recommendation model. Conversely, Elliot helps computing two statistical hypothesis tests, i.e., Wilcoxon and Paired t-test, with a flag in the configuration file. 2.4. Framework Outcomes When the training of recommenders is over, Elliot uses the Output module to gather the results. Three types of output files can be generated: (i) Performance Tables, (ii) Model Weights, and (iii) Recommendation Lists. Performance Tables come in the form of spreadsheets, including all the metric values generated on the test set for each recommendation model given in the configuration file. Cut-off-specific and model-specific tables are included in a final report (i.e., considering each combination of the explored parameters). Statistical hypothesis tests are also presented in the tables, as well as a JSON file that summarizes the optimal model parameters. Optionally, Elliot stores the model weights for the sake of future re-training. 2.5. Preparation of the Experiment Elliot is triggered by a single configuration file written in YAML (e.g., refer to the toy ex- ample sample_hello_world.yml). The first section details the data loading, filtering, and splitting information defined in Section 2.1. The models section represents the recommendation models’ configuration, e.g., Item-𝑘NN. Here, the model-specific hyperparameter optimization strategies are specified, e.g., the grid-search. The evaluation section details the evaluation strategy with the desired metrics, e.g., nDCG in the toy example. Finally, save_recs and top_k keys detail, for example, the Output module abilities described in Section 2.4. 3. Conclusion and Future Work Elliot is a framework that looks at the recommendation process from the eyes of an RS researcher. To undertake a thorough and repeatable experimental assessment, the user only has to generate a flexible configuration file. Several loading, prefiltering, splitting, hyperparameter optimization, recommendation models, and statistical hypothesis testing are included in the framework. Elliot reports may be evaluated and used directly into research papers. We eval- uated the RS assessment literature, putting Elliot in the context of the other frameworks and highlighted its benefits and drawbacks. Following that, we looked at the framework’s design �and how to create a functional (and repeatable) experimental benchmark. Elliot is the only recommendation framework we’re aware of that supports a full multi-recommender experimen- tal pipeline from a single configuration file. We intend to expand the framework in the near future to incorporate sequential recommendation scenarios, adversarial attacks, reinforcement learning-based recommendation systems, differential privacy facilities, sampling assessment, and distributed recommendation, among other things. References [1] W. Krichene, S. Rendle, On sampled metrics for item recommendation, in: R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, ACM, 2020, pp. 1748–1757. URL: https://dl.acm.org/doi/10.1145/3394486.3403226. [2] J. Bennett, S. Lanning, The netflix prize, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12-15, 2007, ACM, 2007. [3] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms, in: V. Y. Shen, N. Saito, M. R. Lyu, M. E. Zurko (Eds.), WWW 2001, ACM, 2001, pp. 285–295. doi:10.1145/371920.372071. [4] Y. Koren, R. M. Bell, Advances in collaborative filtering, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, Springer, 2015, pp. 77–118. doi:10.1007/978-1-4899-7637-6\_3. [5] S. Rendle, Factorization machines, in: G. I. Webb, B. Liu, C. Zhang, D. Gunopulos, X. Wu (Eds.), ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010, IEEE Computer Society, 2010, pp. 995–1000. doi:10.1109/ICDM.2010.127. [6] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, Variational autoencoders for collaborative filtering, in: P. Champin, F. L. Gandon, M. Lalmas, P. G. Ipeirotis (Eds.), WWW 2018, ACM, 2018, pp. 689–698. doi:10.1145/3178876.3186150. [7] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems, in: B. Mobasher, R. D. Burke, D. Jannach, G. Adomavicius (Eds.), RecSys 2011, ACM, 2011, pp. 109–116. URL: https://dl.acm.org/citation.cfm?id=2043955. [8] S. M. McNee, J. Riedl, J. A. Konstan, Being accurate is not enough: how accuracy metrics have hurt recommender systems, in: G. M. Olson, R. Jeffries (Eds.), Extended Abstracts Proceedings of the 2006 Conference on Human Factors in Computing Systems, CHI 2006, Montréal, Québec, Canada, April 22-27, 2006, ACM, 2006, pp. 1097–1101. doi:10.1145/1125451.1125659. [9] S. Vargas, Novelty and diversity enhancement and evaluation in recommender systems and information retrieval, in: S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke, K. Järvelin (Eds.), SIGIR 2014, ACM, 2014, p. 1281. doi:10.1145/2600428.2610382. [10] P. Castells, N. J. Hurley, S. Vargas, Novelty and diversity in recommender sys- tems, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, � Springer, 2015, pp. 881–918. URL: https://doi.org/10.1007/978-1-4899-7637-6_26. doi:10.1007/978-1-4899-7637-6\_26. [11] Z. Zhu, Y. He, X. Zhao, Y. Zhang, J. Wang, J. Caverlee, Popularity-opportunity bias in collaborative filtering, in: WSDM 2021, ACM, 2021. doi:https: //doi.org/10.1145/3437963.3441820. [12] Y. Deldjoo, V. W. Anelli, H. Zamani, A. Bellogin, T. Di Noia, A flexible framework for evaluating user and item fairness in recommender systems, User Modeling and User-Adapted Interaction (2020) 1–47. [13] A. Said, A. Bellogín, Comparative recommender system evaluation: benchmarking recommendation frameworks, in: A. Kobsa, M. X. Zhou, M. Ester, Y. Koren (Eds.), RecSys 2014, ACM, 2014, pp. 129–136. doi:10.1145/2645710.2645746. [14] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, C. Geng, Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison, in: R. L. T. Santos, L. B. Marinho, E. M. Daly, L. Chen, K. Falk, N. Koenigstein, E. S. de Moura (Eds.), RecSys 2020, ACM, 2020, pp. 23–32. doi:10.1145/3383313.3412489. [15] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: a free recommender system library, in: B. Mobasher, R. D. Burke, D. Jannach, G. Adomavicius (Eds.), RecSys 2011, ACM, 2011, pp. 305–308. doi:10.1145/2043932.2043989. [16] M. D. Ekstrand, Lenskit for python: Next-generation software for recommender systems experiments, in: M. d’Aquin, S. Dietze, C. Hauff, E. Curry, P. Cudré-Mauroux (Eds.), CIKM 2020, ACM, 2020, pp. 2999–3006. doi:10.1145/3340531.3412778. [17] M. Kula, Metadata embeddings for user and item cold-start recommendations, in: T. Bogers, M. Koolen (Eds.), Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems co-located with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Austria, September 16-20, 2015., volume 1448 of CEUR Workshop Proceedings, CEUR-WS.org, 2015, pp. 14–21. URL: http://ceur-ws.org/Vol-1448/paper4.pdf. [18] N. Hug, Surprise: A python library for recommender systems, J. Open Source Softw. 5 (2020) 2174. doi:10.21105/joss.02174. [19] G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: A java library for recommender systems, in: A. I. Cristea, J. Masthoff, A. Said, N. Tintarev (Eds.), Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd Conference on User Modeling, Adaptation, and Personalization (UMAP 2015), Dublin, Ireland, June 29 - July 3, 2015, volume 1388 of CEUR Workshop Proceedings, CEUR-WS.org, 2015. [20] L. Yang, E. Bagdasaryan, J. Gruenstein, C. Hsieh, D. Estrin, Openrec: A modular framework for extensible and adaptable recommendation algorithms, in: Y. Chang, C. Zhai, Y. Liu, Y. Maarek (Eds.), WSDM 2018, ACM, 2018, pp. 664–672. doi:10.1145/3159652.3159681. [21] J. Yu, M. Gao, H. Yin, J. Li, C. Gao, Q. Wang, Generating reliable friends via adversarial training to improve social recommendation, in: J. Wang, K. Shim, X. Wu (Eds.), 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China, November 8-11, 2019, IEEE, 2019, pp. 768–777. doi:10.1109/ICDM.2019.00087. [22] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G. Wei, H. S. Lee, D. Brooks, C. Wu, Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference, in: 47th ACM/IEEE Annual International Symposium on Computer Archi- tecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020, IEEE, 2020, pp. 982–995. � doi:10.1109/ISCA45697.2020.00084. [23] A. Salah, Q. Truong, H. W. Lauw, Cornac: A comparative framework for mul- timodal recommender systems, J. Mach. Learn. Res. 21 (2020) 95:1–95:5. URL: http://jmlr.org/papers/v21/19-805.html. [24] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? A worrying analysis of recent neural recommendation approaches, in: RecSys, ACM, 2019, pp. 101–109. [25] S. Rendle, W. Krichene, L. Zhang, J. R. Anderson, Neural collaborative filtering vs. matrix factorization revisited, in: RecSys, ACM, 2020, pp. 240–248. [26] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen, Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min, Z. Feng, X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, J. Wen, Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms, CoRR abs/2011.01731 (2020). URL: https://arxiv.org/abs/2011.01731. arXiv:2011.01731. [27] J. Bergstra, D. Yamins, D. D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, in: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, JMLR.org, 2013, pp. 115–123. URL: http://proceedings.mlr.press/v28/bergstra13.html. [28] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, Local popularity and time in top-n recommendation, in: ECIR (1), volume 11437 of Lecture Notes in Computer Science, Springer, 2019, pp. 861–868. [29] A. Bellogín, P. Sánchez, Revisiting neighbourhood-based recommenders for temporal sce- narios, in: RecTemp@RecSys, volume 1922 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 40–44. [30] X. Wang, X. He, M. Wang, F. Feng, T. Chua, Neural graph collaborative filtering, in: B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), SIGIR 2019, ACM, 2019, pp. 165–174. doi:10.1145/3331184.3331267. [31] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, V-elliot: Design, evaluate and tune visual recommender systems, in: RecSys, ACM, 2021, pp. 768–771. [32] R. He, J. J. McAuley, VBPR: visual bayesian personalized ranking from implicit feedback, in: D. Schuurmans, M. P. Wellman (Eds.), Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, AAAI Press, 2016, pp. 144–150. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11914. [33] J. Tang, X. Du, X. He, F. Yuan, Q. Tian, T. Chua, Adversarial training towards robust multimedia recommender system, IEEE Trans. Knowl. Data Eng. 32 (2020) 855–867. doi:10.1109/TKDE.2019.2893638. [34] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, How to make latent factors interpretable by feeding factorization machines with knowledge graphs, in: ISWC (1), volume 11778 of Lecture Notes in Computer Science, Springer, 2019, pp. 38–56. [35] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ferrara, A. C. M. Mancino, Sparse feature factoriza- tion for recommender systems with knowledge graphs, in: RecSys, ACM, 2021, pp. 154–165. [36] V. W. Anelli, T. D. Noia, E. D. Sciascio, C. Pomo, A. Ragone, On the discriminative power of hyper-parameters in cross-validation and how to choose them, in: RecSys, ACM, 2019, pp. 447–451. �