Vol-3194/paper14
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper14 |
wikidataid | Q117345133→Q117345133 |
title | SoBigData RI: European Integrated Infrastructure for Social Mining and Big Data Analytics |
pdfUrl | https://ceur-ws.org/Vol-3194/paper14.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/TrasartiGNR22 |
volume | Vol-3194→Vol-3194 |
session | → |
SoBigData RI: European Integrated Infrastructure for Social Mining and Big Data Analytics
SoBigData RI: European Integrated Infrastructure for Social Mining and Big Data Analytics Roberto Trasarti1 , Valerio Grossi1 , Michela Natilli1 and Beatrice Rapisarda1 1 CNR-ISTI (Institute of Information Science and Technologies), via G. Moruzzi 1, 56124 Pisa, Italy Abstract SoBigData RI has the ambition to support the rising demand for cross-disciplinary research and innovation on the multiple aspects of social complexity from combined data and model-driven perspectives and the increasing importance of ethics and data scientists’ responsibility as pillars of trustworthy use of Big Data and analytical technology. Digital traces of human activities offer a considerable opportunity to scrutinize the ground truth of individual and collective behaviour at an unprecedented detail and on a global scale. This increasing wealth of data is a chance to understand social complexity, provided we can rely on social mining, i.e., adequate means for accessing big social data and models for extracting knowledge from them. SoBigData RI, with its tools and services, empowers researchers and innovators through a platform for the design and execution of large-scale social mining experiments, open to users with diverse backgrounds, accessible on the cloud (aligned with EOSC), and also exploiting supercomputing facilities. Pushing the FAIR (Findable, Accessible, Interoperable) and FACT (Fair, Accountable, Confidential, and Transparent) principles will render social mining experiments more efficiently designed, adjusted, and repeatable by domain experts that are not data scientists. SoBigData RI moves forward from the simple awareness of ethical and legal challenges in social mining to the development of concrete tools that operationalize ethics with value-sensitive design, incorporating values and norms for privacy protection, fairness, transparency, and pluralism. SoBigData RI is the result of two H2020 grants (g.a. n.654024 and 871042), and it is part of the ESFRI 2021 Roadmap. Keywords Data Science, Artificial Intelligence, Social Mining, Big Data Analytics 1. Introduction SoBigData RI (www.sobigdata.eu) is the result of a long-term vision (Fig. 1) and started in 2015 with an initial project called SoBigData funded by H2020 (G.A. n. 654024) and is continuing with a subsequent H2020 project called SoBigData++ (G.A. n. 871042). It also entered the ESFRI RoadMap 2021 and therefore supported in becoming a legal entity in the form of an ERIC as result of a support project called SoBigData PPP (ESFRI Preparatory Phase Project). At the national level, partners in each country within SoBigData RI participate in national calls to expand their connections internally. For example, the Italian node (which is the coordinator) is participating in the PNRR (Piano Nazionale di Ripresa e Resilienza) since SoBigData RI is already in the high priority list in the PNIR (Piano Nazionale Infrastrutture di Ricerca 2021-2027). SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ roberto.trasarti@isti.cnr.it (R. Trasarti); valerio.grossi@isti.cnr.it (V. Grossi); michela.natilli@isti.cnr.it (M. Natilli); beatrice.rapisarda@isti.cnr.it (B. Rapisarda) � 0000-0001-5316-6475 (R. Trasarti); 0000-0002-8735-5394 (V. Grossi); 0000-0002-0323-7498 (M. Natilli) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �Figure 1: Overview of the SoBigData RI Vision and related projects In particular, the currently active project is SoBigData++ with the objective of building effective national research systems within a federated European platform by planning, designing, and integrating services in a research infrastructure called SoBigData RI. The aim is to allow users to access methods and algorithms provided by RI as services in the cloud of e-infrastructure to be executed using the SoBigData RI computational resources. This aspect will follow the concepts of meta-modelling as well as FAIR and FACT principles. Optimal transnational cooperation and competition: this aspect will be granted in 13 European countries, offering world-leading research expertise from multiple disciplines, as well as Big Data computing platforms, big social data resources, and cutting-edge computational methods [1]. SoBigData RI plays a key role in impacting the ERA in several aspects. Firstly, training the next generation of responsible social data scientists engaged in the challenging research questions and ambassadors of critical data literacy to facilitate data citizenship and data democracy. Secondly, providing an accelerator of data-driven innovation that streamlines the collaboration with industries and startups to develop pilot projects and proofs-of-concept. Finally, democratizing the benefits of AI, data science, and big data through a network of excellencies within an ethical-legal framework that harmonizes individual rights and collective interests. SoBigData RI is also instrumental in many training activities related to a novel Ph.D. program centered on Data Science in Pisa - https://datasciencephd.eu/ (the Italian node will work synergically with the Ph.D. in Data Science); a second level master on Big Data Analytics and Social Mining; a national program on Ph.D. in Artificial Intelligence. The SoBigData RI promotes the platform’s services to other EU projects (in the previous years, we already served many H2020 projects (i.e., HUMAN AI Net, HUMMING BIrd, AI4EU, Tailor, XAI). During the ESFRI PPP, SoBigData RI will become an EOSC service provider. These aspects will impact the creation of synergies between SoBigData RI and other RIs in the European landscape. In the following sections, we will describe the different pillars of the SoBigData RI, starting �from the Exploratories, which are thematic virtual laboratories where the researchers study the complexity of social phenomena and collect data to produce new methods and algorithms which are published in the RI Catalogue to be shared with the community or executed in the cloud computational resources offered by the platform. 2. Exploratories SoBigData RI has made a relevant design choice for creating its e-infrastructure and community by introducing the concept ’exploratory’, i.e., a vertical social mining research environment focused on specific societal challenges. The aim was to exemplify the cutting-edge, multidis- ciplinary research supported by the RI and drive the integration into the e-infrastructure of the resources available within the national laboratories [2, 3]. Our exploratories are environ- ments where concrete, substantive multidisciplinary social mining research has been carried out. Moreover, exploratories serve to attract many users, both via transnational access and virtual access, as well as students and innovators attending the project’s training and innovation initiatives. Exploratories are the vehicle for fostering cooperation and synergies across different lines of activity within the research infrastructure, promoting networking, access, and joint research. SoBigData RI currently includes 7 exploratories: Demography, Economy and Finance 2.0 is devoted to the study of traditional, complex socio-economic and financial systems as well as emerging ones (e.g., block-chain and cryptocur- rency markets). The exploratory aims to extend the existing approaches grounded on statistical physics for the analysis of real-world (economic and financial) networks by considering innova- tive models to analyze the dynamics of and on networks and infer their structural details from partial information extending the rich formalism of entropy-based null models towards the development of non-linear Exponential Random Graphs and employ renormalization methods. Migration Studies tries to estimate flows and stocks from available, real-time data, building models that track indicators extracted from unconventional and official data sources. This exploratory evaluates migrants’ integration in new communities through social network and retail data analysis since migration can generate cultural changes in the local and incoming population. Social Impact of AI and explainable ML studies how the AI can replicate/simulate/infer people’s behaviours in the field of computational social choice. This interdisciplinary research area deals with the aggregation of agents’ preferences to reach a consensus decision that achieves some social objectives by finding "simpler" ways to explain this to humans. Societal Debates and Misinformation uses the state-of-the-art algorithms for supervised prevalence estimation that can track collective sentiment even when expressed on fine-grained ordinal scales. We analyze discussions using data from online social networks to investigate topics relevant to society and phenomena such as opinion polarization and echo chambers. Sports Data Science analyses the way scientists, fans, and practitioners conceive sports performance is changing rapidly due to the proliferation of new sensing technologies that provide reliable data streams extracted from every game. Combining this flow of (big) data with robust data science and AI tools, we now have the chance to unveil the great complexity �underlying sports performances and work towards many challenging goals, from automatic tactical analysis to data-driven performance ranking, from game outcome prediction to injury forecasting. Sustainable Cities for Citizens develops innovative methodologies to help Italian munici- palities and other decision-makers in producing policies and strategies for urban sustainability and improve the environmental performance concerning digitization, energy, water, and waste management and pollution. The methodologies are also directed to citizens to be used as a tool to increase their sustainability awareness. This exploratory also focuses on analyzing human mobility using mobile phone traces, vehicular GPS, and social media data are all Big Data sources and proxies of individual and collective behaviours. Network Medicine aims to create a new approach to biomedical science, combining principles and approaches from systems biology and network science to both understand the causes of human diseases and find and develop new personalized treatments. This exploratory also studies and develops new compressed indexes and algorithms for storing, indexing, and mining knowledge graphs and key-value stores that are the backbone of modern Computational Biology platforms. Disaster response and recovery focusing on the research of methods and tools to analyze, monitor, and improve post-disaster reconstruction processes in socio-economic areas, spatial planning, and environmental health in cooperation with national and international institutions. ICT-enhanced solutions to the disaster management cycle’s response and recovery phases can improve the efficacy of "search & rescue" activities, emergency relief, reconstruction, and rehabilitation. This exploratory was explicitly created to tackle the problems with a strong connection with the local institution in the area of L’Aquila; for this reason, it has a dedicated access point1 ). Exploratories had and still have a significant impact on the use of the platform, both in terms of the number of users and experiments carried out; they can be seen as think tank producing research, dataset, and algorithm which are refined and offered by the RI. 3. Ethical, Legal, Socio-Economic and Cultural Framework SoBigData RI adheres to the EU vision on Responsible Research and Innovation and opera- tionalizes values driving the ongoing reform of the EU Data Protection and Fundamental Rights legislation [4]. For this reason, SoBigData RI is operationalizing the ELSEC (Ethical, Social, Legal, Economic, and Cultural Aspects) values within AI and machine learning. FAIR (Findable, Accessible, Interoperable and re-usable) principles [5] are not enough in our work, but we also need FACT (Fairness, Accuracy, Confidentiality, and Transparency) [6] principles. In other words, the goal is to develop a sustainable and ethical framework. This aim relates to several aspects, starting from the ethical and legal ones and including gender balance aspects. From this perspective, SoBigData contributes and will contribute in moving the RI forward from the simple awareness of ethical and legal challenges in social mining to the development of concrete 1 This exploratory relates to the "Territori Aperti" project - https://territoriaperti.univaq.it/ �tools that operationalize ethics with value-sensitive design, incorporating values and norms for privacy protection, fairness, transparency, and pluralism. This can be demanding in an environment where regulations tend to change. The fact that the potential of technology is not always captured in time by the legislative system can result in gaps that need to be anticipated and, if possible, prevented. We are facing a continuous evolution of the ethical-legal framework. After GDPR, many legislative initiatives are developing new ethical-legal dimensions impacting data processing for statistics and scientific purposes and innovation. There are numerous challenges to be faced. Since many private companies and industries have difficulties including and exploiting ELSEC principles in their project, we can act as facilitators showing how to include them in practice (as already done in SoBigData). Another fundamental challenge is the potential conflict of interest that can arise. In some cases, there may be a tension between the requirement to operate ethically and the need to develop and market as many services as possible, to as many users and as many different companies as possible. In this challenging scenario several actions are put in place: 1) the design of a pipeline for data science experiments and services that adheres to the legal-ethical principles but is also effective; 2) an Ethical and legal board (with experts in IT law, IT ethics and data protection) which accommodate within-infrastructure ethical issues as well as will aim at answering new questions on the scope, interpretation, and application of the GDPR along with the expanding role of AI, machine learning and data mining; 3) white papers on ethical issues from actual cases and practical solutions for social science; 4) a variety of privacy-enhancing and discrimination discovery algorithms; 5) guidelines for legal, ethical, methodological and infrastructural issues arising from working with social data, to help scientists to focus on their research; 6) clear policies and agreements for the sharing of resources according to the Privacy and Ethic EU regulations; 7) tools that enable technologies and ethical values transfer to industry and PA, which are not always aware of the responsible data science pipeline, and show how to exploit its full potential in a sustainable way for a long-term added value. 4. SoBigData Research Infrastructure Services In this section, we describe the services offered by the RI, which are designed for different stakeholders and attract and foster a solid and active user community[2]. There are two kinds of services according to the access type: (i) Virtual Access (VA) gives the user the possibility to navigate and discover datasets, methods, services, and other resources (e.g., papers, experiments, etc.) thanks to the online tools provided by the RI; (ii) Transnational Access (TNA) is an integral part to the RI services to grow the social mining community and widen the reach of the RI. 4.1. Virtual Access The leading virtual access service is the RI catalogue which contains all the metadata of the resources, and it is accessible through the web interface. To access existing resources of the e-infrastructure, the user must log in to perform a free registration or use any academic/social credentials supported by the EOSC. The access to the SoBigData RI also grants access to the Catalogue, SoBigData Lab, and the e-learning area. The SoBigData VA services have the main �entry point at the site: www.sobigdata.eu. The gateway of the SoBigData RI publishes all the services related to e-infrastructure organized into six main areas: 1) the catalogue enables the user to search an item given a set of keywords; 2) SoBigData Lab where users can execute methods on the e-infrastructure; 3) the description ant the link to of all the applications available; 4) the e-learning area that enables the user to access all the training material related to the SoBigData community; 5) the portal to access to our HPC facilities, and 6) on the left the work space, an online environment to support secure and controlled data storage and sharing. The catalogue is the primary tool for discovering and searching for an item inside the SoBigData RI. All the elements inside the SoBigData RI are discoverable through this service. The user can insert a set of keywords, and the list of the results will be visualized. The search result lists items included in the catalogue and their classification (e.g., Method, Training Material, Dataset). The complete description is provided on the dedicated page, accessible by clicking on the item. These features can be added to the search filter, which will be recalculated in real-time. The search result can be sorted alphabetically concerning the insertion date or popularity. The SoBigData Lab integrates different methods that can be invoked under the same environment through SoBigData e-infrastructure. A method is the implementation of an algorithm/procedure or is an algorithm that requires an engine to be executed. Different kinds of integration are available based on the programming language in which a method is implemented. Once a method is integrated into the platform, the final user has a homogeneous web form for inserting parameters and invoking it independently from the programming language employed. In particular, JupyterHub is easily accessible by clicking on the link on the top of SoBigData Lab VRE. After starting the server by selecting one of the default profiles available, the user can start to use Jupyter Notebook as the local version. Several services are offered in the form of SoBigData libraries, consisting of a set of thematic methodologies available on the cloud and usable both in a JupyterHub environment or in the SoBigData engine and applications. 4.2. Transnational Access TA visit is a powerful tool to drive to share knowledge and expertise throughout Data Science. By benefitting from a SoBigData TA visit, the researcher is also guided and supported to become a responsible and ethical Data Scientist; this, in turn, will promote one of our primary objectives of creating and maintaining a trusted Research Infrastructure with a highly respected status in the field. TNA, by definition, shares and disseminates expertise from many of the partners of our community. The goal is to provide researchers and professionals with access to big data computing platforms, big social data resources, and cutting-edge computational methods. A TNA visitor will interact with the local experts, discuss research questions, run experiments on non-public big social datasets and algorithms, and present results at workshops/seminars. The RI expertise offer covers multiple and wide-ranging skills and knowledge. The explorato- ries described in Section 2 provide areas that pull together the strands of the RI across lateral and vertical thematic environments for supporting TNA visits. The RI can now offer a visit to one of 17 European SoBigData nodes providing wide-ranging and varied expertise. The visit is a short-term scientific mission (STSM) between 2 weeks and 2 months, covering the travel and living expenses costs2 . 2 http://www.sobigdata.eu/content/call-2022-23-sobigdata-transnational-access �Figure 2: SoBigData RI in numbers 5. Technological infrastructure The technological model of SoBigData RI has the form of a system of systems considering the principles of autonomy of constituents (independence and evolution), openness (join and leave; dynamic reconfiguration), and distribution (interdependence and interoperability). The cited principles represent the building blocks of the policies governing the rules of partic- ipation of the national nodes. From the technical point of view, the RI is a hyper-converged infrastructure, including both the storage area network and the underlying storage abstractions are implemented virtually in software rather than physically in hardware [7, 8]. All physical resources of the infrastructure are manageable through a single platform, reducing inefficiencies and reducing the total cost of ownership. Moreover, the technological model adopted considers some requirements for integrating with EOSC to provide its services and integrating with services providers such as OpenAIRE, Zenodo, EuroHPC, etc. 6. Conclusion The paper presented the main impacts and services that SoBigData RI has in the research community, mainly related to social mining and AI. Our ambition is to update our repertoire of social mining continuously and AI methods (from knowledge extraction, multilingual text classification, and enhanced privacy technology to federated machine learning) with cutting- edge techniques spurring from partners’ research activities or wrapping useful new openly- available methods. In this perspective, Figure 2 completes the SoBigData RI description providing some numbers related to our activities from 2015 to now. At the moment (March 2022), we have more than 9000 registered users, we fully support more than 80 TNA visits, and we trained more than 750 in our courses. We organized events disseminating and organizing workshops on thematic areas, e.g., Special Action on Gender, and with the presence at European Parliament. In the next future, we think that SoBigData RI will contribute to the creation and maintenance of local ecosystems that go in the direction of dynamic and open data spaces aligned with the ambition of the European strategy for data. �7. Acknowledgments This work is supported by the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu). References [1] F. Giannotti, R. Trasarti, K. Bontcheva, V. Grossi, Sobigdata: Social mining & big data ecosystem, in: P. Champin, F. Gandon, M. Lalmas, P. G. Ipeirotis (Eds.), Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, ACM, 2018, pp. 437–438. URL: https://doi.org/10.1145/3184558.3186205. doi:10.1145/3184558.3186205. [2] V. Grossi, F. Giannotti, D. Pedreschi, P. Manghi, P. Pagano, M. Assante, Data science: a game changer for science and innovation, Int. J. Data Sci. Anal. 11 (2021) 263–278. URL: https://doi.org/10.1007/s41060-020-00240-2. doi:10.1007/s41060-020-00240-2. [3] V. Grossi, B. Rapisarda, F. Giannotti, D. Pedreschi, Data science at sobigdata: the european research infrastructure for social mining and big data analytics, International Journal of Data Science and Analytics 6 (2018) 205–216. [4] N. Forgó, S. Hänold, J. van den Hoven, T. Krügel, I. Lishchuk, R. Mahieu, A. Monreale, D. Pedreschi, F. Pratesi, D. van Putten, An ethico-legal framework for social data science, International Journal of Data Science and Analytics 11 (2021) 377–390. [5] A. Gvishiani, M. Dobrovolsky, A. Rybkina, Big data and fair data for data science, in: Resilience in the Digital Age, Springer, 2021, pp. 105–117. [6] J. van de Hoven, G. Comandè, S. Ruggieri, J. Domingo-Ferrer, F. Musiani, F. Giannotti, F. Pratesi, M. Stauch, Towards a digital ecosystem of trust: Ethical, legal and societal implications, Opinio Juris In Comparatione (2021) 131–156. [7] M. Assante, L. Candela, D. Castelli, R. Cirillo, G. Coro, L. Frosini, L. Lelii, F. Mangiacrapa, V. Marioli, P. Pagano, G. Panichi, C. Perciante, F. Sinibaldi, The gcube system: Delivering virtual research environments as-a-service, Future Gener. Comput. Syst. 95 (2019) 445–453. URL: https://doi.org/10.1016/j.future.2018.10.035. doi:10.1016/j.future.2018.10.035. [8] M. Assante, L. Candela, D. Castelli, R. Cirillo, G. Coro, L. Frosini, L. Lelii, F. Mangiacrapa, P. Pagano, G. Panichi, F. Sinibaldi, Enacting open science by d4science, Future Gener. Comput. Syst. 101 (2019) 555–563. URL: https://doi.org/10.1016/j.future.2019.05.063. doi:10. 1016/j.future.2019.05.063. �