Vol-3194/paper3
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper3 |
wikidataid | →Q117345089 |
title | Towards an Italian Energy Data Space |
pdfUrl | https://ceur-ws.org/Vol-3194/paper3.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/RuccoLZ22 |
volume | Vol-3194→Vol-3194 |
session | → |
Towards an Italian Energy Data Space
Towards an Italian Energy Data Space Chiara Rucco1 , Antonella Longo1 and Marco Zappatore1 1 Dept. of Innovation Engineering, University of Salento, via Monteroni sn, 73100 – Lecce (Italy) Abstract The efficient use and the sustainable production of energy are some of the main challenges to face the ever increasing request for energy and the need to limit the damages to the Earth. Smart energy grids, pervasive computing and communication technologies have enabled the stakeholders in the energy industry to collect large amounts of useful and highly granular energy data. They are generated in large volumes and in a variety of different formats, depending on their originating systems and prospected purposes. Moreover, the data type can be structured and unstructured, in open or proprietary formats. This work focuses on harnessing the power of Big Data Management to propose a first model of an Italian Energy Data Lake: the goal is to create a repository of national energy data that respects the FAIRness’ key principles [1], aimed at providing a decision support system and the availability of FAIR data for open science. Starting from data of two thematic areas that are part of the nine common European Data Spaces identified in the European Data Strategy[2], namely the Green Deal data space and the Energy data space, an open and extensible platform to enable secure, resilient acquisition and sharing of information will be presented, for enabling the Green Deal priority actions on issues such as climate change, circular economy, pollution, biodiversity, and deforestation. Keywords Energy, Datalake, Open Data, Fairness 1. Introduction Global energy requirements are continuously increasing. Conventional methods of producing more energy to meet this growth pose a great threat to the environment. CO2 emissions and other bi-products of energy production have direct consequences on everyday life. Therefore, we need to understand and improve the energy efficiency at both producer and consumer sides. ICT-enabled smart energy grids and sensors are being installed globally to measure energy consumption and limit the environmental impact: these smart objects produce large volumes of data, generated by different devices and in different formats, so that they embody the concept of Gartner’s ’Big Data 3Vs’ [3] - volume, velocity and variety. For the purpose of knowledge discovery, this data needs to be collected and analyzed, and the extracted insights from the analysis need to be visualized for easy and effective understanding. To face these challenges, a highly scalable and flexible data analysis platform for automating the whole process is required. A first model that can meet these requirements is an architecture that draws elements from classic data warehouse systems on the one hand and from pure data lake systems on the other hand. This model, defined as Data Lakehouse [4], together with other paradigms like polystore SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ chiara.rucco@unisalento.it (C. Rucco); antonella.longo@unisalento.it (A. Longo); marcosalvatore.zappatore@unisalento.it (M. Zappatore) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �databases, can be implemented and tested with real life data from smart energy devices in order to contribute to the realization of a society that follows the innovative Circular Economy paradigm, a system where resource input and waste, emission, and energy leakage are minimized by slowing, closing, and narrowing material and energy loops [5]. The new schema will contain heterogeneous data sources and will be processed in order to be compliant to FAIR (Findable, Interoperable, Accessible, Reusable) principles: a well-documented and highly re-usable data set enables the ultimate aim to trusted, effective and sustained reuse of research resources [6]. This paper aims at identifying featuring architectural aspects and modelling challenges for an Energy Data Space to be adopted nationwide in Italy. The work is structured as in the following: the second section presents an overview of the background and the state of art, with reference to the rise of the new ’Energy of Things’ characterised by heterogeneous large data sets, and to three cutting-edge projects on this topic. The main goal of the this architectural proposal is described in the third section. The last part is devoted to discussing the impact and results of the solution’s development. 2. Background and state of art Nowadays, the concept of “Data Lake” is popular for accumulating data from heterogeneous sources. Data lakes are used for storing large scale raw data as a single big data repository, providing ingestion, exploration, and monitoring functionality [7]. Data lakes, in contrast to data warehouses, are databases containing data from different sources in structured, unstructured and semi-structured formats, along with capabilities of handling batch and real-time streams. Moreover, data lakes exhibit different implementation forms (e.g., on premises, cloud or multi- cloud, and hybrid) [8]. Currently, data lakes have been exploited in several application domains, ranging from digital humanities [9] to power grid management [10]. In order to get a full insight on the scenario analyzed, a general overview has been built: at the beginning, the Circular Energy paradigm is discussed, with a view on its heterogeneous energy-related data and on data key principles for achieving FAIRness [1]. Then, as a starting point for this research, some existent projects for the creation of a National Energy data repository are explored: the first is an initiative for creating a digital twin of Earth, the second is a Danish proposal for renewable energies, the third is a novel US initiative to make energy data usable and discoverable by researchers. 2.1. Energy of Things According to the United Nations Sustainable Development Goals agenda [11], energy efficiency is one of the key factors for sustainable development: "ensuring access to affordable, reliable, sustainable and modern energy for all by 2030 will open a new world of opportunities for billions of people through new economic opportunities and jobs"1 . Furthermore, energy efficiency brings long-term economic benefits by reducing the cost of fuel import/supply, energy production and energy sector emissions. Effective analysis of real-time data in the energy supply chain plays a key role in improving energy efficiency and more optimal energy management [12]. Modern 1 https://sdgs.un.org/topics/energy �technologies, such as the Internet of Things (IoT), offer a wide range of applications in the energy sector, i.e. in the areas of energy supply, transmission, distribution and demand. With the new surging of portable smart devices, consistently equipped with sensors, supported by more and more performing cloud computing solutions, and densely used for mobile social networking, human as sensor has become a promising sensing paradigm. For this reason, the term ENERNET (Energy of Things) was recently introduced by Steve Collier in an IEEE webinar on the future of energy: it is defined as a convergence and a marriage between Smart Grids and IoT [13]. Emerging ENERNET opens up other possibilities in order to have an affordable, reliable, secure and sustainable supply of electrical power and energy. The novel sensing technologies promote the data source into a new information space paradigm, which seamlessly integrates cyber-space (CS), physical space (PS) and social space (SS), namely Cyber-Physical-Social Systems (CPSS) [14]. CPSS has a crucial role in improving energy efficiency, increasing the share of renewable energy and reducing the environmental impact of energy use; this can be compliant to a context of Circular Economy, a concept that has been framed by the Ellen MacArthur Foundation as an industrial economy that is restorative or regenerative by intention and design[5]: it represents a system in which resource input and waste, emission, and energy leakage are minimized by slowing, closing, and narrowing material and energy loops. This can be achieved through long-lasting design, maintenance, repair, reuse, re-manufacturing, refurbishing, and recycling. The amount of available data for energy analysis is growing rapidly due to a large number of data sources, such as smart cities installing sensors, IoT and personal devices capturing regular behaviour, human curated datasets (e.g. Open Maps), large-scale collaborative data-driven research, satellite imagery, multi-agent computer systems, and open government initiatives. This abundance of data is diverse both in format (e.g. structured, images, graph-based, matrix, time-series, geo-spatial, and textual) and in types of analysis performed (e.g. linear algebra, classification, graph algorithms, and relational algebra). To help different types of data and analysis activities, scientists and analysts often rely on ad-hoc procedures to integrate various data sources. This typically means manually curating how to clean, convert and integrate data. Such approaches are delicate and time consuming. In addition, to perform the analysis, they require bringing both data and computation into a single architecture, which is typically a (distributed) system not suitable for all necessary computation. Most analysts and programmers, however, are not well prepared to handle a multitude of systems, handle transitions between systems robustly, or define the correct framework for the assignment. 2.2. Open initiatives in energy computing field 2.2.1. DestinE Data Lake As part of the European Commission’s Green Deal and Digital Strategy, Destination Earth (DestinE)[15] is a project focused on contribution to achieving the goals of the double transition, green and digital. DestinE is designed to unlock the potential of digital Earth system modelling. It will focus on the impacts of climate change, aquatic and marine ecosystems, polar regions, the cryosphere, biodiversity or extreme weather events, as well as possible adaptation and mitigation strategies. It will help predict major environmental disasters and environmental degradation with unprecedented accuracy and reliability. The heart of Destination Earth will �be a unified cloud-based modelling and simulation platform that will provide access to data, advanced computing infrastructure, software, artificial intelligence applications and analytics. As seen in figure 1, the project will integrate digital twins (DTs) - digital replicas of different aspects of the Earth system, such as weather and climate change projections, food and water security, global ocean circulation and ocean biogeochemistry, among others - and provide users with access to thematic information, services, models, scenarios, simulations, forecasts and visualisations. The platform will also allow the development of applications and the integration of user data. Figure 1: DestinE Data Lake as proposed in [15] The project, which is currently only submitted as a proposal to the European Commission in line with the European Data Strategy, will be implemented gradually over the next 7-10 years starting in 2021. The basic operational platform, digital twins and services will be developed as part of the Commission’s digital programme, while Horizon Europe will provide research and innovation opportunities that will support the further development of DestinE. 2.2.2. Flexible Energy Denmark Flexible Energy Denmark (FED)[16] is a digitisation project that aims to make Danish electricity consumption flexible, so that it becomes possible to use excess electricity from wind turbines and solar cells. The project brings together leading researchers, organisations, utilities, software companies and numerous living laboratories in the country that provide real data for the project. �Specifically, FED collects data from a series of Living Labs (LLs) in physical environments representative of real life. Raw data on electricity, water and district heating consumption of many thousands of households, as well as indoor climate data of two primary schools and 155 households in Aalborg end up 1-4 times a day in a Data lake, called FED Data Lake (FEDDL), which is operated by the independent, non-profit national research centre Center Denmark in Fredericia2 and enables efficient and advanced analysis. The FED ecosystem includes: • A data ecosystem (the Datalake containing a variety of energy-related data that are mainly collected from the living labs in the project, but also from other sources such as BBR (the Danish Registry of Buildings and Houses) and DMI (Danish Meteorological Institute) • An ecosystem for digital tools (tools based on artificial intelligence, are enabled by Big Data from the data ecosystem) • An ecosystem for digitisation solutions combining some of the tools developed, with the aim of managing energy flexibility in Denmark. FEDDL is built using only open source tools that can be run either on-premise or in cloud environments. 2.2.3. Open Energy Data Initiative The Open Energy Data Initiative (OEDI)[17] is a centralized repository of valuable energy research datasets collected by U.S. Department of Energy programs, offices, and national laboratories. Designed to enable data discovery, OEDI facilitates access to a wide network of results, including data available in technology-specific catalogs such as the Geothermal Data Repository and the Marine Hydrokinetic Data Repository. The initiative aims to improve and automate access to high-value energy datasets across U.S. Department of Energy (DOE) programs, offices, and national laboratories. This platform is being deployed by the National Renewable Energy Laboratory (NREL) to make data usable and discoverable by researchers and industry to accelerate analysis and innovation development. Not only does the data lake provide tools to create actionable insights for analysts and to provide high-value open data, but it can also be used to conduct interesting data mashups or calculations to develop new and expanded data sets. OEDI leverages on Amazon Web Services (AWS) to enable analytics capabilities, innovative dataset access and to trigger new relationships among cloud partners. The data lake is based on consolidated AWS storage solutions for datasets (i.e., AWS S3 buckets) with elastic load balancing, and AWS cloud-optimized analytics tools (e.g., AWS Glue, AWS Athena) that to help users consolidate data into non-standard formats, speed up analytics, and allow users to pull or move small parts of analytics into their AWS accounts. 3 2 https://www.centerdenmark.com/ 3 https://openei.org/wiki/ �Figure 2: Open Energy Data Initiative as depicted in [17] 3. Design of the Italian Energy Data Space 3.1. Logical architecture In this context, the aim of the work is to design a resource for the Internet of Energy, capable of collecting energy data from Italian agencies, consortia and research centres, in order to develop a "Google of Energy", a system capable of indexing and searching energy Big Data. It can be used to facilitate future studies in the energy sector and all reliable infrastructures. Developing and consolidating a new approach to energy management, throughout the analysis of data from institutional databases, sensors, IoT devices, Industry 4.0 infrastructures, in the field of energy and its eco-system, will help to use the Big Data potential to support the Green Deal’s priority actions on issues such as climate change, circular economy, pollution, biodiversity and deforestation. Based on the model developed in the Danish National Energy Data Lake[16], where a national repository for energy data is created, figure 3 gives an abstract overview of the proposed Data Lake logical architecture: it is composed of five separated layers, i.e., Data Sources, Data Collec- tion/Ingestion, Data Storage, Data Exploration, and Data Consumers, and four cross-cutting layers, i.e., Privacy and Data Protection layer, Access Management, Meta Data Governance and Resources Management. Layers Data Sources and Data Consumers represent systems which are external to the Data LakeHouse structure. Data Sources: Data sources considered for this purpose are Mobile Sensors and IoT sensors capable of collecting energy and environmental data, Open Data made available by Public Administrations or research centres and Living Labs. Some of the sources for collecting source data are as follows: �Figure 3: Logical view of the proposed architecture • Open data: Energy Production and consumption – GSE: it provides data at national and regional data about renewable sources, trans- portation, energy counts; – ARERA: monitoring of novel generation plants at national level, Data about market, clients, production, consumption; – ISTAT: energy production from renewable sources at national level and consumption from families; – Terna: national data related to production, generation plants, international bench- marks, peaks, consumption; – Eurostat � ∗ Energy statistics section: share of renewable energies, energy productivity, energy supply bu product, energy consumption by product; ∗ Sustainable development: primary energy consumption, population lacking energy due to poverty; • Industry 4.0: smart meters, data coming from power generation plants. Data Collection/Ingestion: The custom data collection enables data retrieval from data sources requiring custom scripts, e.g. if the data is embedded into HTML pages, APIs, or when files containing data are provided manually in CSV, TSV and PDF formats. Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization. There are different ways of ingesting data, and the design of a particular data ingestion layer can be based on various models or architectures. The two considered kind of data ingestion are batch processing and streaming processing. In the first one, the ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule. Real-time processing (also called stream processing or streaming) involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. Data Storage: The data lake storage problem asks for selecting appropriate data stores to preserve ingested datasets. There are many solutions in the literature and they apply various relational and NoSQL databases [18], and present different manners of data storage organization. There are solutions considering heterogeneous data sources while others target at a particular type, e.g., relational tables. In order to host different types of data, the solutions could be a universal format or allowing multiple formats. Some approaches rely on the common relational or NoSQL stores while others have developed new storage systems. The data storage systems could be on-premise or cloud-based[8]. This layer can be divided in three different zones: • Raw data zone: all types of data are ingested without processing and stored in their native format. This zone allows users to find the original version of data for their analytics to facilitate subsequent treatments. The stored raw data format can be different from the source format. • Intermediate zone: after ingestion, the data lake is a vast collection of raw datasets with certain metadata. To make the data usable for querying, a number of solutions are proposed for further processing of the raw data, e.g., find more metadata, discover hidden relationships, and perform data integration, transformation or cleaning if necessary. • Structured and unstructured zones: they stores all the available data for data analytics and provides the access of data. This zone allows self-service data consumption for different analytics (reporting, statistical analysis, business intelligence analysis, machine learning algorithms) according to their format. Data Exploration: The top layer focuses at the interaction of users with the DataLake. It is important that useful information can be retrieved out of data lakes. However, this is �challenging due to a large number of ingested sources, and the heterogeneity of data. Given data lake systems with a large number of datasets, users may have knowledge for one or a few data sources, but rarely all the datasets. The query formulation component should support users in creating formal queries that express his information requirement. The data interaction should cover all the functionalities which are required to work with the data, including visualization, annotation, selection and filtering of data, and basic analytical methods. Users can first browse the existing data sources, including their description, statistics, and schema; then she can write a query (SQL or JSONiq 4 ) for a single dataset, or use the user interface to make a keyword search over the schema or the data. Alternatively, with certain knowledge of the datasets, which could be learned through the previous exploration processes, they can choose to integrate a subset of relevant datasets, and query them using formal queries or keyword search [19]. An important kind of output that the architecture could provide, is the Fair Data API: ac- cording to the FAIR data principles, research outputs are shared in a way that enables and enhances reuse by humans and machines. The characteristics of these resources can be oriented to achieve compliance with FAIR guidelines. For example, output generated uses globally unique identifiers and can assign other identifiers. The data elements described in FAIR correspond to concepts and (meta)data objects modeled, as our DataLake resources and described with rich metadata and context information. In the output of the Data Lake, resources are retrievable via open APIs, that is, absolute URIs and standard Representational State Transfer (REST) protocols. Data Consumers: Human users play an essential role in the management of data lakes. The users of a data lake are also data providers; the insight provided by the human helps the data in the data lake to mature over time. Data consumers range from communities of interest (e.g., citizens groups and associations interested in performing pollution measurement, and factories and industries interested in their level of environmental pollution) to public authorities, and from citizens (both single individuals and associations) to other end-user categories such as schools or research-labs. 3.2. The proposed development process The steps for the development of the platform are based on the following phases. 1. The first part will focus on an in-depth study of the state of the art and analysis of existing architectures proposed in the second section. A first phase of heterogeneous data collection from the various energy sources will be carried out. 2. The second phase, aimed at scenario definition, will focus on interviews with SMEs and stakeholders that can help in the design of a use case to prototype the research results. This is aimed at an understanding to elicit the needs, the current state of the art in energy generation, distribution and use. 3. Design of pilot projects and use case, also creating living labs for involving prosumers and providers 4 https://www.jsoniq.org/ � 4. Development of the digital platform for collecting data and providing data services and tools 5. Incremental extension of use cases and further involvement of new providers, consumers, stakeholders 4. Conclusion Knowledge of consumers’ energy consumption and indoor climate is worth its weight in gold to utilities, industry and researchers. They use the data to plan production and develop services and algorithms that control energy consumption so that it becomes more flexible and renewable energy is not wasted. Our ambition with the national platform is that it can form the basis for the release of data from electricity, water, heat and potentially also gas, so that the data can be used by commercial suppliers to develop new business models that support data-driven models for the green transition. Creating a repository of national energy data that respects the Fairness’ key principles, is the starting point to provide an open and extensible platform to enable secure, resilient acquisition and sharing of information with the aim to improve the well-being and inclusion of citizens, produce a more effective response to pollution or other environmental emergencies, and make Smart Cities and extended urban areas feel more secure and safe to the citizens living in them. Further, endeavors from citizens and joint academic-community science can assist with distinguishing environmental health problems related with air quality in metropolitan regions. Unfortunately, there remains a gap between the development and the effective utilization of these cutting-edge technologies within communities of proactive decision-making [20]. The importance of this topic will help to raise public awareness of energy problems, to highlight the importance of citizens’ engagement and to inspire citizens to adopt sustainable consumption habits and behavior patterns. These habits will promote new sustainable services, e.g. lengthening product life cycles through reuse, repair and refurbishment and encourage waste reduction, energy savings and circular thinking: the so-called ’citizen science’ is emerging. 5. Acknowledgment This research activity is partially funded by the Italian research programme "PON Ricerca e Innovazione 14-20 (DM n.1062, 10 August 2021), in the framework of "The Italian Data Lake for Energy (ItaDL4E)" project. References [1] M. D. W. et al., The fair guiding principles for scientific data management and stewardship, Scientific Data 3 (2016). doi:10.1038/sdata.2016.18. [2] Towards a common European data space, Technical Report, European Commission, COM 232, 2018. [3] D. Laney, 3D data management: Controlling data volume, velocity and variety, META Group Research Note 6, 2001. � [4] P. G. Alonso, SETA, a suite-independent agile analytical framework, Master’s thesis, Universitat Politecnica de Catalunya, 2016. [5] E. M. Foundation, Towards the Circular Economy, Technical Report, EMF, McKinsey Company, 2013. [6] F. H. Mardiansjah, Extended urbanization in smaller-sized cities and small town develop- ment in java: The case of the tegal region, IOP Conference Series: Earth and Environmental Science 447 (2020). [7] C. Madera, A. Laurent, The next information architecture evolution: the data lake wave, MEDES: Proceedings of the 8th International Conference on Management of Digital EcoSystems (2016) 174–180. doi:10.1145/3012071.3012077. [8] E. Zagan, M. Danubianu, Cloud data lake: The new trend of data storage, in: 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2021, pp. 1–4. doi:10.1109/HORA52670.2021.9461293. [9] J. Darmont, C. Favre, S. Loudcher, C. Noûs, Data lakes for digital humanities, in: Proceed- ings of the 2nd International Conference on Digital Tools amp; Uses Congress, DTUC ’20, Association for Computing Machinery, New York, NY, USA, 2020. doi:10.1145/3423603. 3424004. [10] Y. Li, A. Zhang, X. Zhang, Z. Wu, A data lake architecture for monitoring and diagnosis system of power grid, in: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference, AICCC ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 192–198. doi:10.1145/3299819.3299850. [11] U. Nations, Progress towards the sustainable development goals, 2017. [12] N. Hossein Motlagh, M. Mohammadrezaei, J. Hunt, B. Zakeri, Internet of things (iot) and the energy sector, Energies 13 (2020). doi:10.3390/en13020494. [13] S. E. Collier, The emerging enernet: Convergence of the smart grid with the internet of things, IEEE Industry Applications Magazine 23 (2017) 12–16. doi:10.1109/MIAS.2016. 2600737. [14] P. Wang, L. T. Yang, J. Li, J. Chen, S. Hu, Data fusion in cyber-physical-social systems: State-of-the-art and perspectives, Information Fusion 51 (2019) 42–57. doi:10.1016/j. inffus.2018.11.002. [15] Destination Earth (DestinE) Architecture Validation Workshop, Technical Report, European Commission, 2021. [16] H. B. Hamadou, T. Pedersen, C. Thomsen, The danish national energy data lake: Require- ments, technical architecture, and tool selection, 2020 IEEE International Conference on Big Data (Big Data) (2020) 1523–1532. [17] re3data.org: Oedi, 2020. [18] S. Khan, X. Liu, S. Ali, M. Alam, Storage solutions for big data systems: A qualitative study and comparison (2019). [19] R. Hai, C. Quix, C. Zhou, Query rewriting for heterogeneous data lakes, Advances in Databases and Information Systems - 22nd European Conference (2018) 35–49. [20] E. Bales, N. Nikzad, N. Q. et al., Citisense: mobile air quality sensing for individuals and communities design and deployment of the citisense mobile air-quality system., 2012 6th international conference on pervasive computing technologies for healthcare (Pervasive- Health) and workshops (2012) 155–158. �