Vol-3194/paper39
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper39 |
wikidataid | →Q117344890 |
title | Mapping and Compressing a Convolutional Neural Network through a Multilayer Network |
pdfUrl | https://ceur-ws.org/Vol-3194/paper39.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/AmelioBCMUV22 |
volume | Vol-3194→Vol-3194 |
session | → |
Mapping and Compressing a Convolutional Neural Network through a Multilayer Network
Mapping and Compressing a Convolutional Neural Network through a Multilayer Network (Discussion Paper) Alessia Amelio1 , Gianluca Bonifazi2 , Enrico Corradini2 , Michele Marchetti2 , Domenico Ursino2 and Luca Virgili2 1 INGEO, University “G. D’Annunzio” of Chieti-Pescara 2 DII, Polytechnic University of Marche Abstract This paper falls in the context of the interpretability of the internal structure of deep learning architectures. In particular, we propose an approach to map a Convolutional Neural Network (CNN) into a multilayer network. Next, to show how such a mapping helps to better understand the CNN, we propose a technique for compressing it. This technique detects if there are convolutional layers that can be removed without reducing the performance too much and, if so, removes them. In this way, we obtain lighter and faster CNN models that can be easily employed in any scenario. Keywords Deep Learning, Convolutional Neural Networks, Multilayer Networks, Convolutional Layer Pruning 1. Introduction In recent years, we have witnessed a massive spread of deep learning models in many contexts [1, 2] with the goal of solving increasingly complex problems. The growing complexity of the problems to solve requires increasingly sophisticated models to achieve the best performance. However, more and more researchers realize the need to reduce the size and complexity of deep networks [3, 4]. As a result, enormous efforts are being made to introduce new architectures of deep learning networks that are more efficient and less complex. In parallel, several methods are being proposed to reduce the size of existing networks without affecting their performance too much [5, 6]. To this end, it is extremely important to be able to explore the various layers and components of a deep learning model. In fact, we could identify the most important components, the most interesting patterns and features, the information flow, and so on. In this paper, we want to make a contribution in this setting. In particular, we start from the assumption that complex networks, and in particular multilayer ones, can significantly support the representation, analysis, exploration and manipulation of deep learning networks. Based on SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ a.amelio@unich.it (A. Amelio); g.bonifazi@univpm.it (G. Bonifazi); e.corradini@pm.univpm.it (E. Corradini); m.marchetti@pm.univpm.it (M. Marchetti); d.ursino@univpm.it (D. Ursino); luca.virgili@univpm.it (L. Virgili) � 0000-0002-3568-636X (A. Amelio); 0000-0002-1947-8667 (G. Bonifazi); 0000-0002-1140-4209 (E. Corradini); 0000-0003-3692-3600 (M. Marchetti); 0000-0003-1360-8499 (D. Ursino); 0000-0003-1509-783X (L. Virgili) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �this insight, we first propose an approach to map deep learning networks into multilayer ones and then use the latter to explore and manipulate the former. We focus on one family of deep learning networks, namely Convolutional Neural Networks (hereafter, CNNs) [7]. Multilayer networks [8] are a type of complex networks sophisticated enough to represent all aspects of a CNN. In fact, through their fundamental components (i.e., nodes, arcs, weights and layers), they are able to represent all the typical concepts of a CNN (i.e., nodes, connections, filters, weights, etc.). Once we have mapped a CNN into a multilayer network, we can use the latter to study and manipulate the former. To give an idea of its potential, in the paper we will use it to support an approach to prune convolutional layers [9] from a CNN. This approach aims to identify if there are layers in the CNN that can be pruned without reducing the performance too much, and, if so, proceeds with pruning and returns a new CNN without those layers. The outline of this paper is as follows: In Section 2, we describe our approach to mapping a CNN into a multilayer network. In Section 3, we describe the use of the multilayer network to prune one or more convolutional layers of the CNN. Finally, in Section 4, we draw our conclusion and take a look at some possible future developments of our research. 2. Mapping a Convolutional Neural Network into a multilayer network In this section, we present our approach to mapping a CNN into a multilayer network. It represents the first contribution of this paper. 2.1. Class network definition In this subsection, we describe a CNN by means of a single-layer network, referred to as a class network. It is a weighted directed graph 𝐺 = (𝑉, 𝐸, 𝑊 ), where 𝑉 is the set of nodes, 𝐸 is the set of arcs, and 𝑊 is the set of arc weights. A CNN consists of 𝑀 convolutional layers, each having 𝑥 filters (also called “kernels”). In a convolutional layer, each filter slides over the input with a given stride and creates a feature map. The input ℐ of the first convolutional layer is given by the original image, while the input of the next convolutional layer is given by a feature map. Applying a filter on the element ℐ(𝑖, 𝑗) of the input provides a new element 𝒪(𝑖, 𝑗) of the output feature map 𝒪 of the output. Based on this, the set 𝑉 of the nodes of 𝐺 consists of a set {𝑉1 , 𝑉2 , ..., 𝑉𝑀 } of node subsets. The subset 𝑉𝑘 indicates the contribution of the 𝑘 𝑡ℎ convolutional layer 𝑐𝑘 . This is obtained by applying the 𝑥𝑘 filters of this layer to the input of 𝑐𝑘 . Therefore, a node 𝑝 ∈ 𝑉𝑘 represents the output obtained by applying the 𝑥𝑘 filters of 𝑐𝑘 to some position (𝑖, 𝑗) of the input. Since by applying a filter on ℐ(𝑖, 𝑗) we get a new element 𝒪(𝑖, 𝑗), it is straightforward that there is a direct connection between ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗). Actually, there is a direct connection not only between ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗), but also between ℐ(𝑖, 𝑗) and each element adjacent to 𝒪(𝑖, 𝑗) within the filter area. As each convolutional layer 𝑐𝑘 has a set of 𝑥𝑘 filters, there are 𝑥𝑘 sets of direct connections between ℐ(𝑖, 𝑗) and 𝒪𝑘 (𝑖, 𝑗), one for each filter. Since the 𝑥𝑘 filters are applied to the input with a given stride, a set of similar connections towards the feature �maps is generated for different positions of the input. In a CNN, besides the convolutional layer there are pooling layers. A pooling layer shrinks the input feature map and leads to an increase in the number of connections between the input and the output. This is achieved by sliding the filter over the input with a given stride and determining an aggregate value (e.g., the maximum) for each filter window. As aggregate values are still elements of the feature map provided as input, the next application of a convolutional layer to the feature map returned by a pooling layer generates connections between the aggregate values and the elements of the feature map returned by the convolutional layer. In particular, there will be direct connections between the aggregate values of the feature map provided as input to the pooling layer and the adjacent elements of the feature map generated by the next convolutional layer. Therefore, based on the previous reasoning, we can say that the application of a filter to the element ℐ(𝑖, 𝑗) generates a new element 𝒪(𝑖, 𝑗), whose value is obtained by the following convolution operation: ∑︀𝑎 ∑︀𝑏 𝑔(𝑖, 𝑗) = 𝑓 (𝑖, 𝑗) * ℐ(𝑖, 𝑗) = 𝑠=−𝑎 𝑡=−𝑏 𝑓 (𝑠, 𝑡)ℐ(𝑖 + 𝑠, 𝑗 + 𝑡) Here, 𝑓 is a filter of size (2𝑎 + 1) × (2𝑏 + 1). The direct connections generated between ℐ(𝑖, 𝑗) and 𝒪(𝑖+𝑠, 𝑗+𝑡), −𝑎 ≤ 𝑠 ≤ 𝑎, −𝑏 ≤ 𝑡 ≤ 𝑏, are labeled with the same weight 𝑔(𝑖, 𝑗), representing the convolution result. In presence of 𝑥 filters, the direct connections between ℐ(𝑖, 𝑗) and 𝒪ℎ (𝑖 + 𝑎, 𝑗 + 𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤ 𝑏 ≤ 1, 1 ≤ ℎ ≤ 𝑥, are weighted with the values of the corresponding convolution results 𝑔1 (𝑖, 𝑗), 𝑔2 (𝑖, 𝑗), . . . , 𝑔𝑥 (𝑖, 𝑗). Figure 1 illustrates the application of three filters of size 3 × 3 (green, blue and yellow colored, respectively) to ℐ(8, 8). For each filter 𝑓ℎ , it also shows the weighted direct connections generated between ℐ(8, 8) and 𝒪ℎ (8+𝑎, 8+𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤ 𝑏 ≤ 1, 1 ≤ ℎ ≤ 3. The three weights are obtained as: 𝑔1 (8, 8) = 𝑓1 (8, 8) · ℐ(8, 8) = 5; 𝑔2 (8, 8) = 𝑓2 (8, 8) · ℐ(8, 8) = −2; 𝑔3 (8, 8) = 𝑓3 (8, 8) · ℐ(8, 8) = 11. Figure 1: Application of three filters of size 3 × 3 (green, blue and yellow colored, respectively) to ℐ(8, 8), and computation, for each filter, of the weights of the direct connections between ℐ(8, 8) and 𝒪ℎ (8 + 𝑎, 8 + 𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤ 𝑏 ≤ 1, 1 ≤ ℎ ≤ 3 To obtain the arcs of 𝐺 from the weights of the direct connections of the 𝑥 filters, we adopt some statistical descriptors. Our objective is to have only one set of arcs from the node �corresponding to ℐ(𝑖, 𝑗) to the node corresponding to 𝒪(𝑖 + 𝑠, 𝑗 + 𝑡), −𝑎 ≤ 𝑠 ≤ 𝑎, −𝑏 ≤ 𝑡 ≤ 𝑏1 . The weight of an arc from ℐ(𝑖, 𝑗) to 𝒪(𝑖 + 𝑠, 𝑗 + 𝑡) is obtained by applying a suitable statistical descriptors to the weights of 𝑔ℎ (𝑖, 𝑗), 1 ≤ ℎ ≤ 𝑥. The two statistical descriptors we have chosen to adopt are the mean and the median, which achieved the best performance results. As a consequence of the previous reasoning, the set 𝐸 of the arcs of 𝐺 consists of a set of subsets 𝐸 = {𝐸1 , 𝐸2 , ..., 𝐸𝑀 −1 }. Here, 𝐸𝑘 denotes the set of arcs connecting nodes of 𝑉𝑘 to nodes of 𝑉𝑘+1 . Analogously, the set 𝑊 of the weights of 𝐺 consists of a set of subsets 𝑊 = {𝑊1 , 𝑊2 , ..., 𝑊𝑀 −1 }. Here, 𝑊𝑘 is the set of weights associated with the arcs of 𝐸𝑘 . 2.2. Mapping a CNN into a multilayer network After showing how it is possible to build a class network representing a CNN, in this section we illustrate how, with the support of the class network model and a dataset containing the training data, it is possible to build a multilayer network capable of fully mapping both the CNN and its behavior. Roughly speaking, a multilayer network is a set of 𝑡 class networks, one for each target class in the dataset. Formally speaking, let 𝐷 be a dataset consisting of 𝑡 target classes 𝐶𝑙1 , 𝐶𝑙2 , . . . , 𝐶𝑙𝑡 , and let 𝑐𝑛𝑛 be a Convolutional Neural Network. The multilayer network 𝒢 = {𝐺1 , 𝐺2 , ...𝐺𝑡 } corresponding to 𝑐𝑛𝑛 is a set of 𝑡 class networks such that 𝐺ℎ , 1 ≤ ℎ ≤ 𝑡, corresponds to the ℎ𝑡ℎ target class of 𝐷. Figure 2 shows a multilayer network 𝒢 characterized by three layers, 𝐺1 , 𝐺2 and 𝐺3 each consisting of a generated class network. Figure 3 shows the generation of a portion of 𝒢 obtained by extracting the three feature maps of three target classes from 𝑐𝑛𝑛. In this last network, a set of filters of size 3 × 3 with stride 1, sliding over positions (3, 2) and (4, 2) of the feature maps, generates some arcs for the class networks of 𝒢. The weights of the arcs are computed by applying the rules described in Section 2.1. In particular, for the class network 𝐺1 (resp., 𝐺2 , 𝐺3 ), two sets of arcs of weights 5 (resp., 1, 3) and 4 (resp., 7, 6) are generated between the first and second feature maps. Similarly, a set of arcs of weights 2 (resp., -2, 4) and 8 (resp., 9, -1) are generated between the second and the third feature maps. Our algorithm for constructing 𝒢 from 𝑐𝑛𝑛 and 𝐷 consists of two steps. During the first step, it creates a support data structure consisting of a list of patch lists. A patch list is a part of a feature map having the same size as the filters applied to the next convolutional layer and giving rise to a node in the multilayer network. The list of patch lists is then used during the second step to construct 𝒢. The function corresponding to the first step, called CREATE_PATCHES, receives 𝑐𝑛𝑛 and a target class 𝐶𝑙ℎ , 1 ≤ ℎ ≤ 𝑡. It operates on the feature maps provided in input to each convolutional layer of 𝑐𝑛𝑛 and proceeds as follows. It uses a list conv_layers that initially contains the convolutional layers of 𝑐𝑛𝑛. Afterwards, it iterates over the elements of conv_layers and provides them with the images of 𝐶𝑙ℎ . During each iteration, it treats the current element as the source and the next element as the target. The feature map returned from source is given as input to target, which processes it as specified below. At the beginning of each iteration, CREATE_PATCHES determines the starting and ending points of the source output. After that, it iterates over the source output and creates a patch for 1 Here and in the following, we employ the symbols ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗) for indicating both the elements of the feature maps and the corresponding nodes of the class network. �Figure 2: Sample multilayer network 𝒢 composed of three layers corresponding to the class networks 𝐺1 (top), 𝐺2 (middle), and 𝐺3 (bottom). each application of the convolutional filter on an element. At the end of the iteration, it stores the patches corresponding to a convolutional layer in a list called patch_list. Finally, it returns in output the lists corresponding to all convolutional layers of 𝑐𝑛𝑛. The function corresponding to the second step, called CREATE_LAYER_NETWORK receives 𝑐𝑛𝑛 and a target class 𝐶𝑙ℎ , 1 ≤ ℎ ≤ 𝑡 of the multilayer network 𝒢. It first calls CREATE_PATCHES to receive the list of patch lists corresponding to 𝑐𝑛𝑛 and 𝐶𝑙ℎ . Then, it creates an initially empty network 𝐺ℎ . Afterwards, it scrolls the list of patch lists returned by CREATE_PATCHES considering the current list as source and the next one as target. For each iteration, it adds to 𝐺ℎ a node for each patch present in source or target, if they are not already present in 𝐺ℎ . Then, it determines the size of the area covered by the filter in target. This size is equal to the one covered in source, if we are in presence of a convolutional layer, or it is greater, if we have a pooling layer. After that, it iterates over the source and target nodes on which the filter acts. Given a node, it considers all source nodes that can be processed by the filter whose center falls in the rectangle defined by the coordinates of the patch of 𝑣𝑡 and, for each of them, adds an arc from it to 𝑣𝑡 in 𝐺ℎ . Once this arc has been entered, it computes the corresponding weights by applying the formula seen above (see Figure 3 for an example of its application). At the end of its iterations, CREATE_LAYER_NETWORK has created the ℎ𝑡ℎ layer 𝐺ℎ of 𝒢, corresponding to the target class 𝐶𝑙ℎ . Applying this function 𝑡 times, one for each target class in the dataset 𝐷, we obtain the final multilayer network. � (a) Class network 𝐺1 (b) Class network 𝐺2 (c) Class network 𝐺3 Figure 3: Generation of a portion of the class networks 𝐺1 , 𝐺2 and 𝐺3 corresponding to the layers of the multilayer network 𝒢 depicted in Figure 2 3. Applying the multilayer network model to compress a CNN Once a CNN is mapped into a multilayer network, the latter can be used to analyze and manipulate the former. There are several operations that can be performed on the CNN thanks to its mapping in a multilayer network. To give an idea of such operations, in this section we examine one of them, namely the compression of a CNN. This represents the second main contribution of this paper. Let 𝒢 be a multilayer network and let 𝐺ℎ be its ℎ𝑡ℎ layer. 𝐺ℎ is a weighted directed graph. Given a node 𝑣 of 𝐺ℎ , we can define: (i) the indegree of 𝑣, as the sum of the weights of the arcs of 𝐺ℎ incoming into 𝑣; (ii) the outdegree of 𝑣, as the sum of the weights of the arcs outgoing from 𝑣; (iii) the degree of 𝑣, as the sum of its indegree and its outdegree. We adopt the symbol 𝑑ℎ (𝑣) to denote the degree of 𝑣 in 𝐺ℎ and the symbol 𝛿(𝑣) to represent the overall (weighted) degree of 𝑣 in 𝒢. As we will see below, 𝛿(𝑣) is an important indicator of the effectiveness of the filter represented by 𝑣. � As previously pointed out, 𝒢 = {𝐺1 , 𝐺2 , . . . , 𝐺𝑡 } has a layer for each target class. As a consequence, the overall degree 𝛿(𝑣) of the node 𝑣 in 𝒢 can be obtained by suitably aggregating the degrees 𝑑1 (𝑣), 𝑑2 (𝑣), · · · , 𝑑𝑡 (𝑣) of 𝑣 in the 𝑡 layers of 𝒢. More specifically, if we indicate with ℱ a suitable aggregation function, we have that: 𝛿(𝑣) = ℱ(𝑑1 (𝑣), 𝑑2 (𝑣), ..., 𝑑𝑡 (𝑣)) (1) ∑︀𝑡 Let 𝑑𝑡𝑜𝑡 (𝑣) = ℎ=1 𝑑ℎ (𝑣) be the sum of the degrees of 𝑣 in the 𝑡 layers of 𝒢. We adopt the following entropy-based aggregation function [10, 11] for determining 𝛿(𝑣): 𝑡 𝑑ℎ (𝑣) 𝑑ℎ (𝑣) ∑︁ (︂ )︂ 𝛿(𝑣) = − 𝑙𝑜𝑔 , (2) 𝑑𝑡𝑜𝑡 (𝑣) 𝑑𝑡𝑜𝑡 (𝑣) ℎ=1 This function refers to the famous concept of information entropy introduced by Shannon [12]. It favors the presence of a uniform distribution of the degree of 𝑣 in the different layers, while it penalizes the presence of a high degree of 𝑣 in few layers and a low degree of 𝑣 in many layers. In this way, we favor those nodes whose feature extraction is balanced for different target classes [13] and penalize those nodes that do not favor such a balance. Our approach for the compression of 𝑐𝑛𝑛 selects a subset of the nodes of 𝒢 with the highest values of 𝛿. This is equivalent to selecting the convolutional layers of 𝑐𝑛𝑛 that contribute the most to the quality of the classification and, therefore, cannot be discarded in any way. In particular, our approach selects the nodes of 𝒢 whose values of 𝛿 are higher than a certain threshold 𝑡ℎ𝛿 = 𝛾 · 𝛿. Here, 𝛿 denotes a statistical descriptor of the values of the overall degree 𝛿 of all nodes in 𝒢. Our approach allows the usage of the mean or the median as statistical aggregators because they are the ones that allowed us to obtain the best experimental results. 𝛾 is a scaling factor that allows us to tune the contribution of 𝛿. Its value belongs to the real interval [0, +∞). After selecting the subset of the nodes of 𝒢 with an overall degree 𝛿 higher than 𝑡ℎ𝛿 , our approach determines the set of the convolutional layers of 𝑐𝑛𝑛 from which these nodes were extracted. Finally, it keeps these layers in the compressed version of 𝑐𝑛𝑛 while discarding the others. At this point, our approach terminates by training the pruned network. 4. Conclusion In this paper, we have seen how it is possible to map a deep learning architecture, specifically a CNN, into a multilayer network, which facilitates its analysis, exploration and manipulation. To this end, we proposed an approach to map each element of a CNN into the elements of the multilayer networks, namely nodes, arcs, arc weights and layers. Then, to give an idea of what can be done thanks to such a mapping, we proposed an approach to compress a CNN based on the pruning of the layers that least contribute to the quality of the classification. The approach proposed in this paper is a starting point for further research efforts in this area. For example, our approach does not consider the residual connections, typical of ResNet. In the future, we plan to extend it in order to handle this type of architecture. Also, at the moment, our approach is not able to prune single filters from a convolutional layer. Doing so would require a �much more detailed mapping of the CNN into the multilayer network. At the moment, we have not done this to keep the size of the multilayer network limited. However, conceptually there are no limitations in performing such a task. In the future, we plan to proceed in that direction so that we can perform the pruning of a CNN with finer granularity. References [1] S. Dargan, M. Kumar, M. R. Ayyagari, G. Kumar, A survey of deep learning and its applications: A new paradigm to machine learning, Archives of Computational Methods in Engineering 27 (2020) 1071–1092. [2] E. Corradini, G. Porcino, A. Scopelliti, D. Ursino, L. Virgili, Fine-tuning SalGAN and PathGAN for extending saliency map and gaze path prediction from natural images to websites, Expert Systems With Applications 191 (2022) 116282. Elsevier. [3] Z. Chen, Z. Chen, J. Lin, S. Liu, W. Li, Deep neural network acceleration based on low-rank approximated channel pruning, IEEE Transactions on Circuits and Systems I: Regular Papers 67 (2020) 1232–1244. doi:10.1109/TCSI.2019.2958937. [4] J. Liu, B. Zhuang, Z. Zhuang, Y. Guo, J. Huang, J. Zhu, M. Tan, Discrimination-aware network pruning for deep model compression, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) 1–1. doi:10.1109/TPAMI.2021.3066410. [5] T. Choudhary, V. Mishra, A. Goswami, J. Sarangapani, A comprehensive survey on model compression and acceleration, Artificial Intelligence Review (2020) 1–43. [6] Z.-R. Wang, J. Du, Joint architecture and knowledge distillation in CNN for Chinese text recognition, Pattern Recognition 111 (2021) 107722. Elsevier. [7] A. Khan, A. Sohail, U. Zahoora, A. S. Qureshi, A survey of the recent architectures of deep convolutional neural networks, Artif. Intell. Rev. 53 (2020) 5455–5516. doi:10.1007/ s10462-020-09825-6. [8] M. Kivela, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, M. A. Porter, Multilayer networks, Journal of Complex Networks 2 (2014) 203–271. doi:10.1093/comnet/cnu016. [9] S. Chen, Q. Zhao, Shallowing deep networks: Layer-wise pruning based on feature representations, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019) 3048–3056. doi:10.1109/TPAMI.2018.2874634. [10] R. K. Manel Hmimida, Community detection in multiplex networks: A seed-centric approach, Networks & Heterogeneous Media 10 (2015) 71–85. [11] F. Battiston, V. Nicosia, V. Latora, Structural measures for multiplex networks, Phys. Rev. E 89 (2014) 032804. doi:10.1103/PhysRevE.89.032804. [12] C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (1948) 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x. [13] N. Gowdra, R. Sinha, S. MacDonell, W. Q. Yan, Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic, Pattern Recognition (2021) 108057. Elsevier. �