Vol-3194/paper39

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3194/paper39
wikidataid  Q117344890→Q117344890
title  Mapping and Compressing a Convolutional Neural Network through a Multilayer Network
pdfUrl  https://ceur-ws.org/Vol-3194/paper39.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/AmelioBCMUV22
volume  Vol-3194→Vol-3194
session  →

Mapping and Compressing a Convolutional Neural Network through a Multilayer Network

load PDF

Mapping and Compressing a Convolutional Neural
Network through a Multilayer Network
(Discussion Paper)

Alessia Amelio1 , Gianluca Bonifazi2 , Enrico Corradini2 , Michele Marchetti2 ,
Domenico Ursino2 and Luca Virgili2
1
    INGEO, University “G. D’Annunzio” of Chieti-Pescara
2
    DII, Polytechnic University of Marche


                                         Abstract
                                         This paper falls in the context of the interpretability of the internal structure of deep learning architectures.
                                         In particular, we propose an approach to map a Convolutional Neural Network (CNN) into a multilayer
                                         network. Next, to show how such a mapping helps to better understand the CNN, we propose a technique
                                         for compressing it. This technique detects if there are convolutional layers that can be removed without
                                         reducing the performance too much and, if so, removes them. In this way, we obtain lighter and faster
                                         CNN models that can be easily employed in any scenario.

                                         Keywords
                                         Deep Learning, Convolutional Neural Networks, Multilayer Networks, Convolutional Layer Pruning




1. Introduction
In recent years, we have witnessed a massive spread of deep learning models in many contexts
[1, 2] with the goal of solving increasingly complex problems. The growing complexity of the
problems to solve requires increasingly sophisticated models to achieve the best performance.
However, more and more researchers realize the need to reduce the size and complexity of deep
networks [3, 4]. As a result, enormous efforts are being made to introduce new architectures of
deep learning networks that are more efficient and less complex. In parallel, several methods are
being proposed to reduce the size of existing networks without affecting their performance too
much [5, 6]. To this end, it is extremely important to be able to explore the various layers and
components of a deep learning model. In fact, we could identify the most important components,
the most interesting patterns and features, the information flow, and so on.
   In this paper, we want to make a contribution in this setting. In particular, we start from the
assumption that complex networks, and in particular multilayer ones, can significantly support
the representation, analysis, exploration and manipulation of deep learning networks. Based on


SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ a.amelio@unich.it (A. Amelio); g.bonifazi@univpm.it (G. Bonifazi); e.corradini@pm.univpm.it (E. Corradini);
m.marchetti@pm.univpm.it (M. Marchetti); d.ursino@univpm.it (D. Ursino); luca.virgili@univpm.it (L. Virgili)
� 0000-0002-3568-636X (A. Amelio); 0000-0002-1947-8667 (G. Bonifazi); 0000-0002-1140-4209 (E. Corradini);
0000-0003-3692-3600 (M. Marchetti); 0000-0003-1360-8499 (D. Ursino); 0000-0003-1509-783X (L. Virgili)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
�this insight, we first propose an approach to map deep learning networks into multilayer ones
and then use the latter to explore and manipulate the former.
    We focus on one family of deep learning networks, namely Convolutional Neural Networks
(hereafter, CNNs) [7]. Multilayer networks [8] are a type of complex networks sophisticated
enough to represent all aspects of a CNN. In fact, through their fundamental components (i.e.,
nodes, arcs, weights and layers), they are able to represent all the typical concepts of a CNN
(i.e., nodes, connections, filters, weights, etc.). Once we have mapped a CNN into a multilayer
network, we can use the latter to study and manipulate the former. To give an idea of its
potential, in the paper we will use it to support an approach to prune convolutional layers [9]
from a CNN. This approach aims to identify if there are layers in the CNN that can be pruned
without reducing the performance too much, and, if so, proceeds with pruning and returns a
new CNN without those layers.
    The outline of this paper is as follows: In Section 2, we describe our approach to mapping a
CNN into a multilayer network. In Section 3, we describe the use of the multilayer network
to prune one or more convolutional layers of the CNN. Finally, in Section 4, we draw our
conclusion and take a look at some possible future developments of our research.


2. Mapping a Convolutional Neural Network into a multilayer
   network
In this section, we present our approach to mapping a CNN into a multilayer network. It
represents the first contribution of this paper.

2.1. Class network definition
In this subsection, we describe a CNN by means of a single-layer network, referred to as a class
network. It is a weighted directed graph 𝐺 = (𝑉, 𝐸, 𝑊 ), where 𝑉 is the set of nodes, 𝐸 is the
set of arcs, and 𝑊 is the set of arc weights.
   A CNN consists of 𝑀 convolutional layers, each having 𝑥 filters (also called “kernels”). In a
convolutional layer, each filter slides over the input with a given stride and creates a feature
map. The input ℐ of the first convolutional layer is given by the original image, while the input
of the next convolutional layer is given by a feature map. Applying a filter on the element
ℐ(𝑖, 𝑗) of the input provides a new element 𝒪(𝑖, 𝑗) of the output feature map 𝒪 of the output.
Based on this, the set 𝑉 of the nodes of 𝐺 consists of a set {𝑉1 , 𝑉2 , ..., 𝑉𝑀 } of node subsets.
The subset 𝑉𝑘 indicates the contribution of the 𝑘 𝑡ℎ convolutional layer 𝑐𝑘 . This is obtained by
applying the 𝑥𝑘 filters of this layer to the input of 𝑐𝑘 . Therefore, a node 𝑝 ∈ 𝑉𝑘 represents the
output obtained by applying the 𝑥𝑘 filters of 𝑐𝑘 to some position (𝑖, 𝑗) of the input.
   Since by applying a filter on ℐ(𝑖, 𝑗) we get a new element 𝒪(𝑖, 𝑗), it is straightforward that
there is a direct connection between ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗). Actually, there is a direct connection
not only between ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗), but also between ℐ(𝑖, 𝑗) and each element adjacent to
𝒪(𝑖, 𝑗) within the filter area. As each convolutional layer 𝑐𝑘 has a set of 𝑥𝑘 filters, there are 𝑥𝑘
sets of direct connections between ℐ(𝑖, 𝑗) and 𝒪𝑘 (𝑖, 𝑗), one for each filter. Since the 𝑥𝑘 filters
are applied to the input with a given stride, a set of similar connections towards the feature
�maps is generated for different positions of the input.
   In a CNN, besides the convolutional layer there are pooling layers. A pooling layer shrinks
the input feature map and leads to an increase in the number of connections between the input
and the output. This is achieved by sliding the filter over the input with a given stride and
determining an aggregate value (e.g., the maximum) for each filter window. As aggregate values
are still elements of the feature map provided as input, the next application of a convolutional
layer to the feature map returned by a pooling layer generates connections between the aggregate
values and the elements of the feature map returned by the convolutional layer. In particular,
there will be direct connections between the aggregate values of the feature map provided as
input to the pooling layer and the adjacent elements of the feature map generated by the next
convolutional layer.
   Therefore, based on the previous reasoning, we can say that the application of a filter to
the element ℐ(𝑖, 𝑗) generates a new element 𝒪(𝑖, 𝑗), whose value is obtained by the following
convolution operation:
                                                 ∑︀𝑎      ∑︀𝑏
                𝑔(𝑖, 𝑗) = 𝑓 (𝑖, 𝑗) * ℐ(𝑖, 𝑗) =     𝑠=−𝑎     𝑡=−𝑏 𝑓 (𝑠, 𝑡)ℐ(𝑖 + 𝑠, 𝑗 + 𝑡)

Here, 𝑓 is a filter of size (2𝑎 + 1) × (2𝑏 + 1).
   The direct connections generated between ℐ(𝑖, 𝑗) and 𝒪(𝑖+𝑠, 𝑗+𝑡), −𝑎 ≤ 𝑠 ≤ 𝑎, −𝑏 ≤ 𝑡 ≤ 𝑏,
are labeled with the same weight 𝑔(𝑖, 𝑗), representing the convolution result. In presence of
𝑥 filters, the direct connections between ℐ(𝑖, 𝑗) and 𝒪ℎ (𝑖 + 𝑎, 𝑗 + 𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤
𝑏 ≤ 1, 1 ≤ ℎ ≤ 𝑥, are weighted with the values of the corresponding convolution results
𝑔1 (𝑖, 𝑗), 𝑔2 (𝑖, 𝑗), . . . , 𝑔𝑥 (𝑖, 𝑗).
   Figure 1 illustrates the application of three filters of size 3 × 3 (green, blue and yellow
colored, respectively) to ℐ(8, 8). For each filter 𝑓ℎ , it also shows the weighted direct connections
generated between ℐ(8, 8) and 𝒪ℎ (8+𝑎, 8+𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤ 𝑏 ≤ 1, 1 ≤ ℎ ≤ 3. The three
weights are obtained as: 𝑔1 (8, 8) = 𝑓1 (8, 8) · ℐ(8, 8) = 5; 𝑔2 (8, 8) = 𝑓2 (8, 8) · ℐ(8, 8) = −2;
𝑔3 (8, 8) = 𝑓3 (8, 8) · ℐ(8, 8) = 11.




Figure 1: Application of three filters of size 3 × 3 (green, blue and yellow colored, respectively) to
ℐ(8, 8), and computation, for each filter, of the weights of the direct connections between ℐ(8, 8) and
𝒪ℎ (8 + 𝑎, 8 + 𝑏), −1 ≤ 𝑎 ≤ 1, −1 ≤ 𝑏 ≤ 1, 1 ≤ ℎ ≤ 3


  To obtain the arcs of 𝐺 from the weights of the direct connections of the 𝑥 filters, we
adopt some statistical descriptors. Our objective is to have only one set of arcs from the node
�corresponding to ℐ(𝑖, 𝑗) to the node corresponding to 𝒪(𝑖 + 𝑠, 𝑗 + 𝑡), −𝑎 ≤ 𝑠 ≤ 𝑎, −𝑏 ≤ 𝑡 ≤ 𝑏1 .
The weight of an arc from ℐ(𝑖, 𝑗) to 𝒪(𝑖 + 𝑠, 𝑗 + 𝑡) is obtained by applying a suitable statistical
descriptors to the weights of 𝑔ℎ (𝑖, 𝑗), 1 ≤ ℎ ≤ 𝑥. The two statistical descriptors we have chosen
to adopt are the mean and the median, which achieved the best performance results.
   As a consequence of the previous reasoning, the set 𝐸 of the arcs of 𝐺 consists of a set
of subsets 𝐸 = {𝐸1 , 𝐸2 , ..., 𝐸𝑀 −1 }. Here, 𝐸𝑘 denotes the set of arcs connecting nodes of
𝑉𝑘 to nodes of 𝑉𝑘+1 . Analogously, the set 𝑊 of the weights of 𝐺 consists of a set of subsets
𝑊 = {𝑊1 , 𝑊2 , ..., 𝑊𝑀 −1 }. Here, 𝑊𝑘 is the set of weights associated with the arcs of 𝐸𝑘 .

2.2. Mapping a CNN into a multilayer network
After showing how it is possible to build a class network representing a CNN, in this section we
illustrate how, with the support of the class network model and a dataset containing the training
data, it is possible to build a multilayer network capable of fully mapping both the CNN and
its behavior. Roughly speaking, a multilayer network is a set of 𝑡 class networks, one for each
target class in the dataset. Formally speaking, let 𝐷 be a dataset consisting of 𝑡 target classes
𝐶𝑙1 , 𝐶𝑙2 , . . . , 𝐶𝑙𝑡 , and let 𝑐𝑛𝑛 be a Convolutional Neural Network. The multilayer network
𝒢 = {𝐺1 , 𝐺2 , ...𝐺𝑡 } corresponding to 𝑐𝑛𝑛 is a set of 𝑡 class networks such that 𝐺ℎ , 1 ≤ ℎ ≤ 𝑡,
corresponds to the ℎ𝑡ℎ target class of 𝐷.
   Figure 2 shows a multilayer network 𝒢 characterized by three layers, 𝐺1 , 𝐺2 and 𝐺3 each
consisting of a generated class network. Figure 3 shows the generation of a portion of 𝒢 obtained
by extracting the three feature maps of three target classes from 𝑐𝑛𝑛. In this last network, a
set of filters of size 3 × 3 with stride 1, sliding over positions (3, 2) and (4, 2) of the feature
maps, generates some arcs for the class networks of 𝒢. The weights of the arcs are computed
by applying the rules described in Section 2.1. In particular, for the class network 𝐺1 (resp., 𝐺2 ,
𝐺3 ), two sets of arcs of weights 5 (resp., 1, 3) and 4 (resp., 7, 6) are generated between the first
and second feature maps. Similarly, a set of arcs of weights 2 (resp., -2, 4) and 8 (resp., 9, -1) are
generated between the second and the third feature maps.
   Our algorithm for constructing 𝒢 from 𝑐𝑛𝑛 and 𝐷 consists of two steps. During the first
step, it creates a support data structure consisting of a list of patch lists. A patch list is a part of
a feature map having the same size as the filters applied to the next convolutional layer and
giving rise to a node in the multilayer network. The list of patch lists is then used during the
second step to construct 𝒢.
   The function corresponding to the first step, called CREATE_PATCHES, receives 𝑐𝑛𝑛 and
a target class 𝐶𝑙ℎ , 1 ≤ ℎ ≤ 𝑡. It operates on the feature maps provided in input to each
convolutional layer of 𝑐𝑛𝑛 and proceeds as follows. It uses a list conv_layers that initially
contains the convolutional layers of 𝑐𝑛𝑛. Afterwards, it iterates over the elements of conv_layers
and provides them with the images of 𝐶𝑙ℎ . During each iteration, it treats the current element
as the source and the next element as the target. The feature map returned from source is given
as input to target, which processes it as specified below.
   At the beginning of each iteration, CREATE_PATCHES determines the starting and ending
points of the source output. After that, it iterates over the source output and creates a patch for
    1
     Here and in the following, we employ the symbols ℐ(𝑖, 𝑗) and 𝒪(𝑖, 𝑗) for indicating both the elements of the
feature maps and the corresponding nodes of the class network.
�Figure 2: Sample multilayer network 𝒢 composed of three layers corresponding to the class networks
𝐺1 (top), 𝐺2 (middle), and 𝐺3 (bottom).


each application of the convolutional filter on an element. At the end of the iteration, it stores
the patches corresponding to a convolutional layer in a list called patch_list. Finally, it returns
in output the lists corresponding to all convolutional layers of 𝑐𝑛𝑛.
    The function corresponding to the second step, called CREATE_LAYER_NETWORK receives 𝑐𝑛𝑛
and a target class 𝐶𝑙ℎ , 1 ≤ ℎ ≤ 𝑡 of the multilayer network 𝒢. It first calls CREATE_PATCHES
to receive the list of patch lists corresponding to 𝑐𝑛𝑛 and 𝐶𝑙ℎ . Then, it creates an initially
empty network 𝐺ℎ . Afterwards, it scrolls the list of patch lists returned by CREATE_PATCHES
considering the current list as source and the next one as target.
    For each iteration, it adds to 𝐺ℎ a node for each patch present in source or target, if they are
not already present in 𝐺ℎ . Then, it determines the size of the area covered by the filter in target.
This size is equal to the one covered in source, if we are in presence of a convolutional layer, or
it is greater, if we have a pooling layer. After that, it iterates over the source and target nodes on
which the filter acts. Given a node, it considers all source nodes that can be processed by the
filter whose center falls in the rectangle defined by the coordinates of the patch of 𝑣𝑡 and, for
each of them, adds an arc from it to 𝑣𝑡 in 𝐺ℎ .
    Once this arc has been entered, it computes the corresponding weights by applying the
formula seen above (see Figure 3 for an example of its application). At the end of its iterations,
CREATE_LAYER_NETWORK has created the ℎ𝑡ℎ layer 𝐺ℎ of 𝒢, corresponding to the target class
𝐶𝑙ℎ . Applying this function 𝑡 times, one for each target class in the dataset 𝐷, we obtain the
final multilayer network.
�                                        (a) Class network 𝐺1




                                        (b) Class network 𝐺2




                                        (c) Class network 𝐺3

Figure 3: Generation of a portion of the class networks 𝐺1 , 𝐺2 and 𝐺3 corresponding to the layers of
the multilayer network 𝒢 depicted in Figure 2


3. Applying the multilayer network model to compress a CNN
Once a CNN is mapped into a multilayer network, the latter can be used to analyze and
manipulate the former. There are several operations that can be performed on the CNN thanks
to its mapping in a multilayer network. To give an idea of such operations, in this section we
examine one of them, namely the compression of a CNN. This represents the second main
contribution of this paper.
    Let 𝒢 be a multilayer network and let 𝐺ℎ be its ℎ𝑡ℎ layer. 𝐺ℎ is a weighted directed graph.
Given a node 𝑣 of 𝐺ℎ , we can define: (i) the indegree of 𝑣, as the sum of the weights of the arcs
of 𝐺ℎ incoming into 𝑣; (ii) the outdegree of 𝑣, as the sum of the weights of the arcs outgoing
from 𝑣; (iii) the degree of 𝑣, as the sum of its indegree and its outdegree. We adopt the symbol
𝑑ℎ (𝑣) to denote the degree of 𝑣 in 𝐺ℎ and the symbol 𝛿(𝑣) to represent the overall (weighted)
degree of 𝑣 in 𝒢. As we will see below, 𝛿(𝑣) is an important indicator of the effectiveness of the
filter represented by 𝑣.
�  As previously pointed out, 𝒢 = {𝐺1 , 𝐺2 , . . . , 𝐺𝑡 } has a layer for each target class. As a
consequence, the overall degree 𝛿(𝑣) of the node 𝑣 in 𝒢 can be obtained by suitably aggregating
the degrees 𝑑1 (𝑣), 𝑑2 (𝑣), · · · , 𝑑𝑡 (𝑣) of 𝑣 in the 𝑡 layers of 𝒢. More specifically, if we indicate
with ℱ a suitable aggregation function, we have that:

                                 𝛿(𝑣) = ℱ(𝑑1 (𝑣), 𝑑2 (𝑣), ..., 𝑑𝑡 (𝑣))                              (1)
                  ∑︀𝑡
   Let 𝑑𝑡𝑜𝑡 (𝑣) = ℎ=1 𝑑ℎ (𝑣) be the sum of the degrees of 𝑣 in the 𝑡 layers of 𝒢. We adopt the
following entropy-based aggregation function [10, 11] for determining 𝛿(𝑣):
                                           𝑡
                                             𝑑ℎ (𝑣)                   𝑑ℎ (𝑣)
                                          ∑︁                     (︂              )︂
                               𝛿(𝑣) = −                    𝑙𝑜𝑔                        ,             (2)
                                                𝑑𝑡𝑜𝑡 (𝑣)              𝑑𝑡𝑜𝑡 (𝑣)
                                          ℎ=1

   This function refers to the famous concept of information entropy introduced by Shannon
[12]. It favors the presence of a uniform distribution of the degree of 𝑣 in the different layers,
while it penalizes the presence of a high degree of 𝑣 in few layers and a low degree of 𝑣 in many
layers. In this way, we favor those nodes whose feature extraction is balanced for different
target classes [13] and penalize those nodes that do not favor such a balance.
   Our approach for the compression of 𝑐𝑛𝑛 selects a subset of the nodes of 𝒢 with the highest
values of 𝛿. This is equivalent to selecting the convolutional layers of 𝑐𝑛𝑛 that contribute the
most to the quality of the classification and, therefore, cannot be discarded in any way. In
particular, our approach selects the nodes of 𝒢 whose values of 𝛿 are higher than a certain
threshold 𝑡ℎ𝛿 = 𝛾 · 𝛿.
   Here, 𝛿 denotes a statistical descriptor of the values of the overall degree 𝛿 of all nodes in 𝒢.
Our approach allows the usage of the mean or the median as statistical aggregators because
they are the ones that allowed us to obtain the best experimental results. 𝛾 is a scaling factor
that allows us to tune the contribution of 𝛿. Its value belongs to the real interval [0, +∞).
   After selecting the subset of the nodes of 𝒢 with an overall degree 𝛿 higher than 𝑡ℎ𝛿 , our
approach determines the set of the convolutional layers of 𝑐𝑛𝑛 from which these nodes were
extracted. Finally, it keeps these layers in the compressed version of 𝑐𝑛𝑛 while discarding the
others. At this point, our approach terminates by training the pruned network.


4. Conclusion
In this paper, we have seen how it is possible to map a deep learning architecture, specifically a
CNN, into a multilayer network, which facilitates its analysis, exploration and manipulation.
To this end, we proposed an approach to map each element of a CNN into the elements of the
multilayer networks, namely nodes, arcs, arc weights and layers. Then, to give an idea of what
can be done thanks to such a mapping, we proposed an approach to compress a CNN based on
the pruning of the layers that least contribute to the quality of the classification.
   The approach proposed in this paper is a starting point for further research efforts in this area.
For example, our approach does not consider the residual connections, typical of ResNet. In the
future, we plan to extend it in order to handle this type of architecture. Also, at the moment, our
approach is not able to prune single filters from a convolutional layer. Doing so would require a
�much more detailed mapping of the CNN into the multilayer network. At the moment, we have
not done this to keep the size of the multilayer network limited. However, conceptually there
are no limitations in performing such a task. In the future, we plan to proceed in that direction
so that we can perform the pruning of a CNN with finer granularity.


References
 [1] S. Dargan, M. Kumar, M. R. Ayyagari, G. Kumar, A survey of deep learning and its
     applications: A new paradigm to machine learning, Archives of Computational Methods
     in Engineering 27 (2020) 1071–1092.
 [2] E. Corradini, G. Porcino, A. Scopelliti, D. Ursino, L. Virgili, Fine-tuning SalGAN and
     PathGAN for extending saliency map and gaze path prediction from natural images to
     websites, Expert Systems With Applications 191 (2022) 116282. Elsevier.
 [3] Z. Chen, Z. Chen, J. Lin, S. Liu, W. Li, Deep neural network acceleration based on low-rank
     approximated channel pruning, IEEE Transactions on Circuits and Systems I: Regular
     Papers 67 (2020) 1232–1244. doi:10.1109/TCSI.2019.2958937.
 [4] J. Liu, B. Zhuang, Z. Zhuang, Y. Guo, J. Huang, J. Zhu, M. Tan, Discrimination-aware
     network pruning for deep model compression, IEEE Transactions on Pattern Analysis and
     Machine Intelligence (2021) 1–1. doi:10.1109/TPAMI.2021.3066410.
 [5] T. Choudhary, V. Mishra, A. Goswami, J. Sarangapani, A comprehensive survey on model
     compression and acceleration, Artificial Intelligence Review (2020) 1–43.
 [6] Z.-R. Wang, J. Du, Joint architecture and knowledge distillation in CNN for Chinese text
     recognition, Pattern Recognition 111 (2021) 107722. Elsevier.
 [7] A. Khan, A. Sohail, U. Zahoora, A. S. Qureshi, A survey of the recent architectures of
     deep convolutional neural networks, Artif. Intell. Rev. 53 (2020) 5455–5516. doi:10.1007/
     s10462-020-09825-6.
 [8] M. Kivela, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, M. A. Porter, Multilayer
     networks, Journal of Complex Networks 2 (2014) 203–271. doi:10.1093/comnet/cnu016.
 [9] S. Chen, Q. Zhao, Shallowing deep networks: Layer-wise pruning based on feature
     representations, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019)
     3048–3056. doi:10.1109/TPAMI.2018.2874634.
[10] R. K. Manel Hmimida, Community detection in multiplex networks: A seed-centric
     approach, Networks & Heterogeneous Media 10 (2015) 71–85.
[11] F. Battiston, V. Nicosia, V. Latora, Structural measures for multiplex networks, Phys. Rev.
     E 89 (2014) 032804. doi:10.1103/PhysRevE.89.032804.
[12] C. E. Shannon, A mathematical theory of communication, The Bell System Technical
     Journal 27 (1948) 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x.
[13] N. Gowdra, R. Sinha, S. MacDonell, W. Q. Yan, Mitigating severe over-parameterization in
     deep convolutional neural networks through forced feature abstraction and compression
     with an entropy-based heuristic, Pattern Recognition (2021) 108057. Elsevier.
�