Vol-3194/paper72
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper72 |
wikidataid | Q117344951→Q117344951 |
title | Recent Advancements on Bio-Inspired Hebbian Learning for Deep Neural Networks |
pdfUrl | https://ceur-ws.org/Vol-3194/paper72.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/Lagani22 |
volume | Vol-3194→Vol-3194 |
session | → |
Recent Advancements on Bio-Inspired Hebbian Learning for Deep Neural Networks
Recent Advancements on Bio-Inspired Hebbian Learning for Deep Neural Networks Gabriele Lagani1,2 1 University of Pisa, Pisa, 56127, Italy 2 ISTI-CNR, Pisa, 56124, Italy Abstract Deep learning is becoming more and more popular to extract information from multimedia data for indexing and query processing. In recent contributions, we have explored a biologically inspired strategy for Deep Neural Network (DNN) training, based on the Hebbian principle in neuroscience. We studied hybrid approaches in which unsupervised Hebbian learning was used for a pre-training stage, followed by supervised fine-tuning based on Stochastic Gradient Descent (SGD). The resulting semi-supervised strategy exhibited encouraging results on computer vision datasets, motivating further interest towards applications in the domain of large scale multimedia content based retrieval. Keywords Machine Learning, Hebbian Learning, Deep Neural Networks, Computer Vision 1. Introduction In the past few years, Deep Neural Networks (DNNs) have emerged as a powerful technology in the domain of computer vision [1, 2]. Consequently, DNNs started gaining popularity also in the domain of large scale multimedia content based retrieval, replacing handcrafted feature extractors [3, 4]. Learning algorithms for DNNs are typically based on supervised end-to- end Stochastic Gradient Descent (SGD) training with error backpropagation (backprop). This approach is considered biologically implausible by neuroscientists [5]. Instead, they propose Hebbian learning as a biological alternative to model synaptic plasticity [6]. Backprop-based algorithms need a large number of labeled training samples in order to achieve high results, which are expensive to gather, as opposed to unlabeled samples. The idea behind our contribution [7, 8] is to tackle the sample efficiency problem by taking inspiration from biology and Hebbian learning. Since Hebbian approaches are mainly unsu- pervised, we propose to use them to perform the unsupervised pre-training stage on all the available data, in a semi-supervised setting, followed by end-to-end backprop fine-tuning on the labeled data only. In the rest of this paper, we illustrate the proposed methodology, and we show experimental results in computer vision. The results are promising, motivating further interest in the application of our approach to the domain of multimedia content retrieval on a large scale. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy Envelope-Open gabriele.lagani@phd.unipi.it (G. Lagani) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � The remainder of this paper is structured as follows: Section 2 gives a background concerning Hebbian learning and semi-supervised training; Section 3 delves deeper into the semi-supervised approach based on Hebbian learning; In Section 4, we illustrate our experimental results and discuss the conclusions; 2. Background and related work Several variants of Hebbian learning rules were developed over the years. Some examples are: Hebbian learning with Winner-Takes-All (WTA) competition [9], Hebbian learning for Principal Component Analysis (PCA) [6, 10], Hebbian/anti-Hebbian learning [11]. A brief overview is given in Section 3. However, it was only recently that Hebbian learning started gaining attention in the context of DNN training [12, 13, 14, 15, 16]. In [14], a Hebbian learning rule based on inhibitory competition was used to train a neural network composed of fully connected layers. The approach was validated on object recognition tasks. Instead, the Hebbian/anti-Hebbian learning rule developed in [11] was applied in [13] to train convolutional feature extractors. The resulting features were shown to be effective for classification. Convolutional layers were also considered in [12], where a Hebbian approach based on WTA competition was employed in this case. However, the previous approaches were based on relatively shallow network architectures (2-3 layers). A further step was taken in [15, 16], where Hebbian learning rules were applied for training a 6-layer Convolutional Neural Network (CNN). It is known that a pre-training phase allows to initialize network weights in a region near a good local optimum [17, 18]. Previous papers investigated the idea of enhancing neural network training with an unsupervised learning objective [19, 20]. In [19], Variational Auto- Encoders (VAE) were considered, in order to perform an unsupervised pre-training phase using a limited amount of labeled samples. Also [20] relied on autoencoding architectures to augment supervised training with unsupervised reconstruction objectives, showing that joint optimization of supervised and unsupervised losses helped to regularize the learning process. 3. Hebbian learning strategies and sample efficiency Consider a single neuron with weight vector w and input x. Call 𝑦 = w𝑇 x the neuron output. A learning rule defines a weight update as follows: w𝑛𝑒𝑤 = w𝑜𝑙𝑑 + Δw (1) where w𝑛𝑒𝑤 is the updated weight vector, w𝑜𝑙𝑑 is the old weight vector, and Δw is the weight update. The Hebbian learning rule, in its simplest form, can be expressed as Δw = 𝜂 𝑦 x (where 𝜂 is the learning rate) [6]. Basically, this rule states that the weight on a given synapse is reinforced when the input on that synapse and the output of the neuron are simultaneously high. Therefore, connections between neurons whose activations are correlated are reinforced. In order to prevent weights from growing unbounded, a weight decay term is generally added. �In the context of competitive learning [9], this is obtained as follows: Δwi = 𝜂 𝑦𝑖 x − 𝜂 𝑦𝑖 wi = 𝜂 𝑦𝑖 (x − wi ) (2) where the subscript i refers to the i’th neuron in a given network layer. Moreover, the output 𝑦𝑖 can be replaced with the result 𝑟𝑖 of a competitive nonlinearity, which allows to decorrelate the activity of different neurons. In the Winner-Takes-All (WTA) approach [9], at each training step, the neuron which produces the strongest activation for a given input is called the winner. In this case, 𝑟𝑖 = 1 if the i’th neuron is the winner and 0 otherwise. In other words, only the winner is allowed to perform the weight update, so that it will be more likely for the same neuron to win again if a similar input is presented again in the future. In this way different neurons are induced to specialize on different patterns. In soft-WTA [21], 𝑟𝑖 is computed as 𝑦 𝑟𝑖 = ∑ 𝑖𝑦 . We found this formulation to work poorly in practice, because there is no tunable 𝑗 𝑗 parameter to cope with the variance of activations. For this reason, we introduced a variant of this approach that uses a softmax operation in order to compute 𝑟𝑖 : 𝑒 𝑦𝑖 /𝑇 𝑟𝑖 = (3) ∑𝑗 𝑒 𝑦𝑗 /𝑇 where T is called the temperature hyperparameter. The advantage of this formulation is that we can tune the temperature in order to obtain the best performance on a given task, depending on the distribution of the activations. The Hebbian Principal Component Analysis (HPCA) learning rule, in the case of nonlinear neurons, is obtained by minimizing the so-called representation error: 𝑖 𝐿(wi ) = 𝐸[(𝑥 − ∑ 𝑓 (𝑦𝑗 ) wj )2 ] (4) 𝑗=1 where 𝑓 () is the neuron activation function. Minimization of this objective leads to the nonlinear HPCA rule [10]: 𝑖 Δwi = 𝜂𝑓 (𝑦𝑖 )(𝑥 − ∑ 𝑓 (𝑦𝑗 )wj ) (5) 𝑗=1 It can be noticed that these learning rules do not require supervision, and they are local for each network layer, i.e. they do not require backpropagation. . In order to contextualize our approach in a scenario with scarce data, let’s define the labeled set 𝒯𝐿 as a collection of elements for which the corresponding label is known. Conversely, the unlabeled set 𝒯𝑈 is a collection of elements whose labels are unknown. The whole training set 𝒯 is given by the union of 𝒯𝐿 and 𝒯𝑈 . All the samples from 𝒯 are assumed to be drawn from the same statistical distribution. In a sample efficiency scenario, the number of samples in 𝒯𝐿 is typically much smaller than the total number of samples in 𝒯. In particular, in an 𝑠 %-sample efficiency regime, the size of the labeled set is 𝑠 % that of the whole training set. To tackle this scenario, we considered a semi-supervised approach in two phases. During the first phase, latent representations are obtained from hidden layers of a DNN, which are trained using unsupervised Hebbian learning. This unsupervised pre-training is performed on all the �Figure 1: The neural network used for the experiments. available training samples. During the second phase, a final linear classifier is placed on top of the features extracted from deep network layers. Classifier and deep layers are fine-tuned in a supervised training fashion, by running an end-to-end SGD optimization procedure using only the few labeled samples at our disposal. 4. Results and conclusions In order to validate our method, we performed experiments 1 on various image datasets, and in various sample efficiency regimes. For the sake of brevity, but without loss of generality, in this venue we present the results on CIFAR10 [22], in sample efficiency regimes where the amount of labeled samples was respectively 1%, 5%, 10%, and 100% of the whole training set. Further results can be found in [7, 8]. We considered a six layer neural network as shown in Fig. 1: five deep layers plus a final linear classifier. The various layers were interleaved with other processing stages (such as ReLU nonlinearities, max-pooling, etc.). We first performed unsupervised pre-training with a chosen algorithm. Then, we cut the network in correspondence of a given layer, and we attached a new classifier on top of the features extracted from that layer. Deep layers and classifier were then fine-tuned with supervision in an end-to-end fashion and the resulting accuracy was evaluated. This was done for each layer, in order to evaluate the network on a layer-by-layer basis, and for each sample efficiency regime. For the unsupervised pre-training of deep layers, we considered both the HPCA and the soft-WTA strategy. In addition, as a baseline for comparison, we considered another popular unsupervised method for pre-training, namely the Variational Auto-Encoder (VAE) [23] (considered also in [19]). Note that VAE is unsupervised, but still backprop-based. The results are shown Tab. 1, together with 95% confidence intervals obtained from five independent repetitions of the experiments. In summary, the results suggest that our semi- supervised approach based on unsupervised Hebbian pre-training performs generally better than VAE pre-training, especially in low sample efficiency regimes, in which only a small portion of the training set (between 1% and 10%) is assumed to be labeled. In particular, the HPCA 1 Code available at: https://github.com/GabrieleLagani/HebbianPCA/tree/hebbpca . �Table 1 CIFAR10 accuracy and 95% confidence intervals, on various layers and for various sample efficiency regimes. Results obtained with VAE and Hebbian pre-training are compared. Regime Pre-Train L1 L2 L3 L4 L5 VAE 33.54 ±0.27 34.41 ±0.84 29.92 ±1.25 24.91 ±0.66 22.54 ±0.60 1% soft-WTA 35.47 ±0.19 35.75 ±0.65 36.09 ±0.27 30.57 ±0.36 30.23 ±0.37 HPCA 37.01 ±0.42 37.65 ±0.19 41.88 ±0.53 40.06 ±0.65 39.75 ±0.50 VAE 46.31 ±0.39 48.21 ±0.21 48.98 ±0.34 36.32 ±0.35 32.75 ±0.32 5% soft-WTA 48.34 ±0.27 52.90 ±0.28 54.01 ±0.24 49.80 ±0.16 48.35 ±0.26 HPCA 48.49 ±0.44 50.14 ±0.46 53.33 ±0.52 52.49 ±0.16 52.20 ±0.37 VAE 53.83 ±0.26 56.33 ±0.22 57.85 ±0.22 52.26 ±1.08 45.67 ±1.15 10% soft-WTA 54.23 ±0.18 59.40 ±0.20 61.27 ±0.24 58.33 ±0.35 58.00 ±0.26 HPCA 54.36 ±0.32 56.08 ±0.28 58.46 ±0.15 56.54 ±0.23 57.35 ±0.18 VAE 67.53 ±0.22 75.83 ±0.31 80.78 ±0.28 84.27 ±0.35 85.23 ±0.26 100% soft-WTA 67.37 ±0.16 77.39 ±0.04 81.83 ±0.47 84.42 ±0.15 85.37 ±0.03 HPCA 66.76 ±0.13 75.16 ±0.20 79.90 ±0.18 83.55 ±0.33 84.38 ±0.22 approach appears to perform generally better than soft-WTA. Concerning the computational cost of Hebbian learning, the approach converged in just 1-2 epochs of training, while backprop approaches required 10-20 epochs, showing promises towards scaling to large scale scenarios. In future work, we plan to investigate the combination of Hebbian approaches with alternative semi-supervised methods, namely pseudo-labeling and consistency-based methods [24, 25], which do not exclude unsupervised pre-training, but rather can be integrated together. Moreover, we are currently conducting more thorough explorations of Hebbian algorithms in the domain of large scale multimedia content based retrieval, and the results are promising [26]. Acknowledgments This work was partially supported by the H2020 projects AI4EU (GA 825619) and AI4Media (GA 951911). References [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems (2012). [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [3] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep learning for content-based image retrieval: A comprehensive study, in: Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 157–166. [4] A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes for image retrieval, in: European conference on computer vision, Springer, 2014, pp. 584–599. � [5] R. C. O’Reilly, Y. Munakata, Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain, MIT press, 2000. [6] S. Haykin, Neural networks and learning machines, 3 ed., Pearson, 2009. [7] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Hebbian semi-supervised learning in a sample efficiency setting, Neural Networks 143 (2021) 719–731. [8] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Evaluating hebbian learning in a semi-supervised setting, in: International Conference on Machine Learning, Optimization, and Data Science, Springer, 2021, pp. 365–379. [9] S. Grossberg, Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors, Biological cybernetics 23 (1976) 121–134. [10] J. Karhunen, J. Joutsensalo, Generalizations of principal component analysis, optimization problems, and neural networks, Neural Networks 8 (1995) 549–562. [11] C. Pehlevan, T. Hu, D. B. Chklovskii, A hebbian/anti-hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data, Neural computation 27 (2015) 1461–1495. [12] A. Wadhwa, U. Madhow, Bottom-up deep learning using the hebbian principle, 2016. [13] Y. Bahroun, A. Soltoggio, Online representation learning with single and multi-layer hebbian networks for image classification, in: International Conference on Artificial Neural Networks, Springer, 2017, pp. 354–363. [14] D. Krotov, J. J. Hopfield, Unsupervised learning by competing hidden units, Proceedings of the National Academy of Sciences 116 (2019) 7723–7731. [15] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Training convolutional neural networks with competitive hebbian learning approaches, in: International Conference on Machine Learning, Optimization, and Data Science, Springer, 2021, pp. 25–40. [16] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Comparing the performance of hebbian against backpropagation learning using convolutional neural networks, Neural Computing and Applications (2022) 1–17. [17] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: Advances in neural information processing systems, 2007, pp. 153–160. [18] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring strategies for training deep neural networks., Journal of machine learning research 10 (2009). [19] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, M. Welling, Semi-supervised learning with deep generative models, Advances in neural information processing systems 27 (2014) 3581–3589. [20] Y. Zhang, K. Lee, H. Lee, Augmenting supervised neural networks with unsupervised objectives for large-scale image classification, in: International conference on machine learning, 2016, pp. 612–621. [21] S. J. Nowlan, Maximum likelihood competitive learning, in: Advances in neural information processing systems, 1990, pp. 574–582. [22] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, 2009. [23] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [24] A. Iscen, G. Tolias, Y. Avrithis, O. Chum, Label propagation for deep semi-supervised learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5070–5079. [25] P. Sellars, A. I. Aviles-Rivero, C.-B. Schönlieb, Laplacenet: A hybrid energy-neural model for deep semi-supervised classification, arXiv preprint arXiv:2106.04527 (2021). [26] G. Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato, Deep features for cbir with scarce data using hebbian learning, Submitted at CBMI 2022 conference (2022). URL: https: //arxiv.org/abs/2205.08935. �