Vol-3194/paper42

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3194/paper42
wikidataid  Q117344982→Q117344982
title  Distributed Heterogeneous Transfer Learning
pdfUrl  https://ceur-ws.org/Vol-3194/paper42.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/MignonePC22
volume  Vol-3194→Vol-3194
session  →

Distributed Heterogeneous Transfer Learning

load PDF

Distributed Heterogeneous Transfer Learning
Paolo Mignone1,2 , Gianvito Pio1,2 and Michelangelo Ceci1,2
1
    Department of Computer Science, Via Orabona, 4, 70125, University of Bari Aldo Moro, Bari, Italy
2
    Big Data Lab, National Interuniversity Consortium for Informatics (CINI), Via Ariosto, 25, 00185, Rome, Italy


                                         Abstract
                                         Transfer learning has proved to be effective for building predictive models for a target domain, by
                                         exploiting the knowledge coming from a related source domain. However, most existing transfer learning
                                         methods assume that source and target domains have common feature spaces. Heterogeneous transfer
                                         learning methods aim to overcome this limitation, but they often make strong assumptions, e.g., on the
                                         number of features, or cannot distribute the workload when working in a big data environment. In this
                                         manuscript, we present a novel transfer learning method which: i) can work with heterogeneous feature
                                         spaces without imposing strong assumptions; ii) is fully implemented in Apache Spark following the
                                         MapReduce paradigm, enabling the distribution of the workload over multiple computational nodes; iii)
                                         is able to work also in the very challenging Positive-Unlabeled (PU) learning setting.
                                             We conducted our experiments in two relevant application domains for transfer learning: the
                                         prediction of the energy consumption in power grids and the reconstruction of gene regulatory networks.
                                         The results show that the proposed approach fruitfully exploits the knowledge coming from the source
                                         domain and outperforms 3 state-of-the-art heterogeneous transfer learning methods.

                                         Keywords
                                         Heterogeneous transfer learning, Positive-Unlabeled setting, Distributed computation, Apache Spark


1. Introduction
Machine learning algorithms aim to train a model function that describes and generalizes a set
of observed examples, called training data. The learned model function can be applied to unseen
data with the same feature space and data distribution of the training data. However, in several
real scenarios, it is difficult or expensive to obtain training data described through the same
feature space and following the same data distribution of the examples where the prediction
functions will be applied. A possible solution to the challenges raised by these scenarios come
from the design and application of transfer learning methods [1], that aim to learn a predictive
function for a target domain, by exploiting also an external but related source domain.
   A relevant example of these scenarios can be observed in the energy field, where predicting
the customer energy consumption is a typical task [2]. When new customers are connected to
the energy network in a new area/district, they could not be well represented by the available
training data to make accurate predictions for them. In this case, transfer learning would enable
the exploitation of data related to other customers, also residing in different areas, leveraging
common characteristics in terms of type of customers or in terms of behavior.

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ paolo.mignone@uniba.it (P. Mignone); gianvito.pio@uniba.it (G. Pio); gianvito.pio@uniba.it (M. Ceci)
€ http://www.di.uniba.it/~mignone/ (P. Mignone); http://www.di.uniba.it/~gianvitopio/ (G. Pio);
http://www.di.uniba.it/~ceci/ (M. Ceci)
� 0000-0002-8641-7880 (P. Mignone); 0000-0003-2520-3616 (G. Pio); 0000-0002-6690-7583 (M. Ceci)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�   In several scenarios, data for the target and the source domains come from multiple data
sources, exhibiting different data representations that make the direct adoption of classical
transfer learning approaches unfeasible. A possible solution may consist in performing time-
consuming manual feature engineering steps, aiming to find commonalities among the data
sources and to somehow make the feature spaces homogeneous. However, such manual oper-
ations can be subjective, error-prone, and even unfeasible when no detailed information are
available about the features at hand, or when the features in the target and source domains
are totally different, i.e., heterogeneous. To overcome this issue, several approaches have been
proposed in the literature to automatically identify the best match between features of different
domains, or to identify a novel shared feature space. Such methods cover different real-domain
applications, such as biomedical analysis [3], object recognition from images [4], or multilingual
sentiment classification [5]. However, to the best of our knowledge, existing methods exhibit
one or more of the following limitations: i) they make strict assumptions on the number of
features, i.e., even if they are able to work with heterogeneous feature spaces, they are required
to have the same dimensionality [6, 7, 8]; ii) they are not able to distribute the workload over
multiple computational nodes; iii) in the case of classification tasks, they require a fully labeled
training set, or at least a proper representation of each class, i.e., they are not able to work in
the Positive-Unlabeled (PU) setting, where the training set consists of only positive labeled
instances and unlabelled instances [9], that is very common, for example, in the biological
domain [10, 11]; iv) they are able to work by exploiting the background knowledge of specific
application domains [12, 13].
   To overcome these limitations, in this paper we proposed a novel heterogeneous transfer
learning method, called STEAL, that simultaneously exhibit the following characteristics:
    • It is able to work also in the very challenging Positive-Unlabeled (PU) learning setting;
    • It is implemented using the Apache Spark framework following the MapReduce paradigm,
      enabling the distribution of the workload over multiple computational nodes;
    • It exploits the Kullback-Leibler (KL) divergence to align the descriptive variables of the
      source and target domains and, therefore, to work with heterogeneous domains described
      through feature spaces having different dimensionalities;
    • It is not tailored for specific application domains, but can be considered a general purpose
      approach applicable to multiple real-world scenarios.
2. The proposed method STEAL
In this section, we describe the proposed method STEAL, emphasizing its peculiarities. The
distribution of the workload over multiple computational nodes is achieved by designing the
algorithms using the MapReduce paradigm. The pseudo-code description of the algorithms
exploits the Resilient Distributed Dataset (RDD) data structure available in Apache Spark.
   The workflow followed by STEAL consists of 3 stages. The first two stages are dedicated to
align the feature spaces and to identify a proper matching between target and source instances.
Finally, the third stage consists of training a distributed variant of Random Forests (RF) available
in Apache Spark, from the obtained hybrid training set. In the following, we report some further
details about the first two stages.
Stage 1 - Feature Alignment. In the first stage, in the case of PU learning setting, we first
estimate the labels of the unlabeled instances by resorting to a clustering-based approach.
�Specifically, we apply a prototype-based clustering algorithm to identify 𝑘 representatives of
positive instances from the positive examples of the target domain, and exploit them to estimate
a score for each unlabeled instance. Specifically, we adopt a distributed variant of the 𝑘-means
algorithm, since it is well established in the literature and has been effectively exploited in
previous works in the PU learning setting [10, 11, 12, 13]. The score we compute for each
unlabeled instance is in the interval [0, 1], where a value close to 0.0 (resp., 1.0) means that the
unlabeled example is likely to be a negative (resp., positive) example.
   Methodologically, the score is computed according to the similarity between the feature
vector associated with the unlabeled instance and the  √ feature vectors of the cluster prototypes.
                                                                     ∑︀𝑟
                                                                            (𝑎 −𝑏 )2
As similarity function, we consider 𝑠𝑖𝑚(𝑎, 𝑏) = 1−         𝑖=1 𝑖
                                                              𝑟
                                                                   𝑖
                                                                      , that computes the similarity
between the instance vectors 𝑎 and 𝑏, in a 𝑟-dimensional space, based on the Euclidean distance,
after applying a min-max normalization (in the range [0, 1]) to all the features. After this step,
we obtain a fully labeled dataset composed by true and estimated labels, where the computed
score represents the confidence on the fact that a given unlabelled instance is a positive instance.
   Subsequently, if the dataset is highly dimensional1 , STEAL applies a distributed variant
of
√︀ the PCA, in order to identify a reduced set of (non-redundant) new features. We extract
   𝑚𝑖𝑛(𝑝𝑠 , 𝑝𝑡 ) new features, where 𝑝𝑠 and 𝑝𝑡 are the initial dimensionalities of the source and
the target feature spaces, respectively.
   Finally, we focus on the main objective of this stage, namely, the identification of an alignment
between the descriptive variables of the source and target domains. This step is necessary in
order to make the instances of the source and target domains directly comparable (through a
distance/similarity measure), to carry our the following Stage. Specifically, the goal is to find a
correspondent feature in the source domain, for each feature of the target domain. To this aim,
we compute the asymmetric KL divergence for each target-source pair of descriptive variables,
that quantifies the loss of information caused by a given feature of the source domain when
used to approximate a given feature of the target domain. Formally, given the 𝑖-th feature of
the target domain 𝐹𝑡𝑖 , and the 𝑗-th feature of the source domain 𝐹𝑠𝑗 , we compute the discrete
variant of the KL divergence between 𝐹𝑡𝑖 and 𝐹𝑠𝑗 as follows:
                                                       ∑︁                                       𝑝(𝑥, 𝐹𝑡𝑖 , 𝑋𝑡 )
           𝐾𝐿(𝐹𝑡𝑖 , 𝐹𝑠𝑗 , 𝑋𝑡 , 𝑋𝑠 ) =                                      𝑝(𝑥, 𝐹𝑡𝑖 , 𝑋𝑡 ) ln                   ,    (1)
                                                                                                𝑝(𝑥, 𝐹𝑠𝑗 , 𝑋𝑠 )
                                          𝑥∈(𝑢𝑛(𝐹𝑡𝑖 ,𝑋𝑡 )∩𝑢𝑛(𝐹𝑠𝑗 ,𝑋𝑠 ))


where 𝑋𝑡 and 𝑋𝑠 are the training instances of the target and source domains, respectively;
𝑢𝑛(𝑦, 𝑤) represents the set of distinct values of the feature 𝑦 in the dataset 𝑤; 𝑝(𝑥, 𝑦, 𝑤)
represents the relative number of occurrences of the value 𝑥 on the feature 𝑦 in the dataset 𝑤,
after a proper discretization of the feature 𝑦 2 .
   In Figure 1 we graphically show this alignment step, while the pseudo-code is reported in
Algorithm 1. STEAL starts by considering the values assumed by the attributes as the key of
a paired RDD. The result is then used to find common values assumed by target and source
attributes, through a join operation. The frequency of each distinct value is then computed for
each target and source attribute, which are then exploited to estimate the probabilities used in
    1
        This can be considered as an optional step, that mainly depends on the availability of computational resources.
    2
        We discretize continuous features using the equal-width strategy (bin size of 0.01), after min-max normalization.
� Algorithm 1: alignSourceDomain(𝑋𝑡 , 𝑋𝑠 )
  Data: 𝑋𝑡 , 𝑋𝑠 : RDD. Target and source domain instances (in the ⟨𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒_𝑖𝑑, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑣𝑎𝑙𝑢𝑒⟩ form)
  Result: 𝑋𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑 : RDD. Source domain instances aligned in the 𝑝𝑡 -dimensional feature space.
1 begin
2     𝑡𝑉 𝑎𝑙𝐾𝑒𝑦 ← 𝑋𝑡 .map{case(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒_𝑖𝑑, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑣𝑎𝑙𝑢𝑒) → ⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒⟩};
3     𝑠𝑉 𝑎𝑙𝐾𝑒𝑦 ← 𝑋𝑠 .map{case(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒_𝑖𝑑, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒, 𝑣𝑎𝑙𝑢𝑒) → ⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒⟩};
      // Join the target and source structures according to the feature values
 4    𝑐𝑜𝑚𝑚𝑉 𝑎𝑙𝑠 ← 𝑡𝑉 𝑎𝑙𝐾𝑒𝑦.join(𝑠𝑉 𝑎𝑙𝐾𝑒𝑦)
      // Count the frequencies of values for the target features
 5    𝑡𝑉 𝑎𝑙𝐹 𝑟𝑒𝑞 ← 𝑐𝑜𝑚𝑚𝑉 𝑎𝑙𝑠.map{case(𝑣𝑎𝑙𝑢𝑒, 𝑡𝐹 𝑒𝑎𝑡, 𝑠𝐹 𝑒𝑎𝑡)→⟨⟨𝑣𝑎𝑙𝑢𝑒, 𝑡𝐹 𝑒𝑎𝑡⟩, 1⟩}.reduceByKey((𝑎, 𝑏)→𝑎+𝑏)
      // Compute the number (cardinality) of value matches for each target feature
 6    𝑡𝐴𝑡𝑡𝑟𝐶𝑎𝑟𝑑 ← 𝑡𝑉 𝑎𝑙𝐹 𝑟𝑒𝑞.map{case(⟨⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑒𝑎𝑡⟩, 𝑓 𝑟𝑒𝑞⟩)→⟨𝑓 𝑒𝑎𝑡, 𝑓 𝑟𝑒𝑞⟩}.reduceByKey((𝑎, 𝑏)→𝑎 + 𝑏)
      // Compute the value probabilities for the target domain
 7    𝑡𝑉 𝑎𝑙𝑢𝑒𝑃 𝑟𝑜𝑏𝑠 ← 𝑡𝑉 𝑎𝑙𝐹 𝑟𝑒𝑞.map{case(⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑒𝑎𝑡⟩, 𝑓 𝑟𝑒𝑞) → ⟨𝑓 𝑒𝑎𝑡, ⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑟𝑒𝑞⟩⟩}
 8     .join(𝑡𝐴𝑡𝑡𝑟𝐶𝑎𝑟𝑑).map{case(𝑓 𝑒𝑎𝑡, ⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑟𝑒𝑞, 𝑐𝑎𝑟𝑑⟩) → ⟨𝑣𝑎𝑙𝑢𝑒, 𝑓 𝑒𝑎𝑡, 𝑓 𝑟𝑒𝑞/𝑐𝑎𝑟𝑑⟩}
      // Compute the value probabilities for the source domain in the same way
 9    𝑠𝑉 𝑎𝑙𝑃 𝑟𝑜𝑏𝑠 ← ...
      // Compute the KL divergences
10    𝑑𝑖𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑠 ← 𝑡𝑉 𝑎𝑙𝑃 𝑟𝑜𝑏𝑠.cartesian(𝑠𝑉 𝑎𝑙𝑃 𝑟𝑜𝑏𝑠)
11     .map{case(⟨𝑡𝑉 𝑎𝑙, 𝑡𝐹 𝑒𝑎𝑡, 𝑡𝑃 𝑟𝑜𝑏⟩, ⟨𝑠𝑉 𝑎𝑙, 𝑠𝐹 𝑒𝑎𝑡, 𝑠𝑃 𝑟𝑜𝑏⟩)→⟨⟨𝑡𝐹 𝑒𝑎𝑡, 𝑠𝐹 𝑒𝑎𝑡⟩, 𝑡𝑃 𝑟𝑜𝑏 · 𝑙𝑜𝑔(𝑡𝑃 𝑟𝑜𝑏/𝑠𝑃 𝑟𝑜𝑏)⟩}
12     .reduceByKey((𝑎𝐾𝐿𝑡𝑒𝑟𝑚, 𝑏𝐾𝐿𝑡𝑒𝑟𝑚) → 𝑎𝐾𝐿𝑡𝑒𝑟𝑚 + 𝑏𝐾𝐿𝑡𝑒𝑟𝑚)
      // Find the best target-source features matching by minimizing the KL divergences
13    𝑚𝑖𝑛𝐷𝑖𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑠 ← 𝑑𝑖𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑠
14     .map{case(⟨𝑡𝐹 𝑒𝑎𝑡, 𝑠𝐹 𝑒𝑎𝑡⟩, 𝐾𝐿𝑑𝑖𝑣) → ⟨𝑡𝐹 𝑒𝑎𝑡, ⟨𝑠𝐹 𝑒𝑎𝑡, 𝐾𝐿𝑑𝑖𝑣⟩⟩}
15     .reduceByKey{case(⟨𝑠𝐹 𝑒𝑎𝑡1, 𝐾𝐿𝑑𝑖𝑣1⟩, ⟨𝑠𝐹 𝑒𝑎𝑡2, 𝐾𝐿𝑑𝑖𝑣2⟩) →
16         if(𝐾𝐿𝑑𝑖𝑣1 < 𝐾𝐿𝑑𝑖𝑣2) then ⟨𝑠𝐹 𝑒𝑎𝑡1, 𝐾𝐿𝑑𝑖𝑣1⟩ else ⟨𝑠𝐹 𝑒𝑎𝑡2, 𝐾𝐿𝑑𝑖𝑣2⟩}
      // Align source features to target features, according to the minimum divergences
17    return 𝑎𝑙𝑖𝑔𝑛(𝑋𝑠 , 𝑚𝑖𝑛𝐷𝑖𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑠)




the computation of the KL terms. Such probabilities are computed by dividing the frequencies
by the number of matches (in the join operation) that involved a given attribute. Finally, we
compute the KL divergence for each target-source pair of features, which are then exploited to
find the best (i.e., with the minimum divergence) source feature for each target feature.
   Note that the proposed approach implicitly performs a feature selection on the source domain,
since some of the features could not be matched to any target feature. Analogously, the same
feature of the source domain could be selected multiple times for different features of the target
domain, if it appears strongly aligned to all of them from a statistical viewpoint.
Stage 2 - Instance Matching. After the first stage, target and source instances are represented in
an aligned feature space, and are also directly comparable through similarity/distance measures.
A straightforward approach to exploit the instances of the source domain would be that of
appending them to the training set of the target domain. However, this approach would lead to
the possible introduction of noisy instances, i.e., not properly representing the data distribution
of the target domain, increasing the chance of negative transfer phenomena.
   On the contrary, we aim to finely exploit only a subset of source instances, by attaching them
to specific target instances, through feature concatenation, according to their similarity. In
this way, STEAL augments the descriptive features of target instances with those coming from
�                                                   divergence matrix
          source domain                                Fs1 Fs2 Fs3
      descriptive variables
                                                 Ft1   1.8   3.2   1.2
        Fs1 Fs2 Fs3                                                       aligned source domain

                              Kullback-Leibler   Ft2 0.3     3.5   0.7
                                                                          Fs3 Fs1 Fs2 Fs2
                                divergences
                                                 Ft3   0.7   0.4   0.5
      Ft1 Ft2 Ft3 Ft4
          target domain                          Ft4   0.5   0.4   0.8
      descriptive variables


Figure 1: A graphical focus on the feature alignment between the source and the target domains.


aligned, best-matching, source instances.
   STEAL computes the Cartesian product between 𝑛𝑢𝑚𝑆𝑎𝑚𝑝𝑙𝑒𝑠 subsets (randomly selected,
with replacement) of target and source instances, and reduce the resultset by taking the source
instances with the maximum similarity. The adoption of multiple random subsets, instead
of working on the whole set of instances alleviates the negative effects due to the possible
match of a target instance to an improper source instance, since the same target instance may
be sampled multiple times and associated with different source instances. At the end of the
instance matching step, we obtain a single target-source hybrid training set by merging the
target-source concatenated instances constructed from each sample. For space constraints, we
omit the pseudo-code description of this stage.

3. Experiments
We conducted our experimental evaluation in a relevant application domain where transfer
learning can be strongly beneficial, namely in biology, and specifically in the reconstruction of
the human gene regulatory network (target domain), using also the knowledge coming from
the mouse gene regulatory network (source domain). The reconstruction of a gene regulatory
network consists in the identification of currently unknown gene regulatory activities. A
gene regulatory network consists of a set of genes (represented as nodes) and a set of known
regulations (represented as edges). The reconstruction of such networks can be performed by
resorting to link prediction methods [14], working in the Positive-Unlabelled learning setting.
   We compared the results achieved by STEAL with some state-of-the-art heterogeneous
transfer learning competitors, namely TJM [6], JGSA [7] and BDA [8]. Note that even if they
are able to work with heterogeneous feature spaces, the latter are required to have the same
dimensionality. Moreover, we also evaluated the results obtained by some baseline approaches,
that are: i) T (no transfer), that is a predictive model learned only from the dataset of the
target domain; ii) S (optimal feature alignment), that is a predictive model trained only from
the source domain dataset, assuming that the optimal feature alignment is known a priori; iii) T
+ S (optimal feature alignment), that is a predictive model trained from both the target and
the source instances, assuming that the optimal feature alignment is known a priori.
   In our experiments, we used the dataset adopted in [11], which consists of two gene regulatory
networks, one for the human organism and one for the mouse organism. We considered both the
available versions of the dataset, i.e., the heterogeneous version, where human and mouse genes
�                           Heterogeneous          Homogeneous        Homogeneous Reduced
                           Human Mouse           Human Mouse         Human     Mouse
  Positive interactions    235,706 14,613        235,706  14,613      4,714     4,714
  Unlabeled interactions   235,706 235,706       235,706 235,706      4,714     4,714
  Gene features              174     161            6        6          6         6
  Gene-pair features         348     322           12       12         12         12
Table 1
Quantitative information of the considered datasets.


are described by 174 and 161 expression levels, respectively, and the homogeneous version
obtained by averaging the expression levels per organ. Moreover, since competitor systems
ran out of memory on our server equipped with 64GB RAM, we also ran the experiments on
a reduced version of the homogeneous dataset consisting of 2% randomly selected instances.
Detailed quantitative information of the considered datasets are shown in Table 1.
   The task at hand naturally falls in the PU learning setting, since unobserved links cannot be
considered negative examples. To perform the estimation of the label confidence (Stage 1), we
identified 3 clusters and 2 clusters for the mouse and the human positive instances, respectively.
Such values were identified through the silhouette cluster analysis. For the heterogeneous
version of the dataset, we ran STEAL with the distributed PCA, that led to identify 18 new
features. As regards the instance matching, 10 samples were considered (i.e., 𝑛𝑢𝑚𝑆𝑎𝑚𝑝𝑙𝑒𝑠=10),
each involving 10% random training examples from the source and from the target domains.
   All the experiments were performed in the 10 fold cross-validation setting, where each fold
consists of 9/10 positive examples for training and 1/10 positive examples for testing, while all
the unlabeled examples are considered for both training and testing.
   As evaluation measure, since the dataset does not contain any negative example, we con-
sidered the recall@𝑘 defined as 𝑇 𝑃𝑇+𝐹 𝑃𝑘
                                          𝑁 , where 𝑇 𝑃𝑘 is the number of returned true positive
interactions within the first top-𝑘 interactions, and (𝑇 𝑃 + 𝐹 𝑁 ) corresponds to the number
of positive examples in the testing fold, and computed the area under the recall@𝑘 curve
(AUR@K). Note that the computation of other well-known measures, such as the Area Under
ROC curve (AUROC) and the Area Under Precision-Recall curve (AUPR), is not possible at all
in the positive-unlabeled setting, without introducing wrong biases in the ground truth.
   From the results shown in Table 2, we can observe that STEAL is able to outperform all the
considered baselines, in both the heterogeneous and homogeneous settings. In particular, the
results show that STEAL outperforms the counterpart that does not exploit the source domain,
i.e., T (no transfer), by 27% in the homogeneous setting and by 9.7% in the heterogeneous setting,
proving the benefits of exploiting the additional knowledge coming from the mouse organism.
On the other hand, using only the source domain (i.e., S (optimal feature alignment)), leads to
a reduction of the AUR@K with respect to using the target data only. This means that, although
source data can be beneficial for the target task, their exploitation should be done in a smart
manner. Note that, as expected, appending source instances to the target dataset, even if the
optimal feature alignment is known a priori (see T+S (optimal feature alignment)), does not
provide as much as benefits as the method adopted by STEAL, that outperforms this approach
by 23.4% and 10% in the homogeneous and heterogeneous settings, respectively.
�             Method                Heterogeneous   Homogeneous    Method   Homogeneous Reduced
          T (no transfer)               0.585         0.533        JGSA           0.500
  S (optimal feature alignment)         0.569         0.544         TJM           0.554
 T+S (optimal feature alignment)        0.583         0.551        BDA            0.558
          T+S (STEAL)                  0.642          0.680       STEAL           0.589

Table 2
On the left, the AUR@K results obtained by STEAL and baseline approaches on the full datasets. On
the right, the AUR@K results obtained by STEAL and competitors on the reduced dataset.




Figure 2: Running times and speedup factors measured on a synthetic dataset of 10 millions examples


   The comparison with competitors on the reduced dataset shows that STEAL is able to
outperform all of them. Specifically, STEAL achieves an improvement of 17.8% with respect to
JGSA, 6.3% with respect to TJM, and 5.5% with respect to BDA. Note that the superiority of
STEAL over these competitors is not limited to the observed AUR@K on the reduced dataset,
since JGSA, TJM and BDA were not able to complete the experiments on the full dataset.
   To assess the advantages of STEAL also from a computational viewpoint, we performed
a scalability analysis. We measured the running times and computed the speedup factor to
evaluate the ability of STEAL to exploit additional computational nodes. This analysis was
performed on a cluster with 1, 2, or 3 computational nodes, by resorting to a synthetic dataset
with 10 millions instances. The results of this analysis, depicted in Figure 2, show that, although
the communication overhead in principle increases, STEAL is able to take advantage of possible
additional computational nodes. Specifically, the measured speedup factors are close to ideal
results, i.e., they quickly converge to 2 and 3, respectively, with 2 and 3 computational nodes.

4. Conclusions
In this discussion paper, we proposed a novel heterogeneous transfer learning method, called
STEAL, that can work in the PU learning setting also with heterogeneous feature spaces, thanks
to a feature alignment step based on the KL divergence. The obtained results demonstrate the
effectiveness of the proposed method with respect to baseline and state-of-the-art competitors.
Moreover, it exhibits very interesting scalability results, emphasizing its possible applicability
on large datasets. We evaluated the performance of our method in the reconstruction of
gene regulatory networks, that obtained significant benefits from the application of STEAL.
Currently, we are evaluating the effectiveness of STEAL in other application domains, including
the prediction of the energy consumption in power grids, and we are working on extending it
with the possibility of exploiting multiple source domains.
�Acknowledgments
Dr. Paolo Mignone acknowledges the support of Apulia Region through the REFIN project
“Metodi per l’ottimizzazione delle reti di distribuzione di energia e per la pianificazione di
interventi manutentivi ed evolutivi” (CUP H94I20000410008, Grant n. 7EDD092A).


References
 [1] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, Q. He, A comprehensive survey
     on transfer learning, Proc. IEEE 109 (2021) 43–76.
 [2] K. Amasyali, N. M. El-Gohary, A review of data-driven building energy consumption
     prediction studies, Renewable and Sustainable Energy Reviews 81 (2018) 1192–1205.
 [3] Y. Wang, Z. Xia, J. Deng, X. Xie, M. Gong, X. Ma, TLGP: a flexible transfer learning
     algorithm for gene prioritization based on heterogeneous source domain, BMC Bioinform.
     22-S (2021) 274.
 [4] W. Li, L. Duan, D. Xu, I. W. Tsang, Learning with augmented features for supervised and
     semi-supervised heterogeneous domain adaptation, IEEE Trans. Pattern Anal. Mach. Intell.
     36 (2014) 1134–1148.
 [5] J. T. Zhou, S. J. Pan, I. W. Tsang, Y. Yan, Hybrid heterogeneous transfer learning through
     deep learning, in: AAAI 2014, 2014, pp. 2213–2220.
 [6] M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer joint matching for unsupervised
     domain adaptation, in: CVPR 2014, 2014, pp. 1410–1417.
 [7] J. Zhang, W. Li, P. Ogunbona, Joint geometrical and statistical alignment for visual
     domain adaptation, Proceedings - 30th IEEE Conference on Computer Vision and Pattern
     Recognition, CVPR 2017 2017-January (2017) 5150–5158.
 [8] J. Wang, Y. Chen, S. Hao, W. Feng, Z. Shen, Balanced distribution adaptation for transfer
     learning, in: ICDM 2017, 2017, pp. 1129–1134.
 [9] B. Zhang, W. Zuo, Learning from positive and unlabeled examples: A survey, in: ISIP 2008
     / WMWA 2008, Moscow, Russia, 23-25 May 2008, 2008, pp. 650–654.
[10] P. Mignone, G. Pio, Positive unlabeled link prediction via transfer learning for gene
     network reconstruction, in: ISMIS 2018, 2018, pp. 13–23.
[11] P. Mignone, G. Pio, D. D’Elia, M. Ceci, Exploiting transfer learning for the reconstruction
     of the human gene regulatory network, Bioinformatics 36 (2020) 1553–1561.
[12] P. Mignone, G. Pio, S. Džeroski, M. Ceci, Multi-task learning for the simultaneous recon-
     struction of the human and mouse gene regulatory networks, Scientific Reports 10 (2020)
     22295.
[13] G. Pio, P. Mignone, G. Magazzù, G. Zampieri, M. Ceci, C. Angione, Integrating genome-
     scale metabolic modelling and transfer learning for human gene regulatory network
     reconstruction, Bioinformatics 38 (2021) 487–493.
[14] L. Lu, T. Zhou, Link prediction in complex networks: A survey, Physica A: Statistical
     Mechanics and its Applications 390 (2011).
�